Complete Mitigation of the L1 Terminal Fault (L1TF) vulnerability CVE-2018-3646 requires enabling the ESXi Side-Channel-Aware Scheduler
Picking up another blog post that I had in my drafts waiting to be finished, and I never did. In this Change ESXi host to use ESXi Side-Channel-Aware Scheduler v2 (SCAv2), I will explain how to change your VMware infrastructure to use SCAV2 instead of SCAv1. A quick explanation about ESXi Side-Channel-Aware Scheduler v2 (SCAv2). According to VMware the mitigation for this warning is to enable the 'ESXi Side-Channel-Aware Scheduler' which 'may impose a non-trivial performance impact' - see Has anyone seen the affect of this 'non-trivial' performance impact?
This feature ensures vSphere maintains its lead as a platform for containers and microservices. In vSphere 6.7 U2, there is a new scheduler option called the side-channel aware scheduler to address a security vulnerability known as L1TF. In VMware vSphere the so called Side-Channel-Aware scheduler is available. On Hyper-V there is a similar feature called the Hyper-V Core Scheduler. Performance-wise, enabling these schedulers can have the same impact as disabling Hyperthreading. How to enable VMWare Distributed Resource Scheduler (DRS) In this post, we’ll learn the steps to enable VMWare Distributed Resource Scheduler on vCenter Server 6.Vmware Distributed Resource Scheduler is used to automated load balance computing capacity to deliver optimized performance for hosts and virtual machines.
as documented in KB55806. There are two versions of the Side-Channel Aware Scheduler (SCA). The initial version of the scheduler, SCAv1, will schedule on only one logical processor (Hyper-Thread) of a Hyper-Thread-enabled core. As a result, when enabling this version of the scheduler, available host capacity may be reduced, and VM performance may be impacted, depending on current available host capacity. Starting with ESXi 6.7u2, SCAv2 was introduced and offers performance improvements over SCAv1 while protecting from VM to VM and VM to Hypervisor information leakage. Please refer to KB55806 for guidance on choosing between SCAv1 and SCAv2, including the security and performance characteristics of the two schedulers.For the purposes of this article, we will describe enabling the ESXi Side-channel Aware Scheduler as enabling the Hyper-Thread-aware portion of the L1TF mitigation and name this “HTAware Mitigation”.
You should assess the impact of enabling this scheduler on their vSphere hosts and clusters before enabling it. The HTAware Mitigation Tool is intended to assist in determining the potential impact of subsequently enabling the Side-Channel-Aware Scheduler. The HTAware Mitigation Tool is intended to assist in determining the potential impact of subsequently enabling the Side-Channel-Aware Scheduler v1 (SCAv1). The tool performs the following checks:
- Scans the virtual infrastructure for CPU utilization across Clusters, Hosts, and VMs to identify heavily utilized resources.
- Identifies VMs which may be unable to run on their current host after the mitigation is applied.
- Identifies hosts that are likely safe candidates for mitigation. This list of hosts can be provided as input to the second stage of the tool to enable the HTAware Mitigation.
It is important to note that the information provided by the Tool is advisory. The Tool is not intended to replace your own analysis of CPU utilization across the infrastructure.
Key features of the HTAware Mitigation Tool
- Collect and output historical CPU utilization information stored by vCenter for the Cluster and Host
- Identify the load impact of enabling the HTAware Mitigation on the scanned hosts. The tool also considers the load impact of reduced host capacity during rolling cluster upgrades
- Identify VMs whose total count of vCPUs is greater than the number of physical cores on the running host. Such VMs will be too “wide” to run on that host when the HTAware Mitigation is enabled.
- Identify VMs which utilize the vCPU pinning feature. The PCPU (physical CPU) numbers may no longer be valid once the scheduler is enabled.
- Provides automation functionality to apply HTAware Mitigation across vSphere clusters and/or individual hosts.
The Update History section of this article will be revised when there is a significant change. Please click Subscribe to Article in the Actions box to be alerted when new information is added to this document and sign up at our Security-Announce mailing list to receive new and updated VMware Security Advisories.
by Sven Huisman and Ryan Ververs-Bijkerk
On May 14th 2019 a group of security advisors, together with Intel, publicly announced, yet again, vulnerabilities in the Intel CPUs called Microarchitectural Data Sampling (MDS). These vulnerabilities are also known as Fallout, RIDL (Rogue In-Flight Data Load) and ZombieLoad. On the same date, major vendors like Microsoft, Google, Amazon and VMware released mitigations against these vulnerabilities. This research will investigate the impact of these mitigations on a virtual desktop infrastructure and shows the impact on user density.
MDS and your VDI environment
A lot has been written and published about the MDS vulnerabilities. To learn more about these vulnerabilities, it is recommended to read cpu.fail or mdsattacks.com.
For this research, it is important to understand the context and setup for these performance tests. The goal is to understand the impact of the MDS mitigations. In a VDI environment, this means patching at the hardware level, virtualization layer, and guest OS. Because at the time of testing there was no patch available for the hardware we used, we only tested the impact of patching the hypervisor (VMware vSphere) and the guest OS (Windows 10 build 1809).
You can read more about these mitigations in the following articles:
VMware: https://www.vmware.com/security/advisories/VMSA-2019-0008.html
Microsoft: https://support.microsoft.com/en-us/help/4073119/protect-against-speculative-execution-side-channel-vulnerabilities-in
In short: for the guest OS just install the required cumulative patch, containing the security update. For VMware vSphere, install the required update and enable the Side-Channel-Aware scheduler. The Side-Channel-Aware scheduler is an alternative to disabling Hyper-Threading (which is required to be fully protected). VMware introduced the Side-Channel-Aware scheduler in vSphere ESXi, which basically disables Hyper-Threading if there are no hardware mitigations in place. Newer processors will/could be introduced to protect the system at the hardware-level. VMware introduced in vSphere version 6.7U2 the Side-Channel-Aware scheduler version 2 mitigation, which should increase the performance while maintaining (almost) the same level of protection.
Read about the Side Channel Aware scheduler V1 and V2 here: https://www.vmware.com/techpapers/2018/scheduler-options-vsphere67u2-perf.html
Configuration and infrastructure
This research has taken place on the GO-EUC platform which is described here. All the desktops are delivered using Citrix Virtual Desktop version 1906 and contain 2vCPU’s with 4GB memory. The operating system is Windows 10 1809 and is optimized using the Citrix Optimizer with the recommended template.
In order to get a complete overview of the impact, five scenarios are required.
Test | Windows | VMware vSphere | Comment |
---|---|---|---|
Baseline | 17763.475 | ESXi670-201904001 | Pre MDS patches |
vSphere Patch | 17763.475 | ESXi670-201905001 | Without Windows patch |
Windows Patch | 17763.503 | ESXi670-201905001 | Includes vSphere patch |
SCAv1 | 17763.503 | ESXi670-201905001 | Includes vSphere and Windows patch |
SCAv2 | 17763.503 | ESXi670-201905001 | Includes vSphere and Windows patch |
As always the default testing methodology is used which is described here.
Expectations and results
It is expected that MDS and enabling the Side Channel Aware Scheduler will have an impact on user density and user experience. Using Login VSI we can measure the impact by comparing the Login VSI VSImax and the Login VSI baseline. The Login VSI VSImax is one of the best metrics to see the difference in user capacity. More information about the VSImax can be found here.
Higher is better
Vmware Side Channel Aware Scheduler Jobs
Lower is better
As expected, applying the MDS mitigations will results in lower user capacity, but only if both the hypervisor and the guest OS are patched. Enabling SCAv1 has an even bigger impact on the user capacity, while SCAv2 does show a small improvement in capacity comparing to SCAv1. This is also reflected in the VSI baseline results, which shows that the overall responsiveness within the desktop gets a bit slower.
It is always important to confirm the Login VSI results using other metrics and therefore performance data from the hypervisor is collected. The GO-EUC lab environment is CPU limited and therefore the CPU results should be similar to the Login VSI results.
Lower is better
Lower is better
The CPU Utilization is similar to the Login VSI results where the SCAv1 has a significantly higher CPU utilization. SCAv1 shows a higher utilization which is caused by disabling Hyper-Threading. SCAv2 has a lower CPU utilization and which is not similar to the VSImax results. This has to do with the scheduling where the VMs are allocated on the same core which influences the performance. However, SCAv2 does show an improvement in capacity and CPU utilization compared to SCAv1.
Next, we compared the storage performance when applying the patches and enabling the SCA. As the patches only should affect the CPU, we don’t expect huge differences between the scenarios.
Lower is better
Vmware Side-channel-aware Scheduler
Lower is better
Lower is better
Lower is better
Lower is better
Lower is better
When the patches are applied to both the hypervisor and the guest OS, we see an increase in both read and write IO, which is not as expected. The average host commands/sec decreases again when enabling SCAv1 but increases again when SCAv2 is enabled. It is interesting to see what the differences are when only the first 20 minutes of the tests are compared, when the host is not saturated in any scenario.
Lower is better
Lower is better
When comparing only the first 20 minutes of the Login VSI tests, the average host commands per seconds increases with 30 to 36%. These results are more representative of the real impact of the MDS patches.
There are many factors that are part of the user experience. One of the first things that users will experience are the logon times. When the logon times are long it has a negative effect on the user experience. It is important to keep the logon time as short as possible.
Lower is better
Lower is better
The impact on the logon times is huge when we compare the average logon time from the entire test. The test where SCAv1 is enabled shows an increase in logon time of 188%! This is due to the fact that the logon times increase rapidly when the CPU usage is at maximum.
As the logon times are influenced by the server saturation comparing the first 20 minutes provides a more realistic perspective.
Lower is better
Lower is better
Applying the patches and enabling the Side-Channel-Aware scheduler does increase the logon time with 5-7% on average, but it isn’t as bad as comparing the averages of the entire duration of the test. This shows the saturation point of the server has a big influence on the logon times.
Conclusion
Beginning of 2018 the first Intel vulnerabilities were exposed and since then multiple mitigations from multiple vendors have been released to mitigate these vulnerabilities. On May 14th 2019 new Intel vulnerabilities were exposed by a group of security advisors, called Microarchitectural Data Sampling (MDS), also known as Fallout, RIDL (Rogue In-Flight Data Load) and ZombieLoad.
It is expected to see an impact when applying these mitigations and the result shows an impact of around 25%. As the mitigations effects the CPU resources an impact is also having an impact on the user experience which is visible in the logon times.
The key takeaway is to validate the appropriate sizing on your environment after applying these mitigations. Only this way you can ensure the user experience is not affected by these mitigations. This does probably mean more hardware is required to host all those users, which is the big downside of these mitigations.
Important note: the impact is related to the hardware specification and may be different in your environment.
If you don’t enable the mitigations, your environment is vulnerable to these exploits. We cannot decide for you if it is worth the risk of not enabling the mitigations. We can only advice you to test the mitigations and measure the impact on your environment and take the appropriate measures.
If you have comments about this research or want to discuss other configurations, please join us on our GO-EUC Slack channel.
Photo by Christian Wiediger on Unsplash