[prev in list] [next in list] [prev in thread] [next in thread] 

List:       oss-security
Subject:    [oss-security] Xen Security Advisory 304 v1 (CVE-2018-12207) - x86: Machine Check Error on Page Size
From:       Xen.org security team <security () xen ! org>
Date:       2019-11-12 18:01:10
Message-ID: E1iUaTW-0001wf-3B () xenbits ! xenproject ! org
[Download RAW message or body]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

            Xen Security Advisory CVE-2018-12207 / XSA-304

            x86: Machine Check Error on Page Size Change DoS

ISSUE DESCRIPTION
=================

An erratum exists across some CPUs whereby an instruction fetch may
cause a machine check error if the pagetables have been updated in a
specific manner without invalidating the TLB.

The x86 architecture explicitly permits modification of the pagetables
without TLB invalidation, but in this corner case, the impacted core
ceases operating and an unexpected machine check or system reset occurs.

This corner case can be triggered by guest kernels.

For more details, see:
  https://software.intel.com/security-software-guidance/insights/deep-dive-machine-check-error-avoidance-page-size-change


IMPACT
======

A malicious guest kernel can crash the host, resulting in a Denial of
Service (DoS).  (This CPU bug may also be triggered accidentally.)

VULNERABLE SYSTEMS
==================

Systems running all versions of Xen are affected.

Only x86 processors are vulnerable.  ARM processors are not believed to
be vulnerable.

Only Intel Core based processors (from Nehalem onwards) are affected.
Other processors designs (Intel Atom/Knights range), and other
manufacturers (AMD) are not known to be affected.

Only x86 HVM/PVH guests can exploit the vulnerability.  x86 PV guests
cannot exploit the vulnerability.

Please consult the Intel Security Advisory for details on the affected
processors.

MITIGATION
==========

Running only PV guests avoids the vulnerability.

Booting Xen with `hap_2mb=0 hap_1gb=0` on the command line, to disable
the use of HAP superpages, works around the vulnerability.

Booting Xen with `hap=0` to disable HAP entirely, or configuring HVM/PVH
guests to use shadow paging (hap=0 in xl.cfg) works around the
vulnerability, but the performance impact of shadow paging in
combination with in-guest Meltdown mitigations (KPTI, KVAS, etc) will
most likely make this option prohibitive to use.

RESOLUTION
==========

Applying the appropriate attached patches resolves this issue.

By default, Xen will disable executable superpages on
believed-vulnerable hardware, and report so at boot:

  (XEN) VMX: Disabling executable EPT superpages due to CVE-2018-12207

See the performance and safety consideration section below.

xsa304/xsa304-*.patch           xen-unstable
xsa304/xsa304-4.12-*.patch      Xen 4.12.x
xsa304/xsa304-4.11-*.patch      Xen 4.11.x
xsa304/xsa304-4.10-*.patch      Xen 4.10.x
xsa304/xsa304-4.9-*.patch       Xen 4.9.x
xsa304/xsa304-4.8-*.patch       Xen 4.8.x

The patches are comprised of:
 *-1.patch: Fix on SandyBridge hardware discovered during testing
 *-2.patch: Main security fix
 *-3.patch: (4.10 and later) Runtime control of fast vs secure

$ sha256sum xsa304*/*
3365e0351b3ccb39e3be53bcbfd8219d8282f6f3d97d6c4519a3e860b27f6844  xsa304/xsa304-1.patch
1a85753717312f2b20f291c9e79271c63be2a9542fbec651d0a8fc4d8aca0408  xsa304/xsa304-2.patch
0c770aa15f2aef2bb3253194243968181a4bb1710d09d6f785ed7f5dae03b93b  xsa304/xsa304-3.patch
2d2eb25b842578bd45480c8ff6f2266617dd0db5e6e552d5ae481eb764c8aea0  xsa304/xsa304-4.8-1.patch
72d91f67af06f89d01f7dc1e6ff87f50cad28bbb0475eb5cfbb986ee51775bc2  xsa304/xsa304-4.8-2.patch
d8d18e7dd9b59f01454352a46d38699b21c5f1f7ff6bd2aa8e63fbd7a98cfca4  xsa304/xsa304-4.9-1.patch
244df964d70eab300c77210456439dfb1c46f2ddd9f1b851e1110be7573948ba  xsa304/xsa304-4.9-2.patch
2d80f2603412abb4e644b8e868f4218e90db3f59b25f833ff7342d347af6c5a8  xsa304/xsa304-4.10-1.patch
94a87371ddeccf5705ed71a961135393fa9046e4235cc90402f9292dcfffa43c  xsa304/xsa304-4.10-2.patch
9862e46c2bcbbeaba32d06d7af33b8b97fd8be5a4a35bcd70264e9913031f512  xsa304/xsa304-4.10-3.patch
b927c5b7a5dbf6260fd37ec2a594d5a0ff40b2fa78c9caaaaa59fa184c87d8d1  xsa304/xsa304-4.11-1.patch
478d7b7b27bb0a4ed874a4d6fe73282d785feed8c35f3278a07a1228d5dfad77  xsa304/xsa304-4.11-2.patch
d0e079a0af7045711a21ac52674e5821e69c370f7ef64c9ebdfc0990950f7a54  xsa304/xsa304-4.11-3.patch
4025732fd83a94c09b023f079e9b3c8399649f31e406f5f0c736a522f75fdd53  xsa304/xsa304-4.12-1.patch
2653c57fc79b98ca5cc30ceb2299d11c2ba96f4becdfb93a1cc14ca943e18420  xsa304/xsa304-4.12-2.patch
ec670ca4e3782043824e1f475ba187d89a53836d4e2ad8399daf0a91fcc747dc  xsa304/xsa304-4.12-3.patch
$

PERFORMANCE AND SAFETY CONSIDERATIONS
=====================================

Disabling executable EPT superpages does come with a performance impact,
caused by increased iTLB pressure.  The overhead will be workload and
CPU dependant.

In configurations where guest kernels are trusted not to mount a DoS
attempt, the mitigation can be turned off by booting with `ept=exec-sp`.

In configurations where the guest kernels are not trusted, users are
recommended to measure the impact to their workloads as part of deciding
between fast and secure.

On Xen 4.10 and later, a runtime decision can be made between fast and
secure by using `xl set-parameters ept=[no-]exec-sp`.

NOTE REGARDING LACK OF EMBARGO
==============================

Despite an attempt to organise predisclosure, the discoverers ultimately
did not authorise a predisclosure.
-----BEGIN PGP SIGNATURE-----

iQFABAEBCAAqFiEEI+MiLBRfRHX6gGCng/4UyVfoK9kFAl3K8agMHHBncEB4ZW4u
b3JnAAoJEIP+FMlX6CvZd3sH/jRb9M9+OyI6dsFkqCwgnbL3poPgVwC6umC0he6k
nomcLvY5Tc1ClhvyXTLDOzdo20zMQo6mtLs5RFGC78CjWKM7P3aSFGay+yRHXt4q
QzoTgTPaSR+MtkahgmS+GEY5IuYSXFWZLRNmx8YXmG2GVDFU9CkfbCCo9hGknY4r
t5cMS+I7cjAuGhvf9uBxFcSr6FiARcqzk7B7qSEPOJbfEAq1XXYh4Q81Zx2iHClW
xzyGsWk5UeP+NjRFGpJZpsz9a8yx/zaYWFsjxzG3xYutjkypSoRmNCG2sMPq54Nk
yuEYHV6/r4ymgexIe+INdHfmkJRpoYadmLdV0vRfXp0vlO8=
=LdOL
-----END PGP SIGNATURE-----


["xsa304/xsa304-1.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vtd: Hide superpage support for SandyBridge IOMMUs

Something causes SandyBridge IOMMUs to choke when sharing EPT pagetables, and
an EPT superpage gets shattered.  The root cause is still under investigation,
but the end result is unusable in combination with CVE-2018-12207 protections.

This is part of XSA-304 / CVE-2018-12207

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h
index 6b0b7af9e2..994d360e90 100644
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -101,6 +101,8 @@ void vtd_ops_postamble_quirk(struct vtd_iommu *iommu);
 int __must_check me_wifi_quirk(struct domain *domain,
                                u8 bus, u8 devfn, int map);
 void pci_vtd_quirk(const struct pci_dev *);
+void quirk_iommu_caps(struct vtd_iommu *iommu);
+
 bool_t platform_supports_intremap(void);
 bool_t platform_supports_x2apic(void);
 
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 68e7f5fb58..25ad649c34 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1170,6 +1170,8 @@ int __init iommu_alloc(struct acpi_drhd_unit *drhd)
     if ( !(iommu->cap + 1) || !(iommu->ecap + 1) )
         return -ENODEV;
 
+    quirk_iommu_caps(iommu);
+
     if ( cap_fault_reg_offset(iommu->cap) +
          cap_num_fault_regs(iommu->cap) * PRIMARY_FAULT_REG_LEN >= PAGE_SIZE ||
          ecap_iotlb_offset(iommu->ecap) >= PAGE_SIZE )
diff --git a/xen/drivers/passthrough/vtd/quirks.c b/xen/drivers/passthrough/vtd/quirks.c
index e7e326fe8c..4dadd9523f 100644
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -536,3 +536,28 @@ void pci_vtd_quirk(const struct pci_dev *pdev)
         break;
     }
 }
+
+void __init quirk_iommu_caps(struct vtd_iommu *iommu)
+{
+    /*
+     * IOMMU Quirks:
+     *
+     * SandyBridge IOMMUs claim support for 2M and 1G superpages, but don't
+     * implement superpages internally.
+     *
+     * There are issues changing the walk length under in-flight DMA, which
+     * has manifested as incompatibility between EPT/IOMMU sharing and the
+     * workaround for CVE-2018-12207 / XSA-304.  Hide the superpages
+     * capabilities in the IOMMU, which will prevent Xen from sharing the EPT
+     * and IOMMU pagetables.
+     *
+     * Detection of SandyBridge unfortunately has to be done by processor
+     * model because the client parts don't expose their IOMMUs as PCI devices
+     * we could match with a Device ID.
+     */
+    if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
+         boot_cpu_data.x86 == 6 &&
+         (boot_cpu_data.x86_model == 0x2a ||
+          boot_cpu_data.x86_model == 0x2d) )
+        iommu->cap &= ~(0xful << 34);
+}

["xsa304/xsa304-2.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vtx: Disable executable EPT superpages to work around
 CVE-2018-12207

CVE-2018-12207 covers a set of errata on various Intel processors, whereby a
machine check exception can be generated in a corner case when an executable
mapping changes size or cacheability without TLB invalidation.  HVM guest
kernels can trigger this to DoS the host.

To mitigate, in affected hardware, all EPT superpages are marked NX.  When an
instruction fetch violation is observed against the superpage, the superpage
is shattered to 4k and has execute permissions restored.  This prevents the
guest kernel from being able to create the necessary preconditions in the iTLB
to exploit the vulnerability.

This does come with a workload-dependent performance overhead, caused by
increased TLB pressure.  Performance can be restored, if guest kernels are
trusted not to mount an attack, by specifying ept=exec-sp on the command line.

This is part of XSA-304 / CVE-2018-12207

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 451d213c8c..d2b0020b55 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -918,7 +918,7 @@ Controls for interacting with the system Extended Firmware Interface.
     uncacheable.
 
 ### ept
-> `= List of [ ad=<bool>, pml=<bool> ]`
+> `= List of [ ad=<bool>, pml=<bool>, exec-sp=<bool> ]`
 
 > Applicability: Intel
 
@@ -949,6 +949,16 @@ introduced with the Nehalem architecture.
     disable PML.  `pml=0` can be used to prevent the use of PML on otherwise
     capable hardware.
 
+*   The `exec-sp` boolean controls whether EPT superpages with execute
+    permissions are permitted.  In general this is good for performance.
+
+    However, on processors vulnerable CVE-2018-12207, HVM guest kernels can
+    use executable superpages to crash the host.  By default, executable
+    superpages are disabled on affected hardware.
+
+    If HVM guest kernels are trusted not to mount a DoS against the system,
+    this option can enabled to regain performance.
+
 ### extra_guest_irqs
 > `= [<domU number>][,<dom0 number>]`
 
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 06a7b40107..818e705fd1 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1833,6 +1833,24 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
             break;
         }
 
+        /*
+         * Workaround for XSA-304 / CVE-2018-12207.  If we take an execution
+         * fault against a non-executable superpage, shatter it to regain
+         * execute permissions.
+         */
+        if ( page_order > 0 && npfec.insn_fetch && npfec.present && !violation )
+        {
+            int res = p2m_set_entry(p2m, _gfn(gfn), mfn, PAGE_ORDER_4K,
+                                    p2mt, p2ma);
+
+            if ( res )
+                printk(XENLOG_ERR "Failed to shatter gfn %"PRI_gfn": %d\n",
+                       gfn, res);
+
+            rc = !res;
+            goto out_put_gfn;
+        }
+
         if ( violation )
         {
             /* Should #VE be emulated for this fault? */
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index ed27e8def7..d2624ea9d7 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -67,6 +67,7 @@ integer_param("ple_window", ple_window);
 
 static bool __read_mostly opt_ept_pml = true;
 static s8 __read_mostly opt_ept_ad = -1;
+int8_t __read_mostly opt_ept_exec_sp = -1;
 
 static int __init parse_ept_param(const char *s)
 {
@@ -82,6 +83,8 @@ static int __init parse_ept_param(const char *s)
             opt_ept_ad = val;
         else if ( (val = parse_boolean("pml", s, ss)) >= 0 )
             opt_ept_pml = val;
+        else if ( (val = parse_boolean("exec-sp", s, ss)) >= 0 )
+            opt_ept_exec_sp = val;
         else
             rc = -EINVAL;
 
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index a55ff37733..6a5eeb5c13 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2371,6 +2371,102 @@ static void pi_notification_interrupt(struct cpu_user_regs *regs)
 static void __init lbr_tsx_fixup_check(void);
 static void __init bdf93_fixup_check(void);
 
+/*
+ * Calculate whether the CPU is vulnerable to Instruction Fetch page
+ * size-change MCEs.
+ */
+static bool __init has_if_pschange_mc(void)
+{
+    uint64_t caps = 0;
+
+    /*
+     * If we are virtualised, there is nothing we can do.  Our EPT tables are
+     * shadowed by our hypervisor, and not walked by hardware.
+     */
+    if ( cpu_has_hypervisor )
+        return false;
+
+    if ( boot_cpu_has(X86_FEATURE_ARCH_CAPS) )
+        rdmsrl(MSR_ARCH_CAPABILITIES, caps);
+
+    if ( caps & ARCH_CAPS_IF_PSCHANGE_MC_NO )
+        return false;
+
+    /*
+     * IF_PSCHANGE_MC is only known to affect Intel Family 6 processors at
+     * this time.
+     */
+    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+         boot_cpu_data.x86 != 6 )
+        return false;
+
+    switch ( boot_cpu_data.x86_model )
+    {
+        /*
+         * Core processors since at least Nehalem are vulnerable.
+         */
+    case 0x1f: /* Auburndale / Havendale */
+    case 0x1e: /* Nehalem */
+    case 0x1a: /* Nehalem EP */
+    case 0x2e: /* Nehalem EX */
+    case 0x25: /* Westmere */
+    case 0x2c: /* Westmere EP */
+    case 0x2f: /* Westmere EX */
+    case 0x2a: /* SandyBridge */
+    case 0x2d: /* SandyBridge EP/EX */
+    case 0x3a: /* IvyBridge */
+    case 0x3e: /* IvyBridge EP/EX */
+    case 0x3c: /* Haswell */
+    case 0x3f: /* Haswell EX/EP */
+    case 0x45: /* Haswell D */
+    case 0x46: /* Haswell H */
+    case 0x3d: /* Broadwell */
+    case 0x47: /* Broadwell H */
+    case 0x4f: /* Broadwell EP/EX */
+    case 0x56: /* Broadwell D */
+    case 0x4e: /* Skylake M */
+    case 0x5e: /* Skylake D */
+    case 0x55: /* Skylake-X / Cascade Lake */
+    case 0x8e: /* Kaby / Coffee / Whiskey Lake M */
+    case 0x9e: /* Kaby / Coffee / Whiskey Lake D */
+        return true;
+
+        /*
+         * Atom processors are not vulnerable.
+         */
+    case 0x1c: /* Pineview */
+    case 0x26: /* Lincroft */
+    case 0x27: /* Penwell */
+    case 0x35: /* Cloverview */
+    case 0x36: /* Cedarview */
+    case 0x37: /* Baytrail / Valleyview (Silvermont) */
+    case 0x4d: /* Avaton / Rangely (Silvermont) */
+    case 0x4c: /* Cherrytrail / Brasswell */
+    case 0x4a: /* Merrifield */
+    case 0x5a: /* Moorefield */
+    case 0x5c: /* Goldmont */
+    case 0x5d: /* SoFIA 3G Granite/ES2.1 */
+    case 0x65: /* SoFIA LTE AOSP */
+    case 0x5f: /* Denverton */
+    case 0x6e: /* Cougar Mountain */
+    case 0x75: /* Lightning Mountain */
+    case 0x7a: /* Gemini Lake */
+    case 0x86: /* Jacobsville */
+
+        /*
+         * Knights processors are not vulnerable.
+         */
+    case 0x57: /* Knights Landing */
+    case 0x85: /* Knights Mill */
+        return false;
+
+    default:
+        printk("Unrecognised CPU model %#x - assuming vulnerable to IF_PSCHANGE_MC\n",
+               boot_cpu_data.x86_model);
+        return true;
+    }
+}
+
 const struct hvm_function_table * __init start_vmx(void)
 {
     set_in_cr4(X86_CR4_VMXE);
@@ -2391,6 +2487,17 @@ const struct hvm_function_table * __init start_vmx(void)
      */
     if ( cpu_has_vmx_ept && (cpu_has_vmx_pat || opt_force_ept) )
     {
+        bool cpu_has_bug_pschange_mc = has_if_pschange_mc();
+
+        if ( opt_ept_exec_sp == -1 )
+        {
+            /* Default to non-executable superpages on vulnerable hardware. */
+            opt_ept_exec_sp = !cpu_has_bug_pschange_mc;
+
+            if ( cpu_has_bug_pschange_mc )
+                printk("VMX: Disabling executable EPT superpages due to CVE-2018-12207\n");
+        }
+
         vmx_function_table.hap_supported = 1;
         vmx_function_table.altp2m_supported = 1;
 
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index 220990f017..f06e51904a 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -174,6 +174,12 @@ static void ept_p2m_type_to_flags(struct p2m_domain *p2m, ept_entry_t *entry,
             break;
     }
     
+    /*
+     * Don't create executable superpages if we need to shatter them to
+     * protect against CVE-2018-12207.
+     */
+    if ( !opt_ept_exec_sp && is_epte_superpage(entry) )
+        entry->x = 0;
 }
 
 #define GUEST_TABLE_MAP_FAILED  0
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index ebaa74449b..371b912887 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -28,6 +28,8 @@
 #include <asm/hvm/trace.h>
 #include <asm/hvm/vmx/vmcs.h>
 
+extern int8_t opt_ept_exec_sp;
+
 typedef union {
     struct {
         u64 r       :   1,  /* bit 0 - Read permission */
diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index 637259bd1f..32746aa8ae 100644
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -52,6 +52,7 @@
 #define ARCH_CAPS_SKIP_L1DFL		(_AC(1, ULL) << 3)
 #define ARCH_CAPS_SSB_NO		(_AC(1, ULL) << 4)
 #define ARCH_CAPS_MDS_NO		(_AC(1, ULL) << 5)
+#define ARCH_CAPS_IF_PSCHANGE_MC_NO	(_AC(1, ULL) << 6)
 
 #define MSR_FLUSH_CMD			0x0000010b
 #define FLUSH_CMD_L1D			(_AC(1, ULL) << 0)

["xsa304/xsa304-3.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vtx: Allow runtime modification of the exec-sp setting

See patch for details.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index d2b0020b55..5e427a1cf8 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -959,6 +959,21 @@ introduced with the Nehalem architecture.
     If HVM guest kernels are trusted not to mount a DoS against the system,
     this option can enabled to regain performance.
 
+    This boolean may be modified at runtime using `xl set-parameters
+    ept=[no-]exec-sp` to switch between fast and secure.
+
+    *   When switching from secure to fast, preexisting HVM domains will run
+        at their current performance until they are rebooted; new domains will
+        run without any overhead.
+
+    *   When switching from fast to secure, all HVM domains will immediately
+        suffer a performance penalty.
+
+    **Warning: No guarantee is made that this runtime option will be retained
+      indefinitely, or that it will retain this exact behaviour.  It is
+      intended as an emergency option for people who first chose fast, then
+      change their minds to secure, and wish not to reboot.**
+
 ### extra_guest_irqs
 > `= [<domU number>][,<dom0 number>]`
 
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index d2624ea9d7..477c968409 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -95,6 +95,41 @@ static int __init parse_ept_param(const char *s)
 }
 custom_param("ept", parse_ept_param);
 
+static int parse_ept_param_runtime(const char *s)
+{
+    int val;
+
+    if ( !cpu_has_vmx_ept || !hvm_funcs.hap_supported ||
+         !(hvm_funcs.hap_capabilities &
+           (HVM_HAP_SUPERPAGE_2MB | HVM_HAP_SUPERPAGE_1GB)) )
+    {
+        printk("VMX: EPT not available, or not in use - ignoring\n");
+        return 0;
+    }
+
+    if ( (val = parse_boolean("exec-sp", s, NULL)) < 0 )
+        return -EINVAL;
+
+    if ( val != opt_ept_exec_sp )
+    {
+        struct domain *d;
+
+        opt_ept_exec_sp = val;
+
+        rcu_read_lock(&domlist_read_lock);
+        for_each_domain ( d )
+            if ( paging_mode_hap(d) )
+                p2m_change_entry_type_global(d, p2m_ram_rw, p2m_ram_rw);
+        rcu_read_unlock(&domlist_read_lock);
+    }
+
+    printk("VMX: EPT executable superpages %sabled\n",
+           val ? "en" : "dis");
+
+    return 0;
+}
+custom_runtime_only_param("ept", parse_ept_param_runtime);
+
 /* Dynamic (run-time adjusted) execution control flags. */
 u32 vmx_pin_based_exec_control __read_mostly;
 u32 vmx_cpu_based_exec_control __read_mostly;
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index e5e4349dea..ba126f790a 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -289,15 +289,20 @@ static void change_entry_type_global(struct p2m_domain *p2m,
                                      p2m_type_t ot, p2m_type_t nt)
 {
     p2m->change_entry_type_global(p2m, ot, nt);
-    p2m->global_logdirty = (nt == p2m_ram_logdirty);
+    /* Don't allow 'recalculate' operations to change the logdirty state. */
+    if ( ot != nt )
+        p2m->global_logdirty = (nt == p2m_ram_logdirty);
 }
 
+/*
+ * May be called with ot = nt = p2m_ram_rw for its side effect of
+ * recalculating all PTEs in the p2m.
+ */
 void p2m_change_entry_type_global(struct domain *d,
                                   p2m_type_t ot, p2m_type_t nt)
 {
     struct p2m_domain *hostp2m = p2m_get_hostp2m(d);
 
-    ASSERT(ot != nt);
     ASSERT(p2m_is_changeable(ot) && p2m_is_changeable(nt));
 
     p2m_lock(hostp2m);

["xsa304/xsa304-4.8-1.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vtd: Hide superpage support for SandyBridge IOMMUs

Something causes SandyBridge IOMMUs to choke when sharing EPT pagetables, and
an EPT superpage gets shattered.  The root cause is still under investigation,
but the end result is unusable in combination with CVE-2018-12207 protections.

This is part of XSA-304 / CVE-2018-12207

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h
index fb7edfaef9..d698b1d50a 100644
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -96,6 +96,8 @@ void vtd_ops_postamble_quirk(struct iommu* iommu);
 int __must_check me_wifi_quirk(struct domain *domain,
                                u8 bus, u8 devfn, int map);
 void pci_vtd_quirk(const struct pci_dev *);
+void quirk_iommu_caps(struct iommu *iommu);
+
 bool_t platform_supports_intremap(void);
 bool_t platform_supports_x2apic(void);
 
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 54cb798c2e..d1978133a0 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1205,6 +1205,8 @@ int __init iommu_alloc(struct acpi_drhd_unit *drhd)
     if ( !(iommu->cap + 1) || !(iommu->ecap + 1) )
         return -ENODEV;
 
+    quirk_iommu_caps(iommu);
+
     if ( cap_fault_reg_offset(iommu->cap) +
          cap_num_fault_regs(iommu->cap) * PRIMARY_FAULT_REG_LEN >= PAGE_SIZE ||
          ecap_iotlb_offset(iommu->ecap) >= PAGE_SIZE )
diff --git a/xen/drivers/passthrough/vtd/quirks.c b/xen/drivers/passthrough/vtd/quirks.c
index 5bbbd96d51..7fca95fa87 100644
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -539,3 +539,28 @@ void pci_vtd_quirk(const struct pci_dev *pdev)
         break;
     }
 }
+
+void __init quirk_iommu_caps(struct iommu *iommu)
+{
+    /*
+     * IOMMU Quirks:
+     *
+     * SandyBridge IOMMUs claim support for 2M and 1G superpages, but don't
+     * implement superpages internally.
+     *
+     * There are issues changing the walk length under in-flight DMA, which
+     * has manifested as incompatibility between EPT/IOMMU sharing and the
+     * workaround for CVE-2018-12207 / XSA-304.  Hide the superpages
+     * capabilities in the IOMMU, which will prevent Xen from sharing the EPT
+     * and IOMMU pagetables.
+     *
+     * Detection of SandyBridge unfortunately has to be done by processor
+     * model because the client parts don't expose their IOMMUs as PCI devices
+     * we could match with a Device ID.
+     */
+    if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
+         boot_cpu_data.x86 == 6 &&
+         (boot_cpu_data.x86_model == 0x2a ||
+          boot_cpu_data.x86_model == 0x2d) )
+        iommu->cap &= ~(0xful << 34);
+}

["xsa304/xsa304-4.8-2.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vtx: Disable executable EPT superpages to work around
 CVE-2018-12207

CVE-2018-12207 covers a set of errata on various Intel processors, whereby a
machine check exception can be generated in a corner case when an executable
mapping changes size or cacheability without TLB invalidation.  HVM guest
kernels can trigger this to DoS the host.

To mitigate, in affected hardware, all EPT superpages are marked NX.  When an
instruction fetch violation is observed against the superpage, the superpage
is shattered to 4k and has execute permissions restored.  This prevents the
guest kernel from being able to create the necessary preconditions in the iTLB
to exploit the vulnerability.

This does come with a workload-dependent performance overhead, caused by
increased TLB pressure.  Performance can be restored, if guest kernels are
trusted not to mount an attack, by specifying ept=exec-sp on the command line.

This is part of XSA-304 / CVE-2018-12207

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 23d6f09d8a..5338d20c41 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1781,6 +1781,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
     struct p2m_domain *p2m, *hostp2m;
     int rc, fall_through = 0, paged = 0;
     int sharing_enomem = 0;
+    unsigned int page_order = 0;
     vm_event_request_t *req_ptr = NULL;
     bool_t ap2m_active, sync = 0;
 
@@ -1851,7 +1852,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
     hostp2m = p2m_get_hostp2m(currd);
     mfn = get_gfn_type_access(hostp2m, gfn, &p2mt, &p2ma,
                               P2M_ALLOC | (npfec.write_access ? P2M_UNSHARE : 0),
-                              NULL);
+                              &page_order);
 
     if ( ap2m_active )
     {
@@ -1863,7 +1864,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
             goto out;
         }
 
-        mfn = get_gfn_type_access(p2m, gfn, &p2mt, &p2ma, 0, NULL);
+        mfn = get_gfn_type_access(p2m, gfn, &p2mt, &p2ma, 0, &page_order);
     }
     else
         p2m = hostp2m;
@@ -1905,6 +1906,23 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
             break;
         }
 
+        /*
+         * Workaround for XSA-304 / CVE-2018-12207.  If we take an execution
+         * fault against a non-executable superpage, shatter it to regain
+         * execute permissions.
+         */
+        if ( page_order > 0 && npfec.insn_fetch && npfec.present && !violation )
+        {
+            int res = p2m_set_entry(p2m, gfn, mfn, PAGE_ORDER_4K, p2mt, p2ma);
+
+            if ( res )
+                printk(XENLOG_ERR "Failed to shatter gfn %"PRI_gfn": %d\n",
+                       gfn, res);
+
+            rc = !res;
+            goto out_put_gfn;
+        }
+
         if ( violation )
         {
             /* Should #VE be emulated for this fault? */
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index af1a9d444f..b4b539ac3f 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -68,6 +68,7 @@ integer_param("ple_window", ple_window);
 
 static bool_t __read_mostly opt_pml_enabled = 1;
 static s8 __read_mostly opt_ept_ad = -1;
+int8_t __read_mostly opt_ept_exec_sp = -1;
 
 /*
  * The 'ept' parameter controls functionalities that depend on, or impact the
@@ -94,6 +95,8 @@ static void __init parse_ept_param(char *s)
             opt_pml_enabled = val;
         else if ( !strcmp(s, "ad") )
             opt_ept_ad = val;
+        else if ( !strcmp(s, "exec-sp") )
+            opt_ept_exec_sp = val;
 
         s = ss + 1;
     } while ( ss );
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 0053ac0122..8d4d973ff0 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2401,6 +2401,102 @@ static void pi_notification_interrupt(struct cpu_user_regs *regs)
     raise_softirq(VCPU_KICK_SOFTIRQ);
 }
 
+/*
+ * Calculate whether the CPU is vulnerable to Instruction Fetch page
+ * size-change MCEs.
+ */
+static bool __init has_if_pschange_mc(void)
+{
+    uint64_t caps = 0;
+
+    /*
+     * If we are virtualised, there is nothing we can do.  Our EPT tables are
+     * shadowed by our hypervisor, and not walked by hardware.
+     */
+    if ( cpu_has_hypervisor )
+        return false;
+
+    if ( boot_cpu_has(X86_FEATURE_ARCH_CAPS) )
+        rdmsrl(MSR_ARCH_CAPABILITIES, caps);
+
+    if ( caps & ARCH_CAPS_IF_PSCHANGE_MC_NO )
+        return false;
+
+    /*
+     * IF_PSCHANGE_MC is only known to affect Intel Family 6 processors at
+     * this time.
+     */
+    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+         boot_cpu_data.x86 != 6 )
+        return false;
+
+    switch ( boot_cpu_data.x86_model )
+    {
+        /*
+         * Core processors since at least Nehalem are vulnerable.
+         */
+    case 0x1f: /* Auburndale / Havendale */
+    case 0x1e: /* Nehalem */
+    case 0x1a: /* Nehalem EP */
+    case 0x2e: /* Nehalem EX */
+    case 0x25: /* Westmere */
+    case 0x2c: /* Westmere EP */
+    case 0x2f: /* Westmere EX */
+    case 0x2a: /* SandyBridge */
+    case 0x2d: /* SandyBridge EP/EX */
+    case 0x3a: /* IvyBridge */
+    case 0x3e: /* IvyBridge EP/EX */
+    case 0x3c: /* Haswell */
+    case 0x3f: /* Haswell EX/EP */
+    case 0x45: /* Haswell D */
+    case 0x46: /* Haswell H */
+    case 0x3d: /* Broadwell */
+    case 0x47: /* Broadwell H */
+    case 0x4f: /* Broadwell EP/EX */
+    case 0x56: /* Broadwell D */
+    case 0x4e: /* Skylake M */
+    case 0x5e: /* Skylake D */
+    case 0x55: /* Skylake-X / Cascade Lake */
+    case 0x8e: /* Kaby / Coffee / Whiskey Lake M */
+    case 0x9e: /* Kaby / Coffee / Whiskey Lake D */
+        return true;
+
+        /*
+         * Atom processors are not vulnerable.
+         */
+    case 0x1c: /* Pineview */
+    case 0x26: /* Lincroft */
+    case 0x27: /* Penwell */
+    case 0x35: /* Cloverview */
+    case 0x36: /* Cedarview */
+    case 0x37: /* Baytrail / Valleyview (Silvermont) */
+    case 0x4d: /* Avaton / Rangely (Silvermont) */
+    case 0x4c: /* Cherrytrail / Brasswell */
+    case 0x4a: /* Merrifield */
+    case 0x5a: /* Moorefield */
+    case 0x5c: /* Goldmont */
+    case 0x5d: /* SoFIA 3G Granite/ES2.1 */
+    case 0x65: /* SoFIA LTE AOSP */
+    case 0x5f: /* Denverton */
+    case 0x6e: /* Cougar Mountain */
+    case 0x75: /* Lightning Mountain */
+    case 0x7a: /* Gemini Lake */
+    case 0x86: /* Jacobsville */
+
+        /*
+         * Knights processors are not vulnerable.
+         */
+    case 0x57: /* Knights Landing */
+    case 0x85: /* Knights Mill */
+        return false;
+
+    default:
+        printk("Unrecognised CPU model %#x - assuming vulnerable to IF_PSCHANGE_MC\n",
+               boot_cpu_data.x86_model);
+        return true;
+    }
+}
+
 const struct hvm_function_table * __init start_vmx(void)
 {
     set_in_cr4(X86_CR4_VMXE);
@@ -2417,6 +2513,17 @@ const struct hvm_function_table * __init start_vmx(void)
      */
     if ( cpu_has_vmx_ept && (cpu_has_vmx_pat || opt_force_ept) )
     {
+        bool cpu_has_bug_pschange_mc = has_if_pschange_mc();
+
+        if ( opt_ept_exec_sp == -1 )
+        {
+            /* Default to non-executable superpages on vulnerable hardware. */
+            opt_ept_exec_sp = !cpu_has_bug_pschange_mc;
+
+            if ( cpu_has_bug_pschange_mc )
+                printk("VMX: Disabling executable EPT superpages due to CVE-2018-12207\n");
+        }
+
         vmx_function_table.hap_supported = 1;
         vmx_function_table.altp2m_supported = 1;
 
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index 26aa3cddb7..d0637eeb15 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -210,6 +210,12 @@ static void ept_p2m_type_to_flags(struct p2m_domain *p2m, ept_entry_t *entry,
             break;
     }
     
+    /*
+     * Don't create executable superpages if we need to shatter them to
+     * protect against CVE-2018-12207.
+     */
+    if ( !opt_ept_exec_sp && is_epte_superpage(entry) )
+        entry->x = 0;
 }
 
 #define GUEST_TABLE_MAP_FAILED  0
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 4cdd9b1d9f..bd71545188 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -28,6 +28,8 @@
 #include <asm/hvm/trace.h>
 #include <asm/hvm/vmx/vmcs.h>
 
+extern int8_t opt_ept_exec_sp;
+
 typedef union {
     struct {
         u64 r       :   1,  /* bit 0 - Read permission */
diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index 92d10e2191..0a596f7489 100644
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -54,6 +54,7 @@
 #define ARCH_CAPS_SKIP_L1DFL		(_AC(1, ULL) << 3)
 #define ARCH_CAPS_SSB_NO		(_AC(1, ULL) << 4)
 #define ARCH_CAPS_MDS_NO		(_AC(1, ULL) << 5)
+#define ARCH_CAPS_IF_PSCHANGE_MC_NO	(_AC(1, ULL) << 6)
 
 #define MSR_FLUSH_CMD			0x0000010b
 #define FLUSH_CMD_L1D			(_AC(1, ULL) << 0)

["xsa304/xsa304-4.9-1.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vtd: Hide superpage support for SandyBridge IOMMUs

Something causes SandyBridge IOMMUs to choke when sharing EPT pagetables, and
an EPT superpage gets shattered.  The root cause is still under investigation,
but the end result is unusable in combination with CVE-2018-12207 protections.

This is part of XSA-304 / CVE-2018-12207

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h
index fb7edfaef9..d698b1d50a 100644
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -96,6 +96,8 @@ void vtd_ops_postamble_quirk(struct iommu* iommu);
 int __must_check me_wifi_quirk(struct domain *domain,
                                u8 bus, u8 devfn, int map);
 void pci_vtd_quirk(const struct pci_dev *);
+void quirk_iommu_caps(struct iommu *iommu);
+
 bool_t platform_supports_intremap(void);
 bool_t platform_supports_x2apic(void);
 
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index bbc7e40905..336b778c81 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1205,6 +1205,8 @@ int __init iommu_alloc(struct acpi_drhd_unit *drhd)
     if ( !(iommu->cap + 1) || !(iommu->ecap + 1) )
         return -ENODEV;
 
+    quirk_iommu_caps(iommu);
+
     if ( cap_fault_reg_offset(iommu->cap) +
          cap_num_fault_regs(iommu->cap) * PRIMARY_FAULT_REG_LEN >= PAGE_SIZE ||
          ecap_iotlb_offset(iommu->ecap) >= PAGE_SIZE )
diff --git a/xen/drivers/passthrough/vtd/quirks.c b/xen/drivers/passthrough/vtd/quirks.c
index 5bbbd96d51..7fca95fa87 100644
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -539,3 +539,28 @@ void pci_vtd_quirk(const struct pci_dev *pdev)
         break;
     }
 }
+
+void __init quirk_iommu_caps(struct iommu *iommu)
+{
+    /*
+     * IOMMU Quirks:
+     *
+     * SandyBridge IOMMUs claim support for 2M and 1G superpages, but don't
+     * implement superpages internally.
+     *
+     * There are issues changing the walk length under in-flight DMA, which
+     * has manifested as incompatibility between EPT/IOMMU sharing and the
+     * workaround for CVE-2018-12207 / XSA-304.  Hide the superpages
+     * capabilities in the IOMMU, which will prevent Xen from sharing the EPT
+     * and IOMMU pagetables.
+     *
+     * Detection of SandyBridge unfortunately has to be done by processor
+     * model because the client parts don't expose their IOMMUs as PCI devices
+     * we could match with a Device ID.
+     */
+    if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
+         boot_cpu_data.x86 == 6 &&
+         (boot_cpu_data.x86_model == 0x2a ||
+          boot_cpu_data.x86_model == 0x2d) )
+        iommu->cap &= ~(0xful << 34);
+}

["xsa304/xsa304-4.9-2.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vtx: Disable executable EPT superpages to work around
 CVE-2018-12207

CVE-2018-12207 covers a set of errata on various Intel processors, whereby a
machine check exception can be generated in a corner case when an executable
mapping changes size or cacheability without TLB invalidation.  HVM guest
kernels can trigger this to DoS the host.

To mitigate, in affected hardware, all EPT superpages are marked NX.  When an
instruction fetch violation is observed against the superpage, the superpage
is shattered to 4k and has execute permissions restored.  This prevents the
guest kernel from being able to create the necessary preconditions in the iTLB
to exploit the vulnerability.

This does come with a workload-dependent performance overhead, caused by
increased TLB pressure.  Performance can be restored, if guest kernels are
trusted not to mount an attack, by specifying ept=exec-sp on the command line.

This is part of XSA-304 / CVE-2018-12207

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 0164ae5a96..0b05b0388c 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1648,6 +1648,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
     struct p2m_domain *p2m, *hostp2m;
     int rc, fall_through = 0, paged = 0;
     int sharing_enomem = 0;
+    unsigned int page_order = 0;
     vm_event_request_t *req_ptr = NULL;
     bool_t ap2m_active, sync = 0;
 
@@ -1718,7 +1719,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
     hostp2m = p2m_get_hostp2m(currd);
     mfn = get_gfn_type_access(hostp2m, gfn, &p2mt, &p2ma,
                               P2M_ALLOC | (npfec.write_access ? P2M_UNSHARE : 0),
-                              NULL);
+                              &page_order);
 
     if ( ap2m_active )
     {
@@ -1730,7 +1731,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
             goto out;
         }
 
-        mfn = get_gfn_type_access(p2m, gfn, &p2mt, &p2ma, 0, NULL);
+        mfn = get_gfn_type_access(p2m, gfn, &p2mt, &p2ma, 0, &page_order);
     }
     else
         p2m = hostp2m;
@@ -1772,6 +1773,23 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
             break;
         }
 
+        /*
+         * Workaround for XSA-304 / CVE-2018-12207.  If we take an execution
+         * fault against a non-executable superpage, shatter it to regain
+         * execute permissions.
+         */
+        if ( page_order > 0 && npfec.insn_fetch && npfec.present && !violation )
+        {
+            int res = p2m_set_entry(p2m, gfn, mfn, PAGE_ORDER_4K, p2mt, p2ma);
+
+            if ( res )
+                printk(XENLOG_ERR "Failed to shatter gfn %"PRI_gfn": %d\n",
+                       gfn, res);
+
+            rc = !res;
+            goto out_put_gfn;
+        }
+
         if ( violation )
         {
             /* Should #VE be emulated for this fault? */
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 345bfbf6fc..178ddb0925 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -67,6 +67,7 @@ integer_param("ple_window", ple_window);
 
 static bool_t __read_mostly opt_pml_enabled = 1;
 static s8 __read_mostly opt_ept_ad = -1;
+int8_t __read_mostly opt_ept_exec_sp = -1;
 
 /*
  * The 'ept' parameter controls functionalities that depend on, or impact the
@@ -93,6 +94,8 @@ static void __init parse_ept_param(char *s)
             opt_pml_enabled = val;
         else if ( !strcmp(s, "ad") )
             opt_ept_ad = val;
+        else if ( !strcmp(s, "exec-sp") )
+            opt_ept_exec_sp = val;
 
         s = ss + 1;
     } while ( ss );
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 5042a86515..cb3be48283 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2496,6 +2496,102 @@ static void pi_notification_interrupt(struct cpu_user_regs *regs)
 static void __init lbr_tsx_fixup_check(void);
 static void __init bdw_erratum_bdf14_fixup_check(void);
 
+/*
+ * Calculate whether the CPU is vulnerable to Instruction Fetch page
+ * size-change MCEs.
+ */
+static bool __init has_if_pschange_mc(void)
+{
+    uint64_t caps = 0;
+
+    /*
+     * If we are virtualised, there is nothing we can do.  Our EPT tables are
+     * shadowed by our hypervisor, and not walked by hardware.
+     */
+    if ( cpu_has_hypervisor )
+        return false;
+
+    if ( boot_cpu_has(X86_FEATURE_ARCH_CAPS) )
+        rdmsrl(MSR_ARCH_CAPABILITIES, caps);
+
+    if ( caps & ARCH_CAPS_IF_PSCHANGE_MC_NO )
+        return false;
+
+    /*
+     * IF_PSCHANGE_MC is only known to affect Intel Family 6 processors at
+     * this time.
+     */
+    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+         boot_cpu_data.x86 != 6 )
+        return false;
+
+    switch ( boot_cpu_data.x86_model )
+    {
+        /*
+         * Core processors since at least Nehalem are vulnerable.
+         */
+    case 0x1f: /* Auburndale / Havendale */
+    case 0x1e: /* Nehalem */
+    case 0x1a: /* Nehalem EP */
+    case 0x2e: /* Nehalem EX */
+    case 0x25: /* Westmere */
+    case 0x2c: /* Westmere EP */
+    case 0x2f: /* Westmere EX */
+    case 0x2a: /* SandyBridge */
+    case 0x2d: /* SandyBridge EP/EX */
+    case 0x3a: /* IvyBridge */
+    case 0x3e: /* IvyBridge EP/EX */
+    case 0x3c: /* Haswell */
+    case 0x3f: /* Haswell EX/EP */
+    case 0x45: /* Haswell D */
+    case 0x46: /* Haswell H */
+    case 0x3d: /* Broadwell */
+    case 0x47: /* Broadwell H */
+    case 0x4f: /* Broadwell EP/EX */
+    case 0x56: /* Broadwell D */
+    case 0x4e: /* Skylake M */
+    case 0x5e: /* Skylake D */
+    case 0x55: /* Skylake-X / Cascade Lake */
+    case 0x8e: /* Kaby / Coffee / Whiskey Lake M */
+    case 0x9e: /* Kaby / Coffee / Whiskey Lake D */
+        return true;
+
+        /*
+         * Atom processors are not vulnerable.
+         */
+    case 0x1c: /* Pineview */
+    case 0x26: /* Lincroft */
+    case 0x27: /* Penwell */
+    case 0x35: /* Cloverview */
+    case 0x36: /* Cedarview */
+    case 0x37: /* Baytrail / Valleyview (Silvermont) */
+    case 0x4d: /* Avaton / Rangely (Silvermont) */
+    case 0x4c: /* Cherrytrail / Brasswell */
+    case 0x4a: /* Merrifield */
+    case 0x5a: /* Moorefield */
+    case 0x5c: /* Goldmont */
+    case 0x5d: /* SoFIA 3G Granite/ES2.1 */
+    case 0x65: /* SoFIA LTE AOSP */
+    case 0x5f: /* Denverton */
+    case 0x6e: /* Cougar Mountain */
+    case 0x75: /* Lightning Mountain */
+    case 0x7a: /* Gemini Lake */
+    case 0x86: /* Jacobsville */
+
+        /*
+         * Knights processors are not vulnerable.
+         */
+    case 0x57: /* Knights Landing */
+    case 0x85: /* Knights Mill */
+        return false;
+
+    default:
+        printk("Unrecognised CPU model %#x - assuming vulnerable to IF_PSCHANGE_MC\n",
+               boot_cpu_data.x86_model);
+        return true;
+    }
+}
+
 const struct hvm_function_table * __init start_vmx(void)
 {
     set_in_cr4(X86_CR4_VMXE);
@@ -2516,6 +2612,17 @@ const struct hvm_function_table * __init start_vmx(void)
      */
     if ( cpu_has_vmx_ept && (cpu_has_vmx_pat || opt_force_ept) )
     {
+        bool cpu_has_bug_pschange_mc = has_if_pschange_mc();
+
+        if ( opt_ept_exec_sp == -1 )
+        {
+            /* Default to non-executable superpages on vulnerable hardware. */
+            opt_ept_exec_sp = !cpu_has_bug_pschange_mc;
+
+            if ( cpu_has_bug_pschange_mc )
+                printk("VMX: Disabling executable EPT superpages due to CVE-2018-12207\n");
+        }
+
         vmx_function_table.hap_supported = 1;
         vmx_function_table.altp2m_supported = 1;
 
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index ecab56fbec..3837062b2c 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -215,6 +215,12 @@ static void ept_p2m_type_to_flags(struct p2m_domain *p2m, ept_entry_t *entry,
             break;
     }
     
+    /*
+     * Don't create executable superpages if we need to shatter them to
+     * protect against CVE-2018-12207.
+     */
+    if ( !opt_ept_exec_sp && is_epte_superpage(entry) )
+        entry->x = 0;
 }
 
 #define GUEST_TABLE_MAP_FAILED  0
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 4889a64255..8845c4650b 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -28,6 +28,8 @@
 #include <asm/hvm/trace.h>
 #include <asm/hvm/vmx/vmcs.h>
 
+extern int8_t opt_ept_exec_sp;
+
 typedef union {
     struct {
         u64 r       :   1,  /* bit 0 - Read permission */
diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index 92d9ee76c2..5ef894ff29 100644
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -54,6 +54,7 @@
 #define ARCH_CAPS_SKIP_L1DFL		(_AC(1, ULL) << 3)
 #define ARCH_CAPS_SSB_NO		(_AC(1, ULL) << 4)
 #define ARCH_CAPS_MDS_NO		(_AC(1, ULL) << 5)
+#define ARCH_CAPS_IF_PSCHANGE_MC_NO	(_AC(1, ULL) << 6)
 
 #define MSR_FLUSH_CMD			0x0000010b
 #define FLUSH_CMD_L1D			(_AC(1, ULL) << 0)

["xsa304/xsa304-4.10-1.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vtd: Hide superpage support for SandyBridge IOMMUs

Something causes SandyBridge IOMMUs to choke when sharing EPT pagetables, and
an EPT superpage gets shattered.  The root cause is still under investigation,
but the end result is unusable in combination with CVE-2018-12207 protections.

This is part of XSA-304 / CVE-2018-12207

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h
index fb7edfaef9..d698b1d50a 100644
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -96,6 +96,8 @@ void vtd_ops_postamble_quirk(struct iommu* iommu);
 int __must_check me_wifi_quirk(struct domain *domain,
                                u8 bus, u8 devfn, int map);
 void pci_vtd_quirk(const struct pci_dev *);
+void quirk_iommu_caps(struct iommu *iommu);
+
 bool_t platform_supports_intremap(void);
 bool_t platform_supports_x2apic(void);
 
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index 2798a49907..17cf87ccf1 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1205,6 +1205,8 @@ int __init iommu_alloc(struct acpi_drhd_unit *drhd)
     if ( !(iommu->cap + 1) || !(iommu->ecap + 1) )
         return -ENODEV;
 
+    quirk_iommu_caps(iommu);
+
     if ( cap_fault_reg_offset(iommu->cap) +
          cap_num_fault_regs(iommu->cap) * PRIMARY_FAULT_REG_LEN >= PAGE_SIZE ||
          ecap_iotlb_offset(iommu->ecap) >= PAGE_SIZE )
diff --git a/xen/drivers/passthrough/vtd/quirks.c b/xen/drivers/passthrough/vtd/quirks.c
index d6db862678..b02688e316 100644
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -540,3 +540,28 @@ void pci_vtd_quirk(const struct pci_dev *pdev)
         break;
     }
 }
+
+void __init quirk_iommu_caps(struct iommu *iommu)
+{
+    /*
+     * IOMMU Quirks:
+     *
+     * SandyBridge IOMMUs claim support for 2M and 1G superpages, but don't
+     * implement superpages internally.
+     *
+     * There are issues changing the walk length under in-flight DMA, which
+     * has manifested as incompatibility between EPT/IOMMU sharing and the
+     * workaround for CVE-2018-12207 / XSA-304.  Hide the superpages
+     * capabilities in the IOMMU, which will prevent Xen from sharing the EPT
+     * and IOMMU pagetables.
+     *
+     * Detection of SandyBridge unfortunately has to be done by processor
+     * model because the client parts don't expose their IOMMUs as PCI devices
+     * we could match with a Device ID.
+     */
+    if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
+         boot_cpu_data.x86 == 6 &&
+         (boot_cpu_data.x86_model == 0x2a ||
+          boot_cpu_data.x86_model == 0x2d) )
+        iommu->cap &= ~(0xful << 34);
+}

["xsa304/xsa304-4.10-2.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vtx: Disable executable EPT superpages to work around
 CVE-2018-12207

CVE-2018-12207 covers a set of errata on various Intel processors, whereby a
machine check exception can be generated in a corner case when an executable
mapping changes size or cacheability without TLB invalidation.  HVM guest
kernels can trigger this to DoS the host.

To mitigate, in affected hardware, all EPT superpages are marked NX.  When an
instruction fetch violation is observed against the superpage, the superpage
is shattered to 4k and has execute permissions restored.  This prevents the
guest kernel from being able to create the necessary preconditions in the iTLB
to exploit the vulnerability.

This does come with a workload-dependent performance overhead, caused by
increased TLB pressure.  Performance can be restored, if guest kernels are
trusted not to mount an attack, by specifying ept=exec-sp on the command line.

This is part of XSA-304 / CVE-2018-12207

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index c0700dfbfe..698ab63340 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1695,6 +1695,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
     struct p2m_domain *p2m, *hostp2m;
     int rc, fall_through = 0, paged = 0;
     int sharing_enomem = 0;
+    unsigned int page_order = 0;
     vm_event_request_t *req_ptr = NULL;
     bool_t ap2m_active, sync = 0;
 
@@ -1763,7 +1764,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
     hostp2m = p2m_get_hostp2m(currd);
     mfn = get_gfn_type_access(hostp2m, gfn, &p2mt, &p2ma,
                               P2M_ALLOC | (npfec.write_access ? P2M_UNSHARE : 0),
-                              NULL);
+                              &page_order);
 
     if ( ap2m_active )
     {
@@ -1775,7 +1776,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
             goto out;
         }
 
-        mfn = get_gfn_type_access(p2m, gfn, &p2mt, &p2ma, 0, NULL);
+        mfn = get_gfn_type_access(p2m, gfn, &p2mt, &p2ma, 0, &page_order);
     }
     else
         p2m = hostp2m;
@@ -1817,6 +1818,24 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
             break;
         }
 
+        /*
+         * Workaround for XSA-304 / CVE-2018-12207.  If we take an execution
+         * fault against a non-executable superpage, shatter it to regain
+         * execute permissions.
+         */
+        if ( page_order > 0 && npfec.insn_fetch && npfec.present && !violation )
+        {
+            int res = p2m_set_entry(p2m, _gfn(gfn), mfn, PAGE_ORDER_4K,
+                                    p2mt, p2ma);
+
+            if ( res )
+                printk(XENLOG_ERR "Failed to shatter gfn %"PRI_gfn": %d\n",
+                       gfn, res);
+
+            rc = !res;
+            goto out_put_gfn;
+        }
+
         if ( violation )
         {
             /* Should #VE be emulated for this fault? */
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 205f2307c2..27050c0877 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -67,6 +67,7 @@ integer_param("ple_window", ple_window);
 
 static bool_t __read_mostly opt_pml_enabled = 1;
 static s8 __read_mostly opt_ept_ad = -1;
+int8_t __read_mostly opt_ept_exec_sp = -1;
 
 /*
  * The 'ept' parameter controls functionalities that depend on, or impact the
@@ -94,6 +95,8 @@ static int __init parse_ept_param(const char *s)
             opt_pml_enabled = val;
         else if ( !cmdline_strcmp(s, "ad") )
             opt_ept_ad = val;
+        else if ( !cmdline_strcmp(s, "exec-sp") )
+            opt_ept_exec_sp = val;
         else
             rc = -EINVAL;
 
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index fa1e0309c7..9285c2b2fa 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2490,6 +2490,102 @@ static void pi_notification_interrupt(struct cpu_user_regs *regs)
 static void __init lbr_tsx_fixup_check(void);
 static void __init bdw_erratum_bdf14_fixup_check(void);
 
+/*
+ * Calculate whether the CPU is vulnerable to Instruction Fetch page
+ * size-change MCEs.
+ */
+static bool __init has_if_pschange_mc(void)
+{
+    uint64_t caps = 0;
+
+    /*
+     * If we are virtualised, there is nothing we can do.  Our EPT tables are
+     * shadowed by our hypervisor, and not walked by hardware.
+     */
+    if ( cpu_has_hypervisor )
+        return false;
+
+    if ( boot_cpu_has(X86_FEATURE_ARCH_CAPS) )
+        rdmsrl(MSR_ARCH_CAPABILITIES, caps);
+
+    if ( caps & ARCH_CAPS_IF_PSCHANGE_MC_NO )
+        return false;
+
+    /*
+     * IF_PSCHANGE_MC is only known to affect Intel Family 6 processors at
+     * this time.
+     */
+    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+         boot_cpu_data.x86 != 6 )
+        return false;
+
+    switch ( boot_cpu_data.x86_model )
+    {
+        /*
+         * Core processors since at least Nehalem are vulnerable.
+         */
+    case 0x1f: /* Auburndale / Havendale */
+    case 0x1e: /* Nehalem */
+    case 0x1a: /* Nehalem EP */
+    case 0x2e: /* Nehalem EX */
+    case 0x25: /* Westmere */
+    case 0x2c: /* Westmere EP */
+    case 0x2f: /* Westmere EX */
+    case 0x2a: /* SandyBridge */
+    case 0x2d: /* SandyBridge EP/EX */
+    case 0x3a: /* IvyBridge */
+    case 0x3e: /* IvyBridge EP/EX */
+    case 0x3c: /* Haswell */
+    case 0x3f: /* Haswell EX/EP */
+    case 0x45: /* Haswell D */
+    case 0x46: /* Haswell H */
+    case 0x3d: /* Broadwell */
+    case 0x47: /* Broadwell H */
+    case 0x4f: /* Broadwell EP/EX */
+    case 0x56: /* Broadwell D */
+    case 0x4e: /* Skylake M */
+    case 0x5e: /* Skylake D */
+    case 0x55: /* Skylake-X / Cascade Lake */
+    case 0x8e: /* Kaby / Coffee / Whiskey Lake M */
+    case 0x9e: /* Kaby / Coffee / Whiskey Lake D */
+        return true;
+
+        /*
+         * Atom processors are not vulnerable.
+         */
+    case 0x1c: /* Pineview */
+    case 0x26: /* Lincroft */
+    case 0x27: /* Penwell */
+    case 0x35: /* Cloverview */
+    case 0x36: /* Cedarview */
+    case 0x37: /* Baytrail / Valleyview (Silvermont) */
+    case 0x4d: /* Avaton / Rangely (Silvermont) */
+    case 0x4c: /* Cherrytrail / Brasswell */
+    case 0x4a: /* Merrifield */
+    case 0x5a: /* Moorefield */
+    case 0x5c: /* Goldmont */
+    case 0x5d: /* SoFIA 3G Granite/ES2.1 */
+    case 0x65: /* SoFIA LTE AOSP */
+    case 0x5f: /* Denverton */
+    case 0x6e: /* Cougar Mountain */
+    case 0x75: /* Lightning Mountain */
+    case 0x7a: /* Gemini Lake */
+    case 0x86: /* Jacobsville */
+
+        /*
+         * Knights processors are not vulnerable.
+         */
+    case 0x57: /* Knights Landing */
+    case 0x85: /* Knights Mill */
+        return false;
+
+    default:
+        printk("Unrecognised CPU model %#x - assuming vulnerable to IF_PSCHANGE_MC\n",
+               boot_cpu_data.x86_model);
+        return true;
+    }
+}
+
 const struct hvm_function_table * __init start_vmx(void)
 {
     set_in_cr4(X86_CR4_VMXE);
@@ -2510,6 +2606,17 @@ const struct hvm_function_table * __init start_vmx(void)
      */
     if ( cpu_has_vmx_ept && (cpu_has_vmx_pat || opt_force_ept) )
     {
+        bool cpu_has_bug_pschange_mc = has_if_pschange_mc();
+
+        if ( opt_ept_exec_sp == -1 )
+        {
+            /* Default to non-executable superpages on vulnerable hardware. */
+            opt_ept_exec_sp = !cpu_has_bug_pschange_mc;
+
+            if ( cpu_has_bug_pschange_mc )
+                printk("VMX: Disabling executable EPT superpages due to CVE-2018-12207\n");
+        }
+
         vmx_function_table.hap_supported = 1;
         vmx_function_table.altp2m_supported = 1;
 
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index b4996ce658..424d42c93d 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -215,6 +215,12 @@ static void ept_p2m_type_to_flags(struct p2m_domain *p2m, ept_entry_t *entry,
             break;
     }
     
+    /*
+     * Don't create executable superpages if we need to shatter them to
+     * protect against CVE-2018-12207.
+     */
+    if ( !opt_ept_exec_sp && is_epte_superpage(entry) )
+        entry->x = 0;
 }
 
 #define GUEST_TABLE_MAP_FAILED  0
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 7341cb191e..aad25335eb 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -28,6 +28,8 @@
 #include <asm/hvm/trace.h>
 #include <asm/hvm/vmx/vmcs.h>
 
+extern int8_t opt_ept_exec_sp;
+
 typedef union {
     struct {
         u64 r       :   1,  /* bit 0 - Read permission */
diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index e61aac2f51..47e7c412f2 100644
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -54,6 +54,7 @@
 #define ARCH_CAPS_SKIP_L1DFL		(_AC(1, ULL) << 3)
 #define ARCH_CAPS_SSB_NO		(_AC(1, ULL) << 4)
 #define ARCH_CAPS_MDS_NO		(_AC(1, ULL) << 5)
+#define ARCH_CAPS_IF_PSCHANGE_MC_NO	(_AC(1, ULL) << 6)
 
 #define MSR_FLUSH_CMD			0x0000010b
 #define FLUSH_CMD_L1D			(_AC(1, ULL) << 0)

["xsa304/xsa304-4.10-3.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vtx: Allow runtime modification of the exec-sp setting

See patch for details.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 27050c0877..3c29b7c46f 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -107,6 +107,41 @@ static int __init parse_ept_param(const char *s)
 }
 custom_param("ept", parse_ept_param);
 
+static int parse_ept_param_runtime(const char *s)
+{
+    int val;
+
+    if ( !cpu_has_vmx_ept || !hvm_funcs.hap_supported ||
+         !(hvm_funcs.hap_capabilities &
+           (HVM_HAP_SUPERPAGE_2MB | HVM_HAP_SUPERPAGE_1GB)) )
+    {
+        printk("VMX: EPT not available, or not in use - ignoring\n");
+        return 0;
+    }
+
+    if ( (val = parse_boolean("exec-sp", s, NULL)) < 0 )
+        return -EINVAL;
+
+    if ( val != opt_ept_exec_sp )
+    {
+        struct domain *d;
+
+        opt_ept_exec_sp = val;
+
+        rcu_read_lock(&domlist_read_lock);
+        for_each_domain ( d )
+            if ( paging_mode_hap(d) )
+                p2m_change_entry_type_global(d, p2m_ram_rw, p2m_ram_rw);
+        rcu_read_unlock(&domlist_read_lock);
+    }
+
+    printk("VMX: EPT executable superpages %sabled\n",
+           val ? "en" : "dis");
+
+    return 0;
+}
+custom_runtime_only_param("ept", parse_ept_param_runtime);
+
 /* Dynamic (run-time adjusted) execution control flags. */
 u32 vmx_pin_based_exec_control __read_mostly;
 u32 vmx_cpu_based_exec_control __read_mostly;
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 7a52ba993e..416e77b03c 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -263,17 +263,22 @@ int p2m_is_logdirty_range(struct p2m_domain *p2m, unsigned long start,
     return 0;
 }
 
+/*
+ * May be called with ot = nt = p2m_ram_rw for its side effect of
+ * recalculating all PTEs in the p2m.
+ */
 void p2m_change_entry_type_global(struct domain *d,
                                   p2m_type_t ot, p2m_type_t nt)
 {
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
 
-    ASSERT(ot != nt);
     ASSERT(p2m_is_changeable(ot) && p2m_is_changeable(nt));
 
     p2m_lock(p2m);
     p2m->change_entry_type_global(p2m, ot, nt);
-    p2m->global_logdirty = (nt == p2m_ram_logdirty);
+    /* Don't allow 'recalculate' operations to change the logdirty state. */
+    if ( ot != nt )
+        p2m->global_logdirty = (nt == p2m_ram_logdirty);
     p2m_unlock(p2m);
 }
 

["xsa304/xsa304-4.11-1.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vtd: Hide superpage support for SandyBridge IOMMUs

Something causes SandyBridge IOMMUs to choke when sharing EPT pagetables, and
an EPT superpage gets shattered.  The root cause is still under investigation,
but the end result is unusable in combination with CVE-2018-12207 protections.

This is part of XSA-304 / CVE-2018-12207

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h
index fb7edfaef9..d698b1d50a 100644
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -96,6 +96,8 @@ void vtd_ops_postamble_quirk(struct iommu* iommu);
 int __must_check me_wifi_quirk(struct domain *domain,
                                u8 bus, u8 devfn, int map);
 void pci_vtd_quirk(const struct pci_dev *);
+void quirk_iommu_caps(struct iommu *iommu);
+
 bool_t platform_supports_intremap(void);
 bool_t platform_supports_x2apic(void);
 
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index f242e30caf..8712d3b4dc 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1211,6 +1211,8 @@ int __init iommu_alloc(struct acpi_drhd_unit *drhd)
     if ( !(iommu->cap + 1) || !(iommu->ecap + 1) )
         return -ENODEV;
 
+    quirk_iommu_caps(iommu);
+
     if ( cap_fault_reg_offset(iommu->cap) +
          cap_num_fault_regs(iommu->cap) * PRIMARY_FAULT_REG_LEN >= PAGE_SIZE ||
          ecap_iotlb_offset(iommu->ecap) >= PAGE_SIZE )
diff --git a/xen/drivers/passthrough/vtd/quirks.c b/xen/drivers/passthrough/vtd/quirks.c
index d6db862678..b02688e316 100644
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -540,3 +540,28 @@ void pci_vtd_quirk(const struct pci_dev *pdev)
         break;
     }
 }
+
+void __init quirk_iommu_caps(struct iommu *iommu)
+{
+    /*
+     * IOMMU Quirks:
+     *
+     * SandyBridge IOMMUs claim support for 2M and 1G superpages, but don't
+     * implement superpages internally.
+     *
+     * There are issues changing the walk length under in-flight DMA, which
+     * has manifested as incompatibility between EPT/IOMMU sharing and the
+     * workaround for CVE-2018-12207 / XSA-304.  Hide the superpages
+     * capabilities in the IOMMU, which will prevent Xen from sharing the EPT
+     * and IOMMU pagetables.
+     *
+     * Detection of SandyBridge unfortunately has to be done by processor
+     * model because the client parts don't expose their IOMMUs as PCI devices
+     * we could match with a Device ID.
+     */
+    if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
+         boot_cpu_data.x86 == 6 &&
+         (boot_cpu_data.x86_model == 0x2a ||
+          boot_cpu_data.x86_model == 0x2d) )
+        iommu->cap &= ~(0xful << 34);
+}

["xsa304/xsa304-4.11-2.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vtx: Disable executable EPT superpages to work around
 CVE-2018-12207

CVE-2018-12207 covers a set of errata on various Intel processors, whereby a
machine check exception can be generated in a corner case when an executable
mapping changes size or cacheability without TLB invalidation.  HVM guest
kernels can trigger this to DoS the host.

To mitigate, in affected hardware, all EPT superpages are marked NX.  When an
instruction fetch violation is observed against the superpage, the superpage
is shattered to 4k and has execute permissions restored.  This prevents the
guest kernel from being able to create the necessary preconditions in the iTLB
to exploit the vulnerability.

This does come with a workload-dependent performance overhead, caused by
increased TLB pressure.  Performance can be restored, if guest kernels are
trusted not to mount an attack, by specifying ept=exec-sp on the command line.

This is part of XSA-304 / CVE-2018-12207

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index c63a07d29b..684671cb7b 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -828,7 +828,7 @@ effect the inverse meaning.
 >> set as UC.
 
 ### ept (Intel)
-> `= List of ( {no-}pml | {no-}ad )`
+> `= List of [ {no-}pml,  {no-}ad, {no-}exec-sp ]`
 
 Controls EPT related features.
 
@@ -851,6 +851,16 @@ Controls EPT related features.
 
 >> Have hardware keep accessed/dirty (A/D) bits updated.
 
+*   The `exec-sp` boolean controls whether EPT superpages with execute
+    permissions are permitted.  In general this is good for performance.
+
+    However, on processors vulnerable CVE-2018-12207, HVM guest kernels can
+    use executable superpages to crash the host.  By default, executable
+    superpages are disabled on affected hardware.
+
+    If HVM guest kernels are trusted not to mount a DoS against the system,
+    this option can enabled to regain performance.
+
 ### extra\_guest\_irqs
 > `= [<domU number>][,<dom0 number>]`
 
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index f4a6a37149..1924434960 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1706,6 +1706,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
     struct p2m_domain *p2m, *hostp2m;
     int rc, fall_through = 0, paged = 0;
     int sharing_enomem = 0;
+    unsigned int page_order = 0;
     vm_event_request_t *req_ptr = NULL;
     bool_t ap2m_active, sync = 0;
 
@@ -1774,7 +1775,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
     hostp2m = p2m_get_hostp2m(currd);
     mfn = get_gfn_type_access(hostp2m, gfn, &p2mt, &p2ma,
                               P2M_ALLOC | (npfec.write_access ? P2M_UNSHARE : 0),
-                              NULL);
+                              &page_order);
 
     if ( ap2m_active )
     {
@@ -1786,7 +1787,7 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
             goto out;
         }
 
-        mfn = get_gfn_type_access(p2m, gfn, &p2mt, &p2ma, 0, NULL);
+        mfn = get_gfn_type_access(p2m, gfn, &p2mt, &p2ma, 0, &page_order);
     }
     else
         p2m = hostp2m;
@@ -1828,6 +1829,24 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
             break;
         }
 
+        /*
+         * Workaround for XSA-304 / CVE-2018-12207.  If we take an execution
+         * fault against a non-executable superpage, shatter it to regain
+         * execute permissions.
+         */
+        if ( page_order > 0 && npfec.insn_fetch && npfec.present && !violation )
+        {
+            int res = p2m_set_entry(p2m, _gfn(gfn), mfn, PAGE_ORDER_4K,
+                                    p2mt, p2ma);
+
+            if ( res )
+                printk(XENLOG_ERR "Failed to shatter gfn %"PRI_gfn": %d\n",
+                       gfn, res);
+
+            rc = !res;
+            goto out_put_gfn;
+        }
+
         if ( violation )
         {
             /* Should #VE be emulated for this fault? */
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 493986e84a..8821a3b536 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -67,6 +67,7 @@ integer_param("ple_window", ple_window);
 
 static bool_t __read_mostly opt_pml_enabled = 1;
 static s8 __read_mostly opt_ept_ad = -1;
+int8_t __read_mostly opt_ept_exec_sp = -1;
 
 /*
  * The 'ept' parameter controls functionalities that depend on, or impact the
@@ -94,6 +95,8 @@ static int __init parse_ept_param(const char *s)
             opt_pml_enabled = val;
         else if ( !cmdline_strcmp(s, "ad") )
             opt_ept_ad = val;
+        else if ( !cmdline_strcmp(s, "exec-sp") )
+            opt_ept_exec_sp = val;
         else
             rc = -EINVAL;
 
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 840dc2b44d..a568d62643 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2415,6 +2415,102 @@ static void pi_notification_interrupt(struct cpu_user_regs *regs)
 static void __init lbr_tsx_fixup_check(void);
 static void __init bdw_erratum_bdf14_fixup_check(void);
 
+/*
+ * Calculate whether the CPU is vulnerable to Instruction Fetch page
+ * size-change MCEs.
+ */
+static bool __init has_if_pschange_mc(void)
+{
+    uint64_t caps = 0;
+
+    /*
+     * If we are virtualised, there is nothing we can do.  Our EPT tables are
+     * shadowed by our hypervisor, and not walked by hardware.
+     */
+    if ( cpu_has_hypervisor )
+        return false;
+
+    if ( boot_cpu_has(X86_FEATURE_ARCH_CAPS) )
+        rdmsrl(MSR_ARCH_CAPABILITIES, caps);
+
+    if ( caps & ARCH_CAPS_IF_PSCHANGE_MC_NO )
+        return false;
+
+    /*
+     * IF_PSCHANGE_MC is only known to affect Intel Family 6 processors at
+     * this time.
+     */
+    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+         boot_cpu_data.x86 != 6 )
+        return false;
+
+    switch ( boot_cpu_data.x86_model )
+    {
+        /*
+         * Core processors since at least Nehalem are vulnerable.
+         */
+    case 0x1f: /* Auburndale / Havendale */
+    case 0x1e: /* Nehalem */
+    case 0x1a: /* Nehalem EP */
+    case 0x2e: /* Nehalem EX */
+    case 0x25: /* Westmere */
+    case 0x2c: /* Westmere EP */
+    case 0x2f: /* Westmere EX */
+    case 0x2a: /* SandyBridge */
+    case 0x2d: /* SandyBridge EP/EX */
+    case 0x3a: /* IvyBridge */
+    case 0x3e: /* IvyBridge EP/EX */
+    case 0x3c: /* Haswell */
+    case 0x3f: /* Haswell EX/EP */
+    case 0x45: /* Haswell D */
+    case 0x46: /* Haswell H */
+    case 0x3d: /* Broadwell */
+    case 0x47: /* Broadwell H */
+    case 0x4f: /* Broadwell EP/EX */
+    case 0x56: /* Broadwell D */
+    case 0x4e: /* Skylake M */
+    case 0x5e: /* Skylake D */
+    case 0x55: /* Skylake-X / Cascade Lake */
+    case 0x8e: /* Kaby / Coffee / Whiskey Lake M */
+    case 0x9e: /* Kaby / Coffee / Whiskey Lake D */
+        return true;
+
+        /*
+         * Atom processors are not vulnerable.
+         */
+    case 0x1c: /* Pineview */
+    case 0x26: /* Lincroft */
+    case 0x27: /* Penwell */
+    case 0x35: /* Cloverview */
+    case 0x36: /* Cedarview */
+    case 0x37: /* Baytrail / Valleyview (Silvermont) */
+    case 0x4d: /* Avaton / Rangely (Silvermont) */
+    case 0x4c: /* Cherrytrail / Brasswell */
+    case 0x4a: /* Merrifield */
+    case 0x5a: /* Moorefield */
+    case 0x5c: /* Goldmont */
+    case 0x5d: /* SoFIA 3G Granite/ES2.1 */
+    case 0x65: /* SoFIA LTE AOSP */
+    case 0x5f: /* Denverton */
+    case 0x6e: /* Cougar Mountain */
+    case 0x75: /* Lightning Mountain */
+    case 0x7a: /* Gemini Lake */
+    case 0x86: /* Jacobsville */
+
+        /*
+         * Knights processors are not vulnerable.
+         */
+    case 0x57: /* Knights Landing */
+    case 0x85: /* Knights Mill */
+        return false;
+
+    default:
+        printk("Unrecognised CPU model %#x - assuming vulnerable to IF_PSCHANGE_MC\n",
+               boot_cpu_data.x86_model);
+        return true;
+    }
+}
+
 const struct hvm_function_table * __init start_vmx(void)
 {
     set_in_cr4(X86_CR4_VMXE);
@@ -2435,6 +2531,17 @@ const struct hvm_function_table * __init start_vmx(void)
      */
     if ( cpu_has_vmx_ept && (cpu_has_vmx_pat || opt_force_ept) )
     {
+        bool cpu_has_bug_pschange_mc = has_if_pschange_mc();
+
+        if ( opt_ept_exec_sp == -1 )
+        {
+            /* Default to non-executable superpages on vulnerable hardware. */
+            opt_ept_exec_sp = !cpu_has_bug_pschange_mc;
+
+            if ( cpu_has_bug_pschange_mc )
+                printk("VMX: Disabling executable EPT superpages due to CVE-2018-12207\n");
+        }
+
         vmx_function_table.hap_supported = 1;
         vmx_function_table.altp2m_supported = 1;
 
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index ce46201d45..93e08f89a2 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -215,6 +215,12 @@ static void ept_p2m_type_to_flags(struct p2m_domain *p2m, ept_entry_t *entry,
             break;
     }
     
+    /*
+     * Don't create executable superpages if we need to shatter them to
+     * protect against CVE-2018-12207.
+     */
+    if ( !opt_ept_exec_sp && is_epte_superpage(entry) )
+        entry->x = 0;
 }
 
 #define GUEST_TABLE_MAP_FAILED  0
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index 89619e4afd..20eb7f6082 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -28,6 +28,8 @@
 #include <asm/hvm/trace.h>
 #include <asm/hvm/vmx/vmcs.h>
 
+extern int8_t opt_ept_exec_sp;
+
 typedef union {
     struct {
         u64 r       :   1,  /* bit 0 - Read permission */
diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index b8151d2d9f..89ae3e03f1 100644
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -54,6 +54,7 @@
 #define ARCH_CAPS_SKIP_L1DFL		(_AC(1, ULL) << 3)
 #define ARCH_CAPS_SSB_NO		(_AC(1, ULL) << 4)
 #define ARCH_CAPS_MDS_NO		(_AC(1, ULL) << 5)
+#define ARCH_CAPS_IF_PSCHANGE_MC_NO	(_AC(1, ULL) << 6)
 
 #define MSR_FLUSH_CMD			0x0000010b
 #define FLUSH_CMD_L1D			(_AC(1, ULL) << 0)

["xsa304/xsa304-4.11-3.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vtx: Allow runtime modification of the exec-sp setting

See patch for details.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

diff --git a/docs/misc/xen-command-line.markdown b/docs/misc/xen-command-line.markdown
index 684671cb7b..33ed1ffc40 100644
--- a/docs/misc/xen-command-line.markdown
+++ b/docs/misc/xen-command-line.markdown
@@ -861,6 +861,21 @@ Controls EPT related features.
     If HVM guest kernels are trusted not to mount a DoS against the system,
     this option can enabled to regain performance.
 
+    This boolean may be modified at runtime using `xl set-parameters
+    ept=[no-]exec-sp` to switch between fast and secure.
+
+    *   When switching from secure to fast, preexisting HVM domains will run
+        at their current performance until they are rebooted; new domains will
+        run without any overhead.
+
+    *   When switching from fast to secure, all HVM domains will immediately
+        suffer a performance penalty.
+
+    **Warning: No guarantee is made that this runtime option will be retained
+      indefinitely, or that it will retain this exact behaviour.  It is
+      intended as an emergency option for people who first chose fast, then
+      change their minds to secure, and wish not to reboot.**
+
 ### extra\_guest\_irqs
 > `= [<domU number>][,<dom0 number>]`
 
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 8821a3b536..15376e25ba 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -107,6 +107,41 @@ static int __init parse_ept_param(const char *s)
 }
 custom_param("ept", parse_ept_param);
 
+static int parse_ept_param_runtime(const char *s)
+{
+    int val;
+
+    if ( !cpu_has_vmx_ept || !hvm_funcs.hap_supported ||
+         !(hvm_funcs.hap_capabilities &
+           (HVM_HAP_SUPERPAGE_2MB | HVM_HAP_SUPERPAGE_1GB)) )
+    {
+        printk("VMX: EPT not available, or not in use - ignoring\n");
+        return 0;
+    }
+
+    if ( (val = parse_boolean("exec-sp", s, NULL)) < 0 )
+        return -EINVAL;
+
+    if ( val != opt_ept_exec_sp )
+    {
+        struct domain *d;
+
+        opt_ept_exec_sp = val;
+
+        rcu_read_lock(&domlist_read_lock);
+        for_each_domain ( d )
+            if ( paging_mode_hap(d) )
+                p2m_change_entry_type_global(d, p2m_ram_rw, p2m_ram_rw);
+        rcu_read_unlock(&domlist_read_lock);
+    }
+
+    printk("VMX: EPT executable superpages %sabled\n",
+           val ? "en" : "dis");
+
+    return 0;
+}
+custom_runtime_only_param("ept", parse_ept_param_runtime);
+
 /* Dynamic (run-time adjusted) execution control flags. */
 u32 vmx_pin_based_exec_control __read_mostly;
 u32 vmx_cpu_based_exec_control __read_mostly;
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index 2b62bc61dd..97c417fc3e 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -257,17 +257,22 @@ int p2m_is_logdirty_range(struct p2m_domain *p2m, unsigned long start,
     return 0;
 }
 
+/*
+ * May be called with ot = nt = p2m_ram_rw for its side effect of
+ * recalculating all PTEs in the p2m.
+ */
 void p2m_change_entry_type_global(struct domain *d,
                                   p2m_type_t ot, p2m_type_t nt)
 {
     struct p2m_domain *p2m = p2m_get_hostp2m(d);
 
-    ASSERT(ot != nt);
     ASSERT(p2m_is_changeable(ot) && p2m_is_changeable(nt));
 
     p2m_lock(p2m);
     p2m->change_entry_type_global(p2m, ot, nt);
-    p2m->global_logdirty = (nt == p2m_ram_logdirty);
+    /* Don't allow 'recalculate' operations to change the logdirty state. */
+    if ( ot != nt )
+        p2m->global_logdirty = (nt == p2m_ram_logdirty);
     p2m_unlock(p2m);
 }
 

["xsa304/xsa304-4.12-1.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vtd: Hide superpage support for SandyBridge IOMMUs

Something causes SandyBridge IOMMUs to choke when sharing EPT pagetables, and
an EPT superpage gets shattered.  The root cause is still under investigation,
but the end result is unusable in combination with CVE-2018-12207 protections.

This is part of XSA-304 / CVE-2018-12207

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/drivers/passthrough/vtd/extern.h b/xen/drivers/passthrough/vtd/extern.h
index 16eada9fa2..a71c8b0f84 100644
--- a/xen/drivers/passthrough/vtd/extern.h
+++ b/xen/drivers/passthrough/vtd/extern.h
@@ -97,6 +97,8 @@ void vtd_ops_postamble_quirk(struct iommu* iommu);
 int __must_check me_wifi_quirk(struct domain *domain,
                                u8 bus, u8 devfn, int map);
 void pci_vtd_quirk(const struct pci_dev *);
+void quirk_iommu_caps(struct iommu *iommu);
+
 bool_t platform_supports_intremap(void);
 bool_t platform_supports_x2apic(void);
 
diff --git a/xen/drivers/passthrough/vtd/iommu.c b/xen/drivers/passthrough/vtd/iommu.c
index b3664ecbe0..5d34f75306 100644
--- a/xen/drivers/passthrough/vtd/iommu.c
+++ b/xen/drivers/passthrough/vtd/iommu.c
@@ -1215,6 +1215,8 @@ int __init iommu_alloc(struct acpi_drhd_unit *drhd)
     if ( !(iommu->cap + 1) || !(iommu->ecap + 1) )
         return -ENODEV;
 
+    quirk_iommu_caps(iommu);
+
     if ( cap_fault_reg_offset(iommu->cap) +
          cap_num_fault_regs(iommu->cap) * PRIMARY_FAULT_REG_LEN >= PAGE_SIZE ||
          ecap_iotlb_offset(iommu->ecap) >= PAGE_SIZE )
diff --git a/xen/drivers/passthrough/vtd/quirks.c b/xen/drivers/passthrough/vtd/quirks.c
index d6db862678..b02688e316 100644
--- a/xen/drivers/passthrough/vtd/quirks.c
+++ b/xen/drivers/passthrough/vtd/quirks.c
@@ -540,3 +540,28 @@ void pci_vtd_quirk(const struct pci_dev *pdev)
         break;
     }
 }
+
+void __init quirk_iommu_caps(struct iommu *iommu)
+{
+    /*
+     * IOMMU Quirks:
+     *
+     * SandyBridge IOMMUs claim support for 2M and 1G superpages, but don't
+     * implement superpages internally.
+     *
+     * There are issues changing the walk length under in-flight DMA, which
+     * has manifested as incompatibility between EPT/IOMMU sharing and the
+     * workaround for CVE-2018-12207 / XSA-304.  Hide the superpages
+     * capabilities in the IOMMU, which will prevent Xen from sharing the EPT
+     * and IOMMU pagetables.
+     *
+     * Detection of SandyBridge unfortunately has to be done by processor
+     * model because the client parts don't expose their IOMMUs as PCI devices
+     * we could match with a Device ID.
+     */
+    if ( boot_cpu_data.x86_vendor == X86_VENDOR_INTEL &&
+         boot_cpu_data.x86 == 6 &&
+         (boot_cpu_data.x86_model == 0x2a ||
+          boot_cpu_data.x86_model == 0x2d) )
+        iommu->cap &= ~(0xful << 34);
+}

["xsa304/xsa304-4.12-2.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vtx: Disable executable EPT superpages to work around
 CVE-2018-12207

CVE-2018-12207 covers a set of errata on various Intel processors, whereby a
machine check exception can be generated in a corner case when an executable
mapping changes size or cacheability without TLB invalidation.  HVM guest
kernels can trigger this to DoS the host.

To mitigate, in affected hardware, all EPT superpages are marked NX.  When an
instruction fetch violation is observed against the superpage, the superpage
is shattered to 4k and has execute permissions restored.  This prevents the
guest kernel from being able to create the necessary preconditions in the iTLB
to exploit the vulnerability.

This does come with a workload-dependent performance overhead, caused by
increased TLB pressure.  Performance can be restored, if guest kernels are
trusted not to mount an attack, by specifying ept=exec-sp on the command line.

This is part of XSA-304 / CVE-2018-12207

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Acked-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index 85081fdc94..e283017015 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -895,7 +895,7 @@ Controls for interacting with the system Extended Firmware Interface.
     uncacheable.
 
 ### ept
-> `= List of [ ad=<bool>, pml=<bool> ]`
+> `= List of [ ad=<bool>, pml=<bool>, exec-sp=<bool> ]`
 
 > Applicability: Intel
 
@@ -926,6 +926,16 @@ introduced with the Nehalem architecture.
     disable PML.  `pml=0` can be used to prevent the use of PML on otherwise
     capable hardware.
 
+*   The `exec-sp` boolean controls whether EPT superpages with execute
+    permissions are permitted.  In general this is good for performance.
+
+    However, on processors vulnerable CVE-2018-12207, HVM guest kernels can
+    use executable superpages to crash the host.  By default, executable
+    superpages are disabled on affected hardware.
+
+    If HVM guest kernels are trusted not to mount a DoS against the system,
+    this option can enabled to regain performance.
+
 ### extra_guest_irqs
 > `= [<domU number>][,<dom0 number>]`
 
diff --git a/xen/arch/x86/hvm/hvm.c b/xen/arch/x86/hvm/hvm.c
index 2089a77270..84191d4e4b 100644
--- a/xen/arch/x86/hvm/hvm.c
+++ b/xen/arch/x86/hvm/hvm.c
@@ -1814,6 +1814,24 @@ int hvm_hap_nested_page_fault(paddr_t gpa, unsigned long gla,
             break;
         }
 
+        /*
+         * Workaround for XSA-304 / CVE-2018-12207.  If we take an execution
+         * fault against a non-executable superpage, shatter it to regain
+         * execute permissions.
+         */
+        if ( page_order > 0 && npfec.insn_fetch && npfec.present && !violation )
+        {
+            int res = p2m_set_entry(p2m, _gfn(gfn), mfn, PAGE_ORDER_4K,
+                                    p2mt, p2ma);
+
+            if ( res )
+                printk(XENLOG_ERR "Failed to shatter gfn %"PRI_gfn": %d\n",
+                       gfn, res);
+
+            rc = !res;
+            goto out_put_gfn;
+        }
+
         if ( violation )
         {
             /* Should #VE be emulated for this fault? */
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index 56519fee84..ec5ab860ad 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -67,6 +67,7 @@ integer_param("ple_window", ple_window);
 
 static bool __read_mostly opt_ept_pml = true;
 static s8 __read_mostly opt_ept_ad = -1;
+int8_t __read_mostly opt_ept_exec_sp = -1;
 
 static int __init parse_ept_param(const char *s)
 {
@@ -82,6 +83,8 @@ static int __init parse_ept_param(const char *s)
             opt_ept_ad = val;
         else if ( (val = parse_boolean("pml", s, ss)) >= 0 )
             opt_ept_pml = val;
+        else if ( (val = parse_boolean("exec-sp", s, ss)) >= 0 )
+            opt_ept_exec_sp = val;
         else
             rc = -EINVAL;
 
diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 26b7ddb5fe..28cba8ec28 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -2445,6 +2445,102 @@ static void pi_notification_interrupt(struct cpu_user_regs *regs)
 static void __init lbr_tsx_fixup_check(void);
 static void __init bdw_erratum_bdf14_fixup_check(void);
 
+/*
+ * Calculate whether the CPU is vulnerable to Instruction Fetch page
+ * size-change MCEs.
+ */
+static bool __init has_if_pschange_mc(void)
+{
+    uint64_t caps = 0;
+
+    /*
+     * If we are virtualised, there is nothing we can do.  Our EPT tables are
+     * shadowed by our hypervisor, and not walked by hardware.
+     */
+    if ( cpu_has_hypervisor )
+        return false;
+
+    if ( boot_cpu_has(X86_FEATURE_ARCH_CAPS) )
+        rdmsrl(MSR_ARCH_CAPABILITIES, caps);
+
+    if ( caps & ARCH_CAPS_IF_PSCHANGE_MC_NO )
+        return false;
+
+    /*
+     * IF_PSCHANGE_MC is only known to affect Intel Family 6 processors at
+     * this time.
+     */
+    if ( boot_cpu_data.x86_vendor != X86_VENDOR_INTEL ||
+         boot_cpu_data.x86 != 6 )
+        return false;
+
+    switch ( boot_cpu_data.x86_model )
+    {
+        /*
+         * Core processors since at least Nehalem are vulnerable.
+         */
+    case 0x1f: /* Auburndale / Havendale */
+    case 0x1e: /* Nehalem */
+    case 0x1a: /* Nehalem EP */
+    case 0x2e: /* Nehalem EX */
+    case 0x25: /* Westmere */
+    case 0x2c: /* Westmere EP */
+    case 0x2f: /* Westmere EX */
+    case 0x2a: /* SandyBridge */
+    case 0x2d: /* SandyBridge EP/EX */
+    case 0x3a: /* IvyBridge */
+    case 0x3e: /* IvyBridge EP/EX */
+    case 0x3c: /* Haswell */
+    case 0x3f: /* Haswell EX/EP */
+    case 0x45: /* Haswell D */
+    case 0x46: /* Haswell H */
+    case 0x3d: /* Broadwell */
+    case 0x47: /* Broadwell H */
+    case 0x4f: /* Broadwell EP/EX */
+    case 0x56: /* Broadwell D */
+    case 0x4e: /* Skylake M */
+    case 0x5e: /* Skylake D */
+    case 0x55: /* Skylake-X / Cascade Lake */
+    case 0x8e: /* Kaby / Coffee / Whiskey Lake M */
+    case 0x9e: /* Kaby / Coffee / Whiskey Lake D */
+        return true;
+
+        /*
+         * Atom processors are not vulnerable.
+         */
+    case 0x1c: /* Pineview */
+    case 0x26: /* Lincroft */
+    case 0x27: /* Penwell */
+    case 0x35: /* Cloverview */
+    case 0x36: /* Cedarview */
+    case 0x37: /* Baytrail / Valleyview (Silvermont) */
+    case 0x4d: /* Avaton / Rangely (Silvermont) */
+    case 0x4c: /* Cherrytrail / Brasswell */
+    case 0x4a: /* Merrifield */
+    case 0x5a: /* Moorefield */
+    case 0x5c: /* Goldmont */
+    case 0x5d: /* SoFIA 3G Granite/ES2.1 */
+    case 0x65: /* SoFIA LTE AOSP */
+    case 0x5f: /* Denverton */
+    case 0x6e: /* Cougar Mountain */
+    case 0x75: /* Lightning Mountain */
+    case 0x7a: /* Gemini Lake */
+    case 0x86: /* Jacobsville */
+
+        /*
+         * Knights processors are not vulnerable.
+         */
+    case 0x57: /* Knights Landing */
+    case 0x85: /* Knights Mill */
+        return false;
+
+    default:
+        printk("Unrecognised CPU model %#x - assuming vulnerable to IF_PSCHANGE_MC\n",
+               boot_cpu_data.x86_model);
+        return true;
+    }
+}
+
 const struct hvm_function_table * __init start_vmx(void)
 {
     set_in_cr4(X86_CR4_VMXE);
@@ -2465,6 +2561,17 @@ const struct hvm_function_table * __init start_vmx(void)
      */
     if ( cpu_has_vmx_ept && (cpu_has_vmx_pat || opt_force_ept) )
     {
+        bool cpu_has_bug_pschange_mc = has_if_pschange_mc();
+
+        if ( opt_ept_exec_sp == -1 )
+        {
+            /* Default to non-executable superpages on vulnerable hardware. */
+            opt_ept_exec_sp = !cpu_has_bug_pschange_mc;
+
+            if ( cpu_has_bug_pschange_mc )
+                printk("VMX: Disabling executable EPT superpages due to CVE-2018-12207\n");
+        }
+
         vmx_function_table.hap_supported = 1;
         vmx_function_table.altp2m_supported = 1;
 
diff --git a/xen/arch/x86/mm/p2m-ept.c b/xen/arch/x86/mm/p2m-ept.c
index 952ebad82f..834d4798c8 100644
--- a/xen/arch/x86/mm/p2m-ept.c
+++ b/xen/arch/x86/mm/p2m-ept.c
@@ -174,6 +174,12 @@ static void ept_p2m_type_to_flags(struct p2m_domain *p2m, ept_entry_t *entry,
             break;
     }
     
+    /*
+     * Don't create executable superpages if we need to shatter them to
+     * protect against CVE-2018-12207.
+     */
+    if ( !opt_ept_exec_sp && is_epte_superpage(entry) )
+        entry->x = 0;
 }
 
 #define GUEST_TABLE_MAP_FAILED  0
diff --git a/xen/include/asm-x86/hvm/vmx/vmx.h b/xen/include/asm-x86/hvm/vmx/vmx.h
index ebaa74449b..371b912887 100644
--- a/xen/include/asm-x86/hvm/vmx/vmx.h
+++ b/xen/include/asm-x86/hvm/vmx/vmx.h
@@ -28,6 +28,8 @@
 #include <asm/hvm/trace.h>
 #include <asm/hvm/vmx/vmcs.h>
 
+extern int8_t opt_ept_exec_sp;
+
 typedef union {
     struct {
         u64 r       :   1,  /* bit 0 - Read permission */
diff --git a/xen/include/asm-x86/msr-index.h b/xen/include/asm-x86/msr-index.h
index 637259bd1f..32746aa8ae 100644
--- a/xen/include/asm-x86/msr-index.h
+++ b/xen/include/asm-x86/msr-index.h
@@ -52,6 +52,7 @@
 #define ARCH_CAPS_SKIP_L1DFL		(_AC(1, ULL) << 3)
 #define ARCH_CAPS_SSB_NO		(_AC(1, ULL) << 4)
 #define ARCH_CAPS_MDS_NO		(_AC(1, ULL) << 5)
+#define ARCH_CAPS_IF_PSCHANGE_MC_NO	(_AC(1, ULL) << 6)
 
 #define MSR_FLUSH_CMD			0x0000010b
 #define FLUSH_CMD_L1D			(_AC(1, ULL) << 0)

["xsa304/xsa304-4.12-3.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/vtx: Allow runtime modification of the exec-sp setting

See patch for details.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
Reviewed-by: George Dunlap <george.dunlap@citrix.com>

diff --git a/docs/misc/xen-command-line.pandoc b/docs/misc/xen-command-line.pandoc
index e283017015..84221fe60a 100644
--- a/docs/misc/xen-command-line.pandoc
+++ b/docs/misc/xen-command-line.pandoc
@@ -936,6 +936,21 @@ introduced with the Nehalem architecture.
     If HVM guest kernels are trusted not to mount a DoS against the system,
     this option can enabled to regain performance.
 
+    This boolean may be modified at runtime using `xl set-parameters
+    ept=[no-]exec-sp` to switch between fast and secure.
+
+    *   When switching from secure to fast, preexisting HVM domains will run
+        at their current performance until they are rebooted; new domains will
+        run without any overhead.
+
+    *   When switching from fast to secure, all HVM domains will immediately
+        suffer a performance penalty.
+
+    **Warning: No guarantee is made that this runtime option will be retained
+      indefinitely, or that it will retain this exact behaviour.  It is
+      intended as an emergency option for people who first chose fast, then
+      change their minds to secure, and wish not to reboot.**
+
 ### extra_guest_irqs
 > `= [<domU number>][,<dom0 number>]`
 
diff --git a/xen/arch/x86/hvm/vmx/vmcs.c b/xen/arch/x86/hvm/vmx/vmcs.c
index ec5ab860ad..c4d8a5ba78 100644
--- a/xen/arch/x86/hvm/vmx/vmcs.c
+++ b/xen/arch/x86/hvm/vmx/vmcs.c
@@ -95,6 +95,41 @@ static int __init parse_ept_param(const char *s)
 }
 custom_param("ept", parse_ept_param);
 
+static int parse_ept_param_runtime(const char *s)
+{
+    int val;
+
+    if ( !cpu_has_vmx_ept || !hvm_funcs.hap_supported ||
+         !(hvm_funcs.hap_capabilities &
+           (HVM_HAP_SUPERPAGE_2MB | HVM_HAP_SUPERPAGE_1GB)) )
+    {
+        printk("VMX: EPT not available, or not in use - ignoring\n");
+        return 0;
+    }
+
+    if ( (val = parse_boolean("exec-sp", s, NULL)) < 0 )
+        return -EINVAL;
+
+    if ( val != opt_ept_exec_sp )
+    {
+        struct domain *d;
+
+        opt_ept_exec_sp = val;
+
+        rcu_read_lock(&domlist_read_lock);
+        for_each_domain ( d )
+            if ( paging_mode_hap(d) )
+                p2m_change_entry_type_global(d, p2m_ram_rw, p2m_ram_rw);
+        rcu_read_unlock(&domlist_read_lock);
+    }
+
+    printk("VMX: EPT executable superpages %sabled\n",
+           val ? "en" : "dis");
+
+    return 0;
+}
+custom_runtime_only_param("ept", parse_ept_param_runtime);
+
 /* Dynamic (run-time adjusted) execution control flags. */
 u32 vmx_pin_based_exec_control __read_mostly;
 u32 vmx_cpu_based_exec_control __read_mostly;
diff --git a/xen/arch/x86/mm/p2m.c b/xen/arch/x86/mm/p2m.c
index f518f86493..16608098b1 100644
--- a/xen/arch/x86/mm/p2m.c
+++ b/xen/arch/x86/mm/p2m.c
@@ -289,15 +289,20 @@ static void change_entry_type_global(struct p2m_domain *p2m,
                                      p2m_type_t ot, p2m_type_t nt)
 {
     p2m->change_entry_type_global(p2m, ot, nt);
-    p2m->global_logdirty = (nt == p2m_ram_logdirty);
+    /* Don't allow 'recalculate' operations to change the logdirty state. */
+    if ( ot != nt )
+        p2m->global_logdirty = (nt == p2m_ram_logdirty);
 }
 
+/*
+ * May be called with ot = nt = p2m_ram_rw for its side effect of
+ * recalculating all PTEs in the p2m.
+ */
 void p2m_change_entry_type_global(struct domain *d,
                                   p2m_type_t ot, p2m_type_t nt)
 {
     struct p2m_domain *hostp2m = p2m_get_hostp2m(d);
 
-    ASSERT(ot != nt);
     ASSERT(p2m_is_changeable(ot) && p2m_is_changeable(nt));
 
     p2m_lock(hostp2m);


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic