[prev in list] [next in list] [prev in thread] [next in thread] 

List:       oss-security
Subject:    [oss-security] Xen Security Advisory 293 v4 (CVE-2019-17347) - x86: PV kernel context switch corrupt
From:       Xen.org security team <security () xen ! org>
Date:       2019-10-25 11:10:40
Message-ID: E1iNxUO-0002mf-4u () xenbits ! xenproject ! org
[Download RAW message or body]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

            Xen Security Advisory CVE-2019-17347 / XSA-293
                              version 4

                x86: PV kernel context switch corruption

UPDATES IN VERSION 4
====================

Correct affected versions statement.

CVE assigned.

ISSUE DESCRIPTION
=================

On hardware supporting the fsgsbase feature, 64bit PV guests can set and
clear the applicable control bit in its virtualised %cr4, but the
feature remains fully active in hardware.  Therefore, the associated
instructions are actually usable.

Linux, which does not currently support this feature, has various
optimisations in its context switch path which justifiably assume that
userspace can't actually make changes without a system call.

Xen's behaviour of having this feature active behind the guest kernel's
back undermines the correctness of any context switch logic which
depends on the feature being disabled.

Userspace can therefore corrupt fsbase or gsbase (commonly used for
Thread Local Storage) in the next thread to be scheduled on the
current vcpu.

IMPACT
======

A malicious unprivileged guest userspace process can escalate its
privilege to that of other userspace processes in the same guest, and
potentially thereby to that of the guest operating system.

Additionally, some guest software which attempts to use this CPU
feature may trigger the bug accidentally, leading to crashes or
corruption of other processes in the same guest.

VULNERABLE SYSTEMS
==================

Xen versions 4.4 and later are vulnerable.  Xen 4.3 and earlier are not
vulnerable.

Only x86 hardware with the fsgsbase feature is vulnerable.  This is
believed to be Intel IvyBridge and later hardware, and AMD Steamroller
and later hardware.

ARM hardware is not affected.

Only 64bit PV guests can exploit the vulnerability.  32bit PV guests,
and HVM/PVH guests cannot exploit the vulnerability.

Whether the bug is exploitable, and whether it will be triggered by
accident, depend in a complicated way on the guest operating system
and its configuration.  Most guests are vulnerable to malicious
userspace processes.

MITIGATION
==========

Running only 32bit PV or HVM/PVH guests will avoid this vulnerability.

CREDITS
=======

This issue was discovered by Andy Lutomirski.

RESOLUTION
==========

Applying the appropriate attached patches resolves this issue.

xsa293/unstable-?.patch         xen-unstable
xsa293/4.11-?.patch             Xen 4.11.x
xsa293/4.10-?.patch             Xen 4.10.x
xsa293/4.9-?.patch              Xen 4.9.x
xsa293/4.8-?.patch              Xen 4.8.x
xsa293/4.7-?.patch              Xen 4.7.x

$ sha256sum xsa293* xsa293*/*
27baf055642a3a7e9d2b1a961e15a46b592eca7c6f63e28e3bcb19e4cebfd0bd  xsa293.meta
865596b3dca81712a7d3d78f22e40aed1a08732f93b1950af6f092d893323a0f  xsa293/4.7-1.patch
032559c4bbdfe0987b9d3b15cf8661d8d8a5d4e2e989c944490ac171305fba3b  xsa293/4.7-2.patch
d3d91a1a5083b0a1992750b808aefacd0f0d4e7e92d1436e620a542e935cdadd  xsa293/4.7-3.patch
14b3db49375e353394b831a342d873d83615285d516f8cb08a0e1564d675cd51  xsa293/4.8-1.patch
1efc2ee18f54c7c41f478e944b3b708eb283bfa9de68a1046033d57784846c30  xsa293/4.8-2.patch
0d28899cad0e6798ae6a96717c15363ddf5a35e334ede02becdc81538ae589cc  xsa293/4.8-3.patch
b24210a74eb9dca5c7af902d223dba1b1b372df06a99fb1b0df8e92c9f9632f3  xsa293/4.9-1.patch
f68101f80d9843c1cdbb70188caec7009a0d52d33d811d22091e7c1f265a15e1  xsa293/4.9-2.patch
194e42599eac16afab14856760901705a0600c1308645495f30d30f8dd68734c  xsa293/4.10-1.patch
1fdee59bba66bd6b3ea4949913457dbcb1b8d5cb85fd8fb60aacac9a403ee9a9  xsa293/4.10-2.patch
277ba95e9a2276378fc9b3bcf89b694b9670256cde62278ade2e90d3fd5f7c46  xsa293/4.11-1.patch
724a0f433427a747876cbec09381dc1ca99286cea0ecbdd098c6e68fb135eeda  xsa293/4.11-2.patch
837eb67900a7c70cf7a00836cb312506925ca1fd29529144ff312316b0dbb086  xsa293/unstable-1.patch
0a6df8c8778a1c7e1fb71825695a86dee36f2e9345b39a06e3a364ad8b938de0  xsa293/unstable-2.patch
$

DEPLOYMENT DURING EMBARGO
=========================

Deployment of the patches and/or mitigations described above (or
others which are substantially similar) is permitted during the
embargo, even on public-facing systems with untrusted guest users and
administrators.

But: Distribution of updated software is prohibited (except to other
members of the predisclosure list).

Predisclosure list members who wish to deploy significantly different
patches and/or mitigations, please contact the Xen Project Security
Team.


(Note: this during-embargo deployment notice is retained in
post-embargo publicly released Xen Project advisories, even though it
is then no longer applicable.  This is to enable the community to have
oversight of the Xen Project Security Team's decisionmaking.)

For more information about permissible uses of embargoed information,
consult the Xen Project community's agreed Security Policy:
  http://www.xenproject.org/security-policy.html
-----BEGIN PGP SIGNATURE-----

iQFABAEBCAAqFiEEI+MiLBRfRHX6gGCng/4UyVfoK9kFAl2y1+8MHHBncEB4ZW4u
b3JnAAoJEIP+FMlX6CvZ+v0H/21IMJyzcEdBt5Ki3zJ4gWL5XKxzy7p5r8IyvLto
KulFRzMU2gopsrSji394Inl+iydSEgSRGNMytpJ6HlYmAH+O5xJe3BVsLyf4tvTO
ONTs72xin6mm3h/cUSVtLzTfLAYX6AA37uy/kqUOGH9Bn1VDNhKFDwTjwb7riaDe
cHpvCaQJGK9HBYjzD8HyAfh0nKupgLb19FdG5r2CjXqyHK1A+bC3LPdOc9jfNYrY
YP4LV0nSU5XOBi6RrOSXySadvQQTXtaFACtpcRGQEhrXKmO+bUCQiyJzn2JtmxZP
7uMN9OqR6idl3mxgBb1QiHfxIFw2NB/MC6BoTBn4+Ea7yJk=
=cxjY
-----END PGP SIGNATURE-----

["xsa293.meta" (application/octet-stream)]
["xsa293/4.7-1.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Improve pv_cpuid()'s API

pv_cpuid()'s API is awkward to use.  There are already two callers jumping
through hoops to use it, and a third is on its way.

Change the API to take each parameter individually (like its counterpart,
hvm_cpuid(), already does), and introduce a new pv_cpuid_regs() wrapper
implementing the old API.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index 82e3c2c..59395dd 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -3698,7 +3698,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
         break;
     }
     case EXIT_REASON_CPUID:
-        is_pvh_vcpu(v) ? pv_cpuid(regs) : vmx_do_cpuid(regs);
+        is_pvh_vcpu(v) ? pv_cpuid_regs(regs) : vmx_do_cpuid(regs);
         update_guest_eip(); /* Safe: CPUID */
         break;
     case EXIT_REASON_HLT:
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 139737b..c001f93 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -924,17 +924,14 @@ static void _domain_cpuid(struct domain *currd,
         cpuid_count(leaf, subleaf, eax, ebx, ecx, edx);
 }
 
-void pv_cpuid(struct cpu_user_regs *regs)
+void pv_cpuid(uint32_t leaf, uint32_t subleaf,
+              uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx)
 {
-    uint32_t leaf, subleaf, a, b, c, d;
+    uint32_t a, b, c, d;
+    const struct cpu_user_regs *regs = guest_cpu_user_regs();
     struct vcpu *curr = current;
     struct domain *currd = curr->domain;
 
-    leaf = a = regs->eax;
-    b = regs->ebx;
-    subleaf = c = regs->ecx;
-    d = regs->edx;
-
     if ( cpuid_hypervisor_leaves(leaf, subleaf, &a, &b, &c, &d) )
         goto out;
 
@@ -1200,17 +1197,21 @@ void pv_cpuid(struct cpu_user_regs *regs)
     case 0x8000001e: /* Extended topology reporting */
     unsupported:
         a = b = c = d = 0;
-        break;
+        goto out;
     }
 
- out:
     /* VPMU may decide to modify some of the leaves */
     vpmu_do_cpuid(leaf, &a, &b, &c, &d);
 
-    regs->eax = a;
-    regs->ebx = b;
-    regs->ecx = c;
-    regs->edx = d;
+ out:
+    if ( eax )
+        *eax = a;
+    if ( ebx )
+        *ebx = b;
+    if ( ecx )
+        *ecx = c;
+    if ( edx )
+        *edx = d;
 }
 
 static int emulate_invalid_rdtscp(struct cpu_user_regs *regs)
@@ -1260,7 +1261,7 @@ static int emulate_forced_invalid_op(struct cpu_user_regs *regs)
         return 0;
     eip += sizeof(instr);
 
-    pv_cpuid(regs);
+    pv_cpuid_regs(regs);
 
     instruction_done(regs, eip, 0);
 
@@ -3135,7 +3136,7 @@ static int emulate_privileged_op(struct cpu_user_regs *regs)
         break;
 
     case 0xa2: /* CPUID */
-        pv_cpuid(regs);
+        pv_cpuid_regs(regs);
         break;
 
     default:
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index 14bed92..6a35b89 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -649,7 +649,14 @@ enum get_cpu_vendor {
 };
 
 int get_cpu_vendor(const char vendor_id[], enum get_cpu_vendor);
-void pv_cpuid(struct cpu_user_regs *regs);
+void pv_cpuid(uint32_t leaf, uint32_t subleaf,
+              uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx);
+
+static inline void pv_cpuid_regs(struct cpu_user_regs *regs)
+{
+    pv_cpuid(regs->_eax, regs->_ecx,
+             &regs->_eax, &regs->_ebx, &regs->_ecx, &regs->_edx);
+}
 
 #endif /* !__ASSEMBLY__ */
 

["xsa293/4.7-2.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Rewrite guest %cr4 handling from scratch

The PV cr4 logic is almost impossible to follow, and leaks bits into guest
context which definitely shouldn't be visible (in particular, VMXE).

The biggest problem however, and source of the complexity, is that it derives
new real and guest cr4 values from the current value in hardware - this is
context dependent and an inappropriate source of information.

Rewrite the cr4 logic to be invariant of the current value in hardware.

First of all, modify write_ptbase() to always use mmu_cr4_features for IDLE
and HVM contexts.  mmu_cr4_features *is* the correct value to use, and makes
the ASSERT() obviously redundant.

For PV guests, curr->arch.pv.ctrlreg[4] remains the guests view of cr4, but
all logic gets reworked in terms of this and mmu_cr4_features only.

Two masks are introduced; bits which the guest has control over, and bits
which are forwarded from Xen's settings.  One guest-visible change here is
that Xen's VMXE setting is no longer visible at all.

pv_make_cr4() follows fairly closely from pv_guest_cr4_to_real_cr4(), but
deliberately starts with mmu_cr4_features, and only alters the minimal subset
of bits.

The boot-time {compat_,}pv_cr4_mask variables are removed, as they are a
remnant of the pre-CPUID policy days.  pv_fixup_guest_cr4() gains a related
derivation from the policy.

Another guest visible change here is that a 32bit PV guest can now flip
FSGSBASE in its view of CR4.  While the {RD,WR}{FS,GS}BASE instructions are
unusable outside of a 64bit code segment, the ability to modify FSGSBASE
matches real hardware behaviour, and avoids the need for any 32bit/64bit
differences in the logic.

Overall, this patch shouldn't have a practical change in guest behaviour.
VMXE will disappear from view, and an inquisitive 32bit kernel can now see
FSGSBASE changing, but this new logic is otherwise bug-compatible with before.

This is part of XSA-293

Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 938fce0..8ddb12c 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -377,6 +377,63 @@ static void release_compat_l4(struct vcpu *v)
     v->arch.guest_table_user = pagetable_null();
 }
 
+unsigned long pv_fixup_guest_cr4(const struct vcpu *v, unsigned long cr4)
+{
+    unsigned int leaf1_ecx = 0, leaf1_edx = 0;
+    unsigned int leaf7_0_ebx = 0, level = 0;
+
+    pv_cpuid(0, 0, &level, NULL, NULL, NULL);
+    if ( level >= 1 )
+        pv_cpuid(1, 0, NULL, NULL, &leaf1_ecx, &leaf1_edx);
+    if ( level >= 7 )
+        pv_cpuid(7, 0, NULL, &leaf7_0_ebx, NULL, NULL);
+
+    /* Discard attempts to set guest controllable bits outside of the policy. */
+    cr4 &= ~(((leaf1_edx & cpufeat_mask(X86_FEATURE_TSC))
+              ? 0 : X86_CR4_TSD) |
+             ((leaf1_edx & cpufeat_mask(X86_FEATURE_DE))
+              ? 0 : X86_CR4_DE) |
+             ((leaf7_0_ebx & cpufeat_mask(X86_FEATURE_FSGSBASE))
+              ? 0 : X86_CR4_FSGSBASE) |
+             ((leaf1_ecx & cpufeat_mask(X86_FEATURE_XSAVE))
+              ? 0 : X86_CR4_OSXSAVE));
+
+    /* Masks expected to be disjoint sets. */
+    BUILD_BUG_ON(PV_CR4_GUEST_MASK & PV_CR4_GUEST_VISIBLE_MASK);
+
+    /*
+     * A guest sees the policy subset of its own choice of guest controllable
+     * bits, and a subset of Xen's choice of certain hardware settings.
+     */
+    return ((cr4 & PV_CR4_GUEST_MASK) |
+            (mmu_cr4_features & PV_CR4_GUEST_VISIBLE_MASK));
+}
+
+unsigned long pv_make_cr4(const struct vcpu *v)
+{
+    const struct domain *d = v->domain;
+    unsigned long cr4 = mmu_cr4_features &
+        ~(X86_CR4_PCIDE | X86_CR4_PGE | X86_CR4_TSD);
+
+    /*
+     * PCIDE or PGE depends on the PCID/XPTI settings, but must not both be
+     * set, as it impacts the safety of TLB flushing.
+     */
+    if ( d->arch.pv_domain.pcid )
+        cr4 |= X86_CR4_PCIDE;
+    else if ( !d->arch.pv_domain.xpti )
+        cr4 |= X86_CR4_PGE;
+
+    /*
+     * TSD is needed if either the guest has elected to use it, or Xen is
+     * virtualising the TSC value the guest sees.
+     */
+    if ( d->arch.vtsc || (v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_TSD) )
+        cr4 |= X86_CR4_TSD;
+
+    return cr4;
+}
+
 static void set_domain_xpti(struct domain *d)
 {
     if ( is_pv_32bit_domain(d) )
@@ -551,6 +608,8 @@ int vcpu_initialise(struct vcpu *v)
 
         /* PV guests by default have a 100Hz ticker. */
         v->periodic_period = MILLISECS(10);
+
+        v->arch.pv_vcpu.ctrlreg[4] = pv_fixup_guest_cr4(v, 0);
     }
 
     v->arch.schedule_tail = continue_nonidle_domain;
@@ -563,8 +622,6 @@ int vcpu_initialise(struct vcpu *v)
         v->arch.cr3           = __pa(idle_pg_table);
     }
 
-    v->arch.pv_vcpu.ctrlreg[4] = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
-
     if ( is_pv_32bit_domain(d) )
     {
         if ( (rc = setup_compat_arg_xlat(v)) )
@@ -937,49 +994,6 @@ int arch_domain_soft_reset(struct domain *d)
     return ret;
 }
 
-/*
- * These are the masks of CR4 bits (subject to hardware availability) which a
- * PV guest may not legitimiately attempt to modify.
- */
-static unsigned long __read_mostly pv_cr4_mask, compat_pv_cr4_mask;
-
-static int __init init_pv_cr4_masks(void)
-{
-    unsigned long common_mask = ~X86_CR4_TSD;
-
-    /*
-     * All PV guests may attempt to modify TSD, DE and OSXSAVE.
-     */
-    if ( cpu_has_de )
-        common_mask &= ~X86_CR4_DE;
-    if ( cpu_has_xsave )
-        common_mask &= ~X86_CR4_OSXSAVE;
-
-    pv_cr4_mask = compat_pv_cr4_mask = common_mask;
-
-    /*
-     * 64bit PV guests may attempt to modify FSGSBASE.
-     */
-    if ( cpu_has_fsgsbase )
-        pv_cr4_mask &= ~X86_CR4_FSGSBASE;
-
-    return 0;
-}
-__initcall(init_pv_cr4_masks);
-
-unsigned long pv_guest_cr4_fixup(const struct vcpu *v, unsigned long guest_cr4)
-{
-    unsigned long hv_cr4 = real_cr4_to_pv_guest_cr4(read_cr4());
-    unsigned long mask = is_pv_32bit_vcpu(v) ? compat_pv_cr4_mask : pv_cr4_mask;
-
-    if ( (guest_cr4 & mask) != (hv_cr4 & mask) )
-        printk(XENLOG_G_WARNING
-               "d%d attempted to change %pv's CR4 flags %08lx -> %08lx\n",
-               current->domain->domain_id, v, hv_cr4, guest_cr4);
-
-    return (hv_cr4 & mask) | (guest_cr4 & ~mask);
-}
-
 #define xen_vcpu_guest_context vcpu_guest_context
 #define fpu_ctxt fpu_ctxt.x
 CHECK_FIELD_(struct, vcpu_guest_context, fpu_ctxt);
@@ -993,7 +1007,7 @@ int arch_set_info_guest(
     struct domain *d = v->domain;
     unsigned long cr3_gfn;
     struct page_info *cr3_page;
-    unsigned long flags, cr4;
+    unsigned long flags;
     unsigned int i;
     int rc = 0, compat;
 
@@ -1210,9 +1224,8 @@ int arch_set_info_guest(
     v->arch.pv_vcpu.ctrlreg[0] &= X86_CR0_TS;
     v->arch.pv_vcpu.ctrlreg[0] |= read_cr0() & ~X86_CR0_TS;
 
-    cr4 = v->arch.pv_vcpu.ctrlreg[4];
-    v->arch.pv_vcpu.ctrlreg[4] = cr4 ? pv_guest_cr4_fixup(v, cr4) :
-        real_cr4_to_pv_guest_cr4(mmu_cr4_features);
+    v->arch.pv_vcpu.ctrlreg[4] =
+        pv_fixup_guest_cr4(v, v->arch.pv_vcpu.ctrlreg[4]);
 
     memset(v->arch.debugreg, 0, sizeof(v->arch.debugreg));
     for ( i = 0; i < 8; i++ )
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index c9c6fc9..06fd4e4 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -501,33 +501,13 @@ void make_cr3(struct vcpu *v, unsigned long mfn)
         v->arch.cr3 |= get_pcid_bits(v, 0);
 }
 
-unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v)
-{
-    const struct domain *d = v->domain;
-    unsigned long cr4;
-
-    cr4 = v->arch.pv_vcpu.ctrlreg[4] & ~X86_CR4_DE;
-    cr4 |= mmu_cr4_features & (X86_CR4_PSE | X86_CR4_SMEP | X86_CR4_SMAP |
-                               X86_CR4_OSXSAVE | X86_CR4_FSGSBASE);
-
-    if ( d->arch.pv_domain.pcid )
-        cr4 |= X86_CR4_PCIDE;
-    else if ( !d->arch.pv_domain.xpti )
-        cr4 |= X86_CR4_PGE;
-
-    cr4 |= d->arch.vtsc ? X86_CR4_TSD : 0;
-
-    return cr4;
-}
-
 void write_ptbase(struct vcpu *v)
 {
     struct cpu_info *cpu_info = get_cpu_info();
     unsigned long new_cr4;
 
     new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
-              ? pv_guest_cr4_to_real_cr4(v)
-              : ((read_cr4() & ~(X86_CR4_PCIDE | X86_CR4_TSD)) | X86_CR4_PGE);
+              ? pv_make_cr4(v) : mmu_cr4_features;
 
     if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
     {
@@ -546,8 +526,6 @@ void write_ptbase(struct vcpu *v)
         switch_cr3_cr4(v->arch.cr3, new_cr4);
         cpu_info->pv_cr3 = 0;
     }
-
-    ASSERT(is_pv_vcpu(v) || read_cr4() == mmu_cr4_features);
 }
 
 /*
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index c001f93..6cbbf3f 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -918,7 +918,8 @@ static void _domain_cpuid(struct domain *currd,
                           unsigned int *eax, unsigned int *ebx,
                           unsigned int *ecx, unsigned int *edx)
 {
-    if ( !is_control_domain(currd) && !is_hardware_domain(currd) )
+    if ( !is_control_domain(currd) && !is_hardware_domain(currd) &&
+         !is_idle_domain(currd) )
         domain_cpuid(currd, leaf, subleaf, eax, ebx, ecx, edx);
     else
         cpuid_count(leaf, subleaf, eax, ebx, ecx, edx);
@@ -2721,8 +2722,8 @@ static int emulate_privileged_op(struct cpu_user_regs *regs)
         }
 
         case 4: /* Write CR4 */
-            v->arch.pv_vcpu.ctrlreg[4] = pv_guest_cr4_fixup(v, *reg);
-            write_cr4(pv_guest_cr4_to_real_cr4(v));
+            v->arch.pv_vcpu.ctrlreg[4] = pv_fixup_guest_cr4(v, *reg);
+            write_cr4(pv_make_cr4(v));
             ctxt_switch_levelling(v);
             break;
 
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 132c2b0..3d6f547 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -637,16 +637,22 @@ bool_t update_secondary_system_time(struct vcpu *,
 void vcpu_show_execution_state(struct vcpu *);
 void vcpu_show_registers(const struct vcpu *);
 
-/* Clean up CR4 bits that are not under guest control. */
-unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
+/*
+ * Bits which a PV guest can toggle in its view of cr4.  Some are loaded into
+ * hardware, while some are fully emulated.
+ */
+#define PV_CR4_GUEST_MASK \
+    (X86_CR4_TSD | X86_CR4_DE | X86_CR4_FSGSBASE | X86_CR4_OSXSAVE)
+
+/* Bits which a PV guest may observe from the real hardware settings. */
+#define PV_CR4_GUEST_VISIBLE_MASK \
+    (X86_CR4_PAE | X86_CR4_MCE | X86_CR4_OSFXSR | X86_CR4_OSXMMEXCPT)
 
-/* Convert between guest-visible and real CR4 values. */
-unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v);
+/* Given a new cr4 value, construct the resulting guest-visible cr4 value. */
+unsigned long pv_fixup_guest_cr4(const struct vcpu *v, unsigned long cr4);
 
-#define real_cr4_to_pv_guest_cr4(c)                         \
-    ((c) & ~(X86_CR4_PGE | X86_CR4_PSE | X86_CR4_TSD |      \
-             X86_CR4_OSXSAVE | X86_CR4_SMEP |               \
-             X86_CR4_FSGSBASE | X86_CR4_SMAP | X86_CR4_PCIDE))
+/* Create a cr4 value to load into hardware, based on vcpu settings. */
+unsigned long pv_make_cr4(const struct vcpu *v);
 
 void domain_cpuid(struct domain *d,
                   unsigned int  input,

["xsa293/4.7-3.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Don't have %cr4.fsgsbase active behind a guest kernels back

Currently, a 64bit PV guest can appear to set and clear FSGSBASE in %cr4, but
the bit remains set in hardware.  Therefore, the {RD,WR}{FS,GS}BASE are usable
even when the guest kernel believes that they are disabled.

The FSGSBASE feature isn't currently supported in Linux, and its context
switch path has some optimisations which rely on userspace being unable to use
the WR{FS,GS}BASE instructions.  Xen's current behaviour undermines this
expectation.

In 64bit PV guest context, always load the guest kernels setting of FSGSBASE
into %cr4.  This requires adjusting how Xen uses the {RD,WR}{FS,GS}BASE
instructions.

 * Delete the cpu_has_fsgsbase helper.  It is no longer safe, as users need to
   check %cr4 directly.
 * The raw __rd{fs,gs}base() helpers are only safe to use when %cr4.fsgsbase
   is set.  Comment this property.
 * The {rd,wr}{fs,gs}{base,shadow}() and read_msr() helpers are updated to use
   the current %cr4 value to determine which mechanism to use.
 * toggle_guest_mode() and save_segments() are update to avoid reading
   fs/gsbase if the values in hardware cannot be stale WRT struct vcpu.  A
   consequence of this is that the write_cr() path needs to cache the current
   bases, as subsequent context switches will skip saving the values.
 * write_cr4() is updated to ensure that the shadow %cr4.fsgsbase value is
   observed in a safe way WRT the hardware setting, if an interrupt happens to
   hit in the middle.
 * pv_make_cr4() is updated for 64bit PV guests to use the guest kernels
   choice of FSGSBASE.

This is part of XSA-293

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 8ddb12c..20c0bd0 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -431,6 +431,16 @@ unsigned long pv_make_cr4(const struct vcpu *v)
     if ( d->arch.vtsc || (v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_TSD) )
         cr4 |= X86_CR4_TSD;
 
+    /*
+     * The {RD,WR}{FS,GS}BASE are only useable in 64bit code segments.  While
+     * we must not have CR4.FSGSBASE set behind the back of a 64bit PV kernel,
+     * we do leave it set in 32bit PV context to speed up Xen's context switch
+     * path.
+     */
+    if ( !is_pv_32bit_domain(d) &&
+         !(v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_FSGSBASE) )
+        cr4 &= ~X86_CR4_FSGSBASE;
+
     return cr4;
 }
 
@@ -2012,7 +2022,8 @@ static void save_segments(struct vcpu *v)
     regs->fs = read_sreg(fs);
     regs->gs = read_sreg(gs);
 
-    if ( cpu_has_fsgsbase && !is_pv_32bit_vcpu(v) )
+    /* %fs/%gs bases can only be stale if WR{FS,GS}BASE are usable. */
+    if ( (read_cr4() & X86_CR4_FSGSBASE) && !is_pv_32bit_vcpu(v) )
     {
         v->arch.pv_vcpu.fs_base = __rdfsbase();
         if ( v->arch.flags & TF_kernel_mode )
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 0072de7..9eca57a 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1442,7 +1442,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
 
     cr4_pv32_mask = mmu_cr4_features & XEN_CR4_PV32_BITS;
 
-    if ( cpu_has_fsgsbase )
+    if ( boot_cpu_has(X86_FEATURE_FSGSBASE) )
         set_in_cr4(X86_CR4_FSGSBASE);
 
     if ( opt_invpcid && cpu_has_invpcid )
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 6cbbf3f..583936e 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -2722,6 +2722,17 @@ static int emulate_privileged_op(struct cpu_user_regs *regs)
         }
 
         case 4: /* Write CR4 */
+            /*
+             * If this write will disable FSGSBASE, refresh Xen's idea of the
+             * guest bases now that they can no longer change.
+             */
+            if ( (v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_FSGSBASE) &&
+                 !(*reg & X86_CR4_FSGSBASE) )
+            {
+                v->arch.pv_vcpu.fs_base = __rdfsbase();
+                v->arch.pv_vcpu.gs_base_kernel = __rdgsbase();
+            }
+
             v->arch.pv_vcpu.ctrlreg[4] = pv_fixup_guest_cr4(v, *reg);
             write_cr4(pv_make_cr4(v));
             ctxt_switch_levelling(v);
@@ -2993,13 +3004,14 @@ static int emulate_privileged_op(struct cpu_user_regs *regs)
         case MSR_FS_BASE:
             if ( is_pv_32bit_domain(currd) )
                 goto fail;
-            val = cpu_has_fsgsbase ? __rdfsbase() : v->arch.pv_vcpu.fs_base;
+            val = (read_cr4() & X86_CR4_FSGSBASE) ? __rdfsbase()
+                                                  : v->arch.pv_vcpu.fs_base;
             goto rdmsr_writeback;
         case MSR_GS_BASE:
             if ( is_pv_32bit_domain(currd) )
                 goto fail;
-            val = cpu_has_fsgsbase ? __rdgsbase()
-                                   : v->arch.pv_vcpu.gs_base_kernel;
+            val = (read_cr4() & X86_CR4_FSGSBASE) ? __rdgsbase()
+                                                  : v->arch.pv_vcpu.gs_base_kernel;
             goto rdmsr_writeback;
         case MSR_SHADOW_GS_BASE:
             if ( is_pv_32bit_domain(currd) )
diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c
index 56ed156..04cc60d 100644
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -265,7 +265,9 @@ void toggle_guest_mode(struct vcpu *v)
 {
     if ( is_pv_32bit_vcpu(v) )
         return;
-    if ( cpu_has_fsgsbase )
+
+    /* %fs/%gs bases can only be stale if WR{FS,GS}BASE are usable. */
+    if ( read_cr4() & X86_CR4_FSGSBASE )
     {
         if ( v->arch.flags & TF_kernel_mode )
             v->arch.pv_vcpu.gs_base_kernel = __rdgsbase();
diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
index 04d9e28..f9c8335 100644
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -72,7 +72,6 @@
 #define cpu_has_nx		boot_cpu_has(X86_FEATURE_NX)
 #define cpu_has_clflush		boot_cpu_has(X86_FEATURE_CLFLUSH)
 #define cpu_has_page1gb		boot_cpu_has(X86_FEATURE_PAGE1GB)
-#define cpu_has_fsgsbase	boot_cpu_has(X86_FEATURE_FSGSBASE)
 #define cpu_has_aperfmperf	boot_cpu_has(X86_FEATURE_APERFMPERF)
 #define cpu_has_smep            boot_cpu_has(X86_FEATURE_SMEP)
 #define cpu_has_invpcid         boot_cpu_has(X86_FEATURE_INVPCID)
diff --git a/xen/include/asm-x86/msr.h b/xen/include/asm-x86/msr.h
index 4b4c156..252cbdc 100644
--- a/xen/include/asm-x86/msr.h
+++ b/xen/include/asm-x86/msr.h
@@ -93,6 +93,14 @@ static inline uint64_t rdtsc(void)
 			  : "=a" (low), "=d" (high) \
 			  : "c" (counter))
 
+/*
+ * On hardware supporting FSGSBASE, the value loaded into hardware is the
+ * guest kernel's choice for 64bit PV guests (Xen's choice for Idle, HVM and
+ * 32bit PV).
+ *
+ * Therefore, the {RD,WR}{FS,GS}BASE instructions are only safe to use if
+ * %cr4.fsgsbase is set.
+ */
 static inline unsigned long __rdfsbase(void)
 {
     unsigned long base;
@@ -123,7 +131,7 @@ static inline unsigned long rdfsbase(void)
 {
     unsigned long base;
 
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
         return __rdfsbase();
 
     rdmsrl(MSR_FS_BASE, base);
@@ -135,7 +143,7 @@ static inline unsigned long rdgsbase(void)
 {
     unsigned long base;
 
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
         return __rdgsbase();
 
     rdmsrl(MSR_GS_BASE, base);
@@ -145,7 +153,7 @@ static inline unsigned long rdgsbase(void)
 
 static inline void wrfsbase(unsigned long base)
 {
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
 #ifdef HAVE_GAS_FSGSBASE
         asm volatile ( "wrfsbase %0" :: "r" (base) );
 #else
@@ -157,7 +165,7 @@ static inline void wrfsbase(unsigned long base)
 
 static inline void wrgsbase(unsigned long base)
 {
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
 #ifdef HAVE_GAS_FSGSBASE
         asm volatile ( "wrgsbase %0" :: "r" (base) );
 #else
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index 6a35b89..610b6f3 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -363,11 +363,31 @@ static inline unsigned long read_cr4(void)
 
 static inline void write_cr4(unsigned long val)
 {
+    struct cpu_info *info = get_cpu_info();
+
     /* No global pages in case of PCIDs enabled! */
     ASSERT(!(val & X86_CR4_PGE) || !(val & X86_CR4_PCIDE));
 
-    get_cpu_info()->cr4 = val;
-    asm volatile ( "mov %0,%%cr4" : : "r" (val) );
+    /*
+     * On hardware supporting FSGSBASE, the value in %cr4 is the kernel's
+     * choice for 64bit PV guests, which impacts whether Xen can use the
+     * instructions.
+     *
+     * The {rd,wr}{fs,gs}base() helpers use info->cr4 to work out whether it
+     * is safe to execute the {RD,WR}{FS,GS}BASE instruction, falling back to
+     * the MSR path if not.  Some users require interrupt safety.
+     *
+     * If FSGSBASE is currently or about to become clear, reflect this in
+     * info->cr4 before updating %cr4, so an interrupt which hits in the
+     * middle won't observe FSGSBASE set in info->cr4 but clear in %cr4.
+     */
+    info->cr4 = val & (info->cr4 | ~X86_CR4_FSGSBASE);
+
+    asm volatile ( "mov %[val], %%cr4"
+                   : "+m" (info->cr4) /* Force ordering without a barrier. */
+                   : [val] "r" (val) );
+
+    info->cr4 = val;
 }
 
 /* Clear and set 'TS' bit respectively */

["xsa293/4.8-1.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Improve pv_cpuid()'s API

pv_cpuid()'s API is awkward to use.  There are already two callers jumping
through hoops to use it, and a third is on its way.

Change the API to take each parameter individually (like its counterpart,
hvm_cpuid(), already does), and introduce a new pv_cpuid_regs() wrapper
implementing the old API.

No functional change.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/hvm/vmx/vmx.c b/xen/arch/x86/hvm/vmx/vmx.c
index ab39a45..0053ac0 100644
--- a/xen/arch/x86/hvm/vmx/vmx.c
+++ b/xen/arch/x86/hvm/vmx/vmx.c
@@ -3803,7 +3803,7 @@ void vmx_vmexit_handler(struct cpu_user_regs *regs)
 
         if ( is_pvh_vcpu(v) )
         {
-            pv_cpuid(regs);
+            pv_cpuid_regs(regs);
             rc = 0;
         }
         else
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 1a22895..77f786c 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -972,17 +972,14 @@ static void _domain_cpuid(const struct domain *currd,
         cpuid_count(leaf, subleaf, eax, ebx, ecx, edx);
 }
 
-void pv_cpuid(struct cpu_user_regs *regs)
+void pv_cpuid(uint32_t leaf, uint32_t subleaf,
+              uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx)
 {
-    uint32_t leaf, subleaf, a, b, c, d;
+    uint32_t a, b, c, d;
+    const struct cpu_user_regs *regs = guest_cpu_user_regs();
     struct vcpu *curr = current;
     struct domain *currd = curr->domain;
 
-    leaf = a = regs->eax;
-    b = regs->ebx;
-    subleaf = c = regs->ecx;
-    d = regs->edx;
-
     if ( cpuid_hypervisor_leaves(leaf, subleaf, &a, &b, &c, &d) )
         goto out;
 
@@ -997,13 +994,7 @@ void pv_cpuid(struct cpu_user_regs *regs)
 
         _domain_cpuid(currd, limit, 0, &limit, &dummy, &dummy, &dummy);
         if ( leaf > limit )
-        {
-            regs->eax = 0;
-            regs->ebx = 0;
-            regs->ecx = 0;
-            regs->edx = 0;
-            return;
-        }
+            goto unsupported;
     }
 
     _domain_cpuid(currd, leaf, subleaf, &a, &b, &c, &d);
@@ -1284,17 +1275,21 @@ void pv_cpuid(struct cpu_user_regs *regs)
     case 0x8000001e: /* Extended topology reporting */
     unsupported:
         a = b = c = d = 0;
-        break;
+        goto out;
     }
 
- out:
     /* VPMU may decide to modify some of the leaves */
     vpmu_do_cpuid(leaf, &a, &b, &c, &d);
 
-    regs->eax = a;
-    regs->ebx = b;
-    regs->ecx = c;
-    regs->edx = d;
+ out:
+    if ( eax )
+        *eax = a;
+    if ( ebx )
+        *ebx = b;
+    if ( ecx )
+        *ecx = c;
+    if ( edx )
+        *edx = d;
 }
 
 static int emulate_invalid_rdtscp(struct cpu_user_regs *regs)
@@ -1353,7 +1348,7 @@ static int emulate_forced_invalid_op(struct cpu_user_regs *regs)
 
     eip += sizeof(instr);
 
-    pv_cpuid(regs);
+    pv_cpuid_regs(regs);
 
     instruction_done(regs, eip, 0);
 
@@ -2828,17 +2823,7 @@ static int priv_op_write_msr(unsigned int reg, uint64_t val,
 int pv_emul_cpuid(unsigned int *eax, unsigned int *ebx, unsigned int *ecx,
                   unsigned int *edx, struct x86_emulate_ctxt *ctxt)
 {
-    struct cpu_user_regs regs = *ctxt->regs;
-
-    regs._eax = *eax;
-    regs._ecx = *ecx;
-
-    pv_cpuid(&regs);
-
-    *eax = regs._eax;
-    *ebx = regs._ebx;
-    *ecx = regs._ecx;
-    *edx = regs._edx;
+    pv_cpuid(*eax, *ecx, eax, ebx, ecx, edx);
 
     return X86EMUL_OKAY;
 }
@@ -3329,7 +3314,7 @@ static int emulate_privileged_op(struct cpu_user_regs *regs)
         if ( v->arch.cpuid_faulting && !guest_kernel_mode(v, regs) )
             goto fail;
 
-        pv_cpuid(regs);
+        pv_cpuid_regs(regs);
         break;
 
     default:
diff --git a/xen/arch/x86/xstate.c b/xen/arch/x86/xstate.c
index 5c43ec9..09d25ac 100644
--- a/xen/arch/x86/xstate.c
+++ b/xen/arch/x86/xstate.c
@@ -654,24 +654,18 @@ static bool_t valid_xcr0(u64 xcr0)
 
 static uint64_t guest_xcr0_max(const struct domain *d)
 {
+    uint32_t eax, edx;
+
     if ( has_hvm_container_domain(d) )
     {
-        uint32_t eax, ecx = 0, edx;
+        uint32_t ecx = 0;
 
         hvm_cpuid(XSTATE_CPUID, &eax, NULL, &ecx, &edx);
-
-        return ((uint64_t)edx << 32) | eax;
     }
     else
-    {
-        struct cpu_user_regs regs = { };
+        pv_cpuid(XSTATE_CPUID, 0, &eax, NULL, NULL, &edx);
 
-        regs._eax = XSTATE_CPUID;
-        regs._ecx = 0;
-        pv_cpuid(&regs);
-
-        return (regs.rdx << 32) | regs._eax;
-    }
+    return ((uint64_t)edx << 32) | eax;
 }
 
 int validate_xstate(const struct domain *d, uint64_t xcr0, uint64_t xcr0_accum,
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index 581d7b0..50badab 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -663,7 +663,14 @@ enum get_cpu_vendor {
 };
 
 int get_cpu_vendor(const char vendor_id[], enum get_cpu_vendor);
-void pv_cpuid(struct cpu_user_regs *regs);
+void pv_cpuid(uint32_t leaf, uint32_t subleaf,
+              uint32_t *eax, uint32_t *ebx, uint32_t *ecx, uint32_t *edx);
+
+static inline void pv_cpuid_regs(struct cpu_user_regs *regs)
+{
+    pv_cpuid(regs->_eax, regs->_ecx,
+             &regs->_eax, &regs->_ebx, &regs->_ecx, &regs->_edx);
+}
 
 #endif /* !__ASSEMBLY__ */
 

["xsa293/4.8-2.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Rewrite guest %cr4 handling from scratch

The PV cr4 logic is almost impossible to follow, and leaks bits into guest
context which definitely shouldn't be visible (in particular, VMXE).

The biggest problem however, and source of the complexity, is that it derives
new real and guest cr4 values from the current value in hardware - this is
context dependent and an inappropriate source of information.

Rewrite the cr4 logic to be invariant of the current value in hardware.

First of all, modify write_ptbase() to always use mmu_cr4_features for IDLE
and HVM contexts.  mmu_cr4_features *is* the correct value to use, and makes
the ASSERT() obviously redundant.

For PV guests, curr->arch.pv.ctrlreg[4] remains the guests view of cr4, but
all logic gets reworked in terms of this and mmu_cr4_features only.

Two masks are introduced; bits which the guest has control over, and bits
which are forwarded from Xen's settings.  One guest-visible change here is
that Xen's VMXE setting is no longer visible at all.

pv_make_cr4() follows fairly closely from pv_guest_cr4_to_real_cr4(), but
deliberately starts with mmu_cr4_features, and only alters the minimal subset
of bits.

The boot-time {compat_,}pv_cr4_mask variables are removed, as they are a
remnant of the pre-CPUID policy days.  pv_fixup_guest_cr4() gains a related
derivation from the policy.

Another guest visible change here is that a 32bit PV guest can now flip
FSGSBASE in its view of CR4.  While the {RD,WR}{FS,GS}BASE instructions are
unusable outside of a 64bit code segment, the ability to modify FSGSBASE
matches real hardware behaviour, and avoids the need for any 32bit/64bit
differences in the logic.

Overall, this patch shouldn't have a practical change in guest behaviour.
VMXE will disappear from view, and an inquisitive 32bit kernel can now see
FSGSBASE changing, but this new logic is otherwise bug-compatible with before.

This is part of XSA-293

Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 408fc0b..6c38d84 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -377,6 +377,60 @@ static void release_compat_l4(struct vcpu *v)
     v->arch.guest_table_user = pagetable_null();
 }
 
+unsigned long pv_fixup_guest_cr4(const struct vcpu *v, unsigned long cr4)
+{
+    unsigned int leaf1_ecx = 0, leaf1_edx = 0;
+    unsigned int leaf7_0_ebx = 0;
+
+    pv_cpuid(1, 0, NULL, NULL, &leaf1_ecx, &leaf1_edx);
+    pv_cpuid(7, 0, NULL, &leaf7_0_ebx, NULL, NULL);
+
+    /* Discard attempts to set guest controllable bits outside of the policy. */
+    cr4 &= ~(((leaf1_edx & cpufeat_mask(X86_FEATURE_TSC))
+              ? 0 : X86_CR4_TSD) |
+             ((leaf1_edx & cpufeat_mask(X86_FEATURE_DE))
+              ? 0 : X86_CR4_DE) |
+             ((leaf7_0_ebx & cpufeat_mask(X86_FEATURE_FSGSBASE))
+              ? 0 : X86_CR4_FSGSBASE) |
+             ((leaf1_ecx & cpufeat_mask(X86_FEATURE_XSAVE))
+              ? 0 : X86_CR4_OSXSAVE));
+
+    /* Masks expected to be disjoint sets. */
+    BUILD_BUG_ON(PV_CR4_GUEST_MASK & PV_CR4_GUEST_VISIBLE_MASK);
+
+    /*
+     * A guest sees the policy subset of its own choice of guest controllable
+     * bits, and a subset of Xen's choice of certain hardware settings.
+     */
+    return ((cr4 & PV_CR4_GUEST_MASK) |
+            (mmu_cr4_features & PV_CR4_GUEST_VISIBLE_MASK));
+}
+
+unsigned long pv_make_cr4(const struct vcpu *v)
+{
+    const struct domain *d = v->domain;
+    unsigned long cr4 = mmu_cr4_features &
+        ~(X86_CR4_PCIDE | X86_CR4_PGE | X86_CR4_TSD);
+
+    /*
+     * PCIDE or PGE depends on the PCID/XPTI settings, but must not both be
+     * set, as it impacts the safety of TLB flushing.
+     */
+    if ( d->arch.pv_domain.pcid )
+        cr4 |= X86_CR4_PCIDE;
+    else if ( !d->arch.pv_domain.xpti )
+        cr4 |= X86_CR4_PGE;
+
+    /*
+     * TSD is needed if either the guest has elected to use it, or Xen is
+     * virtualising the TSC value the guest sees.
+     */
+    if ( d->arch.vtsc || (v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_TSD) )
+        cr4 |= X86_CR4_TSD;
+
+    return cr4;
+}
+
 static void set_domain_xpti(struct domain *d)
 {
     if ( is_pv_32bit_domain(d) )
@@ -564,6 +618,8 @@ int vcpu_initialise(struct vcpu *v)
 
         /* PV guests by default have a 100Hz ticker. */
         v->periodic_period = MILLISECS(10);
+
+        v->arch.pv_vcpu.ctrlreg[4] = pv_fixup_guest_cr4(v, 0);
     }
 
     v->arch.schedule_tail = continue_nonidle_domain;
@@ -576,8 +632,6 @@ int vcpu_initialise(struct vcpu *v)
         v->arch.cr3           = __pa(idle_pg_table);
     }
 
-    v->arch.pv_vcpu.ctrlreg[4] = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
-
     if ( is_pv_32bit_domain(d) )
     {
         if ( (rc = setup_compat_arg_xlat(v)) )
@@ -955,49 +1009,6 @@ int arch_domain_soft_reset(struct domain *d)
     return ret;
 }
 
-/*
- * These are the masks of CR4 bits (subject to hardware availability) which a
- * PV guest may not legitimiately attempt to modify.
- */
-static unsigned long __read_mostly pv_cr4_mask, compat_pv_cr4_mask;
-
-static int __init init_pv_cr4_masks(void)
-{
-    unsigned long common_mask = ~X86_CR4_TSD;
-
-    /*
-     * All PV guests may attempt to modify TSD, DE and OSXSAVE.
-     */
-    if ( cpu_has_de )
-        common_mask &= ~X86_CR4_DE;
-    if ( cpu_has_xsave )
-        common_mask &= ~X86_CR4_OSXSAVE;
-
-    pv_cr4_mask = compat_pv_cr4_mask = common_mask;
-
-    /*
-     * 64bit PV guests may attempt to modify FSGSBASE.
-     */
-    if ( cpu_has_fsgsbase )
-        pv_cr4_mask &= ~X86_CR4_FSGSBASE;
-
-    return 0;
-}
-__initcall(init_pv_cr4_masks);
-
-unsigned long pv_guest_cr4_fixup(const struct vcpu *v, unsigned long guest_cr4)
-{
-    unsigned long hv_cr4 = real_cr4_to_pv_guest_cr4(read_cr4());
-    unsigned long mask = is_pv_32bit_vcpu(v) ? compat_pv_cr4_mask : pv_cr4_mask;
-
-    if ( (guest_cr4 & mask) != (hv_cr4 & mask) )
-        printk(XENLOG_G_WARNING
-               "d%d attempted to change %pv's CR4 flags %08lx -> %08lx\n",
-               current->domain->domain_id, v, hv_cr4, guest_cr4);
-
-    return (hv_cr4 & mask) | (guest_cr4 & ~mask);
-}
-
 #define xen_vcpu_guest_context vcpu_guest_context
 #define fpu_ctxt fpu_ctxt.x
 CHECK_FIELD_(struct, vcpu_guest_context, fpu_ctxt);
@@ -1011,7 +1022,7 @@ int arch_set_info_guest(
     struct domain *d = v->domain;
     unsigned long cr3_gfn;
     struct page_info *cr3_page;
-    unsigned long flags, cr4;
+    unsigned long flags;
     unsigned int i;
     int rc = 0, compat;
 
@@ -1228,9 +1239,8 @@ int arch_set_info_guest(
     v->arch.pv_vcpu.ctrlreg[0] &= X86_CR0_TS;
     v->arch.pv_vcpu.ctrlreg[0] |= read_cr0() & ~X86_CR0_TS;
 
-    cr4 = v->arch.pv_vcpu.ctrlreg[4];
-    v->arch.pv_vcpu.ctrlreg[4] = cr4 ? pv_guest_cr4_fixup(v, cr4) :
-        real_cr4_to_pv_guest_cr4(mmu_cr4_features);
+    v->arch.pv_vcpu.ctrlreg[4] =
+        pv_fixup_guest_cr4(v, v->arch.pv_vcpu.ctrlreg[4]);
 
     memset(v->arch.debugreg, 0, sizeof(v->arch.debugreg));
     for ( i = 0; i < 8; i++ )
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 2b93efb..4e98f82 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -501,33 +501,13 @@ void make_cr3(struct vcpu *v, unsigned long mfn)
         v->arch.cr3 |= get_pcid_bits(v, false);
 }
 
-unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v)
-{
-    const struct domain *d = v->domain;
-    unsigned long cr4;
-
-    cr4 = v->arch.pv_vcpu.ctrlreg[4] & ~X86_CR4_DE;
-    cr4 |= mmu_cr4_features & (X86_CR4_PSE | X86_CR4_SMEP | X86_CR4_SMAP |
-                               X86_CR4_OSXSAVE | X86_CR4_FSGSBASE);
-
-    if ( d->arch.pv_domain.pcid )
-        cr4 |= X86_CR4_PCIDE;
-    else if ( !d->arch.pv_domain.xpti )
-        cr4 |= X86_CR4_PGE;
-
-    cr4 |= d->arch.vtsc ? X86_CR4_TSD : 0;
-
-    return cr4;
-}
-
 void write_ptbase(struct vcpu *v)
 {
     struct cpu_info *cpu_info = get_cpu_info();
     unsigned long new_cr4;
 
     new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
-              ? pv_guest_cr4_to_real_cr4(v)
-              : ((read_cr4() & ~(X86_CR4_PCIDE | X86_CR4_TSD)) | X86_CR4_PGE);
+              ? pv_make_cr4(v) : mmu_cr4_features;
 
     if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
     {
@@ -546,8 +526,6 @@ void write_ptbase(struct vcpu *v)
         switch_cr3_cr4(v->arch.cr3, new_cr4);
         cpu_info->pv_cr3 = 0;
     }
-
-    ASSERT(is_pv_vcpu(v) || read_cr4() == mmu_cr4_features);
 }
 
 /*
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 77f786c..a8aebe5 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -966,7 +966,8 @@ static void _domain_cpuid(const struct domain *currd,
                           unsigned int *eax, unsigned int *ebx,
                           unsigned int *ecx, unsigned int *edx)
 {
-    if ( !is_control_domain(currd) && !is_hardware_domain(currd) )
+    if ( !is_control_domain(currd) && !is_hardware_domain(currd) &&
+         !is_idle_domain(currd) )
         domain_cpuid(currd, leaf, subleaf, eax, ebx, ecx, edx);
     else
         cpuid_count(leaf, subleaf, eax, ebx, ecx, edx);
@@ -2372,8 +2373,8 @@ static int priv_op_write_cr(unsigned int reg, unsigned long val,
     }
 
     case 4: /* Write CR4 */
-        curr->arch.pv_vcpu.ctrlreg[4] = pv_guest_cr4_fixup(curr, val);
-        write_cr4(pv_guest_cr4_to_real_cr4(curr));
+        curr->arch.pv_vcpu.ctrlreg[4] = pv_fixup_guest_cr4(curr, val);
+        write_cr4(pv_make_cr4(curr));
         ctxt_switch_levelling(curr);
         return X86EMUL_OKAY;
     }
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index e7abaad..0137283 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -642,16 +642,22 @@ bool_t update_secondary_system_time(struct vcpu *,
 void vcpu_show_execution_state(struct vcpu *);
 void vcpu_show_registers(const struct vcpu *);
 
-/* Clean up CR4 bits that are not under guest control. */
-unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
+/*
+ * Bits which a PV guest can toggle in its view of cr4.  Some are loaded into
+ * hardware, while some are fully emulated.
+ */
+#define PV_CR4_GUEST_MASK \
+    (X86_CR4_TSD | X86_CR4_DE | X86_CR4_FSGSBASE | X86_CR4_OSXSAVE)
+
+/* Bits which a PV guest may observe from the real hardware settings. */
+#define PV_CR4_GUEST_VISIBLE_MASK \
+    (X86_CR4_PAE | X86_CR4_MCE | X86_CR4_OSFXSR | X86_CR4_OSXMMEXCPT)
 
-/* Convert between guest-visible and real CR4 values. */
-unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v);
+/* Given a new cr4 value, construct the resulting guest-visible cr4 value. */
+unsigned long pv_fixup_guest_cr4(const struct vcpu *v, unsigned long cr4);
 
-#define real_cr4_to_pv_guest_cr4(c)                         \
-    ((c) & ~(X86_CR4_PGE | X86_CR4_PSE | X86_CR4_TSD |      \
-             X86_CR4_OSXSAVE | X86_CR4_SMEP |               \
-             X86_CR4_FSGSBASE | X86_CR4_SMAP | X86_CR4_PCIDE))
+/* Create a cr4 value to load into hardware, based on vcpu settings. */
+unsigned long pv_make_cr4(const struct vcpu *v);
 
 void domain_cpuid(const struct domain *d,
                   unsigned int  input,

["xsa293/4.8-3.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Don't have %cr4.fsgsbase active behind a guest kernels back

Currently, a 64bit PV guest can appear to set and clear FSGSBASE in %cr4, but
the bit remains set in hardware.  Therefore, the {RD,WR}{FS,GS}BASE are usable
even when the guest kernel believes that they are disabled.

The FSGSBASE feature isn't currently supported in Linux, and its context
switch path has some optimisations which rely on userspace being unable to use
the WR{FS,GS}BASE instructions.  Xen's current behaviour undermines this
expectation.

In 64bit PV guest context, always load the guest kernels setting of FSGSBASE
into %cr4.  This requires adjusting how Xen uses the {RD,WR}{FS,GS}BASE
instructions.

 * Delete the cpu_has_fsgsbase helper.  It is no longer safe, as users need to
   check %cr4 directly.
 * The raw __rd{fs,gs}base() helpers are only safe to use when %cr4.fsgsbase
   is set.  Comment this property.
 * The {rd,wr}{fs,gs}{base,shadow}() and read_msr() helpers are updated to use
   the current %cr4 value to determine which mechanism to use.
 * toggle_guest_mode() and save_segments() are update to avoid reading
   fs/gsbase if the values in hardware cannot be stale WRT struct vcpu.  A
   consequence of this is that the write_cr() path needs to cache the current
   bases, as subsequent context switches will skip saving the values.
 * write_cr4() is updated to ensure that the shadow %cr4.fsgsbase value is
   observed in a safe way WRT the hardware setting, if an interrupt happens to
   hit in the middle.
 * pv_make_cr4() is updated for 64bit PV guests to use the guest kernels
   choice of FSGSBASE.

This is part of XSA-293

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 120f1e3..d246e40 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -428,6 +428,16 @@ unsigned long pv_make_cr4(const struct vcpu *v)
     if ( d->arch.vtsc || (v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_TSD) )
         cr4 |= X86_CR4_TSD;
 
+    /*
+     * The {RD,WR}{FS,GS}BASE are only useable in 64bit code segments.  While
+     * we must not have CR4.FSGSBASE set behind the back of a 64bit PV kernel,
+     * we do leave it set in 32bit PV context to speed up Xen's context switch
+     * path.
+     */
+    if ( !is_pv_32bit_domain(d) &&
+         !(v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_FSGSBASE) )
+        cr4 &= ~X86_CR4_FSGSBASE;
+
     return cr4;
 }
 
@@ -2022,7 +2032,8 @@ static void save_segments(struct vcpu *v)
     regs->fs = read_sreg(fs);
     regs->gs = read_sreg(gs);
 
-    if ( cpu_has_fsgsbase && !is_pv_32bit_vcpu(v) )
+    /* %fs/%gs bases can only be stale if WR{FS,GS}BASE are usable. */
+    if ( (read_cr4() & X86_CR4_FSGSBASE) && !is_pv_32bit_vcpu(v) )
     {
         v->arch.pv_vcpu.fs_base = __rdfsbase();
         if ( v->arch.flags & TF_kernel_mode )
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 550435c..3a7b362 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1504,7 +1504,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
 
     cr4_pv32_mask = mmu_cr4_features & XEN_CR4_PV32_BITS;
 
-    if ( cpu_has_fsgsbase )
+    if ( boot_cpu_has(X86_FEATURE_FSGSBASE) )
         set_in_cr4(X86_CR4_FSGSBASE);
 
     if ( opt_invpcid && cpu_has_invpcid )
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index 7d44a4c..b1cff5e 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -2372,6 +2372,17 @@ static int priv_op_write_cr(unsigned int reg, unsigned long val,
     }
 
     case 4: /* Write CR4 */
+        /*
+         * If this write will disable FSGSBASE, refresh Xen's idea of the
+         * guest bases now that they can no longer change.
+         */
+        if ( (curr->arch.pv_vcpu.ctrlreg[4] & X86_CR4_FSGSBASE) &&
+             !(val & X86_CR4_FSGSBASE) )
+        {
+            curr->arch.pv_vcpu.fs_base = __rdfsbase();
+            curr->arch.pv_vcpu.gs_base_kernel = __rdgsbase();
+        }
+
         curr->arch.pv_vcpu.ctrlreg[4] = pv_fixup_guest_cr4(curr, val);
         write_cr4(pv_make_cr4(curr));
         ctxt_switch_levelling(curr);
@@ -2432,14 +2443,15 @@ static int priv_op_read_msr(unsigned int reg, uint64_t *val,
     case MSR_FS_BASE:
         if ( is_pv_32bit_domain(currd) )
             break;
-        *val = cpu_has_fsgsbase ? __rdfsbase() : curr->arch.pv_vcpu.fs_base;
+        *val = (read_cr4() & X86_CR4_FSGSBASE) ? __rdfsbase()
+                                               : curr->arch.pv_vcpu.fs_base;
         return X86EMUL_OKAY;
 
     case MSR_GS_BASE:
         if ( is_pv_32bit_domain(currd) )
             break;
-        *val = cpu_has_fsgsbase ? __rdgsbase()
-                                : curr->arch.pv_vcpu.gs_base_kernel;
+        *val = (read_cr4() & X86_CR4_FSGSBASE) ? __rdgsbase()
+                                               : curr->arch.pv_vcpu.gs_base_kernel;
         return X86EMUL_OKAY;
 
     case MSR_SHADOW_GS_BASE:
diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c
index 396e677..cb2abb1 100644
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -267,7 +267,9 @@ void toggle_guest_mode(struct vcpu *v)
 {
     if ( is_pv_32bit_vcpu(v) )
         return;
-    if ( cpu_has_fsgsbase )
+
+    /* %fs/%gs bases can only be stale if WR{FS,GS}BASE are usable. */
+    if ( read_cr4() & X86_CR4_FSGSBASE )
     {
         if ( v->arch.flags & TF_kernel_mode )
             v->arch.pv_vcpu.gs_base_kernel = __rdgsbase();
diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
index af2a892..89ff249 100644
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -80,7 +80,6 @@ XEN_CPUFEATURE(XEN_LBR,         (FSCAPINTS+0)*32+24) /* Xen uses MSR_DEBUGCTL.LB
 #define cpu_has_nx		boot_cpu_has(X86_FEATURE_NX)
 #define cpu_has_clflush		boot_cpu_has(X86_FEATURE_CLFLUSH)
 #define cpu_has_page1gb		boot_cpu_has(X86_FEATURE_PAGE1GB)
-#define cpu_has_fsgsbase	boot_cpu_has(X86_FEATURE_FSGSBASE)
 #define cpu_has_aperfmperf	boot_cpu_has(X86_FEATURE_APERFMPERF)
 #define cpu_has_smep            boot_cpu_has(X86_FEATURE_SMEP)
 #define cpu_has_invpcid         boot_cpu_has(X86_FEATURE_INVPCID)
diff --git a/xen/include/asm-x86/msr.h b/xen/include/asm-x86/msr.h
index 5e1df8f..08eec30 100644
--- a/xen/include/asm-x86/msr.h
+++ b/xen/include/asm-x86/msr.h
@@ -109,6 +109,14 @@ static inline uint64_t rdtsc_ordered(void)
 			  : "=a" (low), "=d" (high) \
 			  : "c" (counter))
 
+/*
+ * On hardware supporting FSGSBASE, the value loaded into hardware is the
+ * guest kernel's choice for 64bit PV guests (Xen's choice for Idle, HVM and
+ * 32bit PV).
+ *
+ * Therefore, the {RD,WR}{FS,GS}BASE instructions are only safe to use if
+ * %cr4.fsgsbase is set.
+ */
 static inline unsigned long __rdfsbase(void)
 {
     unsigned long base;
@@ -139,7 +147,7 @@ static inline unsigned long rdfsbase(void)
 {
     unsigned long base;
 
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
         return __rdfsbase();
 
     rdmsrl(MSR_FS_BASE, base);
@@ -151,7 +159,7 @@ static inline unsigned long rdgsbase(void)
 {
     unsigned long base;
 
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
         return __rdgsbase();
 
     rdmsrl(MSR_GS_BASE, base);
@@ -161,7 +169,7 @@ static inline unsigned long rdgsbase(void)
 
 static inline void wrfsbase(unsigned long base)
 {
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
 #ifdef HAVE_GAS_FSGSBASE
         asm volatile ( "wrfsbase %0" :: "r" (base) );
 #else
@@ -173,7 +181,7 @@ static inline void wrfsbase(unsigned long base)
 
 static inline void wrgsbase(unsigned long base)
 {
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
 #ifdef HAVE_GAS_FSGSBASE
         asm volatile ( "wrgsbase %0" :: "r" (base) );
 #else
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index 50badab..a5319e3 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -380,11 +380,31 @@ static inline unsigned long read_cr4(void)
 
 static inline void write_cr4(unsigned long val)
 {
+    struct cpu_info *info = get_cpu_info();
+
     /* No global pages in case of PCIDs enabled! */
     ASSERT(!(val & X86_CR4_PGE) || !(val & X86_CR4_PCIDE));
 
-    get_cpu_info()->cr4 = val;
-    asm volatile ( "mov %0,%%cr4" : : "r" (val) );
+    /*
+     * On hardware supporting FSGSBASE, the value in %cr4 is the kernel's
+     * choice for 64bit PV guests, which impacts whether Xen can use the
+     * instructions.
+     *
+     * The {rd,wr}{fs,gs}base() helpers use info->cr4 to work out whether it
+     * is safe to execute the {RD,WR}{FS,GS}BASE instruction, falling back to
+     * the MSR path if not.  Some users require interrupt safety.
+     *
+     * If FSGSBASE is currently or about to become clear, reflect this in
+     * info->cr4 before updating %cr4, so an interrupt which hits in the
+     * middle won't observe FSGSBASE set in info->cr4 but clear in %cr4.
+     */
+    info->cr4 = val & (info->cr4 | ~X86_CR4_FSGSBASE);
+
+    asm volatile ( "mov %[val], %%cr4"
+                   : "+m" (info->cr4) /* Force ordering without a barrier. */
+                   : [val] "r" (val) );
+
+    info->cr4 = val;
 }
 
 /* Clear and set 'TS' bit respectively */

["xsa293/4.9-1.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Rewrite guest %cr4 handling from scratch

The PV cr4 logic is almost impossible to follow, and leaks bits into guest
context which definitely shouldn't be visible (in particular, VMXE).

The biggest problem however, and source of the complexity, is that it derives
new real and guest cr4 values from the current value in hardware - this is
context dependent and an inappropriate source of information.

Rewrite the cr4 logic to be invariant of the current value in hardware.

First of all, modify write_ptbase() to always use mmu_cr4_features for IDLE
and HVM contexts.  mmu_cr4_features *is* the correct value to use, and makes
the ASSERT() obviously redundant.

For PV guests, curr->arch.pv.ctrlreg[4] remains the guests view of cr4, but
all logic gets reworked in terms of this and mmu_cr4_features only.

Two masks are introduced; bits which the guest has control over, and bits
which are forwarded from Xen's settings.  One guest-visible change here is
that Xen's VMXE setting is no longer visible at all.

pv_make_cr4() follows fairly closely from pv_guest_cr4_to_real_cr4(), but
deliberately starts with mmu_cr4_features, and only alters the minimal subset
of bits.

The boot-time {compat_,}pv_cr4_mask variables are removed, as they are a
remnant of the pre-CPUID policy days.  pv_fixup_guest_cr4() gains a related
derivation from the policy.

Another guest visible change here is that a 32bit PV guest can now flip
FSGSBASE in its view of CR4.  While the {RD,WR}{FS,GS}BASE instructions are
unusable outside of a 64bit code segment, the ability to modify FSGSBASE
matches real hardware behaviour, and avoids the need for any 32bit/64bit
differences in the logic.

Overall, this patch shouldn't have a practical change in guest behaviour.
VMXE will disappear from view, and an inquisitive 32bit kernel can now see
FSGSBASE changing, but this new logic is otherwise bug-compatible with before.

This is part of XSA-293

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 61f6671..e9b95e4 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -398,6 +398,52 @@ static void release_compat_l4(struct vcpu *v)
     v->arch.guest_table_user = pagetable_null();
 }
 
+unsigned long pv_fixup_guest_cr4(const struct vcpu *v, unsigned long cr4)
+{
+    const struct cpuid_policy *p = v->domain->arch.cpuid;
+
+    /* Discard attempts to set guest controllable bits outside of the policy. */
+    cr4 &= ~((p->basic.tsc     ? 0 : X86_CR4_TSD)      |
+             (p->basic.de      ? 0 : X86_CR4_DE)       |
+             (p->feat.fsgsbase ? 0 : X86_CR4_FSGSBASE) |
+             (p->basic.xsave   ? 0 : X86_CR4_OSXSAVE));
+
+    /* Masks expected to be disjoint sets. */
+    BUILD_BUG_ON(PV_CR4_GUEST_MASK & PV_CR4_GUEST_VISIBLE_MASK);
+
+    /*
+     * A guest sees the policy subset of its own choice of guest controllable
+     * bits, and a subset of Xen's choice of certain hardware settings.
+     */
+    return ((cr4 & PV_CR4_GUEST_MASK) |
+            (mmu_cr4_features & PV_CR4_GUEST_VISIBLE_MASK));
+}
+
+unsigned long pv_make_cr4(const struct vcpu *v)
+{
+    const struct domain *d = v->domain;
+    unsigned long cr4 = mmu_cr4_features &
+        ~(X86_CR4_PCIDE | X86_CR4_PGE | X86_CR4_TSD);
+
+    /*
+     * PCIDE or PGE depends on the PCID/XPTI settings, but must not both be
+     * set, as it impacts the safety of TLB flushing.
+     */
+    if ( d->arch.pv_domain.pcid )
+        cr4 |= X86_CR4_PCIDE;
+    else if ( !d->arch.pv_domain.xpti )
+        cr4 |= X86_CR4_PGE;
+
+    /*
+     * TSD is needed if either the guest has elected to use it, or Xen is
+     * virtualising the TSC value the guest sees.
+     */
+    if ( d->arch.vtsc || (v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_TSD) )
+        cr4 |= X86_CR4_TSD;
+
+    return cr4;
+}
+
 int switch_compat(struct domain *d)
 {
     struct vcpu *v;
@@ -512,12 +558,12 @@ int vcpu_initialise(struct vcpu *v)
 
         /* PV guests by default have a 100Hz ticker. */
         v->periodic_period = MILLISECS(10);
+
+        v->arch.pv_vcpu.ctrlreg[4] = pv_fixup_guest_cr4(v, 0);
     }
     else
         v->arch.cr3 = __pa(idle_pg_table);
 
-    v->arch.pv_vcpu.ctrlreg[4] = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
-
     if ( is_pv_32bit_domain(d) )
     {
         if ( (rc = setup_compat_arg_xlat(v)) )
@@ -953,49 +999,6 @@ int arch_domain_soft_reset(struct domain *d)
     return ret;
 }
 
-/*
- * These are the masks of CR4 bits (subject to hardware availability) which a
- * PV guest may not legitimiately attempt to modify.
- */
-static unsigned long __read_mostly pv_cr4_mask, compat_pv_cr4_mask;
-
-static int __init init_pv_cr4_masks(void)
-{
-    unsigned long common_mask = ~X86_CR4_TSD;
-
-    /*
-     * All PV guests may attempt to modify TSD, DE and OSXSAVE.
-     */
-    if ( cpu_has_de )
-        common_mask &= ~X86_CR4_DE;
-    if ( cpu_has_xsave )
-        common_mask &= ~X86_CR4_OSXSAVE;
-
-    pv_cr4_mask = compat_pv_cr4_mask = common_mask;
-
-    /*
-     * 64bit PV guests may attempt to modify FSGSBASE.
-     */
-    if ( cpu_has_fsgsbase )
-        pv_cr4_mask &= ~X86_CR4_FSGSBASE;
-
-    return 0;
-}
-__initcall(init_pv_cr4_masks);
-
-unsigned long pv_guest_cr4_fixup(const struct vcpu *v, unsigned long guest_cr4)
-{
-    unsigned long hv_cr4 = real_cr4_to_pv_guest_cr4(read_cr4());
-    unsigned long mask = is_pv_32bit_vcpu(v) ? compat_pv_cr4_mask : pv_cr4_mask;
-
-    if ( (guest_cr4 & mask) != (hv_cr4 & mask) )
-        printk(XENLOG_G_WARNING
-               "d%d attempted to change %pv's CR4 flags %08lx -> %08lx\n",
-               current->domain->domain_id, v, hv_cr4, guest_cr4);
-
-    return (hv_cr4 & mask) | (guest_cr4 & ~mask);
-}
-
 #define xen_vcpu_guest_context vcpu_guest_context
 #define fpu_ctxt fpu_ctxt.x
 CHECK_FIELD_(struct, vcpu_guest_context, fpu_ctxt);
@@ -1009,7 +1012,7 @@ int arch_set_info_guest(
     struct domain *d = v->domain;
     unsigned long cr3_gfn;
     struct page_info *cr3_page;
-    unsigned long flags, cr4;
+    unsigned long flags;
     unsigned int i;
     int rc = 0, compat;
 
@@ -1200,9 +1203,8 @@ int arch_set_info_guest(
     v->arch.pv_vcpu.ctrlreg[0] &= X86_CR0_TS;
     v->arch.pv_vcpu.ctrlreg[0] |= read_cr0() & ~X86_CR0_TS;
 
-    cr4 = v->arch.pv_vcpu.ctrlreg[4];
-    v->arch.pv_vcpu.ctrlreg[4] = cr4 ? pv_guest_cr4_fixup(v, cr4) :
-        real_cr4_to_pv_guest_cr4(mmu_cr4_features);
+    v->arch.pv_vcpu.ctrlreg[4] =
+        pv_fixup_guest_cr4(v, v->arch.pv_vcpu.ctrlreg[4]);
 
     memset(v->arch.debugreg, 0, sizeof(v->arch.debugreg));
     for ( i = 0; i < 8; i++ )
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 2982eec..234913f 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -514,33 +514,13 @@ void make_cr3(struct vcpu *v, unsigned long mfn)
         v->arch.cr3 |= get_pcid_bits(v, false);
 }
 
-unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v)
-{
-    const struct domain *d = v->domain;
-    unsigned long cr4;
-
-    cr4 = v->arch.pv_vcpu.ctrlreg[4] & ~X86_CR4_DE;
-    cr4 |= mmu_cr4_features & (X86_CR4_PSE | X86_CR4_SMEP | X86_CR4_SMAP |
-                               X86_CR4_OSXSAVE | X86_CR4_FSGSBASE);
-
-    if ( d->arch.pv_domain.pcid )
-        cr4 |= X86_CR4_PCIDE;
-    else if ( !d->arch.pv_domain.xpti )
-        cr4 |= X86_CR4_PGE;
-
-    cr4 |= d->arch.vtsc ? X86_CR4_TSD : 0;
-
-    return cr4;
-}
-
 void write_ptbase(struct vcpu *v)
 {
     struct cpu_info *cpu_info = get_cpu_info();
     unsigned long new_cr4;
 
     new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
-              ? pv_guest_cr4_to_real_cr4(v)
-              : ((read_cr4() & ~(X86_CR4_PCIDE | X86_CR4_TSD)) | X86_CR4_PGE);
+              ? pv_make_cr4(v) : mmu_cr4_features;
 
     if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
     {
@@ -559,8 +539,6 @@ void write_ptbase(struct vcpu *v)
         switch_cr3_cr4(v->arch.cr3, new_cr4);
         cpu_info->pv_cr3 = 0;
     }
-
-    ASSERT(is_pv_vcpu(v) || read_cr4() == mmu_cr4_features);
 }
 
 /*
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index fec5e55..dc5a0d6 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -2486,8 +2486,8 @@ static int priv_op_write_cr(unsigned int reg, unsigned long val,
     }
 
     case 4: /* Write CR4 */
-        curr->arch.pv_vcpu.ctrlreg[4] = pv_guest_cr4_fixup(curr, val);
-        write_cr4(pv_guest_cr4_to_real_cr4(curr));
+        curr->arch.pv_vcpu.ctrlreg[4] = pv_fixup_guest_cr4(curr, val);
+        write_cr4(pv_make_cr4(curr));
         ctxt_switch_levelling(curr);
         return X86EMUL_OKAY;
     }
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 7c26f60..7ea6753 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -648,16 +648,22 @@ bool_t update_secondary_system_time(struct vcpu *,
 void vcpu_show_execution_state(struct vcpu *);
 void vcpu_show_registers(const struct vcpu *);
 
-/* Clean up CR4 bits that are not under guest control. */
-unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
+/*
+ * Bits which a PV guest can toggle in its view of cr4.  Some are loaded into
+ * hardware, while some are fully emulated.
+ */
+#define PV_CR4_GUEST_MASK \
+    (X86_CR4_TSD | X86_CR4_DE | X86_CR4_FSGSBASE | X86_CR4_OSXSAVE)
+
+/* Bits which a PV guest may observe from the real hardware settings. */
+#define PV_CR4_GUEST_VISIBLE_MASK \
+    (X86_CR4_PAE | X86_CR4_MCE | X86_CR4_OSFXSR | X86_CR4_OSXMMEXCPT)
 
-/* Convert between guest-visible and real CR4 values. */
-unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v);
+/* Given a new cr4 value, construct the resulting guest-visible cr4 value. */
+unsigned long pv_fixup_guest_cr4(const struct vcpu *v, unsigned long cr4);
 
-#define real_cr4_to_pv_guest_cr4(c)                         \
-    ((c) & ~(X86_CR4_PGE | X86_CR4_PSE | X86_CR4_TSD |      \
-             X86_CR4_OSXSAVE | X86_CR4_SMEP |               \
-             X86_CR4_FSGSBASE | X86_CR4_SMAP | X86_CR4_PCIDE))
+/* Create a cr4 value to load into hardware, based on vcpu settings. */
+unsigned long pv_make_cr4(const struct vcpu *v);
 
 #define domain_max_vcpus(d) (is_hvm_domain(d) ? HVM_MAX_VCPUS : MAX_VIRT_CPUS)
 

["xsa293/4.9-2.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Don't have %cr4.fsgsbase active behind a guest kernels back

Currently, a 64bit PV guest can appear to set and clear FSGSBASE in %cr4, but
the bit remains set in hardware.  Therefore, the {RD,WR}{FS,GS}BASE are usable
even when the guest kernel believes that they are disabled.

The FSGSBASE feature isn't currently supported in Linux, and its context
switch path has some optimisations which rely on userspace being unable to use
the WR{FS,GS}BASE instructions.  Xen's current behaviour undermines this
expectation.

In 64bit PV guest context, always load the guest kernels setting of FSGSBASE
into %cr4.  This requires adjusting how Xen uses the {RD,WR}{FS,GS}BASE
instructions.

 * Delete the cpu_has_fsgsbase helper.  It is no longer safe, as users need to
   check %cr4 directly.
 * The raw __rd{fs,gs}base() helpers are only safe to use when %cr4.fsgsbase
   is set.  Comment this property.
 * The {rd,wr}{fs,gs}{base,shadow}() and read_msr() helpers are updated to use
   the current %cr4 value to determine which mechanism to use.
 * toggle_guest_mode() and save_segments() are update to avoid reading
   fs/gsbase if the values in hardware cannot be stale WRT struct vcpu.  A
   consequence of this is that the write_cr() path needs to cache the current
   bases, as subsequent context switches will skip saving the values.
 * write_cr4() is updated to ensure that the shadow %cr4.fsgsbase value is
   observed in a safe way WRT the hardware setting, if an interrupt happens to
   hit in the middle.
 * pv_make_cr4() is updated for 64bit PV guests to use the guest kernels
   choice of FSGSBASE.

This is part of XSA-293

Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index efdab1a..0edb131 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -441,6 +441,16 @@ unsigned long pv_make_cr4(const struct vcpu *v)
     if ( d->arch.vtsc || (v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_TSD) )
         cr4 |= X86_CR4_TSD;
 
+    /*
+     * The {RD,WR}{FS,GS}BASE are only useable in 64bit code segments.  While
+     * we must not have CR4.FSGSBASE set behind the back of a 64bit PV kernel,
+     * we do leave it set in 32bit PV context to speed up Xen's context switch
+     * path.
+     */
+    if ( !is_pv_32bit_domain(d) &&
+         !(v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_FSGSBASE) )
+        cr4 &= ~X86_CR4_FSGSBASE;
+
     return cr4;
 }
 
@@ -1987,7 +1997,8 @@ static void save_segments(struct vcpu *v)
     regs->fs = read_sreg(fs);
     regs->gs = read_sreg(gs);
 
-    if ( cpu_has_fsgsbase && !is_pv_32bit_vcpu(v) )
+    /* %fs/%gs bases can only be stale if WR{FS,GS}BASE are usable. */
+    if ( (read_cr4() & X86_CR4_FSGSBASE) && !is_pv_32bit_vcpu(v) )
     {
         v->arch.pv_vcpu.fs_base = __rdfsbase();
         if ( v->arch.flags & TF_kernel_mode )
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index bf5ed1f..40af7e6 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1510,7 +1510,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
 
     cr4_pv32_mask = mmu_cr4_features & XEN_CR4_PV32_BITS;
 
-    if ( cpu_has_fsgsbase )
+    if ( boot_cpu_has(X86_FEATURE_FSGSBASE) )
         set_in_cr4(X86_CR4_FSGSBASE);
 
     if ( opt_invpcid && cpu_has_invpcid )
diff --git a/xen/arch/x86/traps.c b/xen/arch/x86/traps.c
index dc5a0d6..2f9f75f 100644
--- a/xen/arch/x86/traps.c
+++ b/xen/arch/x86/traps.c
@@ -2486,6 +2486,17 @@ static int priv_op_write_cr(unsigned int reg, unsigned long val,
     }
 
     case 4: /* Write CR4 */
+        /*
+         * If this write will disable FSGSBASE, refresh Xen's idea of the
+         * guest bases now that they can no longer change.
+         */
+        if ( (curr->arch.pv_vcpu.ctrlreg[4] & X86_CR4_FSGSBASE) &&
+             !(val & X86_CR4_FSGSBASE) )
+        {
+            curr->arch.pv_vcpu.fs_base = __rdfsbase();
+            curr->arch.pv_vcpu.gs_base_kernel = __rdgsbase();
+        }
+
         curr->arch.pv_vcpu.ctrlreg[4] = pv_fixup_guest_cr4(curr, val);
         write_cr4(pv_make_cr4(curr));
         ctxt_switch_levelling(curr);
@@ -2526,14 +2537,15 @@ static int priv_op_read_msr(unsigned int reg, uint64_t *val,
     case MSR_FS_BASE:
         if ( is_pv_32bit_domain(currd) )
             break;
-        *val = cpu_has_fsgsbase ? __rdfsbase() : curr->arch.pv_vcpu.fs_base;
+        *val = (read_cr4() & X86_CR4_FSGSBASE) ? __rdfsbase()
+                                               : curr->arch.pv_vcpu.fs_base;
         return X86EMUL_OKAY;
 
     case MSR_GS_BASE:
         if ( is_pv_32bit_domain(currd) )
             break;
-        *val = cpu_has_fsgsbase ? __rdgsbase()
-                                : curr->arch.pv_vcpu.gs_base_kernel;
+        *val = (read_cr4() & X86_CR4_FSGSBASE) ? __rdgsbase()
+                                               : curr->arch.pv_vcpu.gs_base_kernel;
         return X86EMUL_OKAY;
 
     case MSR_SHADOW_GS_BASE:
diff --git a/xen/arch/x86/x86_64/traps.c b/xen/arch/x86/x86_64/traps.c
index 8a06b21..67a9993 100644
--- a/xen/arch/x86/x86_64/traps.c
+++ b/xen/arch/x86/x86_64/traps.c
@@ -266,7 +266,9 @@ void toggle_guest_mode(struct vcpu *v)
 {
     if ( is_pv_32bit_vcpu(v) )
         return;
-    if ( cpu_has_fsgsbase )
+
+    /* %fs/%gs bases can only be stale if WR{FS,GS}BASE are usable. */
+    if ( read_cr4() & X86_CR4_FSGSBASE )
     {
         if ( v->arch.flags & TF_kernel_mode )
             v->arch.pv_vcpu.gs_base_kernel = __rdgsbase();
diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
index ff6f969..5043231 100644
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -84,7 +84,6 @@
 #define cpu_has_xsaves          boot_cpu_has(X86_FEATURE_XSAVES)
 
 /* CPUID level 0x00000007:0.ebx */
-#define cpu_has_fsgsbase        boot_cpu_has(X86_FEATURE_FSGSBASE)
 #define cpu_has_bmi1            boot_cpu_has(X86_FEATURE_BMI1)
 #define cpu_has_hle             boot_cpu_has(X86_FEATURE_HLE)
 #define cpu_has_avx2            boot_cpu_has(X86_FEATURE_AVX2)
diff --git a/xen/include/asm-x86/msr.h b/xen/include/asm-x86/msr.h
index 8d4de61..fe6d5a0 100644
--- a/xen/include/asm-x86/msr.h
+++ b/xen/include/asm-x86/msr.h
@@ -120,6 +120,14 @@ static inline uint64_t rdtsc_ordered(void)
 			  : "=a" (low), "=d" (high) \
 			  : "c" (counter))
 
+/*
+ * On hardware supporting FSGSBASE, the value loaded into hardware is the
+ * guest kernel's choice for 64bit PV guests (Xen's choice for Idle, HVM and
+ * 32bit PV).
+ *
+ * Therefore, the {RD,WR}{FS,GS}BASE instructions are only safe to use if
+ * %cr4.fsgsbase is set.
+ */
 static inline unsigned long __rdfsbase(void)
 {
     unsigned long base;
@@ -150,7 +158,7 @@ static inline unsigned long rdfsbase(void)
 {
     unsigned long base;
 
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
         return __rdfsbase();
 
     rdmsrl(MSR_FS_BASE, base);
@@ -162,7 +170,7 @@ static inline unsigned long rdgsbase(void)
 {
     unsigned long base;
 
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
         return __rdgsbase();
 
     rdmsrl(MSR_GS_BASE, base);
@@ -172,7 +180,7 @@ static inline unsigned long rdgsbase(void)
 
 static inline void wrfsbase(unsigned long base)
 {
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
 #ifdef HAVE_GAS_FSGSBASE
         asm volatile ( "wrfsbase %0" :: "r" (base) );
 #else
@@ -184,7 +192,7 @@ static inline void wrfsbase(unsigned long base)
 
 static inline void wrgsbase(unsigned long base)
 {
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
 #ifdef HAVE_GAS_FSGSBASE
         asm volatile ( "wrgsbase %0" :: "r" (base) );
 #else
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index da42e84..4487347 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -306,11 +306,31 @@ static inline unsigned long read_cr4(void)
 
 static inline void write_cr4(unsigned long val)
 {
+    struct cpu_info *info = get_cpu_info();
+
     /* No global pages in case of PCIDs enabled! */
     ASSERT(!(val & X86_CR4_PGE) || !(val & X86_CR4_PCIDE));
 
-    get_cpu_info()->cr4 = val;
-    asm volatile ( "mov %0,%%cr4" : : "r" (val) );
+    /*
+     * On hardware supporting FSGSBASE, the value in %cr4 is the kernel's
+     * choice for 64bit PV guests, which impacts whether Xen can use the
+     * instructions.
+     *
+     * The {rd,wr}{fs,gs}base() helpers use info->cr4 to work out whether it
+     * is safe to execute the {RD,WR}{FS,GS}BASE instruction, falling back to
+     * the MSR path if not.  Some users require interrupt safety.
+     *
+     * If FSGSBASE is currently or about to become clear, reflect this in
+     * info->cr4 before updating %cr4, so an interrupt which hits in the
+     * middle won't observe FSGSBASE set in info->cr4 but clear in %cr4.
+     */
+    info->cr4 = val & (info->cr4 | ~X86_CR4_FSGSBASE);
+
+    asm volatile ( "mov %[val], %%cr4"
+                   : "+m" (info->cr4) /* Force ordering without a barrier. */
+                   : [val] "r" (val) );
+
+    info->cr4 = val;
 }
 
 /* Clear and set 'TS' bit respectively */

["xsa293/4.10-1.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Rewrite guest %cr4 handling from scratch

The PV cr4 logic is almost impossible to follow, and leaks bits into guest
context which definitely shouldn't be visible (in particular, VMXE).

The biggest problem however, and source of the complexity, is that it derives
new real and guest cr4 values from the current value in hardware - this is
context dependent and an inappropriate source of information.

Rewrite the cr4 logic to be invariant of the current value in hardware.

First of all, modify write_ptbase() to always use mmu_cr4_features for IDLE
and HVM contexts.  mmu_cr4_features *is* the correct value to use, and makes
the ASSERT() obviously redundant.

For PV guests, curr->arch.pv.ctrlreg[4] remains the guests view of cr4, but
all logic gets reworked in terms of this and mmu_cr4_features only.

Two masks are introduced; bits which the guest has control over, and bits
which are forwarded from Xen's settings.  One guest-visible change here is
that Xen's VMXE setting is no longer visible at all.

pv_make_cr4() follows fairly closely from pv_guest_cr4_to_real_cr4(), but
deliberately starts with mmu_cr4_features, and only alters the minimal subset
of bits.

The boot-time {compat_,}pv_cr4_mask variables are removed, as they are a
remnant of the pre-CPUID policy days.  pv_fixup_guest_cr4() gains a related
derivation from the policy.

Another guest visible change here is that a 32bit PV guest can now flip
FSGSBASE in its view of CR4.  While the {RD,WR}{FS,GS}BASE instructions are
unusable outside of a 64bit code segment, the ability to modify FSGSBASE
matches real hardware behaviour, and avoids the need for any 32bit/64bit
differences in the logic.

Overall, this patch shouldn't have a practical change in guest behaviour.
VMXE will disappear from view, and an inquisitive 32bit kernel can now see
FSGSBASE changing, but this new logic is otherwise bug-compatible with before.

This is part of XSA-293

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index fcbe767..853b524 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -734,49 +734,6 @@ int arch_domain_soft_reset(struct domain *d)
     return ret;
 }
 
-/*
- * These are the masks of CR4 bits (subject to hardware availability) which a
- * PV guest may not legitimiately attempt to modify.
- */
-static unsigned long __read_mostly pv_cr4_mask, compat_pv_cr4_mask;
-
-static int __init init_pv_cr4_masks(void)
-{
-    unsigned long common_mask = ~X86_CR4_TSD;
-
-    /*
-     * All PV guests may attempt to modify TSD, DE and OSXSAVE.
-     */
-    if ( cpu_has_de )
-        common_mask &= ~X86_CR4_DE;
-    if ( cpu_has_xsave )
-        common_mask &= ~X86_CR4_OSXSAVE;
-
-    pv_cr4_mask = compat_pv_cr4_mask = common_mask;
-
-    /*
-     * 64bit PV guests may attempt to modify FSGSBASE.
-     */
-    if ( cpu_has_fsgsbase )
-        pv_cr4_mask &= ~X86_CR4_FSGSBASE;
-
-    return 0;
-}
-__initcall(init_pv_cr4_masks);
-
-unsigned long pv_guest_cr4_fixup(const struct vcpu *v, unsigned long guest_cr4)
-{
-    unsigned long hv_cr4 = real_cr4_to_pv_guest_cr4(read_cr4());
-    unsigned long mask = is_pv_32bit_vcpu(v) ? compat_pv_cr4_mask : pv_cr4_mask;
-
-    if ( (guest_cr4 & mask) != (hv_cr4 & mask) )
-        printk(XENLOG_G_WARNING
-               "d%d attempted to change %pv's CR4 flags %08lx -> %08lx\n",
-               current->domain->domain_id, v, hv_cr4, guest_cr4);
-
-    return (hv_cr4 & mask) | (guest_cr4 & ~mask);
-}
-
 #define xen_vcpu_guest_context vcpu_guest_context
 #define fpu_ctxt fpu_ctxt.x
 CHECK_FIELD_(struct, vcpu_guest_context, fpu_ctxt);
@@ -790,7 +747,7 @@ int arch_set_info_guest(
     struct domain *d = v->domain;
     unsigned long cr3_gfn;
     struct page_info *cr3_page;
-    unsigned long flags, cr4;
+    unsigned long flags;
     unsigned int i;
     int rc = 0, compat;
 
@@ -981,9 +938,8 @@ int arch_set_info_guest(
     v->arch.pv_vcpu.ctrlreg[0] &= X86_CR0_TS;
     v->arch.pv_vcpu.ctrlreg[0] |= read_cr0() & ~X86_CR0_TS;
 
-    cr4 = v->arch.pv_vcpu.ctrlreg[4];
-    v->arch.pv_vcpu.ctrlreg[4] = cr4 ? pv_guest_cr4_fixup(v, cr4) :
-        real_cr4_to_pv_guest_cr4(mmu_cr4_features);
+    v->arch.pv_vcpu.ctrlreg[4] =
+        pv_fixup_guest_cr4(v, v->arch.pv_vcpu.ctrlreg[4]);
 
     memset(v->arch.debugreg, 0, sizeof(v->arch.debugreg));
     for ( i = 0; i < 8; i++ )
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index c20edb9..716499a 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -511,33 +511,13 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
         v->arch.cr3 |= get_pcid_bits(v, false);
 }
 
-unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v)
-{
-    const struct domain *d = v->domain;
-    unsigned long cr4;
-
-    cr4 = v->arch.pv_vcpu.ctrlreg[4] & ~X86_CR4_DE;
-    cr4 |= mmu_cr4_features & (X86_CR4_PSE | X86_CR4_SMEP | X86_CR4_SMAP |
-                               X86_CR4_OSXSAVE | X86_CR4_FSGSBASE);
-
-    if ( d->arch.pv_domain.pcid )
-        cr4 |= X86_CR4_PCIDE;
-    else if ( !d->arch.pv_domain.xpti )
-        cr4 |= X86_CR4_PGE;
-
-    cr4 |= d->arch.vtsc ? X86_CR4_TSD : 0;
-
-    return cr4;
-}
-
 void write_ptbase(struct vcpu *v)
 {
     struct cpu_info *cpu_info = get_cpu_info();
     unsigned long new_cr4;
 
     new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
-              ? pv_guest_cr4_to_real_cr4(v)
-              : ((read_cr4() & ~(X86_CR4_PCIDE | X86_CR4_TSD)) | X86_CR4_PGE);
+              ? pv_make_cr4(v) : mmu_cr4_features;
 
     if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
     {
@@ -556,8 +536,6 @@ void write_ptbase(struct vcpu *v)
         switch_cr3_cr4(v->arch.cr3, new_cr4);
         cpu_info->pv_cr3 = 0;
     }
-
-    ASSERT(is_pv_vcpu(v) || read_cr4() == mmu_cr4_features);
 }
 
 /*
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 958c6e3..a9caa01 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -103,6 +103,52 @@ static void release_compat_l4(struct vcpu *v)
     v->arch.guest_table_user = pagetable_null();
 }
 
+unsigned long pv_fixup_guest_cr4(const struct vcpu *v, unsigned long cr4)
+{
+    const struct cpuid_policy *p = v->domain->arch.cpuid;
+
+    /* Discard attempts to set guest controllable bits outside of the policy. */
+    cr4 &= ~((p->basic.tsc     ? 0 : X86_CR4_TSD)      |
+             (p->basic.de      ? 0 : X86_CR4_DE)       |
+             (p->feat.fsgsbase ? 0 : X86_CR4_FSGSBASE) |
+             (p->basic.xsave   ? 0 : X86_CR4_OSXSAVE));
+
+    /* Masks expected to be disjoint sets. */
+    BUILD_BUG_ON(PV_CR4_GUEST_MASK & PV_CR4_GUEST_VISIBLE_MASK);
+
+    /*
+     * A guest sees the policy subset of its own choice of guest controllable
+     * bits, and a subset of Xen's choice of certain hardware settings.
+     */
+    return ((cr4 & PV_CR4_GUEST_MASK) |
+            (mmu_cr4_features & PV_CR4_GUEST_VISIBLE_MASK));
+}
+
+unsigned long pv_make_cr4(const struct vcpu *v)
+{
+    const struct domain *d = v->domain;
+    unsigned long cr4 = mmu_cr4_features &
+        ~(X86_CR4_PCIDE | X86_CR4_PGE | X86_CR4_TSD);
+
+    /*
+     * PCIDE or PGE depends on the PCID/XPTI settings, but must not both be
+     * set, as it impacts the safety of TLB flushing.
+     */
+    if ( d->arch.pv_domain.pcid )
+        cr4 |= X86_CR4_PCIDE;
+    else if ( !d->arch.pv_domain.xpti )
+        cr4 |= X86_CR4_PGE;
+
+    /*
+     * TSD is needed if either the guest has elected to use it, or Xen is
+     * virtualising the TSC value the guest sees.
+     */
+    if ( d->arch.vtsc || (v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_TSD) )
+        cr4 |= X86_CR4_TSD;
+
+    return cr4;
+}
+
 int switch_compat(struct domain *d)
 {
     struct vcpu *v;
@@ -197,7 +243,7 @@ int pv_vcpu_initialise(struct vcpu *v)
     /* PV guests by default have a 100Hz ticker. */
     v->periodic_period = MILLISECS(10);
 
-    v->arch.pv_vcpu.ctrlreg[4] = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
+    v->arch.pv_vcpu.ctrlreg[4] = pv_fixup_guest_cr4(v, 0);
 
     if ( is_pv_32bit_domain(d) )
     {
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index c281936..cd04574 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -32,6 +32,7 @@
 #include <asm/hypercall.h>
 #include <asm/mc146818rtc.h>
 #include <asm/p2m.h>
+#include <asm/pv/domain.h>
 #include <asm/pv/traps.h>
 #include <asm/shared.h>
 #include <asm/traps.h>
@@ -804,8 +805,8 @@ static int write_cr(unsigned int reg, unsigned long val,
     }
 
     case 4: /* Write CR4 */
-        curr->arch.pv_vcpu.ctrlreg[4] = pv_guest_cr4_fixup(curr, val);
-        write_cr4(pv_guest_cr4_to_real_cr4(curr));
+        curr->arch.pv_vcpu.ctrlreg[4] = pv_fixup_guest_cr4(curr, val);
+        write_cr4(pv_make_cr4(curr));
         ctxt_switch_levelling(curr);
         return X86EMUL_OKAY;
     }
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index b9fa988..aec6563 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -623,17 +623,6 @@ bool update_secondary_system_time(struct vcpu *,
 void vcpu_show_execution_state(struct vcpu *);
 void vcpu_show_registers(const struct vcpu *);
 
-/* Clean up CR4 bits that are not under guest control. */
-unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
-
-/* Convert between guest-visible and real CR4 values. */
-unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v);
-
-#define real_cr4_to_pv_guest_cr4(c)                         \
-    ((c) & ~(X86_CR4_PGE | X86_CR4_PSE | X86_CR4_TSD |      \
-             X86_CR4_OSXSAVE | X86_CR4_SMEP |               \
-             X86_CR4_FSGSBASE | X86_CR4_SMAP | X86_CR4_PCIDE))
-
 #define domain_max_vcpus(d) (is_hvm_domain(d) ? HVM_MAX_VCPUS : MAX_VIRT_CPUS)
 
 static inline struct vcpu_guest_context *alloc_vcpu_guest_context(void)
diff --git a/xen/include/asm-x86/pv/domain.h b/xen/include/asm-x86/pv/domain.h
index 6778e1b..1ddc728 100644
--- a/xen/include/asm-x86/pv/domain.h
+++ b/xen/include/asm-x86/pv/domain.h
@@ -60,6 +60,23 @@ void pv_domain_destroy(struct domain *d);
 int pv_domain_initialise(struct domain *d, unsigned int domcr_flags,
                          struct xen_arch_domainconfig *config);
 
+/*
+ * Bits which a PV guest can toggle in its view of cr4.  Some are loaded into
+ * hardware, while some are fully emulated.
+ */
+#define PV_CR4_GUEST_MASK \
+    (X86_CR4_TSD | X86_CR4_DE | X86_CR4_FSGSBASE | X86_CR4_OSXSAVE)
+
+/* Bits which a PV guest may observe from the real hardware settings. */
+#define PV_CR4_GUEST_VISIBLE_MASK \
+    (X86_CR4_PAE | X86_CR4_MCE | X86_CR4_OSFXSR | X86_CR4_OSXMMEXCPT)
+
+/* Given a new cr4 value, construct the resulting guest-visible cr4 value. */
+unsigned long pv_fixup_guest_cr4(const struct vcpu *v, unsigned long cr4);
+
+/* Create a cr4 value to load into hardware, based on vcpu settings. */
+unsigned long pv_make_cr4(const struct vcpu *v);
+
 #else  /* !CONFIG_PV */
 
 #include <xen/errno.h>
@@ -73,6 +90,9 @@ static inline int pv_domain_initialise(struct domain *d,
 {
     return -EOPNOTSUPP;
 }
+
+static inline unsigned long pv_make_cr4(const struct vcpu *v) { return ~0ul; }
+
 #endif	/* CONFIG_PV */
 
 void paravirt_ctxt_switch_from(struct vcpu *v);

["xsa293/4.10-2.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Don't have %cr4.fsgsbase active behind a guest kernels back

Currently, a 64bit PV guest can appear to set and clear FSGSBASE in %cr4, but
the bit remains set in hardware.  Therefore, the {RD,WR}{FS,GS}BASE are usable
even when the guest kernel believes that they are disabled.

The FSGSBASE feature isn't currently supported in Linux, and its context
switch path has some optimisations which rely on userspace being unable to use
the WR{FS,GS}BASE instructions.  Xen's current behaviour undermines this
expectation.

In 64bit PV guest context, always load the guest kernels setting of FSGSBASE
into %cr4.  This requires adjusting how Xen uses the {RD,WR}{FS,GS}BASE
instructions.

 * Delete the cpu_has_fsgsbase helper.  It is no longer safe, as users need to
   check %cr4 directly.
 * The raw __rd{fs,gs}base() helpers are only safe to use when %cr4.fsgsbase
   is set.  Comment this property.
 * The {rd,wr}{fs,gs}{base,shadow}() and read_msr() helpers are updated to use
   the current %cr4 value to determine which mechanism to use.
 * toggle_guest_mode() and save_segments() are update to avoid reading
   fs/gsbase if the values in hardware cannot be stale WRT struct vcpu.  A
   consequence of this is that the write_cr() path needs to cache the current
   bases, as subsequent context switches will skip saving the values.
 * write_cr4() is updated to ensure that the shadow %cr4.fsgsbase value is
   observed in a safe way WRT the hardware setting, if an interrupt happens to
   hit in the middle.
 * pv_make_cr4() is updated for 64bit PV guests to use the guest kernels
   choice of FSGSBASE.

This is part of XSA-293

Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 853b524..91c2b1c 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1431,7 +1431,8 @@ static void save_segments(struct vcpu *v)
     regs->fs = read_sreg(fs);
     regs->gs = read_sreg(gs);
 
-    if ( cpu_has_fsgsbase && !is_pv_32bit_vcpu(v) )
+    /* %fs/%gs bases can only be stale if WR{FS,GS}BASE are usable. */
+    if ( (read_cr4() & X86_CR4_FSGSBASE) && !is_pv_32bit_vcpu(v) )
     {
         v->arch.pv_vcpu.fs_base = __rdfsbase();
         if ( v->arch.flags & TF_kernel_mode )
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index a9caa01..82c6e2f 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -146,6 +146,16 @@ unsigned long pv_make_cr4(const struct vcpu *v)
     if ( d->arch.vtsc || (v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_TSD) )
         cr4 |= X86_CR4_TSD;
 
+    /*
+     * The {RD,WR}{FS,GS}BASE are only useable in 64bit code segments.  While
+     * we must not have CR4.FSGSBASE set behind the back of a 64bit PV kernel,
+     * we do leave it set in 32bit PV context to speed up Xen's context switch
+     * path.
+     */
+    if ( !is_pv_32bit_domain(d) &&
+         !(v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_FSGSBASE) )
+        cr4 &= ~X86_CR4_FSGSBASE;
+
     return cr4;
 }
 
@@ -351,7 +361,8 @@ void toggle_guest_mode(struct vcpu *v)
     if ( is_pv_32bit_vcpu(v) )
         return;
 
-    if ( cpu_has_fsgsbase )
+    /* %fs/%gs bases can only be stale if WR{FS,GS}BASE are usable. */
+    if ( read_cr4() & X86_CR4_FSGSBASE )
     {
         if ( v->arch.flags & TF_kernel_mode )
             v->arch.pv_vcpu.gs_base_kernel = __rdgsbase();
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index cd04574..0344c98 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -805,6 +805,17 @@ static int write_cr(unsigned int reg, unsigned long val,
     }
 
     case 4: /* Write CR4 */
+        /*
+         * If this write will disable FSGSBASE, refresh Xen's idea of the
+         * guest bases now that they can no longer change.
+         */
+        if ( (curr->arch.pv_vcpu.ctrlreg[4] & X86_CR4_FSGSBASE) &&
+             !(val & X86_CR4_FSGSBASE) )
+        {
+            curr->arch.pv_vcpu.fs_base = __rdfsbase();
+            curr->arch.pv_vcpu.gs_base_kernel = __rdgsbase();
+        }
+
         curr->arch.pv_vcpu.ctrlreg[4] = pv_fixup_guest_cr4(curr, val);
         write_cr4(pv_make_cr4(curr));
         ctxt_switch_levelling(curr);
@@ -854,14 +865,15 @@ static int read_msr(unsigned int reg, uint64_t *val,
     case MSR_FS_BASE:
         if ( is_pv_32bit_domain(currd) )
             break;
-        *val = cpu_has_fsgsbase ? __rdfsbase() : curr->arch.pv_vcpu.fs_base;
+        *val = (read_cr4() & X86_CR4_FSGSBASE) ? __rdfsbase()
+                                               : curr->arch.pv_vcpu.fs_base;
         return X86EMUL_OKAY;
 
     case MSR_GS_BASE:
         if ( is_pv_32bit_domain(currd) )
             break;
-        *val = cpu_has_fsgsbase ? __rdgsbase()
-                                : curr->arch.pv_vcpu.gs_base_kernel;
+        *val = (read_cr4() & X86_CR4_FSGSBASE) ? __rdgsbase()
+                                               : curr->arch.pv_vcpu.gs_base_kernel;
         return X86EMUL_OKAY;
 
     case MSR_SHADOW_GS_BASE:
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 617dbb1..7903204 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1560,7 +1560,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
 
     cr4_pv32_mask = mmu_cr4_features & XEN_CR4_PV32_BITS;
 
-    if ( cpu_has_fsgsbase )
+    if ( boot_cpu_has(X86_FEATURE_FSGSBASE) )
         set_in_cr4(X86_CR4_FSGSBASE);
 
     if ( opt_invpcid && cpu_has_invpcid )
diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
index ff6f969..5043231 100644
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -84,7 +84,6 @@
 #define cpu_has_xsaves          boot_cpu_has(X86_FEATURE_XSAVES)
 
 /* CPUID level 0x00000007:0.ebx */
-#define cpu_has_fsgsbase        boot_cpu_has(X86_FEATURE_FSGSBASE)
 #define cpu_has_bmi1            boot_cpu_has(X86_FEATURE_BMI1)
 #define cpu_has_hle             boot_cpu_has(X86_FEATURE_HLE)
 #define cpu_has_avx2            boot_cpu_has(X86_FEATURE_AVX2)
diff --git a/xen/include/asm-x86/msr.h b/xen/include/asm-x86/msr.h
index 22d5b71..143eea3 100644
--- a/xen/include/asm-x86/msr.h
+++ b/xen/include/asm-x86/msr.h
@@ -120,6 +120,14 @@ static inline uint64_t rdtsc_ordered(void)
 			  : "=a" (low), "=d" (high) \
 			  : "c" (counter))
 
+/*
+ * On hardware supporting FSGSBASE, the value loaded into hardware is the
+ * guest kernel's choice for 64bit PV guests (Xen's choice for Idle, HVM and
+ * 32bit PV).
+ *
+ * Therefore, the {RD,WR}{FS,GS}BASE instructions are only safe to use if
+ * %cr4.fsgsbase is set.
+ */
 static inline unsigned long __rdfsbase(void)
 {
     unsigned long base;
@@ -150,7 +158,7 @@ static inline unsigned long rdfsbase(void)
 {
     unsigned long base;
 
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
         return __rdfsbase();
 
     rdmsrl(MSR_FS_BASE, base);
@@ -162,7 +170,7 @@ static inline unsigned long rdgsbase(void)
 {
     unsigned long base;
 
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
         return __rdgsbase();
 
     rdmsrl(MSR_GS_BASE, base);
@@ -172,7 +180,7 @@ static inline unsigned long rdgsbase(void)
 
 static inline void wrfsbase(unsigned long base)
 {
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
 #ifdef HAVE_GAS_FSGSBASE
         asm volatile ( "wrfsbase %0" :: "r" (base) );
 #else
@@ -184,7 +192,7 @@ static inline void wrfsbase(unsigned long base)
 
 static inline void wrgsbase(unsigned long base)
 {
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
 #ifdef HAVE_GAS_FSGSBASE
         asm volatile ( "wrgsbase %0" :: "r" (base) );
 #else
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index 90a2701..a0f8bf4 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -315,11 +315,31 @@ static inline unsigned long read_cr4(void)
 
 static inline void write_cr4(unsigned long val)
 {
+    struct cpu_info *info = get_cpu_info();
+
     /* No global pages in case of PCIDs enabled! */
     ASSERT(!(val & X86_CR4_PGE) || !(val & X86_CR4_PCIDE));
 
-    get_cpu_info()->cr4 = val;
-    asm volatile ( "mov %0,%%cr4" : : "r" (val) );
+    /*
+     * On hardware supporting FSGSBASE, the value in %cr4 is the kernel's
+     * choice for 64bit PV guests, which impacts whether Xen can use the
+     * instructions.
+     *
+     * The {rd,wr}{fs,gs}base() helpers use info->cr4 to work out whether it
+     * is safe to execute the {RD,WR}{FS,GS}BASE instruction, falling back to
+     * the MSR path if not.  Some users require interrupt safety.
+     *
+     * If FSGSBASE is currently or about to become clear, reflect this in
+     * info->cr4 before updating %cr4, so an interrupt which hits in the
+     * middle won't observe FSGSBASE set in info->cr4 but clear in %cr4.
+     */
+    info->cr4 = val & (info->cr4 | ~X86_CR4_FSGSBASE);
+
+    asm volatile ( "mov %[val], %%cr4"
+                   : "+m" (info->cr4) /* Force ordering without a barrier. */
+                   : [val] "r" (val) );
+
+    info->cr4 = val;
 }
 
 /* Clear and set 'TS' bit respectively */

["xsa293/4.11-1.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Rewrite guest %cr4 handling from scratch

The PV cr4 logic is almost impossible to follow, and leaks bits into guest
context which definitely shouldn't be visible (in particular, VMXE).

The biggest problem however, and source of the complexity, is that it derives
new real and guest cr4 values from the current value in hardware - this is
context dependent and an inappropriate source of information.

Rewrite the cr4 logic to be invariant of the current value in hardware.

First of all, modify write_ptbase() to always use mmu_cr4_features for IDLE
and HVM contexts.  mmu_cr4_features *is* the correct value to use, and makes
the ASSERT() obviously redundant.

For PV guests, curr->arch.pv.ctrlreg[4] remains the guests view of cr4, but
all logic gets reworked in terms of this and mmu_cr4_features only.

Two masks are introduced; bits which the guest has control over, and bits
which are forwarded from Xen's settings.  One guest-visible change here is
that Xen's VMXE setting is no longer visible at all.

pv_make_cr4() follows fairly closely from pv_guest_cr4_to_real_cr4(), but
deliberately starts with mmu_cr4_features, and only alters the minimal subset
of bits.

The boot-time {compat_,}pv_cr4_mask variables are removed, as they are a
remnant of the pre-CPUID policy days.  pv_fixup_guest_cr4() gains a related
derivation from the policy.

Another guest visible change here is that a 32bit PV guest can now flip
FSGSBASE in its view of CR4.  While the {RD,WR}{FS,GS}BASE instructions are
unusable outside of a 64bit code segment, the ability to modify FSGSBASE
matches real hardware behaviour, and avoids the need for any 32bit/64bit
differences in the logic.

Overall, this patch shouldn't have a practical change in guest behaviour.
VMXE will disappear from view, and an inquisitive 32bit kernel can now see
FSGSBASE changing, but this new logic is otherwise bug-compatible with before.

This is part of XSA-293

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index b1e50d1..675152a 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -733,49 +733,6 @@ int arch_domain_soft_reset(struct domain *d)
     return ret;
 }
 
-/*
- * These are the masks of CR4 bits (subject to hardware availability) which a
- * PV guest may not legitimiately attempt to modify.
- */
-static unsigned long __read_mostly pv_cr4_mask, compat_pv_cr4_mask;
-
-static int __init init_pv_cr4_masks(void)
-{
-    unsigned long common_mask = ~X86_CR4_TSD;
-
-    /*
-     * All PV guests may attempt to modify TSD, DE and OSXSAVE.
-     */
-    if ( cpu_has_de )
-        common_mask &= ~X86_CR4_DE;
-    if ( cpu_has_xsave )
-        common_mask &= ~X86_CR4_OSXSAVE;
-
-    pv_cr4_mask = compat_pv_cr4_mask = common_mask;
-
-    /*
-     * 64bit PV guests may attempt to modify FSGSBASE.
-     */
-    if ( cpu_has_fsgsbase )
-        pv_cr4_mask &= ~X86_CR4_FSGSBASE;
-
-    return 0;
-}
-__initcall(init_pv_cr4_masks);
-
-unsigned long pv_guest_cr4_fixup(const struct vcpu *v, unsigned long guest_cr4)
-{
-    unsigned long hv_cr4 = real_cr4_to_pv_guest_cr4(read_cr4());
-    unsigned long mask = is_pv_32bit_vcpu(v) ? compat_pv_cr4_mask : pv_cr4_mask;
-
-    if ( (guest_cr4 & mask) != (hv_cr4 & mask) )
-        printk(XENLOG_G_WARNING
-               "d%d attempted to change %pv's CR4 flags %08lx -> %08lx\n",
-               current->domain->domain_id, v, hv_cr4, guest_cr4);
-
-    return (hv_cr4 & mask) | (guest_cr4 & ~mask);
-}
-
 #define xen_vcpu_guest_context vcpu_guest_context
 #define fpu_ctxt fpu_ctxt.x
 CHECK_FIELD_(struct, vcpu_guest_context, fpu_ctxt);
@@ -789,7 +746,7 @@ int arch_set_info_guest(
     struct domain *d = v->domain;
     unsigned long cr3_gfn;
     struct page_info *cr3_page;
-    unsigned long flags, cr4;
+    unsigned long flags;
     unsigned int i;
     int rc = 0, compat;
 
@@ -978,9 +935,8 @@ int arch_set_info_guest(
     v->arch.pv_vcpu.ctrlreg[0] &= X86_CR0_TS;
     v->arch.pv_vcpu.ctrlreg[0] |= read_cr0() & ~X86_CR0_TS;
 
-    cr4 = v->arch.pv_vcpu.ctrlreg[4];
-    v->arch.pv_vcpu.ctrlreg[4] = cr4 ? pv_guest_cr4_fixup(v, cr4) :
-        real_cr4_to_pv_guest_cr4(mmu_cr4_features);
+    v->arch.pv_vcpu.ctrlreg[4] =
+        pv_fixup_guest_cr4(v, v->arch.pv_vcpu.ctrlreg[4]);
 
     memset(v->arch.debugreg, 0, sizeof(v->arch.debugreg));
     for ( i = 0; i < 8; i++ )
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 6509035..08634b7 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -505,33 +505,13 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
         v->arch.cr3 |= get_pcid_bits(v, false);
 }
 
-unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v)
-{
-    const struct domain *d = v->domain;
-    unsigned long cr4;
-
-    cr4 = v->arch.pv_vcpu.ctrlreg[4] & ~X86_CR4_DE;
-    cr4 |= mmu_cr4_features & (X86_CR4_PSE | X86_CR4_SMEP | X86_CR4_SMAP |
-                               X86_CR4_OSXSAVE | X86_CR4_FSGSBASE);
-
-    if ( d->arch.pv_domain.pcid )
-        cr4 |= X86_CR4_PCIDE;
-    else if ( !d->arch.pv_domain.xpti )
-        cr4 |= X86_CR4_PGE;
-
-    cr4 |= d->arch.vtsc ? X86_CR4_TSD : 0;
-
-    return cr4;
-}
-
 void write_ptbase(struct vcpu *v)
 {
     struct cpu_info *cpu_info = get_cpu_info();
     unsigned long new_cr4;
 
     new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
-              ? pv_guest_cr4_to_real_cr4(v)
-              : ((read_cr4() & ~(X86_CR4_PCIDE | X86_CR4_TSD)) | X86_CR4_PGE);
+              ? pv_make_cr4(v) : mmu_cr4_features;
 
     if ( is_pv_vcpu(v) && v->domain->arch.pv_domain.xpti )
     {
@@ -550,8 +530,6 @@ void write_ptbase(struct vcpu *v)
         switch_cr3_cr4(v->arch.cr3, new_cr4);
         cpu_info->pv_cr3 = 0;
     }
-
-    ASSERT(is_pv_vcpu(v) || read_cr4() == mmu_cr4_features);
 }
 
 /*
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index b75ff6b..3965959 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -97,6 +97,52 @@ static void release_compat_l4(struct vcpu *v)
     v->arch.guest_table_user = pagetable_null();
 }
 
+unsigned long pv_fixup_guest_cr4(const struct vcpu *v, unsigned long cr4)
+{
+    const struct cpuid_policy *p = v->domain->arch.cpuid;
+
+    /* Discard attempts to set guest controllable bits outside of the policy. */
+    cr4 &= ~((p->basic.tsc     ? 0 : X86_CR4_TSD)      |
+             (p->basic.de      ? 0 : X86_CR4_DE)       |
+             (p->feat.fsgsbase ? 0 : X86_CR4_FSGSBASE) |
+             (p->basic.xsave   ? 0 : X86_CR4_OSXSAVE));
+
+    /* Masks expected to be disjoint sets. */
+    BUILD_BUG_ON(PV_CR4_GUEST_MASK & PV_CR4_GUEST_VISIBLE_MASK);
+
+    /*
+     * A guest sees the policy subset of its own choice of guest controllable
+     * bits, and a subset of Xen's choice of certain hardware settings.
+     */
+    return ((cr4 & PV_CR4_GUEST_MASK) |
+            (mmu_cr4_features & PV_CR4_GUEST_VISIBLE_MASK));
+}
+
+unsigned long pv_make_cr4(const struct vcpu *v)
+{
+    const struct domain *d = v->domain;
+    unsigned long cr4 = mmu_cr4_features &
+        ~(X86_CR4_PCIDE | X86_CR4_PGE | X86_CR4_TSD);
+
+    /*
+     * PCIDE or PGE depends on the PCID/XPTI settings, but must not both be
+     * set, as it impacts the safety of TLB flushing.
+     */
+    if ( d->arch.pv_domain.pcid )
+        cr4 |= X86_CR4_PCIDE;
+    else if ( !d->arch.pv_domain.xpti )
+        cr4 |= X86_CR4_PGE;
+
+    /*
+     * TSD is needed if either the guest has elected to use it, or Xen is
+     * virtualising the TSC value the guest sees.
+     */
+    if ( d->arch.vtsc || (v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_TSD) )
+        cr4 |= X86_CR4_TSD;
+
+    return cr4;
+}
+
 int switch_compat(struct domain *d)
 {
     struct vcpu *v;
@@ -191,7 +237,7 @@ int pv_vcpu_initialise(struct vcpu *v)
     /* PV guests by default have a 100Hz ticker. */
     v->periodic_period = MILLISECS(10);
 
-    v->arch.pv_vcpu.ctrlreg[4] = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
+    v->arch.pv_vcpu.ctrlreg[4] = pv_fixup_guest_cr4(v, 0);
 
     if ( is_pv_32bit_domain(d) )
     {
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index ce2ec76..4abbc14 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -32,6 +32,7 @@
 #include <asm/hypercall.h>
 #include <asm/mc146818rtc.h>
 #include <asm/p2m.h>
+#include <asm/pv/domain.h>
 #include <asm/pv/traps.h>
 #include <asm/shared.h>
 #include <asm/traps.h>
@@ -785,8 +786,8 @@ static int write_cr(unsigned int reg, unsigned long val,
     }
 
     case 4: /* Write CR4 */
-        curr->arch.pv_vcpu.ctrlreg[4] = pv_guest_cr4_fixup(curr, val);
-        write_cr4(pv_guest_cr4_to_real_cr4(curr));
+        curr->arch.pv_vcpu.ctrlreg[4] = pv_fixup_guest_cr4(curr, val);
+        write_cr4(pv_make_cr4(curr));
         ctxt_switch_levelling(curr);
         return X86EMUL_OKAY;
     }
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index ec81d78..c8aa8a5 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -610,17 +610,6 @@ bool update_secondary_system_time(struct vcpu *,
 void vcpu_show_execution_state(struct vcpu *);
 void vcpu_show_registers(const struct vcpu *);
 
-/* Clean up CR4 bits that are not under guest control. */
-unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
-
-/* Convert between guest-visible and real CR4 values. */
-unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v);
-
-#define real_cr4_to_pv_guest_cr4(c)                         \
-    ((c) & ~(X86_CR4_PGE | X86_CR4_PSE | X86_CR4_TSD |      \
-             X86_CR4_OSXSAVE | X86_CR4_SMEP |               \
-             X86_CR4_FSGSBASE | X86_CR4_SMAP | X86_CR4_PCIDE))
-
 #define domain_max_vcpus(d) (is_hvm_domain(d) ? HVM_MAX_VCPUS : MAX_VIRT_CPUS)
 
 static inline struct vcpu_guest_context *alloc_vcpu_guest_context(void)
diff --git a/xen/include/asm-x86/pv/domain.h b/xen/include/asm-x86/pv/domain.h
index 4fea764..4e4710c 100644
--- a/xen/include/asm-x86/pv/domain.h
+++ b/xen/include/asm-x86/pv/domain.h
@@ -59,6 +59,23 @@ int pv_vcpu_initialise(struct vcpu *v);
 void pv_domain_destroy(struct domain *d);
 int pv_domain_initialise(struct domain *d);
 
+/*
+ * Bits which a PV guest can toggle in its view of cr4.  Some are loaded into
+ * hardware, while some are fully emulated.
+ */
+#define PV_CR4_GUEST_MASK \
+    (X86_CR4_TSD | X86_CR4_DE | X86_CR4_FSGSBASE | X86_CR4_OSXSAVE)
+
+/* Bits which a PV guest may observe from the real hardware settings. */
+#define PV_CR4_GUEST_VISIBLE_MASK \
+    (X86_CR4_PAE | X86_CR4_MCE | X86_CR4_OSFXSR | X86_CR4_OSXMMEXCPT)
+
+/* Given a new cr4 value, construct the resulting guest-visible cr4 value. */
+unsigned long pv_fixup_guest_cr4(const struct vcpu *v, unsigned long cr4);
+
+/* Create a cr4 value to load into hardware, based on vcpu settings. */
+unsigned long pv_make_cr4(const struct vcpu *v);
+
 #else  /* !CONFIG_PV */
 
 #include <xen/errno.h>
@@ -68,6 +85,8 @@ static inline int pv_vcpu_initialise(struct vcpu *v) { return -EOPNOTSUPP; }
 static inline void pv_domain_destroy(struct domain *d) {}
 static inline int pv_domain_initialise(struct domain *d) { return -EOPNOTSUPP; }
 
+static inline unsigned long pv_make_cr4(const struct vcpu *v) { return ~0ul; }
+
 #endif	/* CONFIG_PV */
 
 void paravirt_ctxt_switch_from(struct vcpu *v);

["xsa293/4.11-2.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Don't have %cr4.fsgsbase active behind a guest kernels back

Currently, a 64bit PV guest can appear to set and clear FSGSBASE in %cr4, but
the bit remains set in hardware.  Therefore, the {RD,WR}{FS,GS}BASE are usable
even when the guest kernel believes that they are disabled.

The FSGSBASE feature isn't currently supported in Linux, and its context
switch path has some optimisations which rely on userspace being unable to use
the WR{FS,GS}BASE instructions.  Xen's current behaviour undermines this
expectation.

In 64bit PV guest context, always load the guest kernels setting of FSGSBASE
into %cr4.  This requires adjusting how Xen uses the {RD,WR}{FS,GS}BASE
instructions.

 * Delete the cpu_has_fsgsbase helper.  It is no longer safe, as users need to
   check %cr4 directly.
 * The raw __rd{fs,gs}base() helpers are only safe to use when %cr4.fsgsbase
   is set.  Comment this property.
 * The {rd,wr}{fs,gs}{base,shadow}() and read_msr() helpers are updated to use
   the current %cr4 value to determine which mechanism to use.
 * toggle_guest_mode() and save_segments() are update to avoid reading
   fs/gsbase if the values in hardware cannot be stale WRT struct vcpu.  A
   consequence of this is that the write_cr() path needs to cache the current
   bases, as subsequent context switches will skip saving the values.
 * write_cr4() is updated to ensure that the shadow %cr4.fsgsbase value is
   observed in a safe way WRT the hardware setting, if an interrupt happens to
   hit in the middle.
 * pv_make_cr4() is updated for 64bit PV guests to use the guest kernels
   choice of FSGSBASE.

This is part of XSA-293

Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 675152a..29f892c 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1433,7 +1433,8 @@ static void save_segments(struct vcpu *v)
     regs->fs = read_sreg(fs);
     regs->gs = read_sreg(gs);
 
-    if ( cpu_has_fsgsbase && !is_pv_32bit_vcpu(v) )
+    /* %fs/%gs bases can only be stale if WR{FS,GS}BASE are usable. */
+    if ( (read_cr4() & X86_CR4_FSGSBASE) && !is_pv_32bit_vcpu(v) )
     {
         v->arch.pv_vcpu.fs_base = __rdfsbase();
         if ( v->arch.flags & TF_kernel_mode )
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 3965959..228a174 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -140,6 +140,16 @@ unsigned long pv_make_cr4(const struct vcpu *v)
     if ( d->arch.vtsc || (v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_TSD) )
         cr4 |= X86_CR4_TSD;
 
+    /*
+     * The {RD,WR}{FS,GS}BASE are only useable in 64bit code segments.  While
+     * we must not have CR4.FSGSBASE set behind the back of a 64bit PV kernel,
+     * we do leave it set in 32bit PV context to speed up Xen's context switch
+     * path.
+     */
+    if ( !is_pv_32bit_domain(d) &&
+         !(v->arch.pv_vcpu.ctrlreg[4] & X86_CR4_FSGSBASE) )
+        cr4 &= ~X86_CR4_FSGSBASE;
+
     return cr4;
 }
 
@@ -375,7 +385,8 @@ void toggle_guest_mode(struct vcpu *v)
 {
     ASSERT(!is_pv_32bit_vcpu(v));
 
-    if ( cpu_has_fsgsbase )
+    /* %fs/%gs bases can only be stale if WR{FS,GS}BASE are usable. */
+    if ( read_cr4() & X86_CR4_FSGSBASE )
     {
         if ( v->arch.flags & TF_kernel_mode )
             v->arch.pv_vcpu.gs_base_kernel = __rdgsbase();
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index 4abbc14..312c1ee 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -786,6 +786,17 @@ static int write_cr(unsigned int reg, unsigned long val,
     }
 
     case 4: /* Write CR4 */
+        /*
+         * If this write will disable FSGSBASE, refresh Xen's idea of the
+         * guest bases now that they can no longer change.
+         */
+        if ( (curr->arch.pv_vcpu.ctrlreg[4] & X86_CR4_FSGSBASE) &&
+             !(val & X86_CR4_FSGSBASE) )
+        {
+            curr->arch.pv_vcpu.fs_base = __rdfsbase();
+            curr->arch.pv_vcpu.gs_base_kernel = __rdgsbase();
+        }
+
         curr->arch.pv_vcpu.ctrlreg[4] = pv_fixup_guest_cr4(curr, val);
         write_cr4(pv_make_cr4(curr));
         ctxt_switch_levelling(curr);
@@ -835,14 +846,15 @@ static int read_msr(unsigned int reg, uint64_t *val,
     case MSR_FS_BASE:
         if ( is_pv_32bit_domain(currd) )
             break;
-        *val = cpu_has_fsgsbase ? __rdfsbase() : curr->arch.pv_vcpu.fs_base;
+        *val = (read_cr4() & X86_CR4_FSGSBASE) ? __rdfsbase()
+                                               : curr->arch.pv_vcpu.fs_base;
         return X86EMUL_OKAY;
 
     case MSR_GS_BASE:
         if ( is_pv_32bit_domain(currd) )
             break;
-        *val = cpu_has_fsgsbase ? __rdgsbase()
-                                : curr->arch.pv_vcpu.gs_base_kernel;
+        *val = (read_cr4() & X86_CR4_FSGSBASE) ? __rdgsbase()
+                                               : curr->arch.pv_vcpu.gs_base_kernel;
         return X86EMUL_OKAY;
 
     case MSR_SHADOW_GS_BASE:
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index ecb0149..a353d76 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1567,7 +1567,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
 
     cr4_pv32_mask = mmu_cr4_features & XEN_CR4_PV32_BITS;
 
-    if ( cpu_has_fsgsbase )
+    if ( boot_cpu_has(X86_FEATURE_FSGSBASE) )
         set_in_cr4(X86_CR4_FSGSBASE);
 
     if ( opt_invpcid && cpu_has_invpcid )
diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
index b237da1..861cb0a 100644
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -90,7 +90,6 @@
 #define cpu_has_xsaves          boot_cpu_has(X86_FEATURE_XSAVES)
 
 /* CPUID level 0x00000007:0.ebx */
-#define cpu_has_fsgsbase        boot_cpu_has(X86_FEATURE_FSGSBASE)
 #define cpu_has_bmi1            boot_cpu_has(X86_FEATURE_BMI1)
 #define cpu_has_hle             boot_cpu_has(X86_FEATURE_HLE)
 #define cpu_has_avx2            boot_cpu_has(X86_FEATURE_AVX2)
diff --git a/xen/include/asm-x86/msr.h b/xen/include/asm-x86/msr.h
index afbeb7f..1ba6ee3 100644
--- a/xen/include/asm-x86/msr.h
+++ b/xen/include/asm-x86/msr.h
@@ -120,6 +120,14 @@ static inline uint64_t rdtsc_ordered(void)
 			  : "=a" (low), "=d" (high) \
 			  : "c" (counter))
 
+/*
+ * On hardware supporting FSGSBASE, the value loaded into hardware is the
+ * guest kernel's choice for 64bit PV guests (Xen's choice for Idle, HVM and
+ * 32bit PV).
+ *
+ * Therefore, the {RD,WR}{FS,GS}BASE instructions are only safe to use if
+ * %cr4.fsgsbase is set.
+ */
 static inline unsigned long __rdfsbase(void)
 {
     unsigned long base;
@@ -150,7 +158,7 @@ static inline unsigned long rdfsbase(void)
 {
     unsigned long base;
 
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
         return __rdfsbase();
 
     rdmsrl(MSR_FS_BASE, base);
@@ -162,7 +170,7 @@ static inline unsigned long rdgsbase(void)
 {
     unsigned long base;
 
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
         return __rdgsbase();
 
     rdmsrl(MSR_GS_BASE, base);
@@ -174,7 +182,7 @@ static inline unsigned long rdgsshadow(void)
 {
     unsigned long base;
 
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
     {
         asm volatile ( "swapgs" );
         base = __rdgsbase();
@@ -188,7 +196,7 @@ static inline unsigned long rdgsshadow(void)
 
 static inline void wrfsbase(unsigned long base)
 {
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
 #ifdef HAVE_AS_FSGSBASE
         asm volatile ( "wrfsbase %0" :: "r" (base) );
 #else
@@ -200,7 +208,7 @@ static inline void wrfsbase(unsigned long base)
 
 static inline void wrgsbase(unsigned long base)
 {
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
 #ifdef HAVE_AS_FSGSBASE
         asm volatile ( "wrgsbase %0" :: "r" (base) );
 #else
@@ -212,7 +220,7 @@ static inline void wrgsbase(unsigned long base)
 
 static inline void wrgsshadow(unsigned long base)
 {
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
     {
         asm volatile ( "swapgs\n\t"
 #ifdef HAVE_AS_FSGSBASE
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index 2bd9e69..8e253dc 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -305,11 +305,31 @@ static inline unsigned long read_cr4(void)
 
 static inline void write_cr4(unsigned long val)
 {
+    struct cpu_info *info = get_cpu_info();
+
     /* No global pages in case of PCIDs enabled! */
     ASSERT(!(val & X86_CR4_PGE) || !(val & X86_CR4_PCIDE));
 
-    get_cpu_info()->cr4 = val;
-    asm volatile ( "mov %0,%%cr4" : : "r" (val) );
+    /*
+     * On hardware supporting FSGSBASE, the value in %cr4 is the kernel's
+     * choice for 64bit PV guests, which impacts whether Xen can use the
+     * instructions.
+     *
+     * The {rd,wr}{fs,gs}base() helpers use info->cr4 to work out whether it
+     * is safe to execute the {RD,WR}{FS,GS}BASE instruction, falling back to
+     * the MSR path if not.  Some users require interrupt safety.
+     *
+     * If FSGSBASE is currently or about to become clear, reflect this in
+     * info->cr4 before updating %cr4, so an interrupt which hits in the
+     * middle won't observe FSGSBASE set in info->cr4 but clear in %cr4.
+     */
+    info->cr4 = val & (info->cr4 | ~X86_CR4_FSGSBASE);
+
+    asm volatile ( "mov %[val], %%cr4"
+                   : "+m" (info->cr4) /* Force ordering without a barrier. */
+                   : [val] "r" (val) );
+
+    info->cr4 = val;
 }
 
 /* Clear and set 'TS' bit respectively */

["xsa293/unstable-1.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Rewrite guest %cr4 handling from scratch

The PV cr4 logic is almost impossible to follow, and leaks bits into guest
context which definitely shouldn't be visible (in particular, VMXE).

The biggest problem however, and source of the complexity, is that it derives
new real and guest cr4 values from the current value in hardware - this is
context dependent and an inappropriate source of information.

Rewrite the cr4 logic to be invariant of the current value in hardware.

First of all, modify write_ptbase() to always use mmu_cr4_features for IDLE
and HVM contexts.  mmu_cr4_features *is* the correct value to use, and makes
the ASSERT() obviously redundant.

For PV guests, curr->arch.pv.ctrlreg[4] remains the guests view of cr4, but
all logic gets reworked in terms of this and mmu_cr4_features only.

Two masks are introduced; bits which the guest has control over, and bits
which are forwarded from Xen's settings.  One guest-visible change here is
that Xen's VMXE setting is no longer visible at all.

pv_make_cr4() follows fairly closely from pv_guest_cr4_to_real_cr4(), but
deliberately starts with mmu_cr4_features, and only alters the minimal subset
of bits.

The boot-time {compat_,}pv_cr4_mask variables are removed, as they are a
remnant of the pre-CPUID policy days.  pv_fixup_guest_cr4() gains a related
derivation from the policy.

Another guest visible change here is that a 32bit PV guest can now flip
FSGSBASE in its view of CR4.  While the {RD,WR}{FS,GS}BASE instructions are
unusable outside of a 64bit code segment, the ability to modify FSGSBASE
matches real hardware behaviour, and avoids the need for any 32bit/64bit
differences in the logic.

Overall, this patch shouldn't have a practical change in guest behaviour.
VMXE will disappear from view, and an inquisitive 32bit kernel can now see
FSGSBASE changing, but this new logic is otherwise bug-compatible with before.

This is part of XSA-293

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 32dc4253..7ff0f10 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -769,49 +769,6 @@ void arch_domain_creation_finished(struct domain *d)
 {
 }
 
-/*
- * These are the masks of CR4 bits (subject to hardware availability) which a
- * PV guest may not legitimiately attempt to modify.
- */
-static unsigned long __read_mostly pv_cr4_mask, compat_pv_cr4_mask;
-
-static int __init init_pv_cr4_masks(void)
-{
-    unsigned long common_mask = ~X86_CR4_TSD;
-
-    /*
-     * All PV guests may attempt to modify TSD, DE and OSXSAVE.
-     */
-    if ( cpu_has_de )
-        common_mask &= ~X86_CR4_DE;
-    if ( cpu_has_xsave )
-        common_mask &= ~X86_CR4_OSXSAVE;
-
-    pv_cr4_mask = compat_pv_cr4_mask = common_mask;
-
-    /*
-     * 64bit PV guests may attempt to modify FSGSBASE.
-     */
-    if ( cpu_has_fsgsbase )
-        pv_cr4_mask &= ~X86_CR4_FSGSBASE;
-
-    return 0;
-}
-__initcall(init_pv_cr4_masks);
-
-unsigned long pv_guest_cr4_fixup(const struct vcpu *v, unsigned long guest_cr4)
-{
-    unsigned long hv_cr4 = real_cr4_to_pv_guest_cr4(read_cr4());
-    unsigned long mask = is_pv_32bit_vcpu(v) ? compat_pv_cr4_mask : pv_cr4_mask;
-
-    if ( (guest_cr4 & mask) != (hv_cr4 & mask) )
-        printk(XENLOG_G_WARNING
-               "d%d attempted to change %pv's CR4 flags %08lx -> %08lx\n",
-               current->domain->domain_id, v, hv_cr4, guest_cr4);
-
-    return (hv_cr4 & mask) | (guest_cr4 & ~mask);
-}
-
 #define xen_vcpu_guest_context vcpu_guest_context
 #define fpu_ctxt fpu_ctxt.x
 CHECK_FIELD_(struct, vcpu_guest_context, fpu_ctxt);
@@ -829,7 +786,6 @@ int arch_set_info_guest(
 #ifdef CONFIG_PV
     unsigned long cr3_gfn;
     struct page_info *cr3_page;
-    unsigned long cr4;
     int rc = 0;
 #endif
 
@@ -1005,9 +961,7 @@ int arch_set_info_guest(
     v->arch.pv.ctrlreg[0] &= X86_CR0_TS;
     v->arch.pv.ctrlreg[0] |= read_cr0() & ~X86_CR0_TS;
 
-    cr4 = v->arch.pv.ctrlreg[4];
-    v->arch.pv.ctrlreg[4] = cr4 ? pv_guest_cr4_fixup(v, cr4) :
-        real_cr4_to_pv_guest_cr4(mmu_cr4_features);
+    v->arch.pv.ctrlreg[4] = pv_fixup_guest_cr4(v, v->arch.pv.ctrlreg[4]);
 
     memset(v->arch.dr, 0, sizeof(v->arch.dr));
     v->arch.dr6 = X86_DR6_DEFAULT;
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 7ec5954..ba444d2 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -529,33 +529,13 @@ void make_cr3(struct vcpu *v, mfn_t mfn)
         v->arch.cr3 |= get_pcid_bits(v, false);
 }
 
-unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v)
-{
-    const struct domain *d = v->domain;
-    unsigned long cr4;
-
-    cr4 = v->arch.pv.ctrlreg[4] & ~X86_CR4_DE;
-    cr4 |= mmu_cr4_features & (X86_CR4_PSE | X86_CR4_SMEP | X86_CR4_SMAP |
-                               X86_CR4_OSXSAVE | X86_CR4_FSGSBASE);
-
-    if ( d->arch.pv.pcid )
-        cr4 |= X86_CR4_PCIDE;
-    else if ( !d->arch.pv.xpti )
-        cr4 |= X86_CR4_PGE;
-
-    cr4 |= d->arch.vtsc ? X86_CR4_TSD : 0;
-
-    return cr4;
-}
-
 void write_ptbase(struct vcpu *v)
 {
     struct cpu_info *cpu_info = get_cpu_info();
     unsigned long new_cr4;
 
     new_cr4 = (is_pv_vcpu(v) && !is_idle_vcpu(v))
-              ? pv_guest_cr4_to_real_cr4(v)
-              : ((read_cr4() & ~(X86_CR4_PCIDE | X86_CR4_TSD)) | X86_CR4_PGE);
+              ? pv_make_cr4(v) : mmu_cr4_features;
 
     if ( is_pv_vcpu(v) && v->domain->arch.pv.xpti )
     {
@@ -574,8 +554,6 @@ void write_ptbase(struct vcpu *v)
         switch_cr3_cr4(v->arch.cr3, new_cr4);
         cpu_info->pv_cr3 = 0;
     }
-
-    ASSERT(is_pv_vcpu(v) || read_cr4() == mmu_cr4_features);
 }
 
 /*
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 7e84b04..91123a8 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -97,6 +97,52 @@ static void release_compat_l4(struct vcpu *v)
     v->arch.guest_table_user = pagetable_null();
 }
 
+unsigned long pv_fixup_guest_cr4(const struct vcpu *v, unsigned long cr4)
+{
+    const struct cpuid_policy *p = v->domain->arch.cpuid;
+
+    /* Discard attempts to set guest controllable bits outside of the policy. */
+    cr4 &= ~((p->basic.tsc     ? 0 : X86_CR4_TSD)      |
+             (p->basic.de      ? 0 : X86_CR4_DE)       |
+             (p->feat.fsgsbase ? 0 : X86_CR4_FSGSBASE) |
+             (p->basic.xsave   ? 0 : X86_CR4_OSXSAVE));
+
+    /* Masks expected to be disjoint sets. */
+    BUILD_BUG_ON(PV_CR4_GUEST_MASK & PV_CR4_GUEST_VISIBLE_MASK);
+
+    /*
+     * A guest sees the policy subset of its own choice of guest controllable
+     * bits, and a subset of Xen's choice of certain hardware settings.
+     */
+    return ((cr4 & PV_CR4_GUEST_MASK) |
+            (mmu_cr4_features & PV_CR4_GUEST_VISIBLE_MASK));
+}
+
+unsigned long pv_make_cr4(const struct vcpu *v)
+{
+    const struct domain *d = v->domain;
+    unsigned long cr4 = mmu_cr4_features &
+        ~(X86_CR4_PCIDE | X86_CR4_PGE | X86_CR4_TSD);
+
+    /*
+     * PCIDE or PGE depends on the PCID/XPTI settings, but must not both be
+     * set, as it impacts the safety of TLB flushing.
+     */
+    if ( d->arch.pv.pcid )
+        cr4 |= X86_CR4_PCIDE;
+    else if ( !d->arch.pv.xpti )
+        cr4 |= X86_CR4_PGE;
+
+    /*
+     * TSD is needed if either the guest has elected to use it, or Xen is
+     * virtualising the TSC value the guest sees.
+     */
+    if ( d->arch.vtsc || (v->arch.pv.ctrlreg[4] & X86_CR4_TSD) )
+        cr4 |= X86_CR4_TSD;
+
+    return cr4;
+}
+
 int switch_compat(struct domain *d)
 {
     struct vcpu *v;
@@ -191,7 +237,7 @@ int pv_vcpu_initialise(struct vcpu *v)
     /* PV guests by default have a 100Hz ticker. */
     v->periodic_period = MILLISECS(10);
 
-    v->arch.pv.ctrlreg[4] = real_cr4_to_pv_guest_cr4(mmu_cr4_features);
+    v->arch.pv.ctrlreg[4] = pv_fixup_guest_cr4(v, 0);
 
     if ( is_pv_32bit_domain(d) )
     {
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index 942ece2..b4a56f9 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -33,6 +33,7 @@
 #include <asm/hypercall.h>
 #include <asm/mc146818rtc.h>
 #include <asm/p2m.h>
+#include <asm/pv/domain.h>
 #include <asm/pv/traps.h>
 #include <asm/shared.h>
 #include <asm/traps.h>
@@ -779,8 +780,8 @@ static int write_cr(unsigned int reg, unsigned long val,
     }
 
     case 4: /* Write CR4 */
-        curr->arch.pv.ctrlreg[4] = pv_guest_cr4_fixup(curr, val);
-        write_cr4(pv_guest_cr4_to_real_cr4(curr));
+        curr->arch.pv.ctrlreg[4] = pv_fixup_guest_cr4(curr, val);
+        write_cr4(pv_make_cr4(curr));
         ctxt_switch_levelling(curr);
         return X86EMUL_OKAY;
     }
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 277f99f..a131ca2 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -653,17 +653,6 @@ bool update_secondary_system_time(struct vcpu *,
 void vcpu_show_execution_state(struct vcpu *);
 void vcpu_show_registers(const struct vcpu *);
 
-/* Clean up CR4 bits that are not under guest control. */
-unsigned long pv_guest_cr4_fixup(const struct vcpu *, unsigned long guest_cr4);
-
-/* Convert between guest-visible and real CR4 values. */
-unsigned long pv_guest_cr4_to_real_cr4(const struct vcpu *v);
-
-#define real_cr4_to_pv_guest_cr4(c)                         \
-    ((c) & ~(X86_CR4_PGE | X86_CR4_PSE | X86_CR4_TSD |      \
-             X86_CR4_OSXSAVE | X86_CR4_SMEP |               \
-             X86_CR4_FSGSBASE | X86_CR4_SMAP | X86_CR4_PCIDE))
-
 static inline struct vcpu_guest_context *alloc_vcpu_guest_context(void)
 {
     return vmalloc(sizeof(struct vcpu_guest_context));
diff --git a/xen/include/asm-x86/pv/domain.h b/xen/include/asm-x86/pv/domain.h
index 7d9d09d..99a0fe7 100644
--- a/xen/include/asm-x86/pv/domain.h
+++ b/xen/include/asm-x86/pv/domain.h
@@ -61,6 +61,23 @@ int pv_vcpu_initialise(struct vcpu *v);
 void pv_domain_destroy(struct domain *d);
 int pv_domain_initialise(struct domain *d);
 
+/*
+ * Bits which a PV guest can toggle in its view of cr4.  Some are loaded into
+ * hardware, while some are fully emulated.
+ */
+#define PV_CR4_GUEST_MASK \
+    (X86_CR4_TSD | X86_CR4_DE | X86_CR4_FSGSBASE | X86_CR4_OSXSAVE)
+
+/* Bits which a PV guest may observe from the real hardware settings. */
+#define PV_CR4_GUEST_VISIBLE_MASK \
+    (X86_CR4_PAE | X86_CR4_MCE | X86_CR4_OSFXSR | X86_CR4_OSXMMEXCPT)
+
+/* Given a new cr4 value, construct the resulting guest-visible cr4 value. */
+unsigned long pv_fixup_guest_cr4(const struct vcpu *v, unsigned long cr4);
+
+/* Create a cr4 value to load into hardware, based on vcpu settings. */
+unsigned long pv_make_cr4(const struct vcpu *v);
+
 bool xpti_pcid_enabled(void);
 
 #else  /* !CONFIG_PV */
@@ -72,6 +89,8 @@ static inline int pv_vcpu_initialise(struct vcpu *v) { return -EOPNOTSUPP; }
 static inline void pv_domain_destroy(struct domain *d) {}
 static inline int pv_domain_initialise(struct domain *d) { return -EOPNOTSUPP; }
 
+static inline unsigned long pv_make_cr4(const struct vcpu *v) { return ~0ul; }
+
 #endif	/* CONFIG_PV */
 
 void paravirt_ctxt_switch_from(struct vcpu *v);

["xsa293/unstable-2.patch" (application/octet-stream)]

From: Andrew Cooper <andrew.cooper3@citrix.com>
Subject: x86/pv: Don't have %cr4.fsgsbase active behind a guest kernels back

Currently, a 64bit PV guest can appear to set and clear FSGSBASE in %cr4, but
the bit remains set in hardware.  Therefore, the {RD,WR}{FS,GS}BASE are usable
even when the guest kernel believes that they are disabled.

The FSGSBASE feature isn't currently supported in Linux, and its context
switch path has some optimisations which rely on userspace being unable to use
the WR{FS,GS}BASE instructions.  Xen's current behaviour undermines this
expectation.

In 64bit PV guest context, always load the guest kernels setting of FSGSBASE
into %cr4.  This requires adjusting how Xen uses the {RD,WR}{FS,GS}BASE
instructions.

 * Delete the cpu_has_fsgsbase helper.  It is no longer safe, as users need to
   check %cr4 directly.
 * The raw __rd{fs,gs}base() helpers are only safe to use when %cr4.fsgsbase
   is set.  Comment this property.
 * The {rd,wr}{fs,gs}{base,shadow}() and read_msr() helpers are updated to use
   the current %cr4 value to determine which mechanism to use.
 * toggle_guest_mode() and save_segments() are update to avoid reading
   fs/gsbase if the values in hardware cannot be stale WRT struct vcpu.  A
   consequence of this is that the write_cr() path needs to cache the current
   bases, as subsequent context switches will skip saving the values.
 * write_cr4() is updated to ensure that the shadow %cr4.fsgsbase value is
   observed in a safe way WRT the hardware setting, if an interrupt happens to
   hit in the middle.
 * load_segments() is updated to use the VMLOAD optimisation if FSGSBASE is
   unavailable, even if only gs_shadow needs updating.  As a minor perf
   improvement, check cpu_has_svm first to short circuit a context-dependent
   conditional on Intel hardware.
 * pv_make_cr4() is updated for 64bit PV guests to use the guest kernels
   choice of FSGSBASE.

This is part of XSA-293

Reported-by: Andy Lutomirski <luto@kernel.org>
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 7ff0f10..adee35e 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1303,13 +1303,8 @@ static void load_segments(struct vcpu *n)
     per_cpu(dirty_segment_mask, cpu) = 0;
 
 #ifdef CONFIG_HVM
-    if ( !is_pv_32bit_vcpu(n) && !cpu_has_fsgsbase && cpu_has_svm &&
-         !((uregs->fs | uregs->gs) & ~3) &&
-         /*
-          * The remaining part is just for optimization: If only shadow GS
-          * needs loading, there's nothing to be gained here.
-          */
-         (n->arch.pv.fs_base | n->arch.pv.gs_base_user | n->arch.pv.ldt_ents) )
+    if ( cpu_has_svm && !is_pv_32bit_vcpu(n) &&
+         !(read_cr4() & X86_CR4_FSGSBASE) && !((uregs->fs | uregs->gs) & ~3) )
     {
         unsigned long gsb = n->arch.flags & TF_kernel_mode
             ? n->arch.pv.gs_base_kernel : n->arch.pv.gs_base_user;
@@ -1488,7 +1483,8 @@ static void save_segments(struct vcpu *v)
     regs->fs = read_sreg(fs);
     regs->gs = read_sreg(gs);
 
-    if ( cpu_has_fsgsbase && !is_pv_32bit_vcpu(v) )
+    /* %fs/%gs bases can only be stale if WR{FS,GS}BASE are usable. */
+    if ( (read_cr4() & X86_CR4_FSGSBASE) && !is_pv_32bit_vcpu(v) )
     {
         v->arch.pv.fs_base = __rdfsbase();
         if ( v->arch.flags & TF_kernel_mode )
@@ -1692,8 +1688,8 @@ static void __context_switch(void)
 
 #if defined(CONFIG_PV) && defined(CONFIG_HVM)
     /* Prefetch the VMCB if we expect to use it later in the context switch */
-    if ( is_pv_domain(nd) && !is_pv_32bit_domain(nd) && !is_idle_domain(nd) &&
-         !cpu_has_fsgsbase && cpu_has_svm )
+    if ( cpu_has_svm && is_pv_domain(nd) && !is_pv_32bit_domain(nd) &&
+         !is_idle_domain(nd) && !(read_cr4() & X86_CR4_FSGSBASE) )
         svm_load_segs(0, 0, 0, 0, 0, 0, 0);
 #endif
 
diff --git a/xen/arch/x86/hvm/svm/svm.c b/xen/arch/x86/hvm/svm/svm.c
index 2584b90..23d72e8 100644
--- a/xen/arch/x86/hvm/svm/svm.c
+++ b/xen/arch/x86/hvm/svm/svm.c
@@ -1574,8 +1574,7 @@ static int svm_cpu_up_prepare(unsigned int cpu)
             goto err;
 
 #ifdef CONFIG_PV
-        if ( !cpu_has_fsgsbase )
-            per_cpu(host_vmcb_va, cpu) = __map_domain_page_global(pg);
+        per_cpu(host_vmcb_va, cpu) = __map_domain_page_global(pg);
 #endif
 
         clear_domain_page(page_to_mfn(pg));
diff --git a/xen/arch/x86/pv/domain.c b/xen/arch/x86/pv/domain.c
index 91123a8..3e82cff 100644
--- a/xen/arch/x86/pv/domain.c
+++ b/xen/arch/x86/pv/domain.c
@@ -140,6 +140,15 @@ unsigned long pv_make_cr4(const struct vcpu *v)
     if ( d->arch.vtsc || (v->arch.pv.ctrlreg[4] & X86_CR4_TSD) )
         cr4 |= X86_CR4_TSD;
 
+    /*
+     * The {RD,WR}{FS,GS}BASE are only useable in 64bit code segments.  While
+     * we must not have CR4.FSGSBASE set behind the back of a 64bit PV kernel,
+     * we do leave it set in 32bit PV context to speed up Xen's context switch
+     * path.
+     */
+    if ( !is_pv_32bit_domain(d) && !(v->arch.pv.ctrlreg[4] & X86_CR4_FSGSBASE) )
+        cr4 &= ~X86_CR4_FSGSBASE;
+
     return cr4;
 }
 
@@ -371,7 +380,8 @@ void toggle_guest_mode(struct vcpu *v)
 {
     ASSERT(!is_pv_32bit_vcpu(v));
 
-    if ( cpu_has_fsgsbase )
+    /* %fs/%gs bases can only be stale if WR{FS,GS}BASE are usable. */
+    if ( read_cr4() & X86_CR4_FSGSBASE )
     {
         if ( v->arch.flags & TF_kernel_mode )
             v->arch.pv.gs_base_kernel = __rdgsbase();
diff --git a/xen/arch/x86/pv/emul-priv-op.c b/xen/arch/x86/pv/emul-priv-op.c
index b4a56f9..3746e2a 100644
--- a/xen/arch/x86/pv/emul-priv-op.c
+++ b/xen/arch/x86/pv/emul-priv-op.c
@@ -780,6 +780,17 @@ static int write_cr(unsigned int reg, unsigned long val,
     }
 
     case 4: /* Write CR4 */
+        /*
+         * If this write will disable FSGSBASE, refresh Xen's idea of the
+         * guest bases now that they can no longer change.
+         */
+        if ( (curr->arch.pv.ctrlreg[4] & X86_CR4_FSGSBASE) &&
+             !(val & X86_CR4_FSGSBASE) )
+        {
+            curr->arch.pv.fs_base = __rdfsbase();
+            curr->arch.pv.gs_base_kernel = __rdgsbase();
+        }
+
         curr->arch.pv.ctrlreg[4] = pv_fixup_guest_cr4(curr, val);
         write_cr4(pv_make_cr4(curr));
         ctxt_switch_levelling(curr);
@@ -828,14 +839,15 @@ static int read_msr(unsigned int reg, uint64_t *val,
     case MSR_FS_BASE:
         if ( is_pv_32bit_domain(currd) )
             break;
-        *val = cpu_has_fsgsbase ? __rdfsbase() : curr->arch.pv.fs_base;
+        *val = (read_cr4() & X86_CR4_FSGSBASE) ? __rdfsbase()
+                                               : curr->arch.pv.fs_base;
         return X86EMUL_OKAY;
 
     case MSR_GS_BASE:
         if ( is_pv_32bit_domain(currd) )
             break;
-        *val = cpu_has_fsgsbase ? __rdgsbase()
-                                : curr->arch.pv.gs_base_kernel;
+        *val = (read_cr4() & X86_CR4_FSGSBASE) ? __rdgsbase()
+                                               : curr->arch.pv.gs_base_kernel;
         return X86EMUL_OKAY;
 
     case MSR_SHADOW_GS_BASE:
diff --git a/xen/arch/x86/setup.c b/xen/arch/x86/setup.c
index 92da060..3440794 100644
--- a/xen/arch/x86/setup.c
+++ b/xen/arch/x86/setup.c
@@ -1610,7 +1610,7 @@ void __init noreturn __start_xen(unsigned long mbi_p)
 
     cr4_pv32_mask = mmu_cr4_features & XEN_CR4_PV32_BITS;
 
-    if ( cpu_has_fsgsbase )
+    if ( boot_cpu_has(X86_FEATURE_FSGSBASE) )
         set_in_cr4(X86_CR4_FSGSBASE);
 
     if ( opt_invpcid && cpu_has_invpcid )
diff --git a/xen/include/asm-x86/cpufeature.h b/xen/include/asm-x86/cpufeature.h
index 5592e17..1fb9af4 100644
--- a/xen/include/asm-x86/cpufeature.h
+++ b/xen/include/asm-x86/cpufeature.h
@@ -89,7 +89,6 @@
 #define cpu_has_xsaves          boot_cpu_has(X86_FEATURE_XSAVES)
 
 /* CPUID level 0x00000007:0.ebx */
-#define cpu_has_fsgsbase        boot_cpu_has(X86_FEATURE_FSGSBASE)
 #define cpu_has_bmi1            boot_cpu_has(X86_FEATURE_BMI1)
 #define cpu_has_hle             boot_cpu_has(X86_FEATURE_HLE)
 #define cpu_has_avx2            boot_cpu_has(X86_FEATURE_AVX2)
diff --git a/xen/include/asm-x86/msr.h b/xen/include/asm-x86/msr.h
index adfa2fa..a724479 100644
--- a/xen/include/asm-x86/msr.h
+++ b/xen/include/asm-x86/msr.h
@@ -124,6 +124,14 @@ static inline uint64_t rdtsc_ordered(void)
 			  : "=a" (low), "=d" (high) \
 			  : "c" (counter))
 
+/*
+ * On hardware supporting FSGSBASE, the value loaded into hardware is the
+ * guest kernel's choice for 64bit PV guests (Xen's choice for Idle, HVM and
+ * 32bit PV).
+ *
+ * Therefore, the {RD,WR}{FS,GS}BASE instructions are only safe to use if
+ * %cr4.fsgsbase is set.
+ */
 static inline unsigned long __rdfsbase(void)
 {
     unsigned long base;
@@ -154,7 +162,7 @@ static inline unsigned long rdfsbase(void)
 {
     unsigned long base;
 
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
         return __rdfsbase();
 
     rdmsrl(MSR_FS_BASE, base);
@@ -166,7 +174,7 @@ static inline unsigned long rdgsbase(void)
 {
     unsigned long base;
 
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
         return __rdgsbase();
 
     rdmsrl(MSR_GS_BASE, base);
@@ -178,7 +186,7 @@ static inline unsigned long rdgsshadow(void)
 {
     unsigned long base;
 
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
     {
         asm volatile ( "swapgs" );
         base = __rdgsbase();
@@ -192,7 +200,7 @@ static inline unsigned long rdgsshadow(void)
 
 static inline void wrfsbase(unsigned long base)
 {
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
 #ifdef HAVE_AS_FSGSBASE
         asm volatile ( "wrfsbase %0" :: "r" (base) );
 #else
@@ -204,7 +212,7 @@ static inline void wrfsbase(unsigned long base)
 
 static inline void wrgsbase(unsigned long base)
 {
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
 #ifdef HAVE_AS_FSGSBASE
         asm volatile ( "wrgsbase %0" :: "r" (base) );
 #else
@@ -216,7 +224,7 @@ static inline void wrgsbase(unsigned long base)
 
 static inline void wrgsshadow(unsigned long base)
 {
-    if ( cpu_has_fsgsbase )
+    if ( read_cr4() & X86_CR4_FSGSBASE )
     {
         asm volatile ( "swapgs\n\t"
 #ifdef HAVE_AS_FSGSBASE
diff --git a/xen/include/asm-x86/processor.h b/xen/include/asm-x86/processor.h
index df01ae3..f3275ca 100644
--- a/xen/include/asm-x86/processor.h
+++ b/xen/include/asm-x86/processor.h
@@ -304,11 +304,31 @@ static inline unsigned long read_cr4(void)
 
 static inline void write_cr4(unsigned long val)
 {
+    struct cpu_info *info = get_cpu_info();
+
     /* No global pages in case of PCIDs enabled! */
     ASSERT(!(val & X86_CR4_PGE) || !(val & X86_CR4_PCIDE));
 
-    get_cpu_info()->cr4 = val;
-    asm volatile ( "mov %0,%%cr4" : : "r" (val) );
+    /*
+     * On hardware supporting FSGSBASE, the value in %cr4 is the kernel's
+     * choice for 64bit PV guests, which impacts whether Xen can use the
+     * instructions.
+     *
+     * The {rd,wr}{fs,gs}base() helpers use info->cr4 to work out whether it
+     * is safe to execute the {RD,WR}{FS,GS}BASE instruction, falling back to
+     * the MSR path if not.  Some users require interrupt safety.
+     *
+     * If FSGSBASE is currently or about to become clear, reflect this in
+     * info->cr4 before updating %cr4, so an interrupt which hits in the
+     * middle won't observe FSGSBASE set in info->cr4 but clear in %cr4.
+     */
+    info->cr4 = val & (info->cr4 | ~X86_CR4_FSGSBASE);
+
+    asm volatile ( "mov %[val], %%cr4"
+                   : "+m" (info->cr4) /* Force ordering without a barrier. */
+                   : [val] "r" (val) );
+
+    info->cr4 = val;
 }
 
 /* Clear and set 'TS' bit respectively */


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic