'[oss-security] Xen Security Advisory 286 v6 (CVE-2020-27674) - x86 PV guest INVLPG-like flushes may '

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       oss-security
Subject:    [oss-security] Xen Security Advisory 286 v6 (CVE-2020-27674) - x86 PV guest INVLPG-like flushes may 
From:       Xen.org security team <security () xen ! org>
Date:       2021-01-19 16:34:15
Message-ID: E1l1txP-0002pI-1w () xenbits ! xenproject ! org
[Download RAW message or body]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

            Xen Security Advisory CVE-2020-27674 / XSA-286
                              version 6

     x86 PV guest INVLPG-like flushes may leave stale TLB entries

UPDATES IN VERSION 6
====================

CVE assigned.

ISSUE DESCRIPTION
=================

x86 PV guest kernels may use hypercalls with INVLPG-like behavior to
invalidate TLB entries even after changes to non-leaf page tables.  Such
changes to non-leaf page tables will, however, also render stale
possible TLB entries created by Xen's internal use of linear page tables
to process guest requests like update-va-mapping.  Invalidation of these
TLB entries has been missing, allowing subsequent guest requests to
change address mappings for one process to potentially modify memory
meanwhile in use elsewhere.

IMPACT
======

Malicious x86 PV guest user mode may be able to escalate their privilege
to that of the guest kernel.

VULNERABLE SYSTEMS
==================

All versions of Xen expose the vulnerability.

The vulnerability is exposed to x86 PV guests only.  x86 HVM/PVH guests
as well as ARM ones are not vulnerable.

MITIGATION
==========

There is no known mitigation.

CREDITS
=======

This issue was discovered by Jann Horn of Google Project Zero.

RESOLUTION
==========

Applying the appropriate set of attached patches resolves this issue.

xsa286-unstable/*.patch  xen-unstable
xsa286-4.14/*.patch      Xen 4.14.x
xsa286-4.13/*.patch      Xen 4.13.x
xsa286-4.12/*.patch      Xen 4.12.x
xsa286-4.11/*.patch      Xen 4.11.x
xsa286-4.10/*.patch      Xen 4.10.x

$ sha256sum xsa286* xsa286*/*
a7d4ddb15197dfcb246b84f8a89799f76070cdde99a5c1d0203229d719b0fcc1  xsa286.meta
e5f946b07989db85de2a03e4b88e09324316c0ec12d21c5afb83d463114a1f4f  \
xsa286-unstable/0001-x86-pv-Drop-FLUSH_TLB_GLOBAL-in-do_mmu_update-for-XP.patch \
2a732c958201eb03cc0737278e75f86160e0dedbbe0a13f415ec0d17a90ec009  \
xsa286-unstable/0002-x86-pv-Flush-TLB-in-response-to-paging-structure-cha.patch \
2da4b60e19b1fbf1daf0d1bc61733763abf5653a6e53ffeadd559d0a01ec8095  \
xsa286-4.10/0001-x86-pv-Drop-FLUSH_TLB_GLOBAL-in-do_mmu_update-for-XP.patch \
5ce7f56a9b2c9a3a63f79d7df2486c24fc130a8658deb182b22416e17c202ae9  \
xsa286-4.10/0002-x86-pv-Flush-TLB-in-response-to-paging-structure-cha.patch \
2e700e091bfd9d3fd6dd65064ec39a8a40d73bcc94b66852fd2d6fbe9ba6c2db  \
xsa286-4.11/0001-x86-pv-Drop-FLUSH_TLB_GLOBAL-in-do_mmu_update-for-XP.patch \
d622652ce50d59bf45134baabc26b89a24e5d98b1f82230041919089a1cf1620  \
xsa286-4.11/0002-x86-pv-Flush-TLB-in-response-to-paging-structure-cha.patch \
4dc18a007ddf2bd5022ce194b861989be88170f8188ce49dbea7073bb280202f  \
xsa286-4.12/0001-x86-pv-Drop-FLUSH_TLB_GLOBAL-in-do_mmu_update-for-XP.patch \
2c48331849d4d401b47dfc3db84bb067786b4e53155587235d919781b4a10e76  \
xsa286-4.12/0002-x86-pv-Flush-TLB-in-response-to-paging-structure-cha.patch \
dd0fad5165dcd0c3d8d551e35fa4fe29653a3b8c5ec52f7f86f434305c946338  \
xsa286-4.13/0001-x86-pv-Drop-FLUSH_TLB_GLOBAL-in-do_mmu_update-for-XP.patch \
de1326efd4a8559c32ac68c89095f3230f723dec2acc80fc01a534578bb1be82  \
xsa286-4.13/0002-x86-pv-Flush-TLB-in-response-to-paging-structure-cha.patch \
a718f5e19ce821d1fe06f2cdc2f7ad0bbe7c7bca954c283bbc36ad50522f66ef  \
xsa286-4.14/0001-x86-pv-Drop-FLUSH_TLB_GLOBAL-in-do_mmu_update-for-XP.patch \
d659d4a4119b235c7d1054980ceea9424dcc7faf3cfd3fd46627577a424256b5  \
xsa286-4.14/0002-x86-pv-Flush-TLB-in-response-to-paging-structure-cha.patch $

DEPLOYMENT DURING EMBARGO
=========================

Deployment of the patches and/or mitigations described above (or
others which are substantially similar) is permitted during the
embargo, even on public-facing systems with untrusted guest users and
administrators.

But: Distribution of updated software is prohibited (except to other
members of the predisclosure list).

Predisclosure list members who wish to deploy significantly different
patches and/or mitigations, please contact the Xen Project Security
Team.

(Note: this during-embargo deployment notice is retained in
post-embargo publicly released Xen Project advisories, even though it
is then no longer applicable.  This is to enable the community to have
oversight of the Xen Project Security Team's decisionmaking.)

For more information about permissible uses of embargoed information,
consult the Xen Project community's agreed Security Policy:
  http://www.xenproject.org/security-policy.html
-----BEGIN PGP SIGNATURE-----

iQFABAEBCAAqFiEEI+MiLBRfRHX6gGCng/4UyVfoK9kFAmAHB6MMHHBncEB4ZW4u
b3JnAAoJEIP+FMlX6CvZDi4IAL8YKoMnrvTD8nNVHvUyTgVRpO9w68qq5r8gG3Z6
InBZWYOp+YrMScoqFap+R1RylIcKtzlqbCn3TR0dZdKDviPMpbgIQwEHI7C7I+qM
rN4/cmEljAY+dspU2isqzX6IEDSwk4H9NcUtzN7+MbpUrJiis597IxW5T0KMM5Bd
FYd2/MmzEayZkcEtuMLcFKdl2n1mi+7x7jNlQW5FeHI+6F8SS76YlYs2d/iaDC98
cX4YMdo4ZzcXpKVXgppbga7AEC1AZaNIfBd5cFrZaCvDBYnmW4Zwz8W7R/wYO987
5ogHMu0GX92i8QwN5EBwLolhnruZIBnaSJ9PiGk0GRbgGw4=
=AADk
-----END PGP SIGNATURE-----


["xsa286.meta" (application/octet-stream)]
["xsa286-unstable/0001-x86-pv-Drop-FLUSH_TLB_GLOBAL-in-do_mmu_update-for-XP.patch" (application/octet-stream)]

From 055e1c3a3d95b1e753148369fbc4ba48782dd602 Mon Sep 17 00:00:00 2001
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Thu, 22 Oct 2020 11:28:58 +0100
Subject: [PATCH 1/2] x86/pv: Drop FLUSH_TLB_GLOBAL in do_mmu_update() for XPTI

c/s 9d1d31ad9498 "x86: slightly reduce Meltdown band-aid overhead" removed the
use of Global TLB flushes on the Xen entry path, but added a FLUSH_TLB_GLOBAL
to the L4 path in do_mmu_update().

However, this was unnecessary.

It is the guests responsibility to perform appropriate TLB flushing if the L4
modification altered an established mapping in a flush-relevant way.  In this
case, an MMUEXT_OP hypercall will follow.  The case which Xen needs to cover
is when new mappings are created, and the resync on the exit-to-guest path
covers this correctly.

There is a corner case with multiple vCPUs in hypercalls at the same time,
which 9d1d31ad9498 changed, and this patch changes back to its original XPTI
behaviour.

Architecturally, established TLB entries can continue to be used until the
broadcast flush has completed.  Therefore, even with concurrent hypercalls,
the guest cannot depend on older mappings not being used until an MMUEXT_OP
hypercall completes.  Xen's implementation of guest-initiated flushes will
take correct effect on top of an in-progress hypercall, picking up new mapping
setting before the other vCPU's MMUEXT_OP completes.

Note: The correctness of this change is not impacted by whether XPTI uses
global mappings or not.  Correctness there depends on the behaviour of Xen on
the entry/exit paths when switching two/from the XPTI "shadow" pagetables.

This is (not really) XSA-286 (but necessary to simplify the logic).

Fixes: 9d1d31ad9498 ("x86: slightly reduce Meltdown band-aid overhead")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index b2f35b3e7d..38168189aa 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4188,7 +4188,7 @@ long do_mmu_update(
 
         cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
         if ( !cpumask_empty(mask) )
-            flush_mask(mask, FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
+            flush_mask(mask, FLUSH_ROOT_PGTBL);
     }
 
     perfc_add(num_page_updates, i);
-- 
2.20.1


["xsa286-unstable/0002-x86-pv-Flush-TLB-in-response-to-paging-structure-cha.patch" (application/octet-stream)]

From 16a20963b3209788f2c0d3a3eebb7d92f03f5883 Mon Sep 17 00:00:00 2001
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Mon, 19 Oct 2020 15:51:22 +0100
Subject: [PATCH 2/2] x86/pv: Flush TLB in response to paging structure changes

With MMU_UPDATE, a PV guest can make changes to higher level pagetables.  This
is safe from Xen's point of view (as the update only affects guest mappings),
and the guest is required to flush (if necessary) after making updates.

However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
writeable pagetables, etc.) is an implementation detail outside of the
API/ABI.

Changes in the paging structure require invalidations in the linear pagetable
range for subsequent accesses into the linear pagetables to access non-stale
mappings.  Xen must provide suitable flushing to prevent intermixed guest
actions from accidentally accessing/modifying the wrong pagetable.

For all L2 and higher modifications, flush the TLB.  PV guests cannot create
L2 or higher entries with the Global bit set, so no mappings established in
the linear range can be global.  (This could in principle be an order 39 flush
starting at LINEAR_PT_VIRT_START, but no such mechanism exists in practice.)

Express the necessary flushes as a set of booleans which accumulate across the
operation.  Comment the flushing logic extensively.

This is XSA-286.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 69 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 59 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 38168189aa..5a50339284 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3891,7 +3891,8 @@ long do_mmu_update(
     struct vcpu *curr = current, *v = curr;
     struct domain *d = v->domain, *pt_owner = d, *pg_owner;
     mfn_t map_mfn = INVALID_MFN, mfn;
-    bool sync_guest = false;
+    bool flush_linear_pt = false, flush_root_pt_local = false,
+        flush_root_pt_others = false;
     uint32_t xsm_needed = 0;
     uint32_t xsm_checked = 0;
     int rc = put_old_guest_table(curr);
@@ -4041,6 +4042,8 @@ long do_mmu_update(
                         break;
                     rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        flush_linear_pt = true;
                     break;
 
                 case PGT_l3_page_table:
@@ -4048,6 +4051,8 @@ long do_mmu_update(
                         break;
                     rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        flush_linear_pt = true;
                     break;
 
                 case PGT_l4_page_table:
@@ -4055,6 +4060,8 @@ long do_mmu_update(
                         break;
                     rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        flush_linear_pt = true;
                     if ( !rc && pt_owner->arch.pv.xpti )
                     {
                         bool local_in_use = false;
@@ -4063,7 +4070,7 @@ long do_mmu_update(
                                     mfn) )
                         {
                             local_in_use = true;
-                            get_cpu_info()->root_pgt_changed = true;
+                            flush_root_pt_local = true;
                         }
 
                         /*
@@ -4075,7 +4082,7 @@ long do_mmu_update(
                              (1 + !!(page->u.inuse.type_info & PGT_pinned) +
                               mfn_eq(pagetable_get_mfn(curr->arch.guest_table_user),
                                      mfn) + local_in_use) )
-                            sync_guest = true;
+                            flush_root_pt_others = true;
                     }
                     break;
 
@@ -4177,19 +4184,61 @@ long do_mmu_update(
     if ( va )
         unmap_domain_page(va);
 
-    if ( sync_guest )
+    /*
+     * Perform required TLB maintenance.
+     *
+     * This logic currently depend on flush_linear_pt being a superset of the
+     * flush_root_pt_* conditions.
+     *
+     * pt_owner may not be current->domain.  This may occur during
+     * construction of 32bit PV guests, or debugging of PV guests.  The
+     * behaviour cannot be correct with domain unpaused.  We therefore expect
+     * pt_owner->dirty_cpumask to be empty, but it is a waste of effort to
+     * explicitly check for, and exclude, this corner case.
+     *
+     * flush_linear_pt requires a FLUSH_TLB to all dirty CPUs.  The flush must
+     * be performed now to maintain correct behaviour across a multicall.
+     * i.e. we cannot relax FLUSH_TLB to FLUSH_ROOT_PGTBL, given that the
+     * former is a side effect of the latter, because the resync (which is in
+     * the return-to-guest path) happens too late.
+     *
+     * flush_root_pt_* requires FLUSH_ROOT_PGTBL on either the local CPU
+     * (implies pt_owner == current->domain and current->processor set in
+     * pt_owner->dirty_cpumask), and/or all *other* dirty CPUs as there are
+     * references we can't account for locally.
+     */
+    if ( flush_linear_pt /* || flush_root_pt_local || flush_root_pt_others */ )
     {
+        unsigned int cpu = smp_processor_id();
+        cpumask_t *mask = pt_owner->dirty_cpumask;
+
         /*
-         * Force other vCPU-s of the affected guest to pick up L4 entry
-         * changes (if any).
+         * Always handle local flushing separately (if applicable), to
+         * separate the flush invocations appropriately for scope of the two
+         * flush_root_pt_* variables.
          */
-        unsigned int cpu = smp_processor_id();
-        cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
+        if ( likely(cpumask_test_cpu(cpu, mask)) )
+        {
+            mask = per_cpu(scratch_cpumask, cpu);
 
-        cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
+            cpumask_copy(mask, pt_owner->dirty_cpumask);
+            __cpumask_clear_cpu(cpu, mask);
+
+            flush_local(FLUSH_TLB |
+                        (flush_root_pt_local ? FLUSH_ROOT_PGTBL : 0));
+        }
+        else
+            /* Sanity check.  flush_root_pt_local implies local cpu is dirty. */
+            ASSERT(!flush_root_pt_local);
+
+        /* Flush the remote dirty CPUs.  Does not include the local CPU. */
         if ( !cpumask_empty(mask) )
-            flush_mask(mask, FLUSH_ROOT_PGTBL);
+            flush_mask(mask, FLUSH_TLB |
+                       (flush_root_pt_others ? FLUSH_ROOT_PGTBL : 0));
     }
+    else
+        /* Sanity check.  flush_root_pt_* implies flush_linear_pt. */
+        ASSERT(!flush_root_pt_local && !flush_root_pt_others);
 
     perfc_add(num_page_updates, i);
 
-- 
2.20.1


["xsa286-4.10/0001-x86-pv-Drop-FLUSH_TLB_GLOBAL-in-do_mmu_update-for-XP.patch" (application/octet-stream)]

From 2012db463b8b8e502f69a46dd9f3aa47b31c937f Mon Sep 17 00:00:00 2001
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Thu, 22 Oct 2020 11:28:58 +0100
Subject: [PATCH 1/2] x86/pv: Drop FLUSH_TLB_GLOBAL in do_mmu_update() for XPTI

c/s 9d1d31ad9498 "x86: slightly reduce Meltdown band-aid overhead" removed the
use of Global TLB flushes on the Xen entry path, but added a FLUSH_TLB_GLOBAL
to the L4 path in do_mmu_update().

However, this was unnecessary.

It is the guests responsibility to perform appropriate TLB flushing if the L4
modification altered an established mapping in a flush-relevant way.  In this
case, an MMUEXT_OP hypercall will follow.  The case which Xen needs to cover
is when new mappings are created, and the resync on the exit-to-guest path
covers this correctly.

There is a corner case with multiple vCPUs in hypercalls at the same time,
which 9d1d31ad9498 changed, and this patch changes back to its original XPTI
behaviour.

Architecturally, established TLB entries can continue to be used until the
broadcast flush has completed.  Therefore, even with concurrent hypercalls,
the guest cannot depend on older mappings not being used until an MMUEXT_OP
hypercall completes.  Xen's implementation of guest-initiated flushes will
take correct effect on top of an in-progress hypercall, picking up new mapping
setting before the other vCPU's MMUEXT_OP completes.

Note: The correctness of this change is not impacted by whether XPTI uses
global mappings or not.  Correctness there depends on the behaviour of Xen on
the entry/exit paths when switching two/from the XPTI "shadow" pagetables.

This is (not really) XSA-286 (but necessary to simplify the logic).

Fixes: 9d1d31ad9498 ("x86: slightly reduce Meltdown band-aid overhead")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 055e1c3a3d95b1e753148369fbc4ba48782dd602)
---
 xen/arch/x86/mm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 7090be7b4a..2aac1cbf10 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4235,7 +4235,7 @@ long do_mmu_update(
 
         cpumask_andnot(mask, pt_owner->domain_dirty_cpumask, cpumask_of(cpu));
         if ( !cpumask_empty(mask) )
-            flush_mask(mask, FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
+            flush_mask(mask, FLUSH_ROOT_PGTBL);
     }
 
     perfc_add(num_page_updates, i);
-- 
2.20.1


["xsa286-4.10/0002-x86-pv-Flush-TLB-in-response-to-paging-structure-cha.patch" (application/octet-stream)]

From 78d903e95efc5b0166b393d289a687c64016e8ef Mon Sep 17 00:00:00 2001
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Mon, 19 Oct 2020 15:51:22 +0100
Subject: [PATCH 2/2] x86/pv: Flush TLB in response to paging structure changes

With MMU_UPDATE, a PV guest can make changes to higher level pagetables.  This
is safe from Xen's point of view (as the update only affects guest mappings),
and the guest is required to flush (if necessary) after making updates.

However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
writeable pagetables, etc.) is an implementation detail outside of the
API/ABI.

Changes in the paging structure require invalidations in the linear pagetable
range for subsequent accesses into the linear pagetables to access non-stale
mappings.  Xen must provide suitable flushing to prevent intermixed guest
actions from accidentally accessing/modifying the wrong pagetable.

For all L2 and higher modifications, flush the TLB.  PV guests cannot create
L2 or higher entries with the Global bit set, so no mappings established in
the linear range can be global.  (This could in principle be an order 39 flush
starting at LINEAR_PT_VIRT_START, but no such mechanism exists in practice.)

Express the necessary flushes as a set of booleans which accumulate across the
operation.  Comment the flushing logic extensively.

This is XSA-286.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 16a20963b3209788f2c0d3a3eebb7d92f03f5883)
---
 xen/arch/x86/mm.c | 69 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 59 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 2aac1cbf10..66399a5f61 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3949,7 +3949,8 @@ long do_mmu_update(
     struct vcpu *curr = current, *v = curr;
     struct domain *d = v->domain, *pt_owner = d, *pg_owner;
     mfn_t map_mfn = INVALID_MFN;
-    bool sync_guest = false;
+    bool flush_linear_pt = false, flush_root_pt_local = false,
+        flush_root_pt_others = false;
     uint32_t xsm_needed = 0;
     uint32_t xsm_checked = 0;
     int rc = put_old_guest_table(curr);
@@ -4096,14 +4097,20 @@ long do_mmu_update(
                 case PGT_l2_page_table:
                     rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        flush_linear_pt = true;
                     break;
                 case PGT_l3_page_table:
                     rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        flush_linear_pt = true;
                     break;
                 case PGT_l4_page_table:
                     rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        flush_linear_pt = true;
                     if ( !rc && !cpu_has_no_xpti )
                     {
                         bool local_in_use = false;
@@ -4111,7 +4118,7 @@ long do_mmu_update(
                         if ( pagetable_get_pfn(curr->arch.guest_table) == mfn )
                         {
                             local_in_use = true;
-                            get_cpu_info()->root_pgt_changed = true;
+                            flush_root_pt_local = true;
                         }
 
                         /*
@@ -4123,7 +4130,7 @@ long do_mmu_update(
                              (1 + !!(page->u.inuse.type_info & PGT_pinned) +
                               (pagetable_get_pfn(curr->arch.guest_table_user) ==
                                mfn) + local_in_use) )
-                            sync_guest = true;
+                            flush_root_pt_others = true;
                     }
                     break;
                 case PGT_writable_page:
@@ -4224,19 +4231,61 @@ long do_mmu_update(
     if ( va )
         unmap_domain_page(va);
 
-    if ( sync_guest )
+    /*
+     * Perform required TLB maintenance.
+     *
+     * This logic currently depend on flush_linear_pt being a superset of the
+     * flush_root_pt_* conditions.
+     *
+     * pt_owner may not be current->domain.  This may occur during
+     * construction of 32bit PV guests, or debugging of PV guests.  The
+     * behaviour cannot be correct with domain unpaused.  We therefore expect
+     * pt_owner->dirty_cpumask to be empty, but it is a waste of effort to
+     * explicitly check for, and exclude, this corner case.
+     *
+     * flush_linear_pt requires a FLUSH_TLB to all dirty CPUs.  The flush must
+     * be performed now to maintain correct behaviour across a multicall.
+     * i.e. we cannot relax FLUSH_TLB to FLUSH_ROOT_PGTBL, given that the
+     * former is a side effect of the latter, because the resync (which is in
+     * the return-to-guest path) happens too late.
+     *
+     * flush_root_pt_* requires FLUSH_ROOT_PGTBL on either the local CPU
+     * (implies pt_owner == current->domain and current->processor set in
+     * pt_owner->dirty_cpumask), and/or all *other* dirty CPUs as there are
+     * references we can't account for locally.
+     */
+    if ( flush_linear_pt /* || flush_root_pt_local || flush_root_pt_others */ )
     {
+        unsigned int cpu = smp_processor_id();
+        cpumask_t *mask = pt_owner->domain_dirty_cpumask;
+
         /*
-         * Force other vCPU-s of the affected guest to pick up L4 entry
-         * changes (if any).
+         * Always handle local flushing separately (if applicable), to
+         * separate the flush invocations appropriately for scope of the two
+         * flush_root_pt_* variables.
          */
-        unsigned int cpu = smp_processor_id();
-        cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
+        if ( likely(cpumask_test_cpu(cpu, mask)) )
+        {
+            mask = per_cpu(scratch_cpumask, cpu);
 
-        cpumask_andnot(mask, pt_owner->domain_dirty_cpumask, cpumask_of(cpu));
+            cpumask_copy(mask, pt_owner->domain_dirty_cpumask);
+            __cpumask_clear_cpu(cpu, mask);
+
+            flush_local(FLUSH_TLB |
+                        (flush_root_pt_local ? FLUSH_ROOT_PGTBL : 0));
+        }
+        else
+            /* Sanity check.  flush_root_pt_local implies local cpu is dirty. */
+            ASSERT(!flush_root_pt_local);
+
+        /* Flush the remote dirty CPUs.  Does not include the local CPU. */
         if ( !cpumask_empty(mask) )
-            flush_mask(mask, FLUSH_ROOT_PGTBL);
+            flush_mask(mask, FLUSH_TLB |
+                       (flush_root_pt_others ? FLUSH_ROOT_PGTBL : 0));
     }
+    else
+        /* Sanity check.  flush_root_pt_* implies flush_linear_pt. */
+        ASSERT(!flush_root_pt_local && !flush_root_pt_others);
 
     perfc_add(num_page_updates, i);
 
-- 
2.20.1


["xsa286-4.11/0001-x86-pv-Drop-FLUSH_TLB_GLOBAL-in-do_mmu_update-for-XP.patch" (application/octet-stream)]

From 1d021db3c8712d25e25f078833baa160c90f260f Mon Sep 17 00:00:00 2001
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Thu, 22 Oct 2020 11:28:58 +0100
Subject: [PATCH 1/2] x86/pv: Drop FLUSH_TLB_GLOBAL in do_mmu_update() for XPTI

c/s 9d1d31ad9498 "x86: slightly reduce Meltdown band-aid overhead" removed the
use of Global TLB flushes on the Xen entry path, but added a FLUSH_TLB_GLOBAL
to the L4 path in do_mmu_update().

However, this was unnecessary.

It is the guests responsibility to perform appropriate TLB flushing if the L4
modification altered an established mapping in a flush-relevant way.  In this
case, an MMUEXT_OP hypercall will follow.  The case which Xen needs to cover
is when new mappings are created, and the resync on the exit-to-guest path
covers this correctly.

There is a corner case with multiple vCPUs in hypercalls at the same time,
which 9d1d31ad9498 changed, and this patch changes back to its original XPTI
behaviour.

Architecturally, established TLB entries can continue to be used until the
broadcast flush has completed.  Therefore, even with concurrent hypercalls,
the guest cannot depend on older mappings not being used until an MMUEXT_OP
hypercall completes.  Xen's implementation of guest-initiated flushes will
take correct effect on top of an in-progress hypercall, picking up new mapping
setting before the other vCPU's MMUEXT_OP completes.

Note: The correctness of this change is not impacted by whether XPTI uses
global mappings or not.  Correctness there depends on the behaviour of Xen on
the entry/exit paths when switching two/from the XPTI "shadow" pagetables.

This is (not really) XSA-286 (but necessary to simplify the logic).

Fixes: 9d1d31ad9498 ("x86: slightly reduce Meltdown band-aid overhead")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 055e1c3a3d95b1e753148369fbc4ba48782dd602)
---
 xen/arch/x86/mm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 5ca5c8c9a2..129da1e648 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4279,7 +4279,7 @@ long do_mmu_update(
 
         cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
         if ( !cpumask_empty(mask) )
-            flush_mask(mask, FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
+            flush_mask(mask, FLUSH_ROOT_PGTBL);
     }
 
     perfc_add(num_page_updates, i);
-- 
2.20.1


["xsa286-4.11/0002-x86-pv-Flush-TLB-in-response-to-paging-structure-cha.patch" (application/octet-stream)]

From e274c8bdc12eb596e55233040e8b49da27150f31 Mon Sep 17 00:00:00 2001
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Mon, 19 Oct 2020 15:51:22 +0100
Subject: [PATCH 2/2] x86/pv: Flush TLB in response to paging structure changes

With MMU_UPDATE, a PV guest can make changes to higher level pagetables.  This
is safe from Xen's point of view (as the update only affects guest mappings),
and the guest is required to flush (if necessary) after making updates.

However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
writeable pagetables, etc.) is an implementation detail outside of the
API/ABI.

Changes in the paging structure require invalidations in the linear pagetable
range for subsequent accesses into the linear pagetables to access non-stale
mappings.  Xen must provide suitable flushing to prevent intermixed guest
actions from accidentally accessing/modifying the wrong pagetable.

For all L2 and higher modifications, flush the TLB.  PV guests cannot create
L2 or higher entries with the Global bit set, so no mappings established in
the linear range can be global.  (This could in principle be an order 39 flush
starting at LINEAR_PT_VIRT_START, but no such mechanism exists in practice.)

Express the necessary flushes as a set of booleans which accumulate across the
operation.  Comment the flushing logic extensively.

This is XSA-286.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 16a20963b3209788f2c0d3a3eebb7d92f03f5883)
---
 xen/arch/x86/mm.c | 69 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 59 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 129da1e648..3528cf6b85 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3983,7 +3983,8 @@ long do_mmu_update(
     struct vcpu *curr = current, *v = curr;
     struct domain *d = v->domain, *pt_owner = d, *pg_owner;
     mfn_t map_mfn = INVALID_MFN;
-    bool sync_guest = false;
+    bool flush_linear_pt = false, flush_root_pt_local = false,
+        flush_root_pt_others = false;
     uint32_t xsm_needed = 0;
     uint32_t xsm_checked = 0;
     int rc = put_old_guest_table(curr);
@@ -4133,6 +4134,8 @@ long do_mmu_update(
                         break;
                     rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        flush_linear_pt = true;
                     break;
 
                 case PGT_l3_page_table:
@@ -4140,6 +4143,8 @@ long do_mmu_update(
                         break;
                     rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        flush_linear_pt = true;
                     break;
 
                 case PGT_l4_page_table:
@@ -4147,6 +4152,8 @@ long do_mmu_update(
                         break;
                     rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        flush_linear_pt = true;
                     if ( !rc && pt_owner->arch.pv_domain.xpti )
                     {
                         bool local_in_use = false;
@@ -4154,7 +4161,7 @@ long do_mmu_update(
                         if ( pagetable_get_pfn(curr->arch.guest_table) == mfn )
                         {
                             local_in_use = true;
-                            get_cpu_info()->root_pgt_changed = true;
+                            flush_root_pt_local = true;
                         }
 
                         /*
@@ -4166,7 +4173,7 @@ long do_mmu_update(
                              (1 + !!(page->u.inuse.type_info & PGT_pinned) +
                               (pagetable_get_pfn(curr->arch.guest_table_user) ==
                                mfn) + local_in_use) )
-                            sync_guest = true;
+                            flush_root_pt_others = true;
                     }
                     break;
 
@@ -4268,19 +4275,61 @@ long do_mmu_update(
     if ( va )
         unmap_domain_page(va);
 
-    if ( sync_guest )
+    /*
+     * Perform required TLB maintenance.
+     *
+     * This logic currently depend on flush_linear_pt being a superset of the
+     * flush_root_pt_* conditions.
+     *
+     * pt_owner may not be current->domain.  This may occur during
+     * construction of 32bit PV guests, or debugging of PV guests.  The
+     * behaviour cannot be correct with domain unpaused.  We therefore expect
+     * pt_owner->dirty_cpumask to be empty, but it is a waste of effort to
+     * explicitly check for, and exclude, this corner case.
+     *
+     * flush_linear_pt requires a FLUSH_TLB to all dirty CPUs.  The flush must
+     * be performed now to maintain correct behaviour across a multicall.
+     * i.e. we cannot relax FLUSH_TLB to FLUSH_ROOT_PGTBL, given that the
+     * former is a side effect of the latter, because the resync (which is in
+     * the return-to-guest path) happens too late.
+     *
+     * flush_root_pt_* requires FLUSH_ROOT_PGTBL on either the local CPU
+     * (implies pt_owner == current->domain and current->processor set in
+     * pt_owner->dirty_cpumask), and/or all *other* dirty CPUs as there are
+     * references we can't account for locally.
+     */
+    if ( flush_linear_pt /* || flush_root_pt_local || flush_root_pt_others */ )
     {
+        unsigned int cpu = smp_processor_id();
+        cpumask_t *mask = pt_owner->dirty_cpumask;
+
         /*
-         * Force other vCPU-s of the affected guest to pick up L4 entry
-         * changes (if any).
+         * Always handle local flushing separately (if applicable), to
+         * separate the flush invocations appropriately for scope of the two
+         * flush_root_pt_* variables.
          */
-        unsigned int cpu = smp_processor_id();
-        cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
+        if ( likely(cpumask_test_cpu(cpu, mask)) )
+        {
+            mask = per_cpu(scratch_cpumask, cpu);
 
-        cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
+            cpumask_copy(mask, pt_owner->dirty_cpumask);
+            __cpumask_clear_cpu(cpu, mask);
+
+            flush_local(FLUSH_TLB |
+                        (flush_root_pt_local ? FLUSH_ROOT_PGTBL : 0));
+        }
+        else
+            /* Sanity check.  flush_root_pt_local implies local cpu is dirty. */
+            ASSERT(!flush_root_pt_local);
+
+        /* Flush the remote dirty CPUs.  Does not include the local CPU. */
         if ( !cpumask_empty(mask) )
-            flush_mask(mask, FLUSH_ROOT_PGTBL);
+            flush_mask(mask, FLUSH_TLB |
+                       (flush_root_pt_others ? FLUSH_ROOT_PGTBL : 0));
     }
+    else
+        /* Sanity check.  flush_root_pt_* implies flush_linear_pt. */
+        ASSERT(!flush_root_pt_local && !flush_root_pt_others);
 
     perfc_add(num_page_updates, i);
 
-- 
2.20.1


["xsa286-4.12/0001-x86-pv-Drop-FLUSH_TLB_GLOBAL-in-do_mmu_update-for-XP.patch" (application/octet-stream)]

From b1d6f37aa5aa9f3fc5a269b9dd21b7feb7444be0 Mon Sep 17 00:00:00 2001
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Thu, 22 Oct 2020 11:28:58 +0100
Subject: [PATCH 1/2] x86/pv: Drop FLUSH_TLB_GLOBAL in do_mmu_update() for XPTI

c/s 9d1d31ad9498 "x86: slightly reduce Meltdown band-aid overhead" removed the
use of Global TLB flushes on the Xen entry path, but added a FLUSH_TLB_GLOBAL
to the L4 path in do_mmu_update().

However, this was unnecessary.

It is the guests responsibility to perform appropriate TLB flushing if the L4
modification altered an established mapping in a flush-relevant way.  In this
case, an MMUEXT_OP hypercall will follow.  The case which Xen needs to cover
is when new mappings are created, and the resync on the exit-to-guest path
covers this correctly.

There is a corner case with multiple vCPUs in hypercalls at the same time,
which 9d1d31ad9498 changed, and this patch changes back to its original XPTI
behaviour.

Architecturally, established TLB entries can continue to be used until the
broadcast flush has completed.  Therefore, even with concurrent hypercalls,
the guest cannot depend on older mappings not being used until an MMUEXT_OP
hypercall completes.  Xen's implementation of guest-initiated flushes will
take correct effect on top of an in-progress hypercall, picking up new mapping
setting before the other vCPU's MMUEXT_OP completes.

Note: The correctness of this change is not impacted by whether XPTI uses
global mappings or not.  Correctness there depends on the behaviour of Xen on
the entry/exit paths when switching two/from the XPTI "shadow" pagetables.

This is (not really) XSA-286 (but necessary to simplify the logic).

Fixes: 9d1d31ad9498 ("x86: slightly reduce Meltdown band-aid overhead")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 055e1c3a3d95b1e753148369fbc4ba48782dd602)
---
 xen/arch/x86/mm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index e7b8f4ee4b..86f31b334f 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4301,7 +4301,7 @@ long do_mmu_update(
 
         cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
         if ( !cpumask_empty(mask) )
-            flush_mask(mask, FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
+            flush_mask(mask, FLUSH_ROOT_PGTBL);
     }
 
     perfc_add(num_page_updates, i);
-- 
2.20.1


["xsa286-4.12/0002-x86-pv-Flush-TLB-in-response-to-paging-structure-cha.patch" (application/octet-stream)]

From 4100d463dbdd95d85fabe387dd5676bed75f65f7 Mon Sep 17 00:00:00 2001
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Mon, 19 Oct 2020 15:51:22 +0100
Subject: [PATCH 2/2] x86/pv: Flush TLB in response to paging structure changes

With MMU_UPDATE, a PV guest can make changes to higher level pagetables.  This
is safe from Xen's point of view (as the update only affects guest mappings),
and the guest is required to flush (if necessary) after making updates.

However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
writeable pagetables, etc.) is an implementation detail outside of the
API/ABI.

Changes in the paging structure require invalidations in the linear pagetable
range for subsequent accesses into the linear pagetables to access non-stale
mappings.  Xen must provide suitable flushing to prevent intermixed guest
actions from accidentally accessing/modifying the wrong pagetable.

For all L2 and higher modifications, flush the TLB.  PV guests cannot create
L2 or higher entries with the Global bit set, so no mappings established in
the linear range can be global.  (This could in principle be an order 39 flush
starting at LINEAR_PT_VIRT_START, but no such mechanism exists in practice.)

Express the necessary flushes as a set of booleans which accumulate across the
operation.  Comment the flushing logic extensively.

This is XSA-286.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 16a20963b3209788f2c0d3a3eebb7d92f03f5883)
---
 xen/arch/x86/mm.c | 69 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 59 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 86f31b334f..db4cfdf20b 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4005,7 +4005,8 @@ long do_mmu_update(
     struct vcpu *curr = current, *v = curr;
     struct domain *d = v->domain, *pt_owner = d, *pg_owner;
     mfn_t map_mfn = INVALID_MFN;
-    bool sync_guest = false;
+    bool flush_linear_pt = false, flush_root_pt_local = false,
+        flush_root_pt_others = false;
     uint32_t xsm_needed = 0;
     uint32_t xsm_checked = 0;
     int rc = put_old_guest_table(curr);
@@ -4155,6 +4156,8 @@ long do_mmu_update(
                         break;
                     rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        flush_linear_pt = true;
                     break;
 
                 case PGT_l3_page_table:
@@ -4162,6 +4165,8 @@ long do_mmu_update(
                         break;
                     rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        flush_linear_pt = true;
                     break;
 
                 case PGT_l4_page_table:
@@ -4169,6 +4174,8 @@ long do_mmu_update(
                         break;
                     rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        flush_linear_pt = true;
                     if ( !rc && pt_owner->arch.pv.xpti )
                     {
                         bool local_in_use = false;
@@ -4176,7 +4183,7 @@ long do_mmu_update(
                         if ( pagetable_get_pfn(curr->arch.guest_table) == mfn )
                         {
                             local_in_use = true;
-                            get_cpu_info()->root_pgt_changed = true;
+                            flush_root_pt_local = true;
                         }
 
                         /*
@@ -4188,7 +4195,7 @@ long do_mmu_update(
                              (1 + !!(page->u.inuse.type_info & PGT_pinned) +
                               (pagetable_get_pfn(curr->arch.guest_table_user) ==
                                mfn) + local_in_use) )
-                            sync_guest = true;
+                            flush_root_pt_others = true;
                     }
                     break;
 
@@ -4290,19 +4297,61 @@ long do_mmu_update(
     if ( va )
         unmap_domain_page(va);
 
-    if ( sync_guest )
+    /*
+     * Perform required TLB maintenance.
+     *
+     * This logic currently depend on flush_linear_pt being a superset of the
+     * flush_root_pt_* conditions.
+     *
+     * pt_owner may not be current->domain.  This may occur during
+     * construction of 32bit PV guests, or debugging of PV guests.  The
+     * behaviour cannot be correct with domain unpaused.  We therefore expect
+     * pt_owner->dirty_cpumask to be empty, but it is a waste of effort to
+     * explicitly check for, and exclude, this corner case.
+     *
+     * flush_linear_pt requires a FLUSH_TLB to all dirty CPUs.  The flush must
+     * be performed now to maintain correct behaviour across a multicall.
+     * i.e. we cannot relax FLUSH_TLB to FLUSH_ROOT_PGTBL, given that the
+     * former is a side effect of the latter, because the resync (which is in
+     * the return-to-guest path) happens too late.
+     *
+     * flush_root_pt_* requires FLUSH_ROOT_PGTBL on either the local CPU
+     * (implies pt_owner == current->domain and current->processor set in
+     * pt_owner->dirty_cpumask), and/or all *other* dirty CPUs as there are
+     * references we can't account for locally.
+     */
+    if ( flush_linear_pt /* || flush_root_pt_local || flush_root_pt_others */ )
     {
+        unsigned int cpu = smp_processor_id();
+        cpumask_t *mask = pt_owner->dirty_cpumask;
+
         /*
-         * Force other vCPU-s of the affected guest to pick up L4 entry
-         * changes (if any).
+         * Always handle local flushing separately (if applicable), to
+         * separate the flush invocations appropriately for scope of the two
+         * flush_root_pt_* variables.
          */
-        unsigned int cpu = smp_processor_id();
-        cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
+        if ( likely(cpumask_test_cpu(cpu, mask)) )
+        {
+            mask = per_cpu(scratch_cpumask, cpu);
 
-        cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
+            cpumask_copy(mask, pt_owner->dirty_cpumask);
+            __cpumask_clear_cpu(cpu, mask);
+
+            flush_local(FLUSH_TLB |
+                        (flush_root_pt_local ? FLUSH_ROOT_PGTBL : 0));
+        }
+        else
+            /* Sanity check.  flush_root_pt_local implies local cpu is dirty. */
+            ASSERT(!flush_root_pt_local);
+
+        /* Flush the remote dirty CPUs.  Does not include the local CPU. */
         if ( !cpumask_empty(mask) )
-            flush_mask(mask, FLUSH_ROOT_PGTBL);
+            flush_mask(mask, FLUSH_TLB |
+                       (flush_root_pt_others ? FLUSH_ROOT_PGTBL : 0));
     }
+    else
+        /* Sanity check.  flush_root_pt_* implies flush_linear_pt. */
+        ASSERT(!flush_root_pt_local && !flush_root_pt_others);
 
     perfc_add(num_page_updates, i);
 
-- 
2.20.1


["xsa286-4.13/0001-x86-pv-Drop-FLUSH_TLB_GLOBAL-in-do_mmu_update-for-XP.patch" (application/octet-stream)]

From c10b2931bf63a444e0917115aad348b911caa82e Mon Sep 17 00:00:00 2001
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Thu, 22 Oct 2020 11:28:58 +0100
Subject: [PATCH 1/2] x86/pv: Drop FLUSH_TLB_GLOBAL in do_mmu_update() for XPTI

c/s 9d1d31ad9498 "x86: slightly reduce Meltdown band-aid overhead" removed the
use of Global TLB flushes on the Xen entry path, but added a FLUSH_TLB_GLOBAL
to the L4 path in do_mmu_update().

However, this was unnecessary.

It is the guests responsibility to perform appropriate TLB flushing if the L4
modification altered an established mapping in a flush-relevant way.  In this
case, an MMUEXT_OP hypercall will follow.  The case which Xen needs to cover
is when new mappings are created, and the resync on the exit-to-guest path
covers this correctly.

There is a corner case with multiple vCPUs in hypercalls at the same time,
which 9d1d31ad9498 changed, and this patch changes back to its original XPTI
behaviour.

Architecturally, established TLB entries can continue to be used until the
broadcast flush has completed.  Therefore, even with concurrent hypercalls,
the guest cannot depend on older mappings not being used until an MMUEXT_OP
hypercall completes.  Xen's implementation of guest-initiated flushes will
take correct effect on top of an in-progress hypercall, picking up new mapping
setting before the other vCPU's MMUEXT_OP completes.

Note: The correctness of this change is not impacted by whether XPTI uses
global mappings or not.  Correctness there depends on the behaviour of Xen on
the entry/exit paths when switching two/from the XPTI "shadow" pagetables.

This is (not really) XSA-286 (but necessary to simplify the logic).

Fixes: 9d1d31ad9498 ("x86: slightly reduce Meltdown band-aid overhead")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 055e1c3a3d95b1e753148369fbc4ba48782dd602)
---
 xen/arch/x86/mm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 407340e5f5..204611ca2c 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4266,7 +4266,7 @@ long do_mmu_update(
 
         cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
         if ( !cpumask_empty(mask) )
-            flush_mask(mask, FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
+            flush_mask(mask, FLUSH_ROOT_PGTBL);
     }
 
     perfc_add(num_page_updates, i);
-- 
2.20.1


["xsa286-4.13/0002-x86-pv-Flush-TLB-in-response-to-paging-structure-cha.patch" (application/octet-stream)]

From 28b78171271dbbce88bbd4cb2de3d828a51fb169 Mon Sep 17 00:00:00 2001
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Mon, 19 Oct 2020 15:51:22 +0100
Subject: [PATCH 2/2] x86/pv: Flush TLB in response to paging structure changes

With MMU_UPDATE, a PV guest can make changes to higher level pagetables.  This
is safe from Xen's point of view (as the update only affects guest mappings),
and the guest is required to flush (if necessary) after making updates.

However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
writeable pagetables, etc.) is an implementation detail outside of the
API/ABI.

Changes in the paging structure require invalidations in the linear pagetable
range for subsequent accesses into the linear pagetables to access non-stale
mappings.  Xen must provide suitable flushing to prevent intermixed guest
actions from accidentally accessing/modifying the wrong pagetable.

For all L2 and higher modifications, flush the TLB.  PV guests cannot create
L2 or higher entries with the Global bit set, so no mappings established in
the linear range can be global.  (This could in principle be an order 39 flush
starting at LINEAR_PT_VIRT_START, but no such mechanism exists in practice.)

Express the necessary flushes as a set of booleans which accumulate across the
operation.  Comment the flushing logic extensively.

This is XSA-286.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 16a20963b3209788f2c0d3a3eebb7d92f03f5883)
---
 xen/arch/x86/mm.c | 69 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 59 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 204611ca2c..e56cd4bc65 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3969,7 +3969,8 @@ long do_mmu_update(
     struct vcpu *curr = current, *v = curr;
     struct domain *d = v->domain, *pt_owner = d, *pg_owner;
     mfn_t map_mfn = INVALID_MFN, mfn;
-    bool sync_guest = false;
+    bool flush_linear_pt = false, flush_root_pt_local = false,
+        flush_root_pt_others = false;
     uint32_t xsm_needed = 0;
     uint32_t xsm_checked = 0;
     int rc = put_old_guest_table(curr);
@@ -4119,6 +4120,8 @@ long do_mmu_update(
                         break;
                     rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        flush_linear_pt = true;
                     break;
 
                 case PGT_l3_page_table:
@@ -4126,6 +4129,8 @@ long do_mmu_update(
                         break;
                     rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        flush_linear_pt = true;
                     break;
 
                 case PGT_l4_page_table:
@@ -4133,6 +4138,8 @@ long do_mmu_update(
                         break;
                     rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        flush_linear_pt = true;
                     if ( !rc && pt_owner->arch.pv.xpti )
                     {
                         bool local_in_use = false;
@@ -4141,7 +4148,7 @@ long do_mmu_update(
                                     mfn) )
                         {
                             local_in_use = true;
-                            get_cpu_info()->root_pgt_changed = true;
+                            flush_root_pt_local = true;
                         }
 
                         /*
@@ -4153,7 +4160,7 @@ long do_mmu_update(
                              (1 + !!(page->u.inuse.type_info & PGT_pinned) +
                               mfn_eq(pagetable_get_mfn(curr->arch.guest_table_user),
                                      mfn) + local_in_use) )
-                            sync_guest = true;
+                            flush_root_pt_others = true;
                     }
                     break;
 
@@ -4255,19 +4262,61 @@ long do_mmu_update(
     if ( va )
         unmap_domain_page(va);
 
-    if ( sync_guest )
+    /*
+     * Perform required TLB maintenance.
+     *
+     * This logic currently depend on flush_linear_pt being a superset of the
+     * flush_root_pt_* conditions.
+     *
+     * pt_owner may not be current->domain.  This may occur during
+     * construction of 32bit PV guests, or debugging of PV guests.  The
+     * behaviour cannot be correct with domain unpaused.  We therefore expect
+     * pt_owner->dirty_cpumask to be empty, but it is a waste of effort to
+     * explicitly check for, and exclude, this corner case.
+     *
+     * flush_linear_pt requires a FLUSH_TLB to all dirty CPUs.  The flush must
+     * be performed now to maintain correct behaviour across a multicall.
+     * i.e. we cannot relax FLUSH_TLB to FLUSH_ROOT_PGTBL, given that the
+     * former is a side effect of the latter, because the resync (which is in
+     * the return-to-guest path) happens too late.
+     *
+     * flush_root_pt_* requires FLUSH_ROOT_PGTBL on either the local CPU
+     * (implies pt_owner == current->domain and current->processor set in
+     * pt_owner->dirty_cpumask), and/or all *other* dirty CPUs as there are
+     * references we can't account for locally.
+     */
+    if ( flush_linear_pt /* || flush_root_pt_local || flush_root_pt_others */ )
     {
+        unsigned int cpu = smp_processor_id();
+        cpumask_t *mask = pt_owner->dirty_cpumask;
+
         /*
-         * Force other vCPU-s of the affected guest to pick up L4 entry
-         * changes (if any).
+         * Always handle local flushing separately (if applicable), to
+         * separate the flush invocations appropriately for scope of the two
+         * flush_root_pt_* variables.
          */
-        unsigned int cpu = smp_processor_id();
-        cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
+        if ( likely(cpumask_test_cpu(cpu, mask)) )
+        {
+            mask = per_cpu(scratch_cpumask, cpu);
 
-        cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
+            cpumask_copy(mask, pt_owner->dirty_cpumask);
+            __cpumask_clear_cpu(cpu, mask);
+
+            flush_local(FLUSH_TLB |
+                        (flush_root_pt_local ? FLUSH_ROOT_PGTBL : 0));
+        }
+        else
+            /* Sanity check.  flush_root_pt_local implies local cpu is dirty. */
+            ASSERT(!flush_root_pt_local);
+
+        /* Flush the remote dirty CPUs.  Does not include the local CPU. */
         if ( !cpumask_empty(mask) )
-            flush_mask(mask, FLUSH_ROOT_PGTBL);
+            flush_mask(mask, FLUSH_TLB |
+                       (flush_root_pt_others ? FLUSH_ROOT_PGTBL : 0));
     }
+    else
+        /* Sanity check.  flush_root_pt_* implies flush_linear_pt. */
+        ASSERT(!flush_root_pt_local && !flush_root_pt_others);
 
     perfc_add(num_page_updates, i);
 
-- 
2.20.1


["xsa286-4.14/0001-x86-pv-Drop-FLUSH_TLB_GLOBAL-in-do_mmu_update-for-XP.patch" (application/octet-stream)]

From 941f69a428cd989144300519e548e346c681a1b3 Mon Sep 17 00:00:00 2001
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Thu, 22 Oct 2020 11:28:58 +0100
Subject: [PATCH 1/2] x86/pv: Drop FLUSH_TLB_GLOBAL in do_mmu_update() for XPTI

c/s 9d1d31ad9498 "x86: slightly reduce Meltdown band-aid overhead" removed the
use of Global TLB flushes on the Xen entry path, but added a FLUSH_TLB_GLOBAL
to the L4 path in do_mmu_update().

However, this was unnecessary.

It is the guests responsibility to perform appropriate TLB flushing if the L4
modification altered an established mapping in a flush-relevant way.  In this
case, an MMUEXT_OP hypercall will follow.  The case which Xen needs to cover
is when new mappings are created, and the resync on the exit-to-guest path
covers this correctly.

There is a corner case with multiple vCPUs in hypercalls at the same time,
which 9d1d31ad9498 changed, and this patch changes back to its original XPTI
behaviour.

Architecturally, established TLB entries can continue to be used until the
broadcast flush has completed.  Therefore, even with concurrent hypercalls,
the guest cannot depend on older mappings not being used until an MMUEXT_OP
hypercall completes.  Xen's implementation of guest-initiated flushes will
take correct effect on top of an in-progress hypercall, picking up new mapping
setting before the other vCPU's MMUEXT_OP completes.

Note: The correctness of this change is not impacted by whether XPTI uses
global mappings or not.  Correctness there depends on the behaviour of Xen on
the entry/exit paths when switching two/from the XPTI "shadow" pagetables.

This is (not really) XSA-286 (but necessary to simplify the logic).

Fixes: 9d1d31ad9498 ("x86: slightly reduce Meltdown band-aid overhead")
Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 055e1c3a3d95b1e753148369fbc4ba48782dd602)
---
 xen/arch/x86/mm.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 3cb6fabdae..1caa2df0a5 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -4193,7 +4193,7 @@ long do_mmu_update(
 
         cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
         if ( !cpumask_empty(mask) )
-            flush_mask(mask, FLUSH_TLB_GLOBAL | FLUSH_ROOT_PGTBL);
+            flush_mask(mask, FLUSH_ROOT_PGTBL);
     }
 
     perfc_add(num_page_updates, i);
-- 
2.20.1


["xsa286-4.14/0002-x86-pv-Flush-TLB-in-response-to-paging-structure-cha.patch" (application/octet-stream)]

From 10bb63c203f42d931fa1fa7dbbae7ce1765cecf2 Mon Sep 17 00:00:00 2001
From: Andrew Cooper <andrew.cooper3@citrix.com>
Date: Mon, 19 Oct 2020 15:51:22 +0100
Subject: [PATCH 2/2] x86/pv: Flush TLB in response to paging structure changes

With MMU_UPDATE, a PV guest can make changes to higher level pagetables.  This
is safe from Xen's point of view (as the update only affects guest mappings),
and the guest is required to flush (if necessary) after making updates.

However, Xen's use of linear pagetables (UPDATE_VA_MAPPING, GNTTABOP_map,
writeable pagetables, etc.) is an implementation detail outside of the
API/ABI.

Changes in the paging structure require invalidations in the linear pagetable
range for subsequent accesses into the linear pagetables to access non-stale
mappings.  Xen must provide suitable flushing to prevent intermixed guest
actions from accidentally accessing/modifying the wrong pagetable.

For all L2 and higher modifications, flush the TLB.  PV guests cannot create
L2 or higher entries with the Global bit set, so no mappings established in
the linear range can be global.  (This could in principle be an order 39 flush
starting at LINEAR_PT_VIRT_START, but no such mechanism exists in practice.)

Express the necessary flushes as a set of booleans which accumulate across the
operation.  Comment the flushing logic extensively.

This is XSA-286.

Signed-off-by: Andrew Cooper <andrew.cooper3@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
(cherry picked from commit 16a20963b3209788f2c0d3a3eebb7d92f03f5883)
---
 xen/arch/x86/mm.c | 69 ++++++++++++++++++++++++++++++++++++++++-------
 1 file changed, 59 insertions(+), 10 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 1caa2df0a5..61cf6a7b9b 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -3896,7 +3896,8 @@ long do_mmu_update(
     struct vcpu *curr = current, *v = curr;
     struct domain *d = v->domain, *pt_owner = d, *pg_owner;
     mfn_t map_mfn = INVALID_MFN, mfn;
-    bool sync_guest = false;
+    bool flush_linear_pt = false, flush_root_pt_local = false,
+        flush_root_pt_others = false;
     uint32_t xsm_needed = 0;
     uint32_t xsm_checked = 0;
     int rc = put_old_guest_table(curr);
@@ -4046,6 +4047,8 @@ long do_mmu_update(
                         break;
                     rc = mod_l2_entry(va, l2e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        flush_linear_pt = true;
                     break;
 
                 case PGT_l3_page_table:
@@ -4053,6 +4056,8 @@ long do_mmu_update(
                         break;
                     rc = mod_l3_entry(va, l3e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        flush_linear_pt = true;
                     break;
 
                 case PGT_l4_page_table:
@@ -4060,6 +4065,8 @@ long do_mmu_update(
                         break;
                     rc = mod_l4_entry(va, l4e_from_intpte(req.val), mfn,
                                       cmd == MMU_PT_UPDATE_PRESERVE_AD, v);
+                    if ( !rc )
+                        flush_linear_pt = true;
                     if ( !rc && pt_owner->arch.pv.xpti )
                     {
                         bool local_in_use = false;
@@ -4068,7 +4075,7 @@ long do_mmu_update(
                                     mfn) )
                         {
                             local_in_use = true;
-                            get_cpu_info()->root_pgt_changed = true;
+                            flush_root_pt_local = true;
                         }
 
                         /*
@@ -4080,7 +4087,7 @@ long do_mmu_update(
                              (1 + !!(page->u.inuse.type_info & PGT_pinned) +
                               mfn_eq(pagetable_get_mfn(curr->arch.guest_table_user),
                                      mfn) + local_in_use) )
-                            sync_guest = true;
+                            flush_root_pt_others = true;
                     }
                     break;
 
@@ -4182,19 +4189,61 @@ long do_mmu_update(
     if ( va )
         unmap_domain_page(va);
 
-    if ( sync_guest )
+    /*
+     * Perform required TLB maintenance.
+     *
+     * This logic currently depend on flush_linear_pt being a superset of the
+     * flush_root_pt_* conditions.
+     *
+     * pt_owner may not be current->domain.  This may occur during
+     * construction of 32bit PV guests, or debugging of PV guests.  The
+     * behaviour cannot be correct with domain unpaused.  We therefore expect
+     * pt_owner->dirty_cpumask to be empty, but it is a waste of effort to
+     * explicitly check for, and exclude, this corner case.
+     *
+     * flush_linear_pt requires a FLUSH_TLB to all dirty CPUs.  The flush must
+     * be performed now to maintain correct behaviour across a multicall.
+     * i.e. we cannot relax FLUSH_TLB to FLUSH_ROOT_PGTBL, given that the
+     * former is a side effect of the latter, because the resync (which is in
+     * the return-to-guest path) happens too late.
+     *
+     * flush_root_pt_* requires FLUSH_ROOT_PGTBL on either the local CPU
+     * (implies pt_owner == current->domain and current->processor set in
+     * pt_owner->dirty_cpumask), and/or all *other* dirty CPUs as there are
+     * references we can't account for locally.
+     */
+    if ( flush_linear_pt /* || flush_root_pt_local || flush_root_pt_others */ )
     {
+        unsigned int cpu = smp_processor_id();
+        cpumask_t *mask = pt_owner->dirty_cpumask;
+
         /*
-         * Force other vCPU-s of the affected guest to pick up L4 entry
-         * changes (if any).
+         * Always handle local flushing separately (if applicable), to
+         * separate the flush invocations appropriately for scope of the two
+         * flush_root_pt_* variables.
          */
-        unsigned int cpu = smp_processor_id();
-        cpumask_t *mask = per_cpu(scratch_cpumask, cpu);
+        if ( likely(cpumask_test_cpu(cpu, mask)) )
+        {
+            mask = per_cpu(scratch_cpumask, cpu);
 
-        cpumask_andnot(mask, pt_owner->dirty_cpumask, cpumask_of(cpu));
+            cpumask_copy(mask, pt_owner->dirty_cpumask);
+            __cpumask_clear_cpu(cpu, mask);
+
+            flush_local(FLUSH_TLB |
+                        (flush_root_pt_local ? FLUSH_ROOT_PGTBL : 0));
+        }
+        else
+            /* Sanity check.  flush_root_pt_local implies local cpu is dirty. */
+            ASSERT(!flush_root_pt_local);
+
+        /* Flush the remote dirty CPUs.  Does not include the local CPU. */
         if ( !cpumask_empty(mask) )
-            flush_mask(mask, FLUSH_ROOT_PGTBL);
+            flush_mask(mask, FLUSH_TLB |
+                       (flush_root_pt_others ? FLUSH_ROOT_PGTBL : 0));
     }
+    else
+        /* Sanity check.  flush_root_pt_* implies flush_linear_pt. */
+        ASSERT(!flush_root_pt_local && !flush_root_pt_others);
 
     perfc_add(num_page_updates, i);
 
-- 
2.20.1



[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic