[prev in list] [next in list] [prev in thread] [next in thread] 

List:       oss-security
Subject:    [oss-security] Xen Security Advisory 345 v4 (CVE-2020-27672) - x86: Race condition in Xen mapping co
From:       Xen.org security team <security () xen ! org>
Date:       2021-01-19 16:34:17
Message-ID: E1l1txR-0002tZ-DJ () xenbits ! xenproject ! org
[Download RAW message or body]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

            Xen Security Advisory CVE-2020-27672 / XSA-345
                              version 4

                x86: Race condition in Xen mapping code

UPDATES IN VERSION 4
====================

CVE assigned.

ISSUE DESCRIPTION
=================

The Xen code handling the updating of the hypervisor's own pagetables
tries to use 2MiB and 1GiB superpages as much as possible to maximize
TLB efficiency.  Some of the operations for checking and coalescing
superpages take non-negligible amount of time; to avoid potential lock
contention, this code also tries to avoid holding locks for the entire
operation.

Unfortunately, several potential race conditions were not considered;
precisely-timed guest actions could potentially lead to the code
writing to a page which has been freed (and thus potentially already
reused).

IMPACT
======

A malicious guest can cause a host denial-of-service.  Data corruption
or privilege escalation cannot be ruled out.

VULNERABLE SYSTEMS
==================

Versions of Xen from at least 3.2 onward are affected.

Only x86 systems are vulnerable.  ARM systems are not vulnerable.

Guests can only exercise the vulnerability if they have passed through
hardware devices.  Guests without passthrough configured cannot
exploit the vulnerability.

Furthermore, HVM and PVH guests can only exercise the vulnerability if
they are running in shadow mode, and only when running on VT-x capable
hardware (as opposed to SVM).  This is believed to be Intel, Centaur
and Shanghai CPUs.

MITIGATION
==========

Running all guests in HVM or PVH mode, in each case with HAP enabled,
prevent those guests from exploiting the vulnerability.

CREDITS
=======

This issue was discovered by Hongyan Xia of Amazon.

RESOLUTION
==========

Applying the appropriate attached patch resolves this issue.

Note that patches for released versions are generally prepared to
apply to the stable branches, and may not apply cleanly to the most
recent release tarball.  Downstreams are encouraged to update to the
tip of the stable branch before applying these patches.

xsa345/*.patch           xen-unstable
xsa345-4.14/*.patch      Xen 4.14.x
xsa345-4.13/*.patch      Xen 4.12.x, Xen 4.13.x
xsa345-4.11/*.patch      Xen 4.11.x
xsa345-4.10/*.patch      Xen 4.10.x

$ sha256sum xsa345* xsa345*/*
c8b9445b05aa4c585d9817c2e6cbf08466452a15381ca5b9a0224a377522edf9  xsa345.meta
4ed69dce620449bedda29f3ce1ed767908d2bbeb888701e7c4c2461216b724f7  \
xsa345-4.10/0001-x86-mm-Refactor-map_pages_to_xen-to-have-only-a-sing.patch \
98d3b171b197c1ff9f26ff70499a0cde3b23d048d622b12bf2ea0899de4f9e7f  \
xsa345-4.10/0002-x86-mm-Refactor-modify_xen_mappings-to-have-one-exit.patch \
78c4be2f1747051d13869001180ee25bdeabe5e8138d0604a33db610b24e38f1  \
xsa345-4.10/0003-x86-mm-Prevent-some-races-in-hypervisor-mapping-upda.patch \
4abd8271a70593fcde683071fdf0ac342ff9b0859b60c9790b14dd7e5ae85129  \
xsa345-4.11/0001-x86-mm-Refactor-map_pages_to_xen-to-have-only-a-sing.patch \
3209195c1a7e8a6186b704d6bb791a3fb3c251d59e15b42bcb0ecc0d38f13a4f  \
xsa345-4.11/0002-x86-mm-Refactor-modify_xen_mappings-to-have-one-exit.patch \
7e73f6c14718a0d4b25b4453b45c20bf265bd54c91b77678815be1ef7beae61f  \
xsa345-4.11/0003-x86-mm-Prevent-some-races-in-hypervisor-mapping-upda.patch \
b68b82911c96feee9d05abcddf174c2f6b278829bc8c3bf3062739de8c4704b2  \
xsa345-4.12/0001-x86-mm-Refactor-map_pages_to_xen-to-have-only-a-sing.patch \
fe2a1568a3e273ae01b3984c193e75aea16da53c6c9db27d21a2196d0f204c6e  \
xsa345-4.12/0002-x86-mm-Refactor-modify_xen_mappings-to-have-one-exit.patch \
22c98f4a264bc6b15ed29da8698a733947849c16a3e9da58de88bf16986b6aad  \
xsa345-4.12/0003-x86-mm-Prevent-some-races-in-hypervisor-mapping-upda.patch \
16299d885c19e1cd378a856caf8c1d1365c341bea648c0a0d5f24ae7d56015ae  \
xsa345-4.13/0001-x86-mm-Refactor-map_pages_to_xen-to-have-only-a-sing.patch \
b820061c242c7fa4da44cbb44fa16e0d0542c16815a89699385da0c87321f7ea  \
xsa345-4.13/0002-x86-mm-Refactor-modify_xen_mappings-to-have-one-exit.patch \
8a87ac2478c9bda6ef28c480b256448d51393a5e04f6e8a68ea29d9aeba92e47  \
xsa345-4.13/0003-x86-mm-Prevent-some-races-in-hypervisor-mapping-upda.patch \
acf093741fecccccce0018d4a5c0f5dba367373dd1d6d04ed76ff3f178579670  \
xsa345-4.14/0001-x86-mm-Refactor-map_pages_to_xen-to-have-only-a-sing.patch \
616f2547b4bb6d5eb9f853b1659e6e2a1fc0f67866665f4f6cdd8d763effcdfc  \
xsa345-4.14/0002-x86-mm-Refactor-modify_xen_mappings-to-have-one-exit.patch \
17ae72d2af6759da17ce777e0fc9eef8f8eb6be3fe6d5b38b3589f641fc0f918  \
xsa345-4.14/0003-x86-mm-Prevent-some-races-in-hypervisor-mapping-upda.patch \
65c56cb4d34ff4e97220311b303c09b54bfa44bcf4adc8e81d4a50c50eeb6b95  \
xsa345/0001-x86-mm-Refactor-map_pages_to_xen-to-have-only-a-sing.patch \
5512bd167c29ba7da06b2ace1397fc43ed33a362174ea927d6ca3f9bdd31748b  \
xsa345/0002-x86-mm-Refactor-modify_xen_mappings-to-have-one-exit.patch \
392524c9b0a01618e6c86a39dc1c68288065300b49548e29e9e6672947858060  \
xsa345/0003-x86-mm-Prevent-some-races-in-hypervisor-mapping-upda.patch $

DEPLOYMENT DURING EMBARGO
=========================

Deployment of the patches and/or mitigations described above (or
others which are substantially similar) is permitted during the
embargo, even on public-facing systems with untrusted guest users and
administrators.

But: Distribution of updated software is prohibited (except to other
members of the predisclosure list).

Predisclosure list members who wish to deploy significantly different
patches and/or mitigations, please contact the Xen Project Security
Team.


(Note: this during-embargo deployment notice is retained in
post-embargo publicly released Xen Project advisories, even though it
is then no longer applicable.  This is to enable the community to have
oversight of the Xen Project Security Team's decisionmaking.)

For more information about permissible uses of embargoed information,
consult the Xen Project community's agreed Security Policy:
  http://www.xenproject.org/security-policy.html
-----BEGIN PGP SIGNATURE-----

iQFABAEBCAAqFiEEI+MiLBRfRHX6gGCng/4UyVfoK9kFAmAHB6UMHHBncEB4ZW4u
b3JnAAoJEIP+FMlX6CvZpl8IAKcA1rZlJ2EgeUvFSrwr5lq7SePZ5HUTllv90G/Q
ogyVKe3ru+lWFDOFioRUrZh4/N7LrLzKj+x9LD58knUR78gSLnoPk67yOG3Doz0g
A2DqbOihtBVITrTFgY+0RK3X735ky4PsqEHPhGI/+S2j0HgEMO1KyEJwHyodzc60
KrFV5k/eLGgL+ttKPXOYFQuGn5vCaCr749JDDh498zmzcWpLP+XbN9P+xNtteXD2
JQ0Qsj4SjjmoE9l59M9t5MhnUMIZ3vC8BnjFyRJtzpBGhuU/HvUsPkPUbLevtVC+
X+GHObsQhYq7rmLvrHExckVlq9aGJpZYMtxbZaiA/9B6Xbc=
=Hsyr
-----END PGP SIGNATURE-----


["xsa345.meta" (application/octet-stream)]
["xsa345-4.10/0001-x86-mm-Refactor-map_pages_to_xen-to-have-only-a-sing.patch" (application/octet-stream)]

From 3ecdff48c2e7f0640065df48361a52651165cfda Mon Sep 17 00:00:00 2001
From: Wei Liu <wei.liu2@citrix.com>
Date: Sat, 11 Jan 2020 21:57:41 +0000
Subject: [PATCH 1/3] x86/mm: Refactor map_pages_to_xen to have only a single
 exit path

We will soon need to perform clean-ups before returning.

No functional change.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index fd734ff947..f6df0a41f1 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5112,6 +5112,7 @@ int map_pages_to_xen(
     l2_pgentry_t *pl2e, ol2e;
     l1_pgentry_t *pl1e, ol1e;
     unsigned int  i;
+    int rc = -ENOMEM;
 
 #define flush_flags(oldf) do {                 \
     unsigned int o_ = (oldf);                  \
@@ -5132,7 +5133,8 @@ int map_pages_to_xen(
         l3_pgentry_t ol3e, *pl3e = virt_to_xen_l3e(virt);
 
         if ( !pl3e )
-            return -ENOMEM;
+            goto out;
+
         ol3e = *pl3e;
 
         if ( cpu_has_page1gb &&
@@ -5218,7 +5220,7 @@ int map_pages_to_xen(
 
             pl2e = alloc_xen_pagetable();
             if ( pl2e == NULL )
-                return -ENOMEM;
+                goto out;
 
             for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ )
                 l2e_write(pl2e + i,
@@ -5247,7 +5249,7 @@ int map_pages_to_xen(
 
         pl2e = virt_to_xen_l2e(virt);
         if ( !pl2e )
-            return -ENOMEM;
+            goto out;
 
         if ( ((((virt >> PAGE_SHIFT) | mfn) &
                ((1u << PAGETABLE_ORDER) - 1)) == 0) &&
@@ -5289,7 +5291,7 @@ int map_pages_to_xen(
             {
                 pl1e = virt_to_xen_l1e(virt);
                 if ( pl1e == NULL )
-                    return -ENOMEM;
+                    goto out;
             }
             else if ( l2e_get_flags(*pl2e) & _PAGE_PSE )
             {
@@ -5315,7 +5317,7 @@ int map_pages_to_xen(
 
                 pl1e = alloc_xen_pagetable();
                 if ( pl1e == NULL )
-                    return -ENOMEM;
+                    goto out;
 
                 for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ )
                     l1e_write(&pl1e[i],
@@ -5458,7 +5460,10 @@ int map_pages_to_xen(
 
 #undef flush_flags
 
-    return 0;
+    rc = 0;
+
+ out:
+    return rc;
 }
 
 int populate_pt_range(unsigned long virt, unsigned long mfn,
-- 
2.25.1


["xsa345-4.10/0002-x86-mm-Refactor-modify_xen_mappings-to-have-one-exit.patch" (application/octet-stream)]

From 2ae4891f3a4428a689c9a20e08aebd662c0dcf97 Mon Sep 17 00:00:00 2001
From: Wei Liu <wei.liu2@citrix.com>
Date: Sat, 11 Jan 2020 21:57:42 +0000
Subject: [PATCH 2/3] x86/mm: Refactor modify_xen_mappings to have one exit
 path

We will soon need to perform clean-ups before returning.

No functional change.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index f6df0a41f1..e644685cab 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5491,6 +5491,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     l1_pgentry_t *pl1e;
     unsigned int  i;
     unsigned long v = s;
+    int rc = -ENOMEM;
 
     /* Set of valid PTE bits which may be altered. */
 #define FLAGS_MASK (_PAGE_NX|_PAGE_RW|_PAGE_PRESENT)
@@ -5532,7 +5533,8 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
             /* PAGE1GB: shatter the superpage and fall through. */
             pl2e = alloc_xen_pagetable();
             if ( !pl2e )
-                return -ENOMEM;
+                goto out;
+
             for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ )
                 l2e_write(pl2e + i,
                           l2e_from_pfn(l3e_get_pfn(*pl3e) +
@@ -5587,7 +5589,8 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
                 /* PSE: shatter the superpage and try again. */
                 pl1e = alloc_xen_pagetable();
                 if ( !pl1e )
-                    return -ENOMEM;
+                    goto out;
+
                 for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ )
                     l1e_write(&pl1e[i],
                               l1e_from_pfn(l2e_get_pfn(*pl2e) + i,
@@ -5716,7 +5719,10 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     flush_area(NULL, FLUSH_TLB_GLOBAL);
 
 #undef FLAGS_MASK
-    return 0;
+    rc = 0;
+
+ out:
+    return rc;
 }
 
 #undef flush_area
-- 
2.25.1


["xsa345-4.10/0003-x86-mm-Prevent-some-races-in-hypervisor-mapping-upda.patch" (application/octet-stream)]

From 24181f7635c106f8719b0eba5aac10cf747a4cec Mon Sep 17 00:00:00 2001
From: Hongyan Xia <hongyxia@amazon.com>
Date: Sat, 11 Jan 2020 21:57:43 +0000
Subject: [PATCH 3/3] x86/mm: Prevent some races in hypervisor mapping updates

map_pages_to_xen will attempt to coalesce mappings into 2MiB and 1GiB
superpages if possible, to maximize TLB efficiency.  This means both
replacing superpage entries with smaller entries, and replacing
smaller entries with superpages.

Unfortunately, while some potential races are handled correctly,
others are not.  These include:

1. When one processor modifies a sub-superpage mapping while another
processor replaces the entire range with a superpage.

Take the following example:

Suppose L3[N] points to L2.  And suppose we have two processors, A and
B.

* A walks the pagetables, get a pointer to L2.
* B replaces L3[N] with a 1GiB mapping.
* B Frees L2
* A writes L2[M] #

This is race exacerbated by the fact that virt_to_xen_l[21]e doesn't
handle higher-level superpages properly: If you call virt_xen_to_l2e
on a virtual address within an L3 superpage, you'll either hit a BUG()
(most likely), or get a pointer into the middle of a data page; same
with virt_xen_to_l1 on a virtual address within either an L3 or L2
superpage.

So take the following example:

* A reads pl3e and discovers it to point to an L2.
* B replaces L3[N] with a 1GiB mapping
* A calls virt_to_xen_l2e() and hits the BUG_ON() #

2. When two processors simultaneously try to replace a sub-superpage
mapping with a superpage mapping.

Take the following example:

Suppose L3[N] points to L2.  And suppose we have two processors, A and B,
both trying to replace L3[N] with a superpage.

* A walks the pagetables, get a pointer to pl3e, and takes a copy ol3e pointing to L2.
* B walks the pagetables, gets a pointre to pl3e, and takes a copy ol3e pointing to L2.
* A writes the new value into L3[N]
* B writes the new value into L3[N]
* A recursively frees all the L1's under L2, then frees L2
* B recursively double-frees all the L1's under L2, then double-frees L2 #

Fix this by grabbing a lock for the entirety of the mapping update
operation.

Rather than grabbing map_pgdir_lock for the entire operation, however,
repurpose the PGT_locked bit from L3's page->type_info as a lock.
This means that rather than locking the entire address space, we
"only" lock a single 512GiB chunk of hypervisor address space at a
time.

There was a proposal for a lock-and-reverify approach, where we walk
the pagetables to the point where we decide what to do; then grab the
map_pgdir_lock, re-verify the information we collected without the
lock, and finally make the change (starting over again if anything had
changed).  Without being able to guarantee that the L2 table wasn't
freed, however, that means every read would need to be considered
potentially unsafe.  Thinking carefully about that is probably
something that wants to be done on public, not under time pressure.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 92 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 89 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index e644685cab..faa7df30c1 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2132,6 +2132,50 @@ void page_unlock(struct page_info *page)
     } while ( (y = cmpxchg(&page->u.inuse.type_info, x, nx)) != x );
 }
 
+/*
+ * L3 table locks:
+ *
+ * Used for serialization in map_pages_to_xen() and modify_xen_mappings().
+ *
+ * For Xen PT pages, the page->u.inuse.type_info is unused and it is safe to
+ * reuse the PGT_locked flag. This lock is taken only when we move down to L3
+ * tables and below, since L4 (and above, for 5-level paging) is still globally
+ * protected by map_pgdir_lock.
+ *
+ * PV MMU update hypercalls call map_pages_to_xen while holding a page's page_lock().
+ * This has two implications:
+ * - We cannot reuse reuse current_locked_page_* for debugging
+ * - To avoid the chance of deadlock, even for different pages, we
+ *   must never grab page_lock() after grabbing l3t_lock().  This
+ *   includes any page_lock()-based locks, such as
+ *   mem_sharing_page_lock().
+ *
+ * Also note that we grab the map_pgdir_lock while holding the
+ * l3t_lock(), so to avoid deadlock we must avoid grabbing them in
+ * reverse order.
+ */
+static void l3t_lock(struct page_info *page)
+{
+    unsigned long x, nx;
+
+    do {
+        while ( (x = page->u.inuse.type_info) & PGT_locked )
+            cpu_relax();
+        nx = x | PGT_locked;
+    } while ( cmpxchg(&page->u.inuse.type_info, x, nx) != x );
+}
+
+static void l3t_unlock(struct page_info *page)
+{
+    unsigned long x, nx, y = page->u.inuse.type_info;
+
+    do {
+        x = y;
+        BUG_ON(!(x & PGT_locked));
+        nx = x & ~PGT_locked;
+    } while ( (y = cmpxchg(&page->u.inuse.type_info, x, nx)) != x );
+}
+
 /*
  * PTE flags that a guest may change without re-validating the PTE.
  * All other bits affect translation, caching, or Xen's safety.
@@ -5102,6 +5146,23 @@ l1_pgentry_t *virt_to_xen_l1e(unsigned long v)
                          flush_area_local((const void *)v, f) : \
                          flush_area_all((const void *)v, f))
 
+#define L3T_INIT(page) (page) = ZERO_BLOCK_PTR
+
+#define L3T_LOCK(page)        \
+    do {                      \
+        if ( locking )        \
+            l3t_lock(page);   \
+    } while ( false )
+
+#define L3T_UNLOCK(page)                           \
+    do {                                           \
+        if ( locking && (page) != ZERO_BLOCK_PTR ) \
+        {                                          \
+            l3t_unlock(page);                      \
+            (page) = ZERO_BLOCK_PTR;               \
+        }                                          \
+    } while ( false )
+
 int map_pages_to_xen(
     unsigned long virt,
     unsigned long mfn,
@@ -5113,6 +5174,7 @@ int map_pages_to_xen(
     l1_pgentry_t *pl1e, ol1e;
     unsigned int  i;
     int rc = -ENOMEM;
+    struct page_info *current_l3page;
 
 #define flush_flags(oldf) do {                 \
     unsigned int o_ = (oldf);                  \
@@ -5128,13 +5190,20 @@ int map_pages_to_xen(
     }                                          \
 } while (0)
 
+    L3T_INIT(current_l3page);
+
     while ( nr_mfns != 0 )
     {
-        l3_pgentry_t ol3e, *pl3e = virt_to_xen_l3e(virt);
+        l3_pgentry_t *pl3e, ol3e;
 
+        L3T_UNLOCK(current_l3page);
+
+        pl3e = virt_to_xen_l3e(virt);
         if ( !pl3e )
             goto out;
 
+        current_l3page = virt_to_page(pl3e);
+        L3T_LOCK(current_l3page);
         ol3e = *pl3e;
 
         if ( cpu_has_page1gb &&
@@ -5463,6 +5532,7 @@ int map_pages_to_xen(
     rc = 0;
 
  out:
+    L3T_UNLOCK(current_l3page);
     return rc;
 }
 
@@ -5492,6 +5562,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     unsigned int  i;
     unsigned long v = s;
     int rc = -ENOMEM;
+    struct page_info *current_l3page;
 
     /* Set of valid PTE bits which may be altered. */
 #define FLAGS_MASK (_PAGE_NX|_PAGE_RW|_PAGE_PRESENT)
@@ -5500,11 +5571,22 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     ASSERT(IS_ALIGNED(s, PAGE_SIZE));
     ASSERT(IS_ALIGNED(e, PAGE_SIZE));
 
+    L3T_INIT(current_l3page);
+
     while ( v < e )
     {
-        l3_pgentry_t *pl3e = virt_to_xen_l3e(v);
+        l3_pgentry_t *pl3e;
+
+        L3T_UNLOCK(current_l3page);
 
-        if ( !pl3e || !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) )
+        pl3e = virt_to_xen_l3e(v);
+        if ( !pl3e )
+            goto out;
+
+        current_l3page = virt_to_page(pl3e);
+        L3T_LOCK(current_l3page);
+
+        if ( !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) )
         {
             /* Confirm the caller isn't trying to create new mappings. */
             ASSERT(!(nf & _PAGE_PRESENT));
@@ -5722,9 +5804,13 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     rc = 0;
 
  out:
+    L3T_UNLOCK(current_l3page);
     return rc;
 }
 
+#undef L3T_LOCK
+#undef L3T_UNLOCK
+
 #undef flush_area
 
 int destroy_xen_mappings(unsigned long s, unsigned long e)
-- 
2.25.1


["xsa345-4.11/0001-x86-mm-Refactor-map_pages_to_xen-to-have-only-a-sing.patch" (application/octet-stream)]

From edbe70427e17743351f1b739ea1536acd757ae6c Mon Sep 17 00:00:00 2001
From: Wei Liu <wei.liu2@citrix.com>
Date: Sat, 11 Jan 2020 21:57:41 +0000
Subject: [PATCH 1/3] x86/mm: Refactor map_pages_to_xen to have only a single
 exit path

We will soon need to perform clean-ups before returning.

No functional change.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 626768a950..79a3fac3cc 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5194,6 +5194,7 @@ int map_pages_to_xen(
     l2_pgentry_t *pl2e, ol2e;
     l1_pgentry_t *pl1e, ol1e;
     unsigned int  i;
+    int rc = -ENOMEM;
 
 #define flush_flags(oldf) do {                 \
     unsigned int o_ = (oldf);                  \
@@ -5214,7 +5215,8 @@ int map_pages_to_xen(
         l3_pgentry_t ol3e, *pl3e = virt_to_xen_l3e(virt);
 
         if ( !pl3e )
-            return -ENOMEM;
+            goto out;
+
         ol3e = *pl3e;
 
         if ( cpu_has_page1gb &&
@@ -5302,7 +5304,7 @@ int map_pages_to_xen(
 
             pl2e = alloc_xen_pagetable();
             if ( pl2e == NULL )
-                return -ENOMEM;
+                goto out;
 
             for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ )
                 l2e_write(pl2e + i,
@@ -5331,7 +5333,7 @@ int map_pages_to_xen(
 
         pl2e = virt_to_xen_l2e(virt);
         if ( !pl2e )
-            return -ENOMEM;
+            goto out;
 
         if ( ((((virt >> PAGE_SHIFT) | mfn_x(mfn)) &
                ((1u << PAGETABLE_ORDER) - 1)) == 0) &&
@@ -5374,7 +5376,7 @@ int map_pages_to_xen(
             {
                 pl1e = virt_to_xen_l1e(virt);
                 if ( pl1e == NULL )
-                    return -ENOMEM;
+                    goto out;
             }
             else if ( l2e_get_flags(*pl2e) & _PAGE_PSE )
             {
@@ -5401,7 +5403,7 @@ int map_pages_to_xen(
 
                 pl1e = alloc_xen_pagetable();
                 if ( pl1e == NULL )
-                    return -ENOMEM;
+                    goto out;
 
                 for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ )
                     l1e_write(&pl1e[i],
@@ -5545,7 +5547,10 @@ int map_pages_to_xen(
 
 #undef flush_flags
 
-    return 0;
+    rc = 0;
+
+ out:
+    return rc;
 }
 
 int populate_pt_range(unsigned long virt, unsigned long nr_mfns)
-- 
2.25.1


["xsa345-4.11/0002-x86-mm-Refactor-modify_xen_mappings-to-have-one-exit.patch" (application/octet-stream)]

From 7101786be91dce650b6e79f1374c580c731bb348 Mon Sep 17 00:00:00 2001
From: Wei Liu <wei.liu2@citrix.com>
Date: Sat, 11 Jan 2020 21:57:42 +0000
Subject: [PATCH 2/3] x86/mm: Refactor modify_xen_mappings to have one exit
 path

We will soon need to perform clean-ups before returning.

No functional change.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 79a3fac3cc..8ed3ecacbe 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5577,6 +5577,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     l1_pgentry_t *pl1e;
     unsigned int  i;
     unsigned long v = s;
+    int rc = -ENOMEM;
 
     /* Set of valid PTE bits which may be altered. */
 #define FLAGS_MASK (_PAGE_NX|_PAGE_RW|_PAGE_PRESENT)
@@ -5618,7 +5619,8 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
             /* PAGE1GB: shatter the superpage and fall through. */
             pl2e = alloc_xen_pagetable();
             if ( !pl2e )
-                return -ENOMEM;
+                goto out;
+
             for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ )
                 l2e_write(pl2e + i,
                           l2e_from_pfn(l3e_get_pfn(*pl3e) +
@@ -5673,7 +5675,8 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
                 /* PSE: shatter the superpage and try again. */
                 pl1e = alloc_xen_pagetable();
                 if ( !pl1e )
-                    return -ENOMEM;
+                    goto out;
+
                 for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ )
                     l1e_write(&pl1e[i],
                               l1e_from_pfn(l2e_get_pfn(*pl2e) + i,
@@ -5802,7 +5805,10 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     flush_area(NULL, FLUSH_TLB_GLOBAL);
 
 #undef FLAGS_MASK
-    return 0;
+    rc = 0;
+
+ out:
+    return rc;
 }
 
 #undef flush_area
-- 
2.25.1


["xsa345-4.11/0003-x86-mm-Prevent-some-races-in-hypervisor-mapping-upda.patch" (application/octet-stream)]

From e7bbc4a0b5af76a82f0dcf4afcbf1509b020eb73 Mon Sep 17 00:00:00 2001
From: Hongyan Xia <hongyxia@amazon.com>
Date: Sat, 11 Jan 2020 21:57:43 +0000
Subject: [PATCH 3/3] x86/mm: Prevent some races in hypervisor mapping updates

map_pages_to_xen will attempt to coalesce mappings into 2MiB and 1GiB
superpages if possible, to maximize TLB efficiency.  This means both
replacing superpage entries with smaller entries, and replacing
smaller entries with superpages.

Unfortunately, while some potential races are handled correctly,
others are not.  These include:

1. When one processor modifies a sub-superpage mapping while another
processor replaces the entire range with a superpage.

Take the following example:

Suppose L3[N] points to L2.  And suppose we have two processors, A and
B.

* A walks the pagetables, get a pointer to L2.
* B replaces L3[N] with a 1GiB mapping.
* B Frees L2
* A writes L2[M] #

This is race exacerbated by the fact that virt_to_xen_l[21]e doesn't
handle higher-level superpages properly: If you call virt_xen_to_l2e
on a virtual address within an L3 superpage, you'll either hit a BUG()
(most likely), or get a pointer into the middle of a data page; same
with virt_xen_to_l1 on a virtual address within either an L3 or L2
superpage.

So take the following example:

* A reads pl3e and discovers it to point to an L2.
* B replaces L3[N] with a 1GiB mapping
* A calls virt_to_xen_l2e() and hits the BUG_ON() #

2. When two processors simultaneously try to replace a sub-superpage
mapping with a superpage mapping.

Take the following example:

Suppose L3[N] points to L2.  And suppose we have two processors, A and B,
both trying to replace L3[N] with a superpage.

* A walks the pagetables, get a pointer to pl3e, and takes a copy ol3e pointing to L2.
* B walks the pagetables, gets a pointre to pl3e, and takes a copy ol3e pointing to L2.
* A writes the new value into L3[N]
* B writes the new value into L3[N]
* A recursively frees all the L1's under L2, then frees L2
* B recursively double-frees all the L1's under L2, then double-frees L2 #

Fix this by grabbing a lock for the entirety of the mapping update
operation.

Rather than grabbing map_pgdir_lock for the entire operation, however,
repurpose the PGT_locked bit from L3's page->type_info as a lock.
This means that rather than locking the entire address space, we
"only" lock a single 512GiB chunk of hypervisor address space at a
time.

There was a proposal for a lock-and-reverify approach, where we walk
the pagetables to the point where we decide what to do; then grab the
map_pgdir_lock, re-verify the information we collected without the
lock, and finally make the change (starting over again if anything had
changed).  Without being able to guarantee that the L2 table wasn't
freed, however, that means every read would need to be considered
potentially unsafe.  Thinking carefully about that is probably
something that wants to be done on public, not under time pressure.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 92 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 89 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 8ed3ecacbe..4ff24de73d 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2153,6 +2153,50 @@ void page_unlock(struct page_info *page)
     current_locked_page_set(NULL);
 }
 
+/*
+ * L3 table locks:
+ *
+ * Used for serialization in map_pages_to_xen() and modify_xen_mappings().
+ *
+ * For Xen PT pages, the page->u.inuse.type_info is unused and it is safe to
+ * reuse the PGT_locked flag. This lock is taken only when we move down to L3
+ * tables and below, since L4 (and above, for 5-level paging) is still globally
+ * protected by map_pgdir_lock.
+ *
+ * PV MMU update hypercalls call map_pages_to_xen while holding a page's page_lock().
+ * This has two implications:
+ * - We cannot reuse reuse current_locked_page_* for debugging
+ * - To avoid the chance of deadlock, even for different pages, we
+ *   must never grab page_lock() after grabbing l3t_lock().  This
+ *   includes any page_lock()-based locks, such as
+ *   mem_sharing_page_lock().
+ *
+ * Also note that we grab the map_pgdir_lock while holding the
+ * l3t_lock(), so to avoid deadlock we must avoid grabbing them in
+ * reverse order.
+ */
+static void l3t_lock(struct page_info *page)
+{
+    unsigned long x, nx;
+
+    do {
+        while ( (x = page->u.inuse.type_info) & PGT_locked )
+            cpu_relax();
+        nx = x | PGT_locked;
+    } while ( cmpxchg(&page->u.inuse.type_info, x, nx) != x );
+}
+
+static void l3t_unlock(struct page_info *page)
+{
+    unsigned long x, nx, y = page->u.inuse.type_info;
+
+    do {
+        x = y;
+        BUG_ON(!(x & PGT_locked));
+        nx = x & ~PGT_locked;
+    } while ( (y = cmpxchg(&page->u.inuse.type_info, x, nx)) != x );
+}
+
 /*
  * PTE flags that a guest may change without re-validating the PTE.
  * All other bits affect translation, caching, or Xen's safety.
@@ -5184,6 +5228,23 @@ l1_pgentry_t *virt_to_xen_l1e(unsigned long v)
                          flush_area_local((const void *)v, f) : \
                          flush_area_all((const void *)v, f))
 
+#define L3T_INIT(page) (page) = ZERO_BLOCK_PTR
+
+#define L3T_LOCK(page)        \
+    do {                      \
+        if ( locking )        \
+            l3t_lock(page);   \
+    } while ( false )
+
+#define L3T_UNLOCK(page)                           \
+    do {                                           \
+        if ( locking && (page) != ZERO_BLOCK_PTR ) \
+        {                                          \
+            l3t_unlock(page);                      \
+            (page) = ZERO_BLOCK_PTR;               \
+        }                                          \
+    } while ( false )
+
 int map_pages_to_xen(
     unsigned long virt,
     mfn_t mfn,
@@ -5195,6 +5256,7 @@ int map_pages_to_xen(
     l1_pgentry_t *pl1e, ol1e;
     unsigned int  i;
     int rc = -ENOMEM;
+    struct page_info *current_l3page;
 
 #define flush_flags(oldf) do {                 \
     unsigned int o_ = (oldf);                  \
@@ -5210,13 +5272,20 @@ int map_pages_to_xen(
     }                                          \
 } while (0)
 
+    L3T_INIT(current_l3page);
+
     while ( nr_mfns != 0 )
     {
-        l3_pgentry_t ol3e, *pl3e = virt_to_xen_l3e(virt);
+        l3_pgentry_t *pl3e, ol3e;
 
+        L3T_UNLOCK(current_l3page);
+
+        pl3e = virt_to_xen_l3e(virt);
         if ( !pl3e )
             goto out;
 
+        current_l3page = virt_to_page(pl3e);
+        L3T_LOCK(current_l3page);
         ol3e = *pl3e;
 
         if ( cpu_has_page1gb &&
@@ -5550,6 +5619,7 @@ int map_pages_to_xen(
     rc = 0;
 
  out:
+    L3T_UNLOCK(current_l3page);
     return rc;
 }
 
@@ -5578,6 +5648,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     unsigned int  i;
     unsigned long v = s;
     int rc = -ENOMEM;
+    struct page_info *current_l3page;
 
     /* Set of valid PTE bits which may be altered. */
 #define FLAGS_MASK (_PAGE_NX|_PAGE_RW|_PAGE_PRESENT)
@@ -5586,11 +5657,22 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     ASSERT(IS_ALIGNED(s, PAGE_SIZE));
     ASSERT(IS_ALIGNED(e, PAGE_SIZE));
 
+    L3T_INIT(current_l3page);
+
     while ( v < e )
     {
-        l3_pgentry_t *pl3e = virt_to_xen_l3e(v);
+        l3_pgentry_t *pl3e;
+
+        L3T_UNLOCK(current_l3page);
 
-        if ( !pl3e || !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) )
+        pl3e = virt_to_xen_l3e(v);
+        if ( !pl3e )
+            goto out;
+
+        current_l3page = virt_to_page(pl3e);
+        L3T_LOCK(current_l3page);
+
+        if ( !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) )
         {
             /* Confirm the caller isn't trying to create new mappings. */
             ASSERT(!(nf & _PAGE_PRESENT));
@@ -5808,9 +5890,13 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     rc = 0;
 
  out:
+    L3T_UNLOCK(current_l3page);
     return rc;
 }
 
+#undef L3T_LOCK
+#undef L3T_UNLOCK
+
 #undef flush_area
 
 int destroy_xen_mappings(unsigned long s, unsigned long e)
-- 
2.25.1


["xsa345-4.12/0001-x86-mm-Refactor-map_pages_to_xen-to-have-only-a-sing.patch" (application/octet-stream)]

From e33fad3044aaaeec6ed9914925d9558695bdb09d Mon Sep 17 00:00:00 2001
From: Wei Liu <wei.liu2@citrix.com>
Date: Sat, 11 Jan 2020 21:57:41 +0000
Subject: [PATCH 1/3] x86/mm: Refactor map_pages_to_xen to have only a single
 exit path

We will soon need to perform clean-ups before returning.

No functional change.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
---
 xen/arch/x86/mm.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index b4c90bd054..0e540f143b 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5227,6 +5227,7 @@ int map_pages_to_xen(
     l2_pgentry_t *pl2e, ol2e;
     l1_pgentry_t *pl1e, ol1e;
     unsigned int  i;
+    int rc = -ENOMEM;
 
 #define flush_flags(oldf) do {                 \
     unsigned int o_ = (oldf);                  \
@@ -5247,7 +5248,8 @@ int map_pages_to_xen(
         l3_pgentry_t ol3e, *pl3e = virt_to_xen_l3e(virt);
 
         if ( !pl3e )
-            return -ENOMEM;
+            goto out;
+
         ol3e = *pl3e;
 
         if ( cpu_has_page1gb &&
@@ -5335,7 +5337,7 @@ int map_pages_to_xen(
 
             pl2e = alloc_xen_pagetable();
             if ( pl2e == NULL )
-                return -ENOMEM;
+                goto out;
 
             for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ )
                 l2e_write(pl2e + i,
@@ -5364,7 +5366,7 @@ int map_pages_to_xen(
 
         pl2e = virt_to_xen_l2e(virt);
         if ( !pl2e )
-            return -ENOMEM;
+            goto out;
 
         if ( ((((virt >> PAGE_SHIFT) | mfn_x(mfn)) &
                ((1u << PAGETABLE_ORDER) - 1)) == 0) &&
@@ -5407,7 +5409,7 @@ int map_pages_to_xen(
             {
                 pl1e = virt_to_xen_l1e(virt);
                 if ( pl1e == NULL )
-                    return -ENOMEM;
+                    goto out;
             }
             else if ( l2e_get_flags(*pl2e) & _PAGE_PSE )
             {
@@ -5434,7 +5436,7 @@ int map_pages_to_xen(
 
                 pl1e = alloc_xen_pagetable();
                 if ( pl1e == NULL )
-                    return -ENOMEM;
+                    goto out;
 
                 for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ )
                     l1e_write(&pl1e[i],
@@ -5578,7 +5580,10 @@ int map_pages_to_xen(
 
 #undef flush_flags
 
-    return 0;
+    rc = 0;
+
+ out:
+    return rc;
 }
 
 int populate_pt_range(unsigned long virt, unsigned long nr_mfns)
-- 
2.25.1


["xsa345-4.12/0002-x86-mm-Refactor-modify_xen_mappings-to-have-one-exit.patch" (application/octet-stream)]

From a151b07d504d7f442256e1c82917334216c58a21 Mon Sep 17 00:00:00 2001
From: Wei Liu <wei.liu2@citrix.com>
Date: Sat, 11 Jan 2020 21:57:42 +0000
Subject: [PATCH 2/3] x86/mm: Refactor modify_xen_mappings to have one exit
 path

We will soon need to perform clean-ups before returning.

No functional change.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
---
 xen/arch/x86/mm.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 0e540f143b..bff2689e60 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5610,6 +5610,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     l1_pgentry_t *pl1e;
     unsigned int  i;
     unsigned long v = s;
+    int rc = -ENOMEM;
 
     /* Set of valid PTE bits which may be altered. */
 #define FLAGS_MASK (_PAGE_NX|_PAGE_RW|_PAGE_PRESENT)
@@ -5651,7 +5652,8 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
             /* PAGE1GB: shatter the superpage and fall through. */
             pl2e = alloc_xen_pagetable();
             if ( !pl2e )
-                return -ENOMEM;
+                goto out;
+
             for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ )
                 l2e_write(pl2e + i,
                           l2e_from_pfn(l3e_get_pfn(*pl3e) +
@@ -5706,7 +5708,8 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
                 /* PSE: shatter the superpage and try again. */
                 pl1e = alloc_xen_pagetable();
                 if ( !pl1e )
-                    return -ENOMEM;
+                    goto out;
+
                 for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ )
                     l1e_write(&pl1e[i],
                               l1e_from_pfn(l2e_get_pfn(*pl2e) + i,
@@ -5835,7 +5838,10 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     flush_area(NULL, FLUSH_TLB_GLOBAL);
 
 #undef FLAGS_MASK
-    return 0;
+    rc = 0;
+
+ out:
+    return rc;
 }
 
 #undef flush_area
-- 
2.25.1


["xsa345-4.12/0003-x86-mm-Prevent-some-races-in-hypervisor-mapping-upda.patch" (application/octet-stream)]

From abaf05e183f3bc3927844ae16d6642a382c0dbef Mon Sep 17 00:00:00 2001
From: Hongyan Xia <hongyxia@amazon.com>
Date: Sat, 11 Jan 2020 21:57:43 +0000
Subject: [PATCH 3/3] x86/mm: Prevent some races in hypervisor mapping updates

map_pages_to_xen will attempt to coalesce mappings into 2MiB and 1GiB
superpages if possible, to maximize TLB efficiency.  This means both
replacing superpage entries with smaller entries, and replacing
smaller entries with superpages.

Unfortunately, while some potential races are handled correctly,
others are not.  These include:

1. When one processor modifies a sub-superpage mapping while another
processor replaces the entire range with a superpage.

Take the following example:

Suppose L3[N] points to L2.  And suppose we have two processors, A and
B.

* A walks the pagetables, get a pointer to L2.
* B replaces L3[N] with a 1GiB mapping.
* B Frees L2
* A writes L2[M] #

This is race exacerbated by the fact that virt_to_xen_l[21]e doesn't
handle higher-level superpages properly: If you call virt_xen_to_l2e
on a virtual address within an L3 superpage, you'll either hit a BUG()
(most likely), or get a pointer into the middle of a data page; same
with virt_xen_to_l1 on a virtual address within either an L3 or L2
superpage.

So take the following example:

* A reads pl3e and discovers it to point to an L2.
* B replaces L3[N] with a 1GiB mapping
* A calls virt_to_xen_l2e() and hits the BUG_ON() #

2. When two processors simultaneously try to replace a sub-superpage
mapping with a superpage mapping.

Take the following example:

Suppose L3[N] points to L2.  And suppose we have two processors, A and B,
both trying to replace L3[N] with a superpage.

* A walks the pagetables, get a pointer to pl3e, and takes a copy ol3e pointing to L2.
* B walks the pagetables, gets a pointre to pl3e, and takes a copy ol3e pointing to L2.
* A writes the new value into L3[N]
* B writes the new value into L3[N]
* A recursively frees all the L1's under L2, then frees L2
* B recursively double-frees all the L1's under L2, then double-frees L2 #

Fix this by grabbing a lock for the entirety of the mapping update
operation.

Rather than grabbing map_pgdir_lock for the entire operation, however,
repurpose the PGT_locked bit from L3's page->type_info as a lock.
This means that rather than locking the entire address space, we
"only" lock a single 512GiB chunk of hypervisor address space at a
time.

There was a proposal for a lock-and-reverify approach, where we walk
the pagetables to the point where we decide what to do; then grab the
map_pgdir_lock, re-verify the information we collected without the
lock, and finally make the change (starting over again if anything had
changed).  Without being able to guarantee that the L2 table wasn't
freed, however, that means every read would need to be considered
potentially unsafe.  Thinking carefully about that is probably
something that wants to be done on public, not under time pressure.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
---
 xen/arch/x86/mm.c | 92 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 89 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index bff2689e60..d6ba8c4bb4 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2197,6 +2197,50 @@ void page_unlock(struct page_info *page)
     current_locked_page_set(NULL);
 }
 
+/*
+ * L3 table locks:
+ *
+ * Used for serialization in map_pages_to_xen() and modify_xen_mappings().
+ *
+ * For Xen PT pages, the page->u.inuse.type_info is unused and it is safe to
+ * reuse the PGT_locked flag. This lock is taken only when we move down to L3
+ * tables and below, since L4 (and above, for 5-level paging) is still globally
+ * protected by map_pgdir_lock.
+ *
+ * PV MMU update hypercalls call map_pages_to_xen while holding a page's page_lock().
+ * This has two implications:
+ * - We cannot reuse reuse current_locked_page_* for debugging
+ * - To avoid the chance of deadlock, even for different pages, we
+ *   must never grab page_lock() after grabbing l3t_lock().  This
+ *   includes any page_lock()-based locks, such as
+ *   mem_sharing_page_lock().
+ *
+ * Also note that we grab the map_pgdir_lock while holding the
+ * l3t_lock(), so to avoid deadlock we must avoid grabbing them in
+ * reverse order.
+ */
+static void l3t_lock(struct page_info *page)
+{
+    unsigned long x, nx;
+
+    do {
+        while ( (x = page->u.inuse.type_info) & PGT_locked )
+            cpu_relax();
+        nx = x | PGT_locked;
+    } while ( cmpxchg(&page->u.inuse.type_info, x, nx) != x );
+}
+
+static void l3t_unlock(struct page_info *page)
+{
+    unsigned long x, nx, y = page->u.inuse.type_info;
+
+    do {
+        x = y;
+        BUG_ON(!(x & PGT_locked));
+        nx = x & ~PGT_locked;
+    } while ( (y = cmpxchg(&page->u.inuse.type_info, x, nx)) != x );
+}
+
 #ifdef CONFIG_PV
 /*
  * PTE flags that a guest may change without re-validating the PTE.
@@ -5217,6 +5261,23 @@ l1_pgentry_t *virt_to_xen_l1e(unsigned long v)
                          flush_area_local((const void *)v, f) : \
                          flush_area_all((const void *)v, f))
 
+#define L3T_INIT(page) (page) = ZERO_BLOCK_PTR
+
+#define L3T_LOCK(page)        \
+    do {                      \
+        if ( locking )        \
+            l3t_lock(page);   \
+    } while ( false )
+
+#define L3T_UNLOCK(page)                           \
+    do {                                           \
+        if ( locking && (page) != ZERO_BLOCK_PTR ) \
+        {                                          \
+            l3t_unlock(page);                      \
+            (page) = ZERO_BLOCK_PTR;               \
+        }                                          \
+    } while ( false )
+
 int map_pages_to_xen(
     unsigned long virt,
     mfn_t mfn,
@@ -5228,6 +5289,7 @@ int map_pages_to_xen(
     l1_pgentry_t *pl1e, ol1e;
     unsigned int  i;
     int rc = -ENOMEM;
+    struct page_info *current_l3page;
 
 #define flush_flags(oldf) do {                 \
     unsigned int o_ = (oldf);                  \
@@ -5243,13 +5305,20 @@ int map_pages_to_xen(
     }                                          \
 } while (0)
 
+    L3T_INIT(current_l3page);
+
     while ( nr_mfns != 0 )
     {
-        l3_pgentry_t ol3e, *pl3e = virt_to_xen_l3e(virt);
+        l3_pgentry_t *pl3e, ol3e;
 
+        L3T_UNLOCK(current_l3page);
+
+        pl3e = virt_to_xen_l3e(virt);
         if ( !pl3e )
             goto out;
 
+        current_l3page = virt_to_page(pl3e);
+        L3T_LOCK(current_l3page);
         ol3e = *pl3e;
 
         if ( cpu_has_page1gb &&
@@ -5583,6 +5652,7 @@ int map_pages_to_xen(
     rc = 0;
 
  out:
+    L3T_UNLOCK(current_l3page);
     return rc;
 }
 
@@ -5611,6 +5681,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     unsigned int  i;
     unsigned long v = s;
     int rc = -ENOMEM;
+    struct page_info *current_l3page;
 
     /* Set of valid PTE bits which may be altered. */
 #define FLAGS_MASK (_PAGE_NX|_PAGE_RW|_PAGE_PRESENT)
@@ -5619,11 +5690,22 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     ASSERT(IS_ALIGNED(s, PAGE_SIZE));
     ASSERT(IS_ALIGNED(e, PAGE_SIZE));
 
+    L3T_INIT(current_l3page);
+
     while ( v < e )
     {
-        l3_pgentry_t *pl3e = virt_to_xen_l3e(v);
+        l3_pgentry_t *pl3e;
+
+        L3T_UNLOCK(current_l3page);
 
-        if ( !pl3e || !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) )
+        pl3e = virt_to_xen_l3e(v);
+        if ( !pl3e )
+            goto out;
+
+        current_l3page = virt_to_page(pl3e);
+        L3T_LOCK(current_l3page);
+
+        if ( !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) )
         {
             /* Confirm the caller isn't trying to create new mappings. */
             ASSERT(!(nf & _PAGE_PRESENT));
@@ -5841,9 +5923,13 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     rc = 0;
 
  out:
+    L3T_UNLOCK(current_l3page);
     return rc;
 }
 
+#undef L3T_LOCK
+#undef L3T_UNLOCK
+
 #undef flush_area
 
 int destroy_xen_mappings(unsigned long s, unsigned long e)
-- 
2.25.1


["xsa345-4.13/0001-x86-mm-Refactor-map_pages_to_xen-to-have-only-a-sing.patch" (application/octet-stream)]

From b3e0d4e37b7902533a463812374947d4d6d2e463 Mon Sep 17 00:00:00 2001
From: Wei Liu <wei.liu2@citrix.com>
Date: Sat, 11 Jan 2020 21:57:41 +0000
Subject: [PATCH 1/3] x86/mm: Refactor map_pages_to_xen to have only a single
 exit path

We will soon need to perform clean-ups before returning.

No functional change.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 30dffb68e8..133a393875 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5187,6 +5187,7 @@ int map_pages_to_xen(
     l2_pgentry_t *pl2e, ol2e;
     l1_pgentry_t *pl1e, ol1e;
     unsigned int  i;
+    int rc = -ENOMEM;
 
 #define flush_flags(oldf) do {                 \
     unsigned int o_ = (oldf);                  \
@@ -5207,7 +5208,8 @@ int map_pages_to_xen(
         l3_pgentry_t ol3e, *pl3e = virt_to_xen_l3e(virt);
 
         if ( !pl3e )
-            return -ENOMEM;
+            goto out;
+
         ol3e = *pl3e;
 
         if ( cpu_has_page1gb &&
@@ -5295,7 +5297,7 @@ int map_pages_to_xen(
 
             pl2e = alloc_xen_pagetable();
             if ( pl2e == NULL )
-                return -ENOMEM;
+                goto out;
 
             for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ )
                 l2e_write(pl2e + i,
@@ -5324,7 +5326,7 @@ int map_pages_to_xen(
 
         pl2e = virt_to_xen_l2e(virt);
         if ( !pl2e )
-            return -ENOMEM;
+            goto out;
 
         if ( ((((virt >> PAGE_SHIFT) | mfn_x(mfn)) &
                ((1u << PAGETABLE_ORDER) - 1)) == 0) &&
@@ -5367,7 +5369,7 @@ int map_pages_to_xen(
             {
                 pl1e = virt_to_xen_l1e(virt);
                 if ( pl1e == NULL )
-                    return -ENOMEM;
+                    goto out;
             }
             else if ( l2e_get_flags(*pl2e) & _PAGE_PSE )
             {
@@ -5394,7 +5396,7 @@ int map_pages_to_xen(
 
                 pl1e = alloc_xen_pagetable();
                 if ( pl1e == NULL )
-                    return -ENOMEM;
+                    goto out;
 
                 for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ )
                     l1e_write(&pl1e[i],
@@ -5538,7 +5540,10 @@ int map_pages_to_xen(
 
 #undef flush_flags
 
-    return 0;
+    rc = 0;
+
+ out:
+    return rc;
 }
 
 int populate_pt_range(unsigned long virt, unsigned long nr_mfns)
-- 
2.25.1


["xsa345-4.13/0002-x86-mm-Refactor-modify_xen_mappings-to-have-one-exit.patch" (application/octet-stream)]

From 9f6f35b833d295acaaa2d8ff8cf309bf688cfd50 Mon Sep 17 00:00:00 2001
From: Wei Liu <wei.liu2@citrix.com>
Date: Sat, 11 Jan 2020 21:57:42 +0000
Subject: [PATCH 2/3] x86/mm: Refactor modify_xen_mappings to have one exit
 path

We will soon need to perform clean-ups before returning.

No functional change.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 133a393875..af726d3274 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5570,6 +5570,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     l1_pgentry_t *pl1e;
     unsigned int  i;
     unsigned long v = s;
+    int rc = -ENOMEM;
 
     /* Set of valid PTE bits which may be altered. */
 #define FLAGS_MASK (_PAGE_NX|_PAGE_RW|_PAGE_PRESENT)
@@ -5611,7 +5612,8 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
             /* PAGE1GB: shatter the superpage and fall through. */
             pl2e = alloc_xen_pagetable();
             if ( !pl2e )
-                return -ENOMEM;
+                goto out;
+
             for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ )
                 l2e_write(pl2e + i,
                           l2e_from_pfn(l3e_get_pfn(*pl3e) +
@@ -5666,7 +5668,8 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
                 /* PSE: shatter the superpage and try again. */
                 pl1e = alloc_xen_pagetable();
                 if ( !pl1e )
-                    return -ENOMEM;
+                    goto out;
+
                 for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ )
                     l1e_write(&pl1e[i],
                               l1e_from_pfn(l2e_get_pfn(*pl2e) + i,
@@ -5795,7 +5798,10 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     flush_area(NULL, FLUSH_TLB_GLOBAL);
 
 #undef FLAGS_MASK
-    return 0;
+    rc = 0;
+
+ out:
+    return rc;
 }
 
 #undef flush_area
-- 
2.25.1


["xsa345-4.13/0003-x86-mm-Prevent-some-races-in-hypervisor-mapping-upda.patch" (application/octet-stream)]

From 0ff9a8453dc47cd47eee9659d5916afb5094e871 Mon Sep 17 00:00:00 2001
From: Hongyan Xia <hongyxia@amazon.com>
Date: Sat, 11 Jan 2020 21:57:43 +0000
Subject: [PATCH 3/3] x86/mm: Prevent some races in hypervisor mapping updates

map_pages_to_xen will attempt to coalesce mappings into 2MiB and 1GiB
superpages if possible, to maximize TLB efficiency.  This means both
replacing superpage entries with smaller entries, and replacing
smaller entries with superpages.

Unfortunately, while some potential races are handled correctly,
others are not.  These include:

1. When one processor modifies a sub-superpage mapping while another
processor replaces the entire range with a superpage.

Take the following example:

Suppose L3[N] points to L2.  And suppose we have two processors, A and
B.

* A walks the pagetables, get a pointer to L2.
* B replaces L3[N] with a 1GiB mapping.
* B Frees L2
* A writes L2[M] #

This is race exacerbated by the fact that virt_to_xen_l[21]e doesn't
handle higher-level superpages properly: If you call virt_xen_to_l2e
on a virtual address within an L3 superpage, you'll either hit a BUG()
(most likely), or get a pointer into the middle of a data page; same
with virt_xen_to_l1 on a virtual address within either an L3 or L2
superpage.

So take the following example:

* A reads pl3e and discovers it to point to an L2.
* B replaces L3[N] with a 1GiB mapping
* A calls virt_to_xen_l2e() and hits the BUG_ON() #

2. When two processors simultaneously try to replace a sub-superpage
mapping with a superpage mapping.

Take the following example:

Suppose L3[N] points to L2.  And suppose we have two processors, A and B,
both trying to replace L3[N] with a superpage.

* A walks the pagetables, get a pointer to pl3e, and takes a copy ol3e pointing to L2.
* B walks the pagetables, gets a pointre to pl3e, and takes a copy ol3e pointing to L2.
* A writes the new value into L3[N]
* B writes the new value into L3[N]
* A recursively frees all the L1's under L2, then frees L2
* B recursively double-frees all the L1's under L2, then double-frees L2 #

Fix this by grabbing a lock for the entirety of the mapping update
operation.

Rather than grabbing map_pgdir_lock for the entire operation, however,
repurpose the PGT_locked bit from L3's page->type_info as a lock.
This means that rather than locking the entire address space, we
"only" lock a single 512GiB chunk of hypervisor address space at a
time.

There was a proposal for a lock-and-reverify approach, where we walk
the pagetables to the point where we decide what to do; then grab the
map_pgdir_lock, re-verify the information we collected without the
lock, and finally make the change (starting over again if anything had
changed).  Without being able to guarantee that the L2 table wasn't
freed, however, that means every read would need to be considered
potentially unsafe.  Thinking carefully about that is probably
something that wants to be done on public, not under time pressure.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 92 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 89 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index af726d3274..d6a0761f43 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2167,6 +2167,50 @@ void page_unlock(struct page_info *page)
     current_locked_page_set(NULL);
 }
 
+/*
+ * L3 table locks:
+ *
+ * Used for serialization in map_pages_to_xen() and modify_xen_mappings().
+ *
+ * For Xen PT pages, the page->u.inuse.type_info is unused and it is safe to
+ * reuse the PGT_locked flag. This lock is taken only when we move down to L3
+ * tables and below, since L4 (and above, for 5-level paging) is still globally
+ * protected by map_pgdir_lock.
+ *
+ * PV MMU update hypercalls call map_pages_to_xen while holding a page's page_lock().
+ * This has two implications:
+ * - We cannot reuse reuse current_locked_page_* for debugging
+ * - To avoid the chance of deadlock, even for different pages, we
+ *   must never grab page_lock() after grabbing l3t_lock().  This
+ *   includes any page_lock()-based locks, such as
+ *   mem_sharing_page_lock().
+ *
+ * Also note that we grab the map_pgdir_lock while holding the
+ * l3t_lock(), so to avoid deadlock we must avoid grabbing them in
+ * reverse order.
+ */
+static void l3t_lock(struct page_info *page)
+{
+    unsigned long x, nx;
+
+    do {
+        while ( (x = page->u.inuse.type_info) & PGT_locked )
+            cpu_relax();
+        nx = x | PGT_locked;
+    } while ( cmpxchg(&page->u.inuse.type_info, x, nx) != x );
+}
+
+static void l3t_unlock(struct page_info *page)
+{
+    unsigned long x, nx, y = page->u.inuse.type_info;
+
+    do {
+        x = y;
+        BUG_ON(!(x & PGT_locked));
+        nx = x & ~PGT_locked;
+    } while ( (y = cmpxchg(&page->u.inuse.type_info, x, nx)) != x );
+}
+
 #ifdef CONFIG_PV
 /*
  * PTE flags that a guest may change without re-validating the PTE.
@@ -5177,6 +5221,23 @@ l1_pgentry_t *virt_to_xen_l1e(unsigned long v)
                          flush_area_local((const void *)v, f) : \
                          flush_area_all((const void *)v, f))
 
+#define L3T_INIT(page) (page) = ZERO_BLOCK_PTR
+
+#define L3T_LOCK(page)        \
+    do {                      \
+        if ( locking )        \
+            l3t_lock(page);   \
+    } while ( false )
+
+#define L3T_UNLOCK(page)                           \
+    do {                                           \
+        if ( locking && (page) != ZERO_BLOCK_PTR ) \
+        {                                          \
+            l3t_unlock(page);                      \
+            (page) = ZERO_BLOCK_PTR;               \
+        }                                          \
+    } while ( false )
+
 int map_pages_to_xen(
     unsigned long virt,
     mfn_t mfn,
@@ -5188,6 +5249,7 @@ int map_pages_to_xen(
     l1_pgentry_t *pl1e, ol1e;
     unsigned int  i;
     int rc = -ENOMEM;
+    struct page_info *current_l3page;
 
 #define flush_flags(oldf) do {                 \
     unsigned int o_ = (oldf);                  \
@@ -5203,13 +5265,20 @@ int map_pages_to_xen(
     }                                          \
 } while (0)
 
+    L3T_INIT(current_l3page);
+
     while ( nr_mfns != 0 )
     {
-        l3_pgentry_t ol3e, *pl3e = virt_to_xen_l3e(virt);
+        l3_pgentry_t *pl3e, ol3e;
 
+        L3T_UNLOCK(current_l3page);
+
+        pl3e = virt_to_xen_l3e(virt);
         if ( !pl3e )
             goto out;
 
+        current_l3page = virt_to_page(pl3e);
+        L3T_LOCK(current_l3page);
         ol3e = *pl3e;
 
         if ( cpu_has_page1gb &&
@@ -5543,6 +5612,7 @@ int map_pages_to_xen(
     rc = 0;
 
  out:
+    L3T_UNLOCK(current_l3page);
     return rc;
 }
 
@@ -5571,6 +5641,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     unsigned int  i;
     unsigned long v = s;
     int rc = -ENOMEM;
+    struct page_info *current_l3page;
 
     /* Set of valid PTE bits which may be altered. */
 #define FLAGS_MASK (_PAGE_NX|_PAGE_RW|_PAGE_PRESENT)
@@ -5579,11 +5650,22 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     ASSERT(IS_ALIGNED(s, PAGE_SIZE));
     ASSERT(IS_ALIGNED(e, PAGE_SIZE));
 
+    L3T_INIT(current_l3page);
+
     while ( v < e )
     {
-        l3_pgentry_t *pl3e = virt_to_xen_l3e(v);
+        l3_pgentry_t *pl3e;
+
+        L3T_UNLOCK(current_l3page);
 
-        if ( !pl3e || !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) )
+        pl3e = virt_to_xen_l3e(v);
+        if ( !pl3e )
+            goto out;
+
+        current_l3page = virt_to_page(pl3e);
+        L3T_LOCK(current_l3page);
+
+        if ( !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) )
         {
             /* Confirm the caller isn't trying to create new mappings. */
             ASSERT(!(nf & _PAGE_PRESENT));
@@ -5801,9 +5883,13 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     rc = 0;
 
  out:
+    L3T_UNLOCK(current_l3page);
     return rc;
 }
 
+#undef L3T_LOCK
+#undef L3T_UNLOCK
+
 #undef flush_area
 
 int destroy_xen_mappings(unsigned long s, unsigned long e)
-- 
2.25.1


["xsa345-4.14/0001-x86-mm-Refactor-map_pages_to_xen-to-have-only-a-sing.patch" (application/octet-stream)]

From e9c5a9ee5e2e888f8bb05cf0a353ed635300abe3 Mon Sep 17 00:00:00 2001
From: Wei Liu <wei.liu2@citrix.com>
Date: Sat, 11 Jan 2020 21:57:41 +0000
Subject: [PATCH 1/3] x86/mm: Refactor map_pages_to_xen to have only a single
 exit path

We will soon need to perform clean-ups before returning.

No functional change.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 17 +++++++++++------
 1 file changed, 11 insertions(+), 6 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 82bc676553..03f6e6aa62 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5088,6 +5088,7 @@ int map_pages_to_xen(
     l2_pgentry_t *pl2e, ol2e;
     l1_pgentry_t *pl1e, ol1e;
     unsigned int  i;
+    int rc = -ENOMEM;
 
 #define flush_flags(oldf) do {                 \
     unsigned int o_ = (oldf);                  \
@@ -5108,7 +5109,8 @@ int map_pages_to_xen(
         l3_pgentry_t ol3e, *pl3e = virt_to_xen_l3e(virt);
 
         if ( !pl3e )
-            return -ENOMEM;
+            goto out;
+
         ol3e = *pl3e;
 
         if ( cpu_has_page1gb &&
@@ -5198,7 +5200,7 @@ int map_pages_to_xen(
 
             l2t = alloc_xen_pagetable();
             if ( l2t == NULL )
-                return -ENOMEM;
+                goto out;
 
             for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ )
                 l2e_write(l2t + i,
@@ -5227,7 +5229,7 @@ int map_pages_to_xen(
 
         pl2e = virt_to_xen_l2e(virt);
         if ( !pl2e )
-            return -ENOMEM;
+            goto out;
 
         if ( ((((virt >> PAGE_SHIFT) | mfn_x(mfn)) &
                ((1u << PAGETABLE_ORDER) - 1)) == 0) &&
@@ -5271,7 +5273,7 @@ int map_pages_to_xen(
             {
                 pl1e = virt_to_xen_l1e(virt);
                 if ( pl1e == NULL )
-                    return -ENOMEM;
+                    goto out;
             }
             else if ( l2e_get_flags(*pl2e) & _PAGE_PSE )
             {
@@ -5299,7 +5301,7 @@ int map_pages_to_xen(
 
                 l1t = alloc_xen_pagetable();
                 if ( l1t == NULL )
-                    return -ENOMEM;
+                    goto out;
 
                 for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ )
                     l1e_write(&l1t[i],
@@ -5445,7 +5447,10 @@ int map_pages_to_xen(
 
 #undef flush_flags
 
-    return 0;
+    rc = 0;
+
+ out:
+    return rc;
 }
 
 int populate_pt_range(unsigned long virt, unsigned long nr_mfns)
-- 
2.25.1


["xsa345-4.14/0002-x86-mm-Refactor-modify_xen_mappings-to-have-one-exit.patch" (application/octet-stream)]

From 8645adb7ac679e5ddc5c39e0c5c918e4a2ba5391 Mon Sep 17 00:00:00 2001
From: Wei Liu <wei.liu2@citrix.com>
Date: Sat, 11 Jan 2020 21:57:42 +0000
Subject: [PATCH 2/3] x86/mm: Refactor modify_xen_mappings to have one exit
 path

We will soon need to perform clean-ups before returning.

No functional change.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 03f6e6aa62..2468347a45 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5477,6 +5477,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     l1_pgentry_t *pl1e;
     unsigned int  i;
     unsigned long v = s;
+    int rc = -ENOMEM;
 
     /* Set of valid PTE bits which may be altered. */
 #define FLAGS_MASK (_PAGE_NX|_PAGE_DIRTY|_PAGE_ACCESSED|_PAGE_RW|_PAGE_PRESENT)
@@ -5520,7 +5521,8 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
             /* PAGE1GB: shatter the superpage and fall through. */
             l2t = alloc_xen_pagetable();
             if ( !l2t )
-                return -ENOMEM;
+                goto out;
+
             for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ )
                 l2e_write(l2t + i,
                           l2e_from_pfn(l3e_get_pfn(*pl3e) +
@@ -5577,7 +5579,8 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
                 /* PSE: shatter the superpage and try again. */
                 l1t = alloc_xen_pagetable();
                 if ( !l1t )
-                    return -ENOMEM;
+                    goto out;
+
                 for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ )
                     l1e_write(&l1t[i],
                               l1e_from_pfn(l2e_get_pfn(*pl2e) + i,
@@ -5710,7 +5713,10 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     flush_area(NULL, FLUSH_TLB_GLOBAL);
 
 #undef FLAGS_MASK
-    return 0;
+    rc = 0;
+
+ out:
+    return rc;
 }
 
 #undef flush_area
-- 
2.25.1


["xsa345-4.14/0003-x86-mm-Prevent-some-races-in-hypervisor-mapping-upda.patch" (application/octet-stream)]

From 6b020418d0554d9ec6eb201f50776a72db67739b Mon Sep 17 00:00:00 2001
From: Hongyan Xia <hongyxia@amazon.com>
Date: Sat, 11 Jan 2020 21:57:43 +0000
Subject: [PATCH 3/3] x86/mm: Prevent some races in hypervisor mapping updates

map_pages_to_xen will attempt to coalesce mappings into 2MiB and 1GiB
superpages if possible, to maximize TLB efficiency.  This means both
replacing superpage entries with smaller entries, and replacing
smaller entries with superpages.

Unfortunately, while some potential races are handled correctly,
others are not.  These include:

1. When one processor modifies a sub-superpage mapping while another
processor replaces the entire range with a superpage.

Take the following example:

Suppose L3[N] points to L2.  And suppose we have two processors, A and
B.

* A walks the pagetables, get a pointer to L2.
* B replaces L3[N] with a 1GiB mapping.
* B Frees L2
* A writes L2[M] #

This is race exacerbated by the fact that virt_to_xen_l[21]e doesn't
handle higher-level superpages properly: If you call virt_xen_to_l2e
on a virtual address within an L3 superpage, you'll either hit a BUG()
(most likely), or get a pointer into the middle of a data page; same
with virt_xen_to_l1 on a virtual address within either an L3 or L2
superpage.

So take the following example:

* A reads pl3e and discovers it to point to an L2.
* B replaces L3[N] with a 1GiB mapping
* A calls virt_to_xen_l2e() and hits the BUG_ON() #

2. When two processors simultaneously try to replace a sub-superpage
mapping with a superpage mapping.

Take the following example:

Suppose L3[N] points to L2.  And suppose we have two processors, A and B,
both trying to replace L3[N] with a superpage.

* A walks the pagetables, get a pointer to pl3e, and takes a copy ol3e pointing to L2.
* B walks the pagetables, gets a pointre to pl3e, and takes a copy ol3e pointing to L2.
* A writes the new value into L3[N]
* B writes the new value into L3[N]
* A recursively frees all the L1's under L2, then frees L2
* B recursively double-frees all the L1's under L2, then double-frees L2 #

Fix this by grabbing a lock for the entirety of the mapping update
operation.

Rather than grabbing map_pgdir_lock for the entire operation, however,
repurpose the PGT_locked bit from L3's page->type_info as a lock.
This means that rather than locking the entire address space, we
"only" lock a single 512GiB chunk of hypervisor address space at a
time.

There was a proposal for a lock-and-reverify approach, where we walk
the pagetables to the point where we decide what to do; then grab the
map_pgdir_lock, re-verify the information we collected without the
lock, and finally make the change (starting over again if anything had
changed).  Without being able to guarantee that the L2 table wasn't
freed, however, that means every read would need to be considered
potentially unsafe.  Thinking carefully about that is probably
something that wants to be done on public, not under time pressure.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 92 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 89 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 2468347a45..9c55b2b9e3 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2088,6 +2088,50 @@ void page_unlock(struct page_info *page)
     current_locked_page_set(NULL);
 }
 
+/*
+ * L3 table locks:
+ *
+ * Used for serialization in map_pages_to_xen() and modify_xen_mappings().
+ *
+ * For Xen PT pages, the page->u.inuse.type_info is unused and it is safe to
+ * reuse the PGT_locked flag. This lock is taken only when we move down to L3
+ * tables and below, since L4 (and above, for 5-level paging) is still globally
+ * protected by map_pgdir_lock.
+ *
+ * PV MMU update hypercalls call map_pages_to_xen while holding a page's page_lock().
+ * This has two implications:
+ * - We cannot reuse reuse current_locked_page_* for debugging
+ * - To avoid the chance of deadlock, even for different pages, we
+ *   must never grab page_lock() after grabbing l3t_lock().  This
+ *   includes any page_lock()-based locks, such as
+ *   mem_sharing_page_lock().
+ *
+ * Also note that we grab the map_pgdir_lock while holding the
+ * l3t_lock(), so to avoid deadlock we must avoid grabbing them in
+ * reverse order.
+ */
+static void l3t_lock(struct page_info *page)
+{
+    unsigned long x, nx;
+
+    do {
+        while ( (x = page->u.inuse.type_info) & PGT_locked )
+            cpu_relax();
+        nx = x | PGT_locked;
+    } while ( cmpxchg(&page->u.inuse.type_info, x, nx) != x );
+}
+
+static void l3t_unlock(struct page_info *page)
+{
+    unsigned long x, nx, y = page->u.inuse.type_info;
+
+    do {
+        x = y;
+        BUG_ON(!(x & PGT_locked));
+        nx = x & ~PGT_locked;
+    } while ( (y = cmpxchg(&page->u.inuse.type_info, x, nx)) != x );
+}
+
 #ifdef CONFIG_PV
 /*
  * PTE flags that a guest may change without re-validating the PTE.
@@ -5078,6 +5122,23 @@ l1_pgentry_t *virt_to_xen_l1e(unsigned long v)
                          flush_area_local((const void *)v, f) : \
                          flush_area_all((const void *)v, f))
 
+#define L3T_INIT(page) (page) = ZERO_BLOCK_PTR
+
+#define L3T_LOCK(page)        \
+    do {                      \
+        if ( locking )        \
+            l3t_lock(page);   \
+    } while ( false )
+
+#define L3T_UNLOCK(page)                           \
+    do {                                           \
+        if ( locking && (page) != ZERO_BLOCK_PTR ) \
+        {                                          \
+            l3t_unlock(page);                      \
+            (page) = ZERO_BLOCK_PTR;               \
+        }                                          \
+    } while ( false )
+
 int map_pages_to_xen(
     unsigned long virt,
     mfn_t mfn,
@@ -5089,6 +5150,7 @@ int map_pages_to_xen(
     l1_pgentry_t *pl1e, ol1e;
     unsigned int  i;
     int rc = -ENOMEM;
+    struct page_info *current_l3page;
 
 #define flush_flags(oldf) do {                 \
     unsigned int o_ = (oldf);                  \
@@ -5104,13 +5166,20 @@ int map_pages_to_xen(
     }                                          \
 } while (0)
 
+    L3T_INIT(current_l3page);
+
     while ( nr_mfns != 0 )
     {
-        l3_pgentry_t ol3e, *pl3e = virt_to_xen_l3e(virt);
+        l3_pgentry_t *pl3e, ol3e;
 
+        L3T_UNLOCK(current_l3page);
+
+        pl3e = virt_to_xen_l3e(virt);
         if ( !pl3e )
             goto out;
 
+        current_l3page = virt_to_page(pl3e);
+        L3T_LOCK(current_l3page);
         ol3e = *pl3e;
 
         if ( cpu_has_page1gb &&
@@ -5450,6 +5519,7 @@ int map_pages_to_xen(
     rc = 0;
 
  out:
+    L3T_UNLOCK(current_l3page);
     return rc;
 }
 
@@ -5478,6 +5548,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     unsigned int  i;
     unsigned long v = s;
     int rc = -ENOMEM;
+    struct page_info *current_l3page;
 
     /* Set of valid PTE bits which may be altered. */
 #define FLAGS_MASK (_PAGE_NX|_PAGE_DIRTY|_PAGE_ACCESSED|_PAGE_RW|_PAGE_PRESENT)
@@ -5486,11 +5557,22 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     ASSERT(IS_ALIGNED(s, PAGE_SIZE));
     ASSERT(IS_ALIGNED(e, PAGE_SIZE));
 
+    L3T_INIT(current_l3page);
+
     while ( v < e )
     {
-        l3_pgentry_t *pl3e = virt_to_xen_l3e(v);
+        l3_pgentry_t *pl3e;
+
+        L3T_UNLOCK(current_l3page);
 
-        if ( !pl3e || !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) )
+        pl3e = virt_to_xen_l3e(v);
+        if ( !pl3e )
+            goto out;
+
+        current_l3page = virt_to_page(pl3e);
+        L3T_LOCK(current_l3page);
+
+        if ( !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) )
         {
             /* Confirm the caller isn't trying to create new mappings. */
             ASSERT(!(nf & _PAGE_PRESENT));
@@ -5716,9 +5798,13 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     rc = 0;
 
  out:
+    L3T_UNLOCK(current_l3page);
     return rc;
 }
 
+#undef L3T_LOCK
+#undef L3T_UNLOCK
+
 #undef flush_area
 
 int destroy_xen_mappings(unsigned long s, unsigned long e)
-- 
2.25.1


["xsa345/0001-x86-mm-Refactor-map_pages_to_xen-to-have-only-a-sing.patch" (application/octet-stream)]

From d89e84f382b6045e311a067af6d28680e5a440ff Mon Sep 17 00:00:00 2001
From: Wei Liu <wei.liu2@citrix.com>
Date: Sat, 11 Jan 2020 21:57:41 +0000
Subject: [PATCH 1/3] x86/mm: Refactor map_pages_to_xen to have only a single
 exit path

We will soon need to perform clean-ups before returning.

No functional change.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 19 ++++++++++++-------
 1 file changed, 12 insertions(+), 7 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index d1cfc8fb4a..b909083bc2 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -275,7 +275,7 @@ static l4_pgentry_t __read_mostly split_l4e;
  * Originally cloned from share_xen_page_with_guest(), just to avoid setting
  * PGC_xen_heap on non-heap (typically) MMIO pages. Other pieces got dropped
  * simply because they're not needed in this context.
- */ 
+ */
 static void __init assign_io_page(struct page_info *page)
 {
     set_gpfn_from_mfn(mfn_x(page_to_mfn(page)), INVALID_M2P_ENTRY);
@@ -5079,6 +5079,7 @@ int map_pages_to_xen(
     l2_pgentry_t *pl2e, ol2e;
     l1_pgentry_t *pl1e, ol1e;
     unsigned int  i;
+    int rc = -ENOMEM;
 
 #define flush_flags(oldf) do {                 \
     unsigned int o_ = (oldf);                  \
@@ -5099,7 +5100,8 @@ int map_pages_to_xen(
         l3_pgentry_t ol3e, *pl3e = virt_to_xen_l3e(virt);
 
         if ( !pl3e )
-            return -ENOMEM;
+            goto out;
+
         ol3e = *pl3e;
 
         if ( cpu_has_page1gb &&
@@ -5189,7 +5191,7 @@ int map_pages_to_xen(
 
             l2t = alloc_xen_pagetable();
             if ( l2t == NULL )
-                return -ENOMEM;
+                goto out;
 
             for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ )
                 l2e_write(l2t + i,
@@ -5218,7 +5220,7 @@ int map_pages_to_xen(
 
         pl2e = virt_to_xen_l2e(virt);
         if ( !pl2e )
-            return -ENOMEM;
+            goto out;
 
         if ( ((((virt >> PAGE_SHIFT) | mfn_x(mfn)) &
                ((1u << PAGETABLE_ORDER) - 1)) == 0) &&
@@ -5262,7 +5264,7 @@ int map_pages_to_xen(
             {
                 pl1e = virt_to_xen_l1e(virt);
                 if ( pl1e == NULL )
-                    return -ENOMEM;
+                    goto out;
             }
             else if ( l2e_get_flags(*pl2e) & _PAGE_PSE )
             {
@@ -5290,7 +5292,7 @@ int map_pages_to_xen(
 
                 l1t = alloc_xen_pagetable();
                 if ( l1t == NULL )
-                    return -ENOMEM;
+                    goto out;
 
                 for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ )
                     l1e_write(&l1t[i],
@@ -5436,7 +5438,10 @@ int map_pages_to_xen(
 
 #undef flush_flags
 
-    return 0;
+    rc = 0;
+
+ out:
+    return rc;
 }
 
 int populate_pt_range(unsigned long virt, unsigned long nr_mfns)
-- 
2.25.1


["xsa345/0002-x86-mm-Refactor-modify_xen_mappings-to-have-one-exit.patch" (application/octet-stream)]

From 7aca90766648f1b8ebc04e0abd4d79bbd4db77a0 Mon Sep 17 00:00:00 2001
From: Wei Liu <wei.liu2@citrix.com>
Date: Sat, 11 Jan 2020 21:57:42 +0000
Subject: [PATCH 2/3] x86/mm: Refactor modify_xen_mappings to have one exit
 path

We will soon need to perform clean-ups before returning.

No functional change.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Wei Liu <wei.liu2@citrix.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Acked-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 12 +++++++++---
 1 file changed, 9 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index b909083bc2..d0fc8a8142 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -5468,6 +5468,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     l1_pgentry_t *pl1e;
     unsigned int  i;
     unsigned long v = s;
+    int rc = -ENOMEM;
 
     /* Set of valid PTE bits which may be altered. */
 #define FLAGS_MASK (_PAGE_NX|_PAGE_DIRTY|_PAGE_ACCESSED|_PAGE_RW|_PAGE_PRESENT)
@@ -5511,7 +5512,8 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
             /* PAGE1GB: shatter the superpage and fall through. */
             l2t = alloc_xen_pagetable();
             if ( !l2t )
-                return -ENOMEM;
+                goto out;
+
             for ( i = 0; i < L2_PAGETABLE_ENTRIES; i++ )
                 l2e_write(l2t + i,
                           l2e_from_pfn(l3e_get_pfn(*pl3e) +
@@ -5568,7 +5570,8 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
                 /* PSE: shatter the superpage and try again. */
                 l1t = alloc_xen_pagetable();
                 if ( !l1t )
-                    return -ENOMEM;
+                    goto out;
+
                 for ( i = 0; i < L1_PAGETABLE_ENTRIES; i++ )
                     l1e_write(&l1t[i],
                               l1e_from_pfn(l2e_get_pfn(*pl2e) + i,
@@ -5701,7 +5704,10 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     flush_area(NULL, FLUSH_TLB_GLOBAL);
 
 #undef FLAGS_MASK
-    return 0;
+    rc = 0;
+
+ out:
+    return rc;
 }
 
 #undef flush_area
-- 
2.25.1


["xsa345/0003-x86-mm-Prevent-some-races-in-hypervisor-mapping-upda.patch" (application/octet-stream)]

From d25359d5c42383d2bc7d1a944fab8bc039dbb442 Mon Sep 17 00:00:00 2001
From: Hongyan Xia <hongyxia@amazon.com>
Date: Sat, 11 Jan 2020 21:57:43 +0000
Subject: [PATCH 3/3] x86/mm: Prevent some races in hypervisor mapping updates

map_pages_to_xen will attempt to coalesce mappings into 2MiB and 1GiB
superpages if possible, to maximize TLB efficiency.  This means both
replacing superpage entries with smaller entries, and replacing
smaller entries with superpages.

Unfortunately, while some potential races are handled correctly,
others are not.  These include:

1. When one processor modifies a sub-superpage mapping while another
processor replaces the entire range with a superpage.

Take the following example:

Suppose L3[N] points to L2.  And suppose we have two processors, A and
B.

* A walks the pagetables, get a pointer to L2.
* B replaces L3[N] with a 1GiB mapping.
* B Frees L2
* A writes L2[M] #

This is race exacerbated by the fact that virt_to_xen_l[21]e doesn't
handle higher-level superpages properly: If you call virt_xen_to_l2e
on a virtual address within an L3 superpage, you'll either hit a BUG()
(most likely), or get a pointer into the middle of a data page; same
with virt_xen_to_l1 on a virtual address within either an L3 or L2
superpage.

So take the following example:

* A reads pl3e and discovers it to point to an L2.
* B replaces L3[N] with a 1GiB mapping
* A calls virt_to_xen_l2e() and hits the BUG_ON() #

2. When two processors simultaneously try to replace a sub-superpage
mapping with a superpage mapping.

Take the following example:

Suppose L3[N] points to L2.  And suppose we have two processors, A and B,
both trying to replace L3[N] with a superpage.

* A walks the pagetables, get a pointer to pl3e, and takes a copy ol3e pointing to L2.
* B walks the pagetables, gets a pointre to pl3e, and takes a copy ol3e pointing to L2.
* A writes the new value into L3[N]
* B writes the new value into L3[N]
* A recursively frees all the L1's under L2, then frees L2
* B recursively double-frees all the L1's under L2, then double-frees L2 #

Fix this by grabbing a lock for the entirety of the mapping update
operation.

Rather than grabbing map_pgdir_lock for the entire operation, however,
repurpose the PGT_locked bit from L3's page->type_info as a lock.
This means that rather than locking the entire address space, we
"only" lock a single 512GiB chunk of hypervisor address space at a
time.

There was a proposal for a lock-and-reverify approach, where we walk
the pagetables to the point where we decide what to do; then grab the
map_pgdir_lock, re-verify the information we collected without the
lock, and finally make the change (starting over again if anything had
changed).  Without being able to guarantee that the L2 table wasn't
freed, however, that means every read would need to be considered
potentially unsafe.  Thinking carefully about that is probably
something that wants to be done on public, not under time pressure.

This is part of XSA-345.

Reported-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: Hongyan Xia <hongyxia@amazon.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 92 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 89 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index d0fc8a8142..00b35542d1 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -2083,6 +2083,50 @@ void page_unlock(struct page_info *page)
     current_locked_page_set(NULL);
 }
 
+/*
+ * L3 table locks:
+ *
+ * Used for serialization in map_pages_to_xen() and modify_xen_mappings().
+ *
+ * For Xen PT pages, the page->u.inuse.type_info is unused and it is safe to
+ * reuse the PGT_locked flag. This lock is taken only when we move down to L3
+ * tables and below, since L4 (and above, for 5-level paging) is still globally
+ * protected by map_pgdir_lock.
+ *
+ * PV MMU update hypercalls call map_pages_to_xen while holding a page's page_lock().
+ * This has two implications:
+ * - We cannot reuse reuse current_locked_page_* for debugging
+ * - To avoid the chance of deadlock, even for different pages, we
+ *   must never grab page_lock() after grabbing l3t_lock().  This
+ *   includes any page_lock()-based locks, such as
+ *   mem_sharing_page_lock().
+ *
+ * Also note that we grab the map_pgdir_lock while holding the
+ * l3t_lock(), so to avoid deadlock we must avoid grabbing them in
+ * reverse order.
+ */
+static void l3t_lock(struct page_info *page)
+{
+    unsigned long x, nx;
+
+    do {
+        while ( (x = page->u.inuse.type_info) & PGT_locked )
+            cpu_relax();
+        nx = x | PGT_locked;
+    } while ( cmpxchg(&page->u.inuse.type_info, x, nx) != x );
+}
+
+static void l3t_unlock(struct page_info *page)
+{
+    unsigned long x, nx, y = page->u.inuse.type_info;
+
+    do {
+        x = y;
+        BUG_ON(!(x & PGT_locked));
+        nx = x & ~PGT_locked;
+    } while ( (y = cmpxchg(&page->u.inuse.type_info, x, nx)) != x );
+}
+
 #ifdef CONFIG_PV
 /*
  * PTE flags that a guest may change without re-validating the PTE.
@@ -5069,6 +5113,23 @@ l1_pgentry_t *virt_to_xen_l1e(unsigned long v)
                          flush_area_local((const void *)v, f) : \
                          flush_area_all((const void *)v, f))
 
+#define L3T_INIT(page) (page) = ZERO_BLOCK_PTR
+
+#define L3T_LOCK(page)        \
+    do {                      \
+        if ( locking )        \
+            l3t_lock(page);   \
+    } while ( false )
+
+#define L3T_UNLOCK(page)                           \
+    do {                                           \
+        if ( locking && (page) != ZERO_BLOCK_PTR ) \
+        {                                          \
+            l3t_unlock(page);                      \
+            (page) = ZERO_BLOCK_PTR;               \
+        }                                          \
+    } while ( false )
+
 int map_pages_to_xen(
     unsigned long virt,
     mfn_t mfn,
@@ -5080,6 +5141,7 @@ int map_pages_to_xen(
     l1_pgentry_t *pl1e, ol1e;
     unsigned int  i;
     int rc = -ENOMEM;
+    struct page_info *current_l3page;
 
 #define flush_flags(oldf) do {                 \
     unsigned int o_ = (oldf);                  \
@@ -5095,13 +5157,20 @@ int map_pages_to_xen(
     }                                          \
 } while (0)
 
+    L3T_INIT(current_l3page);
+
     while ( nr_mfns != 0 )
     {
-        l3_pgentry_t ol3e, *pl3e = virt_to_xen_l3e(virt);
+        l3_pgentry_t *pl3e, ol3e;
 
+        L3T_UNLOCK(current_l3page);
+
+        pl3e = virt_to_xen_l3e(virt);
         if ( !pl3e )
             goto out;
 
+        current_l3page = virt_to_page(pl3e);
+        L3T_LOCK(current_l3page);
         ol3e = *pl3e;
 
         if ( cpu_has_page1gb &&
@@ -5441,6 +5510,7 @@ int map_pages_to_xen(
     rc = 0;
 
  out:
+    L3T_UNLOCK(current_l3page);
     return rc;
 }
 
@@ -5469,6 +5539,7 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     unsigned int  i;
     unsigned long v = s;
     int rc = -ENOMEM;
+    struct page_info *current_l3page;
 
     /* Set of valid PTE bits which may be altered. */
 #define FLAGS_MASK (_PAGE_NX|_PAGE_DIRTY|_PAGE_ACCESSED|_PAGE_RW|_PAGE_PRESENT)
@@ -5477,11 +5548,22 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     ASSERT(IS_ALIGNED(s, PAGE_SIZE));
     ASSERT(IS_ALIGNED(e, PAGE_SIZE));
 
+    L3T_INIT(current_l3page);
+
     while ( v < e )
     {
-        l3_pgentry_t *pl3e = virt_to_xen_l3e(v);
+        l3_pgentry_t *pl3e;
+
+        L3T_UNLOCK(current_l3page);
 
-        if ( !pl3e || !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) )
+        pl3e = virt_to_xen_l3e(v);
+        if ( !pl3e )
+            goto out;
+
+        current_l3page = virt_to_page(pl3e);
+        L3T_LOCK(current_l3page);
+
+        if ( !(l3e_get_flags(*pl3e) & _PAGE_PRESENT) )
         {
             /* Confirm the caller isn't trying to create new mappings. */
             ASSERT(!(nf & _PAGE_PRESENT));
@@ -5707,9 +5789,13 @@ int modify_xen_mappings(unsigned long s, unsigned long e, unsigned int nf)
     rc = 0;
 
  out:
+    L3T_UNLOCK(current_l3page);
     return rc;
 }
 
+#undef L3T_LOCK
+#undef L3T_UNLOCK
+
 #undef flush_area
 
 int destroy_xen_mappings(unsigned long s, unsigned long e)
-- 
2.25.1



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic