[prev in list] [next in list] [prev in thread] [next in thread] 

List:       oss-security
Subject:    [oss-security] Xen Security Advisory 299 v4 (CVE-2019-18421) - Issues with restartable PV type chang
From:       Xen.org security team <security () xen ! org>
Date:       2019-10-31 12:29:28
Message-ID: E1iQ9Zw-0002vd-B1 () xenbits ! xenproject ! org
[Download RAW message or body]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA256

            Xen Security Advisory CVE-2019-18421 / XSA-299
                               version 4

           Issues with restartable PV type change operations

UPDATES IN VERSION 4
====================

Public release.

ISSUE DESCRIPTION
=================

To avoid using shadow pagetables for PV guests, Xen exposes the actual
hardware pagetables to the guest.  In order to prevent the guest from
modifying these page tables directly, Xen keeps track of how pages are
used using a type system; pages must be "promoted" before being used
as a pagetable, and "demoted" before being used for any other type.
Xen also allows for "recursive" promotions: i.e., an operating system
promoting a page to an L4 pagetable may end up causing pages to be
promoted to L3s, which may in turn cause pages to be promoted to L2s,
and so on.  These operations may take an arbitrarily large amount of
time, and so must be re-startable.

Unfortunately, making recursive pagetable promotion and demotion
operations restartable is incredibly complicated, and the code
contains several races which, if triggered, can cause Xen to drop or
retain extra type counts, potentially allowing guests to get write
access to in-use pagetables.

IMPACT
======

A malicious PV guest administrator may be able to escalate their
privilege to that of the host.

VULNERABLE SYSTEMS
==================

All x86 systems with untrusted PV guests are vulnerable.

HVM and PVH guests cannot exercise this vulnerability.
ARM systems are not vulnerable because ARM guests are all PVH.

All security-supported Xen versions are vulnerable.

Note that these attacks require very precise timing, which may
be difficult to exploit in practice.

MITIGATION
==========

Running only HVM or PVH guests will avoid this vulnerability.

Running PV guests in "shim" mode will also avoid this vulnerability.

CREDITS
=======

This issue was discovered by George Dunlap of Citrix.

RESOLUTION
==========

Applying the appropriate attached patch resolves this issue.

xsa299/*.patch           xen-unstable
xsa299-4.12/*.patch      Xen 4.12.x
xsa299-4.11/*.patch      Xen 4.11.x
xsa299-4.10/*.patch      Xen 4.10.x
xsa299-4.9/*.patch       Xen 4.9.x
xsa299-4.8/*.patch       Xen 4.8.x

$ sha256sum xsa299* xsa299*/*
687fb0f3273a424726edb4d249b79cfc45d1ef7000610405b11eaac49baecaa8  xsa299.meta
6c8f46e57f61a5e1e2e5e628a32e4c9ae144218ce475309811bb9900d3fdda48  \
xsa299-4.8/0001-x86-mm-Clean-up-trailing-whitespace.patch \
3409e71ed7bc199bcda33892ea6f70fe257c4f3906d74b4a6f4352415daeedb0  \
xsa299-4.8/0002-x86-mm-L1TF-checks-don-t-leave-a-partial-entry.patch \
1179fe0f1a591c542478bf8614501f8ddb67e342d7d452f6bff3b6a999f2b20f  \
xsa299-4.8/0003-x86-mm-Don-t-re-set-PGT_pinned-on-a-partially-de-val.patch \
bc0352a1d82079c4072cc3871d0d397f7abb3c0480dfc3c5c542091d2ec7d7b0  \
xsa299-4.8/0004-x86-mm-Separate-out-partial_pte-tristate-into-indivi.patch \
2b96857ef3e0f8259df7ad01600f1c30ca234668d6f26744c2ae0d3d7dded090  \
xsa299-4.8/0005-x86-mm-Use-flags-for-_put_page_type-rather-than-a-bo.patch \
fe119a8255e23a86845fa1ac5f93afa25acdaff705061c172ea9e0589b0bc1a4  \
xsa299-4.8/0006-x86-mm-Rework-get_page_and_type_from_mfn-conditional.patch \
562415d5fdb4e173443a2aa211094743a722ef1fe5a2d19c59cb3d329e101984  \
xsa299-4.8/0007-x86-mm-Have-alloc_l-23-_table-clear-partial_flags-wh.patch \
454296ac46ea5feea8866101e7c953bf6dbd37a5275f7b006eeb6d22cbae387d  \
xsa299-4.8/0008-x86-mm-Always-retain-a-general-ref-on-partial.patch \
f203a70da67f304c2ede516ef989b58ace6774eeee4eca919631c75f09860ba3  \
xsa299-4.8/0009-x86-mm-Collapse-PTF_partial_set-and-PTF_partial_gene.patch \
1f4877c10ead99c51d822d29ebaed9774cdb97cca869fe1a1ccf905540e291c7  \
xsa299-4.8/0010-x86-mm-Properly-handle-linear-pagetable-promotion-fa.patch \
733d260d731cce9902d66dc5b42ae9d10a319acda6dadcc426b6dfeba6e917da  \
xsa299-4.8/0011-x86-mm-Fix-nested-de-validation-on-error.patch \
cd105c15e2fd915644cb7d31000df60e51d1054a807b575d5436ccb87c1e9a18  \
xsa299-4.8/0012-x86-mm-Don-t-drop-a-type-ref-unless-you-held-a-ref-t.patch \
d8db456679e652f5a33a0a448d379e3a88b0cf7ce1415ee46007873cfb6f49b7  \
xsa299-4.9/0001-x86-mm-Clean-up-trailing-whitespace.patch \
e54df901b5f13d70643938ff365a09a43725637511251efc3ac55c45b80016f5  \
xsa299-4.9/0002-x86-mm-L1TF-checks-don-t-leave-a-partial-entry.patch \
8da540f32ff77f5871f646a6ef2847bc3adc2aecfa4698dcec4335b72e758616  \
xsa299-4.9/0003-x86-mm-Don-t-re-set-PGT_pinned-on-a-partially-de-val.patch \
e97044ffb5edcc7f1094dd47e365f2f29971cacf784d8aaa9a0e42f770ca899d  \
xsa299-4.9/0004-x86-mm-Separate-out-partial_pte-tristate-into-indivi.patch \
53977fd090d488f484e6191c6b68cbc59f771d8cf4aeb230b7b9f8ddc891a58e  \
xsa299-4.9/0005-x86-mm-Use-flags-for-_put_page_type-rather-than-a-bo.patch \
d10b9d434d341ac380e8a9c6fc4b3ddec8baf8dec9d565c2e66867f8d05497ba  \
xsa299-4.9/0006-x86-mm-Rework-get_page_and_type_from_mfn-conditional.patch \
7e01debdbe59cfa734e63b5c9d5c2799aa25f961f0d065ce8c8bdb64d577b164  \
xsa299-4.9/0007-x86-mm-Have-alloc_l-23-_table-clear-partial_flags-wh.patch \
12f0732907547367645db6300cff959f15118b91503165dc2c66083769ac7e56  \
xsa299-4.9/0008-x86-mm-Always-retain-a-general-ref-on-partial.patch \
06044bf56130dd845e08ed9af75f4aade186d48b1cea88d7862026bbe0bf51af  \
xsa299-4.9/0009-x86-mm-Collapse-PTF_partial_set-and-PTF_partial_gene.patch \
2fea704a716d6ff8a589fba7bf5d71443e2b52f41f591f8173d50dcb3ba9a94b  \
xsa299-4.9/0010-x86-mm-Properly-handle-linear-pagetable-promotion-fa.patch \
4bcfd94bdd77726e8ea1069081f5f544705b22752a185ee4e1f58c730a902b74  \
xsa299-4.9/0011-x86-mm-Fix-nested-de-validation-on-error.patch \
580fa03182e40f122e3d21a5c71183b6a9500eae2afba490cf43514b75e15062  \
xsa299-4.9/0012-x86-mm-Don-t-drop-a-type-ref-unless-you-held-a-ref-t.patch \
c3bde8f42e75c0f98c22938267f947d4729e7372510dededa3750699ac8cb2f5  \
xsa299-4.10/0001-x86-mm-L1TF-checks-don-t-leave-a-partial-entry.patch \
0794fd0d20d71367977926f2393e354d4a43452a51f421616fa413acd68bf24a  \
xsa299-4.10/0002-x86-mm-Don-t-re-set-PGT_pinned-on-a-partially-de-val.patch \
0591cd2fa566fcec43e2aa6e1cfb92629c816e55c7548b2534c5a7a84505cd06  \
xsa299-4.10/0003-x86-mm-Separate-out-partial_pte-tristate-into-indivi.patch \
736966986c43bcdfcbf337fc87af6f430458bad5d105b33f7dfa0a1eb72f2416  \
xsa299-4.10/0004-x86-mm-Use-flags-for-_put_page_type-rather-than-a-bo.patch \
416db71e950838dbf5d024ae9ba8bb6e6685314608543fd8df0516db7786b811  \
xsa299-4.10/0005-x86-mm-Rework-get_page_and_type_from_mfn-conditional.patch \
7d84aaf129401faa863565df084e776413dd07ec440c1a67db961b8a147651a4  \
xsa299-4.10/0006-x86-mm-Have-alloc_l-23-_table-clear-partial_flags-wh.patch \
59d37dc3cfd811bcbbedb72ca9d80eb2d460dce4e373e581c88fdb6b874b4111  \
xsa299-4.10/0007-x86-mm-Always-retain-a-general-ref-on-partial.patch \
746156888f0dc4a75164cd668dd05fdf3d9b11cc96205785384f84ebcd1df4ae  \
xsa299-4.10/0008-x86-mm-Collapse-PTF_partial_set-and-PTF_partial_gene.patch \
bcc54d2b0653e584c89c0d219d5cd82e94c2629033ea8f1b22dfd3f373267bf5  \
xsa299-4.10/0009-x86-mm-Properly-handle-linear-pagetable-promotion-fa.patch \
4829ba66647d344f1eaad632fddab4c8c51db513d1ae18385dec195b86e76936  \
xsa299-4.10/0010-x86-mm-Fix-nested-de-validation-on-error.patch \
7ad0b06d2748da4e4b317f4cc8c829c7fb451bf86ad778d97d231acff7cfd940  \
xsa299-4.10/0011-x86-mm-Don-t-drop-a-type-ref-unless-you-held-a-ref-t.patch \
225fec9475b5992338ce19da982a759b3a551c653dbbb280295b00018a107d28  \
xsa299-4.11/0001-x86-mm-L1TF-checks-don-t-leave-a-partial-entry.patch \
fa910f573bde107b90fef4568fa500bf875d7303ac93642ed8a135d639bf7f0e  \
xsa299-4.11/0002-x86-mm-Don-t-re-set-PGT_pinned-on-a-partially-de-val.patch \
f5fcf8ab6940d85fe43de61463ff00bcf17a22b94da4f2b28fa45d714b0255d0  \
xsa299-4.11/0003-x86-mm-Separate-out-partial_pte-tristate-into-indivi.patch \
e1e49d767f08889b518423935869332a40f87e824bb93a0c2707f1f99e9f0328  \
xsa299-4.11/0004-x86-mm-Use-flags-for-_put_page_type-rather-than-a-bo.patch \
c0f5ce00516491b1f3d2eccf25fbd67d409d855e3d4b423490f1bc37b4477e87  \
xsa299-4.11/0005-x86-mm-Rework-get_page_and_type_from_mfn-conditional.patch \
4562543c497c17cc3a793f67a75824043ca3dea69ccc456bf9f5546825282f0e  \
xsa299-4.11/0006-x86-mm-Have-alloc_l-23-_table-clear-partial_flags-wh.patch \
90bc777691225eb4c55804702c2cd7f2913317b13334c27b9437ee60be672cca  \
xsa299-4.11/0007-x86-mm-Always-retain-a-general-ref-on-partial.patch \
7903c9599ee47dc05647e5ec7a6ce3fe5e6331b527551286897429e97cf56f61  \
xsa299-4.11/0008-x86-mm-Collapse-PTF_partial_set-and-PTF_partial_gene.patch \
c1ae9bad93e11a4a9253265318b67b45865e566b17ddd7f167bb88197a9b700c  \
xsa299-4.11/0009-x86-mm-Properly-handle-linear-pagetable-promotion-fa.patch \
49a21bd396ab4af6b82aaa38dac733f4fde806587b5b126cd656f725b9c8eee7  \
xsa299-4.11/0010-x86-mm-Fix-nested-de-validation-on-error.patch \
09df369fa52335e3e560af593d4e9843bab1da24aa1b4c905f9ea1ce8441af6e  \
xsa299-4.11/0011-x86-mm-Don-t-drop-a-type-ref-unless-you-held-a-ref-t.patch \
d27f07eb0020181487ec9dda15c6331125d6b0505fdce1ae67c0a9b524159e11  \
xsa299-4.12/0001-x86-mm-L1TF-checks-don-t-leave-a-partial-entry.patch \
00c2fb77366c427e226315cfb1cda1c67ce495ec8a0b400ff30924bc399bf283  \
xsa299-4.12/0002-x86-mm-Don-t-re-set-PGT_pinned-on-a-partially-de-val.patch \
bc88c216e438af9e1dddf1e5374fd1c78c9867e8908ba3016c72d999aebaea4b  \
xsa299-4.12/0003-x86-mm-Separate-out-partial_pte-tristate-into-indivi.patch \
cc6416c6311be82a2b89d5b14ceb9ecc6cb92ce9286bb03b91083c661186d28d  \
xsa299-4.12/0004-x86-mm-Use-flags-for-_put_page_type-rather-than-a-bo.patch \
732fbb80a6fc6364945e1b6534c921d503e2369c3cd25f425096549b71f75fa0  \
xsa299-4.12/0005-x86-mm-Rework-get_page_and_type_from_mfn-conditional.patch \
20e37b3712b66111193bed02b368aff2ee0e7896dd55b5e6c928fbc97ec618b3  \
xsa299-4.12/0006-x86-mm-Have-alloc_l-23-_table-clear-partial_flags-wh.patch \
20bec098f3ad474093ce33e4ae5e8cee5ff9f8504107c8a4ff76f2731abbab13  \
xsa299-4.12/0007-x86-mm-Always-retain-a-general-ref-on-partial.patch \
71addb8014eeb51a6adc4377aaa4b74ac611a28a6f62865f7020a536a1a9cbc5  \
xsa299-4.12/0008-x86-mm-Collapse-PTF_partial_set-and-PTF_partial_gene.patch \
71bd7d75f7878571d4ea4351ea10f487a1c1a86765f67c85a25308d5df24a40e  \
xsa299-4.12/0009-x86-mm-Properly-handle-linear-pagetable-promotion-fa.patch \
1e58d49f72c1eb158db08a17a3805e2144c0d468b6388a9a8795b67f80a699a5  \
xsa299-4.12/0010-x86-mm-Fix-nested-de-validation-on-error.patch \
67594f941f8cecbc0ff87dfedbdbd43f4e4234d049c1a5d62143153ae96954c1  \
xsa299-4.12/0011-x86-mm-Don-t-drop-a-type-ref-unless-you-held-a-ref-t.patch \
08179d90ea327bca328f3a45198c31166df2aa6fb459b148dd74c716c1d5bb88  \
xsa299/0001-x86-mm-L1TF-checks-don-t-leave-a-partial-entry.patch \
d37e7b4dd3c9d7da14a287d9fe6807f81d95bba8bdab79b729ed5aa3350fad70  \
xsa299/0002-x86-mm-Don-t-re-set-PGT_pinned-on-a-partially-de-val.patch \
660fc01fb09aee7628d65d7893ec11bf77cfe79543e390656b59f0e60334d058  \
xsa299/0003-x86-mm-Separate-out-partial_pte-tristate-into-indivi.patch \
2dc6ad4233ec572ba21632ab80b6149541f3169affb792e31930e3f7c6e72fc6  \
xsa299/0004-x86-mm-Use-flags-for-_put_page_type-rather-than-a-bo.patch \
175fd90422bf00879de2129cd1a86bbdeb1c15ff344d286ab9634bc3f1512c03  \
xsa299/0005-x86-mm-Rework-get_page_and_type_from_mfn-conditional.patch \
afa26c8850085412a787d7f0cb3031f15181ee2c9b3b1a9b4a007bff7404457f  \
xsa299/0006-x86-mm-Have-alloc_l-23-_table-clear-partial_flags-wh.patch \
6f0502b2377db2115faf9c7bcbf35898013dcec74170950c3aa7a0586ff1e174  \
xsa299/0007-x86-mm-Always-retain-a-general-ref-on-partial.patch \
787c3eeaadfed46947fb17773fa8f9e9efe891658d7460eaf5291a4ca6155123  \
xsa299/0008-x86-mm-Collapse-PTF_partial_set-and-PTF_partial_gene.patch \
77341c4d0ab62fbb7090d2a6b60902467563ae470ac0807ef40a3ac791d2933a  \
xsa299/0009-x86-mm-Properly-handle-linear-pagetable-promotion-fa.patch \
e489f49f8783fb388161365072da585c049e05d80306cf963cec5ecbb3bc67c7  \
xsa299/0010-x86-mm-Fix-nested-de-validation-on-error.patch \
17b9ae71c150747bff4d57eee8a918b1961e880e25ae2b9c0dbe933e005cb1a0  \
xsa299/0011-x86-mm-Don-t-drop-a-type-ref-unless-you-held-a-ref-t.patch $

NOTE ON RESOLUTION
==================

Even with these fixes, the code is still very complicated.  After the
embargo is up, we plan to try getting rid of automatic recursive
pagetable promotion entirely, instead requiring guest operating
systems to promote pages one-by-one themselves.  This would obviate
the need to have restartable operations, greatly simplifying the
reference counting code.

DEPLOYMENT DURING EMBARGO
=========================

Deployment of the patches and/or mitigations described above (or
others which are substantially similar) is permitted during the
embargo, even on public-facing systems with untrusted guest users and
administrators.

But: Distribution of updated software is prohibited (except to other
members of the predisclosure list).

Predisclosure list members who wish to deploy significantly different
patches and/or mitigations, please contact the Xen Project Security
Team.

(Note: this during-embargo deployment notice is retained in
post-embargo publicly released Xen Project advisories, even though it
is then no longer applicable.  This is to enable the community to have
oversight of the Xen Project Security Team's decisionmaking.)

For more information about permissible uses of embargoed information,
consult the Xen Project community's agreed Security Policy:
  http://www.xenproject.org/security-policy.html
-----BEGIN PGP SIGNATURE-----

iQFABAEBCAAqFiEEI+MiLBRfRHX6gGCng/4UyVfoK9kFAl2601kMHHBncEB4ZW4u
b3JnAAoJEIP+FMlX6CvZiYAIAMx46nYNIJ5KwV4rCKkBW1O/EDOc5dqt9PjlIKWR
PbJ4rrs9ZObvRh1Xw7nNM/leexHNYClWGAGPp/pLOyfF4nw/9B13jMF0C39vP4Fd
FMzM0jKyZreWTU38NqkrAVHbawyZNkS//1PITZy6LvA+DvwsHBz34qFsUX8Fw3vd
pu7izoozEFCzTie0zrUqwKV7yIyJ+3u3b/SjGuou0nxrbyIGuz/HIxazcFxJWwZh
4Zww3yKWMvXVedg8a2ZP5Fi+8+ePurOKz6g48gOWYefCPYXASrEaAf6s2WUp9Yi1
akddy2WIHzqd3HfOqEVKE5y8bjVvEft7mOIqOVeJBpEzh1s=
=633F
-----END PGP SIGNATURE-----


["xsa299.meta" (application/octet-stream)]
["xsa299-4.8/0001-x86-mm-Clean-up-trailing-whitespace.patch" (application/octet-stream)]

From 9dfb5597c5b51222ba9b2593d8fa2c83761be0e4 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Wed, 16 Oct 2019 09:46:35 +0100
Subject: [PATCH 01/12] x86/mm: Clean up trailing whitespace

Sometime between 4.9 and 4.10 someone cleaned up all the trailing
whitespace in mm.c; applying this patch now makes all futher patches
much cleaner.

No functional change.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
---
 xen/arch/x86/mm.c | 118 +++++++++++++++++++++++-----------------------
 1 file changed, 59 insertions(+), 59 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 9abdf15384..5c9db3f898 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1,48 +1,48 @@
 /******************************************************************************
  * arch/x86/mm.c
- * 
+ *
  * Copyright (c) 2002-2005 K A Fraser
  * Copyright (c) 2004 Christian Limpach
- * 
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation; either version 2 of the License, or
  * (at your option) any later version.
- * 
+ *
  * This program is distributed in the hope that it will be useful,
  * but WITHOUT ANY WARRANTY; without even the implied warranty of
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  * GNU General Public License for more details.
- * 
+ *
  * You should have received a copy of the GNU General Public License
  * along with this program; If not, see <http://www.gnu.org/licenses/>.
  */
 
 /*
  * A description of the x86 page table API:
- * 
+ *
  * Domains trap to do_mmu_update with a list of update requests.
  * This is a list of (ptr, val) pairs, where the requested operation
  * is *ptr = val.
- * 
+ *
  * Reference counting of pages:
  * ----------------------------
  * Each page has two refcounts: tot_count and type_count.
- * 
+ *
  * TOT_COUNT is the obvious reference count. It counts all uses of a
  * physical page frame by a domain, including uses as a page directory,
  * a page table, or simple mappings via a PTE. This count prevents a
  * domain from releasing a frame back to the free pool when it still holds
  * a reference to it.
- * 
+ *
  * TYPE_COUNT is more subtle. A frame can be put to one of three
  * mutually-exclusive uses: it might be used as a page directory, or a
  * page table, or it may be mapped writable by the domain [of course, a
  * frame may not be used in any of these three ways!].
- * So, type_count is a count of the number of times a frame is being 
+ * So, type_count is a count of the number of times a frame is being
  * referred to in its current incarnation. Therefore, a page can only
  * change its type when its type count is zero.
- * 
+ *
  * Pinning the page type:
  * ----------------------
  * The type of a page can be pinned/unpinned with the commands
@@ -51,20 +51,20 @@
  * This is useful to prevent a page's type count falling to zero, at which
  * point safety checks would need to be carried out next time the count
  * is increased again.
- * 
+ *
  * A further note on writable page mappings:
  * -----------------------------------------
  * For simplicity, the count of writable mappings for a page may not
  * correspond to reality. The 'writable count' is incremented for every
  * PTE which maps the page with the _PAGE_RW flag set. However, for
  * write access to be possible the page directory entry must also have
- * its _PAGE_RW bit set. We do not check this as it complicates the 
+ * its _PAGE_RW bit set. We do not check this as it complicates the
  * reference counting considerably [consider the case of multiple
  * directory entries referencing a single page table, some with the RW
  * bit set, others not -- it starts getting a bit messy].
  * In normal use, this simplification shouldn't be a problem.
  * However, the logic can be added if required.
- * 
+ *
  * One more note on read-only page mappings:
  * -----------------------------------------
  * We want domains to be able to map pages for read-only access. The
@@ -73,10 +73,10 @@
  * However, domains have free access to rings 1 & 2 of the Intel
  * privilege model. In terms of page protection, these are considered
  * to be part of 'supervisor mode'. The WP bit in CR0 controls whether
- * read-only restrictions are respected in supervisor mode -- if the 
+ * read-only restrictions are respected in supervisor mode -- if the
  * bit is clear then any mapped page is writable.
- * 
- * We get round this by always setting the WP bit and disallowing 
+ *
+ * We get round this by always setting the WP bit and disallowing
  * updates to it. This is very unlikely to cause a problem for guest
  * OS's, which will generally use the WP bit to simplify copy-on-write
  * implementation (in that case, OS wants a fault when it writes to
@@ -314,7 +314,7 @@ void __init arch_init_memory(void)
      */
     dom_io = domain_create(DOMID_IO, DOMCRF_dummy, 0, NULL);
     BUG_ON(IS_ERR(dom_io));
-    
+
     /*
      * Initialise our COW domain.
      * This domain owns sharable pages.
@@ -325,7 +325,7 @@ void __init arch_init_memory(void)
     /* First 1MB of RAM is historically marked as I/O. */
     for ( i = 0; i < 0x100; i++ )
         share_xen_page_with_guest(mfn_to_page(i), dom_io, XENSHARE_writable);
- 
+
     /* Any areas not specified as RAM by the e820 map are considered I/O. */
     for ( i = 0, pfn = 0; pfn < max_page; i++ )
     {
@@ -355,7 +355,7 @@ void __init arch_init_memory(void)
          */
         iostart_pfn = max_t(unsigned long, pfn, 1UL << (20 - PAGE_SHIFT));
         ioend_pfn = min(rstart_pfn, 16UL << (20 - PAGE_SHIFT));
-        if ( iostart_pfn < ioend_pfn )            
+        if ( iostart_pfn < ioend_pfn )
             destroy_xen_mappings((unsigned long)mfn_to_virt(iostart_pfn),
                                  (unsigned long)mfn_to_virt(ioend_pfn));
 
@@ -443,7 +443,7 @@ int page_is_ram_type(unsigned long mfn, unsigned long mem_type)
             /* unknown */
             continue;
         }
-        
+
         /* Test the range. */
         if ( (e820.map[i].addr <= maddr) &&
              ((e820.map[i].addr + e820.map[i].size) >= (maddr + PAGE_SIZE)) )
@@ -546,7 +546,7 @@ void write_ptbase(struct vcpu *v)
 
 /*
  * Should be called after CR3 is updated.
- * 
+ *
  * Uses values found in vcpu->arch.(guest_table and guest_table_user), and
  * for HVM guests, arch.monitor_table and hvm's guest CR3.
  *
@@ -756,7 +756,7 @@ static int get_page_from_pagenr(unsigned long page_nr, struct domain *d)
 static int __get_page_type(struct page_info *page, unsigned long type,
                            int preemptible);
 
-static int get_page_and_type_from_pagenr(unsigned long page_nr, 
+static int get_page_and_type_from_pagenr(unsigned long page_nr,
                                          unsigned long type,
                                          struct domain *d,
                                          int partial,
@@ -1061,7 +1061,7 @@ get_page_from_l1e(
         {
             if ( mfn != (PADDR_MASK >> PAGE_SHIFT) ) /* INVALID_MFN? */
             {
-                MEM_LOG("Non-privileged (%u) attempt to map I/O space %08lx", 
+                MEM_LOG("Non-privileged (%u) attempt to map I/O space %08lx",
                         pg_owner->domain_id, mfn);
                 return -EPERM;
             }
@@ -1154,7 +1154,7 @@ get_page_from_l1e(
         pg_owner = real_pg_owner;
     }
 
-    /* Extra paranoid check for shared memory. Writable mappings 
+    /* Extra paranoid check for shared memory. Writable mappings
      * disallowed (unshare first!) */
     if ( (l1f & _PAGE_RW) && (real_pg_owner == dom_cow) )
         goto could_not_pin;
@@ -1398,12 +1398,12 @@ void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
      * Check if this is a mapping that was established via a grant reference.
      * If it was then we should not be here: we require that such mappings are
      * explicitly destroyed via the grant-table interface.
-     * 
+     *
      * The upshot of this is that the guest can end up with active grants that
      * it cannot destroy (because it no longer has a PTE to present to the
      * grant-table interface). This can lead to subtle hard-to-catch bugs,
      * hence a special grant PTE flag can be enabled to catch the bug early.
-     * 
+     *
      * (Note that the undestroyable active grants are not a security hole in
      * Xen. All active grants can safely be cleaned up when the domain dies.)
      */
@@ -1417,7 +1417,7 @@ void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
 
     /* Remember we didn't take a type-count of foreign writable mappings
      * to paging-external domains */
-    if ( (l1e_get_flags(l1e) & _PAGE_RW) && 
+    if ( (l1e_get_flags(l1e) & _PAGE_RW) &&
          ((l1e_owner == pg_owner) || !paging_mode_external(pg_owner)) )
     {
         put_page_and_type(page);
@@ -1425,7 +1425,7 @@ void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
     else
     {
         /* We expect this is rare so we blow the entire shadow LDT. */
-        if ( unlikely(((page->u.inuse.type_info & PGT_type_mask) == 
+        if ( unlikely(((page->u.inuse.type_info & PGT_type_mask) ==
                        PGT_seg_desc_page)) &&
              unlikely(((page->u.inuse.type_info & PGT_count_mask) != 0)) &&
              (l1e_owner == pg_owner) )
@@ -1527,7 +1527,7 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
 {
     int rc = 1;
 
-    if ( (l4e_get_flags(l4e) & _PAGE_PRESENT) && 
+    if ( (l4e_get_flags(l4e) & _PAGE_PRESENT) &&
          (l4e_get_pfn(l4e) != pfn) )
     {
         struct page_info *pg = l4e_get_page(l4e);
@@ -2085,8 +2085,8 @@ void page_unlock(struct page_info *page)
 
 /* How to write an entry to the guest pagetables.
  * Returns 0 for failure (pointer not valid), 1 for success. */
-static inline int update_intpte(intpte_t *p, 
-                                intpte_t old, 
+static inline int update_intpte(intpte_t *p,
+                                intpte_t old,
                                 intpte_t new,
                                 unsigned long mfn,
                                 struct vcpu *v,
@@ -2257,8 +2257,8 @@ static int mod_l1_entry(l1_pgentry_t *pl1e, l1_pgentry_t nl1e,
 
 
 /* Update the L2 entry at pl2e to new value nl2e. pl2e is within frame pfn. */
-static int mod_l2_entry(l2_pgentry_t *pl2e, 
-                        l2_pgentry_t nl2e, 
+static int mod_l2_entry(l2_pgentry_t *pl2e,
+                        l2_pgentry_t nl2e,
                         unsigned long pfn,
                         int preserve_ad,
                         struct vcpu *vcpu)
@@ -2321,8 +2321,8 @@ static int mod_l2_entry(l2_pgentry_t *pl2e,
 }
 
 /* Update the L3 entry at pl3e to new value nl3e. pl3e is within frame pfn. */
-static int mod_l3_entry(l3_pgentry_t *pl3e, 
-                        l3_pgentry_t nl3e, 
+static int mod_l3_entry(l3_pgentry_t *pl3e,
+                        l3_pgentry_t nl3e,
                         unsigned long pfn,
                         int preserve_ad,
                         struct vcpu *vcpu)
@@ -2394,8 +2394,8 @@ static int mod_l3_entry(l3_pgentry_t *pl3e,
 }
 
 /* Update the L4 entry at pl4e to new value nl4e. pl4e is within frame pfn. */
-static int mod_l4_entry(l4_pgentry_t *pl4e, 
-                        l4_pgentry_t nl4e, 
+static int mod_l4_entry(l4_pgentry_t *pl4e,
+                        l4_pgentry_t nl4e,
                         unsigned long pfn,
                         int preserve_ad,
                         struct vcpu *vcpu)
@@ -2560,7 +2560,7 @@ struct domain *page_get_owner_and_reference(struct page_info *page)
         x = y;
         /*
          * Count ==  0: Page is not allocated, so we cannot take a reference.
-         * Count == -1: Reference count would wrap, which is invalid. 
+         * Count == -1: Reference count would wrap, which is invalid.
          * Count == -2: Remaining unused ref is reserved for get_page_light().
          */
         if ( unlikely(((x + 2) & PGC_count_mask) <= 2) )
@@ -2648,7 +2648,7 @@ static int alloc_page_type(struct page_info *page, unsigned long type,
         rc = alloc_segdesc_page(page);
         break;
     default:
-        printk("Bad type in alloc_page_type %lx t=%" PRtype_info " c=%lx\n", 
+        printk("Bad type in alloc_page_type %lx t=%" PRtype_info " c=%lx\n",
                type, page->u.inuse.type_info,
                page->count_info);
         rc = -EINVAL;
@@ -2904,8 +2904,8 @@ static int __get_page_type(struct page_info *page, unsigned long type,
             if ( (x & PGT_type_mask) != type )
             {
                 /*
-                 * On type change we check to flush stale TLB entries. This 
-                 * may be unnecessary (e.g., page was GDT/LDT) but those 
+                 * On type change we check to flush stale TLB entries. This
+                 * may be unnecessary (e.g., page was GDT/LDT) but those
                  * circumstances should be very rare.
                  */
                 cpumask_t mask;
@@ -3841,7 +3841,7 @@ long do_mmuext_op(
             else
                 rc = -EPERM;
             break;
-    
+
         case MMUEXT_INVLPG_ALL:
             if ( unlikely(d != pg_owner) )
                 rc = -EPERM;
@@ -4380,7 +4380,7 @@ static int create_grant_pte_mapping(
         MEM_LOG("Could not get page for normal update");
         return GNTST_general_error;
     }
-    
+
     mfn = page_to_mfn(page);
     va = map_domain_page(_mfn(mfn));
     va = (void *)((unsigned long)va + ((unsigned long)pte_addr & ~PAGE_MASK));
@@ -4404,7 +4404,7 @@ static int create_grant_pte_mapping(
         page_unlock(page);
         rc = GNTST_general_error;
         goto failed;
-    } 
+    }
 
     page_unlock(page);
 
@@ -4446,7 +4446,7 @@ static int destroy_grant_pte_mapping(
         MEM_LOG("Could not get page for normal update");
         return GNTST_general_error;
     }
-    
+
     mfn = page_to_mfn(page);
     va = map_domain_page(_mfn(mfn));
     va = (void *)((unsigned long)va + ((unsigned long)addr & ~PAGE_MASK));
@@ -4465,7 +4465,7 @@ static int destroy_grant_pte_mapping(
     }
 
     ol1e = *(l1_pgentry_t *)va;
-    
+
     /*
      * Check that the PTE supplied actually maps frame (with appropriate
      * permissions).
@@ -4489,8 +4489,8 @@ static int destroy_grant_pte_mapping(
 
     /* Delete pagetable entry. */
     if ( unlikely(!UPDATE_ENTRY
-                  (l1, 
-                   (l1_pgentry_t *)va, ol1e, l1e_empty(), mfn, 
+                  (l1,
+                   (l1_pgentry_t *)va, ol1e, l1e_empty(), mfn,
                    d->vcpu[0] /* Change if we go to per-vcpu shadows. */,
                    0)) )
     {
@@ -4517,7 +4517,7 @@ static int create_grant_va_mapping(
     unsigned long gl1mfn;
     struct page_info *l1pg;
     int okay;
-    
+
     adjust_guest_l1e(nl1e, d);
 
     pl1e = guest_map_l1e(va, &gl1mfn);
@@ -4570,7 +4570,7 @@ static int replace_grant_va_mapping(
     unsigned long gl1mfn;
     struct page_info *l1pg;
     int rc = 0;
-    
+
     pl1e = guest_map_l1e(addr, &gl1mfn);
     if ( !pl1e )
     {
@@ -4667,7 +4667,7 @@ static int create_grant_p2m_mapping(uint64_t addr, unsigned long frame,
         return GNTST_okay;
 }
 
-int create_grant_host_mapping(uint64_t addr, unsigned long frame, 
+int create_grant_host_mapping(uint64_t addr, unsigned long frame,
                               unsigned int flags, unsigned int cache_flags)
 {
     l1_pgentry_t pte;
@@ -4736,7 +4736,7 @@ int replace_grant_host_mapping(
     struct page_info *l1pg;
     int rc;
     unsigned int grant_pte_flags;
-    
+
     if ( paging_mode_external(current->domain) )
         return replace_grant_p2m_mapping(addr, frame, new_addr, flags);
 
@@ -4762,7 +4762,7 @@ int replace_grant_host_mapping(
         if ( !new_addr )
             return destroy_grant_pte_mapping(addr, frame, grant_pte_flags,
                                              curr->domain);
-        
+
         MEM_LOG("Unsupported grant table operation");
         return GNTST_general_error;
     }
@@ -5152,7 +5152,7 @@ void destroy_gdt(struct vcpu *v)
 }
 
 
-long set_gdt(struct vcpu *v, 
+long set_gdt(struct vcpu *v,
              unsigned long *frames,
              unsigned int entries)
 {
@@ -5214,7 +5214,7 @@ long do_set_gdt(XEN_GUEST_HANDLE_PARAM(xen_ulong_t) frame_list,
     /* Rechecked in set_gdt, but ensures a sane limit for copy_from_user(). */
     if ( entries > FIRST_RESERVED_GDT_ENTRY )
         return -EINVAL;
-    
+
     if ( copy_from_guest(frames, frame_list, nr_pages) )
         return -EFAULT;
 
@@ -5504,7 +5504,7 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
             rcu_unlock_domain(d);
             return -ENOMEM;
         }
-        
+
         if ( copy_from_guest(e820, fmap.map.buffer, fmap.map.nr_entries) )
         {
             xfree(e820);
@@ -5656,7 +5656,7 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
                 rc = -EINVAL;
                 goto pod_target_out_unlock;
             }
-            
+
             rc = p2m_pod_set_mem_target(d, target.target_pages);
         }
 
@@ -5678,7 +5678,7 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
                 goto pod_target_out_unlock;
             }
         }
-        
+
     pod_target_out_unlock:
         rcu_unlock_domain(d);
         return rc;
@@ -5914,7 +5914,7 @@ static const struct x86_emulate_ops ptwr_emulate_ops = {
 };
 
 /* Write page fault handler: check if guest is trying to modify a PTE. */
-int ptwr_do_page_fault(struct vcpu *v, unsigned long addr, 
+int ptwr_do_page_fault(struct vcpu *v, unsigned long addr,
                        struct cpu_user_regs *regs)
 {
     struct domain *d = v->domain;
-- 
2.23.0


["xsa299-4.8/0002-x86-mm-L1TF-checks-don-t-leave-a-partial-entry.patch" (application/octet-stream)]

From fcfb3fc77052d328273a98e4f360354af660affe Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 02/12] x86/mm: L1TF checks don't leave a partial entry

On detection of a potential L1TF issue, most validation code returns
-ERESTART to allow the switch to shadow mode to happen and cause the
original operation to be restarted.

However, in the validation code, the return value -ERESTART has been
repurposed to indicate 1) the function has partially completed
something which needs to be undone, and 2) calling put_page_type()
should cleanly undo it.  This causes problems in several places.

For L1 tables, on receiving an -ERESTART return from alloc_l1_table(),
alloc_page_type() will set PGT_partial on the page.  If for some
reason the original operation never restarts, then on domain
destruction, relinquish_memory() will call free_page_type() on the
page.

Unfortunately, alloc_ and free_l1_table() aren't set up to deal with
PGT_partial.  When returning a failure, alloc_l1_table() always
de-validates whatever it's validated so far, and free_l1_table()
always devalidates the whole page.  This means that if
relinquish_memory() calls free_page_type() on an L1 that didn't
complete due to an L1TF, it will call put_page_from_l1e() on "page
entries" that have never been validated.

For L2+ tables, setting rc to ERESTART causes the rest of the
alloc_lN_table() function to *think* that the entry in question will
have PGT_partial set.  This will cause it to set partial_pte = 1.  If
relinqush_memory() then calls free_page_type() on one of those pages,
then free_lN_table() will call put_page_from_lNe() on the entry when
it shouldn't.

Rather than indicating -ERESTART, indicate -EINTR.  This is the code
to indicate that nothing has changed from when you started the call
(which is effectively how alloc_l1_table() handles errors).

mod_lN_entry() shouldn't have any of these types of problems, so leave
potential changes there for a clean-up patch later.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 5c9db3f898..1c19092503 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1236,7 +1236,7 @@ get_page_from_l2e(
     int rc;
 
     if ( !(l2e_get_flags(l2e) & _PAGE_PRESENT) )
-        return pv_l1tf_check_l2e(d, l2e) ? -ERESTART : 1;
+        return pv_l1tf_check_l2e(d, l2e) ? -EINTR : 1;
 
     if ( unlikely((l2e_get_flags(l2e) & L2_DISALLOW_MASK)) )
     {
@@ -1278,7 +1278,7 @@ get_page_from_l3e(
     int rc;
 
     if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) )
-        return pv_l1tf_check_l3e(d, l3e) ? -ERESTART : 1;
+        return pv_l1tf_check_l3e(d, l3e) ? -EINTR : 1;
 
     if ( unlikely((l3e_get_flags(l3e) & l3_disallow_mask(d))) )
     {
@@ -1304,7 +1304,7 @@ get_page_from_l4e(
     int rc;
 
     if ( !(l4e_get_flags(l4e) & _PAGE_PRESENT) )
-        return pv_l1tf_check_l4e(d, l4e) ? -ERESTART : 1;
+        return pv_l1tf_check_l4e(d, l4e) ? -EINTR : 1;
 
     if ( unlikely((l4e_get_flags(l4e) & L4_DISALLOW_MASK)) )
     {
@@ -1567,7 +1567,7 @@ static int alloc_l1_table(struct page_info *page)
     {
         if ( !(l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) )
         {
-            ret = pv_l1tf_check_l1e(d, pl1e[i]) ? -ERESTART : 0;
+            ret = pv_l1tf_check_l1e(d, pl1e[i]) ? -EINTR : 0;
             if ( ret )
                 goto out;
         }
-- 
2.23.0


["xsa299-4.8/0003-x86-mm-Don-t-re-set-PGT_pinned-on-a-partially-de-val.patch" (application/octet-stream)]

From 2fca92eed69be2546c1c5a0c78732c71985ee08c Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 03/12] x86/mm: Don't re-set PGT_pinned on a partially
 de-validated page

When unpinning pagetables, if an operation is interrupted,
relinquish_memory() re-sets PGT_pinned so that the un-pin will
pickedup again when the hypercall restarts.

This is appropriate when put_page_and_type_preemptible() returns
-EINTR, which indicates that the page is back in its initial state
(i.e., completely validated).  However, for -ERESTART, this leads to a
state where a page has both PGT_pinned and PGT_partial set.

This happens to work at the moment, although it's not really a
"canonical" state; but in subsequent patches, where we need to make a
distinction in handling between PGT_validated and PGT_partial pages,
this causes issues.

Move to a "canonical" state by:
- Only re-setting PGT_pinned on -EINTR
- Re-dropping the refcount held by PGT_pinned on -ERESTART

In the latter case, the PGT_partial bit will be cleared further down
with the rest of the other PGT_partial pages.

While here, clean up some trainling whitespace.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/domain.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 3946ea38fd..400efbad14 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -159,7 +159,7 @@ static void play_dead(void)
      * this case, heap corruption or #PF can occur (when heap debugging is
      * enabled). For example, even printk() can involve tasklet scheduling,
      * which touches per-cpu vars.
-     * 
+     *
      * Consider very carefully when adding code to *dead_idle. Most hypervisor
      * subsystems are unsafe to call.
      */
@@ -2638,9 +2638,34 @@ static int relinquish_memory(
             break;
         case -ERESTART:
         case -EINTR:
+            /*
+             * -EINTR means PGT_validated has been re-set; re-set
+             * PGT_pinned again so that it gets picked up next time
+             * around.
+             *
+             * -ERESTART, OTOH, means PGT_partial is set instead.  Put
+             * it back on the list, but don't set PGT_pinned; the
+             * section below will finish off de-validation.  But we do
+             * need to drop the general ref associated with
+             * PGT_pinned, since put_page_and_type_preemptible()
+             * didn't do it.
+             *
+             * NB we can do an ASSERT for PGT_validated, since we
+             * "own" the type ref; but theoretically, the PGT_partial
+             * could be cleared by someone else.
+             */
+            if ( ret == -EINTR )
+            {
+                ASSERT(page->u.inuse.type_info & PGT_validated);
+                set_bit(_PGT_pinned, &page->u.inuse.type_info);
+            }
+            else
+                put_page(page);
+
             ret = -ERESTART;
+
+            /* Put the page back on the list and drop the ref we grabbed above */
             page_list_add(page, list);
-            set_bit(_PGT_pinned, &page->u.inuse.type_info);
             put_page(page);
             goto out;
         default:
@@ -2903,7 +2928,7 @@ void vcpu_kick(struct vcpu *v)
      * pending flag. These values may fluctuate (after all, we hold no
      * locks) but the key insight is that each change will cause
      * evtchn_upcall_pending to be polled.
-     * 
+     *
      * NB2. We save the running flag across the unblock to avoid a needless
      * IPI for domains that we IPI'd to unblock.
      */
-- 
2.23.0


["xsa299-4.8/0004-x86-mm-Separate-out-partial_pte-tristate-into-indivi.patch" (application/octet-stream)]

From 39063864e70a30a2ee958953ed71760e7596f202 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 04/12] x86/mm: Separate out partial_pte tristate into
 individual flags

At the moment, partial_pte is a tri-state that contains two distinct bits
of information:

1. If zero, the pte at index [nr_validated_ptes] is un-validated.  If
   non-zero, the pte was last seen with PGT_partial set.

2. If positive, the pte at index [nr_validated_ptes] does not hold a
   general reference count.  If negative, it does.

To make future patches more clear, separate out this functionality
into two distinct, named bits: PTF_partial_set (for #1) and
PTF_partial_general_ref (for #2).

Additionally, a number of functions which need this information also
take other flags to control behavior (such as `preemptible` and
`defer`).  These are hard to read in the caller (since you only see
'true' or 'false'), and ugly when many are added together.  In
preparation for adding yet another flag in a future patch, collapse
all of these into a single `flag` variable.

NB that this does mean checking for what was previously the '-1'
condition a bit more ugly in the put_page_from_lNe functions (since
you have to check for both partial_set and general ref); but this
clause will go away in a future patch.

Also note that the original comment had an off-by-one error:
partial_flags (like partial_pte before it) concerns
plNe[nr_validated_ptes], not plNe[nr_validated_ptes+1].

No functional change intended.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c        | 166 +++++++++++++++++++++++----------------
 xen/include/asm-x86/mm.h |  41 +++++++---
 2 files changed, 128 insertions(+), 79 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 1c19092503..680b5b3de7 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -756,22 +756,35 @@ static int get_page_from_pagenr(unsigned long page_nr, struct domain *d)
 static int __get_page_type(struct page_info *page, unsigned long type,
                            int preemptible);
 
+/*
+ * The following flags are used to specify behavior of various get and
+ * put commands.  The first two are also stored in page->partial_flags
+ * to indicate the state of the page pointed to by
+ * page->pte[page->nr_validated_entries].  See the comment in mm.h for
+ * more information.
+ */
+#define PTF_partial_set         (1 << 0)
+#define PTF_partial_general_ref (1 << 1)
+#define PTF_preemptible         (1 << 2)
+#define PTF_defer               (1 << 3)
+
 static int get_page_and_type_from_pagenr(unsigned long page_nr,
                                          unsigned long type,
                                          struct domain *d,
-                                         int partial,
-                                         int preemptible)
+                                         unsigned int flags)
 {
     struct page_info *page = mfn_to_page(page_nr);
     int rc;
+    bool preemptible = flags & PTF_preemptible,
+         partial_ref = flags & PTF_partial_general_ref;
 
-    if ( likely(partial >= 0) &&
+    if ( likely(!partial_ref) &&
          unlikely(!get_page_from_pagenr(page_nr, d)) )
         return -EINVAL;
 
     rc = __get_page_type(page, type, preemptible);
 
-    if ( unlikely(rc) && partial >= 0 &&
+    if ( unlikely(rc) && !partial_ref &&
          (!preemptible || page != current->arch.old_guest_table) )
         put_page(page);
 
@@ -1230,7 +1243,7 @@ get_page_from_l1e(
 define_get_linear_pagetable(l2);
 static int
 get_page_from_l2e(
-    l2_pgentry_t l2e, unsigned long pfn, struct domain *d, int partial)
+    l2_pgentry_t l2e, unsigned long pfn, struct domain *d, unsigned int flags)
 {
     unsigned long mfn = l2e_get_pfn(l2e);
     int rc;
@@ -1246,8 +1259,9 @@ get_page_from_l2e(
 
     if ( !(l2e_get_flags(l2e) & _PAGE_PSE) )
     {
-        rc = get_page_and_type_from_pagenr(mfn, PGT_l1_page_table, d,
-                                           partial, false);
+        ASSERT(!(flags & PTF_preemptible));
+
+        rc = get_page_and_type_from_pagenr(mfn, PGT_l1_page_table, d, flags);
         if ( unlikely(rc == -EINVAL) && get_l2_linear_pagetable(l2e, pfn, d) )
             rc = 0;
         return rc;
@@ -1273,7 +1287,7 @@ get_page_from_l2e(
 define_get_linear_pagetable(l3);
 static int
 get_page_from_l3e(
-    l3_pgentry_t l3e, unsigned long pfn, struct domain *d, int partial)
+    l3_pgentry_t l3e, unsigned long pfn, struct domain *d, unsigned int flags)
 {
     int rc;
 
@@ -1287,7 +1301,7 @@ get_page_from_l3e(
     }
 
     rc = get_page_and_type_from_pagenr(
-        l3e_get_pfn(l3e), PGT_l2_page_table, d, partial, 1);
+        l3e_get_pfn(l3e), PGT_l2_page_table, d, flags | PTF_preemptible);
     if ( unlikely(rc == -EINVAL) &&
          !is_pv_32bit_domain(d) &&
          get_l3_linear_pagetable(l3e, pfn, d) )
@@ -1299,7 +1313,7 @@ get_page_from_l3e(
 define_get_linear_pagetable(l4);
 static int
 get_page_from_l4e(
-    l4_pgentry_t l4e, unsigned long pfn, struct domain *d, int partial)
+    l4_pgentry_t l4e, unsigned long pfn, struct domain *d, unsigned int flags)
 {
     int rc;
 
@@ -1313,7 +1327,7 @@ get_page_from_l4e(
     }
 
     rc = get_page_and_type_from_pagenr(
-        l4e_get_pfn(l4e), PGT_l3_page_table, d, partial, 1);
+        l4e_get_pfn(l4e), PGT_l3_page_table, d, flags | PTF_preemptible);
     if ( unlikely(rc == -EINVAL) && get_l4_linear_pagetable(l4e, pfn, d) )
         rc = 0;
 
@@ -1443,7 +1457,7 @@ void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
  * Note also that this automatically deals correctly with linear p.t.'s.
  */
 static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
-                             int partial, bool defer)
+                             unsigned int flags)
 {
     int rc = 0;
 
@@ -1457,12 +1471,13 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         struct page_info *pg = l2e_get_page(l2e);
         struct page_info *ptpg = mfn_to_page(pfn);
 
-        if ( unlikely(partial > 0) )
+        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
+              PTF_partial_set )
         {
-            ASSERT(!defer);
+            ASSERT(!(flags & PTF_defer));
             rc = _put_page_type(pg, true, ptpg);
         }
-        else if ( defer )
+        else if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = ptpg;
             current->arch.old_guest_table = pg;
@@ -1479,7 +1494,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
 }
 
 static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
-                             int partial, bool_t defer)
+                             unsigned int flags)
 {
     struct page_info *pg;
     int rc;
@@ -1502,13 +1517,14 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 
     pg = l3e_get_page(l3e);
 
-    if ( unlikely(partial > 0) )
+    if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
+         PTF_partial_set )
     {
-        ASSERT(!defer);
+        ASSERT(!(flags & PTF_defer));
         return _put_page_type(pg, true, mfn_to_page(pfn));
     }
 
-    if ( defer )
+    if ( flags & PTF_defer )
     {
         current->arch.old_guest_ptpg = mfn_to_page(pfn);
         current->arch.old_guest_table = pg;
@@ -1523,7 +1539,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 }
 
 static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
-                             int partial, bool_t defer)
+                             unsigned int flags)
 {
     int rc = 1;
 
@@ -1532,13 +1548,14 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
     {
         struct page_info *pg = l4e_get_page(l4e);
 
-        if ( unlikely(partial > 0) )
+        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
+              PTF_partial_set )
         {
-            ASSERT(!defer);
+            ASSERT(!(flags & PTF_defer));
             return _put_page_type(pg, true, mfn_to_page(pfn));
         }
 
-        if ( defer )
+        if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = mfn_to_page(pfn);
             current->arch.old_guest_table = pg;
@@ -1648,12 +1665,13 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
     unsigned long  pfn = page_to_mfn(page);
     l2_pgentry_t  *pl2e;
     unsigned int   i;
-    int            rc = 0, partial = page->partial_pte;
+    int            rc = 0;
+    unsigned int   partial_flags = page->partial_flags;
 
     pl2e = map_domain_page(_mfn(pfn));
 
     for ( i = page->nr_validated_ptes; i < L2_PAGETABLE_ENTRIES;
-          i++, partial = 0 )
+          i++, partial_flags = 0 )
     {
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
         {
@@ -1663,18 +1681,19 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
         }
 
         if ( !is_guest_l2_slot(d, type, i) ||
-             (rc = get_page_from_l2e(pl2e[i], pfn, d, partial)) > 0 )
+             (rc = get_page_from_l2e(pl2e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = partial ?: 1;
+            /* Set 'set', retain 'general ref' */
+            page->partial_flags = partial_flags | PTF_partial_set;
         }
         else if ( rc == -EINTR && i )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
             rc = -ERESTART;
         }
         else if ( rc < 0 && rc != -EINTR )
@@ -1683,7 +1702,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_pte = 0;
+                page->partial_flags = 0;
                 current->arch.old_guest_ptpg = NULL;
                 current->arch.old_guest_table = page;
             }
@@ -1713,7 +1732,8 @@ static int alloc_l3_table(struct page_info *page)
     unsigned long  pfn = page_to_mfn(page);
     l3_pgentry_t  *pl3e;
     unsigned int   i;
-    int            rc = 0, partial = page->partial_pte;
+    int            rc = 0;
+    unsigned int   partial_flags = page->partial_flags;
 
     pl3e = map_domain_page(_mfn(pfn));
 
@@ -1728,7 +1748,7 @@ static int alloc_l3_table(struct page_info *page)
         memset(pl3e + 4, 0, (L3_PAGETABLE_ENTRIES - 4) * sizeof(*pl3e));
 
     for ( i = page->nr_validated_ptes; i < L3_PAGETABLE_ENTRIES;
-          i++, partial = 0 )
+          i++, partial_flags = 0 )
     {
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
         {
@@ -1746,21 +1766,23 @@ static int alloc_l3_table(struct page_info *page)
                 rc = get_page_and_type_from_pagenr(l3e_get_pfn(pl3e[i]),
                                                    PGT_l2_page_table |
                                                    PGT_pae_xen_l2,
-                                                   d, partial, 1);
+                                                   d,
+                                                   partial_flags | PTF_preemptible);
         }
         else if ( !is_guest_l3_slot(i) ||
-                  (rc = get_page_from_l3e(pl3e[i], pfn, d, partial)) > 0 )
+                  (rc = get_page_from_l3e(pl3e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = partial ?: 1;
+            /* Set 'set', leave 'general ref' set if this entry was set */
+            page->partial_flags = partial_flags | PTF_partial_set;
         }
         else if ( rc == -EINTR && i )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
             rc = -ERESTART;
         }
         if ( rc < 0 )
@@ -1777,7 +1799,7 @@ static int alloc_l3_table(struct page_info *page)
         if ( i )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
         }
@@ -1843,19 +1865,21 @@ static int alloc_l4_table(struct page_info *page)
     unsigned long  pfn = page_to_mfn(page);
     l4_pgentry_t  *pl4e = map_domain_page(_mfn(pfn));
     unsigned int   i;
-    int            rc = 0, partial = page->partial_pte;
+    int            rc = 0;
+    unsigned int   partial_flags = page->partial_flags;
 
     for ( i = page->nr_validated_ptes; i < L4_PAGETABLE_ENTRIES;
-          i++, partial = 0 )
+          i++, partial_flags = 0 )
     {
         if ( !is_guest_l4_slot(d, i) ||
-             (rc = get_page_from_l4e(pl4e[i], pfn, d, partial)) > 0 )
+             (rc = get_page_from_l4e(pl4e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = partial ?: 1;
+            /* Set 'set', leave 'general ref' set if this entry was set */
+            page->partial_flags = partial_flags | PTF_partial_set;
         }
         else if ( rc < 0 )
         {
@@ -1864,7 +1888,7 @@ static int alloc_l4_table(struct page_info *page)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_pte = 0;
+                page->partial_flags = 0;
                 if ( rc == -EINTR )
                     rc = -ERESTART;
                 else
@@ -1918,19 +1942,20 @@ static int free_l2_table(struct page_info *page)
     struct domain *d = page_get_owner(page);
     unsigned long pfn = page_to_mfn(page);
     l2_pgentry_t *pl2e;
-    int rc = 0, partial = page->partial_pte;
-    unsigned int i = page->nr_validated_ptes - !partial;
+    int rc = 0;
+    unsigned int partial_flags = page->partial_flags,
+        i = page->nr_validated_ptes - !(partial_flags & PTF_partial_set);
 
     pl2e = map_domain_page(_mfn(pfn));
 
     for ( ; ; )
     {
         if ( is_guest_l2_slot(d, page->u.inuse.type_info, i) )
-            rc = put_page_from_l2e(pl2e[i], pfn, partial, false);
+            rc = put_page_from_l2e(pl2e[i], pfn, partial_flags);
         if ( rc < 0 )
             break;
 
-        partial = 0;
+        partial_flags = 0;
 
         if ( !i-- )
             break;
@@ -1952,12 +1977,14 @@ static int free_l2_table(struct page_info *page)
     else if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_pte = partial ?: -1;
+        page->partial_flags = (partial_flags & PTF_partial_set) ?
+            partial_flags :
+            (PTF_partial_set | PTF_partial_general_ref);
     }
     else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
     {
         page->nr_validated_ptes = i + 1;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
         rc = -ERESTART;
     }
 
@@ -1969,8 +1996,9 @@ static int free_l3_table(struct page_info *page)
     struct domain *d = page_get_owner(page);
     unsigned long pfn = page_to_mfn(page);
     l3_pgentry_t *pl3e;
-    int rc = 0, partial = page->partial_pte;
-    unsigned int  i = page->nr_validated_ptes - !partial;
+    int rc = 0;
+    unsigned int partial_flags = page->partial_flags,
+        i = page->nr_validated_ptes - !(partial_flags & PTF_partial_set);
 
     pl3e = map_domain_page(_mfn(pfn));
 
@@ -1978,11 +2006,11 @@ static int free_l3_table(struct page_info *page)
     {
         if ( is_guest_l3_slot(i) )
         {
-            rc = put_page_from_l3e(pl3e[i], pfn, partial, 0);
+            rc = put_page_from_l3e(pl3e[i], pfn, partial_flags);
             if ( rc < 0 )
                 break;
 
-            partial = 0;
+            partial_flags = 0;
             if ( rc == 0 )
                 unadjust_guest_l3e(pl3e[i], d);
         }
@@ -2002,12 +2030,14 @@ static int free_l3_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_pte = partial ?: -1;
+        page->partial_flags = (partial_flags & PTF_partial_set) ?
+            partial_flags :
+            (PTF_partial_set | PTF_partial_general_ref);
     }
     else if ( rc == -EINTR && i < L3_PAGETABLE_ENTRIES - 1 )
     {
         page->nr_validated_ptes = i + 1;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
         rc = -ERESTART;
     }
     return rc > 0 ? 0 : rc;
@@ -2018,26 +2048,29 @@ static int free_l4_table(struct page_info *page)
     struct domain *d = page_get_owner(page);
     unsigned long pfn = page_to_mfn(page);
     l4_pgentry_t *pl4e = map_domain_page(_mfn(pfn));
-    int rc = 0, partial = page->partial_pte;
-    unsigned int  i = page->nr_validated_ptes - !partial;
+    int rc = 0;
+    unsigned partial_flags = page->partial_flags,
+        i = page->nr_validated_ptes - !(partial_flags & PTF_partial_set);
 
     do {
         if ( is_guest_l4_slot(d, i) )
-            rc = put_page_from_l4e(pl4e[i], pfn, partial, 0);
+            rc = put_page_from_l4e(pl4e[i], pfn, partial_flags);
         if ( rc < 0 )
             break;
-        partial = 0;
+        partial_flags = 0;
     } while ( i-- );
 
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_pte = partial ?: -1;
+        page->partial_flags = (partial_flags & PTF_partial_set) ?
+            partial_flags :
+            (PTF_partial_set | PTF_partial_general_ref);
     }
     else if ( rc == -EINTR && i < L4_PAGETABLE_ENTRIES - 1 )
     {
         page->nr_validated_ptes = i + 1;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
         rc = -ERESTART;
     }
 
@@ -2315,7 +2348,7 @@ static int mod_l2_entry(l2_pgentry_t *pl2e,
         return -EBUSY;
     }
 
-    put_page_from_l2e(ol2e, pfn, 0, true);
+    put_page_from_l2e(ol2e, pfn, PTF_defer);
 
     return rc;
 }
@@ -2389,7 +2422,7 @@ static int mod_l3_entry(l3_pgentry_t *pl3e,
         if ( !create_pae_xen_mappings(d, pl3e) )
             BUG();
 
-    put_page_from_l3e(ol3e, pfn, 0, 1);
+    put_page_from_l3e(ol3e, pfn, PTF_defer);
     return rc;
 }
 
@@ -2451,7 +2484,7 @@ static int mod_l4_entry(l4_pgentry_t *pl4e,
         return -EFAULT;
     }
 
-    put_page_from_l4e(ol4e, pfn, 0, 1);
+    put_page_from_l4e(ol4e, pfn, PTF_defer);
     return rc;
 }
 
@@ -2715,7 +2748,7 @@ int free_page_type(struct page_info *page, unsigned long type,
     if ( !(type & PGT_partial) )
     {
         page->nr_validated_ptes = 1U << PAGETABLE_ORDER;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
     }
 
     switch ( type & PGT_type_mask )
@@ -3003,7 +3036,7 @@ static int __get_page_type(struct page_info *page, unsigned long type,
         if ( !(x & PGT_partial) )
         {
             page->nr_validated_ptes = 0;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
         }
         page->linear_pt_count = 0;
         rc = alloc_page_type(page, type, preemptible);
@@ -3369,7 +3402,8 @@ int new_guest_cr3(unsigned long mfn)
 
     rc = paging_mode_refcounts(d)
          ? (get_page_from_pagenr(mfn, d) ? 0 : -EINVAL)
-         : get_page_and_type_from_pagenr(mfn, PGT_root_page_table, d, 0, 1);
+         : get_page_and_type_from_pagenr(mfn, PGT_root_page_table, d,
+                                         PTF_preemptible);
     switch ( rc )
     {
     case 0:
@@ -3757,7 +3791,7 @@ long do_mmuext_op(
                     rc = get_page_from_pagenr(op.arg1.mfn, d) ? 0 : -EINVAL;
                 else
                     rc = get_page_and_type_from_pagenr(
-                        op.arg1.mfn, PGT_root_page_table, d, 0, 1);
+                        op.arg1.mfn, PGT_root_page_table, d, PTF_preemptible);
 
                 if ( unlikely(rc) )
                 {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 955e1fa86a..8e3cef954c 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -141,19 +141,34 @@ struct page_info
          * setting the flag must not drop that reference, whereas the instance
          * clearing it will have to.
          *
-         * If @partial_pte is positive then PTE at @nr_validated_ptes+1 has
-         * been partially validated. This implies that the general reference
-         * to the page (acquired from get_page_from_lNe()) would be dropped
-         * (again due to the apparent failure) and hence must be re-acquired
-         * when resuming the validation, but must not be dropped when picking
-         * up the page for invalidation.
+         * If partial_flags & PTF_partial_set is set, then the page at
+         * at @nr_validated_ptes had PGT_partial set as a result of an
+         * operation on the current page.  (That page may or may not
+         * still have PGT_partial set.)
          *
-         * If @partial_pte is negative then PTE at @nr_validated_ptes+1 has
-         * been partially invalidated. This is basically the opposite case of
-         * above, i.e. the general reference to the page was not dropped in
-         * put_page_from_lNe() (due to the apparent failure), and hence it
-         * must be dropped when the put operation is resumed (and completes),
-         * but it must not be acquired if picking up the page for validation.
+         * If PTF_partial_general_ref is set, then the PTE at
+         * @nr_validated_ptef holds a general reference count for the
+         * page.
+         *
+         * This happens:
+         * - During de-validation, if de-validation of the page was
+         *   interrupted
+         * - During validation, if an invalid entry is encountered and
+         *   validation is preemptible
+         * - During validation, if PTF_partial_general_ref was set on
+         *   this entry to begin with (perhaps because we're picking
+         *   up from a partial de-validation).
+         *
+         * When resuming validation, if PTF_partial_general_ref is clear,
+         * then a general reference must be re-acquired; if it is set, no
+         * reference should be acquired.
+         *
+         * When resuming de-validation, if PTF_partial_general_ref is
+         * clear, no reference should be dropped; if it is set, a
+         * reference should be dropped.
+         *
+         * NB that PTF_partial_set and PTF_partial_general_ref are
+         * defined in mm.c, the only place where they are used.
          *
          * The 3rd field, @linear_pt_count, indicates
          * - by a positive value, how many same-level page table entries a page
@@ -164,7 +179,7 @@ struct page_info
         struct {
             u16 nr_validated_ptes:PAGETABLE_ORDER + 1;
             u16 :16 - PAGETABLE_ORDER - 1 - 2;
-            s16 partial_pte:2;
+            u16 partial_flags:2;
             s16 linear_pt_count;
         };
 
-- 
2.23.0


["xsa299-4.8/0005-x86-mm-Use-flags-for-_put_page_type-rather-than-a-bo.patch" (application/octet-stream)]

From 13660d718ec3a119ff0ab1f1197d0796ca79c668 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 05/12] x86/mm: Use flags for _put_page_type rather than a
 boolean

This is in mainly in preparation for _put_page_type taking the
partial_flags value in the future.  It also makes it easier to read in
the caller (since you see a flag name rather than `true` or `false`).

No functional change intended.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 680b5b3de7..1576a083df 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1392,7 +1392,7 @@ get_page_from_l4e(
             l3e_remove_flags((pl3e), _PAGE_USER|_PAGE_RW|_PAGE_ACCESSED);   \
     } while ( 0 )
 
-static int _put_page_type(struct page_info *page, bool preemptible,
+static int _put_page_type(struct page_info *page, unsigned int flags,
                           struct page_info *ptpg);
 
 void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
@@ -1475,7 +1475,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
               PTF_partial_set )
         {
             ASSERT(!(flags & PTF_defer));
-            rc = _put_page_type(pg, true, ptpg);
+            rc = _put_page_type(pg, PTF_preemptible, ptpg);
         }
         else if ( flags & PTF_defer )
         {
@@ -1484,7 +1484,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         }
         else
         {
-            rc = _put_page_type(pg, true, ptpg);
+            rc = _put_page_type(pg, PTF_preemptible, ptpg);
             if ( likely(!rc) )
                 put_page(pg);
         }
@@ -1521,7 +1521,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
          PTF_partial_set )
     {
         ASSERT(!(flags & PTF_defer));
-        return _put_page_type(pg, true, mfn_to_page(pfn));
+        return _put_page_type(pg, PTF_preemptible, mfn_to_page(pfn));
     }
 
     if ( flags & PTF_defer )
@@ -1531,7 +1531,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
         return 0;
     }
 
-    rc = _put_page_type(pg, true, mfn_to_page(pfn));
+    rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(pfn));
     if ( likely(!rc) )
         put_page(pg);
 
@@ -1552,7 +1552,7 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
               PTF_partial_set )
         {
             ASSERT(!(flags & PTF_defer));
-            return _put_page_type(pg, true, mfn_to_page(pfn));
+            return _put_page_type(pg, PTF_preemptible, mfn_to_page(pfn));
         }
 
         if ( flags & PTF_defer )
@@ -1562,7 +1562,7 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
             return 0;
         }
 
-        rc = _put_page_type(pg, true, mfn_to_page(pfn));
+        rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(pfn));
         if ( likely(!rc) )
             put_page(pg);
     }
@@ -2825,11 +2825,12 @@ static int _put_final_page_type(struct page_info *page, unsigned long type,
 }
 
 
-static int _put_page_type(struct page_info *page, bool preemptible,
+static int _put_page_type(struct page_info *page, unsigned int flags,
                           struct page_info *ptpg)
 {
     unsigned long nx, x, y = page->u.inuse.type_info;
     int rc = 0;
+    bool preemptible = flags & PTF_preemptible;
 
     for ( ; ; )
     {
@@ -3024,7 +3025,7 @@ static int __get_page_type(struct page_info *page, unsigned long type,
 
             if ( unlikely(iommu_ret) )
             {
-                _put_page_type(page, false, NULL);
+                _put_page_type(page, 0, NULL);
                 rc = iommu_ret;
                 goto out;
             }
@@ -3051,7 +3052,7 @@ static int __get_page_type(struct page_info *page, unsigned long type,
 
 void put_page_type(struct page_info *page)
 {
-    int rc = _put_page_type(page, false, NULL);
+    int rc = _put_page_type(page, 0, NULL);
     ASSERT(rc == 0);
     (void)rc;
 }
@@ -3067,7 +3068,7 @@ int get_page_type(struct page_info *page, unsigned long type)
 
 int put_page_type_preemptible(struct page_info *page)
 {
-    return _put_page_type(page, true, NULL);
+    return _put_page_type(page, PTF_preemptible, NULL);
 }
 
 int get_page_type_preemptible(struct page_info *page, unsigned long type)
@@ -3273,7 +3274,7 @@ int put_old_guest_table(struct vcpu *v)
     if ( !v->arch.old_guest_table )
         return 0;
 
-    switch ( rc = _put_page_type(v->arch.old_guest_table, true,
+    switch ( rc = _put_page_type(v->arch.old_guest_table, PTF_preemptible,
                                  v->arch.old_guest_ptpg) )
     {
     case -EINTR:
-- 
2.23.0


["xsa299-4.8/0006-x86-mm-Rework-get_page_and_type_from_mfn-conditional.patch" (application/octet-stream)]

From aafc9cd847896748160c5820a5b9c94404a350f4 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 06/12] x86/mm: Rework get_page_and_type_from_mfn conditional

Make it easier to read by declaring the conditions in which we will
retain the ref, rather than the conditions under which we release it.

The only way (page == current->arch.old_guest_table) can be true is if
preemptible is true; so remove this from the query itself, and add an
ASSERT() to that effect on the opposite path.

No functional change intended.

NB that alloc_lN_table() mishandle the "linear pt failure" situation
described in the comment; this will be addressed in a future patch.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 1576a083df..16aef2f652 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -784,8 +784,43 @@ static int get_page_and_type_from_pagenr(unsigned long page_nr,
 
     rc = __get_page_type(page, type, preemptible);
 
-    if ( unlikely(rc) && !partial_ref &&
-         (!preemptible || page != current->arch.old_guest_table) )
+    /*
+     * Retain the refcount if:
+     * - page is fully validated (rc == 0)
+     * - page is not validated (rc < 0) but:
+     *   - We came in with a reference (partial_ref)
+     *   - page is partially validated but there's been an error
+     *     (page == current->arch.old_guest_table)
+     *
+     * The partial_ref-on-error clause is worth an explanation.  There
+     * are two scenarios where partial_ref might be true coming in:
+     * - mfn has been partially demoted as type `type`; i.e. has
+     *   PGT_partial set
+     * - mfn has been partially demoted as L(type+1) (i.e., a linear
+     *   page; e.g. we're being called from get_page_from_l2e with
+     *   type == PGT_l1_table, but the mfn is PGT_l2_table)
+     *
+     * If there's an error, in the first case, _get_page_type will
+     * either return -ERESTART, in which case we want to retain the
+     * ref (as the caller will consider it retained), or -EINVAL, in
+     * which case old_guest_table will be set; in both cases, we need
+     * to retain the ref.
+     *
+     * In the second case, if there's an error, _get_page_type() can
+     * *only* return -EINVAL, and *never* set old_guest_table.  In
+     * that case we also want to retain the reference, to allow the
+     * page to continue to be torn down (i.e., PGT_partial cleared)
+     * safely.
+     *
+     * Also note that we shouldn't be able to leave with the reference
+     * count retained unless we succeeded, or the operation was
+     * preemptible.
+     */
+    if ( likely(!rc) || partial_ref )
+        /* nothing */;
+    else if ( page == current->arch.old_guest_table )
+        ASSERT(preemptible);
+    else
         put_page(page);
 
     return rc;
-- 
2.23.0


["xsa299-4.8/0007-x86-mm-Have-alloc_l-23-_table-clear-partial_flags-wh.patch" (application/octet-stream)]

From 62e9b9c68eba61b7bedcb5ef81be940308d4127d Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 07/12] x86/mm: Have alloc_l[23]_table clear partial_flags when
 preempting

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.

At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.

PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held.  Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.

Unfortunately, when alloc_l[23]_table check hypercall_preempt_check()
and return -ERESTART, they set nr_entries_validated, but don't clear
partial_flags.

If we were picking up from a previously-interrupted promotion, that
means that PTF_partial_set would be set even though
[nr_entries_validated] was not partially validated.  This means that
if the page in this state were de-validated, put_page_type() would
erroneously be called on that entry.

Perhaps worse, if we were racing with a de-validation, then we might
leave both PTF_partial_set and PTF_partial_general_ref; and when
de-validation picked up again, both the type and the general ref would
be erroneously dropped from [nr_entries_validated].

In a sense, the real issue here is code duplication.  Rather than
duplicate the interruption code, set rc to -EINTR and fall through to
the code which already handles that case correctly.

Given the logic at this point, it should be impossible for
partial_flags to be non-zero; add an ASSERT() to catch any changes.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 16aef2f652..5e6bb3349c 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1709,13 +1709,8 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
           i++, partial_flags = 0 )
     {
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
-        {
-            page->nr_validated_ptes = i;
-            rc = -ERESTART;
-            break;
-        }
-
-        if ( !is_guest_l2_slot(d, type, i) ||
+            rc = -EINTR;
+        else if ( !is_guest_l2_slot(d, type, i) ||
              (rc = get_page_from_l2e(pl2e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
@@ -1786,13 +1781,8 @@ static int alloc_l3_table(struct page_info *page)
           i++, partial_flags = 0 )
     {
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
-        {
-            page->nr_validated_ptes = i;
-            rc = -ERESTART;
-            break;
-        }
-
-        if ( is_pv_32bit_domain(d) && (i == 3) )
+            rc = -EINTR;
+        else if ( is_pv_32bit_domain(d) && (i == 3) )
         {
             if ( !(l3e_get_flags(pl3e[i]) & _PAGE_PRESENT) ||
                  (l3e_get_flags(pl3e[i]) & l3_disallow_mask(d)) )
-- 
2.23.0


["xsa299-4.8/0008-x86-mm-Always-retain-a-general-ref-on-partial.patch" (application/octet-stream)]

From 391a85bbb11f9218d65afac666c980621c5b822b Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 08/12] x86/mm: Always retain a general ref on partial

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page struct:
nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.

At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.

PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held.  Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.

Unfortunately, because a refcount is not held, it is possible to
engineer a situation where PFT_partial_set is set but the page in
question has been assigned to another domain.  A sketch is provided in
the appendix.

Fix this by having the parent page table entry hold a general
reference count whenever PFT_partial_set is set.  (For clarity of
change, keep two separate flags.  These will be collapsed in a
subsequent changeset.)

This has two basic implications.  On the put_page_from_lNe() side,
this mean that the (partial_set && !partial_ref) case can never happen,
and no longer needs to be special-cased.

Secondly, because both flags are set together, there's no need to carry over
existing bits from partial_pte.

(NB there is still another issue with calling _put_page_type() on a
page which had PGT_partial set; that will be handled in a subsequent
patch.)

On the get_page_and_type_from_mfn() side, we need to distinguish
between callers which hold a reference on partial (i.e.,
alloc_lN_table()), and those which do not (new_cr3, PIN_LN_TABLE, and
so on): pass a flag if the type should be retained on interruption.

NB that since l1 promotion can't be preempted, that get_page_from_l2e
can't return -ERESTART.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----

* Appendix: Engineering PTF_partial_set while a page belongs to a
  foreign domain

Suppose A is a page which can be promoted to an l3, and B is a page
which can be promoted to an l2, and A[x] points to B.  B has
PGC_allocated set but no other general references.

V1:  PIN_L3 A.
  A is validated, B is validated.
  A.type_count = 1 | PGT_validated | PGT_pinned
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated (A[x] holds a general ref)

V1: UNPIN A.
  A begins de-validation.
  Arrange to be interrupted when i < x
  V1->old_guest_table = A
  V1->old_guest_table_ref_held = false
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = i < x
  B.type_count = 0
  B.count = 1 | PGC_allocated

V2: MOD_L4_ENTRY to point some l4e to A.
  Picks up re-validation of A.
  Arrange to be interrupted halfway through B's validation
  B.type_count = 1 | PGT_partial
  B.count = 2 | PGC_allocated (PGT_partial holds a general ref)
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = x
  A.partial_pte = PTF_partial_set

V3: MOD_L3_ENTRY to point some other l3e (not in A) to B.
  Validates B.
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated ("other l3e" holds a general ref)

V3: MOD_L3_ENTRY to clear l3e pointing to B.
  Devalidates B.
  B.type_count = 0
  B.count = 1 | PGC_allocated

V3: decrease_reservation(B)
  Clears PGC_allocated
  B.count = 0 => B is freed

B gets assigned to a different domain

V1: Restarts UNPIN of A
  put_old_guest_table(A)
    ...
      free_l3_table(A)

Now since A.partial_flags has PTF_partial_set, free_l3_table() will
call put_page_from_l3e() on A[x], which points to B, while B is owned
by another domain.

If A[x] held a general refcount for B on partial validation, as it does
for partial de-validation, then B would still have a reference count of
1 after PGC_allocated was freed; so B wouldn't be freed until after
put_page_from_l3e() had happend on A[x].
---
 xen/arch/x86/mm.c        | 87 ++++++++++++++++++++++++----------------
 xen/include/asm-x86/mm.h | 15 ++++---
 2 files changed, 61 insertions(+), 41 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 5e6bb3349c..8012b6cddb 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -763,10 +763,11 @@ static int __get_page_type(struct page_info *page, unsigned long type,
  * page->pte[page->nr_validated_entries].  See the comment in mm.h for
  * more information.
  */
-#define PTF_partial_set         (1 << 0)
-#define PTF_partial_general_ref (1 << 1)
-#define PTF_preemptible         (1 << 2)
-#define PTF_defer               (1 << 3)
+#define PTF_partial_set           (1 << 0)
+#define PTF_partial_general_ref   (1 << 1)
+#define PTF_preemptible           (1 << 2)
+#define PTF_defer                 (1 << 3)
+#define PTF_retain_ref_on_restart (1 << 4)
 
 static int get_page_and_type_from_pagenr(unsigned long page_nr,
                                          unsigned long type,
@@ -776,7 +777,11 @@ static int get_page_and_type_from_pagenr(unsigned long page_nr,
     struct page_info *page = mfn_to_page(page_nr);
     int rc;
     bool preemptible = flags & PTF_preemptible,
-         partial_ref = flags & PTF_partial_general_ref;
+         partial_ref = flags & PTF_partial_general_ref,
+         partial_set = flags & PTF_partial_set,
+         retain_ref  = flags & PTF_retain_ref_on_restart;
+
+    ASSERT(partial_ref == partial_set);
 
     if ( likely(!partial_ref) &&
          unlikely(!get_page_from_pagenr(page_nr, d)) )
@@ -789,13 +794,15 @@ static int get_page_and_type_from_pagenr(unsigned long page_nr,
      * - page is fully validated (rc == 0)
      * - page is not validated (rc < 0) but:
      *   - We came in with a reference (partial_ref)
+     *   - page is partially validated (rc == -ERESTART), and the
+     *     caller has asked the ref to be retained in that case
      *   - page is partially validated but there's been an error
      *     (page == current->arch.old_guest_table)
      *
      * The partial_ref-on-error clause is worth an explanation.  There
      * are two scenarios where partial_ref might be true coming in:
-     * - mfn has been partially demoted as type `type`; i.e. has
-     *   PGT_partial set
+     * - mfn has been partially promoted / demoted as type `type`;
+     *   i.e. has PGT_partial set
      * - mfn has been partially demoted as L(type+1) (i.e., a linear
      *   page; e.g. we're being called from get_page_from_l2e with
      *   type == PGT_l1_table, but the mfn is PGT_l2_table)
@@ -818,7 +825,8 @@ static int get_page_and_type_from_pagenr(unsigned long page_nr,
      */
     if ( likely(!rc) || partial_ref )
         /* nothing */;
-    else if ( page == current->arch.old_guest_table )
+    else if ( page == current->arch.old_guest_table ||
+              (retain_ref && rc == -ERESTART) )
         ASSERT(preemptible);
     else
         put_page(page);
@@ -1509,8 +1517,8 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
               PTF_partial_set )
         {
-            ASSERT(!(flags & PTF_defer));
-            rc = _put_page_type(pg, PTF_preemptible, ptpg);
+            /* partial_set should always imply partial_ref */
+            BUG();
         }
         else if ( flags & PTF_defer )
         {
@@ -1555,8 +1563,8 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
     if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
          PTF_partial_set )
     {
-        ASSERT(!(flags & PTF_defer));
-        return _put_page_type(pg, PTF_preemptible, mfn_to_page(pfn));
+        /* partial_set should always imply partial_ref */
+        BUG();
     }
 
     if ( flags & PTF_defer )
@@ -1586,8 +1594,8 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
         if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
               PTF_partial_set )
         {
-            ASSERT(!(flags & PTF_defer));
-            return _put_page_type(pg, PTF_preemptible, mfn_to_page(pfn));
+            /* partial_set should always imply partial_ref */
+            BUG();
         }
 
         if ( flags & PTF_defer )
@@ -1714,13 +1722,22 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
              (rc = get_page_from_l2e(pl2e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
-        if ( rc == -ERESTART )
-        {
-            page->nr_validated_ptes = i;
-            /* Set 'set', retain 'general ref' */
-            page->partial_flags = partial_flags | PTF_partial_set;
-        }
-        else if ( rc == -EINTR && i )
+        /*
+         * It shouldn't be possible for get_page_from_l2e to return
+         * -ERESTART, since we never call this with PTF_preemptible.
+         * (alloc_l1_table may return -EINTR on an L1TF-vulnerable
+         * entry.)
+         *
+         * NB that while on a "clean" promotion, we can never get
+         * PGT_partial.  It is possible to arrange for an l2e to
+         * contain a partially-devalidated l2; but in that case, both
+         * of the following functions will fail anyway (the first
+         * because the page in question is not an l1; the second
+         * because the page is not fully validated).
+         */
+        ASSERT(rc != -ERESTART);
+
+        if ( rc == -EINTR && i )
         {
             page->nr_validated_ptes = i;
             page->partial_flags = 0;
@@ -1729,6 +1746,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
         else if ( rc < 0 && rc != -EINTR )
         {
             MEM_LOG("Failure in alloc_l2_table: entry %d", i);
+            ASSERT(current->arch.old_guest_table == NULL);
             if ( i )
             {
                 page->nr_validated_ptes = i;
@@ -1792,17 +1810,21 @@ static int alloc_l3_table(struct page_info *page)
                                                    PGT_l2_page_table |
                                                    PGT_pae_xen_l2,
                                                    d,
-                                                   partial_flags | PTF_preemptible);
+                                                   partial_flags |
+                                                   PTF_preemptible |
+                                                   PTF_retain_ref_on_restart);
         }
         else if ( !is_guest_l3_slot(i) ||
-                  (rc = get_page_from_l3e(pl3e[i], pfn, d, partial_flags)) > 0 )
+                  (rc = get_page_from_l3e(pl3e[i], pfn, d,
+                                          partial_flags |
+                                          PTF_retain_ref_on_restart)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = partial_flags | PTF_partial_set;
+            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
         }
         else if ( rc == -EINTR && i )
         {
@@ -1897,14 +1919,15 @@ static int alloc_l4_table(struct page_info *page)
           i++, partial_flags = 0 )
     {
         if ( !is_guest_l4_slot(d, i) ||
-             (rc = get_page_from_l4e(pl4e[i], pfn, d, partial_flags)) > 0 )
+             (rc = get_page_from_l4e(pl4e[i], pfn, d,
+                               partial_flags | PTF_retain_ref_on_restart)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = partial_flags | PTF_partial_set;
+            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
         }
         else if ( rc < 0 )
         {
@@ -2002,9 +2025,7 @@ static int free_l2_table(struct page_info *page)
     else if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = (partial_flags & PTF_partial_set) ?
-            partial_flags :
-            (PTF_partial_set | PTF_partial_general_ref);
+        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
     }
     else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
     {
@@ -2055,9 +2076,7 @@ static int free_l3_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = (partial_flags & PTF_partial_set) ?
-            partial_flags :
-            (PTF_partial_set | PTF_partial_general_ref);
+        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
     }
     else if ( rc == -EINTR && i < L3_PAGETABLE_ENTRIES - 1 )
     {
@@ -2088,9 +2107,7 @@ static int free_l4_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = (partial_flags & PTF_partial_set) ?
-            partial_flags :
-            (PTF_partial_set | PTF_partial_general_ref);
+        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
     }
     else if ( rc == -EINTR && i < L4_PAGETABLE_ENTRIES - 1 )
     {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 8e3cef954c..c2d1758ad7 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -151,22 +151,25 @@ struct page_info
          * page.
          *
          * This happens:
-         * - During de-validation, if de-validation of the page was
+         * - During validation or de-validation, if the operation was
          *   interrupted
          * - During validation, if an invalid entry is encountered and
          *   validation is preemptible
          * - During validation, if PTF_partial_general_ref was set on
-         *   this entry to begin with (perhaps because we're picking
-         *   up from a partial de-validation).
+         *   this entry to begin with (perhaps because it picked up a
+         *   previous operation)
          *
-         * When resuming validation, if PTF_partial_general_ref is clear,
-         * then a general reference must be re-acquired; if it is set, no
-         * reference should be acquired.
+         * When resuming validation, if PTF_partial_general_ref is
+         * clear, then a general reference must be re-acquired; if it
+         * is set, no reference should be acquired.
          *
          * When resuming de-validation, if PTF_partial_general_ref is
          * clear, no reference should be dropped; if it is set, a
          * reference should be dropped.
          *
+         * NB at the moment, PTF_partial_set should be set if and only if
+         * PTF_partial_general_ref is set.
+         *
          * NB that PTF_partial_set and PTF_partial_general_ref are
          * defined in mm.c, the only place where they are used.
          *
-- 
2.23.0


["xsa299-4.8/0009-x86-mm-Collapse-PTF_partial_set-and-PTF_partial_gene.patch" (application/octet-stream)]

From 4e0388af70146faaef6443a8b57807cd61e0960f Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 09/12] x86/mm: Collapse PTF_partial_set and
 PTF_partial_general_ref into one

...now that they are equivalent.  No functional change intended.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c        | 50 +++++++++++-----------------------------
 xen/include/asm-x86/mm.h | 29 +++++++++++------------
 2 files changed, 26 insertions(+), 53 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 8012b6cddb..a7084ccfa7 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -758,13 +758,12 @@ static int __get_page_type(struct page_info *page, unsigned long type,
 
 /*
  * The following flags are used to specify behavior of various get and
- * put commands.  The first two are also stored in page->partial_flags
- * to indicate the state of the page pointed to by
+ * put commands.  The first is also stored in page->partial_flags to
+ * indicate the state of the page pointed to by
  * page->pte[page->nr_validated_entries].  See the comment in mm.h for
  * more information.
  */
 #define PTF_partial_set           (1 << 0)
-#define PTF_partial_general_ref   (1 << 1)
 #define PTF_preemptible           (1 << 2)
 #define PTF_defer                 (1 << 3)
 #define PTF_retain_ref_on_restart (1 << 4)
@@ -777,13 +776,10 @@ static int get_page_and_type_from_pagenr(unsigned long page_nr,
     struct page_info *page = mfn_to_page(page_nr);
     int rc;
     bool preemptible = flags & PTF_preemptible,
-         partial_ref = flags & PTF_partial_general_ref,
          partial_set = flags & PTF_partial_set,
          retain_ref  = flags & PTF_retain_ref_on_restart;
 
-    ASSERT(partial_ref == partial_set);
-
-    if ( likely(!partial_ref) &&
+    if ( likely(!partial_set) &&
          unlikely(!get_page_from_pagenr(page_nr, d)) )
         return -EINVAL;
 
@@ -793,14 +789,14 @@ static int get_page_and_type_from_pagenr(unsigned long page_nr,
      * Retain the refcount if:
      * - page is fully validated (rc == 0)
      * - page is not validated (rc < 0) but:
-     *   - We came in with a reference (partial_ref)
+     *   - We came in with a reference (partial_set)
      *   - page is partially validated (rc == -ERESTART), and the
      *     caller has asked the ref to be retained in that case
      *   - page is partially validated but there's been an error
      *     (page == current->arch.old_guest_table)
      *
-     * The partial_ref-on-error clause is worth an explanation.  There
-     * are two scenarios where partial_ref might be true coming in:
+     * The partial_set-on-error clause is worth an explanation.  There
+     * are two scenarios where partial_set might be true coming in:
      * - mfn has been partially promoted / demoted as type `type`;
      *   i.e. has PGT_partial set
      * - mfn has been partially demoted as L(type+1) (i.e., a linear
@@ -823,7 +819,7 @@ static int get_page_and_type_from_pagenr(unsigned long page_nr,
      * count retained unless we succeeded, or the operation was
      * preemptible.
      */
-    if ( likely(!rc) || partial_ref )
+    if ( likely(!rc) || partial_set )
         /* nothing */;
     else if ( page == current->arch.old_guest_table ||
               (retain_ref && rc == -ERESTART) )
@@ -1514,13 +1510,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         struct page_info *pg = l2e_get_page(l2e);
         struct page_info *ptpg = mfn_to_page(pfn);
 
-        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
-              PTF_partial_set )
-        {
-            /* partial_set should always imply partial_ref */
-            BUG();
-        }
-        else if ( flags & PTF_defer )
+        if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = ptpg;
             current->arch.old_guest_table = pg;
@@ -1560,13 +1550,6 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 
     pg = l3e_get_page(l3e);
 
-    if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
-         PTF_partial_set )
-    {
-        /* partial_set should always imply partial_ref */
-        BUG();
-    }
-
     if ( flags & PTF_defer )
     {
         current->arch.old_guest_ptpg = mfn_to_page(pfn);
@@ -1591,13 +1574,6 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
     {
         struct page_info *pg = l4e_get_page(l4e);
 
-        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
-              PTF_partial_set )
-        {
-            /* partial_set should always imply partial_ref */
-            BUG();
-        }
-
         if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = mfn_to_page(pfn);
@@ -1824,7 +1800,7 @@ static int alloc_l3_table(struct page_info *page)
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+            page->partial_flags = PTF_partial_set;
         }
         else if ( rc == -EINTR && i )
         {
@@ -1927,7 +1903,7 @@ static int alloc_l4_table(struct page_info *page)
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+            page->partial_flags = PTF_partial_set;
         }
         else if ( rc < 0 )
         {
@@ -2025,7 +2001,7 @@ static int free_l2_table(struct page_info *page)
     else if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+        page->partial_flags = PTF_partial_set;
     }
     else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
     {
@@ -2076,7 +2052,7 @@ static int free_l3_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+        page->partial_flags = PTF_partial_set;
     }
     else if ( rc == -EINTR && i < L3_PAGETABLE_ENTRIES - 1 )
     {
@@ -2107,7 +2083,7 @@ static int free_l4_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+        page->partial_flags = PTF_partial_set;
     }
     else if ( rc == -EINTR && i < L4_PAGETABLE_ENTRIES - 1 )
     {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index c2d1758ad7..a13dd1d299 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -146,7 +146,7 @@ struct page_info
          * operation on the current page.  (That page may or may not
          * still have PGT_partial set.)
          *
-         * If PTF_partial_general_ref is set, then the PTE at
+         * Additionally, if PTF_partial_set is set, then the PTE at
          * @nr_validated_ptef holds a general reference count for the
          * page.
          *
@@ -155,23 +155,20 @@ struct page_info
          *   interrupted
          * - During validation, if an invalid entry is encountered and
          *   validation is preemptible
-         * - During validation, if PTF_partial_general_ref was set on
-         *   this entry to begin with (perhaps because it picked up a
+         * - During validation, if PTF_partial_set was set on this
+         *   entry to begin with (perhaps because it picked up a
          *   previous operation)
          *
-         * When resuming validation, if PTF_partial_general_ref is
-         * clear, then a general reference must be re-acquired; if it
-         * is set, no reference should be acquired.
+         * When resuming validation, if PTF_partial_set is clear, then
+         * a general reference must be re-acquired; if it is set, no
+         * reference should be acquired.
          *
-         * When resuming de-validation, if PTF_partial_general_ref is
-         * clear, no reference should be dropped; if it is set, a
-         * reference should be dropped.
+         * When resuming de-validation, if PTF_partial_set is clear,
+         * no reference should be dropped; if it is set, a reference
+         * should be dropped.
          *
-         * NB at the moment, PTF_partial_set should be set if and only if
-         * PTF_partial_general_ref is set.
-         *
-         * NB that PTF_partial_set and PTF_partial_general_ref are
-         * defined in mm.c, the only place where they are used.
+         * NB that PTF_partial_set is defined in mm.c, the only place
+         * where it is used.
          *
          * The 3rd field, @linear_pt_count, indicates
          * - by a positive value, how many same-level page table entries a page
@@ -181,8 +178,8 @@ struct page_info
          */
         struct {
             u16 nr_validated_ptes:PAGETABLE_ORDER + 1;
-            u16 :16 - PAGETABLE_ORDER - 1 - 2;
-            u16 partial_flags:2;
+            u16 :16 - PAGETABLE_ORDER - 1 - 1;
+            u16 partial_flags:1;
             s16 linear_pt_count;
         };
 
-- 
2.23.0


["xsa299-4.8/0010-x86-mm-Properly-handle-linear-pagetable-promotion-fa.patch" (application/octet-stream)]

From 2811202a267069e1287d069ecd61cde1c44c8741 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 10/12] x86/mm: Properly handle linear pagetable promotion
 failures

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated, and a general reference count is held.

Unfortunately, in cases where an entry began with PTF_partial_set set,
and get_page_from_lNe() returns -EINVAL, the PTF_partial_set bit is
erroneously dropped.  (This scenario can be engineered mainly by the
use of interleaving of promoting and demoting a page which has "linear
pagetable" entries; see the appendix for a sketch.)  This means that
we will "leak" a general reference count on the page in question,
preventing the page from being freed.

Fix this by setting page->partial_flags to the partial_flags local
variable.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix

Suppose A and B can both be promoted to L2 pages, and A[x] points to B.

V1: PIN_L2 B.
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated

V1: MOD_L3_ENTRY pointing something to A.
  In the process of validating A[x], grab an extra type / ref on B:
  B.type_count = 2 | PGT_validated
  B.count = 3 | PGC_allocated
  A.type_count = 1 | PGT_validated
  A.count = 2 | PGC_allocated

V1: UNPIN B.
  B.type_count = 1 | PGT_validate
  B.count = 2 | PGC_allocated

V1: MOD_L3_ENTRY removing the reference to A.
  De-validate A, down to A[x], which points to B.
  Drop the final type on B.  Arrange to be interrupted.
  B.type_count = 1 | PGT_partial
  B.count = 2 | PGC_allocated
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = x
  A.partial_pte = -1

V2: MOD_L3_ENTRY adds a reference to A.

At this point, get_page_from_l2e(A[x]) tries
get_page_and_type_from_mfn(), which fails because it's the wrong type;
and get_l2_linear_pagetable() also fails, because B isn't validated as
an l2 anymore.
---
 xen/arch/x86/mm.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index a7084ccfa7..3551b58963 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1726,7 +1726,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_flags = 0;
+                page->partial_flags = partial_flags;
                 current->arch.old_guest_ptpg = NULL;
                 current->arch.old_guest_table = page;
             }
@@ -1822,7 +1822,7 @@ static int alloc_l3_table(struct page_info *page)
         if ( i )
         {
             page->nr_validated_ptes = i;
-            page->partial_flags = 0;
+            page->partial_flags = partial_flags;
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
         }
@@ -1912,7 +1912,7 @@ static int alloc_l4_table(struct page_info *page)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_flags = 0;
+                page->partial_flags = partial_flags;
                 if ( rc == -EINTR )
                     rc = -ERESTART;
                 else
-- 
2.23.0


["xsa299-4.8/0011-x86-mm-Fix-nested-de-validation-on-error.patch" (application/octet-stream)]

From 72bc77600aa1f4c601b30af46ac3f808c93c8ca4 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 11/12] x86/mm: Fix nested de-validation on error

If an invalid entry is discovered when validating a page-table tree,
the entire tree which has so far been validated must be de-validated.
Since this may take a long time, alloc_l[2-4]_table() set current
vcpu's old_guest_table immediately; put_old_guest_table() will make
sure that put_page_type() will be called to finish off the
de-validation before any other MMU operations can happen on the vcpu.

The invariant for partial pages should be:

* Entries [0, nr_validated_ptes) should be completely validated;
  put_page_type() will de-validate these.

* If [nr_validated_ptes] is partially validated, partial_flags should
  set PTF_partiaL_set.  put_page_type() will be called on this page to
  finish off devalidation, and the appropriate refcount adjustments
  will be done.

alloc_l[2-3]_table() indicates partial validation to its callers by
setting current->old_guest_table.

Unfortunately, this is mishandled.

Take the case where validating lNe[x] returns an error.

First, alloc_l3_table() doesn't check old_guest_table at all; as a
result, partial_flags is not set when it should be.  nr_validated_ptes
is set to x; and since PFT_partial_set clear, de-validation resumes at
nr_validated_ptes-1.  This means that the l2 page at pl3e[x] will not
have put_page_type() called on it when de-validating the rest of the
l3: it will be stuck in the PGT_partial state until the domain is
destroyed, or until it is re-used as an l2.  (Any other page type will
fail.)

Worse, alloc_l4_table(), rather than setting PTF_partial_set as it
should, sets nr_validated_ptes to x+1.  When de-validating, since
partial is 0, this will correctly resume calling put_page_type at [x];
but, if the put_page_type() is never called, but instead
get_page_type() is called, validation will pick up at [x+1],
neglecting to validate [x].  If the rest of the validation succeeds,
the l4 will be validated even though [x] is invalid.

Fix this in both cases by setting PTF_partial_set if old_guest_table
is set.

While here, add some safety catches:
- old_guest_table must point to the page contained in
  [nr_validated_ptes].
- alloc_l1_page shouldn't set old_guest_table

If we experience one of these situations in production builds, it's
safer to avoid calling put_page_type for the pages in question.  If
they have PGT_partial set, they will be cleaned up on domain
destruction; if not, we have no idea whether a type count is safe to
drop.  Retaining an extra type ref that should have been dropped may
trigger a BUG() on the free_domain_page() path, but dropping a type
count that shouldn't be dropped may cause a privilege escalation.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 3551b58963..814d12353c 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1725,6 +1725,20 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             ASSERT(current->arch.old_guest_table == NULL);
             if ( i )
             {
+                /*
+                 * alloc_l1_table() doesn't set old_guest_table; it does
+                 * its own tear-down immediately on failure.  If it
+                 * did we'd need to check it and set partial_flags as we
+                 * do in alloc_l[34]_table().
+                 *
+                 * Note on the use of ASSERT: if it's non-null and
+                 * hasn't been cleaned up yet, it should have
+                 * PGT_partial set; and so the type will be cleaned up
+                 * on domain destruction.  Unfortunately, we would
+                 * leak the general ref held by old_guest_table; but
+                 * leaking a page is less bad than a host crash.
+                 */
+                ASSERT(current->arch.old_guest_table == NULL);
                 page->nr_validated_ptes = i;
                 page->partial_flags = partial_flags;
                 current->arch.old_guest_ptpg = NULL;
@@ -1758,6 +1772,7 @@ static int alloc_l3_table(struct page_info *page)
     unsigned int   i;
     int            rc = 0;
     unsigned int   partial_flags = page->partial_flags;
+    l3_pgentry_t   l3e = l3e_empty();
 
     pl3e = map_domain_page(_mfn(pfn));
 
@@ -1809,7 +1824,11 @@ static int alloc_l3_table(struct page_info *page)
             rc = -ERESTART;
         }
         if ( rc < 0 )
+        {
+            /* XSA-299 Backport: Copy l3e for checking */
+            l3e = pl3e[i];
             break;
+        }
 
         adjust_guest_l3e(pl3e[i], d);
     }
@@ -1823,6 +1842,24 @@ static int alloc_l3_table(struct page_info *page)
         {
             page->nr_validated_ptes = i;
             page->partial_flags = partial_flags;
+            if ( current->arch.old_guest_table )
+            {
+                /*
+                 * We've experienced a validation failure.  If
+                 * old_guest_table is set, "transfer" the general
+                 * reference count to pl3e[nr_validated_ptes] by
+                 * setting PTF_partial_set.
+                 *
+                 * As a precaution, check that old_guest_table is the
+                 * page pointed to by pl3e[nr_validated_ptes].  If
+                 * not, it's safer to leak a type ref on production
+                 * builds.
+                 */
+                if ( current->arch.old_guest_table == l3e_get_page(l3e) )
+                    page->partial_flags = PTF_partial_set;
+                else
+                    ASSERT_UNREACHABLE();
+            }
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
         }
@@ -1918,7 +1955,23 @@ static int alloc_l4_table(struct page_info *page)
                 else
                 {
                     if ( current->arch.old_guest_table )
-                        page->nr_validated_ptes++;
+                    {
+                        /*
+                         * We've experienced a validation failure.  If
+                         * old_guest_table is set, "transfer" the general
+                         * reference count to pl3e[nr_validated_ptes] by
+                         * setting PTF_partial_set.
+                         *
+                         * As a precaution, check that old_guest_table is the
+                         * page pointed to by pl4e[nr_validated_ptes].  If
+                         * not, it's safer to leak a type ref on production
+                         * builds.
+                         */
+                        if ( current->arch.old_guest_table == l4e_get_page(pl4e[i]) )
+                            page->partial_flags = PTF_partial_set;
+                        else
+                            ASSERT_UNREACHABLE();
+                    }
                     current->arch.old_guest_ptpg = NULL;
                     current->arch.old_guest_table = page;
                 }
-- 
2.23.0


["xsa299-4.8/0012-x86-mm-Don-t-drop-a-type-ref-unless-you-held-a-ref-t.patch" (application/octet-stream)]

From 15f43ff6f0f08d64495749d3c2e7d4230ef8b632 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:50 +0100
Subject: [PATCH 12/12] x86/mm: Don't drop a type ref unless you held a ref to
 begin with

Validation and de-validation of pagetable trees may take arbitrarily
large amounts of time, and so must be preemptible.  This is indicated
by setting the PGT_partial bit in the type_info, and setting
nr_validated_entries and partial_flags appropriately.  Specifically,
if the entry at [nr_validated_entries] is partially validated,
partial_flags should have the PGT_partial_set bit set, and the entry
should hold a general reference count.  During de-validation,
put_page_type() is called on partially validated entries.

Unfortunately, there are a number of issues with the current algorithm.

First, doing a "normal" put_page_type() is not safe when no type ref
is held: there is nothing to stop another vcpu from coming along and
picking up validation again: at which point the put_page_type may drop
the only page ref on an in-use page.  Some examples are listed in the
appendix.

The core issue is that put_page_type() is being called both to clean
up PGT_partial, and to drop a type count; and has no way of knowing
which is which; and so if in between, PGT_partial is cleared,
put_page_type() will drop the type ref erroneously.

What is needed is to distinguish between two states:
- Dropping a type ref which is held
- Cleaning up a page which has been partially de/validated

Fix this by telling put_page_type() which of the two activities you
intend.

When cleaning up a partial de/validation, take no action unless you
find a page partially validated.

If put_page_type() is called without PTF_partial_set, and finds the
page in a PGT_partial state anyway, then there's certainly been a
misaccounting somewhere, and carrying on would almost certainly cause
a security issue, so crash the host instead.

In put_page_from_lNe, pass partial_flags on to _put_page_type().

old_guest_table may be set either with a fully validated page (when
using the "deferred put" pattern), or with a partially validated page
(when a normal "de-validation" is interrupted, or when a validation
fails part-way through due to invalid entries).  Add a flag,
old_guest_table_partial, to indicate which of these it is, and use
that to pass the appropriate flag to _put_page_type().

While here, delete stray trailing whitespace.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix:

Suppose page A, when interpreted as an l3 pagetable, contains all
valid entries; and suppose A[x] points to page B, which when
interpreted as an l2 pagetable, contains all valid entries.

P1: PIN_L3_TABLE
  A -> PGT_l3_table | 1 | valid
  B -> PGT_l2_table | 1 | valid

P1: UNPIN_TABLE
  > Arrange to interrupt after B has been de-validated
  B:
    type_info -> PGT_l2_table | 0
  A:
    type_info -> PGT_l3_table | 1 | partial
    nr_validated_enties -> (less than x)

P2: mod_l4_entry to point to A
  > Arrange for this to be interrupted while B is being validated
  B:
    type_info -> PGT_l2_table | 1 | partial
    (nr_validated_entires &c set as appropriate)
  A:
    type_info -> PGT_l3_table | 1 | partial
    nr_validated_entries -> x
    partial_pte = 1

P3: mod_l3_entry some other unrelated l3 to point to B:
  B:
    type_info -> PGT_l2_table | 1

P1: Restart UNPIN_TABLE

At this point, since A.nr_validate_entries == x and A.partial_pte !=
0, free_l3_table() will call put_page_from_l3e() on pl3e[x], dropping
its type count to 0 while it's still being pointed to by some other l3

A similar issue arises with old_guest_table.  Consider the following
scenario:

Suppose A is a page which, when interpreted as an l2, has valid entries
until entry x, which is invalid.

V1:  PIN_L2_TABLE(A)
  <Validate until we try to validate [x], get -EINVAL>
  A -> PGT_l2_table | 1 | PGT_partial
  V1 -> old_guest_table = A
  <delayed>

V2: PIN_L2_TABLE(A)
  <Pick up where V1 left off, try to re-validate [x], get -EINVAL>
  A -> PGT_l2_table | 1 | PGT_partial
  V2 -> old_guest_table = A
  <restart>
  put_old_guest_table()
    _put_page_type(A)
      A -> PGT_l2_table | 0

V1: <restart>
  put_old_guest_table()
    _put_page_type(A) # UNDERFLOW

Indeed, it is possible to engineer for old_guest_table for every vcpu
a guest has to point to the same page.
---
 xen/arch/x86/domain.c        |  6 +++
 xen/arch/x86/mm.c            | 98 +++++++++++++++++++++++++++++++-----
 xen/include/asm-x86/domain.h |  4 +-
 3 files changed, 94 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 400efbad14..8b0030e95a 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1386,9 +1386,15 @@ int arch_set_info_guest(
                     rc = -ERESTART;
                     /* Fallthrough */
                 case -ERESTART:
+                    /*
+                     * NB that we're putting the kernel-mode table
+                     * here, which we've already successfully
+                     * validated above; hence partial = false;
+                     */
                     v->arch.old_guest_ptpg = NULL;
                     v->arch.old_guest_table =
                         pagetable_get_page(v->arch.guest_table);
+                    v->arch.old_guest_table_partial = false;
                     v->arch.guest_table = pagetable_null();
                     break;
                 default:
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 814d12353c..b5ca5031a8 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1514,10 +1514,11 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         {
             current->arch.old_guest_ptpg = ptpg;
             current->arch.old_guest_table = pg;
+            current->arch.old_guest_table_partial = false;
         }
         else
         {
-            rc = _put_page_type(pg, PTF_preemptible, ptpg);
+            rc = _put_page_type(pg, flags | PTF_preemptible, ptpg);
             if ( likely(!rc) )
                 put_page(pg);
         }
@@ -1540,6 +1541,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
         unsigned long mfn = l3e_get_pfn(l3e);
         int writeable = l3e_get_flags(l3e) & _PAGE_RW;
 
+        ASSERT(!(flags & PTF_partial_set));
         ASSERT(!(mfn & ((1UL << (L3_PAGETABLE_SHIFT - PAGE_SHIFT)) - 1)));
         do {
             put_data_page(mfn_to_page(mfn), writeable);
@@ -1552,12 +1554,14 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 
     if ( flags & PTF_defer )
     {
+        ASSERT(!(flags & PTF_partial_set));
         current->arch.old_guest_ptpg = mfn_to_page(pfn);
         current->arch.old_guest_table = pg;
+        current->arch.old_guest_table_partial = false;
         return 0;
     }
 
-    rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(pfn));
+    rc = _put_page_type(pg, flags | PTF_preemptible, mfn_to_page(pfn));
     if ( likely(!rc) )
         put_page(pg);
 
@@ -1576,12 +1580,14 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
 
         if ( flags & PTF_defer )
         {
+            ASSERT(!(flags & PTF_partial_set));
             current->arch.old_guest_ptpg = mfn_to_page(pfn);
             current->arch.old_guest_table = pg;
+            current->arch.old_guest_table_partial = false;
             return 0;
         }
 
-        rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(pfn));
+        rc = _put_page_type(pg, flags | PTF_preemptible, mfn_to_page(pfn));
         if ( likely(!rc) )
             put_page(pg);
     }
@@ -1689,6 +1695,14 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
 
     pl2e = map_domain_page(_mfn(pfn));
 
+    /*
+     * NB that alloc_l2_table will never set partial_pte on an l2; but
+     * free_l2_table might if a linear_pagetable entry is interrupted
+     * partway through de-validation.  In that circumstance,
+     * get_page_from_l2e() will always return -EINVAL; and we must
+     * retain the type ref by doing the normal partial_flags tracking.
+     */
+
     for ( i = page->nr_validated_ptes; i < L2_PAGETABLE_ENTRIES;
           i++, partial_flags = 0 )
     {
@@ -1743,6 +1757,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
                 page->partial_flags = partial_flags;
                 current->arch.old_guest_ptpg = NULL;
                 current->arch.old_guest_table = page;
+                current->arch.old_guest_table_partial = true;
             }
         }
         if ( rc < 0 )
@@ -1856,12 +1871,16 @@ static int alloc_l3_table(struct page_info *page)
                  * builds.
                  */
                 if ( current->arch.old_guest_table == l3e_get_page(l3e) )
+                {
+                    ASSERT(current->arch.old_guest_table_partial);
                     page->partial_flags = PTF_partial_set;
+                }
                 else
                     ASSERT_UNREACHABLE();
             }
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
+            current->arch.old_guest_table_partial = true;
         }
         while ( i-- > 0 )
         {
@@ -1968,12 +1987,16 @@ static int alloc_l4_table(struct page_info *page)
                          * builds.
                          */
                         if ( current->arch.old_guest_table == l4e_get_page(pl4e[i]) )
+                        {
+                            ASSERT(current->arch.old_guest_table_partial);
                             page->partial_flags = PTF_partial_set;
+                        }
                         else
                             ASSERT_UNREACHABLE();
                     }
                     current->arch.old_guest_ptpg = NULL;
                     current->arch.old_guest_table = page;
+                    current->arch.old_guest_table_partial = true;
                 }
             }
         }
@@ -2908,6 +2931,28 @@ static int _put_page_type(struct page_info *page, unsigned int flags,
         x  = y;
         nx = x - 1;
 
+        /*
+         * Is this expected to do a full reference drop, or only
+         * cleanup partial validation / devalidation?
+         *
+         * If the former, the caller must hold a "full" type ref;
+         * which means the page must be validated.  If the page is
+         * *not* fully validated, continuing would almost certainly
+         * open up a security hole.  An exception to this is during
+         * domain destruction, where PGT_validated can be dropped
+         * without dropping a type ref.
+         *
+         * If the latter, do nothing unless type PGT_partial is set.
+         * If it is set, the type count must be 1.
+         */
+        if ( !(flags & PTF_partial_set) )
+            BUG_ON((x & PGT_partial) ||
+                   !((x & PGT_validated) || page_get_owner(page)->is_dying));
+        else if ( !(x & PGT_partial) )
+            return 0;
+        else
+            BUG_ON((x & PGT_count_mask) != 1);
+
         ASSERT((x & PGT_count_mask) != 0);
 
         if ( unlikely((nx & PGT_count_mask) == 0) )
@@ -3345,17 +3390,34 @@ int put_old_guest_table(struct vcpu *v)
     if ( !v->arch.old_guest_table )
         return 0;
 
-    switch ( rc = _put_page_type(v->arch.old_guest_table, PTF_preemptible,
-                                 v->arch.old_guest_ptpg) )
+    rc = _put_page_type(v->arch.old_guest_table,
+                        PTF_preemptible |
+                        ( v->arch.old_guest_table_partial ?
+                          PTF_partial_set : 0 ),
+                        v->arch.old_guest_ptpg);
+
+    if ( rc == -ERESTART || rc == -EINTR )
     {
-    case -EINTR:
-    case -ERESTART:
+        v->arch.old_guest_table_partial = (rc == -ERESTART);
         return -ERESTART;
-    case 0:
-        put_page(v->arch.old_guest_table);
     }
 
+    /*
+     * It shouldn't be possible for _put_page_type() to return
+     * anything else at the moment; but if it does happen in
+     * production, leaking the type ref is probably the best thing to
+     * do.  Either way, drop the general ref held by old_guest_table.
+     */
+    ASSERT(rc == 0);
+
+    put_page(v->arch.old_guest_table);
     v->arch.old_guest_table = NULL;
+    v->arch.old_guest_ptpg = NULL;
+    /*
+     * Safest default if someone sets old_guest_table without
+     * explicitly setting old_guest_table_partial.
+     */
+    v->arch.old_guest_table_partial = true;
 
     return rc;
 }
@@ -3507,11 +3569,11 @@ int new_guest_cr3(unsigned long mfn)
             switch ( rc = put_page_and_type_preemptible(page) )
             {
             case -EINTR:
-                rc = -ERESTART;
-                /* fallthrough */
             case -ERESTART:
                 curr->arch.old_guest_ptpg = NULL;
                 curr->arch.old_guest_table = page;
+                curr->arch.old_guest_table_partial = (rc == -ERESTART);
+                rc = -ERESTART;
                 break;
             default:
                 BUG_ON(rc);
@@ -3781,6 +3843,7 @@ long do_mmuext_op(
                     {
                         curr->arch.old_guest_ptpg = NULL;
                         curr->arch.old_guest_table = page;
+                        curr->arch.old_guest_table_partial = false;
                     }
                 }
             }
@@ -3816,6 +3879,11 @@ long do_mmuext_op(
             case -ERESTART:
                 curr->arch.old_guest_ptpg = NULL;
                 curr->arch.old_guest_table = page;
+                /*
+                 * EINTR means we still hold the type ref; ERESTART
+                 * means PGT_partial holds the type ref
+                 */
+                curr->arch.old_guest_table_partial = (rc == -ERESTART);
                 rc = 0;
                 break;
             default:
@@ -3890,11 +3958,15 @@ long do_mmuext_op(
                     switch ( rc = put_page_and_type_preemptible(page) )
                     {
                     case -EINTR:
-                        rc = -ERESTART;
-                        /* fallthrough */
                     case -ERESTART:
                         curr->arch.old_guest_ptpg = NULL;
                         curr->arch.old_guest_table = page;
+                        /*
+                         * EINTR means we still hold the type ref;
+                         * ERESTART means PGT_partial holds the ref
+                         */
+                        curr->arch.old_guest_table_partial = (rc == -ERESTART);
+                        rc = -ERESTART;
                         break;
                     default:
                         BUG_ON(rc);
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 01372838ec..372a157b70 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -342,7 +342,7 @@ struct arch_domain
 
     struct paging_domain paging;
     struct p2m_domain *p2m;
-    /* To enforce lock ordering in the pod code wrt the 
+    /* To enforce lock ordering in the pod code wrt the
      * page_alloc lock */
     int page_alloc_unlock_level;
 
@@ -576,6 +576,8 @@ struct arch_vcpu
     struct page_info *old_guest_table;  /* partially destructed pagetable */
     struct page_info *old_guest_ptpg;   /* containing page table of the */
                                         /* former, if any */
+    bool old_guest_table_partial;       /* Are we dropping a type ref, or just
+                                         * finishing up a partial de-validation? */
     /* guest_table holds a ref to the page, and also a type-count unless
      * shadow refcounts are in use */
     pagetable_t shadow_table[4];        /* (MFN) shadow(s) of guest */
-- 
2.23.0


["xsa299-4.9/0001-x86-mm-Clean-up-trailing-whitespace.patch" (application/octet-stream)]

From f7751d83cb104555a7dbfab21bb4c1a297121083 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Wed, 16 Oct 2019 09:16:15 +0100
Subject: [PATCH 01/12] x86/mm: Clean up trailing whitespace

Sometime between 4.9 and 4.10 someone cleaned up all the trailing
whitespace in mm.c; applying this patch now makes all futher patches
much cleaner.

No functional change.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
---
 xen/arch/x86/mm.c | 116 +++++++++++++++++++++++-----------------------
 1 file changed, 58 insertions(+), 58 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index ae8c2de4f3..bdc2682154 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1,48 +1,48 @@
 /******************************************************************************
  * arch/x86/mm.c
- * 
+ *
  * Copyright (c) 2002-2005 K A Fraser
  * Copyright (c) 2004 Christian Limpach
- * 
+ *
  * This program is free software; you can redistribute it and/or modify
  * it under the terms of the GNU General Public License as published by
  * the Free Software Foundation; either version 2 of the License, or
  * (at your option) any later version.
- * 
+ *
  * This program is distributed in the hope that it will be useful,
  * but WITHOUT ANY WARRANTY; without even the implied warranty of
  * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  * GNU General Public License for more details.
- * 
+ *
  * You should have received a copy of the GNU General Public License
  * along with this program; If not, see <http://www.gnu.org/licenses/>.
  */
 
 /*
  * A description of the x86 page table API:
- * 
+ *
  * Domains trap to do_mmu_update with a list of update requests.
  * This is a list of (ptr, val) pairs, where the requested operation
  * is *ptr = val.
- * 
+ *
  * Reference counting of pages:
  * ----------------------------
  * Each page has two refcounts: tot_count and type_count.
- * 
+ *
  * TOT_COUNT is the obvious reference count. It counts all uses of a
  * physical page frame by a domain, including uses as a page directory,
  * a page table, or simple mappings via a PTE. This count prevents a
  * domain from releasing a frame back to the free pool when it still holds
  * a reference to it.
- * 
+ *
  * TYPE_COUNT is more subtle. A frame can be put to one of three
  * mutually-exclusive uses: it might be used as a page directory, or a
  * page table, or it may be mapped writable by the domain [of course, a
  * frame may not be used in any of these three ways!].
- * So, type_count is a count of the number of times a frame is being 
+ * So, type_count is a count of the number of times a frame is being
  * referred to in its current incarnation. Therefore, a page can only
  * change its type when its type count is zero.
- * 
+ *
  * Pinning the page type:
  * ----------------------
  * The type of a page can be pinned/unpinned with the commands
@@ -51,20 +51,20 @@
  * This is useful to prevent a page's type count falling to zero, at which
  * point safety checks would need to be carried out next time the count
  * is increased again.
- * 
+ *
  * A further note on writable page mappings:
  * -----------------------------------------
  * For simplicity, the count of writable mappings for a page may not
  * correspond to reality. The 'writable count' is incremented for every
  * PTE which maps the page with the _PAGE_RW flag set. However, for
  * write access to be possible the page directory entry must also have
- * its _PAGE_RW bit set. We do not check this as it complicates the 
+ * its _PAGE_RW bit set. We do not check this as it complicates the
  * reference counting considerably [consider the case of multiple
  * directory entries referencing a single page table, some with the RW
  * bit set, others not -- it starts getting a bit messy].
  * In normal use, this simplification shouldn't be a problem.
  * However, the logic can be added if required.
- * 
+ *
  * One more note on read-only page mappings:
  * -----------------------------------------
  * We want domains to be able to map pages for read-only access. The
@@ -73,10 +73,10 @@
  * However, domains have free access to rings 1 & 2 of the Intel
  * privilege model. In terms of page protection, these are considered
  * to be part of 'supervisor mode'. The WP bit in CR0 controls whether
- * read-only restrictions are respected in supervisor mode -- if the 
+ * read-only restrictions are respected in supervisor mode -- if the
  * bit is clear then any mapped page is writable.
- * 
- * We get round this by always setting the WP bit and disallowing 
+ *
+ * We get round this by always setting the WP bit and disallowing
  * updates to it. This is very unlikely to cause a problem for guest
  * OS's, which will generally use the WP bit to simplify copy-on-write
  * implementation (in that case, OS wants a fault when it writes to
@@ -311,7 +311,7 @@ void __init arch_init_memory(void)
      */
     dom_io = domain_create(DOMID_IO, DOMCRF_dummy, 0, NULL);
     BUG_ON(IS_ERR(dom_io));
-    
+
     /*
      * Initialise our COW domain.
      * This domain owns sharable pages.
@@ -322,7 +322,7 @@ void __init arch_init_memory(void)
     /* First 1MB of RAM is historically marked as I/O. */
     for ( i = 0; i < 0x100; i++ )
         share_xen_page_with_guest(mfn_to_page(i), dom_io, XENSHARE_writable);
- 
+
     /* Any areas not specified as RAM by the e820 map are considered I/O. */
     for ( i = 0, pfn = 0; pfn < max_page; i++ )
     {
@@ -352,7 +352,7 @@ void __init arch_init_memory(void)
          */
         iostart_pfn = max_t(unsigned long, pfn, 1UL << (20 - PAGE_SHIFT));
         ioend_pfn = min(rstart_pfn, 16UL << (20 - PAGE_SHIFT));
-        if ( iostart_pfn < ioend_pfn )            
+        if ( iostart_pfn < ioend_pfn )
             destroy_xen_mappings((unsigned long)mfn_to_virt(iostart_pfn),
                                  (unsigned long)mfn_to_virt(ioend_pfn));
 
@@ -440,7 +440,7 @@ int page_is_ram_type(unsigned long mfn, unsigned long mem_type)
             /* unknown */
             continue;
         }
-        
+
         /* Test the range. */
         if ( (e820.map[i].addr <= maddr) &&
              ((e820.map[i].addr + e820.map[i].size) >= (maddr + PAGE_SIZE)) )
@@ -559,7 +559,7 @@ void write_ptbase(struct vcpu *v)
 
 /*
  * Should be called after CR3 is updated.
- * 
+ *
  * Uses values found in vcpu->arch.(guest_table and guest_table_user), and
  * for HVM guests, arch.monitor_table and hvm's guest CR3.
  *
@@ -770,7 +770,7 @@ static int get_page_from_pagenr(unsigned long page_nr, struct domain *d)
 static int __get_page_type(struct page_info *page, unsigned long type,
                            int preemptible);
 
-static int get_page_and_type_from_pagenr(unsigned long page_nr, 
+static int get_page_and_type_from_pagenr(unsigned long page_nr,
                                          unsigned long type,
                                          struct domain *d,
                                          int partial,
@@ -1174,7 +1174,7 @@ get_page_from_l1e(
         pg_owner = real_pg_owner;
     }
 
-    /* Extra paranoid check for shared memory. Writable mappings 
+    /* Extra paranoid check for shared memory. Writable mappings
      * disallowed (unshare first!) */
     if ( (l1f & _PAGE_RW) && (real_pg_owner == dom_cow) )
         goto could_not_pin;
@@ -1423,12 +1423,12 @@ void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
      * Check if this is a mapping that was established via a grant reference.
      * If it was then we should not be here: we require that such mappings are
      * explicitly destroyed via the grant-table interface.
-     * 
+     *
      * The upshot of this is that the guest can end up with active grants that
      * it cannot destroy (because it no longer has a PTE to present to the
      * grant-table interface). This can lead to subtle hard-to-catch bugs,
      * hence a special grant PTE flag can be enabled to catch the bug early.
-     * 
+     *
      * (Note that the undestroyable active grants are not a security hole in
      * Xen. All active grants can safely be cleaned up when the domain dies.)
      */
@@ -1443,7 +1443,7 @@ void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
 
     /* Remember we didn't take a type-count of foreign writable mappings
      * to paging-external domains */
-    if ( (l1e_get_flags(l1e) & _PAGE_RW) && 
+    if ( (l1e_get_flags(l1e) & _PAGE_RW) &&
          ((l1e_owner == pg_owner) || !paging_mode_external(pg_owner)) )
     {
         put_page_and_type(page);
@@ -1451,7 +1451,7 @@ void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
     else
     {
         /* We expect this is rare so we blow the entire shadow LDT. */
-        if ( unlikely(((page->u.inuse.type_info & PGT_type_mask) == 
+        if ( unlikely(((page->u.inuse.type_info & PGT_type_mask) ==
                        PGT_seg_desc_page)) &&
              unlikely(((page->u.inuse.type_info & PGT_count_mask) != 0)) &&
              (l1e_owner == pg_owner) )
@@ -1553,7 +1553,7 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
 {
     int rc = 1;
 
-    if ( (l4e_get_flags(l4e) & _PAGE_PRESENT) && 
+    if ( (l4e_get_flags(l4e) & _PAGE_PRESENT) &&
          (l4e_get_pfn(l4e) != pfn) )
     {
         struct page_info *pg = l4e_get_page(l4e);
@@ -2112,8 +2112,8 @@ void page_unlock(struct page_info *page)
 
 /* How to write an entry to the guest pagetables.
  * Returns 0 for failure (pointer not valid), 1 for success. */
-static inline int update_intpte(intpte_t *p, 
-                                intpte_t old, 
+static inline int update_intpte(intpte_t *p,
+                                intpte_t old,
                                 intpte_t new,
                                 unsigned long mfn,
                                 struct vcpu *v,
@@ -2285,8 +2285,8 @@ static int mod_l1_entry(l1_pgentry_t *pl1e, l1_pgentry_t nl1e,
 
 
 /* Update the L2 entry at pl2e to new value nl2e. pl2e is within frame pfn. */
-static int mod_l2_entry(l2_pgentry_t *pl2e, 
-                        l2_pgentry_t nl2e, 
+static int mod_l2_entry(l2_pgentry_t *pl2e,
+                        l2_pgentry_t nl2e,
                         unsigned long pfn,
                         int preserve_ad,
                         struct vcpu *vcpu)
@@ -2350,8 +2350,8 @@ static int mod_l2_entry(l2_pgentry_t *pl2e,
 }
 
 /* Update the L3 entry at pl3e to new value nl3e. pl3e is within frame pfn. */
-static int mod_l3_entry(l3_pgentry_t *pl3e, 
-                        l3_pgentry_t nl3e, 
+static int mod_l3_entry(l3_pgentry_t *pl3e,
+                        l3_pgentry_t nl3e,
                         unsigned long pfn,
                         int preserve_ad,
                         struct vcpu *vcpu)
@@ -2424,8 +2424,8 @@ static int mod_l3_entry(l3_pgentry_t *pl3e,
 }
 
 /* Update the L4 entry at pl4e to new value nl4e. pl4e is within frame pfn. */
-static int mod_l4_entry(l4_pgentry_t *pl4e, 
-                        l4_pgentry_t nl4e, 
+static int mod_l4_entry(l4_pgentry_t *pl4e,
+                        l4_pgentry_t nl4e,
                         unsigned long pfn,
                         int preserve_ad,
                         struct vcpu *vcpu)
@@ -2592,7 +2592,7 @@ struct domain *page_get_owner_and_reference(struct page_info *page)
         x = y;
         /*
          * Count ==  0: Page is not allocated, so we cannot take a reference.
-         * Count == -1: Reference count would wrap, which is invalid. 
+         * Count == -1: Reference count would wrap, which is invalid.
          * Count == -2: Remaining unused ref is reserved for get_page_light().
          */
         if ( unlikely(((x + 2) & PGC_count_mask) <= 2) )
@@ -2680,7 +2680,7 @@ static int alloc_page_type(struct page_info *page, unsigned long type,
         rc = alloc_segdesc_page(page);
         break;
     default:
-        printk("Bad type in alloc_page_type %lx t=%" PRtype_info " c=%lx\n", 
+        printk("Bad type in alloc_page_type %lx t=%" PRtype_info " c=%lx\n",
                type, page->u.inuse.type_info,
                page->count_info);
         rc = -EINVAL;
@@ -2941,8 +2941,8 @@ static int __get_page_type(struct page_info *page, unsigned long type,
             if ( (x & PGT_type_mask) != type )
             {
                 /*
-                 * On type change we check to flush stale TLB entries. This 
-                 * may be unnecessary (e.g., page was GDT/LDT) but those 
+                 * On type change we check to flush stale TLB entries. This
+                 * may be unnecessary (e.g., page was GDT/LDT) but those
                  * circumstances should be very rare.
                  */
                 cpumask_t *mask = this_cpu(scratch_cpumask);
@@ -3892,7 +3892,7 @@ long do_mmuext_op(
             else
                 rc = -EPERM;
             break;
-    
+
         case MMUEXT_INVLPG_ALL:
             if ( unlikely(d != pg_owner) )
                 rc = -EPERM;
@@ -4426,7 +4426,7 @@ static int create_grant_pte_mapping(
         gdprintk(XENLOG_WARNING, "Could not get page for normal update\n");
         return GNTST_general_error;
     }
-    
+
     mfn = page_to_mfn(page);
     va = map_domain_page(_mfn(mfn));
     va = (void *)((unsigned long)va + ((unsigned long)pte_addr & ~PAGE_MASK));
@@ -4450,7 +4450,7 @@ static int create_grant_pte_mapping(
         page_unlock(page);
         rc = GNTST_general_error;
         goto failed;
-    } 
+    }
 
     page_unlock(page);
 
@@ -4492,7 +4492,7 @@ static int destroy_grant_pte_mapping(
         gdprintk(XENLOG_WARNING, "Could not get page for normal update\n");
         return GNTST_general_error;
     }
-    
+
     mfn = page_to_mfn(page);
     va = map_domain_page(_mfn(mfn));
     va = (void *)((unsigned long)va + ((unsigned long)addr & ~PAGE_MASK));
@@ -4511,7 +4511,7 @@ static int destroy_grant_pte_mapping(
     }
 
     ol1e = *(l1_pgentry_t *)va;
-    
+
     /*
      * Check that the PTE supplied actually maps frame (with appropriate
      * permissions).
@@ -4537,8 +4537,8 @@ static int destroy_grant_pte_mapping(
 
     /* Delete pagetable entry. */
     if ( unlikely(!UPDATE_ENTRY
-                  (l1, 
-                   (l1_pgentry_t *)va, ol1e, l1e_empty(), mfn, 
+                  (l1,
+                   (l1_pgentry_t *)va, ol1e, l1e_empty(), mfn,
                    d->vcpu[0] /* Change if we go to per-vcpu shadows. */,
                    0)) )
     {
@@ -4566,7 +4566,7 @@ static int create_grant_va_mapping(
     unsigned long gl1mfn;
     struct page_info *l1pg;
     int okay;
-    
+
     adjust_guest_l1e(nl1e, d);
 
     pl1e = guest_map_l1e(va, &gl1mfn);
@@ -4619,7 +4619,7 @@ static int replace_grant_va_mapping(
     unsigned long gl1mfn;
     struct page_info *l1pg;
     int rc = 0;
-    
+
     pl1e = guest_map_l1e(addr, &gl1mfn);
     if ( !pl1e )
     {
@@ -4719,7 +4719,7 @@ static int create_grant_p2m_mapping(uint64_t addr, unsigned long frame,
         return GNTST_okay;
 }
 
-int create_grant_host_mapping(uint64_t addr, unsigned long frame, 
+int create_grant_host_mapping(uint64_t addr, unsigned long frame,
                               unsigned int flags, unsigned int cache_flags)
 {
     l1_pgentry_t pte;
@@ -4789,7 +4789,7 @@ int replace_grant_host_mapping(
     struct page_info *l1pg;
     int rc;
     unsigned int grant_pte_flags;
-    
+
     if ( paging_mode_external(current->domain) )
         return replace_grant_p2m_mapping(addr, frame, new_addr, flags);
 
@@ -4815,7 +4815,7 @@ int replace_grant_host_mapping(
         if ( !new_addr )
             return destroy_grant_pte_mapping(addr, frame, grant_pte_flags,
                                              curr->domain);
-        
+
         return GNTST_general_error;
     }
 
@@ -5210,7 +5210,7 @@ void destroy_gdt(struct vcpu *v)
 }
 
 
-long set_gdt(struct vcpu *v, 
+long set_gdt(struct vcpu *v,
              unsigned long *frames,
              unsigned int entries)
 {
@@ -5272,7 +5272,7 @@ long do_set_gdt(XEN_GUEST_HANDLE_PARAM(xen_ulong_t) frame_list,
     /* Rechecked in set_gdt, but ensures a sane limit for copy_from_user(). */
     if ( entries > FIRST_RESERVED_GDT_ENTRY )
         return -EINVAL;
-    
+
     if ( copy_from_guest(frames, frame_list, nr_pages) )
         return -EFAULT;
 
@@ -5565,7 +5565,7 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
             rcu_unlock_domain(d);
             return -ENOMEM;
         }
-        
+
         if ( copy_from_guest(e820, fmap.map.buffer, fmap.map.nr_entries) )
         {
             xfree(e820);
@@ -5724,7 +5724,7 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
                 rc = -EINVAL;
                 goto pod_target_out_unlock;
             }
-            
+
             rc = p2m_pod_set_mem_target(d, target.target_pages);
         }
 
@@ -5746,7 +5746,7 @@ long arch_memory_op(unsigned long cmd, XEN_GUEST_HANDLE_PARAM(void) arg)
                 goto pod_target_out_unlock;
             }
         }
-        
+
     pod_target_out_unlock:
         rcu_unlock_domain(d);
         return rc;
@@ -5992,7 +5992,7 @@ static const struct x86_emulate_ops ptwr_emulate_ops = {
 };
 
 /* Write page fault handler: check if guest is trying to modify a PTE. */
-int ptwr_do_page_fault(struct vcpu *v, unsigned long addr, 
+int ptwr_do_page_fault(struct vcpu *v, unsigned long addr,
                        struct cpu_user_regs *regs)
 {
     struct domain *d = v->domain;
-- 
2.23.0


["xsa299-4.9/0002-x86-mm-L1TF-checks-don-t-leave-a-partial-entry.patch" (application/octet-stream)]

From 2f5296970ab8e5ac048f14c560d0befd53d386cc Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 02/12] x86/mm: L1TF checks don't leave a partial entry

On detection of a potential L1TF issue, most validation code returns
-ERESTART to allow the switch to shadow mode to happen and cause the
original operation to be restarted.

However, in the validation code, the return value -ERESTART has been
repurposed to indicate 1) the function has partially completed
something which needs to be undone, and 2) calling put_page_type()
should cleanly undo it.  This causes problems in several places.

For L1 tables, on receiving an -ERESTART return from alloc_l1_table(),
alloc_page_type() will set PGT_partial on the page.  If for some
reason the original operation never restarts, then on domain
destruction, relinquish_memory() will call free_page_type() on the
page.

Unfortunately, alloc_ and free_l1_table() aren't set up to deal with
PGT_partial.  When returning a failure, alloc_l1_table() always
de-validates whatever it's validated so far, and free_l1_table()
always devalidates the whole page.  This means that if
relinquish_memory() calls free_page_type() on an L1 that didn't
complete due to an L1TF, it will call put_page_from_l1e() on "page
entries" that have never been validated.

For L2+ tables, setting rc to ERESTART causes the rest of the
alloc_lN_table() function to *think* that the entry in question will
have PGT_partial set.  This will cause it to set partial_pte = 1.  If
relinqush_memory() then calls free_page_type() on one of those pages,
then free_lN_table() will call put_page_from_lNe() on the entry when
it shouldn't.

Rather than indicating -ERESTART, indicate -EINTR.  This is the code
to indicate that nothing has changed from when you started the call
(which is effectively how alloc_l1_table() handles errors).

mod_lN_entry() shouldn't have any of these types of problems, so leave
potential changes there for a clean-up patch later.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index bdc2682154..b5de6c108a 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1257,7 +1257,7 @@ get_page_from_l2e(
     int rc;
 
     if ( !(l2e_get_flags(l2e) & _PAGE_PRESENT) )
-        return pv_l1tf_check_l2e(d, l2e) ? -ERESTART : 1;
+        return pv_l1tf_check_l2e(d, l2e) ? -EINTR : 1;
 
     if ( unlikely((l2e_get_flags(l2e) & L2_DISALLOW_MASK)) )
     {
@@ -1300,7 +1300,7 @@ get_page_from_l3e(
     int rc;
 
     if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) )
-        return pv_l1tf_check_l3e(d, l3e) ? -ERESTART : 1;
+        return pv_l1tf_check_l3e(d, l3e) ? -EINTR : 1;
 
     if ( unlikely((l3e_get_flags(l3e) & l3_disallow_mask(d))) )
     {
@@ -1327,7 +1327,7 @@ get_page_from_l4e(
     int rc;
 
     if ( !(l4e_get_flags(l4e) & _PAGE_PRESENT) )
-        return pv_l1tf_check_l4e(d, l4e) ? -ERESTART : 1;
+        return pv_l1tf_check_l4e(d, l4e) ? -EINTR : 1;
 
     if ( unlikely((l4e_get_flags(l4e) & L4_DISALLOW_MASK)) )
     {
@@ -1593,7 +1593,7 @@ static int alloc_l1_table(struct page_info *page)
     {
         if ( !(l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) )
         {
-            ret = pv_l1tf_check_l1e(d, pl1e[i]) ? -ERESTART : 0;
+            ret = pv_l1tf_check_l1e(d, pl1e[i]) ? -EINTR : 0;
             if ( ret )
                 goto out;
         }
-- 
2.23.0


["xsa299-4.9/0003-x86-mm-Don-t-re-set-PGT_pinned-on-a-partially-de-val.patch" (application/octet-stream)]

From 63a030c6fbb74e19b580e4c46dd25bbb1c8a6178 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 03/12] x86/mm: Don't re-set PGT_pinned on a partially
 de-validated page

When unpinning pagetables, if an operation is interrupted,
relinquish_memory() re-sets PGT_pinned so that the un-pin will
pickedup again when the hypercall restarts.

This is appropriate when put_page_and_type_preemptible() returns
-EINTR, which indicates that the page is back in its initial state
(i.e., completely validated).  However, for -ERESTART, this leads to a
state where a page has both PGT_pinned and PGT_partial set.

This happens to work at the moment, although it's not really a
"canonical" state; but in subsequent patches, where we need to make a
distinction in handling between PGT_validated and PGT_partial pages,
this causes issues.

Move to a "canonical" state by:
- Only re-setting PGT_pinned on -EINTR
- Re-dropping the refcount held by PGT_pinned on -ERESTART

In the latter case, the PGT_partial bit will be cleared further down
with the rest of the other PGT_partial pages.

While here, clean up some trainling whitespace.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/domain.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index b1e2cfa27e..1a4f89a6b1 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -159,7 +159,7 @@ static void play_dead(void)
      * this case, heap corruption or #PF can occur (when heap debugging is
      * enabled). For example, even printk() can involve tasklet scheduling,
      * which touches per-cpu vars.
-     * 
+     *
      * Consider very carefully when adding code to *dead_idle. Most hypervisor
      * subsystems are unsafe to call.
      */
@@ -2405,9 +2405,34 @@ static int relinquish_memory(
             break;
         case -ERESTART:
         case -EINTR:
+            /*
+             * -EINTR means PGT_validated has been re-set; re-set
+             * PGT_pinned again so that it gets picked up next time
+             * around.
+             *
+             * -ERESTART, OTOH, means PGT_partial is set instead.  Put
+             * it back on the list, but don't set PGT_pinned; the
+             * section below will finish off de-validation.  But we do
+             * need to drop the general ref associated with
+             * PGT_pinned, since put_page_and_type_preemptible()
+             * didn't do it.
+             *
+             * NB we can do an ASSERT for PGT_validated, since we
+             * "own" the type ref; but theoretically, the PGT_partial
+             * could be cleared by someone else.
+             */
+            if ( ret == -EINTR )
+            {
+                ASSERT(page->u.inuse.type_info & PGT_validated);
+                set_bit(_PGT_pinned, &page->u.inuse.type_info);
+            }
+            else
+                put_page(page);
+
             ret = -ERESTART;
+
+            /* Put the page back on the list and drop the ref we grabbed above */
             page_list_add(page, list);
-            set_bit(_PGT_pinned, &page->u.inuse.type_info);
             put_page(page);
             goto out;
         default:
@@ -2631,7 +2656,7 @@ void vcpu_kick(struct vcpu *v)
      * pending flag. These values may fluctuate (after all, we hold no
      * locks) but the key insight is that each change will cause
      * evtchn_upcall_pending to be polled.
-     * 
+     *
      * NB2. We save the running flag across the unblock to avoid a needless
      * IPI for domains that we IPI'd to unblock.
      */
-- 
2.23.0


["xsa299-4.9/0004-x86-mm-Separate-out-partial_pte-tristate-into-indivi.patch" (application/octet-stream)]

From dd75c35162055e294f8fa4dc9609a9b2ea81d438 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 04/12] x86/mm: Separate out partial_pte tristate into
 individual flags

At the moment, partial_pte is a tri-state that contains two distinct bits
of information:

1. If zero, the pte at index [nr_validated_ptes] is un-validated.  If
   non-zero, the pte was last seen with PGT_partial set.

2. If positive, the pte at index [nr_validated_ptes] does not hold a
   general reference count.  If negative, it does.

To make future patches more clear, separate out this functionality
into two distinct, named bits: PTF_partial_set (for #1) and
PTF_partial_general_ref (for #2).

Additionally, a number of functions which need this information also
take other flags to control behavior (such as `preemptible` and
`defer`).  These are hard to read in the caller (since you only see
'true' or 'false'), and ugly when many are added together.  In
preparation for adding yet another flag in a future patch, collapse
all of these into a single `flag` variable.

NB that this does mean checking for what was previously the '-1'
condition a bit more ugly in the put_page_from_lNe functions (since
you have to check for both partial_set and general ref); but this
clause will go away in a future patch.

Also note that the original comment had an off-by-one error:
partial_flags (like partial_pte before it) concerns
plNe[nr_validated_ptes], not plNe[nr_validated_ptes+1].

No functional change intended.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c        | 166 +++++++++++++++++++++++----------------
 xen/include/asm-x86/mm.h |  41 +++++++---
 2 files changed, 128 insertions(+), 79 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index b5de6c108a..9d225e3c28 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -770,22 +770,35 @@ static int get_page_from_pagenr(unsigned long page_nr, struct domain *d)
 static int __get_page_type(struct page_info *page, unsigned long type,
                            int preemptible);
 
+/*
+ * The following flags are used to specify behavior of various get and
+ * put commands.  The first two are also stored in page->partial_flags
+ * to indicate the state of the page pointed to by
+ * page->pte[page->nr_validated_entries].  See the comment in mm.h for
+ * more information.
+ */
+#define PTF_partial_set         (1 << 0)
+#define PTF_partial_general_ref (1 << 1)
+#define PTF_preemptible         (1 << 2)
+#define PTF_defer               (1 << 3)
+
 static int get_page_and_type_from_pagenr(unsigned long page_nr,
                                          unsigned long type,
                                          struct domain *d,
-                                         int partial,
-                                         int preemptible)
+                                         unsigned int flags)
 {
     struct page_info *page = mfn_to_page(page_nr);
     int rc;
+    bool preemptible = flags & PTF_preemptible,
+         partial_ref = flags & PTF_partial_general_ref;
 
-    if ( likely(partial >= 0) &&
+    if ( likely(!partial_ref) &&
          unlikely(!get_page_from_pagenr(page_nr, d)) )
         return -EINVAL;
 
     rc = __get_page_type(page, type, preemptible);
 
-    if ( unlikely(rc) && partial >= 0 &&
+    if ( unlikely(rc) && !partial_ref &&
          (!preemptible || page != current->arch.old_guest_table) )
         put_page(page);
 
@@ -1251,7 +1264,7 @@ get_page_from_l1e(
 define_get_linear_pagetable(l2);
 static int
 get_page_from_l2e(
-    l2_pgentry_t l2e, unsigned long pfn, struct domain *d, int partial)
+    l2_pgentry_t l2e, unsigned long pfn, struct domain *d, unsigned int flags)
 {
     unsigned long mfn = l2e_get_pfn(l2e);
     int rc;
@@ -1268,8 +1281,9 @@ get_page_from_l2e(
 
     if ( !(l2e_get_flags(l2e) & _PAGE_PSE) )
     {
-        rc = get_page_and_type_from_pagenr(mfn, PGT_l1_page_table, d,
-                                           partial, false);
+        ASSERT(!(flags & PTF_preemptible));
+
+        rc = get_page_and_type_from_pagenr(mfn, PGT_l1_page_table, d, flags);
         if ( unlikely(rc == -EINVAL) && get_l2_linear_pagetable(l2e, pfn, d) )
             rc = 0;
         return rc;
@@ -1295,7 +1309,7 @@ get_page_from_l2e(
 define_get_linear_pagetable(l3);
 static int
 get_page_from_l3e(
-    l3_pgentry_t l3e, unsigned long pfn, struct domain *d, int partial)
+    l3_pgentry_t l3e, unsigned long pfn, struct domain *d, unsigned int flags)
 {
     int rc;
 
@@ -1310,7 +1324,7 @@ get_page_from_l3e(
     }
 
     rc = get_page_and_type_from_pagenr(
-        l3e_get_pfn(l3e), PGT_l2_page_table, d, partial, 1);
+        l3e_get_pfn(l3e), PGT_l2_page_table, d, flags | PTF_preemptible);
     if ( unlikely(rc == -EINVAL) &&
          !is_pv_32bit_domain(d) &&
          get_l3_linear_pagetable(l3e, pfn, d) )
@@ -1322,7 +1336,7 @@ get_page_from_l3e(
 define_get_linear_pagetable(l4);
 static int
 get_page_from_l4e(
-    l4_pgentry_t l4e, unsigned long pfn, struct domain *d, int partial)
+    l4_pgentry_t l4e, unsigned long pfn, struct domain *d, unsigned int flags)
 {
     int rc;
 
@@ -1337,7 +1351,7 @@ get_page_from_l4e(
     }
 
     rc = get_page_and_type_from_pagenr(
-        l4e_get_pfn(l4e), PGT_l3_page_table, d, partial, 1);
+        l4e_get_pfn(l4e), PGT_l3_page_table, d, flags | PTF_preemptible);
     if ( unlikely(rc == -EINVAL) && get_l4_linear_pagetable(l4e, pfn, d) )
         rc = 0;
 
@@ -1469,7 +1483,7 @@ void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
  * Note also that this automatically deals correctly with linear p.t.'s.
  */
 static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
-                             int partial, bool defer)
+                             unsigned int flags)
 {
     int rc = 0;
 
@@ -1483,12 +1497,13 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         struct page_info *pg = l2e_get_page(l2e);
         struct page_info *ptpg = mfn_to_page(pfn);
 
-        if ( unlikely(partial > 0) )
+        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
+              PTF_partial_set )
         {
-            ASSERT(!defer);
+            ASSERT(!(flags & PTF_defer));
             rc = _put_page_type(pg, true, ptpg);
         }
-        else if ( defer )
+        else if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = ptpg;
             current->arch.old_guest_table = pg;
@@ -1505,7 +1520,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
 }
 
 static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
-                             int partial, bool_t defer)
+                             unsigned int flags)
 {
     struct page_info *pg;
     int rc;
@@ -1528,13 +1543,14 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 
     pg = l3e_get_page(l3e);
 
-    if ( unlikely(partial > 0) )
+    if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
+         PTF_partial_set )
     {
-        ASSERT(!defer);
+        ASSERT(!(flags & PTF_defer));
         return _put_page_type(pg, true, mfn_to_page(pfn));
     }
 
-    if ( defer )
+    if ( flags & PTF_defer )
     {
         current->arch.old_guest_ptpg = mfn_to_page(pfn);
         current->arch.old_guest_table = pg;
@@ -1549,7 +1565,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 }
 
 static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
-                             int partial, bool_t defer)
+                             unsigned int flags)
 {
     int rc = 1;
 
@@ -1558,13 +1574,14 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
     {
         struct page_info *pg = l4e_get_page(l4e);
 
-        if ( unlikely(partial > 0) )
+        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
+              PTF_partial_set )
         {
-            ASSERT(!defer);
+            ASSERT(!(flags & PTF_defer));
             return _put_page_type(pg, true, mfn_to_page(pfn));
         }
 
-        if ( defer )
+        if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = mfn_to_page(pfn);
             current->arch.old_guest_table = pg;
@@ -1674,12 +1691,13 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
     unsigned long  pfn = page_to_mfn(page);
     l2_pgentry_t  *pl2e;
     unsigned int   i;
-    int            rc = 0, partial = page->partial_pte;
+    int            rc = 0;
+    unsigned int   partial_flags = page->partial_flags;
 
     pl2e = map_domain_page(_mfn(pfn));
 
     for ( i = page->nr_validated_ptes; i < L2_PAGETABLE_ENTRIES;
-          i++, partial = 0 )
+          i++, partial_flags = 0 )
     {
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
         {
@@ -1689,18 +1707,19 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
         }
 
         if ( !is_guest_l2_slot(d, type, i) ||
-             (rc = get_page_from_l2e(pl2e[i], pfn, d, partial)) > 0 )
+             (rc = get_page_from_l2e(pl2e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = partial ?: 1;
+            /* Set 'set', retain 'general ref' */
+            page->partial_flags = partial_flags | PTF_partial_set;
         }
         else if ( rc == -EINTR && i )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
             rc = -ERESTART;
         }
         else if ( rc < 0 && rc != -EINTR )
@@ -1709,7 +1728,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_pte = 0;
+                page->partial_flags = 0;
                 current->arch.old_guest_ptpg = NULL;
                 current->arch.old_guest_table = page;
             }
@@ -1739,7 +1758,8 @@ static int alloc_l3_table(struct page_info *page)
     unsigned long  pfn = page_to_mfn(page);
     l3_pgentry_t  *pl3e;
     unsigned int   i;
-    int            rc = 0, partial = page->partial_pte;
+    int            rc = 0;
+    unsigned int   partial_flags = page->partial_flags;
 
     pl3e = map_domain_page(_mfn(pfn));
 
@@ -1754,7 +1774,7 @@ static int alloc_l3_table(struct page_info *page)
         memset(pl3e + 4, 0, (L3_PAGETABLE_ENTRIES - 4) * sizeof(*pl3e));
 
     for ( i = page->nr_validated_ptes; i < L3_PAGETABLE_ENTRIES;
-          i++, partial = 0 )
+          i++, partial_flags = 0 )
     {
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
         {
@@ -1772,21 +1792,23 @@ static int alloc_l3_table(struct page_info *page)
                 rc = get_page_and_type_from_pagenr(l3e_get_pfn(pl3e[i]),
                                                    PGT_l2_page_table |
                                                    PGT_pae_xen_l2,
-                                                   d, partial, 1);
+                                                   d,
+                                                   partial_flags | PTF_preemptible);
         }
         else if ( !is_guest_l3_slot(i) ||
-                  (rc = get_page_from_l3e(pl3e[i], pfn, d, partial)) > 0 )
+                  (rc = get_page_from_l3e(pl3e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = partial ?: 1;
+            /* Set 'set', leave 'general ref' set if this entry was set */
+            page->partial_flags = partial_flags | PTF_partial_set;
         }
         else if ( rc == -EINTR && i )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
             rc = -ERESTART;
         }
         if ( rc < 0 )
@@ -1803,7 +1825,7 @@ static int alloc_l3_table(struct page_info *page)
         if ( i )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
         }
@@ -1869,19 +1891,21 @@ static int alloc_l4_table(struct page_info *page)
     unsigned long  pfn = page_to_mfn(page);
     l4_pgentry_t  *pl4e = map_domain_page(_mfn(pfn));
     unsigned int   i;
-    int            rc = 0, partial = page->partial_pte;
+    int            rc = 0;
+    unsigned int   partial_flags = page->partial_flags;
 
     for ( i = page->nr_validated_ptes; i < L4_PAGETABLE_ENTRIES;
-          i++, partial = 0 )
+          i++, partial_flags = 0 )
     {
         if ( !is_guest_l4_slot(d, i) ||
-             (rc = get_page_from_l4e(pl4e[i], pfn, d, partial)) > 0 )
+             (rc = get_page_from_l4e(pl4e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = partial ?: 1;
+            /* Set 'set', leave 'general ref' set if this entry was set */
+            page->partial_flags = partial_flags | PTF_partial_set;
         }
         else if ( rc < 0 )
         {
@@ -1891,7 +1915,7 @@ static int alloc_l4_table(struct page_info *page)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_pte = 0;
+                page->partial_flags = 0;
                 if ( rc == -EINTR )
                     rc = -ERESTART;
                 else
@@ -1945,19 +1969,20 @@ static int free_l2_table(struct page_info *page)
     struct domain *d = page_get_owner(page);
     unsigned long pfn = page_to_mfn(page);
     l2_pgentry_t *pl2e;
-    int rc = 0, partial = page->partial_pte;
-    unsigned int i = page->nr_validated_ptes - !partial;
+    int rc = 0;
+    unsigned int partial_flags = page->partial_flags,
+        i = page->nr_validated_ptes - !(partial_flags & PTF_partial_set);
 
     pl2e = map_domain_page(_mfn(pfn));
 
     for ( ; ; )
     {
         if ( is_guest_l2_slot(d, page->u.inuse.type_info, i) )
-            rc = put_page_from_l2e(pl2e[i], pfn, partial, false);
+            rc = put_page_from_l2e(pl2e[i], pfn, partial_flags);
         if ( rc < 0 )
             break;
 
-        partial = 0;
+        partial_flags = 0;
 
         if ( !i-- )
             break;
@@ -1979,12 +2004,14 @@ static int free_l2_table(struct page_info *page)
     else if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_pte = partial ?: -1;
+        page->partial_flags = (partial_flags & PTF_partial_set) ?
+            partial_flags :
+            (PTF_partial_set | PTF_partial_general_ref);
     }
     else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
     {
         page->nr_validated_ptes = i + 1;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
         rc = -ERESTART;
     }
 
@@ -1996,8 +2023,9 @@ static int free_l3_table(struct page_info *page)
     struct domain *d = page_get_owner(page);
     unsigned long pfn = page_to_mfn(page);
     l3_pgentry_t *pl3e;
-    int rc = 0, partial = page->partial_pte;
-    unsigned int  i = page->nr_validated_ptes - !partial;
+    int rc = 0;
+    unsigned int partial_flags = page->partial_flags,
+        i = page->nr_validated_ptes - !(partial_flags & PTF_partial_set);
 
     pl3e = map_domain_page(_mfn(pfn));
 
@@ -2005,11 +2033,11 @@ static int free_l3_table(struct page_info *page)
     {
         if ( is_guest_l3_slot(i) )
         {
-            rc = put_page_from_l3e(pl3e[i], pfn, partial, 0);
+            rc = put_page_from_l3e(pl3e[i], pfn, partial_flags);
             if ( rc < 0 )
                 break;
 
-            partial = 0;
+            partial_flags = 0;
             if ( rc == 0 )
                 unadjust_guest_l3e(pl3e[i], d);
         }
@@ -2029,12 +2057,14 @@ static int free_l3_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_pte = partial ?: -1;
+        page->partial_flags = (partial_flags & PTF_partial_set) ?
+            partial_flags :
+            (PTF_partial_set | PTF_partial_general_ref);
     }
     else if ( rc == -EINTR && i < L3_PAGETABLE_ENTRIES - 1 )
     {
         page->nr_validated_ptes = i + 1;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
         rc = -ERESTART;
     }
     return rc > 0 ? 0 : rc;
@@ -2045,26 +2075,29 @@ static int free_l4_table(struct page_info *page)
     struct domain *d = page_get_owner(page);
     unsigned long pfn = page_to_mfn(page);
     l4_pgentry_t *pl4e = map_domain_page(_mfn(pfn));
-    int rc = 0, partial = page->partial_pte;
-    unsigned int  i = page->nr_validated_ptes - !partial;
+    int rc = 0;
+    unsigned partial_flags = page->partial_flags,
+        i = page->nr_validated_ptes - !(partial_flags & PTF_partial_set);
 
     do {
         if ( is_guest_l4_slot(d, i) )
-            rc = put_page_from_l4e(pl4e[i], pfn, partial, 0);
+            rc = put_page_from_l4e(pl4e[i], pfn, partial_flags);
         if ( rc < 0 )
             break;
-        partial = 0;
+        partial_flags = 0;
     } while ( i-- );
 
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_pte = partial ?: -1;
+        page->partial_flags = (partial_flags & PTF_partial_set) ?
+            partial_flags :
+            (PTF_partial_set | PTF_partial_general_ref);
     }
     else if ( rc == -EINTR && i < L4_PAGETABLE_ENTRIES - 1 )
     {
         page->nr_validated_ptes = i + 1;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
         rc = -ERESTART;
     }
 
@@ -2344,7 +2377,7 @@ static int mod_l2_entry(l2_pgentry_t *pl2e,
         return -EBUSY;
     }
 
-    put_page_from_l2e(ol2e, pfn, 0, true);
+    put_page_from_l2e(ol2e, pfn, PTF_defer);
 
     return rc;
 }
@@ -2419,7 +2452,7 @@ static int mod_l3_entry(l3_pgentry_t *pl3e,
         if ( !create_pae_xen_mappings(d, pl3e) )
             BUG();
 
-    put_page_from_l3e(ol3e, pfn, 0, 1);
+    put_page_from_l3e(ol3e, pfn, PTF_defer);
     return rc;
 }
 
@@ -2482,7 +2515,7 @@ static int mod_l4_entry(l4_pgentry_t *pl4e,
         return -EFAULT;
     }
 
-    put_page_from_l4e(ol4e, pfn, 0, 1);
+    put_page_from_l4e(ol4e, pfn, PTF_defer);
     return rc;
 }
 
@@ -2748,7 +2781,7 @@ int free_page_type(struct page_info *page, unsigned long type,
     if ( !(type & PGT_partial) )
     {
         page->nr_validated_ptes = 1U << PAGETABLE_ORDER;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
     }
 
     switch ( type & PGT_type_mask )
@@ -3042,7 +3075,7 @@ static int __get_page_type(struct page_info *page, unsigned long type,
         if ( !(x & PGT_partial) )
         {
             page->nr_validated_ptes = 0;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
         }
         page->linear_pt_count = 0;
         rc = alloc_page_type(page, type, preemptible);
@@ -3414,7 +3447,8 @@ int new_guest_cr3(unsigned long mfn)
 
     rc = paging_mode_refcounts(d)
          ? (get_page_from_pagenr(mfn, d) ? 0 : -EINVAL)
-         : get_page_and_type_from_pagenr(mfn, PGT_root_page_table, d, 0, 1);
+         : get_page_and_type_from_pagenr(mfn, PGT_root_page_table, d,
+                                         PTF_preemptible);
     switch ( rc )
     {
     case 0:
@@ -3807,7 +3841,7 @@ long do_mmuext_op(
                     rc = get_page_from_pagenr(op.arg1.mfn, d) ? 0 : -EINVAL;
                 else
                     rc = get_page_and_type_from_pagenr(
-                        op.arg1.mfn, PGT_root_page_table, d, 0, 1);
+                        op.arg1.mfn, PGT_root_page_table, d, PTF_preemptible);
 
                 if ( unlikely(rc) )
                 {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 7e4ffeb160..0bf5b60ba8 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -140,19 +140,34 @@ struct page_info
          * setting the flag must not drop that reference, whereas the instance
          * clearing it will have to.
          *
-         * If @partial_pte is positive then PTE at @nr_validated_ptes+1 has
-         * been partially validated. This implies that the general reference
-         * to the page (acquired from get_page_from_lNe()) would be dropped
-         * (again due to the apparent failure) and hence must be re-acquired
-         * when resuming the validation, but must not be dropped when picking
-         * up the page for invalidation.
+         * If partial_flags & PTF_partial_set is set, then the page at
+         * at @nr_validated_ptes had PGT_partial set as a result of an
+         * operation on the current page.  (That page may or may not
+         * still have PGT_partial set.)
          *
-         * If @partial_pte is negative then PTE at @nr_validated_ptes+1 has
-         * been partially invalidated. This is basically the opposite case of
-         * above, i.e. the general reference to the page was not dropped in
-         * put_page_from_lNe() (due to the apparent failure), and hence it
-         * must be dropped when the put operation is resumed (and completes),
-         * but it must not be acquired if picking up the page for validation.
+         * If PTF_partial_general_ref is set, then the PTE at
+         * @nr_validated_ptef holds a general reference count for the
+         * page.
+         *
+         * This happens:
+         * - During de-validation, if de-validation of the page was
+         *   interrupted
+         * - During validation, if an invalid entry is encountered and
+         *   validation is preemptible
+         * - During validation, if PTF_partial_general_ref was set on
+         *   this entry to begin with (perhaps because we're picking
+         *   up from a partial de-validation).
+         *
+         * When resuming validation, if PTF_partial_general_ref is clear,
+         * then a general reference must be re-acquired; if it is set, no
+         * reference should be acquired.
+         *
+         * When resuming de-validation, if PTF_partial_general_ref is
+         * clear, no reference should be dropped; if it is set, a
+         * reference should be dropped.
+         *
+         * NB that PTF_partial_set and PTF_partial_general_ref are
+         * defined in mm.c, the only place where they are used.
          *
          * The 3rd field, @linear_pt_count, indicates
          * - by a positive value, how many same-level page table entries a page
@@ -163,7 +178,7 @@ struct page_info
         struct {
             u16 nr_validated_ptes:PAGETABLE_ORDER + 1;
             u16 :16 - PAGETABLE_ORDER - 1 - 2;
-            s16 partial_pte:2;
+            u16 partial_flags:2;
             s16 linear_pt_count;
         };
 
-- 
2.23.0


["xsa299-4.9/0005-x86-mm-Use-flags-for-_put_page_type-rather-than-a-bo.patch" (application/octet-stream)]

From 9859fb98bd41bc79f7dd696c5dc5f095f603829e Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 05/12] x86/mm: Use flags for _put_page_type rather than a
 boolean

This is in mainly in preparation for _put_page_type taking the
partial_flags value in the future.  It also makes it easier to read in
the caller (since you see a flag name rather than `true` or `false`).

No functional change intended.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 9d225e3c28..2cf0c33d18 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1417,7 +1417,7 @@ get_page_from_l4e(
             l3e_remove_flags((pl3e), _PAGE_USER|_PAGE_RW|_PAGE_ACCESSED);   \
     } while ( 0 )
 
-static int _put_page_type(struct page_info *page, bool preemptible,
+static int _put_page_type(struct page_info *page, unsigned int flags,
                           struct page_info *ptpg);
 
 void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
@@ -1501,7 +1501,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
               PTF_partial_set )
         {
             ASSERT(!(flags & PTF_defer));
-            rc = _put_page_type(pg, true, ptpg);
+            rc = _put_page_type(pg, PTF_preemptible, ptpg);
         }
         else if ( flags & PTF_defer )
         {
@@ -1510,7 +1510,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         }
         else
         {
-            rc = _put_page_type(pg, true, ptpg);
+            rc = _put_page_type(pg, PTF_preemptible, ptpg);
             if ( likely(!rc) )
                 put_page(pg);
         }
@@ -1547,7 +1547,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
          PTF_partial_set )
     {
         ASSERT(!(flags & PTF_defer));
-        return _put_page_type(pg, true, mfn_to_page(pfn));
+        return _put_page_type(pg, PTF_preemptible, mfn_to_page(pfn));
     }
 
     if ( flags & PTF_defer )
@@ -1557,7 +1557,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
         return 0;
     }
 
-    rc = _put_page_type(pg, true, mfn_to_page(pfn));
+    rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(pfn));
     if ( likely(!rc) )
         put_page(pg);
 
@@ -1578,7 +1578,7 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
               PTF_partial_set )
         {
             ASSERT(!(flags & PTF_defer));
-            return _put_page_type(pg, true, mfn_to_page(pfn));
+            return _put_page_type(pg, PTF_preemptible, mfn_to_page(pfn));
         }
 
         if ( flags & PTF_defer )
@@ -1588,7 +1588,7 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
             return 0;
         }
 
-        rc = _put_page_type(pg, true, mfn_to_page(pfn));
+        rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(pfn));
         if ( likely(!rc) )
             put_page(pg);
     }
@@ -2859,11 +2859,12 @@ static int _put_final_page_type(struct page_info *page, unsigned long type,
 }
 
 
-static int _put_page_type(struct page_info *page, bool preemptible,
+static int _put_page_type(struct page_info *page, unsigned int flags,
                           struct page_info *ptpg)
 {
     unsigned long nx, x, y = page->u.inuse.type_info;
     int rc = 0;
+    bool preemptible = flags & PTF_preemptible;
 
     for ( ; ; )
     {
@@ -3063,7 +3064,7 @@ static int __get_page_type(struct page_info *page, unsigned long type,
 
             if ( unlikely(iommu_ret) )
             {
-                _put_page_type(page, false, NULL);
+                _put_page_type(page, 0, NULL);
                 rc = iommu_ret;
                 goto out;
             }
@@ -3090,7 +3091,7 @@ static int __get_page_type(struct page_info *page, unsigned long type,
 
 void put_page_type(struct page_info *page)
 {
-    int rc = _put_page_type(page, false, NULL);
+    int rc = _put_page_type(page, 0, NULL);
     ASSERT(rc == 0);
     (void)rc;
 }
@@ -3106,7 +3107,7 @@ int get_page_type(struct page_info *page, unsigned long type)
 
 int put_page_type_preemptible(struct page_info *page)
 {
-    return _put_page_type(page, true, NULL);
+    return _put_page_type(page, PTF_preemptible, NULL);
 }
 
 int get_page_type_preemptible(struct page_info *page, unsigned long type)
@@ -3316,7 +3317,7 @@ int put_old_guest_table(struct vcpu *v)
     if ( !v->arch.old_guest_table )
         return 0;
 
-    switch ( rc = _put_page_type(v->arch.old_guest_table, true,
+    switch ( rc = _put_page_type(v->arch.old_guest_table, PTF_preemptible,
                                  v->arch.old_guest_ptpg) )
     {
     case -EINTR:
-- 
2.23.0


["xsa299-4.9/0006-x86-mm-Rework-get_page_and_type_from_mfn-conditional.patch" (application/octet-stream)]

From 389898600769d94be21c8e37db6df905babc0eee Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 06/12] x86/mm: Rework get_page_and_type_from_mfn conditional

Make it easier to read by declaring the conditions in which we will
retain the ref, rather than the conditions under which we release it.

The only way (page == current->arch.old_guest_table) can be true is if
preemptible is true; so remove this from the query itself, and add an
ASSERT() to that effect on the opposite path.

No functional change intended.

NB that alloc_lN_table() mishandle the "linear pt failure" situation
described in the comment; this will be addressed in a future patch.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 2cf0c33d18..c430f2c52e 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -798,8 +798,43 @@ static int get_page_and_type_from_pagenr(unsigned long page_nr,
 
     rc = __get_page_type(page, type, preemptible);
 
-    if ( unlikely(rc) && !partial_ref &&
-         (!preemptible || page != current->arch.old_guest_table) )
+    /*
+     * Retain the refcount if:
+     * - page is fully validated (rc == 0)
+     * - page is not validated (rc < 0) but:
+     *   - We came in with a reference (partial_ref)
+     *   - page is partially validated but there's been an error
+     *     (page == current->arch.old_guest_table)
+     *
+     * The partial_ref-on-error clause is worth an explanation.  There
+     * are two scenarios where partial_ref might be true coming in:
+     * - mfn has been partially demoted as type `type`; i.e. has
+     *   PGT_partial set
+     * - mfn has been partially demoted as L(type+1) (i.e., a linear
+     *   page; e.g. we're being called from get_page_from_l2e with
+     *   type == PGT_l1_table, but the mfn is PGT_l2_table)
+     *
+     * If there's an error, in the first case, _get_page_type will
+     * either return -ERESTART, in which case we want to retain the
+     * ref (as the caller will consider it retained), or -EINVAL, in
+     * which case old_guest_table will be set; in both cases, we need
+     * to retain the ref.
+     *
+     * In the second case, if there's an error, _get_page_type() can
+     * *only* return -EINVAL, and *never* set old_guest_table.  In
+     * that case we also want to retain the reference, to allow the
+     * page to continue to be torn down (i.e., PGT_partial cleared)
+     * safely.
+     *
+     * Also note that we shouldn't be able to leave with the reference
+     * count retained unless we succeeded, or the operation was
+     * preemptible.
+     */
+    if ( likely(!rc) || partial_ref )
+        /* nothing */;
+    else if ( page == current->arch.old_guest_table )
+        ASSERT(preemptible);
+    else
         put_page(page);
 
     return rc;
-- 
2.23.0


["xsa299-4.9/0007-x86-mm-Have-alloc_l-23-_table-clear-partial_flags-wh.patch" (application/octet-stream)]

From 8629bfa452cb8890350ce63c9e28c4d0e58b7a2c Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 07/12] x86/mm: Have alloc_l[23]_table clear partial_flags when
 preempting

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.

At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.

PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held.  Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.

Unfortunately, when alloc_l[23]_table check hypercall_preempt_check()
and return -ERESTART, they set nr_entries_validated, but don't clear
partial_flags.

If we were picking up from a previously-interrupted promotion, that
means that PTF_partial_set would be set even though
[nr_entries_validated] was not partially validated.  This means that
if the page in this state were de-validated, put_page_type() would
erroneously be called on that entry.

Perhaps worse, if we were racing with a de-validation, then we might
leave both PTF_partial_set and PTF_partial_general_ref; and when
de-validation picked up again, both the type and the general ref would
be erroneously dropped from [nr_entries_validated].

In a sense, the real issue here is code duplication.  Rather than
duplicate the interruption code, set rc to -EINTR and fall through to
the code which already handles that case correctly.

Given the logic at this point, it should be impossible for
partial_flags to be non-zero; add an ASSERT() to catch any changes.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index c430f2c52e..fce15f30ca 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1735,13 +1735,8 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
           i++, partial_flags = 0 )
     {
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
-        {
-            page->nr_validated_ptes = i;
-            rc = -ERESTART;
-            break;
-        }
-
-        if ( !is_guest_l2_slot(d, type, i) ||
+            rc = -EINTR;
+        else if ( !is_guest_l2_slot(d, type, i) ||
              (rc = get_page_from_l2e(pl2e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
@@ -1812,13 +1807,8 @@ static int alloc_l3_table(struct page_info *page)
           i++, partial_flags = 0 )
     {
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
-        {
-            page->nr_validated_ptes = i;
-            rc = -ERESTART;
-            break;
-        }
-
-        if ( is_pv_32bit_domain(d) && (i == 3) )
+            rc = -EINTR;
+        else if ( is_pv_32bit_domain(d) && (i == 3) )
         {
             if ( !(l3e_get_flags(pl3e[i]) & _PAGE_PRESENT) ||
                  (l3e_get_flags(pl3e[i]) & l3_disallow_mask(d)) )
-- 
2.23.0


["xsa299-4.9/0008-x86-mm-Always-retain-a-general-ref-on-partial.patch" (application/octet-stream)]

From 457232af25e037ce4fbd73dc8d6c1f3c830e735b Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 08/12] x86/mm: Always retain a general ref on partial

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page struct:
nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.

At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.

PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held.  Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.

Unfortunately, because a refcount is not held, it is possible to
engineer a situation where PFT_partial_set is set but the page in
question has been assigned to another domain.  A sketch is provided in
the appendix.

Fix this by having the parent page table entry hold a general
reference count whenever PFT_partial_set is set.  (For clarity of
change, keep two separate flags.  These will be collapsed in a
subsequent changeset.)

This has two basic implications.  On the put_page_from_lNe() side,
this mean that the (partial_set && !partial_ref) case can never happen,
and no longer needs to be special-cased.

Secondly, because both flags are set together, there's no need to carry over
existing bits from partial_pte.

(NB there is still another issue with calling _put_page_type() on a
page which had PGT_partial set; that will be handled in a subsequent
patch.)

On the get_page_and_type_from_mfn() side, we need to distinguish
between callers which hold a reference on partial (i.e.,
alloc_lN_table()), and those which do not (new_cr3, PIN_LN_TABLE, and
so on): pass a flag if the type should be retained on interruption.

NB that since l1 promotion can't be preempted, that get_page_from_l2e
can't return -ERESTART.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
* Appendix: Engineering PTF_partial_set while a page belongs to a
  foreign domain

Suppose A is a page which can be promoted to an l3, and B is a page
which can be promoted to an l2, and A[x] points to B.  B has
PGC_allocated set but no other general references.

V1:  PIN_L3 A.
  A is validated, B is validated.
  A.type_count = 1 | PGT_validated | PGT_pinned
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated (A[x] holds a general ref)

V1: UNPIN A.
  A begins de-validation.
  Arrange to be interrupted when i < x
  V1->old_guest_table = A
  V1->old_guest_table_ref_held = false
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = i < x
  B.type_count = 0
  B.count = 1 | PGC_allocated

V2: MOD_L4_ENTRY to point some l4e to A.
  Picks up re-validation of A.
  Arrange to be interrupted halfway through B's validation
  B.type_count = 1 | PGT_partial
  B.count = 2 | PGC_allocated (PGT_partial holds a general ref)
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = x
  A.partial_pte = PTF_partial_set

V3: MOD_L3_ENTRY to point some other l3e (not in A) to B.
  Validates B.
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated ("other l3e" holds a general ref)

V3: MOD_L3_ENTRY to clear l3e pointing to B.
  Devalidates B.
  B.type_count = 0
  B.count = 1 | PGC_allocated

V3: decrease_reservation(B)
  Clears PGC_allocated
  B.count = 0 => B is freed

B gets assigned to a different domain

V1: Restarts UNPIN of A
  put_old_guest_table(A)
    ...
      free_l3_table(A)

Now since A.partial_flags has PTF_partial_set, free_l3_table() will
call put_page_from_l3e() on A[x], which points to B, while B is owned
by another domain.

If A[x] held a general refcount for B on partial validation, as it does
for partial de-validation, then B would still have a reference count of
1 after PGC_allocated was freed; so B wouldn't be freed until after
put_page_from_l3e() had happend on A[x].
---
 xen/arch/x86/mm.c        | 87 ++++++++++++++++++++++++----------------
 xen/include/asm-x86/mm.h | 15 ++++---
 2 files changed, 61 insertions(+), 41 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index fce15f30ca..0ac8c4592d 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -777,10 +777,11 @@ static int __get_page_type(struct page_info *page, unsigned long type,
  * page->pte[page->nr_validated_entries].  See the comment in mm.h for
  * more information.
  */
-#define PTF_partial_set         (1 << 0)
-#define PTF_partial_general_ref (1 << 1)
-#define PTF_preemptible         (1 << 2)
-#define PTF_defer               (1 << 3)
+#define PTF_partial_set           (1 << 0)
+#define PTF_partial_general_ref   (1 << 1)
+#define PTF_preemptible           (1 << 2)
+#define PTF_defer                 (1 << 3)
+#define PTF_retain_ref_on_restart (1 << 4)
 
 static int get_page_and_type_from_pagenr(unsigned long page_nr,
                                          unsigned long type,
@@ -790,7 +791,11 @@ static int get_page_and_type_from_pagenr(unsigned long page_nr,
     struct page_info *page = mfn_to_page(page_nr);
     int rc;
     bool preemptible = flags & PTF_preemptible,
-         partial_ref = flags & PTF_partial_general_ref;
+         partial_ref = flags & PTF_partial_general_ref,
+         partial_set = flags & PTF_partial_set,
+         retain_ref  = flags & PTF_retain_ref_on_restart;
+
+    ASSERT(partial_ref == partial_set);
 
     if ( likely(!partial_ref) &&
          unlikely(!get_page_from_pagenr(page_nr, d)) )
@@ -803,13 +808,15 @@ static int get_page_and_type_from_pagenr(unsigned long page_nr,
      * - page is fully validated (rc == 0)
      * - page is not validated (rc < 0) but:
      *   - We came in with a reference (partial_ref)
+     *   - page is partially validated (rc == -ERESTART), and the
+     *     caller has asked the ref to be retained in that case
      *   - page is partially validated but there's been an error
      *     (page == current->arch.old_guest_table)
      *
      * The partial_ref-on-error clause is worth an explanation.  There
      * are two scenarios where partial_ref might be true coming in:
-     * - mfn has been partially demoted as type `type`; i.e. has
-     *   PGT_partial set
+     * - mfn has been partially promoted / demoted as type `type`;
+     *   i.e. has PGT_partial set
      * - mfn has been partially demoted as L(type+1) (i.e., a linear
      *   page; e.g. we're being called from get_page_from_l2e with
      *   type == PGT_l1_table, but the mfn is PGT_l2_table)
@@ -832,7 +839,8 @@ static int get_page_and_type_from_pagenr(unsigned long page_nr,
      */
     if ( likely(!rc) || partial_ref )
         /* nothing */;
-    else if ( page == current->arch.old_guest_table )
+    else if ( page == current->arch.old_guest_table ||
+              (retain_ref && rc == -ERESTART) )
         ASSERT(preemptible);
     else
         put_page(page);
@@ -1535,8 +1543,8 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
               PTF_partial_set )
         {
-            ASSERT(!(flags & PTF_defer));
-            rc = _put_page_type(pg, PTF_preemptible, ptpg);
+            /* partial_set should always imply partial_ref */
+            BUG();
         }
         else if ( flags & PTF_defer )
         {
@@ -1581,8 +1589,8 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
     if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
          PTF_partial_set )
     {
-        ASSERT(!(flags & PTF_defer));
-        return _put_page_type(pg, PTF_preemptible, mfn_to_page(pfn));
+        /* partial_set should always imply partial_ref */
+        BUG();
     }
 
     if ( flags & PTF_defer )
@@ -1612,8 +1620,8 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
         if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
               PTF_partial_set )
         {
-            ASSERT(!(flags & PTF_defer));
-            return _put_page_type(pg, PTF_preemptible, mfn_to_page(pfn));
+            /* partial_set should always imply partial_ref */
+            BUG();
         }
 
         if ( flags & PTF_defer )
@@ -1740,13 +1748,22 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
              (rc = get_page_from_l2e(pl2e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
-        if ( rc == -ERESTART )
-        {
-            page->nr_validated_ptes = i;
-            /* Set 'set', retain 'general ref' */
-            page->partial_flags = partial_flags | PTF_partial_set;
-        }
-        else if ( rc == -EINTR && i )
+        /*
+         * It shouldn't be possible for get_page_from_l2e to return
+         * -ERESTART, since we never call this with PTF_preemptible.
+         * (alloc_l1_table may return -EINTR on an L1TF-vulnerable
+         * entry.)
+         *
+         * NB that while on a "clean" promotion, we can never get
+         * PGT_partial.  It is possible to arrange for an l2e to
+         * contain a partially-devalidated l2; but in that case, both
+         * of the following functions will fail anyway (the first
+         * because the page in question is not an l1; the second
+         * because the page is not fully validated).
+         */
+        ASSERT(rc != -ERESTART);
+
+        if ( rc == -EINTR && i )
         {
             page->nr_validated_ptes = i;
             page->partial_flags = 0;
@@ -1755,6 +1772,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
         else if ( rc < 0 && rc != -EINTR )
         {
             gdprintk(XENLOG_WARNING, "Failure in alloc_l2_table: slot %#x\n", i);
+            ASSERT(current->arch.old_guest_table == NULL);
             if ( i )
             {
                 page->nr_validated_ptes = i;
@@ -1818,17 +1836,21 @@ static int alloc_l3_table(struct page_info *page)
                                                    PGT_l2_page_table |
                                                    PGT_pae_xen_l2,
                                                    d,
-                                                   partial_flags | PTF_preemptible);
+                                                   partial_flags |
+                                                   PTF_preemptible |
+                                                   PTF_retain_ref_on_restart);
         }
         else if ( !is_guest_l3_slot(i) ||
-                  (rc = get_page_from_l3e(pl3e[i], pfn, d, partial_flags)) > 0 )
+                  (rc = get_page_from_l3e(pl3e[i], pfn, d,
+                                          partial_flags |
+                                          PTF_retain_ref_on_restart)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = partial_flags | PTF_partial_set;
+            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
         }
         else if ( rc == -EINTR && i )
         {
@@ -1923,14 +1945,15 @@ static int alloc_l4_table(struct page_info *page)
           i++, partial_flags = 0 )
     {
         if ( !is_guest_l4_slot(d, i) ||
-             (rc = get_page_from_l4e(pl4e[i], pfn, d, partial_flags)) > 0 )
+             (rc = get_page_from_l4e(pl4e[i], pfn, d,
+                               partial_flags | PTF_retain_ref_on_restart)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = partial_flags | PTF_partial_set;
+            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
         }
         else if ( rc < 0 )
         {
@@ -2029,9 +2052,7 @@ static int free_l2_table(struct page_info *page)
     else if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = (partial_flags & PTF_partial_set) ?
-            partial_flags :
-            (PTF_partial_set | PTF_partial_general_ref);
+        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
     }
     else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
     {
@@ -2082,9 +2103,7 @@ static int free_l3_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = (partial_flags & PTF_partial_set) ?
-            partial_flags :
-            (PTF_partial_set | PTF_partial_general_ref);
+        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
     }
     else if ( rc == -EINTR && i < L3_PAGETABLE_ENTRIES - 1 )
     {
@@ -2115,9 +2134,7 @@ static int free_l4_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = (partial_flags & PTF_partial_set) ?
-            partial_flags :
-            (PTF_partial_set | PTF_partial_general_ref);
+        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
     }
     else if ( rc == -EINTR && i < L4_PAGETABLE_ENTRIES - 1 )
     {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 0bf5b60ba8..6ab642eb18 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -150,22 +150,25 @@ struct page_info
          * page.
          *
          * This happens:
-         * - During de-validation, if de-validation of the page was
+         * - During validation or de-validation, if the operation was
          *   interrupted
          * - During validation, if an invalid entry is encountered and
          *   validation is preemptible
          * - During validation, if PTF_partial_general_ref was set on
-         *   this entry to begin with (perhaps because we're picking
-         *   up from a partial de-validation).
+         *   this entry to begin with (perhaps because it picked up a
+         *   previous operation)
          *
-         * When resuming validation, if PTF_partial_general_ref is clear,
-         * then a general reference must be re-acquired; if it is set, no
-         * reference should be acquired.
+         * When resuming validation, if PTF_partial_general_ref is
+         * clear, then a general reference must be re-acquired; if it
+         * is set, no reference should be acquired.
          *
          * When resuming de-validation, if PTF_partial_general_ref is
          * clear, no reference should be dropped; if it is set, a
          * reference should be dropped.
          *
+         * NB at the moment, PTF_partial_set should be set if and only if
+         * PTF_partial_general_ref is set.
+         *
          * NB that PTF_partial_set and PTF_partial_general_ref are
          * defined in mm.c, the only place where they are used.
          *
-- 
2.23.0


["xsa299-4.9/0009-x86-mm-Collapse-PTF_partial_set-and-PTF_partial_gene.patch" (application/octet-stream)]

From a340d52c1537f1d29612d5c17a6ba5e948b10e69 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 09/12] x86/mm: Collapse PTF_partial_set and
 PTF_partial_general_ref into one

...now that they are equivalent.  No functional change intended.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c        | 50 +++++++++++-----------------------------
 xen/include/asm-x86/mm.h | 29 +++++++++++------------
 2 files changed, 26 insertions(+), 53 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 0ac8c4592d..ee91c69537 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -772,13 +772,12 @@ static int __get_page_type(struct page_info *page, unsigned long type,
 
 /*
  * The following flags are used to specify behavior of various get and
- * put commands.  The first two are also stored in page->partial_flags
- * to indicate the state of the page pointed to by
+ * put commands.  The first is also stored in page->partial_flags to
+ * indicate the state of the page pointed to by
  * page->pte[page->nr_validated_entries].  See the comment in mm.h for
  * more information.
  */
 #define PTF_partial_set           (1 << 0)
-#define PTF_partial_general_ref   (1 << 1)
 #define PTF_preemptible           (1 << 2)
 #define PTF_defer                 (1 << 3)
 #define PTF_retain_ref_on_restart (1 << 4)
@@ -791,13 +790,10 @@ static int get_page_and_type_from_pagenr(unsigned long page_nr,
     struct page_info *page = mfn_to_page(page_nr);
     int rc;
     bool preemptible = flags & PTF_preemptible,
-         partial_ref = flags & PTF_partial_general_ref,
          partial_set = flags & PTF_partial_set,
          retain_ref  = flags & PTF_retain_ref_on_restart;
 
-    ASSERT(partial_ref == partial_set);
-
-    if ( likely(!partial_ref) &&
+    if ( likely(!partial_set) &&
          unlikely(!get_page_from_pagenr(page_nr, d)) )
         return -EINVAL;
 
@@ -807,14 +803,14 @@ static int get_page_and_type_from_pagenr(unsigned long page_nr,
      * Retain the refcount if:
      * - page is fully validated (rc == 0)
      * - page is not validated (rc < 0) but:
-     *   - We came in with a reference (partial_ref)
+     *   - We came in with a reference (partial_set)
      *   - page is partially validated (rc == -ERESTART), and the
      *     caller has asked the ref to be retained in that case
      *   - page is partially validated but there's been an error
      *     (page == current->arch.old_guest_table)
      *
-     * The partial_ref-on-error clause is worth an explanation.  There
-     * are two scenarios where partial_ref might be true coming in:
+     * The partial_set-on-error clause is worth an explanation.  There
+     * are two scenarios where partial_set might be true coming in:
      * - mfn has been partially promoted / demoted as type `type`;
      *   i.e. has PGT_partial set
      * - mfn has been partially demoted as L(type+1) (i.e., a linear
@@ -837,7 +833,7 @@ static int get_page_and_type_from_pagenr(unsigned long page_nr,
      * count retained unless we succeeded, or the operation was
      * preemptible.
      */
-    if ( likely(!rc) || partial_ref )
+    if ( likely(!rc) || partial_set )
         /* nothing */;
     else if ( page == current->arch.old_guest_table ||
               (retain_ref && rc == -ERESTART) )
@@ -1540,13 +1536,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         struct page_info *pg = l2e_get_page(l2e);
         struct page_info *ptpg = mfn_to_page(pfn);
 
-        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
-              PTF_partial_set )
-        {
-            /* partial_set should always imply partial_ref */
-            BUG();
-        }
-        else if ( flags & PTF_defer )
+        if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = ptpg;
             current->arch.old_guest_table = pg;
@@ -1586,13 +1576,6 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 
     pg = l3e_get_page(l3e);
 
-    if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
-         PTF_partial_set )
-    {
-        /* partial_set should always imply partial_ref */
-        BUG();
-    }
-
     if ( flags & PTF_defer )
     {
         current->arch.old_guest_ptpg = mfn_to_page(pfn);
@@ -1617,13 +1600,6 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
     {
         struct page_info *pg = l4e_get_page(l4e);
 
-        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
-              PTF_partial_set )
-        {
-            /* partial_set should always imply partial_ref */
-            BUG();
-        }
-
         if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = mfn_to_page(pfn);
@@ -1850,7 +1826,7 @@ static int alloc_l3_table(struct page_info *page)
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+            page->partial_flags = PTF_partial_set;
         }
         else if ( rc == -EINTR && i )
         {
@@ -1953,7 +1929,7 @@ static int alloc_l4_table(struct page_info *page)
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+            page->partial_flags = PTF_partial_set;
         }
         else if ( rc < 0 )
         {
@@ -2052,7 +2028,7 @@ static int free_l2_table(struct page_info *page)
     else if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+        page->partial_flags = PTF_partial_set;
     }
     else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
     {
@@ -2103,7 +2079,7 @@ static int free_l3_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+        page->partial_flags = PTF_partial_set;
     }
     else if ( rc == -EINTR && i < L3_PAGETABLE_ENTRIES - 1 )
     {
@@ -2134,7 +2110,7 @@ static int free_l4_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+        page->partial_flags = PTF_partial_set;
     }
     else if ( rc == -EINTR && i < L4_PAGETABLE_ENTRIES - 1 )
     {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 6ab642eb18..7b05589dc8 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -145,7 +145,7 @@ struct page_info
          * operation on the current page.  (That page may or may not
          * still have PGT_partial set.)
          *
-         * If PTF_partial_general_ref is set, then the PTE at
+         * Additionally, if PTF_partial_set is set, then the PTE at
          * @nr_validated_ptef holds a general reference count for the
          * page.
          *
@@ -154,23 +154,20 @@ struct page_info
          *   interrupted
          * - During validation, if an invalid entry is encountered and
          *   validation is preemptible
-         * - During validation, if PTF_partial_general_ref was set on
-         *   this entry to begin with (perhaps because it picked up a
+         * - During validation, if PTF_partial_set was set on this
+         *   entry to begin with (perhaps because it picked up a
          *   previous operation)
          *
-         * When resuming validation, if PTF_partial_general_ref is
-         * clear, then a general reference must be re-acquired; if it
-         * is set, no reference should be acquired.
+         * When resuming validation, if PTF_partial_set is clear, then
+         * a general reference must be re-acquired; if it is set, no
+         * reference should be acquired.
          *
-         * When resuming de-validation, if PTF_partial_general_ref is
-         * clear, no reference should be dropped; if it is set, a
-         * reference should be dropped.
+         * When resuming de-validation, if PTF_partial_set is clear,
+         * no reference should be dropped; if it is set, a reference
+         * should be dropped.
          *
-         * NB at the moment, PTF_partial_set should be set if and only if
-         * PTF_partial_general_ref is set.
-         *
-         * NB that PTF_partial_set and PTF_partial_general_ref are
-         * defined in mm.c, the only place where they are used.
+         * NB that PTF_partial_set is defined in mm.c, the only place
+         * where it is used.
          *
          * The 3rd field, @linear_pt_count, indicates
          * - by a positive value, how many same-level page table entries a page
@@ -180,8 +177,8 @@ struct page_info
          */
         struct {
             u16 nr_validated_ptes:PAGETABLE_ORDER + 1;
-            u16 :16 - PAGETABLE_ORDER - 1 - 2;
-            u16 partial_flags:2;
+            u16 :16 - PAGETABLE_ORDER - 1 - 1;
+            u16 partial_flags:1;
             s16 linear_pt_count;
         };
 
-- 
2.23.0


["xsa299-4.9/0010-x86-mm-Properly-handle-linear-pagetable-promotion-fa.patch" (application/octet-stream)]

From 9e6caabe4314aee5fb6d405a89212d49d456ce20 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 10/12] x86/mm: Properly handle linear pagetable promotion
 failures

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated, and a general reference count is held.

Unfortunately, in cases where an entry began with PTF_partial_set set,
and get_page_from_lNe() returns -EINVAL, the PTF_partial_set bit is
erroneously dropped.  (This scenario can be engineered mainly by the
use of interleaving of promoting and demoting a page which has "linear
pagetable" entries; see the appendix for a sketch.)  This means that
we will "leak" a general reference count on the page in question,
preventing the page from being freed.

Fix this by setting page->partial_flags to the partial_flags local
variable.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix

Suppose A and B can both be promoted to L2 pages, and A[x] points to B.

V1: PIN_L2 B.
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated

V1: MOD_L3_ENTRY pointing something to A.
  In the process of validating A[x], grab an extra type / ref on B:
  B.type_count = 2 | PGT_validated
  B.count = 3 | PGC_allocated
  A.type_count = 1 | PGT_validated
  A.count = 2 | PGC_allocated

V1: UNPIN B.
  B.type_count = 1 | PGT_validate
  B.count = 2 | PGC_allocated

V1: MOD_L3_ENTRY removing the reference to A.
  De-validate A, down to A[x], which points to B.
  Drop the final type on B.  Arrange to be interrupted.
  B.type_count = 1 | PGT_partial
  B.count = 2 | PGC_allocated
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = x
  A.partial_pte = -1

V2: MOD_L3_ENTRY adds a reference to A.

At this point, get_page_from_l2e(A[x]) tries
get_page_and_type_from_mfn(), which fails because it's the wrong type;
and get_l2_linear_pagetable() also fails, because B isn't validated as
an l2 anymore.
---
 xen/arch/x86/mm.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index ee91c69537..19b16de5d1 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1752,7 +1752,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_flags = 0;
+                page->partial_flags = partial_flags;
                 current->arch.old_guest_ptpg = NULL;
                 current->arch.old_guest_table = page;
             }
@@ -1848,7 +1848,7 @@ static int alloc_l3_table(struct page_info *page)
         if ( i )
         {
             page->nr_validated_ptes = i;
-            page->partial_flags = 0;
+            page->partial_flags = partial_flags;
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
         }
@@ -1939,7 +1939,7 @@ static int alloc_l4_table(struct page_info *page)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_flags = 0;
+                page->partial_flags = partial_flags;
                 if ( rc == -EINTR )
                     rc = -ERESTART;
                 else
-- 
2.23.0


["xsa299-4.9/0011-x86-mm-Fix-nested-de-validation-on-error.patch" (application/octet-stream)]

From 7df5d74adffc857fefb5d778cfd27a9b8e14dac5 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 11/12] x86/mm: Fix nested de-validation on error

If an invalid entry is discovered when validating a page-table tree,
the entire tree which has so far been validated must be de-validated.
Since this may take a long time, alloc_l[2-4]_table() set current
vcpu's old_guest_table immediately; put_old_guest_table() will make
sure that put_page_type() will be called to finish off the
de-validation before any other MMU operations can happen on the vcpu.

The invariant for partial pages should be:

* Entries [0, nr_validated_ptes) should be completely validated;
  put_page_type() will de-validate these.

* If [nr_validated_ptes] is partially validated, partial_flags should
  set PTF_partiaL_set.  put_page_type() will be called on this page to
  finish off devalidation, and the appropriate refcount adjustments
  will be done.

alloc_l[2-3]_table() indicates partial validation to its callers by
setting current->old_guest_table.

Unfortunately, this is mishandled.

Take the case where validating lNe[x] returns an error.

First, alloc_l3_table() doesn't check old_guest_table at all; as a
result, partial_flags is not set when it should be.  nr_validated_ptes
is set to x; and since PFT_partial_set clear, de-validation resumes at
nr_validated_ptes-1.  This means that the l2 page at pl3e[x] will not
have put_page_type() called on it when de-validating the rest of the
l3: it will be stuck in the PGT_partial state until the domain is
destroyed, or until it is re-used as an l2.  (Any other page type will
fail.)

Worse, alloc_l4_table(), rather than setting PTF_partial_set as it
should, sets nr_validated_ptes to x+1.  When de-validating, since
partial is 0, this will correctly resume calling put_page_type at [x];
but, if the put_page_type() is never called, but instead
get_page_type() is called, validation will pick up at [x+1],
neglecting to validate [x].  If the rest of the validation succeeds,
the l4 will be validated even though [x] is invalid.

Fix this in both cases by setting PTF_partial_set if old_guest_table
is set.

While here, add some safety catches:
- old_guest_table must point to the page contained in
  [nr_validated_ptes].
- alloc_l1_page shouldn't set old_guest_table

If we experience one of these situations in production builds, it's
safer to avoid calling put_page_type for the pages in question.  If
they have PGT_partial set, they will be cleaned up on domain
destruction; if not, we have no idea whether a type count is safe to
drop.  Retaining an extra type ref that should have been dropped may
trigger a BUG() on the free_domain_page() path, but dropping a type
count that shouldn't be dropped may cause a privilege escalation.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 19b16de5d1..4af82d6447 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1751,6 +1751,20 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             ASSERT(current->arch.old_guest_table == NULL);
             if ( i )
             {
+                /*
+                 * alloc_l1_table() doesn't set old_guest_table; it does
+                 * its own tear-down immediately on failure.  If it
+                 * did we'd need to check it and set partial_flags as we
+                 * do in alloc_l[34]_table().
+                 *
+                 * Note on the use of ASSERT: if it's non-null and
+                 * hasn't been cleaned up yet, it should have
+                 * PGT_partial set; and so the type will be cleaned up
+                 * on domain destruction.  Unfortunately, we would
+                 * leak the general ref held by old_guest_table; but
+                 * leaking a page is less bad than a host crash.
+                 */
+                ASSERT(current->arch.old_guest_table == NULL);
                 page->nr_validated_ptes = i;
                 page->partial_flags = partial_flags;
                 current->arch.old_guest_ptpg = NULL;
@@ -1784,6 +1798,7 @@ static int alloc_l3_table(struct page_info *page)
     unsigned int   i;
     int            rc = 0;
     unsigned int   partial_flags = page->partial_flags;
+    l3_pgentry_t   l3e = l3e_empty();
 
     pl3e = map_domain_page(_mfn(pfn));
 
@@ -1835,7 +1850,11 @@ static int alloc_l3_table(struct page_info *page)
             rc = -ERESTART;
         }
         if ( rc < 0 )
+        {
+            /* XSA-299 Backport: Copy l3e for checking */
+            l3e = pl3e[i];
             break;
+        }
 
         adjust_guest_l3e(pl3e[i], d);
     }
@@ -1849,6 +1868,24 @@ static int alloc_l3_table(struct page_info *page)
         {
             page->nr_validated_ptes = i;
             page->partial_flags = partial_flags;
+            if ( current->arch.old_guest_table )
+            {
+                /*
+                 * We've experienced a validation failure.  If
+                 * old_guest_table is set, "transfer" the general
+                 * reference count to pl3e[nr_validated_ptes] by
+                 * setting PTF_partial_set.
+                 *
+                 * As a precaution, check that old_guest_table is the
+                 * page pointed to by pl3e[nr_validated_ptes].  If
+                 * not, it's safer to leak a type ref on production
+                 * builds.
+                 */
+                if ( current->arch.old_guest_table == l3e_get_page(l3e) )
+                    page->partial_flags = PTF_partial_set;
+                else
+                    ASSERT_UNREACHABLE();
+            }
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
         }
@@ -1945,7 +1982,23 @@ static int alloc_l4_table(struct page_info *page)
                 else
                 {
                     if ( current->arch.old_guest_table )
-                        page->nr_validated_ptes++;
+                    {
+                        /*
+                         * We've experienced a validation failure.  If
+                         * old_guest_table is set, "transfer" the general
+                         * reference count to pl3e[nr_validated_ptes] by
+                         * setting PTF_partial_set.
+                         *
+                         * As a precaution, check that old_guest_table is the
+                         * page pointed to by pl4e[nr_validated_ptes].  If
+                         * not, it's safer to leak a type ref on production
+                         * builds.
+                         */
+                        if ( current->arch.old_guest_table == l4e_get_page(pl4e[i]) )
+                            page->partial_flags = PTF_partial_set;
+                        else
+                            ASSERT_UNREACHABLE();
+                    }
                     current->arch.old_guest_ptpg = NULL;
                     current->arch.old_guest_table = page;
                 }
-- 
2.23.0


["xsa299-4.9/0012-x86-mm-Don-t-drop-a-type-ref-unless-you-held-a-ref-t.patch" (application/octet-stream)]

From 1d017af45011e54c1f40171cae557d0c8d1d4104 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:50 +0100
Subject: [PATCH 12/12] x86/mm: Don't drop a type ref unless you held a ref to
 begin with

Validation and de-validation of pagetable trees may take arbitrarily
large amounts of time, and so must be preemptible.  This is indicated
by setting the PGT_partial bit in the type_info, and setting
nr_validated_entries and partial_flags appropriately.  Specifically,
if the entry at [nr_validated_entries] is partially validated,
partial_flags should have the PGT_partial_set bit set, and the entry
should hold a general reference count.  During de-validation,
put_page_type() is called on partially validated entries.

Unfortunately, there are a number of issues with the current algorithm.

First, doing a "normal" put_page_type() is not safe when no type ref
is held: there is nothing to stop another vcpu from coming along and
picking up validation again: at which point the put_page_type may drop
the only page ref on an in-use page.  Some examples are listed in the
appendix.

The core issue is that put_page_type() is being called both to clean
up PGT_partial, and to drop a type count; and has no way of knowing
which is which; and so if in between, PGT_partial is cleared,
put_page_type() will drop the type ref erroneously.

What is needed is to distinguish between two states:
- Dropping a type ref which is held
- Cleaning up a page which has been partially de/validated

Fix this by telling put_page_type() which of the two activities you
intend.

When cleaning up a partial de/validation, take no action unless you
find a page partially validated.

If put_page_type() is called without PTF_partial_set, and finds the
page in a PGT_partial state anyway, then there's certainly been a
misaccounting somewhere, and carrying on would almost certainly cause
a security issue, so crash the host instead.

In put_page_from_lNe, pass partial_flags on to _put_page_type().

old_guest_table may be set either with a fully validated page (when
using the "deferred put" pattern), or with a partially validated page
(when a normal "de-validation" is interrupted, or when a validation
fails part-way through due to invalid entries).  Add a flag,
old_guest_table_partial, to indicate which of these it is, and use
that to pass the appropriate flag to _put_page_type().

While here, delete stray trailing whitespace.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix:

Suppose page A, when interpreted as an l3 pagetable, contains all
valid entries; and suppose A[x] points to page B, which when
interpreted as an l2 pagetable, contains all valid entries.

P1: PIN_L3_TABLE
  A -> PGT_l3_table | 1 | valid
  B -> PGT_l2_table | 1 | valid

P1: UNPIN_TABLE
  > Arrange to interrupt after B has been de-validated
  B:
    type_info -> PGT_l2_table | 0
  A:
    type_info -> PGT_l3_table | 1 | partial
    nr_validated_enties -> (less than x)

P2: mod_l4_entry to point to A
  > Arrange for this to be interrupted while B is being validated
  B:
    type_info -> PGT_l2_table | 1 | partial
    (nr_validated_entires &c set as appropriate)
  A:
    type_info -> PGT_l3_table | 1 | partial
    nr_validated_entries -> x
    partial_pte = 1

P3: mod_l3_entry some other unrelated l3 to point to B:
  B:
    type_info -> PGT_l2_table | 1

P1: Restart UNPIN_TABLE

At this point, since A.nr_validate_entries == x and A.partial_pte !=
0, free_l3_table() will call put_page_from_l3e() on pl3e[x], dropping
its type count to 0 while it's still being pointed to by some other l3

A similar issue arises with old_guest_table.  Consider the following
scenario:

Suppose A is a page which, when interpreted as an l2, has valid entries
until entry x, which is invalid.

V1:  PIN_L2_TABLE(A)
  <Validate until we try to validate [x], get -EINVAL>
  A -> PGT_l2_table | 1 | PGT_partial
  V1 -> old_guest_table = A
  <delayed>

V2: PIN_L2_TABLE(A)
  <Pick up where V1 left off, try to re-validate [x], get -EINVAL>
  A -> PGT_l2_table | 1 | PGT_partial
  V2 -> old_guest_table = A
  <restart>
  put_old_guest_table()
    _put_page_type(A)
      A -> PGT_l2_table | 0

V1: <restart>
  put_old_guest_table()
    _put_page_type(A) # UNDERFLOW

Indeed, it is possible to engineer for old_guest_table for every vcpu
a guest has to point to the same page.
---
 xen/arch/x86/domain.c        |  6 +++
 xen/arch/x86/mm.c            | 98 +++++++++++++++++++++++++++++++-----
 xen/include/asm-x86/domain.h |  4 +-
 3 files changed, 94 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 1a4f89a6b1..2b0a01d24a 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1350,9 +1350,15 @@ int arch_set_info_guest(
                     rc = -ERESTART;
                     /* Fallthrough */
                 case -ERESTART:
+                    /*
+                     * NB that we're putting the kernel-mode table
+                     * here, which we've already successfully
+                     * validated above; hence partial = false;
+                     */
                     v->arch.old_guest_ptpg = NULL;
                     v->arch.old_guest_table =
                         pagetable_get_page(v->arch.guest_table);
+                    v->arch.old_guest_table_partial = false;
                     v->arch.guest_table = pagetable_null();
                     break;
                 default:
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 4af82d6447..74a1491c33 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1540,10 +1540,11 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         {
             current->arch.old_guest_ptpg = ptpg;
             current->arch.old_guest_table = pg;
+            current->arch.old_guest_table_partial = false;
         }
         else
         {
-            rc = _put_page_type(pg, PTF_preemptible, ptpg);
+            rc = _put_page_type(pg, flags | PTF_preemptible, ptpg);
             if ( likely(!rc) )
                 put_page(pg);
         }
@@ -1566,6 +1567,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
         unsigned long mfn = l3e_get_pfn(l3e);
         int writeable = l3e_get_flags(l3e) & _PAGE_RW;
 
+        ASSERT(!(flags & PTF_partial_set));
         ASSERT(!(mfn & ((1UL << (L3_PAGETABLE_SHIFT - PAGE_SHIFT)) - 1)));
         do {
             put_data_page(mfn_to_page(mfn), writeable);
@@ -1578,12 +1580,14 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 
     if ( flags & PTF_defer )
     {
+        ASSERT(!(flags & PTF_partial_set));
         current->arch.old_guest_ptpg = mfn_to_page(pfn);
         current->arch.old_guest_table = pg;
+        current->arch.old_guest_table_partial = false;
         return 0;
     }
 
-    rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(pfn));
+    rc = _put_page_type(pg, flags | PTF_preemptible, mfn_to_page(pfn));
     if ( likely(!rc) )
         put_page(pg);
 
@@ -1602,12 +1606,14 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
 
         if ( flags & PTF_defer )
         {
+            ASSERT(!(flags & PTF_partial_set));
             current->arch.old_guest_ptpg = mfn_to_page(pfn);
             current->arch.old_guest_table = pg;
+            current->arch.old_guest_table_partial = false;
             return 0;
         }
 
-        rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(pfn));
+        rc = _put_page_type(pg, flags | PTF_preemptible, mfn_to_page(pfn));
         if ( likely(!rc) )
             put_page(pg);
     }
@@ -1715,6 +1721,14 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
 
     pl2e = map_domain_page(_mfn(pfn));
 
+    /*
+     * NB that alloc_l2_table will never set partial_pte on an l2; but
+     * free_l2_table might if a linear_pagetable entry is interrupted
+     * partway through de-validation.  In that circumstance,
+     * get_page_from_l2e() will always return -EINVAL; and we must
+     * retain the type ref by doing the normal partial_flags tracking.
+     */
+
     for ( i = page->nr_validated_ptes; i < L2_PAGETABLE_ENTRIES;
           i++, partial_flags = 0 )
     {
@@ -1769,6 +1783,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
                 page->partial_flags = partial_flags;
                 current->arch.old_guest_ptpg = NULL;
                 current->arch.old_guest_table = page;
+                current->arch.old_guest_table_partial = true;
             }
         }
         if ( rc < 0 )
@@ -1882,12 +1897,16 @@ static int alloc_l3_table(struct page_info *page)
                  * builds.
                  */
                 if ( current->arch.old_guest_table == l3e_get_page(l3e) )
+                {
+                    ASSERT(current->arch.old_guest_table_partial);
                     page->partial_flags = PTF_partial_set;
+                }
                 else
                     ASSERT_UNREACHABLE();
             }
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
+            current->arch.old_guest_table_partial = true;
         }
         while ( i-- > 0 )
         {
@@ -1995,12 +2014,16 @@ static int alloc_l4_table(struct page_info *page)
                          * builds.
                          */
                         if ( current->arch.old_guest_table == l4e_get_page(pl4e[i]) )
+                        {
+                            ASSERT(current->arch.old_guest_table_partial);
                             page->partial_flags = PTF_partial_set;
+                        }
                         else
                             ASSERT_UNREACHABLE();
                     }
                     current->arch.old_guest_ptpg = NULL;
                     current->arch.old_guest_table = page;
+                    current->arch.old_guest_table_partial = true;
                 }
             }
         }
@@ -2942,6 +2965,28 @@ static int _put_page_type(struct page_info *page, unsigned int flags,
         x  = y;
         nx = x - 1;
 
+        /*
+         * Is this expected to do a full reference drop, or only
+         * cleanup partial validation / devalidation?
+         *
+         * If the former, the caller must hold a "full" type ref;
+         * which means the page must be validated.  If the page is
+         * *not* fully validated, continuing would almost certainly
+         * open up a security hole.  An exception to this is during
+         * domain destruction, where PGT_validated can be dropped
+         * without dropping a type ref.
+         *
+         * If the latter, do nothing unless type PGT_partial is set.
+         * If it is set, the type count must be 1.
+         */
+        if ( !(flags & PTF_partial_set) )
+            BUG_ON((x & PGT_partial) ||
+                   !((x & PGT_validated) || page_get_owner(page)->is_dying));
+        else if ( !(x & PGT_partial) )
+            return 0;
+        else
+            BUG_ON((x & PGT_count_mask) != 1);
+
         ASSERT((x & PGT_count_mask) != 0);
 
         if ( unlikely((nx & PGT_count_mask) == 0) )
@@ -3388,17 +3433,34 @@ int put_old_guest_table(struct vcpu *v)
     if ( !v->arch.old_guest_table )
         return 0;
 
-    switch ( rc = _put_page_type(v->arch.old_guest_table, PTF_preemptible,
-                                 v->arch.old_guest_ptpg) )
+    rc = _put_page_type(v->arch.old_guest_table,
+                        PTF_preemptible |
+                        ( v->arch.old_guest_table_partial ?
+                          PTF_partial_set : 0 ),
+                        v->arch.old_guest_ptpg);
+
+    if ( rc == -ERESTART || rc == -EINTR )
     {
-    case -EINTR:
-    case -ERESTART:
+        v->arch.old_guest_table_partial = (rc == -ERESTART);
         return -ERESTART;
-    case 0:
-        put_page(v->arch.old_guest_table);
     }
 
+    /*
+     * It shouldn't be possible for _put_page_type() to return
+     * anything else at the moment; but if it does happen in
+     * production, leaking the type ref is probably the best thing to
+     * do.  Either way, drop the general ref held by old_guest_table.
+     */
+    ASSERT(rc == 0);
+
+    put_page(v->arch.old_guest_table);
     v->arch.old_guest_table = NULL;
+    v->arch.old_guest_ptpg = NULL;
+    /*
+     * Safest default if someone sets old_guest_table without
+     * explicitly setting old_guest_table_partial.
+     */
+    v->arch.old_guest_table_partial = true;
 
     return rc;
 }
@@ -3553,11 +3615,11 @@ int new_guest_cr3(unsigned long mfn)
             switch ( rc = put_page_and_type_preemptible(page) )
             {
             case -EINTR:
-                rc = -ERESTART;
-                /* fallthrough */
             case -ERESTART:
                 curr->arch.old_guest_ptpg = NULL;
                 curr->arch.old_guest_table = page;
+                curr->arch.old_guest_table_partial = (rc == -ERESTART);
+                rc = -ERESTART;
                 break;
             default:
                 BUG_ON(rc);
@@ -3828,6 +3890,7 @@ long do_mmuext_op(
                     {
                         curr->arch.old_guest_ptpg = NULL;
                         curr->arch.old_guest_table = page;
+                        curr->arch.old_guest_table_partial = false;
                     }
                 }
             }
@@ -3866,6 +3929,11 @@ long do_mmuext_op(
             case -ERESTART:
                 curr->arch.old_guest_ptpg = NULL;
                 curr->arch.old_guest_table = page;
+                /*
+                 * EINTR means we still hold the type ref; ERESTART
+                 * means PGT_partial holds the type ref
+                 */
+                curr->arch.old_guest_table_partial = (rc == -ERESTART);
                 rc = 0;
                 break;
             default:
@@ -3941,11 +4009,15 @@ long do_mmuext_op(
                     switch ( rc = put_page_and_type_preemptible(page) )
                     {
                     case -EINTR:
-                        rc = -ERESTART;
-                        /* fallthrough */
                     case -ERESTART:
                         curr->arch.old_guest_ptpg = NULL;
                         curr->arch.old_guest_table = page;
+                        /*
+                         * EINTR means we still hold the type ref;
+                         * ERESTART means PGT_partial holds the ref
+                         */
+                        curr->arch.old_guest_table_partial = (rc == -ERESTART);
+                        rc = -ERESTART;
                         break;
                     default:
                         BUG_ON(rc);
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 7ea67532dd..a0fe361d5e 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -338,7 +338,7 @@ struct arch_domain
 
     struct paging_domain paging;
     struct p2m_domain *p2m;
-    /* To enforce lock ordering in the pod code wrt the 
+    /* To enforce lock ordering in the pod code wrt the
      * page_alloc lock */
     int page_alloc_unlock_level;
 
@@ -572,6 +572,8 @@ struct arch_vcpu
     struct page_info *old_guest_table;  /* partially destructed pagetable */
     struct page_info *old_guest_ptpg;   /* containing page table of the */
                                         /* former, if any */
+    bool old_guest_table_partial;       /* Are we dropping a type ref, or just
+                                         * finishing up a partial de-validation? */
     /* guest_table holds a ref to the page, and also a type-count unless
      * shadow refcounts are in use */
     pagetable_t shadow_table[4];        /* (MFN) shadow(s) of guest */
-- 
2.23.0


["xsa299-4.10/0001-x86-mm-L1TF-checks-don-t-leave-a-partial-entry.patch" (application/octet-stream)]

From bc266a68aa014af2cc3ed0a1f55723fdeac2e545 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 01/11] x86/mm: L1TF checks don't leave a partial entry

On detection of a potential L1TF issue, most validation code returns
-ERESTART to allow the switch to shadow mode to happen and cause the
original operation to be restarted.

However, in the validation code, the return value -ERESTART has been
repurposed to indicate 1) the function has partially completed
something which needs to be undone, and 2) calling put_page_type()
should cleanly undo it.  This causes problems in several places.

For L1 tables, on receiving an -ERESTART return from alloc_l1_table(),
alloc_page_type() will set PGT_partial on the page.  If for some
reason the original operation never restarts, then on domain
destruction, relinquish_memory() will call free_page_type() on the
page.

Unfortunately, alloc_ and free_l1_table() aren't set up to deal with
PGT_partial.  When returning a failure, alloc_l1_table() always
de-validates whatever it's validated so far, and free_l1_table()
always devalidates the whole page.  This means that if
relinquish_memory() calls free_page_type() on an L1 that didn't
complete due to an L1TF, it will call put_page_from_l1e() on "page
entries" that have never been validated.

For L2+ tables, setting rc to ERESTART causes the rest of the
alloc_lN_table() function to *think* that the entry in question will
have PGT_partial set.  This will cause it to set partial_pte = 1.  If
relinqush_memory() then calls free_page_type() on one of those pages,
then free_lN_table() will call put_page_from_lNe() on the entry when
it shouldn't.

Rather than indicating -ERESTART, indicate -EINTR.  This is the code
to indicate that nothing has changed from when you started the call
(which is effectively how alloc_l1_table() handles errors).

mod_lN_entry() shouldn't have any of these types of problems, so leave
potential changes there for a clean-up patch later.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index ce2c082caf..0cbca48a02 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1152,7 +1152,7 @@ get_page_from_l2e(
     int rc;
 
     if ( !(l2e_get_flags(l2e) & _PAGE_PRESENT) )
-        return pv_l1tf_check_l2e(d, l2e) ? -ERESTART : 1;
+        return pv_l1tf_check_l2e(d, l2e) ? -EINTR : 1;
 
     if ( unlikely((l2e_get_flags(l2e) & L2_DISALLOW_MASK)) )
     {
@@ -1188,7 +1188,7 @@ get_page_from_l3e(
     int rc;
 
     if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) )
-        return pv_l1tf_check_l3e(d, l3e) ? -ERESTART : 1;
+        return pv_l1tf_check_l3e(d, l3e) ? -EINTR : 1;
 
     if ( unlikely((l3e_get_flags(l3e) & l3_disallow_mask(d))) )
     {
@@ -1221,7 +1221,7 @@ get_page_from_l4e(
     int rc;
 
     if ( !(l4e_get_flags(l4e) & _PAGE_PRESENT) )
-        return pv_l1tf_check_l4e(d, l4e) ? -ERESTART : 1;
+        return pv_l1tf_check_l4e(d, l4e) ? -EINTR : 1;
 
     if ( unlikely((l4e_get_flags(l4e) & L4_DISALLOW_MASK)) )
     {
@@ -1435,7 +1435,7 @@ static int alloc_l1_table(struct page_info *page)
     {
         if ( !(l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) )
         {
-            ret = pv_l1tf_check_l1e(d, pl1e[i]) ? -ERESTART : 0;
+            ret = pv_l1tf_check_l1e(d, pl1e[i]) ? -EINTR : 0;
             if ( ret )
                 goto out;
         }
-- 
2.23.0


["xsa299-4.10/0002-x86-mm-Don-t-re-set-PGT_pinned-on-a-partially-de-val.patch" (application/octet-stream)]

From fd7bfe9aaee41c589c16c541ec538285dcde1fb2 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 02/11] x86/mm: Don't re-set PGT_pinned on a partially
 de-validated page

When unpinning pagetables, if an operation is interrupted,
relinquish_memory() re-sets PGT_pinned so that the un-pin will
pickedup again when the hypercall restarts.

This is appropriate when put_page_and_type_preemptible() returns
-EINTR, which indicates that the page is back in its initial state
(i.e., completely validated).  However, for -ERESTART, this leads to a
state where a page has both PGT_pinned and PGT_partial set.

This happens to work at the moment, although it's not really a
"canonical" state; but in subsequent patches, where we need to make a
distinction in handling between PGT_validated and PGT_partial pages,
this causes issues.

Move to a "canonical" state by:
- Only re-setting PGT_pinned on -EINTR
- Re-dropping the refcount held by PGT_pinned on -ERESTART

In the latter case, the PGT_partial bit will be cleared further down
with the rest of the other PGT_partial pages.

While here, clean up some trainling whitespace.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/domain.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 91c2b1c21a..897124f05f 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -112,7 +112,7 @@ static void play_dead(void)
      * this case, heap corruption or #PF can occur (when heap debugging is
      * enabled). For example, even printk() can involve tasklet scheduling,
      * which touches per-cpu vars.
-     * 
+     *
      * Consider very carefully when adding code to *dead_idle. Most hypervisor
      * subsystems are unsafe to call.
      */
@@ -1837,9 +1837,34 @@ static int relinquish_memory(
             break;
         case -ERESTART:
         case -EINTR:
+            /*
+             * -EINTR means PGT_validated has been re-set; re-set
+             * PGT_pinned again so that it gets picked up next time
+             * around.
+             *
+             * -ERESTART, OTOH, means PGT_partial is set instead.  Put
+             * it back on the list, but don't set PGT_pinned; the
+             * section below will finish off de-validation.  But we do
+             * need to drop the general ref associated with
+             * PGT_pinned, since put_page_and_type_preemptible()
+             * didn't do it.
+             *
+             * NB we can do an ASSERT for PGT_validated, since we
+             * "own" the type ref; but theoretically, the PGT_partial
+             * could be cleared by someone else.
+             */
+            if ( ret == -EINTR )
+            {
+                ASSERT(page->u.inuse.type_info & PGT_validated);
+                set_bit(_PGT_pinned, &page->u.inuse.type_info);
+            }
+            else
+                put_page(page);
+
             ret = -ERESTART;
+
+            /* Put the page back on the list and drop the ref we grabbed above */
             page_list_add(page, list);
-            set_bit(_PGT_pinned, &page->u.inuse.type_info);
             put_page(page);
             goto out;
         default:
@@ -2061,7 +2086,7 @@ void vcpu_kick(struct vcpu *v)
      * pending flag. These values may fluctuate (after all, we hold no
      * locks) but the key insight is that each change will cause
      * evtchn_upcall_pending to be polled.
-     * 
+     *
      * NB2. We save the running flag across the unblock to avoid a needless
      * IPI for domains that we IPI'd to unblock.
      */
-- 
2.23.0


["xsa299-4.10/0003-x86-mm-Separate-out-partial_pte-tristate-into-indivi.patch" (application/octet-stream)]

From 6bad09c708d906922fb59d7e2c06d5de9a633ca3 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 03/11] x86/mm: Separate out partial_pte tristate into
 individual flags

At the moment, partial_pte is a tri-state that contains two distinct bits
of information:

1. If zero, the pte at index [nr_validated_ptes] is un-validated.  If
   non-zero, the pte was last seen with PGT_partial set.

2. If positive, the pte at index [nr_validated_ptes] does not hold a
   general reference count.  If negative, it does.

To make future patches more clear, separate out this functionality
into two distinct, named bits: PTF_partial_set (for #1) and
PTF_partial_general_ref (for #2).

Additionally, a number of functions which need this information also
take other flags to control behavior (such as `preemptible` and
`defer`).  These are hard to read in the caller (since you only see
'true' or 'false'), and ugly when many are added together.  In
preparation for adding yet another flag in a future patch, collapse
all of these into a single `flag` variable.

NB that this does mean checking for what was previously the '-1'
condition a bit more ugly in the put_page_from_lNe functions (since
you have to check for both partial_set and general ref); but this
clause will go away in a future patch.

Also note that the original comment had an off-by-one error:
partial_flags (like partial_pte before it) concerns
plNe[nr_validated_ptes], not plNe[nr_validated_ptes+1].

No functional change intended.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c        | 165 ++++++++++++++++++++++++---------------
 xen/include/asm-x86/mm.h |  41 +++++++---
 2 files changed, 128 insertions(+), 78 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 0cbca48a02..84ee48ec3f 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -651,20 +651,34 @@ static int alloc_segdesc_page(struct page_info *page)
 static int __get_page_type(struct page_info *page, unsigned long type,
                            int preemptible);
 
+/*
+ * The following flags are used to specify behavior of various get and
+ * put commands.  The first two are also stored in page->partial_flags
+ * to indicate the state of the page pointed to by
+ * page->pte[page->nr_validated_entries].  See the comment in mm.h for
+ * more information.
+ */
+#define PTF_partial_set         (1 << 0)
+#define PTF_partial_general_ref (1 << 1)
+#define PTF_preemptible         (1 << 2)
+#define PTF_defer               (1 << 3)
+
 static int get_page_and_type_from_mfn(
     mfn_t mfn, unsigned long type, struct domain *d,
-    int partial, int preemptible)
+    unsigned int flags)
 {
     struct page_info *page = mfn_to_page(mfn);
     int rc;
+    bool preemptible = flags & PTF_preemptible,
+         partial_ref = flags & PTF_partial_general_ref;
 
-    if ( likely(partial >= 0) &&
+    if ( likely(!partial_ref) &&
          unlikely(!get_page_from_mfn(mfn, d)) )
         return -EINVAL;
 
     rc = __get_page_type(page, type, preemptible);
 
-    if ( unlikely(rc) && partial >= 0 &&
+    if ( unlikely(rc) && !partial_ref &&
          (!preemptible || page != current->arch.old_guest_table) )
         put_page(page);
 
@@ -1146,7 +1160,7 @@ get_page_from_l1e(
 define_get_linear_pagetable(l2);
 static int
 get_page_from_l2e(
-    l2_pgentry_t l2e, unsigned long pfn, struct domain *d, int partial)
+    l2_pgentry_t l2e, unsigned long pfn, struct domain *d, unsigned int flags)
 {
     unsigned long mfn = l2e_get_pfn(l2e);
     int rc;
@@ -1163,8 +1177,9 @@ get_page_from_l2e(
 
     if ( !(l2e_get_flags(l2e) & _PAGE_PSE) )
     {
-        rc = get_page_and_type_from_mfn(_mfn(mfn), PGT_l1_page_table, d,
-                                        partial, false);
+        ASSERT(!(flags & PTF_preemptible));
+
+        rc = get_page_and_type_from_mfn(_mfn(mfn), PGT_l1_page_table, d, flags);
         if ( unlikely(rc == -EINVAL) && get_l2_linear_pagetable(l2e, pfn, d) )
             rc = 0;
         return rc;
@@ -1183,7 +1198,7 @@ get_page_from_l2e(
 define_get_linear_pagetable(l3);
 static int
 get_page_from_l3e(
-    l3_pgentry_t l3e, unsigned long pfn, struct domain *d, int partial)
+    l3_pgentry_t l3e, unsigned long pfn, struct domain *d, unsigned int flags)
 {
     int rc;
 
@@ -1198,7 +1213,7 @@ get_page_from_l3e(
     }
 
     rc = get_page_and_type_from_mfn(
-        l3e_get_mfn(l3e), PGT_l2_page_table, d, partial, 1);
+        l3e_get_mfn(l3e), PGT_l2_page_table, d, flags | PTF_preemptible);
     if ( unlikely(rc == -EINVAL) &&
          !is_pv_32bit_domain(d) &&
          get_l3_linear_pagetable(l3e, pfn, d) )
@@ -1216,7 +1231,7 @@ get_page_from_l3e(
 define_get_linear_pagetable(l4);
 static int
 get_page_from_l4e(
-    l4_pgentry_t l4e, unsigned long pfn, struct domain *d, int partial)
+    l4_pgentry_t l4e, unsigned long pfn, struct domain *d, unsigned int flags)
 {
     int rc;
 
@@ -1231,7 +1246,7 @@ get_page_from_l4e(
     }
 
     rc = get_page_and_type_from_mfn(
-        l4e_get_mfn(l4e), PGT_l3_page_table, d, partial, 1);
+        l4e_get_mfn(l4e), PGT_l3_page_table, d, flags | PTF_preemptible);
     if ( unlikely(rc == -EINVAL) && get_l4_linear_pagetable(l4e, pfn, d) )
         rc = 0;
 
@@ -1306,7 +1321,7 @@ void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
  * Note also that this automatically deals correctly with linear p.t.'s.
  */
 static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
-                             int partial, bool defer)
+                             unsigned int flags)
 {
     int rc = 0;
 
@@ -1326,12 +1341,13 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         struct page_info *pg = l2e_get_page(l2e);
         struct page_info *ptpg = mfn_to_page(_mfn(pfn));
 
-        if ( unlikely(partial > 0) )
+        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
+              PTF_partial_set )
         {
-            ASSERT(!defer);
+            ASSERT(!(flags & PTF_defer));
             rc = _put_page_type(pg, true, ptpg);
         }
-        else if ( defer )
+        else if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = ptpg;
             current->arch.old_guest_table = pg;
@@ -1348,7 +1364,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
 }
 
 static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
-                             int partial, bool defer)
+                             unsigned int flags)
 {
     struct page_info *pg;
     int rc;
@@ -1371,13 +1387,14 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 
     pg = l3e_get_page(l3e);
 
-    if ( unlikely(partial > 0) )
+    if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
+         PTF_partial_set )
     {
-        ASSERT(!defer);
+        ASSERT(!(flags & PTF_defer));
         return _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
     }
 
-    if ( defer )
+    if ( flags & PTF_defer )
     {
         current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
         current->arch.old_guest_table = pg;
@@ -1392,7 +1409,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 }
 
 static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
-                             int partial, bool defer)
+                             unsigned int flags)
 {
     int rc = 1;
 
@@ -1401,13 +1418,14 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
     {
         struct page_info *pg = l4e_get_page(l4e);
 
-        if ( unlikely(partial > 0) )
+        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
+              PTF_partial_set )
         {
-            ASSERT(!defer);
+            ASSERT(!(flags & PTF_defer));
             return _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
         }
 
-        if ( defer )
+        if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
             current->arch.old_guest_table = pg;
@@ -1514,12 +1532,13 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
     unsigned long  pfn = mfn_x(page_to_mfn(page));
     l2_pgentry_t  *pl2e;
     unsigned int   i;
-    int            rc = 0, partial = page->partial_pte;
+    int            rc = 0;
+    unsigned int   partial_flags = page->partial_flags;
 
     pl2e = map_domain_page(_mfn(pfn));
 
     for ( i = page->nr_validated_ptes; i < L2_PAGETABLE_ENTRIES;
-          i++, partial = 0 )
+          i++, partial_flags = 0 )
     {
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
         {
@@ -1529,18 +1548,19 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
         }
 
         if ( !is_guest_l2_slot(d, type, i) ||
-             (rc = get_page_from_l2e(pl2e[i], pfn, d, partial)) > 0 )
+             (rc = get_page_from_l2e(pl2e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = partial ?: 1;
+            /* Set 'set', retain 'general ref' */
+            page->partial_flags = partial_flags | PTF_partial_set;
         }
         else if ( rc == -EINTR && i )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
             rc = -ERESTART;
         }
         else if ( rc < 0 && rc != -EINTR )
@@ -1549,7 +1569,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_pte = 0;
+                page->partial_flags = 0;
                 current->arch.old_guest_ptpg = NULL;
                 current->arch.old_guest_table = page;
             }
@@ -1573,7 +1593,8 @@ static int alloc_l3_table(struct page_info *page)
     unsigned long  pfn = mfn_x(page_to_mfn(page));
     l3_pgentry_t  *pl3e;
     unsigned int   i;
-    int            rc = 0, partial = page->partial_pte;
+    int            rc = 0;
+    unsigned int   partial_flags = page->partial_flags;
 
     pl3e = map_domain_page(_mfn(pfn));
 
@@ -1588,7 +1609,7 @@ static int alloc_l3_table(struct page_info *page)
         memset(pl3e + 4, 0, (L3_PAGETABLE_ENTRIES - 4) * sizeof(*pl3e));
 
     for ( i = page->nr_validated_ptes; i < L3_PAGETABLE_ENTRIES;
-          i++, partial = 0 )
+          i++, partial_flags = 0 )
     {
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
         {
@@ -1605,20 +1626,22 @@ static int alloc_l3_table(struct page_info *page)
             else
                 rc = get_page_and_type_from_mfn(
                     l3e_get_mfn(pl3e[i]),
-                    PGT_l2_page_table | PGT_pae_xen_l2, d, partial, 1);
+                    PGT_l2_page_table | PGT_pae_xen_l2, d,
+                    partial_flags | PTF_preemptible);
         }
-        else if ( (rc = get_page_from_l3e(pl3e[i], pfn, d, partial)) > 0 )
+        else if ( (rc = get_page_from_l3e(pl3e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = partial ?: 1;
+            /* Set 'set', leave 'general ref' set if this entry was set */
+            page->partial_flags = partial_flags | PTF_partial_set;
         }
         else if ( rc == -EINTR && i )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
             rc = -ERESTART;
         }
         if ( rc < 0 )
@@ -1635,7 +1658,7 @@ static int alloc_l3_table(struct page_info *page)
         if ( i )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
         }
@@ -1767,19 +1790,21 @@ static int alloc_l4_table(struct page_info *page)
     unsigned long  pfn = mfn_x(page_to_mfn(page));
     l4_pgentry_t  *pl4e = map_domain_page(_mfn(pfn));
     unsigned int   i;
-    int            rc = 0, partial = page->partial_pte;
+    int            rc = 0;
+    unsigned int   partial_flags = page->partial_flags;
 
     for ( i = page->nr_validated_ptes; i < L4_PAGETABLE_ENTRIES;
-          i++, partial = 0 )
+          i++, partial_flags = 0 )
     {
         if ( !is_guest_l4_slot(d, i) ||
-             (rc = get_page_from_l4e(pl4e[i], pfn, d, partial)) > 0 )
+             (rc = get_page_from_l4e(pl4e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = partial ?: 1;
+            /* Set 'set', leave 'general ref' set if this entry was set */
+            page->partial_flags = partial_flags | PTF_partial_set;
         }
         else if ( rc < 0 )
         {
@@ -1789,7 +1814,7 @@ static int alloc_l4_table(struct page_info *page)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_pte = 0;
+                page->partial_flags = 0;
                 if ( rc == -EINTR )
                     rc = -ERESTART;
                 else
@@ -1842,19 +1867,20 @@ static int free_l2_table(struct page_info *page)
     struct domain *d = page_get_owner(page);
     unsigned long pfn = mfn_x(page_to_mfn(page));
     l2_pgentry_t *pl2e;
-    int rc = 0, partial = page->partial_pte;
-    unsigned int i = page->nr_validated_ptes - !partial;
+    int rc = 0;
+    unsigned int partial_flags = page->partial_flags,
+        i = page->nr_validated_ptes - !(partial_flags & PTF_partial_set);
 
     pl2e = map_domain_page(_mfn(pfn));
 
     for ( ; ; )
     {
         if ( is_guest_l2_slot(d, page->u.inuse.type_info, i) )
-            rc = put_page_from_l2e(pl2e[i], pfn, partial, false);
+            rc = put_page_from_l2e(pl2e[i], pfn, partial_flags);
         if ( rc < 0 )
             break;
 
-        partial = 0;
+        partial_flags = 0;
 
         if ( !i-- )
             break;
@@ -1876,12 +1902,14 @@ static int free_l2_table(struct page_info *page)
     else if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_pte = partial ?: -1;
+        page->partial_flags = (partial_flags & PTF_partial_set) ?
+            partial_flags :
+            (PTF_partial_set | PTF_partial_general_ref);
     }
     else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
     {
         page->nr_validated_ptes = i + 1;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
         rc = -ERESTART;
     }
 
@@ -1893,18 +1921,19 @@ static int free_l3_table(struct page_info *page)
     struct domain *d = page_get_owner(page);
     unsigned long pfn = mfn_x(page_to_mfn(page));
     l3_pgentry_t *pl3e;
-    int rc = 0, partial = page->partial_pte;
-    unsigned int  i = page->nr_validated_ptes - !partial;
+    int rc = 0;
+    unsigned int partial_flags = page->partial_flags,
+        i = page->nr_validated_ptes - !(partial_flags & PTF_partial_set);
 
     pl3e = map_domain_page(_mfn(pfn));
 
     for ( ; ; )
     {
-        rc = put_page_from_l3e(pl3e[i], pfn, partial, 0);
+        rc = put_page_from_l3e(pl3e[i], pfn, partial_flags);
         if ( rc < 0 )
             break;
 
-        partial = 0;
+        partial_flags = 0;
         if ( rc == 0 )
             pl3e[i] = unadjust_guest_l3e(pl3e[i], d);
 
@@ -1923,12 +1952,14 @@ static int free_l3_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_pte = partial ?: -1;
+        page->partial_flags = (partial_flags & PTF_partial_set) ?
+            partial_flags :
+            (PTF_partial_set | PTF_partial_general_ref);
     }
     else if ( rc == -EINTR && i < L3_PAGETABLE_ENTRIES - 1 )
     {
         page->nr_validated_ptes = i + 1;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
         rc = -ERESTART;
     }
     return rc > 0 ? 0 : rc;
@@ -1939,26 +1970,29 @@ static int free_l4_table(struct page_info *page)
     struct domain *d = page_get_owner(page);
     unsigned long pfn = mfn_x(page_to_mfn(page));
     l4_pgentry_t *pl4e = map_domain_page(_mfn(pfn));
-    int rc = 0, partial = page->partial_pte;
-    unsigned int  i = page->nr_validated_ptes - !partial;
+    int rc = 0;
+    unsigned partial_flags = page->partial_flags,
+        i = page->nr_validated_ptes - !(partial_flags & PTF_partial_set);
 
     do {
         if ( is_guest_l4_slot(d, i) )
-            rc = put_page_from_l4e(pl4e[i], pfn, partial, 0);
+            rc = put_page_from_l4e(pl4e[i], pfn, partial_flags);
         if ( rc < 0 )
             break;
-        partial = 0;
+        partial_flags = 0;
     } while ( i-- );
 
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_pte = partial ?: -1;
+        page->partial_flags = (partial_flags & PTF_partial_set) ?
+            partial_flags :
+            (PTF_partial_set | PTF_partial_general_ref);
     }
     else if ( rc == -EINTR && i < L4_PAGETABLE_ENTRIES - 1 )
     {
         page->nr_validated_ptes = i + 1;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
         rc = -ERESTART;
     }
 
@@ -2180,7 +2214,7 @@ static int mod_l2_entry(l2_pgentry_t *pl2e,
         return -EBUSY;
     }
 
-    put_page_from_l2e(ol2e, pfn, 0, true);
+    put_page_from_l2e(ol2e, pfn, PTF_defer);
 
     return rc;
 }
@@ -2248,7 +2282,7 @@ static int mod_l3_entry(l3_pgentry_t *pl3e,
         if ( !create_pae_xen_mappings(d, pl3e) )
             BUG();
 
-    put_page_from_l3e(ol3e, pfn, 0, 1);
+    put_page_from_l3e(ol3e, pfn, PTF_defer);
     return rc;
 }
 
@@ -2311,7 +2345,7 @@ static int mod_l4_entry(l4_pgentry_t *pl4e,
         return -EFAULT;
     }
 
-    put_page_from_l4e(ol4e, pfn, 0, 1);
+    put_page_from_l4e(ol4e, pfn, PTF_defer);
     return rc;
 }
 
@@ -2577,7 +2611,7 @@ int free_page_type(struct page_info *page, unsigned long type,
     if ( !(type & PGT_partial) )
     {
         page->nr_validated_ptes = 1U << PAGETABLE_ORDER;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
     }
 
     switch ( type & PGT_type_mask )
@@ -2862,7 +2896,7 @@ static int __get_page_type(struct page_info *page, unsigned long type,
         if ( !(x & PGT_partial) )
         {
             page->nr_validated_ptes = 0;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
         }
         page->linear_pt_count = 0;
         rc = alloc_page_type(page, type, preemptible);
@@ -3037,7 +3071,8 @@ int new_guest_cr3(mfn_t mfn)
 
     rc = paging_mode_refcounts(d)
          ? (get_page_from_mfn(mfn, d) ? 0 : -EINVAL)
-         : get_page_and_type_from_mfn(mfn, PGT_root_page_table, d, 0, 1);
+         : get_page_and_type_from_mfn(mfn, PGT_root_page_table, d,
+                                      PTF_preemptible);
     switch ( rc )
     {
     case 0:
@@ -3420,7 +3455,7 @@ long do_mmuext_op(
             if ( op.arg1.mfn != 0 )
             {
                 rc = get_page_and_type_from_mfn(
-                    _mfn(op.arg1.mfn), PGT_root_page_table, currd, 0, 1);
+                    _mfn(op.arg1.mfn), PGT_root_page_table, currd, PTF_preemptible);
 
                 if ( unlikely(rc) )
                 {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 1030b8b5e6..a531fe3115 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -157,19 +157,34 @@ struct page_info
          * setting the flag must not drop that reference, whereas the instance
          * clearing it will have to.
          *
-         * If @partial_pte is positive then PTE at @nr_validated_ptes+1 has
-         * been partially validated. This implies that the general reference
-         * to the page (acquired from get_page_from_lNe()) would be dropped
-         * (again due to the apparent failure) and hence must be re-acquired
-         * when resuming the validation, but must not be dropped when picking
-         * up the page for invalidation.
+         * If partial_flags & PTF_partial_set is set, then the page at
+         * at @nr_validated_ptes had PGT_partial set as a result of an
+         * operation on the current page.  (That page may or may not
+         * still have PGT_partial set.)
          *
-         * If @partial_pte is negative then PTE at @nr_validated_ptes+1 has
-         * been partially invalidated. This is basically the opposite case of
-         * above, i.e. the general reference to the page was not dropped in
-         * put_page_from_lNe() (due to the apparent failure), and hence it
-         * must be dropped when the put operation is resumed (and completes),
-         * but it must not be acquired if picking up the page for validation.
+         * If PTF_partial_general_ref is set, then the PTE at
+         * @nr_validated_ptef holds a general reference count for the
+         * page.
+         *
+         * This happens:
+         * - During de-validation, if de-validation of the page was
+         *   interrupted
+         * - During validation, if an invalid entry is encountered and
+         *   validation is preemptible
+         * - During validation, if PTF_partial_general_ref was set on
+         *   this entry to begin with (perhaps because we're picking
+         *   up from a partial de-validation).
+         *
+         * When resuming validation, if PTF_partial_general_ref is clear,
+         * then a general reference must be re-acquired; if it is set, no
+         * reference should be acquired.
+         *
+         * When resuming de-validation, if PTF_partial_general_ref is
+         * clear, no reference should be dropped; if it is set, a
+         * reference should be dropped.
+         *
+         * NB that PTF_partial_set and PTF_partial_general_ref are
+         * defined in mm.c, the only place where they are used.
          *
          * The 3rd field, @linear_pt_count, indicates
          * - by a positive value, how many same-level page table entries a page
@@ -180,7 +195,7 @@ struct page_info
         struct {
             u16 nr_validated_ptes:PAGETABLE_ORDER + 1;
             u16 :16 - PAGETABLE_ORDER - 1 - 2;
-            s16 partial_pte:2;
+            u16 partial_flags:2;
             s16 linear_pt_count;
         };
 
-- 
2.23.0


["xsa299-4.10/0004-x86-mm-Use-flags-for-_put_page_type-rather-than-a-bo.patch" (application/octet-stream)]

From 255ad8804c79dc874322a7060ae0615305bcb8e8 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 04/11] x86/mm: Use flags for _put_page_type rather than a
 boolean

This is in mainly in preparation for _put_page_type taking the
partial_flags value in the future.  It also makes it easier to read in
the caller (since you see a flag name rather than `true` or `false`).

No functional change intended.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 84ee48ec3f..e3264f8879 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1253,7 +1253,7 @@ get_page_from_l4e(
     return rc;
 }
 
-static int _put_page_type(struct page_info *page, bool preemptible,
+static int _put_page_type(struct page_info *page, unsigned int flags,
                           struct page_info *ptpg);
 
 void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
@@ -1345,7 +1345,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
               PTF_partial_set )
         {
             ASSERT(!(flags & PTF_defer));
-            rc = _put_page_type(pg, true, ptpg);
+            rc = _put_page_type(pg, PTF_preemptible, ptpg);
         }
         else if ( flags & PTF_defer )
         {
@@ -1354,7 +1354,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         }
         else
         {
-            rc = _put_page_type(pg, true, ptpg);
+            rc = _put_page_type(pg, PTF_preemptible, ptpg);
             if ( likely(!rc) )
                 put_page(pg);
         }
@@ -1391,7 +1391,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
          PTF_partial_set )
     {
         ASSERT(!(flags & PTF_defer));
-        return _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
+        return _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
     }
 
     if ( flags & PTF_defer )
@@ -1401,7 +1401,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
         return 0;
     }
 
-    rc = _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
+    rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
     if ( likely(!rc) )
         put_page(pg);
 
@@ -1422,7 +1422,7 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
               PTF_partial_set )
         {
             ASSERT(!(flags & PTF_defer));
-            return _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
+            return _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
         }
 
         if ( flags & PTF_defer )
@@ -1432,7 +1432,7 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
             return 0;
         }
 
-        rc = _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
+        rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
         if ( likely(!rc) )
             put_page(pg);
     }
@@ -2680,11 +2680,12 @@ static int _put_final_page_type(struct page_info *page, unsigned long type,
 }
 
 
-static int _put_page_type(struct page_info *page, bool preemptible,
+static int _put_page_type(struct page_info *page, unsigned int flags,
                           struct page_info *ptpg)
 {
     unsigned long nx, x, y = page->u.inuse.type_info;
     int rc = 0;
+    bool preemptible = flags & PTF_preemptible;
 
     for ( ; ; )
     {
@@ -2884,7 +2885,7 @@ static int __get_page_type(struct page_info *page, unsigned long type,
 
             if ( unlikely(iommu_ret) )
             {
-                _put_page_type(page, false, NULL);
+                _put_page_type(page, 0, NULL);
                 rc = iommu_ret;
                 goto out;
             }
@@ -2911,7 +2912,7 @@ static int __get_page_type(struct page_info *page, unsigned long type,
 
 void put_page_type(struct page_info *page)
 {
-    int rc = _put_page_type(page, false, NULL);
+    int rc = _put_page_type(page, 0, NULL);
     ASSERT(rc == 0);
     (void)rc;
 }
@@ -2927,7 +2928,7 @@ int get_page_type(struct page_info *page, unsigned long type)
 
 int put_page_type_preemptible(struct page_info *page)
 {
-    return _put_page_type(page, true, NULL);
+    return _put_page_type(page, PTF_preemptible, NULL);
 }
 
 int get_page_type_preemptible(struct page_info *page, unsigned long type)
@@ -2943,7 +2944,7 @@ int put_old_guest_table(struct vcpu *v)
     if ( !v->arch.old_guest_table )
         return 0;
 
-    switch ( rc = _put_page_type(v->arch.old_guest_table, true,
+    switch ( rc = _put_page_type(v->arch.old_guest_table, PTF_preemptible,
                                  v->arch.old_guest_ptpg) )
     {
     case -EINTR:
-- 
2.23.0


["xsa299-4.10/0005-x86-mm-Rework-get_page_and_type_from_mfn-conditional.patch" (application/octet-stream)]

From 36ce2b6e246d41ebaeb994dbf2b4e0e4555893bf Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 05/11] x86/mm: Rework get_page_and_type_from_mfn conditional

Make it easier to read by declaring the conditions in which we will
retain the ref, rather than the conditions under which we release it.

The only way (page == current->arch.old_guest_table) can be true is if
preemptible is true; so remove this from the query itself, and add an
ASSERT() to that effect on the opposite path.

No functional change intended.

NB that alloc_lN_table() mishandle the "linear pt failure" situation
described in the comment; this will be addressed in a future patch.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index e3264f8879..ce7f5b84f3 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -678,8 +678,43 @@ static int get_page_and_type_from_mfn(
 
     rc = __get_page_type(page, type, preemptible);
 
-    if ( unlikely(rc) && !partial_ref &&
-         (!preemptible || page != current->arch.old_guest_table) )
+    /*
+     * Retain the refcount if:
+     * - page is fully validated (rc == 0)
+     * - page is not validated (rc < 0) but:
+     *   - We came in with a reference (partial_ref)
+     *   - page is partially validated but there's been an error
+     *     (page == current->arch.old_guest_table)
+     *
+     * The partial_ref-on-error clause is worth an explanation.  There
+     * are two scenarios where partial_ref might be true coming in:
+     * - mfn has been partially demoted as type `type`; i.e. has
+     *   PGT_partial set
+     * - mfn has been partially demoted as L(type+1) (i.e., a linear
+     *   page; e.g. we're being called from get_page_from_l2e with
+     *   type == PGT_l1_table, but the mfn is PGT_l2_table)
+     *
+     * If there's an error, in the first case, _get_page_type will
+     * either return -ERESTART, in which case we want to retain the
+     * ref (as the caller will consider it retained), or -EINVAL, in
+     * which case old_guest_table will be set; in both cases, we need
+     * to retain the ref.
+     *
+     * In the second case, if there's an error, _get_page_type() can
+     * *only* return -EINVAL, and *never* set old_guest_table.  In
+     * that case we also want to retain the reference, to allow the
+     * page to continue to be torn down (i.e., PGT_partial cleared)
+     * safely.
+     *
+     * Also note that we shouldn't be able to leave with the reference
+     * count retained unless we succeeded, or the operation was
+     * preemptible.
+     */
+    if ( likely(!rc) || partial_ref )
+        /* nothing */;
+    else if ( page == current->arch.old_guest_table )
+        ASSERT(preemptible);
+    else
         put_page(page);
 
     return rc;
-- 
2.23.0


["xsa299-4.10/0006-x86-mm-Have-alloc_l-23-_table-clear-partial_flags-wh.patch" (application/octet-stream)]

From 180f638fb5047c478ca32b15dd2ba9ba0ce43623 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 06/11] x86/mm: Have alloc_l[23]_table clear partial_flags when
 preempting

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.

At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.

PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held.  Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.

Unfortunately, when alloc_l[23]_table check hypercall_preempt_check()
and return -ERESTART, they set nr_entries_validated, but don't clear
partial_flags.

If we were picking up from a previously-interrupted promotion, that
means that PTF_partial_set would be set even though
[nr_entries_validated] was not partially validated.  This means that
if the page in this state were de-validated, put_page_type() would
erroneously be called on that entry.

Perhaps worse, if we were racing with a de-validation, then we might
leave both PTF_partial_set and PTF_partial_general_ref; and when
de-validation picked up again, both the type and the general ref would
be erroneously dropped from [nr_entries_validated].

In a sense, the real issue here is code duplication.  Rather than
duplicate the interruption code, set rc to -EINTR and fall through to
the code which already handles that case correctly.

Given the logic at this point, it should be impossible for
partial_flags to be non-zero; add an ASSERT() to catch any changes.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index ce7f5b84f3..9b9b67cd74 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1576,13 +1576,8 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
           i++, partial_flags = 0 )
     {
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
-        {
-            page->nr_validated_ptes = i;
-            rc = -ERESTART;
-            break;
-        }
-
-        if ( !is_guest_l2_slot(d, type, i) ||
+            rc = -EINTR;
+        else if ( !is_guest_l2_slot(d, type, i) ||
              (rc = get_page_from_l2e(pl2e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
@@ -1647,13 +1642,8 @@ static int alloc_l3_table(struct page_info *page)
           i++, partial_flags = 0 )
     {
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
-        {
-            page->nr_validated_ptes = i;
-            rc = -ERESTART;
-            break;
-        }
-
-        if ( is_pv_32bit_domain(d) && (i == 3) )
+            rc = -EINTR;
+        else if ( is_pv_32bit_domain(d) && (i == 3) )
         {
             if ( !(l3e_get_flags(pl3e[i]) & _PAGE_PRESENT) ||
                  (l3e_get_flags(pl3e[i]) & l3_disallow_mask(d)) )
-- 
2.23.0


["xsa299-4.10/0007-x86-mm-Always-retain-a-general-ref-on-partial.patch" (application/octet-stream)]

From 29f56f0e7c11a299da497c866b4c76ebbc862045 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 07/11] x86/mm: Always retain a general ref on partial

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page struct:
nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.

At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.

PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held.  Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.

Unfortunately, because a refcount is not held, it is possible to
engineer a situation where PFT_partial_set is set but the page in
question has been assigned to another domain.  A sketch is provided in
the appendix.

Fix this by having the parent page table entry hold a general
reference count whenever PFT_partial_set is set.  (For clarity of
change, keep two separate flags.  These will be collapsed in a
subsequent changeset.)

This has two basic implications.  On the put_page_from_lNe() side,
this mean that the (partial_set && !partial_ref) case can never happen,
and no longer needs to be special-cased.

Secondly, because both flags are set together, there's no need to carry over
existing bits from partial_pte.

(NB there is still another issue with calling _put_page_type() on a
page which had PGT_partial set; that will be handled in a subsequent
patch.)

On the get_page_and_type_from_mfn() side, we need to distinguish
between callers which hold a reference on partial (i.e.,
alloc_lN_table()), and those which do not (new_cr3, PIN_LN_TABLE, and
so on): pass a flag if the type should be retained on interruption.

NB that since l1 promotion can't be preempted, that get_page_from_l2e
can't return -ERESTART.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
* Appendix: Engineering PTF_partial_set while a page belongs to a
  foreign domain

Suppose A is a page which can be promoted to an l3, and B is a page
which can be promoted to an l2, and A[x] points to B.  B has
PGC_allocated set but no other general references.

V1:  PIN_L3 A.
  A is validated, B is validated.
  A.type_count = 1 | PGT_validated | PGT_pinned
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated (A[x] holds a general ref)

V1: UNPIN A.
  A begins de-validation.
  Arrange to be interrupted when i < x
  V1->old_guest_table = A
  V1->old_guest_table_ref_held = false
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = i < x
  B.type_count = 0
  B.count = 1 | PGC_allocated

V2: MOD_L4_ENTRY to point some l4e to A.
  Picks up re-validation of A.
  Arrange to be interrupted halfway through B's validation
  B.type_count = 1 | PGT_partial
  B.count = 2 | PGC_allocated (PGT_partial holds a general ref)
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = x
  A.partial_pte = PTF_partial_set

V3: MOD_L3_ENTRY to point some other l3e (not in A) to B.
  Validates B.
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated ("other l3e" holds a general ref)

V3: MOD_L3_ENTRY to clear l3e pointing to B.
  Devalidates B.
  B.type_count = 0
  B.count = 1 | PGC_allocated

V3: decrease_reservation(B)
  Clears PGC_allocated
  B.count = 0 => B is freed

B gets assigned to a different domain

V1: Restarts UNPIN of A
  put_old_guest_table(A)
    ...
      free_l3_table(A)

Now since A.partial_flags has PTF_partial_set, free_l3_table() will
call put_page_from_l3e() on A[x], which points to B, while B is owned
by another domain.

If A[x] held a general refcount for B on partial validation, as it does
for partial de-validation, then B would still have a reference count of
1 after PGC_allocated was freed; so B wouldn't be freed until after
put_page_from_l3e() had happend on A[x].
---
 xen/arch/x86/mm.c        | 84 +++++++++++++++++++++++-----------------
 xen/include/asm-x86/mm.h | 15 ++++---
 2 files changed, 58 insertions(+), 41 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 9b9b67cd74..2f185a3cd3 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -658,10 +658,11 @@ static int __get_page_type(struct page_info *page, unsigned long type,
  * page->pte[page->nr_validated_entries].  See the comment in mm.h for
  * more information.
  */
-#define PTF_partial_set         (1 << 0)
-#define PTF_partial_general_ref (1 << 1)
-#define PTF_preemptible         (1 << 2)
-#define PTF_defer               (1 << 3)
+#define PTF_partial_set           (1 << 0)
+#define PTF_partial_general_ref   (1 << 1)
+#define PTF_preemptible           (1 << 2)
+#define PTF_defer                 (1 << 3)
+#define PTF_retain_ref_on_restart (1 << 4)
 
 static int get_page_and_type_from_mfn(
     mfn_t mfn, unsigned long type, struct domain *d,
@@ -670,7 +671,11 @@ static int get_page_and_type_from_mfn(
     struct page_info *page = mfn_to_page(mfn);
     int rc;
     bool preemptible = flags & PTF_preemptible,
-         partial_ref = flags & PTF_partial_general_ref;
+         partial_ref = flags & PTF_partial_general_ref,
+         partial_set = flags & PTF_partial_set,
+         retain_ref  = flags & PTF_retain_ref_on_restart;
+
+    ASSERT(partial_ref == partial_set);
 
     if ( likely(!partial_ref) &&
          unlikely(!get_page_from_mfn(mfn, d)) )
@@ -683,13 +688,15 @@ static int get_page_and_type_from_mfn(
      * - page is fully validated (rc == 0)
      * - page is not validated (rc < 0) but:
      *   - We came in with a reference (partial_ref)
+     *   - page is partially validated (rc == -ERESTART), and the
+     *     caller has asked the ref to be retained in that case
      *   - page is partially validated but there's been an error
      *     (page == current->arch.old_guest_table)
      *
      * The partial_ref-on-error clause is worth an explanation.  There
      * are two scenarios where partial_ref might be true coming in:
-     * - mfn has been partially demoted as type `type`; i.e. has
-     *   PGT_partial set
+     * - mfn has been partially promoted / demoted as type `type`;
+     *   i.e. has PGT_partial set
      * - mfn has been partially demoted as L(type+1) (i.e., a linear
      *   page; e.g. we're being called from get_page_from_l2e with
      *   type == PGT_l1_table, but the mfn is PGT_l2_table)
@@ -712,7 +719,8 @@ static int get_page_and_type_from_mfn(
      */
     if ( likely(!rc) || partial_ref )
         /* nothing */;
-    else if ( page == current->arch.old_guest_table )
+    else if ( page == current->arch.old_guest_table ||
+              (retain_ref && rc == -ERESTART) )
         ASSERT(preemptible);
     else
         put_page(page);
@@ -1379,8 +1387,8 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
               PTF_partial_set )
         {
-            ASSERT(!(flags & PTF_defer));
-            rc = _put_page_type(pg, PTF_preemptible, ptpg);
+            /* partial_set should always imply partial_ref */
+            BUG();
         }
         else if ( flags & PTF_defer )
         {
@@ -1425,8 +1433,8 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
     if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
          PTF_partial_set )
     {
-        ASSERT(!(flags & PTF_defer));
-        return _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
+        /* partial_set should always imply partial_ref */
+        BUG();
     }
 
     if ( flags & PTF_defer )
@@ -1456,8 +1464,8 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
         if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
               PTF_partial_set )
         {
-            ASSERT(!(flags & PTF_defer));
-            return _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
+            /* partial_set should always imply partial_ref */
+            BUG();
         }
 
         if ( flags & PTF_defer )
@@ -1581,13 +1589,22 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
              (rc = get_page_from_l2e(pl2e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
-        if ( rc == -ERESTART )
-        {
-            page->nr_validated_ptes = i;
-            /* Set 'set', retain 'general ref' */
-            page->partial_flags = partial_flags | PTF_partial_set;
-        }
-        else if ( rc == -EINTR && i )
+        /*
+         * It shouldn't be possible for get_page_from_l2e to return
+         * -ERESTART, since we never call this with PTF_preemptible.
+         * (alloc_l1_table may return -EINTR on an L1TF-vulnerable
+         * entry.)
+         *
+         * NB that while on a "clean" promotion, we can never get
+         * PGT_partial.  It is possible to arrange for an l2e to
+         * contain a partially-devalidated l2; but in that case, both
+         * of the following functions will fail anyway (the first
+         * because the page in question is not an l1; the second
+         * because the page is not fully validated).
+         */
+        ASSERT(rc != -ERESTART);
+
+        if ( rc == -EINTR && i )
         {
             page->nr_validated_ptes = i;
             page->partial_flags = 0;
@@ -1596,6 +1613,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
         else if ( rc < 0 && rc != -EINTR )
         {
             gdprintk(XENLOG_WARNING, "Failure in alloc_l2_table: slot %#x\n", i);
+            ASSERT(current->arch.old_guest_table == NULL);
             if ( i )
             {
                 page->nr_validated_ptes = i;
@@ -1652,16 +1670,17 @@ static int alloc_l3_table(struct page_info *page)
                 rc = get_page_and_type_from_mfn(
                     l3e_get_mfn(pl3e[i]),
                     PGT_l2_page_table | PGT_pae_xen_l2, d,
-                    partial_flags | PTF_preemptible);
+                    partial_flags | PTF_preemptible | PTF_retain_ref_on_restart);
         }
-        else if ( (rc = get_page_from_l3e(pl3e[i], pfn, d, partial_flags)) > 0 )
+        else if ( (rc = get_page_from_l3e(pl3e[i], pfn, d,
+                               partial_flags | PTF_retain_ref_on_restart)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = partial_flags | PTF_partial_set;
+            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
         }
         else if ( rc == -EINTR && i )
         {
@@ -1822,14 +1841,15 @@ static int alloc_l4_table(struct page_info *page)
           i++, partial_flags = 0 )
     {
         if ( !is_guest_l4_slot(d, i) ||
-             (rc = get_page_from_l4e(pl4e[i], pfn, d, partial_flags)) > 0 )
+             (rc = get_page_from_l4e(pl4e[i], pfn, d,
+                               partial_flags | PTF_retain_ref_on_restart)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = partial_flags | PTF_partial_set;
+            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
         }
         else if ( rc < 0 )
         {
@@ -1927,9 +1947,7 @@ static int free_l2_table(struct page_info *page)
     else if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = (partial_flags & PTF_partial_set) ?
-            partial_flags :
-            (PTF_partial_set | PTF_partial_general_ref);
+        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
     }
     else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
     {
@@ -1977,9 +1995,7 @@ static int free_l3_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = (partial_flags & PTF_partial_set) ?
-            partial_flags :
-            (PTF_partial_set | PTF_partial_general_ref);
+        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
     }
     else if ( rc == -EINTR && i < L3_PAGETABLE_ENTRIES - 1 )
     {
@@ -2010,9 +2026,7 @@ static int free_l4_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = (partial_flags & PTF_partial_set) ?
-            partial_flags :
-            (PTF_partial_set | PTF_partial_general_ref);
+        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
     }
     else if ( rc == -EINTR && i < L4_PAGETABLE_ENTRIES - 1 )
     {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index a531fe3115..74b0246c02 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -167,22 +167,25 @@ struct page_info
          * page.
          *
          * This happens:
-         * - During de-validation, if de-validation of the page was
+         * - During validation or de-validation, if the operation was
          *   interrupted
          * - During validation, if an invalid entry is encountered and
          *   validation is preemptible
          * - During validation, if PTF_partial_general_ref was set on
-         *   this entry to begin with (perhaps because we're picking
-         *   up from a partial de-validation).
+         *   this entry to begin with (perhaps because it picked up a
+         *   previous operation)
          *
-         * When resuming validation, if PTF_partial_general_ref is clear,
-         * then a general reference must be re-acquired; if it is set, no
-         * reference should be acquired.
+         * When resuming validation, if PTF_partial_general_ref is
+         * clear, then a general reference must be re-acquired; if it
+         * is set, no reference should be acquired.
          *
          * When resuming de-validation, if PTF_partial_general_ref is
          * clear, no reference should be dropped; if it is set, a
          * reference should be dropped.
          *
+         * NB at the moment, PTF_partial_set should be set if and only if
+         * PTF_partial_general_ref is set.
+         *
          * NB that PTF_partial_set and PTF_partial_general_ref are
          * defined in mm.c, the only place where they are used.
          *
-- 
2.23.0


["xsa299-4.10/0008-x86-mm-Collapse-PTF_partial_set-and-PTF_partial_gene.patch" (application/octet-stream)]

From 140c8876835a134daf507d6c60bdcdf9126f166f Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 08/11] x86/mm: Collapse PTF_partial_set and
 PTF_partial_general_ref into one

...now that they are equivalent.  No functional change intended.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c        | 50 +++++++++++-----------------------------
 xen/include/asm-x86/mm.h | 29 +++++++++++------------
 2 files changed, 26 insertions(+), 53 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 2f185a3cd3..693791331a 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -653,13 +653,12 @@ static int __get_page_type(struct page_info *page, unsigned long type,
 
 /*
  * The following flags are used to specify behavior of various get and
- * put commands.  The first two are also stored in page->partial_flags
- * to indicate the state of the page pointed to by
+ * put commands.  The first is also stored in page->partial_flags to
+ * indicate the state of the page pointed to by
  * page->pte[page->nr_validated_entries].  See the comment in mm.h for
  * more information.
  */
 #define PTF_partial_set           (1 << 0)
-#define PTF_partial_general_ref   (1 << 1)
 #define PTF_preemptible           (1 << 2)
 #define PTF_defer                 (1 << 3)
 #define PTF_retain_ref_on_restart (1 << 4)
@@ -671,13 +670,10 @@ static int get_page_and_type_from_mfn(
     struct page_info *page = mfn_to_page(mfn);
     int rc;
     bool preemptible = flags & PTF_preemptible,
-         partial_ref = flags & PTF_partial_general_ref,
          partial_set = flags & PTF_partial_set,
          retain_ref  = flags & PTF_retain_ref_on_restart;
 
-    ASSERT(partial_ref == partial_set);
-
-    if ( likely(!partial_ref) &&
+    if ( likely(!partial_set) &&
          unlikely(!get_page_from_mfn(mfn, d)) )
         return -EINVAL;
 
@@ -687,14 +683,14 @@ static int get_page_and_type_from_mfn(
      * Retain the refcount if:
      * - page is fully validated (rc == 0)
      * - page is not validated (rc < 0) but:
-     *   - We came in with a reference (partial_ref)
+     *   - We came in with a reference (partial_set)
      *   - page is partially validated (rc == -ERESTART), and the
      *     caller has asked the ref to be retained in that case
      *   - page is partially validated but there's been an error
      *     (page == current->arch.old_guest_table)
      *
-     * The partial_ref-on-error clause is worth an explanation.  There
-     * are two scenarios where partial_ref might be true coming in:
+     * The partial_set-on-error clause is worth an explanation.  There
+     * are two scenarios where partial_set might be true coming in:
      * - mfn has been partially promoted / demoted as type `type`;
      *   i.e. has PGT_partial set
      * - mfn has been partially demoted as L(type+1) (i.e., a linear
@@ -717,7 +713,7 @@ static int get_page_and_type_from_mfn(
      * count retained unless we succeeded, or the operation was
      * preemptible.
      */
-    if ( likely(!rc) || partial_ref )
+    if ( likely(!rc) || partial_set )
         /* nothing */;
     else if ( page == current->arch.old_guest_table ||
               (retain_ref && rc == -ERESTART) )
@@ -1384,13 +1380,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         struct page_info *pg = l2e_get_page(l2e);
         struct page_info *ptpg = mfn_to_page(_mfn(pfn));
 
-        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
-              PTF_partial_set )
-        {
-            /* partial_set should always imply partial_ref */
-            BUG();
-        }
-        else if ( flags & PTF_defer )
+        if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = ptpg;
             current->arch.old_guest_table = pg;
@@ -1430,13 +1420,6 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 
     pg = l3e_get_page(l3e);
 
-    if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
-         PTF_partial_set )
-    {
-        /* partial_set should always imply partial_ref */
-        BUG();
-    }
-
     if ( flags & PTF_defer )
     {
         current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
@@ -1461,13 +1444,6 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
     {
         struct page_info *pg = l4e_get_page(l4e);
 
-        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
-              PTF_partial_set )
-        {
-            /* partial_set should always imply partial_ref */
-            BUG();
-        }
-
         if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
@@ -1680,7 +1656,7 @@ static int alloc_l3_table(struct page_info *page)
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+            page->partial_flags = PTF_partial_set;
         }
         else if ( rc == -EINTR && i )
         {
@@ -1849,7 +1825,7 @@ static int alloc_l4_table(struct page_info *page)
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+            page->partial_flags = PTF_partial_set;
         }
         else if ( rc < 0 )
         {
@@ -1947,7 +1923,7 @@ static int free_l2_table(struct page_info *page)
     else if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+        page->partial_flags = PTF_partial_set;
     }
     else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
     {
@@ -1995,7 +1971,7 @@ static int free_l3_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+        page->partial_flags = PTF_partial_set;
     }
     else if ( rc == -EINTR && i < L3_PAGETABLE_ENTRIES - 1 )
     {
@@ -2026,7 +2002,7 @@ static int free_l4_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+        page->partial_flags = PTF_partial_set;
     }
     else if ( rc == -EINTR && i < L4_PAGETABLE_ENTRIES - 1 )
     {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 74b0246c02..704345335c 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -162,7 +162,7 @@ struct page_info
          * operation on the current page.  (That page may or may not
          * still have PGT_partial set.)
          *
-         * If PTF_partial_general_ref is set, then the PTE at
+         * Additionally, if PTF_partial_set is set, then the PTE at
          * @nr_validated_ptef holds a general reference count for the
          * page.
          *
@@ -171,23 +171,20 @@ struct page_info
          *   interrupted
          * - During validation, if an invalid entry is encountered and
          *   validation is preemptible
-         * - During validation, if PTF_partial_general_ref was set on
-         *   this entry to begin with (perhaps because it picked up a
+         * - During validation, if PTF_partial_set was set on this
+         *   entry to begin with (perhaps because it picked up a
          *   previous operation)
          *
-         * When resuming validation, if PTF_partial_general_ref is
-         * clear, then a general reference must be re-acquired; if it
-         * is set, no reference should be acquired.
+         * When resuming validation, if PTF_partial_set is clear, then
+         * a general reference must be re-acquired; if it is set, no
+         * reference should be acquired.
          *
-         * When resuming de-validation, if PTF_partial_general_ref is
-         * clear, no reference should be dropped; if it is set, a
-         * reference should be dropped.
+         * When resuming de-validation, if PTF_partial_set is clear,
+         * no reference should be dropped; if it is set, a reference
+         * should be dropped.
          *
-         * NB at the moment, PTF_partial_set should be set if and only if
-         * PTF_partial_general_ref is set.
-         *
-         * NB that PTF_partial_set and PTF_partial_general_ref are
-         * defined in mm.c, the only place where they are used.
+         * NB that PTF_partial_set is defined in mm.c, the only place
+         * where it is used.
          *
          * The 3rd field, @linear_pt_count, indicates
          * - by a positive value, how many same-level page table entries a page
@@ -197,8 +194,8 @@ struct page_info
          */
         struct {
             u16 nr_validated_ptes:PAGETABLE_ORDER + 1;
-            u16 :16 - PAGETABLE_ORDER - 1 - 2;
-            u16 partial_flags:2;
+            u16 :16 - PAGETABLE_ORDER - 1 - 1;
+            u16 partial_flags:1;
             s16 linear_pt_count;
         };
 
-- 
2.23.0


["xsa299-4.10/0009-x86-mm-Properly-handle-linear-pagetable-promotion-fa.patch" (application/octet-stream)]

From 203bc967574c7c5a06ed6bb452a9761f46dce724 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 09/11] x86/mm: Properly handle linear pagetable promotion
 failures

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated, and a general reference count is held.

Unfortunately, in cases where an entry began with PTF_partial_set set,
and get_page_from_lNe() returns -EINVAL, the PTF_partial_set bit is
erroneously dropped.  (This scenario can be engineered mainly by the
use of interleaving of promoting and demoting a page which has "linear
pagetable" entries; see the appendix for a sketch.)  This means that
we will "leak" a general reference count on the page in question,
preventing the page from being freed.

Fix this by setting page->partial_flags to the partial_flags local
variable.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix

Suppose A and B can both be promoted to L2 pages, and A[x] points to B.

V1: PIN_L2 B.
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated

V1: MOD_L3_ENTRY pointing something to A.
  In the process of validating A[x], grab an extra type / ref on B:
  B.type_count = 2 | PGT_validated
  B.count = 3 | PGC_allocated
  A.type_count = 1 | PGT_validated
  A.count = 2 | PGC_allocated

V1: UNPIN B.
  B.type_count = 1 | PGT_validate
  B.count = 2 | PGC_allocated

V1: MOD_L3_ENTRY removing the reference to A.
  De-validate A, down to A[x], which points to B.
  Drop the final type on B.  Arrange to be interrupted.
  B.type_count = 1 | PGT_partial
  B.count = 2 | PGC_allocated
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = x
  A.partial_pte = -1

V2: MOD_L3_ENTRY adds a reference to A.

At this point, get_page_from_l2e(A[x]) tries
get_page_and_type_from_mfn(), which fails because it's the wrong type;
and get_l2_linear_pagetable() also fails, because B isn't validated as
an l2 anymore.
---
 xen/arch/x86/mm.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 693791331a..300f147e98 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1593,7 +1593,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_flags = 0;
+                page->partial_flags = partial_flags;
                 current->arch.old_guest_ptpg = NULL;
                 current->arch.old_guest_table = page;
             }
@@ -1678,7 +1678,7 @@ static int alloc_l3_table(struct page_info *page)
         if ( i )
         {
             page->nr_validated_ptes = i;
-            page->partial_flags = 0;
+            page->partial_flags = partial_flags;
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
         }
@@ -1835,7 +1835,7 @@ static int alloc_l4_table(struct page_info *page)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_flags = 0;
+                page->partial_flags = partial_flags;
                 if ( rc == -EINTR )
                     rc = -ERESTART;
                 else
-- 
2.23.0


["xsa299-4.10/0010-x86-mm-Fix-nested-de-validation-on-error.patch" (application/octet-stream)]

From 45242b9057b4feccb837362f39e0eb97dc0093c8 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 10/11] x86/mm: Fix nested de-validation on error

If an invalid entry is discovered when validating a page-table tree,
the entire tree which has so far been validated must be de-validated.
Since this may take a long time, alloc_l[2-4]_table() set current
vcpu's old_guest_table immediately; put_old_guest_table() will make
sure that put_page_type() will be called to finish off the
de-validation before any other MMU operations can happen on the vcpu.

The invariant for partial pages should be:

* Entries [0, nr_validated_ptes) should be completely validated;
  put_page_type() will de-validate these.

* If [nr_validated_ptes] is partially validated, partial_flags should
  set PTF_partiaL_set.  put_page_type() will be called on this page to
  finish off devalidation, and the appropriate refcount adjustments
  will be done.

alloc_l[2-3]_table() indicates partial validation to its callers by
setting current->old_guest_table.

Unfortunately, this is mishandled.

Take the case where validating lNe[x] returns an error.

First, alloc_l3_table() doesn't check old_guest_table at all; as a
result, partial_flags is not set when it should be.  nr_validated_ptes
is set to x; and since PFT_partial_set clear, de-validation resumes at
nr_validated_ptes-1.  This means that the l2 page at pl3e[x] will not
have put_page_type() called on it when de-validating the rest of the
l3: it will be stuck in the PGT_partial state until the domain is
destroyed, or until it is re-used as an l2.  (Any other page type will
fail.)

Worse, alloc_l4_table(), rather than setting PTF_partial_set as it
should, sets nr_validated_ptes to x+1.  When de-validating, since
partial is 0, this will correctly resume calling put_page_type at [x];
but, if the put_page_type() is never called, but instead
get_page_type() is called, validation will pick up at [x+1],
neglecting to validate [x].  If the rest of the validation succeeds,
the l4 will be validated even though [x] is invalid.

Fix this in both cases by setting PTF_partial_set if old_guest_table
is set.

While here, add some safety catches:
- old_guest_table must point to the page contained in
  [nr_validated_ptes].
- alloc_l1_page shouldn't set old_guest_table

If we experience one of these situations in production builds, it's
safer to avoid calling put_page_type for the pages in question.  If
they have PGT_partial set, they will be cleaned up on domain
destruction; if not, we have no idea whether a type count is safe to
drop.  Retaining an extra type ref that should have been dropped may
trigger a BUG() on the free_domain_page() path, but dropping a type
count that shouldn't be dropped may cause a privilege escalation.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 300f147e98..2ea32463a8 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1592,6 +1592,20 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             ASSERT(current->arch.old_guest_table == NULL);
             if ( i )
             {
+                /*
+                 * alloc_l1_table() doesn't set old_guest_table; it does
+                 * its own tear-down immediately on failure.  If it
+                 * did we'd need to check it and set partial_flags as we
+                 * do in alloc_l[34]_table().
+                 *
+                 * Note on the use of ASSERT: if it's non-null and
+                 * hasn't been cleaned up yet, it should have
+                 * PGT_partial set; and so the type will be cleaned up
+                 * on domain destruction.  Unfortunately, we would
+                 * leak the general ref held by old_guest_table; but
+                 * leaking a page is less bad than a host crash.
+                 */
+                ASSERT(current->arch.old_guest_table == NULL);
                 page->nr_validated_ptes = i;
                 page->partial_flags = partial_flags;
                 current->arch.old_guest_ptpg = NULL;
@@ -1619,6 +1633,7 @@ static int alloc_l3_table(struct page_info *page)
     unsigned int   i;
     int            rc = 0;
     unsigned int   partial_flags = page->partial_flags;
+    l3_pgentry_t   l3e = l3e_empty();
 
     pl3e = map_domain_page(_mfn(pfn));
 
@@ -1665,7 +1680,11 @@ static int alloc_l3_table(struct page_info *page)
             rc = -ERESTART;
         }
         if ( rc < 0 )
+        {
+            /* XSA-299 Backport: Copy l3e for checking */
+            l3e = pl3e[i];
             break;
+        }
 
         pl3e[i] = adjust_guest_l3e(pl3e[i], d);
     }
@@ -1679,6 +1698,24 @@ static int alloc_l3_table(struct page_info *page)
         {
             page->nr_validated_ptes = i;
             page->partial_flags = partial_flags;
+            if ( current->arch.old_guest_table )
+            {
+                /*
+                 * We've experienced a validation failure.  If
+                 * old_guest_table is set, "transfer" the general
+                 * reference count to pl3e[nr_validated_ptes] by
+                 * setting PTF_partial_set.
+                 *
+                 * As a precaution, check that old_guest_table is the
+                 * page pointed to by pl3e[nr_validated_ptes].  If
+                 * not, it's safer to leak a type ref on production
+                 * builds.
+                 */
+                if ( current->arch.old_guest_table == l3e_get_page(l3e) )
+                    page->partial_flags = PTF_partial_set;
+                else
+                    ASSERT_UNREACHABLE();
+            }
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
         }
@@ -1841,7 +1878,23 @@ static int alloc_l4_table(struct page_info *page)
                 else
                 {
                     if ( current->arch.old_guest_table )
-                        page->nr_validated_ptes++;
+                    {
+                        /*
+                         * We've experienced a validation failure.  If
+                         * old_guest_table is set, "transfer" the general
+                         * reference count to pl3e[nr_validated_ptes] by
+                         * setting PTF_partial_set.
+                         *
+                         * As a precaution, check that old_guest_table is the
+                         * page pointed to by pl4e[nr_validated_ptes].  If
+                         * not, it's safer to leak a type ref on production
+                         * builds.
+                         */
+                        if ( current->arch.old_guest_table == l4e_get_page(pl4e[i]) )
+                            page->partial_flags = PTF_partial_set;
+                        else
+                            ASSERT_UNREACHABLE();
+                    }
                     current->arch.old_guest_ptpg = NULL;
                     current->arch.old_guest_table = page;
                 }
-- 
2.23.0


["xsa299-4.10/0011-x86-mm-Don-t-drop-a-type-ref-unless-you-held-a-ref-t.patch" (application/octet-stream)]

From 4905f7fbaa60f75df063305c9532fb63b77deab9 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:50 +0100
Subject: [PATCH 11/11] x86/mm: Don't drop a type ref unless you held a ref to
 begin with

Validation and de-validation of pagetable trees may take arbitrarily
large amounts of time, and so must be preemptible.  This is indicated
by setting the PGT_partial bit in the type_info, and setting
nr_validated_entries and partial_flags appropriately.  Specifically,
if the entry at [nr_validated_entries] is partially validated,
partial_flags should have the PGT_partial_set bit set, and the entry
should hold a general reference count.  During de-validation,
put_page_type() is called on partially validated entries.

Unfortunately, there are a number of issues with the current algorithm.

First, doing a "normal" put_page_type() is not safe when no type ref
is held: there is nothing to stop another vcpu from coming along and
picking up validation again: at which point the put_page_type may drop
the only page ref on an in-use page.  Some examples are listed in the
appendix.

The core issue is that put_page_type() is being called both to clean
up PGT_partial, and to drop a type count; and has no way of knowing
which is which; and so if in between, PGT_partial is cleared,
put_page_type() will drop the type ref erroneously.

What is needed is to distinguish between two states:
- Dropping a type ref which is held
- Cleaning up a page which has been partially de/validated

Fix this by telling put_page_type() which of the two activities you
intend.

When cleaning up a partial de/validation, take no action unless you
find a page partially validated.

If put_page_type() is called without PTF_partial_set, and finds the
page in a PGT_partial state anyway, then there's certainly been a
misaccounting somewhere, and carrying on would almost certainly cause
a security issue, so crash the host instead.

In put_page_from_lNe, pass partial_flags on to _put_page_type().

old_guest_table may be set either with a fully validated page (when
using the "deferred put" pattern), or with a partially validated page
(when a normal "de-validation" is interrupted, or when a validation
fails part-way through due to invalid entries).  Add a flag,
old_guest_table_partial, to indicate which of these it is, and use
that to pass the appropriate flag to _put_page_type().

While here, delete stray trailing whitespace.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix:

Suppose page A, when interpreted as an l3 pagetable, contains all
valid entries; and suppose A[x] points to page B, which when
interpreted as an l2 pagetable, contains all valid entries.

P1: PIN_L3_TABLE
  A -> PGT_l3_table | 1 | valid
  B -> PGT_l2_table | 1 | valid

P1: UNPIN_TABLE
  > Arrange to interrupt after B has been de-validated
  B:
    type_info -> PGT_l2_table | 0
  A:
    type_info -> PGT_l3_table | 1 | partial
    nr_validated_enties -> (less than x)

P2: mod_l4_entry to point to A
  > Arrange for this to be interrupted while B is being validated
  B:
    type_info -> PGT_l2_table | 1 | partial
    (nr_validated_entires &c set as appropriate)
  A:
    type_info -> PGT_l3_table | 1 | partial
    nr_validated_entries -> x
    partial_pte = 1

P3: mod_l3_entry some other unrelated l3 to point to B:
  B:
    type_info -> PGT_l2_table | 1

P1: Restart UNPIN_TABLE

At this point, since A.nr_validate_entries == x and A.partial_pte !=
0, free_l3_table() will call put_page_from_l3e() on pl3e[x], dropping
its type count to 0 while it's still being pointed to by some other l3

A similar issue arises with old_guest_table.  Consider the following
scenario:

Suppose A is a page which, when interpreted as an l2, has valid entries
until entry x, which is invalid.

V1:  PIN_L2_TABLE(A)
  <Validate until we try to validate [x], get -EINVAL>
  A -> PGT_l2_table | 1 | PGT_partial
  V1 -> old_guest_table = A
  <delayed>

V2: PIN_L2_TABLE(A)
  <Pick up where V1 left off, try to re-validate [x], get -EINVAL>
  A -> PGT_l2_table | 1 | PGT_partial
  V2 -> old_guest_table = A
  <restart>
  put_old_guest_table()
    _put_page_type(A)
      A -> PGT_l2_table | 0

V1: <restart>
  put_old_guest_table()
    _put_page_type(A) # UNDERFLOW

Indeed, it is possible to engineer for old_guest_table for every vcpu
a guest has to point to the same page.
---
 xen/arch/x86/domain.c        |  6 +++
 xen/arch/x86/mm.c            | 99 +++++++++++++++++++++++++++++++-----
 xen/include/asm-x86/domain.h |  4 +-
 3 files changed, 95 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 897124f05f..6074fa5947 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1075,9 +1075,15 @@ int arch_set_info_guest(
                     rc = -ERESTART;
                     /* Fallthrough */
                 case -ERESTART:
+                    /*
+                     * NB that we're putting the kernel-mode table
+                     * here, which we've already successfully
+                     * validated above; hence partial = false;
+                     */
                     v->arch.old_guest_ptpg = NULL;
                     v->arch.old_guest_table =
                         pagetable_get_page(v->arch.guest_table);
+                    v->arch.old_guest_table_partial = false;
                     v->arch.guest_table = pagetable_null();
                     break;
                 default:
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 2ea32463a8..9ae71d864a 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1384,10 +1384,11 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         {
             current->arch.old_guest_ptpg = ptpg;
             current->arch.old_guest_table = pg;
+            current->arch.old_guest_table_partial = false;
         }
         else
         {
-            rc = _put_page_type(pg, PTF_preemptible, ptpg);
+            rc = _put_page_type(pg, flags | PTF_preemptible, ptpg);
             if ( likely(!rc) )
                 put_page(pg);
         }
@@ -1410,6 +1411,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
         unsigned long mfn = l3e_get_pfn(l3e);
         int writeable = l3e_get_flags(l3e) & _PAGE_RW;
 
+        ASSERT(!(flags & PTF_partial_set));
         ASSERT(!(mfn & ((1UL << (L3_PAGETABLE_SHIFT - PAGE_SHIFT)) - 1)));
         do {
             put_data_page(mfn_to_page(_mfn(mfn)), writeable);
@@ -1422,12 +1424,14 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 
     if ( flags & PTF_defer )
     {
+        ASSERT(!(flags & PTF_partial_set));
         current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
         current->arch.old_guest_table = pg;
+        current->arch.old_guest_table_partial = false;
         return 0;
     }
 
-    rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
+    rc = _put_page_type(pg, flags | PTF_preemptible, mfn_to_page(_mfn(pfn)));
     if ( likely(!rc) )
         put_page(pg);
 
@@ -1446,12 +1450,15 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
 
         if ( flags & PTF_defer )
         {
+            ASSERT(!(flags & PTF_partial_set));
             current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
             current->arch.old_guest_table = pg;
+            current->arch.old_guest_table_partial = false;
             return 0;
         }
 
-        rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
+        rc = _put_page_type(pg, flags | PTF_preemptible,
+                            mfn_to_page(_mfn(pfn)));
         if ( likely(!rc) )
             put_page(pg);
     }
@@ -1556,6 +1563,14 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
 
     pl2e = map_domain_page(_mfn(pfn));
 
+    /*
+     * NB that alloc_l2_table will never set partial_pte on an l2; but
+     * free_l2_table might if a linear_pagetable entry is interrupted
+     * partway through de-validation.  In that circumstance,
+     * get_page_from_l2e() will always return -EINVAL; and we must
+     * retain the type ref by doing the normal partial_flags tracking.
+     */
+
     for ( i = page->nr_validated_ptes; i < L2_PAGETABLE_ENTRIES;
           i++, partial_flags = 0 )
     {
@@ -1610,6 +1625,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
                 page->partial_flags = partial_flags;
                 current->arch.old_guest_ptpg = NULL;
                 current->arch.old_guest_table = page;
+                current->arch.old_guest_table_partial = true;
             }
         }
         if ( rc < 0 )
@@ -1712,12 +1728,16 @@ static int alloc_l3_table(struct page_info *page)
                  * builds.
                  */
                 if ( current->arch.old_guest_table == l3e_get_page(l3e) )
+                {
+                    ASSERT(current->arch.old_guest_table_partial);
                     page->partial_flags = PTF_partial_set;
+                }
                 else
                     ASSERT_UNREACHABLE();
             }
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
+            current->arch.old_guest_table_partial = true;
         }
         while ( i-- > 0 )
             pl3e[i] = unadjust_guest_l3e(pl3e[i], d);
@@ -1891,12 +1911,16 @@ static int alloc_l4_table(struct page_info *page)
                          * builds.
                          */
                         if ( current->arch.old_guest_table == l4e_get_page(pl4e[i]) )
+                        {
+                            ASSERT(current->arch.old_guest_table_partial);
                             page->partial_flags = PTF_partial_set;
+                        }
                         else
                             ASSERT_UNREACHABLE();
                     }
                     current->arch.old_guest_ptpg = NULL;
                     current->arch.old_guest_table = page;
+                    current->arch.old_guest_table_partial = true;
                 }
             }
         }
@@ -2760,6 +2784,28 @@ static int _put_page_type(struct page_info *page, unsigned int flags,
         x  = y;
         nx = x - 1;
 
+        /*
+         * Is this expected to do a full reference drop, or only
+         * cleanup partial validation / devalidation?
+         *
+         * If the former, the caller must hold a "full" type ref;
+         * which means the page must be validated.  If the page is
+         * *not* fully validated, continuing would almost certainly
+         * open up a security hole.  An exception to this is during
+         * domain destruction, where PGT_validated can be dropped
+         * without dropping a type ref.
+         *
+         * If the latter, do nothing unless type PGT_partial is set.
+         * If it is set, the type count must be 1.
+         */
+        if ( !(flags & PTF_partial_set) )
+            BUG_ON((x & PGT_partial) ||
+                   !((x & PGT_validated) || page_get_owner(page)->is_dying));
+        else if ( !(x & PGT_partial) )
+            return 0;
+        else
+            BUG_ON((x & PGT_count_mask) != 1);
+
         ASSERT((x & PGT_count_mask) != 0);
 
         if ( unlikely((nx & PGT_count_mask) == 0) )
@@ -3012,17 +3058,34 @@ int put_old_guest_table(struct vcpu *v)
     if ( !v->arch.old_guest_table )
         return 0;
 
-    switch ( rc = _put_page_type(v->arch.old_guest_table, PTF_preemptible,
-                                 v->arch.old_guest_ptpg) )
+    rc = _put_page_type(v->arch.old_guest_table,
+                        PTF_preemptible |
+                        ( v->arch.old_guest_table_partial ?
+                          PTF_partial_set : 0 ),
+                        v->arch.old_guest_ptpg);
+
+    if ( rc == -ERESTART || rc == -EINTR )
     {
-    case -EINTR:
-    case -ERESTART:
+        v->arch.old_guest_table_partial = (rc == -ERESTART);
         return -ERESTART;
-    case 0:
-        put_page(v->arch.old_guest_table);
     }
 
+    /*
+     * It shouldn't be possible for _put_page_type() to return
+     * anything else at the moment; but if it does happen in
+     * production, leaking the type ref is probably the best thing to
+     * do.  Either way, drop the general ref held by old_guest_table.
+     */
+    ASSERT(rc == 0);
+
+    put_page(v->arch.old_guest_table);
     v->arch.old_guest_table = NULL;
+    v->arch.old_guest_ptpg = NULL;
+    /*
+     * Safest default if someone sets old_guest_table without
+     * explicitly setting old_guest_table_partial.
+     */
+    v->arch.old_guest_table_partial = true;
 
     return rc;
 }
@@ -3175,11 +3238,11 @@ int new_guest_cr3(mfn_t mfn)
             switch ( rc = put_page_and_type_preemptible(page) )
             {
             case -EINTR:
-                rc = -ERESTART;
-                /* fallthrough */
             case -ERESTART:
                 curr->arch.old_guest_ptpg = NULL;
                 curr->arch.old_guest_table = page;
+                curr->arch.old_guest_table_partial = (rc == -ERESTART);
+                rc = -ERESTART;
                 break;
             default:
                 BUG_ON(rc);
@@ -3448,6 +3511,7 @@ long do_mmuext_op(
                     {
                         curr->arch.old_guest_ptpg = NULL;
                         curr->arch.old_guest_table = page;
+                        curr->arch.old_guest_table_partial = false;
                     }
                 }
             }
@@ -3482,6 +3546,11 @@ long do_mmuext_op(
             case -ERESTART:
                 curr->arch.old_guest_ptpg = NULL;
                 curr->arch.old_guest_table = page;
+                /*
+                 * EINTR means we still hold the type ref; ERESTART
+                 * means PGT_partial holds the type ref
+                 */
+                curr->arch.old_guest_table_partial = (rc == -ERESTART);
                 rc = 0;
                 break;
             default:
@@ -3550,11 +3619,15 @@ long do_mmuext_op(
                 switch ( rc = put_page_and_type_preemptible(page) )
                 {
                 case -EINTR:
-                    rc = -ERESTART;
-                    /* fallthrough */
                 case -ERESTART:
                     curr->arch.old_guest_ptpg = NULL;
                     curr->arch.old_guest_table = page;
+                    /*
+                     * EINTR means we still hold the type ref;
+                     * ERESTART means PGT_partial holds the ref
+                     */
+                    curr->arch.old_guest_table_partial = (rc == -ERESTART);
+                    rc = -ERESTART;
                     break;
                 default:
                     BUG_ON(rc);
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index aec65630d9..5afaf6b9de 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -311,7 +311,7 @@ struct arch_domain
 
     struct paging_domain paging;
     struct p2m_domain *p2m;
-    /* To enforce lock ordering in the pod code wrt the 
+    /* To enforce lock ordering in the pod code wrt the
      * page_alloc lock */
     int page_alloc_unlock_level;
 
@@ -550,6 +550,8 @@ struct arch_vcpu
     struct page_info *old_guest_table;  /* partially destructed pagetable */
     struct page_info *old_guest_ptpg;   /* containing page table of the */
                                         /* former, if any */
+    bool old_guest_table_partial;       /* Are we dropping a type ref, or just
+                                         * finishing up a partial de-validation? */
     /* guest_table holds a ref to the page, and also a type-count unless
      * shadow refcounts are in use */
     pagetable_t shadow_table[4];        /* (MFN) shadow(s) of guest */
-- 
2.23.0


["xsa299-4.11/0001-x86-mm-L1TF-checks-don-t-leave-a-partial-entry.patch" (application/octet-stream)]

From 852df269d247e177d5f2e9b8f3a4301a6fdd76bd Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 01/11] x86/mm: L1TF checks don't leave a partial entry

On detection of a potential L1TF issue, most validation code returns
-ERESTART to allow the switch to shadow mode to happen and cause the
original operation to be restarted.

However, in the validation code, the return value -ERESTART has been
repurposed to indicate 1) the function has partially completed
something which needs to be undone, and 2) calling put_page_type()
should cleanly undo it.  This causes problems in several places.

For L1 tables, on receiving an -ERESTART return from alloc_l1_table(),
alloc_page_type() will set PGT_partial on the page.  If for some
reason the original operation never restarts, then on domain
destruction, relinquish_memory() will call free_page_type() on the
page.

Unfortunately, alloc_ and free_l1_table() aren't set up to deal with
PGT_partial.  When returning a failure, alloc_l1_table() always
de-validates whatever it's validated so far, and free_l1_table()
always devalidates the whole page.  This means that if
relinquish_memory() calls free_page_type() on an L1 that didn't
complete due to an L1TF, it will call put_page_from_l1e() on "page
entries" that have never been validated.

For L2+ tables, setting rc to ERESTART causes the rest of the
alloc_lN_table() function to *think* that the entry in question will
have PGT_partial set.  This will cause it to set partial_pte = 1.  If
relinqush_memory() then calls free_page_type() on one of those pages,
then free_lN_table() will call put_page_from_lNe() on the entry when
it shouldn't.

Rather than indicating -ERESTART, indicate -EINTR.  This is the code
to indicate that nothing has changed from when you started the call
(which is effectively how alloc_l1_table() handles errors).

mod_lN_entry() shouldn't have any of these types of problems, so leave
potential changes there for a clean-up patch later.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index e6a4cb28f8..8ced185b49 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1110,7 +1110,7 @@ get_page_from_l2e(
     int rc;
 
     if ( !(l2e_get_flags(l2e) & _PAGE_PRESENT) )
-        return pv_l1tf_check_l2e(d, l2e) ? -ERESTART : 1;
+        return pv_l1tf_check_l2e(d, l2e) ? -EINTR : 1;
 
     if ( unlikely((l2e_get_flags(l2e) & L2_DISALLOW_MASK)) )
     {
@@ -1142,7 +1142,7 @@ get_page_from_l3e(
     int rc;
 
     if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) )
-        return pv_l1tf_check_l3e(d, l3e) ? -ERESTART : 1;
+        return pv_l1tf_check_l3e(d, l3e) ? -EINTR : 1;
 
     if ( unlikely((l3e_get_flags(l3e) & l3_disallow_mask(d))) )
     {
@@ -1175,7 +1175,7 @@ get_page_from_l4e(
     int rc;
 
     if ( !(l4e_get_flags(l4e) & _PAGE_PRESENT) )
-        return pv_l1tf_check_l4e(d, l4e) ? -ERESTART : 1;
+        return pv_l1tf_check_l4e(d, l4e) ? -EINTR : 1;
 
     if ( unlikely((l4e_get_flags(l4e) & L4_DISALLOW_MASK)) )
     {
@@ -1404,7 +1404,7 @@ static int alloc_l1_table(struct page_info *page)
     {
         if ( !(l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) )
         {
-            ret = pv_l1tf_check_l1e(d, pl1e[i]) ? -ERESTART : 0;
+            ret = pv_l1tf_check_l1e(d, pl1e[i]) ? -EINTR : 0;
             if ( ret )
                 goto out;
         }
-- 
2.23.0


["xsa299-4.11/0002-x86-mm-Don-t-re-set-PGT_pinned-on-a-partially-de-val.patch" (application/octet-stream)]

From 6bdddd7980eac0cc883945d823986f24682ca47a Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 02/11] x86/mm: Don't re-set PGT_pinned on a partially
 de-validated page

When unpinning pagetables, if an operation is interrupted,
relinquish_memory() re-sets PGT_pinned so that the un-pin will
pickedup again when the hypercall restarts.

This is appropriate when put_page_and_type_preemptible() returns
-EINTR, which indicates that the page is back in its initial state
(i.e., completely validated).  However, for -ERESTART, this leads to a
state where a page has both PGT_pinned and PGT_partial set.

This happens to work at the moment, although it's not really a
"canonical" state; but in subsequent patches, where we need to make a
distinction in handling between PGT_validated and PGT_partial pages,
this causes issues.

Move to a "canonical" state by:
- Only re-setting PGT_pinned on -EINTR
- Re-dropping the refcount held by PGT_pinned on -ERESTART

In the latter case, the PGT_partial bit will be cleared further down
with the rest of the other PGT_partial pages.

While here, clean up some trainling whitespace.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/domain.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 29f892c04c..8fbecbb169 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -112,7 +112,7 @@ static void play_dead(void)
      * this case, heap corruption or #PF can occur (when heap debugging is
      * enabled). For example, even printk() can involve tasklet scheduling,
      * which touches per-cpu vars.
-     * 
+     *
      * Consider very carefully when adding code to *dead_idle. Most hypervisor
      * subsystems are unsafe to call.
      */
@@ -1838,9 +1838,34 @@ static int relinquish_memory(
             break;
         case -ERESTART:
         case -EINTR:
+            /*
+             * -EINTR means PGT_validated has been re-set; re-set
+             * PGT_pinned again so that it gets picked up next time
+             * around.
+             *
+             * -ERESTART, OTOH, means PGT_partial is set instead.  Put
+             * it back on the list, but don't set PGT_pinned; the
+             * section below will finish off de-validation.  But we do
+             * need to drop the general ref associated with
+             * PGT_pinned, since put_page_and_type_preemptible()
+             * didn't do it.
+             *
+             * NB we can do an ASSERT for PGT_validated, since we
+             * "own" the type ref; but theoretically, the PGT_partial
+             * could be cleared by someone else.
+             */
+            if ( ret == -EINTR )
+            {
+                ASSERT(page->u.inuse.type_info & PGT_validated);
+                set_bit(_PGT_pinned, &page->u.inuse.type_info);
+            }
+            else
+                put_page(page);
+
             ret = -ERESTART;
+
+            /* Put the page back on the list and drop the ref we grabbed above */
             page_list_add(page, list);
-            set_bit(_PGT_pinned, &page->u.inuse.type_info);
             put_page(page);
             goto out;
         default:
@@ -2062,7 +2087,7 @@ void vcpu_kick(struct vcpu *v)
      * pending flag. These values may fluctuate (after all, we hold no
      * locks) but the key insight is that each change will cause
      * evtchn_upcall_pending to be polled.
-     * 
+     *
      * NB2. We save the running flag across the unblock to avoid a needless
      * IPI for domains that we IPI'd to unblock.
      */
-- 
2.23.0


["xsa299-4.11/0003-x86-mm-Separate-out-partial_pte-tristate-into-indivi.patch" (application/octet-stream)]

From 7c0a37005f52d10903ce22851b52ae9b6f4f0ee2 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 03/11] x86/mm: Separate out partial_pte tristate into
 individual flags

At the moment, partial_pte is a tri-state that contains two distinct bits
of information:

1. If zero, the pte at index [nr_validated_ptes] is un-validated.  If
   non-zero, the pte was last seen with PGT_partial set.

2. If positive, the pte at index [nr_validated_ptes] does not hold a
   general reference count.  If negative, it does.

To make future patches more clear, separate out this functionality
into two distinct, named bits: PTF_partial_set (for #1) and
PTF_partial_general_ref (for #2).

Additionally, a number of functions which need this information also
take other flags to control behavior (such as `preemptible` and
`defer`).  These are hard to read in the caller (since you only see
'true' or 'false'), and ugly when many are added together.  In
preparation for adding yet another flag in a future patch, collapse
all of these into a single `flag` variable.

NB that this does mean checking for what was previously the '-1'
condition a bit more ugly in the put_page_from_lNe functions (since
you have to check for both partial_set and general ref); but this
clause will go away in a future patch.

Also note that the original comment had an off-by-one error:
partial_flags (like partial_pte before it) concerns
plNe[nr_validated_ptes], not plNe[nr_validated_ptes+1].

No functional change intended.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c        | 164 +++++++++++++++++++++++----------------
 xen/include/asm-x86/mm.h |  41 ++++++----
 2 files changed, 127 insertions(+), 78 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 8ced185b49..1c4f54e328 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -610,20 +610,34 @@ static int alloc_segdesc_page(struct page_info *page)
 static int _get_page_type(struct page_info *page, unsigned long type,
                           bool preemptible);
 
+/*
+ * The following flags are used to specify behavior of various get and
+ * put commands.  The first two are also stored in page->partial_flags
+ * to indicate the state of the page pointed to by
+ * page->pte[page->nr_validated_entries].  See the comment in mm.h for
+ * more information.
+ */
+#define PTF_partial_set         (1 << 0)
+#define PTF_partial_general_ref (1 << 1)
+#define PTF_preemptible         (1 << 2)
+#define PTF_defer               (1 << 3)
+
 static int get_page_and_type_from_mfn(
     mfn_t mfn, unsigned long type, struct domain *d,
-    int partial, int preemptible)
+    unsigned int flags)
 {
     struct page_info *page = mfn_to_page(mfn);
     int rc;
+    bool preemptible = flags & PTF_preemptible,
+         partial_ref = flags & PTF_partial_general_ref;
 
-    if ( likely(partial >= 0) &&
+    if ( likely(!partial_ref) &&
          unlikely(!get_page_from_mfn(mfn, d)) )
         return -EINVAL;
 
     rc = _get_page_type(page, type, preemptible);
 
-    if ( unlikely(rc) && partial >= 0 &&
+    if ( unlikely(rc) && !partial_ref &&
          (!preemptible || page != current->arch.old_guest_table) )
         put_page(page);
 
@@ -1104,7 +1118,7 @@ get_page_from_l1e(
 define_get_linear_pagetable(l2);
 static int
 get_page_from_l2e(
-    l2_pgentry_t l2e, unsigned long pfn, struct domain *d, int partial)
+    l2_pgentry_t l2e, unsigned long pfn, struct domain *d, unsigned int flags)
 {
     unsigned long mfn = l2e_get_pfn(l2e);
     int rc;
@@ -1119,8 +1133,9 @@ get_page_from_l2e(
         return -EINVAL;
     }
 
-    rc = get_page_and_type_from_mfn(_mfn(mfn), PGT_l1_page_table, d,
-                                    partial, false);
+    ASSERT(!(flags & PTF_preemptible));
+
+    rc = get_page_and_type_from_mfn(_mfn(mfn), PGT_l1_page_table, d, flags);
     if ( unlikely(rc == -EINVAL) && get_l2_linear_pagetable(l2e, pfn, d) )
         rc = 0;
 
@@ -1137,7 +1152,7 @@ get_page_from_l2e(
 define_get_linear_pagetable(l3);
 static int
 get_page_from_l3e(
-    l3_pgentry_t l3e, unsigned long pfn, struct domain *d, int partial)
+    l3_pgentry_t l3e, unsigned long pfn, struct domain *d, unsigned int flags)
 {
     int rc;
 
@@ -1152,7 +1167,7 @@ get_page_from_l3e(
     }
 
     rc = get_page_and_type_from_mfn(
-        l3e_get_mfn(l3e), PGT_l2_page_table, d, partial, 1);
+        l3e_get_mfn(l3e), PGT_l2_page_table, d, flags | PTF_preemptible);
     if ( unlikely(rc == -EINVAL) &&
          !is_pv_32bit_domain(d) &&
          get_l3_linear_pagetable(l3e, pfn, d) )
@@ -1170,7 +1185,7 @@ get_page_from_l3e(
 define_get_linear_pagetable(l4);
 static int
 get_page_from_l4e(
-    l4_pgentry_t l4e, unsigned long pfn, struct domain *d, int partial)
+    l4_pgentry_t l4e, unsigned long pfn, struct domain *d, unsigned int flags)
 {
     int rc;
 
@@ -1185,7 +1200,7 @@ get_page_from_l4e(
     }
 
     rc = get_page_and_type_from_mfn(
-        l4e_get_mfn(l4e), PGT_l3_page_table, d, partial, 1);
+        l4e_get_mfn(l4e), PGT_l3_page_table, d, flags | PTF_preemptible);
     if ( unlikely(rc == -EINVAL) && get_l4_linear_pagetable(l4e, pfn, d) )
         rc = 0;
 
@@ -1275,7 +1290,7 @@ void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
  * Note also that this automatically deals correctly with linear p.t.'s.
  */
 static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
-                             int partial, bool defer)
+                             unsigned int flags)
 {
     int rc = 0;
 
@@ -1295,12 +1310,13 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         struct page_info *pg = l2e_get_page(l2e);
         struct page_info *ptpg = mfn_to_page(_mfn(pfn));
 
-        if ( unlikely(partial > 0) )
+        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
+              PTF_partial_set )
         {
-            ASSERT(!defer);
+            ASSERT(!(flags & PTF_defer));
             rc = _put_page_type(pg, true, ptpg);
         }
-        else if ( defer )
+        else if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = ptpg;
             current->arch.old_guest_table = pg;
@@ -1317,7 +1333,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
 }
 
 static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
-                             int partial, bool defer)
+                             unsigned int flags)
 {
     struct page_info *pg;
     int rc;
@@ -1340,13 +1356,14 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 
     pg = l3e_get_page(l3e);
 
-    if ( unlikely(partial > 0) )
+    if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
+         PTF_partial_set )
     {
-        ASSERT(!defer);
+        ASSERT(!(flags & PTF_defer));
         return _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
     }
 
-    if ( defer )
+    if ( flags & PTF_defer )
     {
         current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
         current->arch.old_guest_table = pg;
@@ -1361,7 +1378,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 }
 
 static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
-                             int partial, bool defer)
+                             unsigned int flags)
 {
     int rc = 1;
 
@@ -1370,13 +1387,14 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
     {
         struct page_info *pg = l4e_get_page(l4e);
 
-        if ( unlikely(partial > 0) )
+        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
+              PTF_partial_set )
         {
-            ASSERT(!defer);
+            ASSERT(!(flags & PTF_defer));
             return _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
         }
 
-        if ( defer )
+        if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
             current->arch.old_guest_table = pg;
@@ -1483,12 +1501,13 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
     unsigned long  pfn = mfn_x(page_to_mfn(page));
     l2_pgentry_t  *pl2e;
     unsigned int   i;
-    int            rc = 0, partial = page->partial_pte;
+    int            rc = 0;
+    unsigned int   partial_flags = page->partial_flags;
 
     pl2e = map_domain_page(_mfn(pfn));
 
     for ( i = page->nr_validated_ptes; i < L2_PAGETABLE_ENTRIES;
-          i++, partial = 0 )
+          i++, partial_flags = 0 )
     {
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
         {
@@ -1498,18 +1517,19 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
         }
 
         if ( !is_guest_l2_slot(d, type, i) ||
-             (rc = get_page_from_l2e(pl2e[i], pfn, d, partial)) > 0 )
+             (rc = get_page_from_l2e(pl2e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = partial ?: 1;
+            /* Set 'set', retain 'general ref' */
+            page->partial_flags = partial_flags | PTF_partial_set;
         }
         else if ( rc == -EINTR && i )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
             rc = -ERESTART;
         }
         else if ( rc < 0 && rc != -EINTR )
@@ -1518,7 +1538,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_pte = 0;
+                page->partial_flags = 0;
                 current->arch.old_guest_ptpg = NULL;
                 current->arch.old_guest_table = page;
             }
@@ -1542,7 +1562,8 @@ static int alloc_l3_table(struct page_info *page)
     unsigned long  pfn = mfn_x(page_to_mfn(page));
     l3_pgentry_t  *pl3e;
     unsigned int   i;
-    int            rc = 0, partial = page->partial_pte;
+    int            rc = 0;
+    unsigned int   partial_flags = page->partial_flags;
 
     pl3e = map_domain_page(_mfn(pfn));
 
@@ -1557,7 +1578,7 @@ static int alloc_l3_table(struct page_info *page)
         memset(pl3e + 4, 0, (L3_PAGETABLE_ENTRIES - 4) * sizeof(*pl3e));
 
     for ( i = page->nr_validated_ptes; i < L3_PAGETABLE_ENTRIES;
-          i++, partial = 0 )
+          i++, partial_flags = 0 )
     {
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
         {
@@ -1574,20 +1595,22 @@ static int alloc_l3_table(struct page_info *page)
             else
                 rc = get_page_and_type_from_mfn(
                     l3e_get_mfn(pl3e[i]),
-                    PGT_l2_page_table | PGT_pae_xen_l2, d, partial, 1);
+                    PGT_l2_page_table | PGT_pae_xen_l2, d,
+                    partial_flags | PTF_preemptible);
         }
-        else if ( (rc = get_page_from_l3e(pl3e[i], pfn, d, partial)) > 0 )
+        else if ( (rc = get_page_from_l3e(pl3e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = partial ?: 1;
+            /* Set 'set', leave 'general ref' set if this entry was set */
+            page->partial_flags = partial_flags | PTF_partial_set;
         }
         else if ( rc == -EINTR && i )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
             rc = -ERESTART;
         }
         if ( rc < 0 )
@@ -1604,7 +1627,7 @@ static int alloc_l3_table(struct page_info *page)
         if ( i )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
         }
@@ -1736,19 +1759,21 @@ static int alloc_l4_table(struct page_info *page)
     unsigned long  pfn = mfn_x(page_to_mfn(page));
     l4_pgentry_t  *pl4e = map_domain_page(_mfn(pfn));
     unsigned int   i;
-    int            rc = 0, partial = page->partial_pte;
+    int            rc = 0;
+    unsigned int   partial_flags = page->partial_flags;
 
     for ( i = page->nr_validated_ptes; i < L4_PAGETABLE_ENTRIES;
-          i++, partial = 0 )
+          i++, partial_flags = 0 )
     {
         if ( !is_guest_l4_slot(d, i) ||
-             (rc = get_page_from_l4e(pl4e[i], pfn, d, partial)) > 0 )
+             (rc = get_page_from_l4e(pl4e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = partial ?: 1;
+            /* Set 'set', leave 'general ref' set if this entry was set */
+            page->partial_flags = partial_flags | PTF_partial_set;
         }
         else if ( rc < 0 )
         {
@@ -1758,7 +1783,7 @@ static int alloc_l4_table(struct page_info *page)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_pte = 0;
+                page->partial_flags = 0;
                 if ( rc == -EINTR )
                     rc = -ERESTART;
                 else
@@ -1811,19 +1836,20 @@ static int free_l2_table(struct page_info *page)
     struct domain *d = page_get_owner(page);
     unsigned long pfn = mfn_x(page_to_mfn(page));
     l2_pgentry_t *pl2e;
-    int rc = 0, partial = page->partial_pte;
-    unsigned int i = page->nr_validated_ptes - !partial;
+    int rc = 0;
+    unsigned int partial_flags = page->partial_flags,
+        i = page->nr_validated_ptes - !(partial_flags & PTF_partial_set);
 
     pl2e = map_domain_page(_mfn(pfn));
 
     for ( ; ; )
     {
         if ( is_guest_l2_slot(d, page->u.inuse.type_info, i) )
-            rc = put_page_from_l2e(pl2e[i], pfn, partial, false);
+            rc = put_page_from_l2e(pl2e[i], pfn, partial_flags);
         if ( rc < 0 )
             break;
 
-        partial = 0;
+        partial_flags = 0;
 
         if ( !i-- )
             break;
@@ -1845,12 +1871,14 @@ static int free_l2_table(struct page_info *page)
     else if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_pte = partial ?: -1;
+        page->partial_flags = (partial_flags & PTF_partial_set) ?
+            partial_flags :
+            (PTF_partial_set | PTF_partial_general_ref);
     }
     else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
     {
         page->nr_validated_ptes = i + 1;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
         rc = -ERESTART;
     }
 
@@ -1862,18 +1890,19 @@ static int free_l3_table(struct page_info *page)
     struct domain *d = page_get_owner(page);
     unsigned long pfn = mfn_x(page_to_mfn(page));
     l3_pgentry_t *pl3e;
-    int rc = 0, partial = page->partial_pte;
-    unsigned int  i = page->nr_validated_ptes - !partial;
+    int rc = 0;
+    unsigned int partial_flags = page->partial_flags,
+        i = page->nr_validated_ptes - !(partial_flags & PTF_partial_set);
 
     pl3e = map_domain_page(_mfn(pfn));
 
     for ( ; ; )
     {
-        rc = put_page_from_l3e(pl3e[i], pfn, partial, 0);
+        rc = put_page_from_l3e(pl3e[i], pfn, partial_flags);
         if ( rc < 0 )
             break;
 
-        partial = 0;
+        partial_flags = 0;
         if ( rc == 0 )
             pl3e[i] = unadjust_guest_l3e(pl3e[i], d);
 
@@ -1892,12 +1921,14 @@ static int free_l3_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_pte = partial ?: -1;
+        page->partial_flags = (partial_flags & PTF_partial_set) ?
+            partial_flags :
+            (PTF_partial_set | PTF_partial_general_ref);
     }
     else if ( rc == -EINTR && i < L3_PAGETABLE_ENTRIES - 1 )
     {
         page->nr_validated_ptes = i + 1;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
         rc = -ERESTART;
     }
     return rc > 0 ? 0 : rc;
@@ -1908,26 +1939,29 @@ static int free_l4_table(struct page_info *page)
     struct domain *d = page_get_owner(page);
     unsigned long pfn = mfn_x(page_to_mfn(page));
     l4_pgentry_t *pl4e = map_domain_page(_mfn(pfn));
-    int rc = 0, partial = page->partial_pte;
-    unsigned int  i = page->nr_validated_ptes - !partial;
+    int rc = 0;
+    unsigned partial_flags = page->partial_flags,
+        i = page->nr_validated_ptes - !(partial_flags & PTF_partial_set);
 
     do {
         if ( is_guest_l4_slot(d, i) )
-            rc = put_page_from_l4e(pl4e[i], pfn, partial, 0);
+            rc = put_page_from_l4e(pl4e[i], pfn, partial_flags);
         if ( rc < 0 )
             break;
-        partial = 0;
+        partial_flags = 0;
     } while ( i-- );
 
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_pte = partial ?: -1;
+        page->partial_flags = (partial_flags & PTF_partial_set) ?
+            partial_flags :
+            (PTF_partial_set | PTF_partial_general_ref);
     }
     else if ( rc == -EINTR && i < L4_PAGETABLE_ENTRIES - 1 )
     {
         page->nr_validated_ptes = i + 1;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
         rc = -ERESTART;
     }
 
@@ -2203,7 +2237,7 @@ static int mod_l2_entry(l2_pgentry_t *pl2e,
         return -EBUSY;
     }
 
-    put_page_from_l2e(ol2e, pfn, 0, true);
+    put_page_from_l2e(ol2e, pfn, PTF_defer);
 
     return rc;
 }
@@ -2271,7 +2305,7 @@ static int mod_l3_entry(l3_pgentry_t *pl3e,
         if ( !create_pae_xen_mappings(d, pl3e) )
             BUG();
 
-    put_page_from_l3e(ol3e, pfn, 0, 1);
+    put_page_from_l3e(ol3e, pfn, PTF_defer);
     return rc;
 }
 
@@ -2334,7 +2368,7 @@ static int mod_l4_entry(l4_pgentry_t *pl4e,
         return -EFAULT;
     }
 
-    put_page_from_l4e(ol4e, pfn, 0, 1);
+    put_page_from_l4e(ol4e, pfn, PTF_defer);
     return rc;
 }
 
@@ -2598,7 +2632,7 @@ int free_page_type(struct page_info *page, unsigned long type,
     if ( !(type & PGT_partial) )
     {
         page->nr_validated_ptes = 1U << PAGETABLE_ORDER;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
     }
 
     switch ( type & PGT_type_mask )
@@ -2889,7 +2923,7 @@ static int _get_page_type(struct page_info *page, unsigned long type,
         if ( !(x & PGT_partial) )
         {
             page->nr_validated_ptes = 0;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
         }
         page->linear_pt_count = 0;
         rc = alloc_page_type(page, type, preemptible);
@@ -3064,7 +3098,7 @@ int new_guest_cr3(mfn_t mfn)
         return 0;
     }
 
-    rc = get_page_and_type_from_mfn(mfn, PGT_root_page_table, d, 0, 1);
+    rc = get_page_and_type_from_mfn(mfn, PGT_root_page_table, d, PTF_preemptible);
     switch ( rc )
     {
     case 0:
@@ -3452,7 +3486,7 @@ long do_mmuext_op(
             if ( op.arg1.mfn != 0 )
             {
                 rc = get_page_and_type_from_mfn(
-                    _mfn(op.arg1.mfn), PGT_root_page_table, currd, 0, 1);
+                    _mfn(op.arg1.mfn), PGT_root_page_table, currd, PTF_preemptible);
 
                 if ( unlikely(rc) )
                 {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 1ea173c555..46cba52941 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -228,19 +228,34 @@ struct page_info
          * setting the flag must not drop that reference, whereas the instance
          * clearing it will have to.
          *
-         * If @partial_pte is positive then PTE at @nr_validated_ptes+1 has
-         * been partially validated. This implies that the general reference
-         * to the page (acquired from get_page_from_lNe()) would be dropped
-         * (again due to the apparent failure) and hence must be re-acquired
-         * when resuming the validation, but must not be dropped when picking
-         * up the page for invalidation.
+         * If partial_flags & PTF_partial_set is set, then the page at
+         * at @nr_validated_ptes had PGT_partial set as a result of an
+         * operation on the current page.  (That page may or may not
+         * still have PGT_partial set.)
          *
-         * If @partial_pte is negative then PTE at @nr_validated_ptes+1 has
-         * been partially invalidated. This is basically the opposite case of
-         * above, i.e. the general reference to the page was not dropped in
-         * put_page_from_lNe() (due to the apparent failure), and hence it
-         * must be dropped when the put operation is resumed (and completes),
-         * but it must not be acquired if picking up the page for validation.
+         * If PTF_partial_general_ref is set, then the PTE at
+         * @nr_validated_ptef holds a general reference count for the
+         * page.
+         *
+         * This happens:
+         * - During de-validation, if de-validation of the page was
+         *   interrupted
+         * - During validation, if an invalid entry is encountered and
+         *   validation is preemptible
+         * - During validation, if PTF_partial_general_ref was set on
+         *   this entry to begin with (perhaps because we're picking
+         *   up from a partial de-validation).
+         *
+         * When resuming validation, if PTF_partial_general_ref is clear,
+         * then a general reference must be re-acquired; if it is set, no
+         * reference should be acquired.
+         *
+         * When resuming de-validation, if PTF_partial_general_ref is
+         * clear, no reference should be dropped; if it is set, a
+         * reference should be dropped.
+         *
+         * NB that PTF_partial_set and PTF_partial_general_ref are
+         * defined in mm.c, the only place where they are used.
          *
          * The 3rd field, @linear_pt_count, indicates
          * - by a positive value, how many same-level page table entries a page
@@ -251,7 +266,7 @@ struct page_info
         struct {
             u16 nr_validated_ptes:PAGETABLE_ORDER + 1;
             u16 :16 - PAGETABLE_ORDER - 1 - 2;
-            s16 partial_pte:2;
+            u16 partial_flags:2;
             s16 linear_pt_count;
         };
 
-- 
2.23.0


["xsa299-4.11/0004-x86-mm-Use-flags-for-_put_page_type-rather-than-a-bo.patch" (application/octet-stream)]

From 20b8a6702c6839bafd252789396b443d4b5c5474 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 04/11] x86/mm: Use flags for _put_page_type rather than a
 boolean

This is in mainly in preparation for _put_page_type taking the
partial_flags value in the future.  It also makes it easier to read in
the caller (since you see a flag name rather than `true` or `false`).

No functional change intended.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 1c4f54e328..e2fba15d86 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1207,7 +1207,7 @@ get_page_from_l4e(
     return rc;
 }
 
-static int _put_page_type(struct page_info *page, bool preemptible,
+static int _put_page_type(struct page_info *page, unsigned int flags,
                           struct page_info *ptpg);
 
 void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
@@ -1314,7 +1314,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
               PTF_partial_set )
         {
             ASSERT(!(flags & PTF_defer));
-            rc = _put_page_type(pg, true, ptpg);
+            rc = _put_page_type(pg, PTF_preemptible, ptpg);
         }
         else if ( flags & PTF_defer )
         {
@@ -1323,7 +1323,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         }
         else
         {
-            rc = _put_page_type(pg, true, ptpg);
+            rc = _put_page_type(pg, PTF_preemptible, ptpg);
             if ( likely(!rc) )
                 put_page(pg);
         }
@@ -1360,7 +1360,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
          PTF_partial_set )
     {
         ASSERT(!(flags & PTF_defer));
-        return _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
+        return _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
     }
 
     if ( flags & PTF_defer )
@@ -1370,7 +1370,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
         return 0;
     }
 
-    rc = _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
+    rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
     if ( likely(!rc) )
         put_page(pg);
 
@@ -1391,7 +1391,7 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
               PTF_partial_set )
         {
             ASSERT(!(flags & PTF_defer));
-            return _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
+            return _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
         }
 
         if ( flags & PTF_defer )
@@ -1401,7 +1401,7 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
             return 0;
         }
 
-        rc = _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
+        rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
         if ( likely(!rc) )
             put_page(pg);
     }
@@ -2701,10 +2701,11 @@ static int _put_final_page_type(struct page_info *page, unsigned long type,
 }
 
 
-static int _put_page_type(struct page_info *page, bool preemptible,
+static int _put_page_type(struct page_info *page, unsigned int flags,
                           struct page_info *ptpg)
 {
     unsigned long nx, x, y = page->u.inuse.type_info;
+    bool preemptible = flags & PTF_preemptible;
 
     ASSERT(current_locked_page_ne_check(page));
 
@@ -2911,7 +2912,7 @@ static int _get_page_type(struct page_info *page, unsigned long type,
 
             if ( unlikely(iommu_ret) )
             {
-                _put_page_type(page, false, NULL);
+                _put_page_type(page, 0, NULL);
                 rc = iommu_ret;
                 goto out;
             }
@@ -2938,7 +2939,7 @@ static int _get_page_type(struct page_info *page, unsigned long type,
 
 void put_page_type(struct page_info *page)
 {
-    int rc = _put_page_type(page, false, NULL);
+    int rc = _put_page_type(page, 0, NULL);
     ASSERT(rc == 0);
     (void)rc;
 }
@@ -2955,7 +2956,7 @@ int get_page_type(struct page_info *page, unsigned long type)
 
 int put_page_type_preemptible(struct page_info *page)
 {
-    return _put_page_type(page, true, NULL);
+    return _put_page_type(page, PTF_preemptible, NULL);
 }
 
 int get_page_type_preemptible(struct page_info *page, unsigned long type)
@@ -2972,7 +2973,7 @@ int put_old_guest_table(struct vcpu *v)
     if ( !v->arch.old_guest_table )
         return 0;
 
-    switch ( rc = _put_page_type(v->arch.old_guest_table, true,
+    switch ( rc = _put_page_type(v->arch.old_guest_table, PTF_preemptible,
                                  v->arch.old_guest_ptpg) )
     {
     case -EINTR:
-- 
2.23.0


["xsa299-4.11/0005-x86-mm-Rework-get_page_and_type_from_mfn-conditional.patch" (application/octet-stream)]

From 7b3f9f9a797459902bebba962e31be5cbfe7b515 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 05/11] x86/mm: Rework get_page_and_type_from_mfn conditional

Make it easier to read by declaring the conditions in which we will
retain the ref, rather than the conditions under which we release it.

The only way (page == current->arch.old_guest_table) can be true is if
preemptible is true; so remove this from the query itself, and add an
ASSERT() to that effect on the opposite path.

No functional change intended.

NB that alloc_lN_table() mishandle the "linear pt failure" situation
described in the comment; this will be addressed in a future patch.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index e2fba15d86..eaf7b14245 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -637,8 +637,43 @@ static int get_page_and_type_from_mfn(
 
     rc = _get_page_type(page, type, preemptible);
 
-    if ( unlikely(rc) && !partial_ref &&
-         (!preemptible || page != current->arch.old_guest_table) )
+    /*
+     * Retain the refcount if:
+     * - page is fully validated (rc == 0)
+     * - page is not validated (rc < 0) but:
+     *   - We came in with a reference (partial_ref)
+     *   - page is partially validated but there's been an error
+     *     (page == current->arch.old_guest_table)
+     *
+     * The partial_ref-on-error clause is worth an explanation.  There
+     * are two scenarios where partial_ref might be true coming in:
+     * - mfn has been partially demoted as type `type`; i.e. has
+     *   PGT_partial set
+     * - mfn has been partially demoted as L(type+1) (i.e., a linear
+     *   page; e.g. we're being called from get_page_from_l2e with
+     *   type == PGT_l1_table, but the mfn is PGT_l2_table)
+     *
+     * If there's an error, in the first case, _get_page_type will
+     * either return -ERESTART, in which case we want to retain the
+     * ref (as the caller will consider it retained), or -EINVAL, in
+     * which case old_guest_table will be set; in both cases, we need
+     * to retain the ref.
+     *
+     * In the second case, if there's an error, _get_page_type() can
+     * *only* return -EINVAL, and *never* set old_guest_table.  In
+     * that case we also want to retain the reference, to allow the
+     * page to continue to be torn down (i.e., PGT_partial cleared)
+     * safely.
+     *
+     * Also note that we shouldn't be able to leave with the reference
+     * count retained unless we succeeded, or the operation was
+     * preemptible.
+     */
+    if ( likely(!rc) || partial_ref )
+        /* nothing */;
+    else if ( page == current->arch.old_guest_table )
+        ASSERT(preemptible);
+    else
         put_page(page);
 
     return rc;
-- 
2.23.0


["xsa299-4.11/0006-x86-mm-Have-alloc_l-23-_table-clear-partial_flags-wh.patch" (application/octet-stream)]

From d28893777be56ef51562ed32502377974f738fd3 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 06/11] x86/mm: Have alloc_l[23]_table clear partial_flags when
 preempting

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.

At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.

PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held.  Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.

Unfortunately, when alloc_l[23]_table check hypercall_preempt_check()
and return -ERESTART, they set nr_entries_validated, but don't clear
partial_flags.

If we were picking up from a previously-interrupted promotion, that
means that PTF_partial_set would be set even though
[nr_entries_validated] was not partially validated.  This means that
if the page in this state were de-validated, put_page_type() would
erroneously be called on that entry.

Perhaps worse, if we were racing with a de-validation, then we might
leave both PTF_partial_set and PTF_partial_general_ref; and when
de-validation picked up again, both the type and the general ref would
be erroneously dropped from [nr_entries_validated].

In a sense, the real issue here is code duplication.  Rather than
duplicate the interruption code, set rc to -EINTR and fall through to
the code which already handles that case correctly.

Given the logic at this point, it should be impossible for
partial_flags to be non-zero; add an ASSERT() to catch any changes.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 18 ++++--------------
 1 file changed, 4 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index eaf7b14245..053465cb7c 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1545,13 +1545,8 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
           i++, partial_flags = 0 )
     {
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
-        {
-            page->nr_validated_ptes = i;
-            rc = -ERESTART;
-            break;
-        }
-
-        if ( !is_guest_l2_slot(d, type, i) ||
+            rc = -EINTR;
+        else if ( !is_guest_l2_slot(d, type, i) ||
              (rc = get_page_from_l2e(pl2e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
@@ -1616,13 +1611,8 @@ static int alloc_l3_table(struct page_info *page)
           i++, partial_flags = 0 )
     {
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
-        {
-            page->nr_validated_ptes = i;
-            rc = -ERESTART;
-            break;
-        }
-
-        if ( is_pv_32bit_domain(d) && (i == 3) )
+            rc = -EINTR;
+        else if ( is_pv_32bit_domain(d) && (i == 3) )
         {
             if ( !(l3e_get_flags(pl3e[i]) & _PAGE_PRESENT) ||
                  (l3e_get_flags(pl3e[i]) & l3_disallow_mask(d)) )
-- 
2.23.0


["xsa299-4.11/0007-x86-mm-Always-retain-a-general-ref-on-partial.patch" (application/octet-stream)]

From f608a53c25806a7a4318cbe225bc5f5bbf154d69 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 07/11] x86/mm: Always retain a general ref on partial

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page struct:
nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.

At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.

PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held.  Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.

Unfortunately, because a refcount is not held, it is possible to
engineer a situation where PFT_partial_set is set but the page in
question has been assigned to another domain.  A sketch is provided in
the appendix.

Fix this by having the parent page table entry hold a general
reference count whenever PFT_partial_set is set.  (For clarity of
change, keep two separate flags.  These will be collapsed in a
subsequent changeset.)

This has two basic implications.  On the put_page_from_lNe() side,
this mean that the (partial_set && !partial_ref) case can never happen,
and no longer needs to be special-cased.

Secondly, because both flags are set together, there's no need to carry over
existing bits from partial_pte.

(NB there is still another issue with calling _put_page_type() on a
page which had PGT_partial set; that will be handled in a subsequent
patch.)

On the get_page_and_type_from_mfn() side, we need to distinguish
between callers which hold a reference on partial (i.e.,
alloc_lN_table()), and those which do not (new_cr3, PIN_LN_TABLE, and
so on): pass a flag if the type should be retained on interruption.

NB that since l1 promotion can't be preempted, that get_page_from_l2e
can't return -ERESTART.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
* Appendix: Engineering PTF_partial_set while a page belongs to a
  foreign domain

Suppose A is a page which can be promoted to an l3, and B is a page
which can be promoted to an l2, and A[x] points to B.  B has
PGC_allocated set but no other general references.

V1:  PIN_L3 A.
  A is validated, B is validated.
  A.type_count = 1 | PGT_validated | PGT_pinned
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated (A[x] holds a general ref)

V1: UNPIN A.
  A begins de-validation.
  Arrange to be interrupted when i < x
  V1->old_guest_table = A
  V1->old_guest_table_ref_held = false
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = i < x
  B.type_count = 0
  B.count = 1 | PGC_allocated

V2: MOD_L4_ENTRY to point some l4e to A.
  Picks up re-validation of A.
  Arrange to be interrupted halfway through B's validation
  B.type_count = 1 | PGT_partial
  B.count = 2 | PGC_allocated (PGT_partial holds a general ref)
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = x
  A.partial_pte = PTF_partial_set

V3: MOD_L3_ENTRY to point some other l3e (not in A) to B.
  Validates B.
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated ("other l3e" holds a general ref)

V3: MOD_L3_ENTRY to clear l3e pointing to B.
  Devalidates B.
  B.type_count = 0
  B.count = 1 | PGC_allocated

V3: decrease_reservation(B)
  Clears PGC_allocated
  B.count = 0 => B is freed

B gets assigned to a different domain

V1: Restarts UNPIN of A
  put_old_guest_table(A)
    ...
      free_l3_table(A)

Now since A.partial_flags has PTF_partial_set, free_l3_table() will
call put_page_from_l3e() on A[x], which points to B, while B is owned
by another domain.

If A[x] held a general refcount for B on partial validation, as it does
for partial de-validation, then B would still have a reference count of
1 after PGC_allocated was freed; so B wouldn't be freed until after
put_page_from_l3e() had happend on A[x].
---
 xen/arch/x86/mm.c        | 84 +++++++++++++++++++++++-----------------
 xen/include/asm-x86/mm.h | 15 ++++---
 2 files changed, 58 insertions(+), 41 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 053465cb7c..68a9e74002 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -617,10 +617,11 @@ static int _get_page_type(struct page_info *page, unsigned long type,
  * page->pte[page->nr_validated_entries].  See the comment in mm.h for
  * more information.
  */
-#define PTF_partial_set         (1 << 0)
-#define PTF_partial_general_ref (1 << 1)
-#define PTF_preemptible         (1 << 2)
-#define PTF_defer               (1 << 3)
+#define PTF_partial_set           (1 << 0)
+#define PTF_partial_general_ref   (1 << 1)
+#define PTF_preemptible           (1 << 2)
+#define PTF_defer                 (1 << 3)
+#define PTF_retain_ref_on_restart (1 << 4)
 
 static int get_page_and_type_from_mfn(
     mfn_t mfn, unsigned long type, struct domain *d,
@@ -629,7 +630,11 @@ static int get_page_and_type_from_mfn(
     struct page_info *page = mfn_to_page(mfn);
     int rc;
     bool preemptible = flags & PTF_preemptible,
-         partial_ref = flags & PTF_partial_general_ref;
+         partial_ref = flags & PTF_partial_general_ref,
+         partial_set = flags & PTF_partial_set,
+         retain_ref  = flags & PTF_retain_ref_on_restart;
+
+    ASSERT(partial_ref == partial_set);
 
     if ( likely(!partial_ref) &&
          unlikely(!get_page_from_mfn(mfn, d)) )
@@ -642,13 +647,15 @@ static int get_page_and_type_from_mfn(
      * - page is fully validated (rc == 0)
      * - page is not validated (rc < 0) but:
      *   - We came in with a reference (partial_ref)
+     *   - page is partially validated (rc == -ERESTART), and the
+     *     caller has asked the ref to be retained in that case
      *   - page is partially validated but there's been an error
      *     (page == current->arch.old_guest_table)
      *
      * The partial_ref-on-error clause is worth an explanation.  There
      * are two scenarios where partial_ref might be true coming in:
-     * - mfn has been partially demoted as type `type`; i.e. has
-     *   PGT_partial set
+     * - mfn has been partially promoted / demoted as type `type`;
+     *   i.e. has PGT_partial set
      * - mfn has been partially demoted as L(type+1) (i.e., a linear
      *   page; e.g. we're being called from get_page_from_l2e with
      *   type == PGT_l1_table, but the mfn is PGT_l2_table)
@@ -671,7 +678,8 @@ static int get_page_and_type_from_mfn(
      */
     if ( likely(!rc) || partial_ref )
         /* nothing */;
-    else if ( page == current->arch.old_guest_table )
+    else if ( page == current->arch.old_guest_table ||
+              (retain_ref && rc == -ERESTART) )
         ASSERT(preemptible);
     else
         put_page(page);
@@ -1348,8 +1356,8 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
               PTF_partial_set )
         {
-            ASSERT(!(flags & PTF_defer));
-            rc = _put_page_type(pg, PTF_preemptible, ptpg);
+            /* partial_set should always imply partial_ref */
+            BUG();
         }
         else if ( flags & PTF_defer )
         {
@@ -1394,8 +1402,8 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
     if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
          PTF_partial_set )
     {
-        ASSERT(!(flags & PTF_defer));
-        return _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
+        /* partial_set should always imply partial_ref */
+        BUG();
     }
 
     if ( flags & PTF_defer )
@@ -1425,8 +1433,8 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
         if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
               PTF_partial_set )
         {
-            ASSERT(!(flags & PTF_defer));
-            return _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
+            /* partial_set should always imply partial_ref */
+            BUG();
         }
 
         if ( flags & PTF_defer )
@@ -1550,13 +1558,22 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
              (rc = get_page_from_l2e(pl2e[i], pfn, d, partial_flags)) > 0 )
             continue;
 
-        if ( rc == -ERESTART )
-        {
-            page->nr_validated_ptes = i;
-            /* Set 'set', retain 'general ref' */
-            page->partial_flags = partial_flags | PTF_partial_set;
-        }
-        else if ( rc == -EINTR && i )
+        /*
+         * It shouldn't be possible for get_page_from_l2e to return
+         * -ERESTART, since we never call this with PTF_preemptible.
+         * (alloc_l1_table may return -EINTR on an L1TF-vulnerable
+         * entry.)
+         *
+         * NB that while on a "clean" promotion, we can never get
+         * PGT_partial.  It is possible to arrange for an l2e to
+         * contain a partially-devalidated l2; but in that case, both
+         * of the following functions will fail anyway (the first
+         * because the page in question is not an l1; the second
+         * because the page is not fully validated).
+         */
+        ASSERT(rc != -ERESTART);
+
+        if ( rc == -EINTR && i )
         {
             page->nr_validated_ptes = i;
             page->partial_flags = 0;
@@ -1565,6 +1582,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
         else if ( rc < 0 && rc != -EINTR )
         {
             gdprintk(XENLOG_WARNING, "Failure in alloc_l2_table: slot %#x\n", i);
+            ASSERT(current->arch.old_guest_table == NULL);
             if ( i )
             {
                 page->nr_validated_ptes = i;
@@ -1621,16 +1639,17 @@ static int alloc_l3_table(struct page_info *page)
                 rc = get_page_and_type_from_mfn(
                     l3e_get_mfn(pl3e[i]),
                     PGT_l2_page_table | PGT_pae_xen_l2, d,
-                    partial_flags | PTF_preemptible);
+                    partial_flags | PTF_preemptible | PTF_retain_ref_on_restart);
         }
-        else if ( (rc = get_page_from_l3e(pl3e[i], pfn, d, partial_flags)) > 0 )
+        else if ( (rc = get_page_from_l3e(pl3e[i], pfn, d,
+                               partial_flags | PTF_retain_ref_on_restart)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = partial_flags | PTF_partial_set;
+            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
         }
         else if ( rc == -EINTR && i )
         {
@@ -1791,14 +1810,15 @@ static int alloc_l4_table(struct page_info *page)
           i++, partial_flags = 0 )
     {
         if ( !is_guest_l4_slot(d, i) ||
-             (rc = get_page_from_l4e(pl4e[i], pfn, d, partial_flags)) > 0 )
+             (rc = get_page_from_l4e(pl4e[i], pfn, d,
+                               partial_flags | PTF_retain_ref_on_restart)) > 0 )
             continue;
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = partial_flags | PTF_partial_set;
+            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
         }
         else if ( rc < 0 )
         {
@@ -1896,9 +1916,7 @@ static int free_l2_table(struct page_info *page)
     else if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = (partial_flags & PTF_partial_set) ?
-            partial_flags :
-            (PTF_partial_set | PTF_partial_general_ref);
+        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
     }
     else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
     {
@@ -1946,9 +1964,7 @@ static int free_l3_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = (partial_flags & PTF_partial_set) ?
-            partial_flags :
-            (PTF_partial_set | PTF_partial_general_ref);
+        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
     }
     else if ( rc == -EINTR && i < L3_PAGETABLE_ENTRIES - 1 )
     {
@@ -1979,9 +1995,7 @@ static int free_l4_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = (partial_flags & PTF_partial_set) ?
-            partial_flags :
-            (PTF_partial_set | PTF_partial_general_ref);
+        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
     }
     else if ( rc == -EINTR && i < L4_PAGETABLE_ENTRIES - 1 )
     {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 46cba52941..dc9cb869dd 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -238,22 +238,25 @@ struct page_info
          * page.
          *
          * This happens:
-         * - During de-validation, if de-validation of the page was
+         * - During validation or de-validation, if the operation was
          *   interrupted
          * - During validation, if an invalid entry is encountered and
          *   validation is preemptible
          * - During validation, if PTF_partial_general_ref was set on
-         *   this entry to begin with (perhaps because we're picking
-         *   up from a partial de-validation).
+         *   this entry to begin with (perhaps because it picked up a
+         *   previous operation)
          *
-         * When resuming validation, if PTF_partial_general_ref is clear,
-         * then a general reference must be re-acquired; if it is set, no
-         * reference should be acquired.
+         * When resuming validation, if PTF_partial_general_ref is
+         * clear, then a general reference must be re-acquired; if it
+         * is set, no reference should be acquired.
          *
          * When resuming de-validation, if PTF_partial_general_ref is
          * clear, no reference should be dropped; if it is set, a
          * reference should be dropped.
          *
+         * NB at the moment, PTF_partial_set should be set if and only if
+         * PTF_partial_general_ref is set.
+         *
          * NB that PTF_partial_set and PTF_partial_general_ref are
          * defined in mm.c, the only place where they are used.
          *
-- 
2.23.0


["xsa299-4.11/0008-x86-mm-Collapse-PTF_partial_set-and-PTF_partial_gene.patch" (application/octet-stream)]

From 6811df7fb7a1d4bb5a75fec9cf41519b5c86c605 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 08/11] x86/mm: Collapse PTF_partial_set and
 PTF_partial_general_ref into one

...now that they are equivalent.  No functional change intended.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c        | 50 +++++++++++-----------------------------
 xen/include/asm-x86/mm.h | 29 +++++++++++------------
 2 files changed, 26 insertions(+), 53 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 68a9e74002..4970b19aff 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -612,13 +612,12 @@ static int _get_page_type(struct page_info *page, unsigned long type,
 
 /*
  * The following flags are used to specify behavior of various get and
- * put commands.  The first two are also stored in page->partial_flags
- * to indicate the state of the page pointed to by
+ * put commands.  The first is also stored in page->partial_flags to
+ * indicate the state of the page pointed to by
  * page->pte[page->nr_validated_entries].  See the comment in mm.h for
  * more information.
  */
 #define PTF_partial_set           (1 << 0)
-#define PTF_partial_general_ref   (1 << 1)
 #define PTF_preemptible           (1 << 2)
 #define PTF_defer                 (1 << 3)
 #define PTF_retain_ref_on_restart (1 << 4)
@@ -630,13 +629,10 @@ static int get_page_and_type_from_mfn(
     struct page_info *page = mfn_to_page(mfn);
     int rc;
     bool preemptible = flags & PTF_preemptible,
-         partial_ref = flags & PTF_partial_general_ref,
          partial_set = flags & PTF_partial_set,
          retain_ref  = flags & PTF_retain_ref_on_restart;
 
-    ASSERT(partial_ref == partial_set);
-
-    if ( likely(!partial_ref) &&
+    if ( likely(!partial_set) &&
          unlikely(!get_page_from_mfn(mfn, d)) )
         return -EINVAL;
 
@@ -646,14 +642,14 @@ static int get_page_and_type_from_mfn(
      * Retain the refcount if:
      * - page is fully validated (rc == 0)
      * - page is not validated (rc < 0) but:
-     *   - We came in with a reference (partial_ref)
+     *   - We came in with a reference (partial_set)
      *   - page is partially validated (rc == -ERESTART), and the
      *     caller has asked the ref to be retained in that case
      *   - page is partially validated but there's been an error
      *     (page == current->arch.old_guest_table)
      *
-     * The partial_ref-on-error clause is worth an explanation.  There
-     * are two scenarios where partial_ref might be true coming in:
+     * The partial_set-on-error clause is worth an explanation.  There
+     * are two scenarios where partial_set might be true coming in:
      * - mfn has been partially promoted / demoted as type `type`;
      *   i.e. has PGT_partial set
      * - mfn has been partially demoted as L(type+1) (i.e., a linear
@@ -676,7 +672,7 @@ static int get_page_and_type_from_mfn(
      * count retained unless we succeeded, or the operation was
      * preemptible.
      */
-    if ( likely(!rc) || partial_ref )
+    if ( likely(!rc) || partial_set )
         /* nothing */;
     else if ( page == current->arch.old_guest_table ||
               (retain_ref && rc == -ERESTART) )
@@ -1353,13 +1349,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         struct page_info *pg = l2e_get_page(l2e);
         struct page_info *ptpg = mfn_to_page(_mfn(pfn));
 
-        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
-              PTF_partial_set )
-        {
-            /* partial_set should always imply partial_ref */
-            BUG();
-        }
-        else if ( flags & PTF_defer )
+        if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = ptpg;
             current->arch.old_guest_table = pg;
@@ -1399,13 +1389,6 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 
     pg = l3e_get_page(l3e);
 
-    if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
-         PTF_partial_set )
-    {
-        /* partial_set should always imply partial_ref */
-        BUG();
-    }
-
     if ( flags & PTF_defer )
     {
         current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
@@ -1430,13 +1413,6 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
     {
         struct page_info *pg = l4e_get_page(l4e);
 
-        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
-              PTF_partial_set )
-        {
-            /* partial_set should always imply partial_ref */
-            BUG();
-        }
-
         if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
@@ -1649,7 +1625,7 @@ static int alloc_l3_table(struct page_info *page)
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+            page->partial_flags = PTF_partial_set;
         }
         else if ( rc == -EINTR && i )
         {
@@ -1818,7 +1794,7 @@ static int alloc_l4_table(struct page_info *page)
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+            page->partial_flags = PTF_partial_set;
         }
         else if ( rc < 0 )
         {
@@ -1916,7 +1892,7 @@ static int free_l2_table(struct page_info *page)
     else if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+        page->partial_flags = PTF_partial_set;
     }
     else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
     {
@@ -1964,7 +1940,7 @@ static int free_l3_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+        page->partial_flags = PTF_partial_set;
     }
     else if ( rc == -EINTR && i < L3_PAGETABLE_ENTRIES - 1 )
     {
@@ -1995,7 +1971,7 @@ static int free_l4_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+        page->partial_flags = PTF_partial_set;
     }
     else if ( rc == -EINTR && i < L4_PAGETABLE_ENTRIES - 1 )
     {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index dc9cb869dd..c6ba9e4d73 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -233,7 +233,7 @@ struct page_info
          * operation on the current page.  (That page may or may not
          * still have PGT_partial set.)
          *
-         * If PTF_partial_general_ref is set, then the PTE at
+         * Additionally, if PTF_partial_set is set, then the PTE at
          * @nr_validated_ptef holds a general reference count for the
          * page.
          *
@@ -242,23 +242,20 @@ struct page_info
          *   interrupted
          * - During validation, if an invalid entry is encountered and
          *   validation is preemptible
-         * - During validation, if PTF_partial_general_ref was set on
-         *   this entry to begin with (perhaps because it picked up a
+         * - During validation, if PTF_partial_set was set on this
+         *   entry to begin with (perhaps because it picked up a
          *   previous operation)
          *
-         * When resuming validation, if PTF_partial_general_ref is
-         * clear, then a general reference must be re-acquired; if it
-         * is set, no reference should be acquired.
+         * When resuming validation, if PTF_partial_set is clear, then
+         * a general reference must be re-acquired; if it is set, no
+         * reference should be acquired.
          *
-         * When resuming de-validation, if PTF_partial_general_ref is
-         * clear, no reference should be dropped; if it is set, a
-         * reference should be dropped.
+         * When resuming de-validation, if PTF_partial_set is clear,
+         * no reference should be dropped; if it is set, a reference
+         * should be dropped.
          *
-         * NB at the moment, PTF_partial_set should be set if and only if
-         * PTF_partial_general_ref is set.
-         *
-         * NB that PTF_partial_set and PTF_partial_general_ref are
-         * defined in mm.c, the only place where they are used.
+         * NB that PTF_partial_set is defined in mm.c, the only place
+         * where it is used.
          *
          * The 3rd field, @linear_pt_count, indicates
          * - by a positive value, how many same-level page table entries a page
@@ -268,8 +265,8 @@ struct page_info
          */
         struct {
             u16 nr_validated_ptes:PAGETABLE_ORDER + 1;
-            u16 :16 - PAGETABLE_ORDER - 1 - 2;
-            u16 partial_flags:2;
+            u16 :16 - PAGETABLE_ORDER - 1 - 1;
+            u16 partial_flags:1;
             s16 linear_pt_count;
         };
 
-- 
2.23.0


["xsa299-4.11/0009-x86-mm-Properly-handle-linear-pagetable-promotion-fa.patch" (application/octet-stream)]

From a6098b8920b02149220641cb13358e9012b5fc4d Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 09/11] x86/mm: Properly handle linear pagetable promotion
 failures

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated, and a general reference count is held.

Unfortunately, in cases where an entry began with PTF_partial_set set,
and get_page_from_lNe() returns -EINVAL, the PTF_partial_set bit is
erroneously dropped.  (This scenario can be engineered mainly by the
use of interleaving of promoting and demoting a page which has "linear
pagetable" entries; see the appendix for a sketch.)  This means that
we will "leak" a general reference count on the page in question,
preventing the page from being freed.

Fix this by setting page->partial_flags to the partial_flags local
variable.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix

Suppose A and B can both be promoted to L2 pages, and A[x] points to B.

V1: PIN_L2 B.
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated

V1: MOD_L3_ENTRY pointing something to A.
  In the process of validating A[x], grab an extra type / ref on B:
  B.type_count = 2 | PGT_validated
  B.count = 3 | PGC_allocated
  A.type_count = 1 | PGT_validated
  A.count = 2 | PGC_allocated

V1: UNPIN B.
  B.type_count = 1 | PGT_validate
  B.count = 2 | PGC_allocated

V1: MOD_L3_ENTRY removing the reference to A.
  De-validate A, down to A[x], which points to B.
  Drop the final type on B.  Arrange to be interrupted.
  B.type_count = 1 | PGT_partial
  B.count = 2 | PGC_allocated
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = x
  A.partial_pte = -1

V2: MOD_L3_ENTRY adds a reference to A.

At this point, get_page_from_l2e(A[x]) tries
get_page_and_type_from_mfn(), which fails because it's the wrong type;
and get_l2_linear_pagetable() also fails, because B isn't validated as
an l2 anymore.
---
 xen/arch/x86/mm.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 4970b19aff..cfb7538403 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1562,7 +1562,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_flags = 0;
+                page->partial_flags = partial_flags;
                 current->arch.old_guest_ptpg = NULL;
                 current->arch.old_guest_table = page;
             }
@@ -1647,7 +1647,7 @@ static int alloc_l3_table(struct page_info *page)
         if ( i )
         {
             page->nr_validated_ptes = i;
-            page->partial_flags = 0;
+            page->partial_flags = partial_flags;
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
         }
@@ -1804,7 +1804,7 @@ static int alloc_l4_table(struct page_info *page)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_flags = 0;
+                page->partial_flags = partial_flags;
                 if ( rc == -EINTR )
                     rc = -ERESTART;
                 else
-- 
2.23.0


["xsa299-4.11/0010-x86-mm-Fix-nested-de-validation-on-error.patch" (application/octet-stream)]

From eabd77b59f4006128501d6e15f9e620dfb349420 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 10/11] x86/mm: Fix nested de-validation on error

If an invalid entry is discovered when validating a page-table tree,
the entire tree which has so far been validated must be de-validated.
Since this may take a long time, alloc_l[2-4]_table() set current
vcpu's old_guest_table immediately; put_old_guest_table() will make
sure that put_page_type() will be called to finish off the
de-validation before any other MMU operations can happen on the vcpu.

The invariant for partial pages should be:

* Entries [0, nr_validated_ptes) should be completely validated;
  put_page_type() will de-validate these.

* If [nr_validated_ptes] is partially validated, partial_flags should
  set PTF_partiaL_set.  put_page_type() will be called on this page to
  finish off devalidation, and the appropriate refcount adjustments
  will be done.

alloc_l[2-3]_table() indicates partial validation to its callers by
setting current->old_guest_table.

Unfortunately, this is mishandled.

Take the case where validating lNe[x] returns an error.

First, alloc_l3_table() doesn't check old_guest_table at all; as a
result, partial_flags is not set when it should be.  nr_validated_ptes
is set to x; and since PFT_partial_set clear, de-validation resumes at
nr_validated_ptes-1.  This means that the l2 page at pl3e[x] will not
have put_page_type() called on it when de-validating the rest of the
l3: it will be stuck in the PGT_partial state until the domain is
destroyed, or until it is re-used as an l2.  (Any other page type will
fail.)

Worse, alloc_l4_table(), rather than setting PTF_partial_set as it
should, sets nr_validated_ptes to x+1.  When de-validating, since
partial is 0, this will correctly resume calling put_page_type at [x];
but, if the put_page_type() is never called, but instead
get_page_type() is called, validation will pick up at [x+1],
neglecting to validate [x].  If the rest of the validation succeeds,
the l4 will be validated even though [x] is invalid.

Fix this in both cases by setting PTF_partial_set if old_guest_table
is set.

While here, add some safety catches:
- old_guest_table must point to the page contained in
  [nr_validated_ptes].
- alloc_l1_page shouldn't set old_guest_table

If we experience one of these situations in production builds, it's
safer to avoid calling put_page_type for the pages in question.  If
they have PGT_partial set, they will be cleaned up on domain
destruction; if not, we have no idea whether a type count is safe to
drop.  Retaining an extra type ref that should have been dropped may
trigger a BUG() on the free_domain_page() path, but dropping a type
count that shouldn't be dropped may cause a privilege escalation.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 55 ++++++++++++++++++++++++++++++++++++++++++++++-
 1 file changed, 54 insertions(+), 1 deletion(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index cfb7538403..aa03cb8b40 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1561,6 +1561,20 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             ASSERT(current->arch.old_guest_table == NULL);
             if ( i )
             {
+                /*
+                 * alloc_l1_table() doesn't set old_guest_table; it does
+                 * its own tear-down immediately on failure.  If it
+                 * did we'd need to check it and set partial_flags as we
+                 * do in alloc_l[34]_table().
+                 *
+                 * Note on the use of ASSERT: if it's non-null and
+                 * hasn't been cleaned up yet, it should have
+                 * PGT_partial set; and so the type will be cleaned up
+                 * on domain destruction.  Unfortunately, we would
+                 * leak the general ref held by old_guest_table; but
+                 * leaking a page is less bad than a host crash.
+                 */
+                ASSERT(current->arch.old_guest_table == NULL);
                 page->nr_validated_ptes = i;
                 page->partial_flags = partial_flags;
                 current->arch.old_guest_ptpg = NULL;
@@ -1588,6 +1602,7 @@ static int alloc_l3_table(struct page_info *page)
     unsigned int   i;
     int            rc = 0;
     unsigned int   partial_flags = page->partial_flags;
+    l3_pgentry_t   l3e = l3e_empty();
 
     pl3e = map_domain_page(_mfn(pfn));
 
@@ -1634,7 +1649,11 @@ static int alloc_l3_table(struct page_info *page)
             rc = -ERESTART;
         }
         if ( rc < 0 )
+        {
+            /* XSA-299 Backport: Copy l3e for checking */
+            l3e = pl3e[i];
             break;
+        }
 
         pl3e[i] = adjust_guest_l3e(pl3e[i], d);
     }
@@ -1648,6 +1667,24 @@ static int alloc_l3_table(struct page_info *page)
         {
             page->nr_validated_ptes = i;
             page->partial_flags = partial_flags;
+            if ( current->arch.old_guest_table )
+            {
+                /*
+                 * We've experienced a validation failure.  If
+                 * old_guest_table is set, "transfer" the general
+                 * reference count to pl3e[nr_validated_ptes] by
+                 * setting PTF_partial_set.
+                 *
+                 * As a precaution, check that old_guest_table is the
+                 * page pointed to by pl3e[nr_validated_ptes].  If
+                 * not, it's safer to leak a type ref on production
+                 * builds.
+                 */
+                if ( current->arch.old_guest_table == l3e_get_page(l3e) )
+                    page->partial_flags = PTF_partial_set;
+                else
+                    ASSERT_UNREACHABLE();
+            }
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
         }
@@ -1810,7 +1847,23 @@ static int alloc_l4_table(struct page_info *page)
                 else
                 {
                     if ( current->arch.old_guest_table )
-                        page->nr_validated_ptes++;
+                    {
+                        /*
+                         * We've experienced a validation failure.  If
+                         * old_guest_table is set, "transfer" the general
+                         * reference count to pl3e[nr_validated_ptes] by
+                         * setting PTF_partial_set.
+                         *
+                         * As a precaution, check that old_guest_table is the
+                         * page pointed to by pl4e[nr_validated_ptes].  If
+                         * not, it's safer to leak a type ref on production
+                         * builds.
+                         */
+                        if ( current->arch.old_guest_table == l4e_get_page(pl4e[i]) )
+                            page->partial_flags = PTF_partial_set;
+                        else
+                            ASSERT_UNREACHABLE();
+                    }
                     current->arch.old_guest_ptpg = NULL;
                     current->arch.old_guest_table = page;
                 }
-- 
2.23.0


["xsa299-4.11/0011-x86-mm-Don-t-drop-a-type-ref-unless-you-held-a-ref-t.patch" (application/octet-stream)]

From f0086e3ac65c8bcabb84c1c29ab00b0c8a187555 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:50 +0100
Subject: [PATCH 11/11] x86/mm: Don't drop a type ref unless you held a ref to
 begin with

Validation and de-validation of pagetable trees may take arbitrarily
large amounts of time, and so must be preemptible.  This is indicated
by setting the PGT_partial bit in the type_info, and setting
nr_validated_entries and partial_flags appropriately.  Specifically,
if the entry at [nr_validated_entries] is partially validated,
partial_flags should have the PGT_partial_set bit set, and the entry
should hold a general reference count.  During de-validation,
put_page_type() is called on partially validated entries.

Unfortunately, there are a number of issues with the current algorithm.

First, doing a "normal" put_page_type() is not safe when no type ref
is held: there is nothing to stop another vcpu from coming along and
picking up validation again: at which point the put_page_type may drop
the only page ref on an in-use page.  Some examples are listed in the
appendix.

The core issue is that put_page_type() is being called both to clean
up PGT_partial, and to drop a type count; and has no way of knowing
which is which; and so if in between, PGT_partial is cleared,
put_page_type() will drop the type ref erroneously.

What is needed is to distinguish between two states:
- Dropping a type ref which is held
- Cleaning up a page which has been partially de/validated

Fix this by telling put_page_type() which of the two activities you
intend.

When cleaning up a partial de/validation, take no action unless you
find a page partially validated.

If put_page_type() is called without PTF_partial_set, and finds the
page in a PGT_partial state anyway, then there's certainly been a
misaccounting somewhere, and carrying on would almost certainly cause
a security issue, so crash the host instead.

In put_page_from_lNe, pass partial_flags on to _put_page_type().

old_guest_table may be set either with a fully validated page (when
using the "deferred put" pattern), or with a partially validated page
(when a normal "de-validation" is interrupted, or when a validation
fails part-way through due to invalid entries).  Add a flag,
old_guest_table_partial, to indicate which of these it is, and use
that to pass the appropriate flag to _put_page_type().

While here, delete stray trailing whitespace.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix:

Suppose page A, when interpreted as an l3 pagetable, contains all
valid entries; and suppose A[x] points to page B, which when
interpreted as an l2 pagetable, contains all valid entries.

P1: PIN_L3_TABLE
  A -> PGT_l3_table | 1 | valid
  B -> PGT_l2_table | 1 | valid

P1: UNPIN_TABLE
  > Arrange to interrupt after B has been de-validated
  B:
    type_info -> PGT_l2_table | 0
  A:
    type_info -> PGT_l3_table | 1 | partial
    nr_validated_enties -> (less than x)

P2: mod_l4_entry to point to A
  > Arrange for this to be interrupted while B is being validated
  B:
    type_info -> PGT_l2_table | 1 | partial
    (nr_validated_entires &c set as appropriate)
  A:
    type_info -> PGT_l3_table | 1 | partial
    nr_validated_entries -> x
    partial_pte = 1

P3: mod_l3_entry some other unrelated l3 to point to B:
  B:
    type_info -> PGT_l2_table | 1

P1: Restart UNPIN_TABLE

At this point, since A.nr_validate_entries == x and A.partial_pte !=
0, free_l3_table() will call put_page_from_l3e() on pl3e[x], dropping
its type count to 0 while it's still being pointed to by some other l3

A similar issue arises with old_guest_table.  Consider the following
scenario:

Suppose A is a page which, when interpreted as an l2, has valid entries
until entry x, which is invalid.

V1:  PIN_L2_TABLE(A)
  <Validate until we try to validate [x], get -EINVAL>
  A -> PGT_l2_table | 1 | PGT_partial
  V1 -> old_guest_table = A
  <delayed>

V2: PIN_L2_TABLE(A)
  <Pick up where V1 left off, try to re-validate [x], get -EINVAL>
  A -> PGT_l2_table | 1 | PGT_partial
  V2 -> old_guest_table = A
  <restart>
  put_old_guest_table()
    _put_page_type(A)
      A -> PGT_l2_table | 0

V1: <restart>
  put_old_guest_table()
    _put_page_type(A) # UNDERFLOW

Indeed, it is possible to engineer for old_guest_table for every vcpu
a guest has to point to the same page.
---
 xen/arch/x86/domain.c        |  6 +++
 xen/arch/x86/mm.c            | 99 +++++++++++++++++++++++++++++++-----
 xen/include/asm-x86/domain.h |  4 +-
 3 files changed, 95 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 8fbecbb169..c880568dd4 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1074,9 +1074,15 @@ int arch_set_info_guest(
                     rc = -ERESTART;
                     /* Fallthrough */
                 case -ERESTART:
+                    /*
+                     * NB that we're putting the kernel-mode table
+                     * here, which we've already successfully
+                     * validated above; hence partial = false;
+                     */
                     v->arch.old_guest_ptpg = NULL;
                     v->arch.old_guest_table =
                         pagetable_get_page(v->arch.guest_table);
+                    v->arch.old_guest_table_partial = false;
                     v->arch.guest_table = pagetable_null();
                     break;
                 default:
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index aa03cb8b40..c701c7ef14 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1353,10 +1353,11 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         {
             current->arch.old_guest_ptpg = ptpg;
             current->arch.old_guest_table = pg;
+            current->arch.old_guest_table_partial = false;
         }
         else
         {
-            rc = _put_page_type(pg, PTF_preemptible, ptpg);
+            rc = _put_page_type(pg, flags | PTF_preemptible, ptpg);
             if ( likely(!rc) )
                 put_page(pg);
         }
@@ -1379,6 +1380,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
         unsigned long mfn = l3e_get_pfn(l3e);
         int writeable = l3e_get_flags(l3e) & _PAGE_RW;
 
+        ASSERT(!(flags & PTF_partial_set));
         ASSERT(!(mfn & ((1UL << (L3_PAGETABLE_SHIFT - PAGE_SHIFT)) - 1)));
         do {
             put_data_page(mfn_to_page(_mfn(mfn)), writeable);
@@ -1391,12 +1393,14 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 
     if ( flags & PTF_defer )
     {
+        ASSERT(!(flags & PTF_partial_set));
         current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
         current->arch.old_guest_table = pg;
+        current->arch.old_guest_table_partial = false;
         return 0;
     }
 
-    rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
+    rc = _put_page_type(pg, flags | PTF_preemptible, mfn_to_page(_mfn(pfn)));
     if ( likely(!rc) )
         put_page(pg);
 
@@ -1415,12 +1419,15 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
 
         if ( flags & PTF_defer )
         {
+            ASSERT(!(flags & PTF_partial_set));
             current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
             current->arch.old_guest_table = pg;
+            current->arch.old_guest_table_partial = false;
             return 0;
         }
 
-        rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
+        rc = _put_page_type(pg, flags | PTF_preemptible,
+                            mfn_to_page(_mfn(pfn)));
         if ( likely(!rc) )
             put_page(pg);
     }
@@ -1525,6 +1532,14 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
 
     pl2e = map_domain_page(_mfn(pfn));
 
+    /*
+     * NB that alloc_l2_table will never set partial_pte on an l2; but
+     * free_l2_table might if a linear_pagetable entry is interrupted
+     * partway through de-validation.  In that circumstance,
+     * get_page_from_l2e() will always return -EINVAL; and we must
+     * retain the type ref by doing the normal partial_flags tracking.
+     */
+
     for ( i = page->nr_validated_ptes; i < L2_PAGETABLE_ENTRIES;
           i++, partial_flags = 0 )
     {
@@ -1579,6 +1594,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
                 page->partial_flags = partial_flags;
                 current->arch.old_guest_ptpg = NULL;
                 current->arch.old_guest_table = page;
+                current->arch.old_guest_table_partial = true;
             }
         }
         if ( rc < 0 )
@@ -1681,12 +1697,16 @@ static int alloc_l3_table(struct page_info *page)
                  * builds.
                  */
                 if ( current->arch.old_guest_table == l3e_get_page(l3e) )
+                {
+                    ASSERT(current->arch.old_guest_table_partial);
                     page->partial_flags = PTF_partial_set;
+                }
                 else
                     ASSERT_UNREACHABLE();
             }
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
+            current->arch.old_guest_table_partial = true;
         }
         while ( i-- > 0 )
             pl3e[i] = unadjust_guest_l3e(pl3e[i], d);
@@ -1860,12 +1880,16 @@ static int alloc_l4_table(struct page_info *page)
                          * builds.
                          */
                         if ( current->arch.old_guest_table == l4e_get_page(pl4e[i]) )
+                        {
+                            ASSERT(current->arch.old_guest_table_partial);
                             page->partial_flags = PTF_partial_set;
+                        }
                         else
                             ASSERT_UNREACHABLE();
                     }
                     current->arch.old_guest_ptpg = NULL;
                     current->arch.old_guest_table = page;
+                    current->arch.old_guest_table_partial = true;
                 }
             }
         }
@@ -2782,6 +2806,28 @@ static int _put_page_type(struct page_info *page, unsigned int flags,
         x  = y;
         nx = x - 1;
 
+        /*
+         * Is this expected to do a full reference drop, or only
+         * cleanup partial validation / devalidation?
+         *
+         * If the former, the caller must hold a "full" type ref;
+         * which means the page must be validated.  If the page is
+         * *not* fully validated, continuing would almost certainly
+         * open up a security hole.  An exception to this is during
+         * domain destruction, where PGT_validated can be dropped
+         * without dropping a type ref.
+         *
+         * If the latter, do nothing unless type PGT_partial is set.
+         * If it is set, the type count must be 1.
+         */
+        if ( !(flags & PTF_partial_set) )
+            BUG_ON((x & PGT_partial) ||
+                   !((x & PGT_validated) || page_get_owner(page)->is_dying));
+        else if ( !(x & PGT_partial) )
+            return 0;
+        else
+            BUG_ON((x & PGT_count_mask) != 1);
+
         ASSERT((x & PGT_count_mask) != 0);
 
         switch ( nx & (PGT_locked | PGT_count_mask) )
@@ -3041,17 +3087,34 @@ int put_old_guest_table(struct vcpu *v)
     if ( !v->arch.old_guest_table )
         return 0;
 
-    switch ( rc = _put_page_type(v->arch.old_guest_table, PTF_preemptible,
-                                 v->arch.old_guest_ptpg) )
+    rc = _put_page_type(v->arch.old_guest_table,
+                        PTF_preemptible |
+                        ( v->arch.old_guest_table_partial ?
+                          PTF_partial_set : 0 ),
+                        v->arch.old_guest_ptpg);
+
+    if ( rc == -ERESTART || rc == -EINTR )
     {
-    case -EINTR:
-    case -ERESTART:
+        v->arch.old_guest_table_partial = (rc == -ERESTART);
         return -ERESTART;
-    case 0:
-        put_page(v->arch.old_guest_table);
     }
 
+    /*
+     * It shouldn't be possible for _put_page_type() to return
+     * anything else at the moment; but if it does happen in
+     * production, leaking the type ref is probably the best thing to
+     * do.  Either way, drop the general ref held by old_guest_table.
+     */
+    ASSERT(rc == 0);
+
+    put_page(v->arch.old_guest_table);
     v->arch.old_guest_table = NULL;
+    v->arch.old_guest_ptpg = NULL;
+    /*
+     * Safest default if someone sets old_guest_table without
+     * explicitly setting old_guest_table_partial.
+     */
+    v->arch.old_guest_table_partial = true;
 
     return rc;
 }
@@ -3201,11 +3264,11 @@ int new_guest_cr3(mfn_t mfn)
             switch ( rc = put_page_and_type_preemptible(page) )
             {
             case -EINTR:
-                rc = -ERESTART;
-                /* fallthrough */
             case -ERESTART:
                 curr->arch.old_guest_ptpg = NULL;
                 curr->arch.old_guest_table = page;
+                curr->arch.old_guest_table_partial = (rc == -ERESTART);
+                rc = -ERESTART;
                 break;
             default:
                 BUG_ON(rc);
@@ -3479,6 +3542,7 @@ long do_mmuext_op(
                     {
                         curr->arch.old_guest_ptpg = NULL;
                         curr->arch.old_guest_table = page;
+                        curr->arch.old_guest_table_partial = false;
                     }
                 }
             }
@@ -3513,6 +3577,11 @@ long do_mmuext_op(
             case -ERESTART:
                 curr->arch.old_guest_ptpg = NULL;
                 curr->arch.old_guest_table = page;
+                /*
+                 * EINTR means we still hold the type ref; ERESTART
+                 * means PGT_partial holds the type ref
+                 */
+                curr->arch.old_guest_table_partial = (rc == -ERESTART);
                 rc = 0;
                 break;
             default:
@@ -3581,11 +3650,15 @@ long do_mmuext_op(
                 switch ( rc = put_page_and_type_preemptible(page) )
                 {
                 case -EINTR:
-                    rc = -ERESTART;
-                    /* fallthrough */
                 case -ERESTART:
                     curr->arch.old_guest_ptpg = NULL;
                     curr->arch.old_guest_table = page;
+                    /*
+                     * EINTR means we still hold the type ref;
+                     * ERESTART means PGT_partial holds the ref
+                     */
+                    curr->arch.old_guest_table_partial = (rc == -ERESTART);
+                    rc = -ERESTART;
                     break;
                 default:
                     BUG_ON(rc);
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 1ac5a96c08..360c38bd83 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -309,7 +309,7 @@ struct arch_domain
 
     struct paging_domain paging;
     struct p2m_domain *p2m;
-    /* To enforce lock ordering in the pod code wrt the 
+    /* To enforce lock ordering in the pod code wrt the
      * page_alloc lock */
     int page_alloc_unlock_level;
 
@@ -542,6 +542,8 @@ struct arch_vcpu
     struct page_info *old_guest_table;  /* partially destructed pagetable */
     struct page_info *old_guest_ptpg;   /* containing page table of the */
                                         /* former, if any */
+    bool old_guest_table_partial;       /* Are we dropping a type ref, or just
+                                         * finishing up a partial de-validation? */
     /* guest_table holds a ref to the page, and also a type-count unless
      * shadow refcounts are in use */
     pagetable_t shadow_table[4];        /* (MFN) shadow(s) of guest */
-- 
2.23.0


["xsa299-4.12/0001-x86-mm-L1TF-checks-don-t-leave-a-partial-entry.patch" (application/octet-stream)]

From 33d051917d5ef38f678b507a3c832afde48b9b49 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 01/11] x86/mm: L1TF checks don't leave a partial entry

On detection of a potential L1TF issue, most validation code returns
-ERESTART to allow the switch to shadow mode to happen and cause the
original operation to be restarted.

However, in the validation code, the return value -ERESTART has been
repurposed to indicate 1) the function has partially completed
something which needs to be undone, and 2) calling put_page_type()
should cleanly undo it.  This causes problems in several places.

For L1 tables, on receiving an -ERESTART return from alloc_l1_table(),
alloc_page_type() will set PGT_partial on the page.  If for some
reason the original operation never restarts, then on domain
destruction, relinquish_memory() will call free_page_type() on the
page.

Unfortunately, alloc_ and free_l1_table() aren't set up to deal with
PGT_partial.  When returning a failure, alloc_l1_table() always
de-validates whatever it's validated so far, and free_l1_table()
always devalidates the whole page.  This means that if
relinquish_memory() calls free_page_type() on an L1 that didn't
complete due to an L1TF, it will call put_page_from_l1e() on "page
entries" that have never been validated.

For L2+ tables, setting rc to ERESTART causes the rest of the
alloc_lN_table() function to *think* that the entry in question will
have PGT_partial set.  This will cause it to set partial_pte = 1.  If
relinqush_memory() then calls free_page_type() on one of those pages,
then free_lN_table() will call put_page_from_lNe() on the entry when
it shouldn't.

Rather than indicating -ERESTART, indicate -EINTR.  This is the code
to indicate that nothing has changed from when you started the call
(which is effectively how alloc_l1_table() handles errors).

mod_lN_entry() shouldn't have any of these types of problems, so leave
potential changes there for a clean-up patch later.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 3557cd1178..a1b55c10ff 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1409,7 +1409,7 @@ static int alloc_l1_table(struct page_info *page)
     {
         if ( !(l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) )
         {
-            ret = pv_l1tf_check_l1e(d, pl1e[i]) ? -ERESTART : 0;
+            ret = pv_l1tf_check_l1e(d, pl1e[i]) ? -EINTR : 0;
             if ( ret )
                 goto out;
         }
@@ -1517,7 +1517,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
         {
             if ( !pv_l1tf_check_l2e(d, l2e) )
                 continue;
-            rc = -ERESTART;
+            rc = -EINTR;
         }
         else
             rc = get_page_from_l2e(l2e, pfn, d, partial);
@@ -1603,7 +1603,7 @@ static int alloc_l3_table(struct page_info *page)
         {
             if ( !pv_l1tf_check_l3e(d, l3e) )
                 continue;
-            rc = -ERESTART;
+            rc = -EINTR;
         }
         else
             rc = get_page_from_l3e(l3e, pfn, d, partial);
@@ -1783,7 +1783,7 @@ static int alloc_l4_table(struct page_info *page)
         {
             if ( !pv_l1tf_check_l4e(d, l4e) )
                 continue;
-            rc = -ERESTART;
+            rc = -EINTR;
         }
         else
             rc = get_page_from_l4e(l4e, pfn, d, partial);
-- 
2.23.0


["xsa299-4.12/0002-x86-mm-Don-t-re-set-PGT_pinned-on-a-partially-de-val.patch" (application/octet-stream)]

From b490792c18f74b76ec8161721c1e07f810e36309 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 02/11] x86/mm: Don't re-set PGT_pinned on a partially
 de-validated page

When unpinning pagetables, if an operation is interrupted,
relinquish_memory() re-sets PGT_pinned so that the un-pin will
pickedup again when the hypercall restarts.

This is appropriate when put_page_and_type_preemptible() returns
-EINTR, which indicates that the page is back in its initial state
(i.e., completely validated).  However, for -ERESTART, this leads to a
state where a page has both PGT_pinned and PGT_partial set.

This happens to work at the moment, although it's not really a
"canonical" state; but in subsequent patches, where we need to make a
distinction in handling between PGT_validated and PGT_partial pages,
this causes issues.

Move to a "canonical" state by:
- Only re-setting PGT_pinned on -EINTR
- Re-dropping the refcount held by PGT_pinned on -ERESTART

In the latter case, the PGT_partial bit will be cleared further down
with the rest of the other PGT_partial pages.

While here, clean up some trainling whitespace.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/domain.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 2585327834..59df8a6d8d 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -114,7 +114,7 @@ static void play_dead(void)
      * this case, heap corruption or #PF can occur (when heap debugging is
      * enabled). For example, even printk() can involve tasklet scheduling,
      * which touches per-cpu vars.
-     * 
+     *
      * Consider very carefully when adding code to *dead_idle. Most hypervisor
      * subsystems are unsafe to call.
      */
@@ -1909,9 +1909,34 @@ static int relinquish_memory(
             break;
         case -ERESTART:
         case -EINTR:
+            /*
+             * -EINTR means PGT_validated has been re-set; re-set
+             * PGT_pinned again so that it gets picked up next time
+             * around.
+             *
+             * -ERESTART, OTOH, means PGT_partial is set instead.  Put
+             * it back on the list, but don't set PGT_pinned; the
+             * section below will finish off de-validation.  But we do
+             * need to drop the general ref associated with
+             * PGT_pinned, since put_page_and_type_preemptible()
+             * didn't do it.
+             *
+             * NB we can do an ASSERT for PGT_validated, since we
+             * "own" the type ref; but theoretically, the PGT_partial
+             * could be cleared by someone else.
+             */
+            if ( ret == -EINTR )
+            {
+                ASSERT(page->u.inuse.type_info & PGT_validated);
+                set_bit(_PGT_pinned, &page->u.inuse.type_info);
+            }
+            else
+                put_page(page);
+
             ret = -ERESTART;
+
+            /* Put the page back on the list and drop the ref we grabbed above */
             page_list_add(page, list);
-            set_bit(_PGT_pinned, &page->u.inuse.type_info);
             put_page(page);
             goto out;
         default:
@@ -2161,7 +2186,7 @@ void vcpu_kick(struct vcpu *v)
      * pending flag. These values may fluctuate (after all, we hold no
      * locks) but the key insight is that each change will cause
      * evtchn_upcall_pending to be polled.
-     * 
+     *
      * NB2. We save the running flag across the unblock to avoid a needless
      * IPI for domains that we IPI'd to unblock.
      */
-- 
2.23.0


["xsa299-4.12/0003-x86-mm-Separate-out-partial_pte-tristate-into-indivi.patch" (application/octet-stream)]

From 0f9f61e5737fdd346550ec6e30161fa99e4653fa Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 03/11] x86/mm: Separate out partial_pte tristate into
 individual flags

At the moment, partial_pte is a tri-state that contains two distinct bits
of information:

1. If zero, the pte at index [nr_validated_ptes] is un-validated.  If
   non-zero, the pte was last seen with PGT_partial set.

2. If positive, the pte at index [nr_validated_ptes] does not hold a
   general reference count.  If negative, it does.

To make future patches more clear, separate out this functionality
into two distinct, named bits: PTF_partial_set (for #1) and
PTF_partial_general_ref (for #2).

Additionally, a number of functions which need this information also
take other flags to control behavior (such as `preemptible` and
`defer`).  These are hard to read in the caller (since you only see
'true' or 'false'), and ugly when many are added together.  In
preparation for adding yet another flag in a future patch, collapse
all of these into a single `flag` variable.

NB that this does mean checking for what was previously the '-1'
condition a bit more ugly in the put_page_from_lNe functions (since
you have to check for both partial_set and general ref); but this
clause will go away in a future patch.

Also note that the original comment had an off-by-one error:
partial_flags (like partial_pte before it) concerns
plNe[nr_validated_ptes], not plNe[nr_validated_ptes+1].

No functional change intended.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c        | 165 ++++++++++++++++++++++++---------------
 xen/include/asm-x86/mm.h |  41 +++++++---
 2 files changed, 128 insertions(+), 78 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index a1b55c10ff..3f6f8cc9b8 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1094,20 +1094,35 @@ get_page_from_l1e(
 }
 
 #ifdef CONFIG_PV
+
+/*
+ * The following flags are used to specify behavior of various get and
+ * put commands.  The first two are also stored in page->partial_flags
+ * to indicate the state of the page pointed to by
+ * page->pte[page->nr_validated_entries].  See the comment in mm.h for
+ * more information.
+ */
+#define PTF_partial_set         (1 << 0)
+#define PTF_partial_general_ref (1 << 1)
+#define PTF_preemptible         (1 << 2)
+#define PTF_defer               (1 << 3)
+
 static int get_page_and_type_from_mfn(
     mfn_t mfn, unsigned long type, struct domain *d,
-    int partial, int preemptible)
+    unsigned int flags)
 {
     struct page_info *page = mfn_to_page(mfn);
     int rc;
+    bool preemptible = flags & PTF_preemptible,
+         partial_ref = flags & PTF_partial_general_ref;
 
-    if ( likely(partial >= 0) &&
+    if ( likely(!partial_ref) &&
          unlikely(!get_page_from_mfn(mfn, d)) )
         return -EINVAL;
 
     rc = _get_page_type(page, type, preemptible);
 
-    if ( unlikely(rc) && partial >= 0 &&
+    if ( unlikely(rc) && !partial_ref &&
          (!preemptible || page != current->arch.old_guest_table) )
         put_page(page);
 
@@ -1117,7 +1132,7 @@ static int get_page_and_type_from_mfn(
 define_get_linear_pagetable(l2);
 static int
 get_page_from_l2e(
-    l2_pgentry_t l2e, unsigned long pfn, struct domain *d, int partial)
+    l2_pgentry_t l2e, unsigned long pfn, struct domain *d, unsigned int flags)
 {
     unsigned long mfn = l2e_get_pfn(l2e);
     int rc;
@@ -1129,8 +1144,9 @@ get_page_from_l2e(
         return -EINVAL;
     }
 
-    rc = get_page_and_type_from_mfn(_mfn(mfn), PGT_l1_page_table, d,
-                                    partial, false);
+    ASSERT(!(flags & PTF_preemptible));
+
+    rc = get_page_and_type_from_mfn(_mfn(mfn), PGT_l1_page_table, d, flags);
     if ( unlikely(rc == -EINVAL) && get_l2_linear_pagetable(l2e, pfn, d) )
         rc = 0;
 
@@ -1140,7 +1156,7 @@ get_page_from_l2e(
 define_get_linear_pagetable(l3);
 static int
 get_page_from_l3e(
-    l3_pgentry_t l3e, unsigned long pfn, struct domain *d, int partial)
+    l3_pgentry_t l3e, unsigned long pfn, struct domain *d, unsigned int flags)
 {
     int rc;
 
@@ -1152,7 +1168,7 @@ get_page_from_l3e(
     }
 
     rc = get_page_and_type_from_mfn(
-        l3e_get_mfn(l3e), PGT_l2_page_table, d, partial, 1);
+        l3e_get_mfn(l3e), PGT_l2_page_table, d, flags | PTF_preemptible);
     if ( unlikely(rc == -EINVAL) &&
          !is_pv_32bit_domain(d) &&
          get_l3_linear_pagetable(l3e, pfn, d) )
@@ -1164,7 +1180,7 @@ get_page_from_l3e(
 define_get_linear_pagetable(l4);
 static int
 get_page_from_l4e(
-    l4_pgentry_t l4e, unsigned long pfn, struct domain *d, int partial)
+    l4_pgentry_t l4e, unsigned long pfn, struct domain *d, unsigned int flags)
 {
     int rc;
 
@@ -1176,7 +1192,7 @@ get_page_from_l4e(
     }
 
     rc = get_page_and_type_from_mfn(
-        l4e_get_mfn(l4e), PGT_l3_page_table, d, partial, 1);
+        l4e_get_mfn(l4e), PGT_l3_page_table, d, flags | PTF_preemptible);
     if ( unlikely(rc == -EINVAL) && get_l4_linear_pagetable(l4e, pfn, d) )
         rc = 0;
 
@@ -1277,7 +1293,7 @@ static void put_data_page(struct page_info *page, bool writeable)
  * Note also that this automatically deals correctly with linear p.t.'s.
  */
 static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
-                             int partial, bool defer)
+                             unsigned int flags)
 {
     int rc = 0;
 
@@ -1300,12 +1316,13 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         struct page_info *pg = l2e_get_page(l2e);
         struct page_info *ptpg = mfn_to_page(_mfn(pfn));
 
-        if ( unlikely(partial > 0) )
+        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
+              PTF_partial_set )
         {
-            ASSERT(!defer);
+            ASSERT(!(flags & PTF_defer));
             rc = _put_page_type(pg, true, ptpg);
         }
-        else if ( defer )
+        else if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = ptpg;
             current->arch.old_guest_table = pg;
@@ -1322,7 +1339,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
 }
 
 static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
-                             int partial, bool defer)
+                             unsigned int flags)
 {
     struct page_info *pg;
     int rc;
@@ -1345,13 +1362,14 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 
     pg = l3e_get_page(l3e);
 
-    if ( unlikely(partial > 0) )
+    if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
+         PTF_partial_set )
     {
-        ASSERT(!defer);
+        ASSERT(!(flags & PTF_defer));
         return _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
     }
 
-    if ( defer )
+    if ( flags & PTF_defer )
     {
         current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
         current->arch.old_guest_table = pg;
@@ -1366,7 +1384,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 }
 
 static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
-                             int partial, bool defer)
+                             unsigned int flags)
 {
     int rc = 1;
 
@@ -1375,13 +1393,14 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
     {
         struct page_info *pg = l4e_get_page(l4e);
 
-        if ( unlikely(partial > 0) )
+        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
+              PTF_partial_set )
         {
-            ASSERT(!defer);
+            ASSERT(!(flags & PTF_defer));
             return _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
         }
 
-        if ( defer )
+        if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
             current->arch.old_guest_table = pg;
@@ -1492,12 +1511,13 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
     unsigned long  pfn = mfn_x(page_to_mfn(page));
     l2_pgentry_t  *pl2e;
     unsigned int   i;
-    int            rc = 0, partial = page->partial_pte;
+    int            rc = 0;
+    unsigned int   partial_flags = page->partial_flags;
 
     pl2e = map_domain_page(_mfn(pfn));
 
     for ( i = page->nr_validated_ptes; i < L2_PAGETABLE_ENTRIES;
-          i++, partial = 0 )
+          i++, partial_flags = 0 )
     {
         l2_pgentry_t l2e;
 
@@ -1520,17 +1540,18 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             rc = -EINTR;
         }
         else
-            rc = get_page_from_l2e(l2e, pfn, d, partial);
+            rc = get_page_from_l2e(l2e, pfn, d, partial_flags);
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = partial ?: 1;
+            /* Set 'set', retain 'general ref' */
+            page->partial_flags = partial_flags | PTF_partial_set;
         }
         else if ( rc == -EINTR && i )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
             rc = -ERESTART;
         }
         else if ( rc < 0 && rc != -EINTR )
@@ -1539,7 +1560,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_pte = 0;
+                page->partial_flags = 0;
                 current->arch.old_guest_ptpg = NULL;
                 current->arch.old_guest_table = page;
             }
@@ -1563,7 +1584,8 @@ static int alloc_l3_table(struct page_info *page)
     unsigned long  pfn = mfn_x(page_to_mfn(page));
     l3_pgentry_t  *pl3e;
     unsigned int   i;
-    int            rc = 0, partial = page->partial_pte;
+    int            rc = 0;
+    unsigned int   partial_flags = page->partial_flags;
 
     pl3e = map_domain_page(_mfn(pfn));
 
@@ -1578,7 +1600,7 @@ static int alloc_l3_table(struct page_info *page)
         memset(pl3e + 4, 0, (L3_PAGETABLE_ENTRIES - 4) * sizeof(*pl3e));
 
     for ( i = page->nr_validated_ptes; i < L3_PAGETABLE_ENTRIES;
-          i++, partial = 0 )
+          i++, partial_flags = 0 )
     {
         l3_pgentry_t l3e = pl3e[i];
 
@@ -1597,7 +1619,8 @@ static int alloc_l3_table(struct page_info *page)
             else
                 rc = get_page_and_type_from_mfn(
                     l3e_get_mfn(l3e),
-                    PGT_l2_page_table | PGT_pae_xen_l2, d, partial, 1);
+                    PGT_l2_page_table | PGT_pae_xen_l2, d,
+                    partial_flags | PTF_preemptible);
         }
         else if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) )
         {
@@ -1606,17 +1629,18 @@ static int alloc_l3_table(struct page_info *page)
             rc = -EINTR;
         }
         else
-            rc = get_page_from_l3e(l3e, pfn, d, partial);
+            rc = get_page_from_l3e(l3e, pfn, d, partial_flags);
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = partial ?: 1;
+            /* Set 'set', leave 'general ref' set if this entry was set */
+            page->partial_flags = partial_flags | PTF_partial_set;
         }
         else if ( rc == -EINTR && i )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
             rc = -ERESTART;
         }
         if ( rc < 0 )
@@ -1633,7 +1657,7 @@ static int alloc_l3_table(struct page_info *page)
         if ( i )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
         }
@@ -1767,10 +1791,11 @@ static int alloc_l4_table(struct page_info *page)
     unsigned long  pfn = mfn_x(page_to_mfn(page));
     l4_pgentry_t  *pl4e = map_domain_page(_mfn(pfn));
     unsigned int   i;
-    int            rc = 0, partial = page->partial_pte;
+    int            rc = 0;
+    unsigned int   partial_flags = page->partial_flags;
 
     for ( i = page->nr_validated_ptes; i < L4_PAGETABLE_ENTRIES;
-          i++, partial = 0 )
+          i++, partial_flags = 0 )
     {
         l4_pgentry_t l4e;
 
@@ -1786,12 +1811,13 @@ static int alloc_l4_table(struct page_info *page)
             rc = -EINTR;
         }
         else
-            rc = get_page_from_l4e(l4e, pfn, d, partial);
+            rc = get_page_from_l4e(l4e, pfn, d, partial_flags);
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = partial ?: 1;
+            /* Set 'set', leave 'general ref' set if this entry was set */
+            page->partial_flags = partial_flags | PTF_partial_set;
         }
         else if ( rc < 0 )
         {
@@ -1801,7 +1827,7 @@ static int alloc_l4_table(struct page_info *page)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_pte = 0;
+                page->partial_flags = 0;
                 if ( rc == -EINTR )
                     rc = -ERESTART;
                 else
@@ -1853,19 +1879,20 @@ static int free_l2_table(struct page_info *page)
     struct domain *d = page_get_owner(page);
     unsigned long pfn = mfn_x(page_to_mfn(page));
     l2_pgentry_t *pl2e;
-    int rc = 0, partial = page->partial_pte;
-    unsigned int i = page->nr_validated_ptes - !partial;
+    int rc = 0;
+    unsigned int partial_flags = page->partial_flags,
+        i = page->nr_validated_ptes - !(partial_flags & PTF_partial_set);
 
     pl2e = map_domain_page(_mfn(pfn));
 
     for ( ; ; )
     {
         if ( is_guest_l2_slot(d, page->u.inuse.type_info, i) )
-            rc = put_page_from_l2e(pl2e[i], pfn, partial, false);
+            rc = put_page_from_l2e(pl2e[i], pfn, partial_flags);
         if ( rc < 0 )
             break;
 
-        partial = 0;
+        partial_flags = 0;
 
         if ( !i-- )
             break;
@@ -1887,12 +1914,14 @@ static int free_l2_table(struct page_info *page)
     else if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_pte = partial ?: -1;
+        page->partial_flags = (partial_flags & PTF_partial_set) ?
+            partial_flags :
+            (PTF_partial_set | PTF_partial_general_ref);
     }
     else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
     {
         page->nr_validated_ptes = i + 1;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
         rc = -ERESTART;
     }
 
@@ -1904,18 +1933,19 @@ static int free_l3_table(struct page_info *page)
     struct domain *d = page_get_owner(page);
     unsigned long pfn = mfn_x(page_to_mfn(page));
     l3_pgentry_t *pl3e;
-    int rc = 0, partial = page->partial_pte;
-    unsigned int  i = page->nr_validated_ptes - !partial;
+    int rc = 0;
+    unsigned int partial_flags = page->partial_flags,
+        i = page->nr_validated_ptes - !(partial_flags & PTF_partial_set);
 
     pl3e = map_domain_page(_mfn(pfn));
 
     for ( ; ; )
     {
-        rc = put_page_from_l3e(pl3e[i], pfn, partial, 0);
+        rc = put_page_from_l3e(pl3e[i], pfn, partial_flags);
         if ( rc < 0 )
             break;
 
-        partial = 0;
+        partial_flags = 0;
         if ( rc == 0 )
             pl3e[i] = unadjust_guest_l3e(pl3e[i], d);
 
@@ -1934,12 +1964,14 @@ static int free_l3_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_pte = partial ?: -1;
+        page->partial_flags = (partial_flags & PTF_partial_set) ?
+            partial_flags :
+            (PTF_partial_set | PTF_partial_general_ref);
     }
     else if ( rc == -EINTR && i < L3_PAGETABLE_ENTRIES - 1 )
     {
         page->nr_validated_ptes = i + 1;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
         rc = -ERESTART;
     }
     return rc > 0 ? 0 : rc;
@@ -1950,26 +1982,29 @@ static int free_l4_table(struct page_info *page)
     struct domain *d = page_get_owner(page);
     unsigned long pfn = mfn_x(page_to_mfn(page));
     l4_pgentry_t *pl4e = map_domain_page(_mfn(pfn));
-    int rc = 0, partial = page->partial_pte;
-    unsigned int  i = page->nr_validated_ptes - !partial;
+    int rc = 0;
+    unsigned partial_flags = page->partial_flags,
+        i = page->nr_validated_ptes - !(partial_flags & PTF_partial_set);
 
     do {
         if ( is_guest_l4_slot(d, i) )
-            rc = put_page_from_l4e(pl4e[i], pfn, partial, 0);
+            rc = put_page_from_l4e(pl4e[i], pfn, partial_flags);
         if ( rc < 0 )
             break;
-        partial = 0;
+        partial_flags = 0;
     } while ( i-- );
 
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_pte = partial ?: -1;
+        page->partial_flags = (partial_flags & PTF_partial_set) ?
+            partial_flags :
+            (PTF_partial_set | PTF_partial_general_ref);
     }
     else if ( rc == -EINTR && i < L4_PAGETABLE_ENTRIES - 1 )
     {
         page->nr_validated_ptes = i + 1;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
         rc = -ERESTART;
     }
 
@@ -2247,7 +2282,7 @@ static int mod_l2_entry(l2_pgentry_t *pl2e,
         return -EBUSY;
     }
 
-    put_page_from_l2e(ol2e, pfn, 0, true);
+    put_page_from_l2e(ol2e, pfn, PTF_defer);
 
     return rc;
 }
@@ -2315,7 +2350,7 @@ static int mod_l3_entry(l3_pgentry_t *pl3e,
         if ( !create_pae_xen_mappings(d, pl3e) )
             BUG();
 
-    put_page_from_l3e(ol3e, pfn, 0, 1);
+    put_page_from_l3e(ol3e, pfn, PTF_defer);
     return rc;
 }
 
@@ -2378,7 +2413,7 @@ static int mod_l4_entry(l4_pgentry_t *pl4e,
         return -EFAULT;
     }
 
-    put_page_from_l4e(ol4e, pfn, 0, 1);
+    put_page_from_l4e(ol4e, pfn, PTF_defer);
     return rc;
 }
 #endif /* CONFIG_PV */
@@ -2649,7 +2684,7 @@ int free_page_type(struct page_info *page, unsigned long type,
     if ( !(type & PGT_partial) )
     {
         page->nr_validated_ptes = 1U << PAGETABLE_ORDER;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
     }
 
     switch ( type & PGT_type_mask )
@@ -2946,7 +2981,7 @@ static int _get_page_type(struct page_info *page, unsigned long type,
         if ( !(x & PGT_partial) )
         {
             page->nr_validated_ptes = 0;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
         }
         page->linear_pt_count = 0;
         rc = alloc_page_type(page, type, preemptible);
@@ -3122,7 +3157,7 @@ int new_guest_cr3(mfn_t mfn)
         return 0;
     }
 
-    rc = get_page_and_type_from_mfn(mfn, PGT_root_page_table, d, 0, 1);
+    rc = get_page_and_type_from_mfn(mfn, PGT_root_page_table, d, PTF_preemptible);
     switch ( rc )
     {
     case 0:
@@ -3473,7 +3508,7 @@ long do_mmuext_op(
             if ( op.arg1.mfn != 0 )
             {
                 rc = get_page_and_type_from_mfn(
-                    _mfn(op.arg1.mfn), PGT_root_page_table, currd, 0, 1);
+                    _mfn(op.arg1.mfn), PGT_root_page_table, currd, PTF_preemptible);
 
                 if ( unlikely(rc) )
                 {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 6faa563167..8406ac3c37 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -228,19 +228,34 @@ struct page_info
          * setting the flag must not drop that reference, whereas the instance
          * clearing it will have to.
          *
-         * If @partial_pte is positive then PTE at @nr_validated_ptes+1 has
-         * been partially validated. This implies that the general reference
-         * to the page (acquired from get_page_from_lNe()) would be dropped
-         * (again due to the apparent failure) and hence must be re-acquired
-         * when resuming the validation, but must not be dropped when picking
-         * up the page for invalidation.
+         * If partial_flags & PTF_partial_set is set, then the page at
+         * at @nr_validated_ptes had PGT_partial set as a result of an
+         * operation on the current page.  (That page may or may not
+         * still have PGT_partial set.)
          *
-         * If @partial_pte is negative then PTE at @nr_validated_ptes+1 has
-         * been partially invalidated. This is basically the opposite case of
-         * above, i.e. the general reference to the page was not dropped in
-         * put_page_from_lNe() (due to the apparent failure), and hence it
-         * must be dropped when the put operation is resumed (and completes),
-         * but it must not be acquired if picking up the page for validation.
+         * If PTF_partial_general_ref is set, then the PTE at
+         * @nr_validated_ptef holds a general reference count for the
+         * page.
+         *
+         * This happens:
+         * - During de-validation, if de-validation of the page was
+         *   interrupted
+         * - During validation, if an invalid entry is encountered and
+         *   validation is preemptible
+         * - During validation, if PTF_partial_general_ref was set on
+         *   this entry to begin with (perhaps because we're picking
+         *   up from a partial de-validation).
+         *
+         * When resuming validation, if PTF_partial_general_ref is clear,
+         * then a general reference must be re-acquired; if it is set, no
+         * reference should be acquired.
+         *
+         * When resuming de-validation, if PTF_partial_general_ref is
+         * clear, no reference should be dropped; if it is set, a
+         * reference should be dropped.
+         *
+         * NB that PTF_partial_set and PTF_partial_general_ref are
+         * defined in mm.c, the only place where they are used.
          *
          * The 3rd field, @linear_pt_count, indicates
          * - by a positive value, how many same-level page table entries a page
@@ -251,7 +266,7 @@ struct page_info
         struct {
             u16 nr_validated_ptes:PAGETABLE_ORDER + 1;
             u16 :16 - PAGETABLE_ORDER - 1 - 2;
-            s16 partial_pte:2;
+            u16 partial_flags:2;
             s16 linear_pt_count;
         };
 
-- 
2.23.0


["xsa299-4.12/0004-x86-mm-Use-flags-for-_put_page_type-rather-than-a-bo.patch" (application/octet-stream)]

From db1d801aa8dcb918a27486a6e8d9cf5d7307dec3 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 04/11] x86/mm: Use flags for _put_page_type rather than a
 boolean

This is in mainly in preparation for _put_page_type taking the
partial_flags value in the future.  It also makes it easier to read in
the caller (since you see a flag name rather than `true` or `false`).

No functional change intended.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 3f6f8cc9b8..0740b61af8 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1200,7 +1200,7 @@ get_page_from_l4e(
 }
 #endif /* CONFIG_PV */
 
-static int _put_page_type(struct page_info *page, bool preemptible,
+static int _put_page_type(struct page_info *page, unsigned int flags,
                           struct page_info *ptpg);
 
 void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
@@ -1320,7 +1320,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
               PTF_partial_set )
         {
             ASSERT(!(flags & PTF_defer));
-            rc = _put_page_type(pg, true, ptpg);
+            rc = _put_page_type(pg, PTF_preemptible, ptpg);
         }
         else if ( flags & PTF_defer )
         {
@@ -1329,7 +1329,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         }
         else
         {
-            rc = _put_page_type(pg, true, ptpg);
+            rc = _put_page_type(pg, PTF_preemptible, ptpg);
             if ( likely(!rc) )
                 put_page(pg);
         }
@@ -1366,7 +1366,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
          PTF_partial_set )
     {
         ASSERT(!(flags & PTF_defer));
-        return _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
+        return _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
     }
 
     if ( flags & PTF_defer )
@@ -1376,7 +1376,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
         return 0;
     }
 
-    rc = _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
+    rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
     if ( likely(!rc) )
         put_page(pg);
 
@@ -1397,7 +1397,7 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
               PTF_partial_set )
         {
             ASSERT(!(flags & PTF_defer));
-            return _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
+            return _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
         }
 
         if ( flags & PTF_defer )
@@ -1407,7 +1407,7 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
             return 0;
         }
 
-        rc = _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
+        rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
         if ( likely(!rc) )
             put_page(pg);
     }
@@ -2757,10 +2757,11 @@ static int _put_final_page_type(struct page_info *page, unsigned long type,
 }
 
 
-static int _put_page_type(struct page_info *page, bool preemptible,
+static int _put_page_type(struct page_info *page, unsigned int flags,
                           struct page_info *ptpg)
 {
     unsigned long nx, x, y = page->u.inuse.type_info;
+    bool preemptible = flags & PTF_preemptible;
 
     ASSERT(current_locked_page_ne_check(page));
 
@@ -2969,7 +2970,7 @@ static int _get_page_type(struct page_info *page, unsigned long type,
 
             if ( unlikely(iommu_ret) )
             {
-                _put_page_type(page, false, NULL);
+                _put_page_type(page, 0, NULL);
                 rc = iommu_ret;
                 goto out;
             }
@@ -2996,7 +2997,7 @@ static int _get_page_type(struct page_info *page, unsigned long type,
 
 void put_page_type(struct page_info *page)
 {
-    int rc = _put_page_type(page, false, NULL);
+    int rc = _put_page_type(page, 0, NULL);
     ASSERT(rc == 0);
     (void)rc;
 }
@@ -3013,7 +3014,7 @@ int get_page_type(struct page_info *page, unsigned long type)
 
 int put_page_type_preemptible(struct page_info *page)
 {
-    return _put_page_type(page, true, NULL);
+    return _put_page_type(page, PTF_preemptible, NULL);
 }
 
 int get_page_type_preemptible(struct page_info *page, unsigned long type)
@@ -3030,7 +3031,7 @@ int put_old_guest_table(struct vcpu *v)
     if ( !v->arch.old_guest_table )
         return 0;
 
-    switch ( rc = _put_page_type(v->arch.old_guest_table, true,
+    switch ( rc = _put_page_type(v->arch.old_guest_table, PTF_preemptible,
                                  v->arch.old_guest_ptpg) )
     {
     case -EINTR:
-- 
2.23.0


["xsa299-4.12/0005-x86-mm-Rework-get_page_and_type_from_mfn-conditional.patch" (application/octet-stream)]

From 6f257854c8778774210281c5c21028c4b7739b44 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 05/11] x86/mm: Rework get_page_and_type_from_mfn conditional

Make it easier to read by declaring the conditions in which we will
retain the ref, rather than the conditions under which we release it.

The only way (page == current->arch.old_guest_table) can be true is if
preemptible is true; so remove this from the query itself, and add an
ASSERT() to that effect on the opposite path.

No functional change intended.

NB that alloc_lN_table() mishandle the "linear pt failure" situation
described in the comment; this will be addressed in a future patch.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 0740b61af8..0a4d39a2c3 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1122,8 +1122,43 @@ static int get_page_and_type_from_mfn(
 
     rc = _get_page_type(page, type, preemptible);
 
-    if ( unlikely(rc) && !partial_ref &&
-         (!preemptible || page != current->arch.old_guest_table) )
+    /*
+     * Retain the refcount if:
+     * - page is fully validated (rc == 0)
+     * - page is not validated (rc < 0) but:
+     *   - We came in with a reference (partial_ref)
+     *   - page is partially validated but there's been an error
+     *     (page == current->arch.old_guest_table)
+     *
+     * The partial_ref-on-error clause is worth an explanation.  There
+     * are two scenarios where partial_ref might be true coming in:
+     * - mfn has been partially demoted as type `type`; i.e. has
+     *   PGT_partial set
+     * - mfn has been partially demoted as L(type+1) (i.e., a linear
+     *   page; e.g. we're being called from get_page_from_l2e with
+     *   type == PGT_l1_table, but the mfn is PGT_l2_table)
+     *
+     * If there's an error, in the first case, _get_page_type will
+     * either return -ERESTART, in which case we want to retain the
+     * ref (as the caller will consider it retained), or -EINVAL, in
+     * which case old_guest_table will be set; in both cases, we need
+     * to retain the ref.
+     *
+     * In the second case, if there's an error, _get_page_type() can
+     * *only* return -EINVAL, and *never* set old_guest_table.  In
+     * that case we also want to retain the reference, to allow the
+     * page to continue to be torn down (i.e., PGT_partial cleared)
+     * safely.
+     *
+     * Also note that we shouldn't be able to leave with the reference
+     * count retained unless we succeeded, or the operation was
+     * preemptible.
+     */
+    if ( likely(!rc) || partial_ref )
+        /* nothing */;
+    else if ( page == current->arch.old_guest_table )
+        ASSERT(preemptible);
+    else
         put_page(page);
 
     return rc;
-- 
2.23.0


["xsa299-4.12/0006-x86-mm-Have-alloc_l-23-_table-clear-partial_flags-wh.patch" (application/octet-stream)]

From 4ad70553611a7a4e4494d5a3b51b5cc295a488e0 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 06/11] x86/mm: Have alloc_l[23]_table clear partial_flags when
 preempting

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.

At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.

PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held.  Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.

Unfortunately, when alloc_l[23]_table check hypercall_preempt_check()
and return -ERESTART, they set nr_entries_validated, but don't clear
partial_flags.

If we were picking up from a previously-interrupted promotion, that
means that PTF_partial_set would be set even though
[nr_entries_validated] was not partially validated.  This means that
if the page in this state were de-validated, put_page_type() would
erroneously be called on that entry.

Perhaps worse, if we were racing with a de-validation, then we might
leave both PTF_partial_set and PTF_partial_general_ref; and when
de-validation picked up again, both the type and the general ref would
be erroneously dropped from [nr_entries_validated].

In a sense, the real issue here is code duplication.  Rather than
duplicate the interruption code, set rc to -EINTR and fall through to
the code which already handles that case correctly.

Given the logic at this point, it should be impossible for
partial_flags to be non-zero; add an ASSERT() to catch any changes.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 25 ++++++-------------------
 1 file changed, 6 insertions(+), 19 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 0a4d39a2c3..bbd29a68f4 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1554,21 +1554,13 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
     for ( i = page->nr_validated_ptes; i < L2_PAGETABLE_ENTRIES;
           i++, partial_flags = 0 )
     {
-        l2_pgentry_t l2e;
+        l2_pgentry_t l2e = pl2e[i];
 
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
-        {
-            page->nr_validated_ptes = i;
-            rc = -ERESTART;
-            break;
-        }
-
-        if ( !is_guest_l2_slot(d, type, i) )
+            rc = -EINTR;
+        else if ( !is_guest_l2_slot(d, type, i) )
             continue;
-
-        l2e = pl2e[i];
-
-        if ( !(l2e_get_flags(l2e) & _PAGE_PRESENT) )
+        else if ( !(l2e_get_flags(l2e) & _PAGE_PRESENT) )
         {
             if ( !pv_l1tf_check_l2e(d, l2e) )
                 continue;
@@ -1640,13 +1632,8 @@ static int alloc_l3_table(struct page_info *page)
         l3_pgentry_t l3e = pl3e[i];
 
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
-        {
-            page->nr_validated_ptes = i;
-            rc = -ERESTART;
-            break;
-        }
-
-        if ( is_pv_32bit_domain(d) && (i == 3) )
+            rc = -EINTR;
+        else if ( is_pv_32bit_domain(d) && (i == 3) )
         {
             if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) ||
                  (l3e_get_flags(l3e) & l3_disallow_mask(d)) )
-- 
2.23.0


["xsa299-4.12/0007-x86-mm-Always-retain-a-general-ref-on-partial.patch" (application/octet-stream)]

From 51fe4e67d954649fcf103116be6206a769f0db1e Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 07/11] x86/mm: Always retain a general ref on partial

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page struct:
nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.

At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.

PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held.  Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.

Unfortunately, because a refcount is not held, it is possible to
engineer a situation where PFT_partial_set is set but the page in
question has been assigned to another domain.  A sketch is provided in
the appendix.

Fix this by having the parent page table entry hold a general
reference count whenever PFT_partial_set is set.  (For clarity of
change, keep two separate flags.  These will be collapsed in a
subsequent changeset.)

This has two basic implications.  On the put_page_from_lNe() side,
this mean that the (partial_set && !partial_ref) case can never happen,
and no longer needs to be special-cased.

Secondly, because both flags are set together, there's no need to carry over
existing bits from partial_pte.

(NB there is still another issue with calling _put_page_type() on a
page which had PGT_partial set; that will be handled in a subsequent
patch.)

On the get_page_and_type_from_mfn() side, we need to distinguish
between callers which hold a reference on partial (i.e.,
alloc_lN_table()), and those which do not (new_cr3, PIN_LN_TABLE, and
so on): pass a flag if the type should be retained on interruption.

NB that since l1 promotion can't be preempted, that get_page_from_l2e
can't return -ERESTART.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
* Appendix: Engineering PTF_partial_set while a page belongs to a
  foreign domain

Suppose A is a page which can be promoted to an l3, and B is a page
which can be promoted to an l2, and A[x] points to B.  B has
PGC_allocated set but no other general references.

V1:  PIN_L3 A.
  A is validated, B is validated.
  A.type_count = 1 | PGT_validated | PGT_pinned
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated (A[x] holds a general ref)

V1: UNPIN A.
  A begins de-validation.
  Arrange to be interrupted when i < x
  V1->old_guest_table = A
  V1->old_guest_table_ref_held = false
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = i < x
  B.type_count = 0
  B.count = 1 | PGC_allocated

V2: MOD_L4_ENTRY to point some l4e to A.
  Picks up re-validation of A.
  Arrange to be interrupted halfway through B's validation
  B.type_count = 1 | PGT_partial
  B.count = 2 | PGC_allocated (PGT_partial holds a general ref)
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = x
  A.partial_pte = PTF_partial_set

V3: MOD_L3_ENTRY to point some other l3e (not in A) to B.
  Validates B.
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated ("other l3e" holds a general ref)

V3: MOD_L3_ENTRY to clear l3e pointing to B.
  Devalidates B.
  B.type_count = 0
  B.count = 1 | PGC_allocated

V3: decrease_reservation(B)
  Clears PGC_allocated
  B.count = 0 => B is freed

B gets assigned to a different domain

V1: Restarts UNPIN of A
  put_old_guest_table(A)
    ...
      free_l3_table(A)

Now since A.partial_flags has PTF_partial_set, free_l3_table() will
call put_page_from_l3e() on A[x], which points to B, while B is owned
by another domain.

If A[x] held a general refcount for B on partial validation, as it does
for partial de-validation, then B would still have a reference count of
1 after PGC_allocated was freed; so B wouldn't be freed until after
put_page_from_l3e() had happend on A[x].
---
 xen/arch/x86/mm.c        | 84 +++++++++++++++++++++++-----------------
 xen/include/asm-x86/mm.h | 15 ++++---
 2 files changed, 58 insertions(+), 41 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index bbd29a68f4..4d3ebf341d 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1102,10 +1102,11 @@ get_page_from_l1e(
  * page->pte[page->nr_validated_entries].  See the comment in mm.h for
  * more information.
  */
-#define PTF_partial_set         (1 << 0)
-#define PTF_partial_general_ref (1 << 1)
-#define PTF_preemptible         (1 << 2)
-#define PTF_defer               (1 << 3)
+#define PTF_partial_set           (1 << 0)
+#define PTF_partial_general_ref   (1 << 1)
+#define PTF_preemptible           (1 << 2)
+#define PTF_defer                 (1 << 3)
+#define PTF_retain_ref_on_restart (1 << 4)
 
 static int get_page_and_type_from_mfn(
     mfn_t mfn, unsigned long type, struct domain *d,
@@ -1114,7 +1115,11 @@ static int get_page_and_type_from_mfn(
     struct page_info *page = mfn_to_page(mfn);
     int rc;
     bool preemptible = flags & PTF_preemptible,
-         partial_ref = flags & PTF_partial_general_ref;
+         partial_ref = flags & PTF_partial_general_ref,
+         partial_set = flags & PTF_partial_set,
+         retain_ref  = flags & PTF_retain_ref_on_restart;
+
+    ASSERT(partial_ref == partial_set);
 
     if ( likely(!partial_ref) &&
          unlikely(!get_page_from_mfn(mfn, d)) )
@@ -1127,13 +1132,15 @@ static int get_page_and_type_from_mfn(
      * - page is fully validated (rc == 0)
      * - page is not validated (rc < 0) but:
      *   - We came in with a reference (partial_ref)
+     *   - page is partially validated (rc == -ERESTART), and the
+     *     caller has asked the ref to be retained in that case
      *   - page is partially validated but there's been an error
      *     (page == current->arch.old_guest_table)
      *
      * The partial_ref-on-error clause is worth an explanation.  There
      * are two scenarios where partial_ref might be true coming in:
-     * - mfn has been partially demoted as type `type`; i.e. has
-     *   PGT_partial set
+     * - mfn has been partially promoted / demoted as type `type`;
+     *   i.e. has PGT_partial set
      * - mfn has been partially demoted as L(type+1) (i.e., a linear
      *   page; e.g. we're being called from get_page_from_l2e with
      *   type == PGT_l1_table, but the mfn is PGT_l2_table)
@@ -1156,7 +1163,8 @@ static int get_page_and_type_from_mfn(
      */
     if ( likely(!rc) || partial_ref )
         /* nothing */;
-    else if ( page == current->arch.old_guest_table )
+    else if ( page == current->arch.old_guest_table ||
+              (retain_ref && rc == -ERESTART) )
         ASSERT(preemptible);
     else
         put_page(page);
@@ -1354,8 +1362,8 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
               PTF_partial_set )
         {
-            ASSERT(!(flags & PTF_defer));
-            rc = _put_page_type(pg, PTF_preemptible, ptpg);
+            /* partial_set should always imply partial_ref */
+            BUG();
         }
         else if ( flags & PTF_defer )
         {
@@ -1400,8 +1408,8 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
     if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
          PTF_partial_set )
     {
-        ASSERT(!(flags & PTF_defer));
-        return _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
+        /* partial_set should always imply partial_ref */
+        BUG();
     }
 
     if ( flags & PTF_defer )
@@ -1431,8 +1439,8 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
         if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
               PTF_partial_set )
         {
-            ASSERT(!(flags & PTF_defer));
-            return _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
+            /* partial_set should always imply partial_ref */
+            BUG();
         }
 
         if ( flags & PTF_defer )
@@ -1569,13 +1577,22 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
         else
             rc = get_page_from_l2e(l2e, pfn, d, partial_flags);
 
-        if ( rc == -ERESTART )
-        {
-            page->nr_validated_ptes = i;
-            /* Set 'set', retain 'general ref' */
-            page->partial_flags = partial_flags | PTF_partial_set;
-        }
-        else if ( rc == -EINTR && i )
+        /*
+         * It shouldn't be possible for get_page_from_l2e to return
+         * -ERESTART, since we never call this with PTF_preemptible.
+         * (alloc_l1_table may return -EINTR on an L1TF-vulnerable
+         * entry.)
+         *
+         * NB that while on a "clean" promotion, we can never get
+         * PGT_partial.  It is possible to arrange for an l2e to
+         * contain a partially-devalidated l2; but in that case, both
+         * of the following functions will fail anyway (the first
+         * because the page in question is not an l1; the second
+         * because the page is not fully validated).
+         */
+        ASSERT(rc != -ERESTART);
+
+        if ( rc == -EINTR && i )
         {
             page->nr_validated_ptes = i;
             page->partial_flags = 0;
@@ -1584,6 +1601,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
         else if ( rc < 0 && rc != -EINTR )
         {
             gdprintk(XENLOG_WARNING, "Failure in alloc_l2_table: slot %#x\n", i);
+            ASSERT(current->arch.old_guest_table == NULL);
             if ( i )
             {
                 page->nr_validated_ptes = i;
@@ -1642,7 +1660,7 @@ static int alloc_l3_table(struct page_info *page)
                 rc = get_page_and_type_from_mfn(
                     l3e_get_mfn(l3e),
                     PGT_l2_page_table | PGT_pae_xen_l2, d,
-                    partial_flags | PTF_preemptible);
+                    partial_flags | PTF_preemptible | PTF_retain_ref_on_restart);
         }
         else if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) )
         {
@@ -1651,13 +1669,14 @@ static int alloc_l3_table(struct page_info *page)
             rc = -EINTR;
         }
         else
-            rc = get_page_from_l3e(l3e, pfn, d, partial_flags);
+            rc = get_page_from_l3e(l3e, pfn, d,
+                                   partial_flags | PTF_retain_ref_on_restart);
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = partial_flags | PTF_partial_set;
+            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
         }
         else if ( rc == -EINTR && i )
         {
@@ -1833,13 +1852,14 @@ static int alloc_l4_table(struct page_info *page)
             rc = -EINTR;
         }
         else
-            rc = get_page_from_l4e(l4e, pfn, d, partial_flags);
+            rc = get_page_from_l4e(l4e, pfn, d,
+                                   partial_flags | PTF_retain_ref_on_restart);
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = partial_flags | PTF_partial_set;
+            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
         }
         else if ( rc < 0 )
         {
@@ -1936,9 +1956,7 @@ static int free_l2_table(struct page_info *page)
     else if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = (partial_flags & PTF_partial_set) ?
-            partial_flags :
-            (PTF_partial_set | PTF_partial_general_ref);
+        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
     }
     else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
     {
@@ -1986,9 +2004,7 @@ static int free_l3_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = (partial_flags & PTF_partial_set) ?
-            partial_flags :
-            (PTF_partial_set | PTF_partial_general_ref);
+        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
     }
     else if ( rc == -EINTR && i < L3_PAGETABLE_ENTRIES - 1 )
     {
@@ -2019,9 +2035,7 @@ static int free_l4_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = (partial_flags & PTF_partial_set) ?
-            partial_flags :
-            (PTF_partial_set | PTF_partial_general_ref);
+        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
     }
     else if ( rc == -EINTR && i < L4_PAGETABLE_ENTRIES - 1 )
     {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 8406ac3c37..02079e1324 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -238,22 +238,25 @@ struct page_info
          * page.
          *
          * This happens:
-         * - During de-validation, if de-validation of the page was
+         * - During validation or de-validation, if the operation was
          *   interrupted
          * - During validation, if an invalid entry is encountered and
          *   validation is preemptible
          * - During validation, if PTF_partial_general_ref was set on
-         *   this entry to begin with (perhaps because we're picking
-         *   up from a partial de-validation).
+         *   this entry to begin with (perhaps because it picked up a
+         *   previous operation)
          *
-         * When resuming validation, if PTF_partial_general_ref is clear,
-         * then a general reference must be re-acquired; if it is set, no
-         * reference should be acquired.
+         * When resuming validation, if PTF_partial_general_ref is
+         * clear, then a general reference must be re-acquired; if it
+         * is set, no reference should be acquired.
          *
          * When resuming de-validation, if PTF_partial_general_ref is
          * clear, no reference should be dropped; if it is set, a
          * reference should be dropped.
          *
+         * NB at the moment, PTF_partial_set should be set if and only if
+         * PTF_partial_general_ref is set.
+         *
          * NB that PTF_partial_set and PTF_partial_general_ref are
          * defined in mm.c, the only place where they are used.
          *
-- 
2.23.0


["xsa299-4.12/0008-x86-mm-Collapse-PTF_partial_set-and-PTF_partial_gene.patch" (application/octet-stream)]

From 8a8d836f7f7418e659d37817a66cd7a6b115042b Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 08/11] x86/mm: Collapse PTF_partial_set and
 PTF_partial_general_ref into one

...now that they are equivalent.  No functional change intended.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c        | 50 +++++++++++-----------------------------
 xen/include/asm-x86/mm.h | 29 +++++++++++------------
 2 files changed, 26 insertions(+), 53 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 4d3ebf341d..886e93b8aa 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1097,13 +1097,12 @@ get_page_from_l1e(
 
 /*
  * The following flags are used to specify behavior of various get and
- * put commands.  The first two are also stored in page->partial_flags
- * to indicate the state of the page pointed to by
+ * put commands.  The first is also stored in page->partial_flags to
+ * indicate the state of the page pointed to by
  * page->pte[page->nr_validated_entries].  See the comment in mm.h for
  * more information.
  */
 #define PTF_partial_set           (1 << 0)
-#define PTF_partial_general_ref   (1 << 1)
 #define PTF_preemptible           (1 << 2)
 #define PTF_defer                 (1 << 3)
 #define PTF_retain_ref_on_restart (1 << 4)
@@ -1115,13 +1114,10 @@ static int get_page_and_type_from_mfn(
     struct page_info *page = mfn_to_page(mfn);
     int rc;
     bool preemptible = flags & PTF_preemptible,
-         partial_ref = flags & PTF_partial_general_ref,
          partial_set = flags & PTF_partial_set,
          retain_ref  = flags & PTF_retain_ref_on_restart;
 
-    ASSERT(partial_ref == partial_set);
-
-    if ( likely(!partial_ref) &&
+    if ( likely(!partial_set) &&
          unlikely(!get_page_from_mfn(mfn, d)) )
         return -EINVAL;
 
@@ -1131,14 +1127,14 @@ static int get_page_and_type_from_mfn(
      * Retain the refcount if:
      * - page is fully validated (rc == 0)
      * - page is not validated (rc < 0) but:
-     *   - We came in with a reference (partial_ref)
+     *   - We came in with a reference (partial_set)
      *   - page is partially validated (rc == -ERESTART), and the
      *     caller has asked the ref to be retained in that case
      *   - page is partially validated but there's been an error
      *     (page == current->arch.old_guest_table)
      *
-     * The partial_ref-on-error clause is worth an explanation.  There
-     * are two scenarios where partial_ref might be true coming in:
+     * The partial_set-on-error clause is worth an explanation.  There
+     * are two scenarios where partial_set might be true coming in:
      * - mfn has been partially promoted / demoted as type `type`;
      *   i.e. has PGT_partial set
      * - mfn has been partially demoted as L(type+1) (i.e., a linear
@@ -1161,7 +1157,7 @@ static int get_page_and_type_from_mfn(
      * count retained unless we succeeded, or the operation was
      * preemptible.
      */
-    if ( likely(!rc) || partial_ref )
+    if ( likely(!rc) || partial_set )
         /* nothing */;
     else if ( page == current->arch.old_guest_table ||
               (retain_ref && rc == -ERESTART) )
@@ -1359,13 +1355,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         struct page_info *pg = l2e_get_page(l2e);
         struct page_info *ptpg = mfn_to_page(_mfn(pfn));
 
-        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
-              PTF_partial_set )
-        {
-            /* partial_set should always imply partial_ref */
-            BUG();
-        }
-        else if ( flags & PTF_defer )
+        if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = ptpg;
             current->arch.old_guest_table = pg;
@@ -1405,13 +1395,6 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 
     pg = l3e_get_page(l3e);
 
-    if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
-         PTF_partial_set )
-    {
-        /* partial_set should always imply partial_ref */
-        BUG();
-    }
-
     if ( flags & PTF_defer )
     {
         current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
@@ -1436,13 +1419,6 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
     {
         struct page_info *pg = l4e_get_page(l4e);
 
-        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
-              PTF_partial_set )
-        {
-            /* partial_set should always imply partial_ref */
-            BUG();
-        }
-
         if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
@@ -1676,7 +1652,7 @@ static int alloc_l3_table(struct page_info *page)
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+            page->partial_flags = PTF_partial_set;
         }
         else if ( rc == -EINTR && i )
         {
@@ -1859,7 +1835,7 @@ static int alloc_l4_table(struct page_info *page)
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+            page->partial_flags = PTF_partial_set;
         }
         else if ( rc < 0 )
         {
@@ -1956,7 +1932,7 @@ static int free_l2_table(struct page_info *page)
     else if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+        page->partial_flags = PTF_partial_set;
     }
     else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
     {
@@ -2004,7 +1980,7 @@ static int free_l3_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+        page->partial_flags = PTF_partial_set;
     }
     else if ( rc == -EINTR && i < L3_PAGETABLE_ENTRIES - 1 )
     {
@@ -2035,7 +2011,7 @@ static int free_l4_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+        page->partial_flags = PTF_partial_set;
     }
     else if ( rc == -EINTR && i < L4_PAGETABLE_ENTRIES - 1 )
     {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 02079e1324..f0fd35bf6b 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -233,7 +233,7 @@ struct page_info
          * operation on the current page.  (That page may or may not
          * still have PGT_partial set.)
          *
-         * If PTF_partial_general_ref is set, then the PTE at
+         * Additionally, if PTF_partial_set is set, then the PTE at
          * @nr_validated_ptef holds a general reference count for the
          * page.
          *
@@ -242,23 +242,20 @@ struct page_info
          *   interrupted
          * - During validation, if an invalid entry is encountered and
          *   validation is preemptible
-         * - During validation, if PTF_partial_general_ref was set on
-         *   this entry to begin with (perhaps because it picked up a
+         * - During validation, if PTF_partial_set was set on this
+         *   entry to begin with (perhaps because it picked up a
          *   previous operation)
          *
-         * When resuming validation, if PTF_partial_general_ref is
-         * clear, then a general reference must be re-acquired; if it
-         * is set, no reference should be acquired.
+         * When resuming validation, if PTF_partial_set is clear, then
+         * a general reference must be re-acquired; if it is set, no
+         * reference should be acquired.
          *
-         * When resuming de-validation, if PTF_partial_general_ref is
-         * clear, no reference should be dropped; if it is set, a
-         * reference should be dropped.
+         * When resuming de-validation, if PTF_partial_set is clear,
+         * no reference should be dropped; if it is set, a reference
+         * should be dropped.
          *
-         * NB at the moment, PTF_partial_set should be set if and only if
-         * PTF_partial_general_ref is set.
-         *
-         * NB that PTF_partial_set and PTF_partial_general_ref are
-         * defined in mm.c, the only place where they are used.
+         * NB that PTF_partial_set is defined in mm.c, the only place
+         * where it is used.
          *
          * The 3rd field, @linear_pt_count, indicates
          * - by a positive value, how many same-level page table entries a page
@@ -268,8 +265,8 @@ struct page_info
          */
         struct {
             u16 nr_validated_ptes:PAGETABLE_ORDER + 1;
-            u16 :16 - PAGETABLE_ORDER - 1 - 2;
-            u16 partial_flags:2;
+            u16 :16 - PAGETABLE_ORDER - 1 - 1;
+            u16 partial_flags:1;
             s16 linear_pt_count;
         };
 
-- 
2.23.0


["xsa299-4.12/0009-x86-mm-Properly-handle-linear-pagetable-promotion-fa.patch" (application/octet-stream)]

From da3d1d258e54fe600f7f75287183b74d957ec63b Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 09/11] x86/mm: Properly handle linear pagetable promotion
 failures

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated, and a general reference count is held.

Unfortunately, in cases where an entry began with PTF_partial_set set,
and get_page_from_lNe() returns -EINVAL, the PTF_partial_set bit is
erroneously dropped.  (This scenario can be engineered mainly by the
use of interleaving of promoting and demoting a page which has "linear
pagetable" entries; see the appendix for a sketch.)  This means that
we will "leak" a general reference count on the page in question,
preventing the page from being freed.

Fix this by setting page->partial_flags to the partial_flags local
variable.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix

Suppose A and B can both be promoted to L2 pages, and A[x] points to B.

V1: PIN_L2 B.
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated

V1: MOD_L3_ENTRY pointing something to A.
  In the process of validating A[x], grab an extra type / ref on B:
  B.type_count = 2 | PGT_validated
  B.count = 3 | PGC_allocated
  A.type_count = 1 | PGT_validated
  A.count = 2 | PGC_allocated

V1: UNPIN B.
  B.type_count = 1 | PGT_validate
  B.count = 2 | PGC_allocated

V1: MOD_L3_ENTRY removing the reference to A.
  De-validate A, down to A[x], which points to B.
  Drop the final type on B.  Arrange to be interrupted.
  B.type_count = 1 | PGT_partial
  B.count = 2 | PGC_allocated
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = x
  A.partial_pte = -1

V2: MOD_L3_ENTRY adds a reference to A.

At this point, get_page_from_l2e(A[x]) tries
get_page_and_type_from_mfn(), which fails because it's the wrong type;
and get_l2_linear_pagetable() also fails, because B isn't validated as
an l2 anymore.
---
 xen/arch/x86/mm.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 886e93b8aa..0a094291da 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1581,7 +1581,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_flags = 0;
+                page->partial_flags = partial_flags;
                 current->arch.old_guest_ptpg = NULL;
                 current->arch.old_guest_table = page;
             }
@@ -1674,7 +1674,7 @@ static int alloc_l3_table(struct page_info *page)
         if ( i )
         {
             page->nr_validated_ptes = i;
-            page->partial_flags = 0;
+            page->partial_flags = partial_flags;
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
         }
@@ -1845,7 +1845,7 @@ static int alloc_l4_table(struct page_info *page)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_flags = 0;
+                page->partial_flags = partial_flags;
                 if ( rc == -EINTR )
                     rc = -ERESTART;
                 else
-- 
2.23.0


["xsa299-4.12/0010-x86-mm-Fix-nested-de-validation-on-error.patch" (application/octet-stream)]

From b3e169dc8daeae85b0b51c25fdb142e2e552ec7f Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 10/11] x86/mm: Fix nested de-validation on error

If an invalid entry is discovered when validating a page-table tree,
the entire tree which has so far been validated must be de-validated.
Since this may take a long time, alloc_l[2-4]_table() set current
vcpu's old_guest_table immediately; put_old_guest_table() will make
sure that put_page_type() will be called to finish off the
de-validation before any other MMU operations can happen on the vcpu.

The invariant for partial pages should be:

* Entries [0, nr_validated_ptes) should be completely validated;
  put_page_type() will de-validate these.

* If [nr_validated_ptes] is partially validated, partial_flags should
  set PTF_partiaL_set.  put_page_type() will be called on this page to
  finish off devalidation, and the appropriate refcount adjustments
  will be done.

alloc_l[2-3]_table() indicates partial validation to its callers by
setting current->old_guest_table.

Unfortunately, this is mishandled.

Take the case where validating lNe[x] returns an error.

First, alloc_l3_table() doesn't check old_guest_table at all; as a
result, partial_flags is not set when it should be.  nr_validated_ptes
is set to x; and since PFT_partial_set clear, de-validation resumes at
nr_validated_ptes-1.  This means that the l2 page at pl3e[x] will not
have put_page_type() called on it when de-validating the rest of the
l3: it will be stuck in the PGT_partial state until the domain is
destroyed, or until it is re-used as an l2.  (Any other page type will
fail.)

Worse, alloc_l4_table(), rather than setting PTF_partial_set as it
should, sets nr_validated_ptes to x+1.  When de-validating, since
partial is 0, this will correctly resume calling put_page_type at [x];
but, if the put_page_type() is never called, but instead
get_page_type() is called, validation will pick up at [x+1],
neglecting to validate [x].  If the rest of the validation succeeds,
the l4 will be validated even though [x] is invalid.

Fix this in both cases by setting PTF_partial_set if old_guest_table
is set.

While here, add some safety catches:
- old_guest_table must point to the page contained in
  [nr_validated_ptes].
- alloc_l1_page shouldn't set old_guest_table

If we experience one of these situations in production builds, it's
safer to avoid calling put_page_type for the pages in question.  If
they have PGT_partial set, they will be cleaned up on domain
destruction; if not, we have no idea whether a type count is safe to
drop.  Retaining an extra type ref that should have been dropped may
trigger a BUG() on the free_domain_page() path, but dropping a type
count that shouldn't be dropped may cause a privilege escalation.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 53 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 0a094291da..a432e69c74 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1580,6 +1580,20 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             ASSERT(current->arch.old_guest_table == NULL);
             if ( i )
             {
+                /*
+                 * alloc_l1_table() doesn't set old_guest_table; it does
+                 * its own tear-down immediately on failure.  If it
+                 * did we'd need to check it and set partial_flags as we
+                 * do in alloc_l[34]_table().
+                 *
+                 * Note on the use of ASSERT: if it's non-null and
+                 * hasn't been cleaned up yet, it should have
+                 * PGT_partial set; and so the type will be cleaned up
+                 * on domain destruction.  Unfortunately, we would
+                 * leak the general ref held by old_guest_table; but
+                 * leaking a page is less bad than a host crash.
+                 */
+                ASSERT(current->arch.old_guest_table == NULL);
                 page->nr_validated_ptes = i;
                 page->partial_flags = partial_flags;
                 current->arch.old_guest_ptpg = NULL;
@@ -1607,6 +1621,7 @@ static int alloc_l3_table(struct page_info *page)
     unsigned int   i;
     int            rc = 0;
     unsigned int   partial_flags = page->partial_flags;
+    l3_pgentry_t   l3e = l3e_empty();
 
     pl3e = map_domain_page(_mfn(pfn));
 
@@ -1623,7 +1638,7 @@ static int alloc_l3_table(struct page_info *page)
     for ( i = page->nr_validated_ptes; i < L3_PAGETABLE_ENTRIES;
           i++, partial_flags = 0 )
     {
-        l3_pgentry_t l3e = pl3e[i];
+        l3e = pl3e[i];
 
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
             rc = -EINTR;
@@ -1675,6 +1690,24 @@ static int alloc_l3_table(struct page_info *page)
         {
             page->nr_validated_ptes = i;
             page->partial_flags = partial_flags;
+            if ( current->arch.old_guest_table )
+            {
+                /*
+                 * We've experienced a validation failure.  If
+                 * old_guest_table is set, "transfer" the general
+                 * reference count to pl3e[nr_validated_ptes] by
+                 * setting PTF_partial_set.
+                 *
+                 * As a precaution, check that old_guest_table is the
+                 * page pointed to by pl3e[nr_validated_ptes].  If
+                 * not, it's safer to leak a type ref on production
+                 * builds.
+                 */
+                if ( current->arch.old_guest_table == l3e_get_page(l3e) )
+                    page->partial_flags = PTF_partial_set;
+                else
+                    ASSERT_UNREACHABLE();
+            }
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
         }
@@ -1851,7 +1884,23 @@ static int alloc_l4_table(struct page_info *page)
                 else
                 {
                     if ( current->arch.old_guest_table )
-                        page->nr_validated_ptes++;
+                    {
+                        /*
+                         * We've experienced a validation failure.  If
+                         * old_guest_table is set, "transfer" the general
+                         * reference count to pl3e[nr_validated_ptes] by
+                         * setting PTF_partial_set.
+                         *
+                         * As a precaution, check that old_guest_table is the
+                         * page pointed to by pl4e[nr_validated_ptes].  If
+                         * not, it's safer to leak a type ref on production
+                         * builds.
+                         */
+                        if ( current->arch.old_guest_table == l4e_get_page(l4e) )
+                            page->partial_flags = PTF_partial_set;
+                        else
+                            ASSERT_UNREACHABLE();
+                    }
                     current->arch.old_guest_ptpg = NULL;
                     current->arch.old_guest_table = page;
                 }
-- 
2.23.0


["xsa299-4.12/0011-x86-mm-Don-t-drop-a-type-ref-unless-you-held-a-ref-t.patch" (application/octet-stream)]

From ea3dc624c5e6325a9c2f079e52a85965d4ab6ce8 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:50 +0100
Subject: [PATCH 11/11] x86/mm: Don't drop a type ref unless you held a ref to
 begin with

Validation and de-validation of pagetable trees may take arbitrarily
large amounts of time, and so must be preemptible.  This is indicated
by setting the PGT_partial bit in the type_info, and setting
nr_validated_entries and partial_flags appropriately.  Specifically,
if the entry at [nr_validated_entries] is partially validated,
partial_flags should have the PGT_partial_set bit set, and the entry
should hold a general reference count.  During de-validation,
put_page_type() is called on partially validated entries.

Unfortunately, there are a number of issues with the current algorithm.

First, doing a "normal" put_page_type() is not safe when no type ref
is held: there is nothing to stop another vcpu from coming along and
picking up validation again: at which point the put_page_type may drop
the only page ref on an in-use page.  Some examples are listed in the
appendix.

The core issue is that put_page_type() is being called both to clean
up PGT_partial, and to drop a type count; and has no way of knowing
which is which; and so if in between, PGT_partial is cleared,
put_page_type() will drop the type ref erroneously.

What is needed is to distinguish between two states:
- Dropping a type ref which is held
- Cleaning up a page which has been partially de/validated

Fix this by telling put_page_type() which of the two activities you
intend.

When cleaning up a partial de/validation, take no action unless you
find a page partially validated.

If put_page_type() is called without PTF_partial_set, and finds the
page in a PGT_partial state anyway, then there's certainly been a
misaccounting somewhere, and carrying on would almost certainly cause
a security issue, so crash the host instead.

In put_page_from_lNe, pass partial_flags on to _put_page_type().

old_guest_table may be set either with a fully validated page (when
using the "deferred put" pattern), or with a partially validated page
(when a normal "de-validation" is interrupted, or when a validation
fails part-way through due to invalid entries).  Add a flag,
old_guest_table_partial, to indicate which of these it is, and use
that to pass the appropriate flag to _put_page_type().

While here, delete stray trailing whitespace.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix:

Suppose page A, when interpreted as an l3 pagetable, contains all
valid entries; and suppose A[x] points to page B, which when
interpreted as an l2 pagetable, contains all valid entries.

P1: PIN_L3_TABLE
  A -> PGT_l3_table | 1 | valid
  B -> PGT_l2_table | 1 | valid

P1: UNPIN_TABLE
  > Arrange to interrupt after B has been de-validated
  B:
    type_info -> PGT_l2_table | 0
  A:
    type_info -> PGT_l3_table | 1 | partial
    nr_validated_enties -> (less than x)

P2: mod_l4_entry to point to A
  > Arrange for this to be interrupted while B is being validated
  B:
    type_info -> PGT_l2_table | 1 | partial
    (nr_validated_entires &c set as appropriate)
  A:
    type_info -> PGT_l3_table | 1 | partial
    nr_validated_entries -> x
    partial_pte = 1

P3: mod_l3_entry some other unrelated l3 to point to B:
  B:
    type_info -> PGT_l2_table | 1

P1: Restart UNPIN_TABLE

At this point, since A.nr_validate_entries == x and A.partial_pte !=
0, free_l3_table() will call put_page_from_l3e() on pl3e[x], dropping
its type count to 0 while it's still being pointed to by some other l3

A similar issue arises with old_guest_table.  Consider the following
scenario:

Suppose A is a page which, when interpreted as an l2, has valid entries
until entry x, which is invalid.

V1:  PIN_L2_TABLE(A)
  <Validate until we try to validate [x], get -EINVAL>
  A -> PGT_l2_table | 1 | PGT_partial
  V1 -> old_guest_table = A
  <delayed>

V2: PIN_L2_TABLE(A)
  <Pick up where V1 left off, try to re-validate [x], get -EINVAL>
  A -> PGT_l2_table | 1 | PGT_partial
  V2 -> old_guest_table = A
  <restart>
  put_old_guest_table()
    _put_page_type(A)
      A -> PGT_l2_table | 0

V1: <restart>
  put_old_guest_table()
    _put_page_type(A) # UNDERFLOW

Indeed, it is possible to engineer for old_guest_table for every vcpu
a guest has to point to the same page.
---
 xen/arch/x86/domain.c        |  6 +++
 xen/arch/x86/mm.c            | 99 +++++++++++++++++++++++++++++++-----
 xen/include/asm-x86/domain.h |  4 +-
 3 files changed, 95 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index 59df8a6d8d..f1ae5f89f5 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1104,9 +1104,15 @@ int arch_set_info_guest(
                     rc = -ERESTART;
                     /* Fallthrough */
                 case -ERESTART:
+                    /*
+                     * NB that we're putting the kernel-mode table
+                     * here, which we've already successfully
+                     * validated above; hence partial = false;
+                     */
                     v->arch.old_guest_ptpg = NULL;
                     v->arch.old_guest_table =
                         pagetable_get_page(v->arch.guest_table);
+                    v->arch.old_guest_table_partial = false;
                     v->arch.guest_table = pagetable_null();
                     break;
                 default:
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index a432e69c74..81774368a0 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1359,10 +1359,11 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         {
             current->arch.old_guest_ptpg = ptpg;
             current->arch.old_guest_table = pg;
+            current->arch.old_guest_table_partial = false;
         }
         else
         {
-            rc = _put_page_type(pg, PTF_preemptible, ptpg);
+            rc = _put_page_type(pg, flags | PTF_preemptible, ptpg);
             if ( likely(!rc) )
                 put_page(pg);
         }
@@ -1385,6 +1386,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
         unsigned long mfn = l3e_get_pfn(l3e);
         bool writeable = l3e_get_flags(l3e) & _PAGE_RW;
 
+        ASSERT(!(flags & PTF_partial_set));
         ASSERT(!(mfn & ((1UL << (L3_PAGETABLE_SHIFT - PAGE_SHIFT)) - 1)));
         do {
             put_data_page(mfn_to_page(_mfn(mfn)), writeable);
@@ -1397,12 +1399,14 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 
     if ( flags & PTF_defer )
     {
+        ASSERT(!(flags & PTF_partial_set));
         current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
         current->arch.old_guest_table = pg;
+        current->arch.old_guest_table_partial = false;
         return 0;
     }
 
-    rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
+    rc = _put_page_type(pg, flags | PTF_preemptible, mfn_to_page(_mfn(pfn)));
     if ( likely(!rc) )
         put_page(pg);
 
@@ -1421,12 +1425,15 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
 
         if ( flags & PTF_defer )
         {
+            ASSERT(!(flags & PTF_partial_set));
             current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
             current->arch.old_guest_table = pg;
+            current->arch.old_guest_table_partial = false;
             return 0;
         }
 
-        rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
+        rc = _put_page_type(pg, flags | PTF_preemptible,
+                            mfn_to_page(_mfn(pfn)));
         if ( likely(!rc) )
             put_page(pg);
     }
@@ -1535,6 +1542,14 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
 
     pl2e = map_domain_page(_mfn(pfn));
 
+    /*
+     * NB that alloc_l2_table will never set partial_pte on an l2; but
+     * free_l2_table might if a linear_pagetable entry is interrupted
+     * partway through de-validation.  In that circumstance,
+     * get_page_from_l2e() will always return -EINVAL; and we must
+     * retain the type ref by doing the normal partial_flags tracking.
+     */
+
     for ( i = page->nr_validated_ptes; i < L2_PAGETABLE_ENTRIES;
           i++, partial_flags = 0 )
     {
@@ -1598,6 +1613,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
                 page->partial_flags = partial_flags;
                 current->arch.old_guest_ptpg = NULL;
                 current->arch.old_guest_table = page;
+                current->arch.old_guest_table_partial = true;
             }
         }
         if ( rc < 0 )
@@ -1704,12 +1720,16 @@ static int alloc_l3_table(struct page_info *page)
                  * builds.
                  */
                 if ( current->arch.old_guest_table == l3e_get_page(l3e) )
+                {
+                    ASSERT(current->arch.old_guest_table_partial);
                     page->partial_flags = PTF_partial_set;
+                }
                 else
                     ASSERT_UNREACHABLE();
             }
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
+            current->arch.old_guest_table_partial = true;
         }
         while ( i-- > 0 )
             pl3e[i] = unadjust_guest_l3e(pl3e[i], d);
@@ -1897,12 +1917,16 @@ static int alloc_l4_table(struct page_info *page)
                          * builds.
                          */
                         if ( current->arch.old_guest_table == l4e_get_page(l4e) )
+                        {
+                            ASSERT(current->arch.old_guest_table_partial);
                             page->partial_flags = PTF_partial_set;
+                        }
                         else
                             ASSERT_UNREACHABLE();
                     }
                     current->arch.old_guest_ptpg = NULL;
                     current->arch.old_guest_table = page;
+                    current->arch.old_guest_table_partial = true;
                 }
             }
         }
@@ -2831,6 +2855,28 @@ static int _put_page_type(struct page_info *page, unsigned int flags,
         x  = y;
         nx = x - 1;
 
+        /*
+         * Is this expected to do a full reference drop, or only
+         * cleanup partial validation / devalidation?
+         *
+         * If the former, the caller must hold a "full" type ref;
+         * which means the page must be validated.  If the page is
+         * *not* fully validated, continuing would almost certainly
+         * open up a security hole.  An exception to this is during
+         * domain destruction, where PGT_validated can be dropped
+         * without dropping a type ref.
+         *
+         * If the latter, do nothing unless type PGT_partial is set.
+         * If it is set, the type count must be 1.
+         */
+        if ( !(flags & PTF_partial_set) )
+            BUG_ON((x & PGT_partial) ||
+                   !((x & PGT_validated) || page_get_owner(page)->is_dying));
+        else if ( !(x & PGT_partial) )
+            return 0;
+        else
+            BUG_ON((x & PGT_count_mask) != 1);
+
         ASSERT((x & PGT_count_mask) != 0);
 
         switch ( nx & (PGT_locked | PGT_count_mask) )
@@ -3092,17 +3138,34 @@ int put_old_guest_table(struct vcpu *v)
     if ( !v->arch.old_guest_table )
         return 0;
 
-    switch ( rc = _put_page_type(v->arch.old_guest_table, PTF_preemptible,
-                                 v->arch.old_guest_ptpg) )
+    rc = _put_page_type(v->arch.old_guest_table,
+                        PTF_preemptible |
+                        ( v->arch.old_guest_table_partial ?
+                          PTF_partial_set : 0 ),
+                        v->arch.old_guest_ptpg);
+
+    if ( rc == -ERESTART || rc == -EINTR )
     {
-    case -EINTR:
-    case -ERESTART:
+        v->arch.old_guest_table_partial = (rc == -ERESTART);
         return -ERESTART;
-    case 0:
-        put_page(v->arch.old_guest_table);
     }
 
+    /*
+     * It shouldn't be possible for _put_page_type() to return
+     * anything else at the moment; but if it does happen in
+     * production, leaking the type ref is probably the best thing to
+     * do.  Either way, drop the general ref held by old_guest_table.
+     */
+    ASSERT(rc == 0);
+
+    put_page(v->arch.old_guest_table);
     v->arch.old_guest_table = NULL;
+    v->arch.old_guest_ptpg = NULL;
+    /*
+     * Safest default if someone sets old_guest_table without
+     * explicitly setting old_guest_table_partial.
+     */
+    v->arch.old_guest_table_partial = true;
 
     return rc;
 }
@@ -3253,11 +3316,11 @@ int new_guest_cr3(mfn_t mfn)
             switch ( rc = put_page_and_type_preemptible(page) )
             {
             case -EINTR:
-                rc = -ERESTART;
-                /* fallthrough */
             case -ERESTART:
                 curr->arch.old_guest_ptpg = NULL;
                 curr->arch.old_guest_table = page;
+                curr->arch.old_guest_table_partial = (rc == -ERESTART);
+                rc = -ERESTART;
                 break;
             default:
                 BUG_ON(rc);
@@ -3494,6 +3557,7 @@ long do_mmuext_op(
                     {
                         curr->arch.old_guest_ptpg = NULL;
                         curr->arch.old_guest_table = page;
+                        curr->arch.old_guest_table_partial = false;
                     }
                 }
             }
@@ -3528,6 +3592,11 @@ long do_mmuext_op(
             case -ERESTART:
                 curr->arch.old_guest_ptpg = NULL;
                 curr->arch.old_guest_table = page;
+                /*
+                 * EINTR means we still hold the type ref; ERESTART
+                 * means PGT_partial holds the type ref
+                 */
+                curr->arch.old_guest_table_partial = (rc == -ERESTART);
                 rc = 0;
                 break;
             default:
@@ -3596,11 +3665,15 @@ long do_mmuext_op(
                 switch ( rc = put_page_and_type_preemptible(page) )
                 {
                 case -EINTR:
-                    rc = -ERESTART;
-                    /* fallthrough */
                 case -ERESTART:
                     curr->arch.old_guest_ptpg = NULL;
                     curr->arch.old_guest_table = page;
+                    /*
+                     * EINTR means we still hold the type ref;
+                     * ERESTART means PGT_partial holds the ref
+                     */
+                    curr->arch.old_guest_table_partial = (rc == -ERESTART);
+                    rc = -ERESTART;
                     break;
                 default:
                     BUG_ON(rc);
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 214e44ce1c..2cfce7b36b 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -307,7 +307,7 @@ struct arch_domain
 
     struct paging_domain paging;
     struct p2m_domain *p2m;
-    /* To enforce lock ordering in the pod code wrt the 
+    /* To enforce lock ordering in the pod code wrt the
      * page_alloc lock */
     int page_alloc_unlock_level;
 
@@ -581,6 +581,8 @@ struct arch_vcpu
     struct page_info *old_guest_table;  /* partially destructed pagetable */
     struct page_info *old_guest_ptpg;   /* containing page table of the */
                                         /* former, if any */
+    bool old_guest_table_partial;       /* Are we dropping a type ref, or just
+                                         * finishing up a partial de-validation? */
     /* guest_table holds a ref to the page, and also a type-count unless
      * shadow refcounts are in use */
     pagetable_t shadow_table[4];        /* (MFN) shadow(s) of guest */
-- 
2.23.0


["xsa299/0001-x86-mm-L1TF-checks-don-t-leave-a-partial-entry.patch" (application/octet-stream)]

From dae551145ad4c4720f7b700d10d58b609acbae4e Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 01/11] x86/mm: L1TF checks don't leave a partial entry

On detection of a potential L1TF issue, most validation code returns
-ERESTART to allow the switch to shadow mode to happen and cause the
original operation to be restarted.

However, in the validation code, the return value -ERESTART has been
repurposed to indicate 1) the function has partially completed
something which needs to be undone, and 2) calling put_page_type()
should cleanly undo it.  This causes problems in several places.

For L1 tables, on receiving an -ERESTART return from alloc_l1_table(),
alloc_page_type() will set PGT_partial on the page.  If for some
reason the original operation never restarts, then on domain
destruction, relinquish_memory() will call free_page_type() on the
page.

Unfortunately, alloc_ and free_l1_table() aren't set up to deal with
PGT_partial.  When returning a failure, alloc_l1_table() always
de-validates whatever it's validated so far, and free_l1_table()
always devalidates the whole page.  This means that if
relinquish_memory() calls free_page_type() on an L1 that didn't
complete due to an L1TF, it will call put_page_from_l1e() on "page
entries" that have never been validated.

For L2+ tables, setting rc to ERESTART causes the rest of the
alloc_lN_table() function to *think* that the entry in question will
have PGT_partial set.  This will cause it to set partial_pte = 1.  If
relinqush_memory() then calls free_page_type() on one of those pages,
then free_lN_table() will call put_page_from_lNe() on the entry when
it shouldn't.

Rather than indicating -ERESTART, indicate -EINTR.  This is the code
to indicate that nothing has changed from when you started the call
(which is effectively how alloc_l1_table() handles errors).

mod_lN_entry() shouldn't have any of these types of problems, so leave
potential changes there for a clean-up patch later.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 8 ++++----
 1 file changed, 4 insertions(+), 4 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 99816fc67c..9cac6d1cdf 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1379,7 +1379,7 @@ static int alloc_l1_table(struct page_info *page)
     {
         if ( !(l1e_get_flags(pl1e[i]) & _PAGE_PRESENT) )
         {
-            ret = pv_l1tf_check_l1e(d, pl1e[i]) ? -ERESTART : 0;
+            ret = pv_l1tf_check_l1e(d, pl1e[i]) ? -EINTR : 0;
             if ( ret )
                 goto out;
         }
@@ -1488,7 +1488,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
         {
             if ( !pv_l1tf_check_l2e(d, l2e) )
                 continue;
-            rc = -ERESTART;
+            rc = -EINTR;
         }
         else
             rc = get_page_from_l2e(l2e, pfn, d, partial);
@@ -1575,7 +1575,7 @@ static int alloc_l3_table(struct page_info *page)
         {
             if ( !pv_l1tf_check_l3e(d, l3e) )
                 continue;
-            rc = -ERESTART;
+            rc = -EINTR;
         }
         else
             rc = get_page_from_l3e(l3e, pfn, d, partial);
@@ -1756,7 +1756,7 @@ static int alloc_l4_table(struct page_info *page)
         {
             if ( !pv_l1tf_check_l4e(d, l4e) )
                 continue;
-            rc = -ERESTART;
+            rc = -EINTR;
         }
         else
             rc = get_page_from_l4e(l4e, pfn, d, partial);
-- 
2.23.0


["xsa299/0002-x86-mm-Don-t-re-set-PGT_pinned-on-a-partially-de-val.patch" (application/octet-stream)]

From fbc279651dba9e5be6ff29c89a92f1bce40aee56 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 02/11] x86/mm: Don't re-set PGT_pinned on a partially
 de-validated page

When unpinning pagetables, if an operation is interrupted,
relinquish_memory() re-sets PGT_pinned so that the un-pin will
pickedup again when the hypercall restarts.

This is appropriate when put_page_and_type_preemptible() returns
-EINTR, which indicates that the page is back in its initial state
(i.e., completely validated).  However, for -ERESTART, this leads to a
state where a page has both PGT_pinned and PGT_partial set.

This happens to work at the moment, although it's not really a
"canonical" state; but in subsequent patches, where we need to make a
distinction in handling between PGT_validated and PGT_partial pages,
this causes issues.

Move to a "canonical" state by:
- Only re-setting PGT_pinned on -EINTR
- Re-dropping the refcount held by PGT_pinned on -ERESTART

In the latter case, the PGT_partial bit will be cleared further down
with the rest of the other PGT_partial pages.

While here, clean up some trainling whitespace.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/domain.c | 31 ++++++++++++++++++++++++++++---
 1 file changed, 28 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index c8d7f491ea..febbb23336 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -120,7 +120,7 @@ void play_dead(void)
      * this case, heap corruption or #PF can occur (when heap debugging is
      * enabled). For example, even printk() can involve tasklet scheduling,
      * which touches per-cpu vars.
-     * 
+     *
      * Consider very carefully when adding code to *dead_idle. Most hypervisor
      * subsystems are unsafe to call.
      */
@@ -1972,9 +1972,34 @@ static int relinquish_memory(
             break;
         case -ERESTART:
         case -EINTR:
+            /*
+             * -EINTR means PGT_validated has been re-set; re-set
+             * PGT_pinned again so that it gets picked up next time
+             * around.
+             *
+             * -ERESTART, OTOH, means PGT_partial is set instead.  Put
+             * it back on the list, but don't set PGT_pinned; the
+             * section below will finish off de-validation.  But we do
+             * need to drop the general ref associated with
+             * PGT_pinned, since put_page_and_type_preemptible()
+             * didn't do it.
+             *
+             * NB we can do an ASSERT for PGT_validated, since we
+             * "own" the type ref; but theoretically, the PGT_partial
+             * could be cleared by someone else.
+             */
+            if ( ret == -EINTR )
+            {
+                ASSERT(page->u.inuse.type_info & PGT_validated);
+                set_bit(_PGT_pinned, &page->u.inuse.type_info);
+            }
+            else
+                put_page(page);
+
             ret = -ERESTART;
+
+            /* Put the page back on the list and drop the ref we grabbed above */
             page_list_add(page, list);
-            set_bit(_PGT_pinned, &page->u.inuse.type_info);
             put_page(page);
             goto out;
         default:
@@ -2225,7 +2250,7 @@ void vcpu_kick(struct vcpu *v)
      * pending flag. These values may fluctuate (after all, we hold no
      * locks) but the key insight is that each change will cause
      * evtchn_upcall_pending to be polled.
-     * 
+     *
      * NB2. We save the running flag across the unblock to avoid a needless
      * IPI for domains that we IPI'd to unblock.
      */
-- 
2.23.0


["xsa299/0003-x86-mm-Separate-out-partial_pte-tristate-into-indivi.patch" (application/octet-stream)]

From b821f79f88c194cf3064f41ea66cd3d9554a8cb5 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 03/11] x86/mm: Separate out partial_pte tristate into
 individual flags

At the moment, partial_pte is a tri-state that contains two distinct bits
of information:

1. If zero, the pte at index [nr_validated_ptes] is un-validated.  If
   non-zero, the pte was last seen with PGT_partial set.

2. If positive, the pte at index [nr_validated_ptes] does not hold a
   general reference count.  If negative, it does.

To make future patches more clear, separate out this functionality
into two distinct, named bits: PTF_partial_set (for #1) and
PTF_partial_general_ref (for #2).

Additionally, a number of functions which need this information also
take other flags to control behavior (such as `preemptible` and
`defer`).  These are hard to read in the caller (since you only see
'true' or 'false'), and ugly when many are added together.  In
preparation for adding yet another flag in a future patch, collapse
all of these into a single `flag` variable.

NB that this does mean checking for what was previously the '-1'
condition a bit more ugly in the put_page_from_lNe functions (since
you have to check for both partial_set and general ref); but this
clause will go away in a future patch.

Also note that the original comment had an off-by-one error:
partial_flags (like partial_pte before it) concerns
plNe[nr_validated_ptes], not plNe[nr_validated_ptes+1].

No functional change intended.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c        | 165 ++++++++++++++++++++++++---------------
 xen/include/asm-x86/mm.h |  41 +++++++---
 2 files changed, 128 insertions(+), 78 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 9cac6d1cdf..6abdd20b89 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1064,20 +1064,35 @@ get_page_from_l1e(
 }
 
 #ifdef CONFIG_PV
+
+/*
+ * The following flags are used to specify behavior of various get and
+ * put commands.  The first two are also stored in page->partial_flags
+ * to indicate the state of the page pointed to by
+ * page->pte[page->nr_validated_entries].  See the comment in mm.h for
+ * more information.
+ */
+#define PTF_partial_set         (1 << 0)
+#define PTF_partial_general_ref (1 << 1)
+#define PTF_preemptible         (1 << 2)
+#define PTF_defer               (1 << 3)
+
 static int get_page_and_type_from_mfn(
     mfn_t mfn, unsigned long type, struct domain *d,
-    int partial, int preemptible)
+    unsigned int flags)
 {
     struct page_info *page = mfn_to_page(mfn);
     int rc;
+    bool preemptible = flags & PTF_preemptible,
+         partial_ref = flags & PTF_partial_general_ref;
 
-    if ( likely(partial >= 0) &&
+    if ( likely(!partial_ref) &&
          unlikely(!get_page_from_mfn(mfn, d)) )
         return -EINVAL;
 
     rc = _get_page_type(page, type, preemptible);
 
-    if ( unlikely(rc) && partial >= 0 &&
+    if ( unlikely(rc) && !partial_ref &&
          (!preemptible || page != current->arch.old_guest_table) )
         put_page(page);
 
@@ -1087,7 +1102,7 @@ static int get_page_and_type_from_mfn(
 define_get_linear_pagetable(l2);
 static int
 get_page_from_l2e(
-    l2_pgentry_t l2e, unsigned long pfn, struct domain *d, int partial)
+    l2_pgentry_t l2e, unsigned long pfn, struct domain *d, unsigned int flags)
 {
     unsigned long mfn = l2e_get_pfn(l2e);
     int rc;
@@ -1099,8 +1114,9 @@ get_page_from_l2e(
         return -EINVAL;
     }
 
-    rc = get_page_and_type_from_mfn(_mfn(mfn), PGT_l1_page_table, d,
-                                    partial, false);
+    ASSERT(!(flags & PTF_preemptible));
+
+    rc = get_page_and_type_from_mfn(_mfn(mfn), PGT_l1_page_table, d, flags);
     if ( unlikely(rc == -EINVAL) && get_l2_linear_pagetable(l2e, pfn, d) )
         rc = 0;
 
@@ -1110,7 +1126,7 @@ get_page_from_l2e(
 define_get_linear_pagetable(l3);
 static int
 get_page_from_l3e(
-    l3_pgentry_t l3e, unsigned long pfn, struct domain *d, int partial)
+    l3_pgentry_t l3e, unsigned long pfn, struct domain *d, unsigned int flags)
 {
     int rc;
 
@@ -1122,7 +1138,7 @@ get_page_from_l3e(
     }
 
     rc = get_page_and_type_from_mfn(
-        l3e_get_mfn(l3e), PGT_l2_page_table, d, partial, 1);
+        l3e_get_mfn(l3e), PGT_l2_page_table, d, flags | PTF_preemptible);
     if ( unlikely(rc == -EINVAL) &&
          !is_pv_32bit_domain(d) &&
          get_l3_linear_pagetable(l3e, pfn, d) )
@@ -1134,7 +1150,7 @@ get_page_from_l3e(
 define_get_linear_pagetable(l4);
 static int
 get_page_from_l4e(
-    l4_pgentry_t l4e, unsigned long pfn, struct domain *d, int partial)
+    l4_pgentry_t l4e, unsigned long pfn, struct domain *d, unsigned int flags)
 {
     int rc;
 
@@ -1146,7 +1162,7 @@ get_page_from_l4e(
     }
 
     rc = get_page_and_type_from_mfn(
-        l4e_get_mfn(l4e), PGT_l3_page_table, d, partial, 1);
+        l4e_get_mfn(l4e), PGT_l3_page_table, d, flags | PTF_preemptible);
     if ( unlikely(rc == -EINVAL) && get_l4_linear_pagetable(l4e, pfn, d) )
         rc = 0;
 
@@ -1247,7 +1263,7 @@ static void put_data_page(struct page_info *page, bool writeable)
  * Note also that this automatically deals correctly with linear p.t.'s.
  */
 static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
-                             int partial, bool defer)
+                             unsigned int flags)
 {
     int rc = 0;
 
@@ -1270,12 +1286,13 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         struct page_info *pg = l2e_get_page(l2e);
         struct page_info *ptpg = mfn_to_page(_mfn(pfn));
 
-        if ( unlikely(partial > 0) )
+        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
+              PTF_partial_set )
         {
-            ASSERT(!defer);
+            ASSERT(!(flags & PTF_defer));
             rc = _put_page_type(pg, true, ptpg);
         }
-        else if ( defer )
+        else if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = ptpg;
             current->arch.old_guest_table = pg;
@@ -1292,7 +1309,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
 }
 
 static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
-                             int partial, bool defer)
+                             unsigned int flags)
 {
     struct page_info *pg;
     int rc;
@@ -1315,13 +1332,14 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 
     pg = l3e_get_page(l3e);
 
-    if ( unlikely(partial > 0) )
+    if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
+         PTF_partial_set )
     {
-        ASSERT(!defer);
+        ASSERT(!(flags & PTF_defer));
         return _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
     }
 
-    if ( defer )
+    if ( flags & PTF_defer )
     {
         current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
         current->arch.old_guest_table = pg;
@@ -1336,7 +1354,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 }
 
 static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
-                             int partial, bool defer)
+                             unsigned int flags)
 {
     int rc = 1;
 
@@ -1345,13 +1363,14 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
     {
         struct page_info *pg = l4e_get_page(l4e);
 
-        if ( unlikely(partial > 0) )
+        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
+              PTF_partial_set )
         {
-            ASSERT(!defer);
+            ASSERT(!(flags & PTF_defer));
             return _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
         }
 
-        if ( defer )
+        if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
             current->arch.old_guest_table = pg;
@@ -1463,12 +1482,13 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
     unsigned long  pfn = mfn_x(page_to_mfn(page));
     l2_pgentry_t  *pl2e;
     unsigned int   i;
-    int            rc = 0, partial = page->partial_pte;
+    int            rc = 0;
+    unsigned int   partial_flags = page->partial_flags;
 
     pl2e = map_domain_page(_mfn(pfn));
 
     for ( i = page->nr_validated_ptes; i < L2_PAGETABLE_ENTRIES;
-          i++, partial = 0 )
+          i++, partial_flags = 0 )
     {
         l2_pgentry_t l2e;
 
@@ -1491,17 +1511,18 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             rc = -EINTR;
         }
         else
-            rc = get_page_from_l2e(l2e, pfn, d, partial);
+            rc = get_page_from_l2e(l2e, pfn, d, partial_flags);
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = partial ?: 1;
+            /* Set 'set', retain 'general ref' */
+            page->partial_flags = partial_flags | PTF_partial_set;
         }
         else if ( rc == -EINTR && i )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
             rc = -ERESTART;
         }
         else if ( rc < 0 && rc != -EINTR )
@@ -1511,7 +1532,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_pte = 0;
+                page->partial_flags = 0;
                 current->arch.old_guest_ptpg = NULL;
                 current->arch.old_guest_table = page;
             }
@@ -1535,7 +1556,8 @@ static int alloc_l3_table(struct page_info *page)
     unsigned long  pfn = mfn_x(page_to_mfn(page));
     l3_pgentry_t  *pl3e;
     unsigned int   i;
-    int            rc = 0, partial = page->partial_pte;
+    int            rc = 0;
+    unsigned int   partial_flags = page->partial_flags;
 
     pl3e = map_domain_page(_mfn(pfn));
 
@@ -1550,7 +1572,7 @@ static int alloc_l3_table(struct page_info *page)
         memset(pl3e + 4, 0, (L3_PAGETABLE_ENTRIES - 4) * sizeof(*pl3e));
 
     for ( i = page->nr_validated_ptes; i < L3_PAGETABLE_ENTRIES;
-          i++, partial = 0 )
+          i++, partial_flags = 0 )
     {
         l3_pgentry_t l3e = pl3e[i];
 
@@ -1569,7 +1591,8 @@ static int alloc_l3_table(struct page_info *page)
             else
                 rc = get_page_and_type_from_mfn(
                     l3e_get_mfn(l3e),
-                    PGT_l2_page_table | PGT_pae_xen_l2, d, partial, 1);
+                    PGT_l2_page_table | PGT_pae_xen_l2, d,
+                    partial_flags | PTF_preemptible);
         }
         else if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) )
         {
@@ -1578,17 +1601,18 @@ static int alloc_l3_table(struct page_info *page)
             rc = -EINTR;
         }
         else
-            rc = get_page_from_l3e(l3e, pfn, d, partial);
+            rc = get_page_from_l3e(l3e, pfn, d, partial_flags);
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = partial ?: 1;
+            /* Set 'set', leave 'general ref' set if this entry was set */
+            page->partial_flags = partial_flags | PTF_partial_set;
         }
         else if ( rc == -EINTR && i )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
             rc = -ERESTART;
         }
         if ( rc < 0 )
@@ -1606,7 +1630,7 @@ static int alloc_l3_table(struct page_info *page)
         if ( i )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
         }
@@ -1740,10 +1764,11 @@ static int alloc_l4_table(struct page_info *page)
     unsigned long  pfn = mfn_x(page_to_mfn(page));
     l4_pgentry_t  *pl4e = map_domain_page(_mfn(pfn));
     unsigned int   i;
-    int            rc = 0, partial = page->partial_pte;
+    int            rc = 0;
+    unsigned int   partial_flags = page->partial_flags;
 
     for ( i = page->nr_validated_ptes; i < L4_PAGETABLE_ENTRIES;
-          i++, partial = 0 )
+          i++, partial_flags = 0 )
     {
         l4_pgentry_t l4e;
 
@@ -1759,12 +1784,13 @@ static int alloc_l4_table(struct page_info *page)
             rc = -EINTR;
         }
         else
-            rc = get_page_from_l4e(l4e, pfn, d, partial);
+            rc = get_page_from_l4e(l4e, pfn, d, partial_flags);
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
-            page->partial_pte = partial ?: 1;
+            /* Set 'set', leave 'general ref' set if this entry was set */
+            page->partial_flags = partial_flags | PTF_partial_set;
         }
         else if ( rc < 0 )
         {
@@ -1774,7 +1800,7 @@ static int alloc_l4_table(struct page_info *page)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_pte = 0;
+                page->partial_flags = 0;
                 if ( rc == -EINTR )
                     rc = -ERESTART;
                 else
@@ -1826,19 +1852,20 @@ static int free_l2_table(struct page_info *page)
     struct domain *d = page_get_owner(page);
     unsigned long pfn = mfn_x(page_to_mfn(page));
     l2_pgentry_t *pl2e;
-    int rc = 0, partial = page->partial_pte;
-    unsigned int i = page->nr_validated_ptes - !partial;
+    int rc = 0;
+    unsigned int partial_flags = page->partial_flags,
+        i = page->nr_validated_ptes - !(partial_flags & PTF_partial_set);
 
     pl2e = map_domain_page(_mfn(pfn));
 
     for ( ; ; )
     {
         if ( is_guest_l2_slot(d, page->u.inuse.type_info, i) )
-            rc = put_page_from_l2e(pl2e[i], pfn, partial, false);
+            rc = put_page_from_l2e(pl2e[i], pfn, partial_flags);
         if ( rc < 0 )
             break;
 
-        partial = 0;
+        partial_flags = 0;
 
         if ( !i-- )
             break;
@@ -1860,12 +1887,14 @@ static int free_l2_table(struct page_info *page)
     else if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_pte = partial ?: -1;
+        page->partial_flags = (partial_flags & PTF_partial_set) ?
+            partial_flags :
+            (PTF_partial_set | PTF_partial_general_ref);
     }
     else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
     {
         page->nr_validated_ptes = i + 1;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
         rc = -ERESTART;
     }
 
@@ -1877,18 +1906,19 @@ static int free_l3_table(struct page_info *page)
     struct domain *d = page_get_owner(page);
     unsigned long pfn = mfn_x(page_to_mfn(page));
     l3_pgentry_t *pl3e;
-    int rc = 0, partial = page->partial_pte;
-    unsigned int  i = page->nr_validated_ptes - !partial;
+    int rc = 0;
+    unsigned int partial_flags = page->partial_flags,
+        i = page->nr_validated_ptes - !(partial_flags & PTF_partial_set);
 
     pl3e = map_domain_page(_mfn(pfn));
 
     for ( ; ; )
     {
-        rc = put_page_from_l3e(pl3e[i], pfn, partial, 0);
+        rc = put_page_from_l3e(pl3e[i], pfn, partial_flags);
         if ( rc < 0 )
             break;
 
-        partial = 0;
+        partial_flags = 0;
         if ( rc == 0 )
             pl3e[i] = unadjust_guest_l3e(pl3e[i], d);
 
@@ -1907,12 +1937,14 @@ static int free_l3_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_pte = partial ?: -1;
+        page->partial_flags = (partial_flags & PTF_partial_set) ?
+            partial_flags :
+            (PTF_partial_set | PTF_partial_general_ref);
     }
     else if ( rc == -EINTR && i < L3_PAGETABLE_ENTRIES - 1 )
     {
         page->nr_validated_ptes = i + 1;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
         rc = -ERESTART;
     }
     return rc > 0 ? 0 : rc;
@@ -1923,26 +1955,29 @@ static int free_l4_table(struct page_info *page)
     struct domain *d = page_get_owner(page);
     unsigned long pfn = mfn_x(page_to_mfn(page));
     l4_pgentry_t *pl4e = map_domain_page(_mfn(pfn));
-    int rc = 0, partial = page->partial_pte;
-    unsigned int  i = page->nr_validated_ptes - !partial;
+    int rc = 0;
+    unsigned partial_flags = page->partial_flags,
+        i = page->nr_validated_ptes - !(partial_flags & PTF_partial_set);
 
     do {
         if ( is_guest_l4_slot(d, i) )
-            rc = put_page_from_l4e(pl4e[i], pfn, partial, 0);
+            rc = put_page_from_l4e(pl4e[i], pfn, partial_flags);
         if ( rc < 0 )
             break;
-        partial = 0;
+        partial_flags = 0;
     } while ( i-- );
 
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_pte = partial ?: -1;
+        page->partial_flags = (partial_flags & PTF_partial_set) ?
+            partial_flags :
+            (PTF_partial_set | PTF_partial_general_ref);
     }
     else if ( rc == -EINTR && i < L4_PAGETABLE_ENTRIES - 1 )
     {
         page->nr_validated_ptes = i + 1;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
         rc = -ERESTART;
     }
 
@@ -2219,7 +2254,7 @@ static int mod_l2_entry(l2_pgentry_t *pl2e,
         return -EBUSY;
     }
 
-    put_page_from_l2e(ol2e, mfn_x(mfn), 0, true);
+    put_page_from_l2e(ol2e, mfn_x(mfn), PTF_defer);
 
     return rc;
 }
@@ -2287,7 +2322,7 @@ static int mod_l3_entry(l3_pgentry_t *pl3e,
         if ( !create_pae_xen_mappings(d, pl3e) )
             BUG();
 
-    put_page_from_l3e(ol3e, mfn_x(mfn), 0, 1);
+    put_page_from_l3e(ol3e, mfn_x(mfn), PTF_defer);
     return rc;
 }
 
@@ -2350,7 +2385,7 @@ static int mod_l4_entry(l4_pgentry_t *pl4e,
         return -EFAULT;
     }
 
-    put_page_from_l4e(ol4e, mfn_x(mfn), 0, 1);
+    put_page_from_l4e(ol4e, mfn_x(mfn), PTF_defer);
     return rc;
 }
 #endif /* CONFIG_PV */
@@ -2619,7 +2654,7 @@ int free_page_type(struct page_info *page, unsigned long type,
     if ( !(type & PGT_partial) )
     {
         page->nr_validated_ptes = 1U << PAGETABLE_ORDER;
-        page->partial_pte = 0;
+        page->partial_flags = 0;
     }
 
     switch ( type & PGT_type_mask )
@@ -2912,7 +2947,7 @@ static int _get_page_type(struct page_info *page, unsigned long type,
         if ( !(x & PGT_partial) )
         {
             page->nr_validated_ptes = 0;
-            page->partial_pte = 0;
+            page->partial_flags = 0;
         }
         page->linear_pt_count = 0;
         rc = alloc_page_type(page, type, preemptible);
@@ -3088,7 +3123,7 @@ int new_guest_cr3(mfn_t mfn)
         return 0;
     }
 
-    rc = get_page_and_type_from_mfn(mfn, PGT_root_page_table, d, 0, 1);
+    rc = get_page_and_type_from_mfn(mfn, PGT_root_page_table, d, PTF_preemptible);
     switch ( rc )
     {
     case 0:
@@ -3439,7 +3474,7 @@ long do_mmuext_op(
             if ( op.arg1.mfn != 0 )
             {
                 rc = get_page_and_type_from_mfn(
-                    _mfn(op.arg1.mfn), PGT_root_page_table, currd, 0, 1);
+                    _mfn(op.arg1.mfn), PGT_root_page_table, currd, PTF_preemptible);
 
                 if ( unlikely(rc) )
                 {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 3863e4ce57..0e628fa417 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -231,19 +231,34 @@ struct page_info
          * setting the flag must not drop that reference, whereas the instance
          * clearing it will have to.
          *
-         * If @partial_pte is positive then PTE at @nr_validated_ptes+1 has
-         * been partially validated. This implies that the general reference
-         * to the page (acquired from get_page_from_lNe()) would be dropped
-         * (again due to the apparent failure) and hence must be re-acquired
-         * when resuming the validation, but must not be dropped when picking
-         * up the page for invalidation.
+         * If partial_flags & PTF_partial_set is set, then the page at
+         * at @nr_validated_ptes had PGT_partial set as a result of an
+         * operation on the current page.  (That page may or may not
+         * still have PGT_partial set.)
          *
-         * If @partial_pte is negative then PTE at @nr_validated_ptes+1 has
-         * been partially invalidated. This is basically the opposite case of
-         * above, i.e. the general reference to the page was not dropped in
-         * put_page_from_lNe() (due to the apparent failure), and hence it
-         * must be dropped when the put operation is resumed (and completes),
-         * but it must not be acquired if picking up the page for validation.
+         * If PTF_partial_general_ref is set, then the PTE at
+         * @nr_validated_ptef holds a general reference count for the
+         * page.
+         *
+         * This happens:
+         * - During de-validation, if de-validation of the page was
+         *   interrupted
+         * - During validation, if an invalid entry is encountered and
+         *   validation is preemptible
+         * - During validation, if PTF_partial_general_ref was set on
+         *   this entry to begin with (perhaps because we're picking
+         *   up from a partial de-validation).
+         *
+         * When resuming validation, if PTF_partial_general_ref is clear,
+         * then a general reference must be re-acquired; if it is set, no
+         * reference should be acquired.
+         *
+         * When resuming de-validation, if PTF_partial_general_ref is
+         * clear, no reference should be dropped; if it is set, a
+         * reference should be dropped.
+         *
+         * NB that PTF_partial_set and PTF_partial_general_ref are
+         * defined in mm.c, the only place where they are used.
          *
          * The 3rd field, @linear_pt_count, indicates
          * - by a positive value, how many same-level page table entries a page
@@ -254,7 +269,7 @@ struct page_info
         struct {
             u16 nr_validated_ptes:PAGETABLE_ORDER + 1;
             u16 :16 - PAGETABLE_ORDER - 1 - 2;
-            s16 partial_pte:2;
+            u16 partial_flags:2;
             s16 linear_pt_count;
         };
 
-- 
2.23.0


["xsa299/0004-x86-mm-Use-flags-for-_put_page_type-rather-than-a-bo.patch" (application/octet-stream)]

From 07cc728267ad1e659757ac2a32e467438dae1032 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 04/11] x86/mm: Use flags for _put_page_type rather than a
 boolean

This is in mainly in preparation for _put_page_type taking the
partial_flags value in the future.  It also makes it easier to read in
the caller (since you see a flag name rather than `true` or `false`).

No functional change intended.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 25 +++++++++++++------------
 1 file changed, 13 insertions(+), 12 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 6abdd20b89..68d17db4ad 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1170,7 +1170,7 @@ get_page_from_l4e(
 }
 #endif /* CONFIG_PV */
 
-static int _put_page_type(struct page_info *page, bool preemptible,
+static int _put_page_type(struct page_info *page, unsigned int flags,
                           struct page_info *ptpg);
 
 void put_page_from_l1e(l1_pgentry_t l1e, struct domain *l1e_owner)
@@ -1290,7 +1290,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
               PTF_partial_set )
         {
             ASSERT(!(flags & PTF_defer));
-            rc = _put_page_type(pg, true, ptpg);
+            rc = _put_page_type(pg, PTF_preemptible, ptpg);
         }
         else if ( flags & PTF_defer )
         {
@@ -1299,7 +1299,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         }
         else
         {
-            rc = _put_page_type(pg, true, ptpg);
+            rc = _put_page_type(pg, PTF_preemptible, ptpg);
             if ( likely(!rc) )
                 put_page(pg);
         }
@@ -1336,7 +1336,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
          PTF_partial_set )
     {
         ASSERT(!(flags & PTF_defer));
-        return _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
+        return _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
     }
 
     if ( flags & PTF_defer )
@@ -1346,7 +1346,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
         return 0;
     }
 
-    rc = _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
+    rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
     if ( likely(!rc) )
         put_page(pg);
 
@@ -1367,7 +1367,7 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
               PTF_partial_set )
         {
             ASSERT(!(flags & PTF_defer));
-            return _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
+            return _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
         }
 
         if ( flags & PTF_defer )
@@ -1377,7 +1377,7 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
             return 0;
         }
 
-        rc = _put_page_type(pg, true, mfn_to_page(_mfn(pfn)));
+        rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
         if ( likely(!rc) )
             put_page(pg);
     }
@@ -2727,10 +2727,11 @@ static int _put_final_page_type(struct page_info *page, unsigned long type,
 }
 
 
-static int _put_page_type(struct page_info *page, bool preemptible,
+static int _put_page_type(struct page_info *page, unsigned int flags,
                           struct page_info *ptpg)
 {
     unsigned long nx, x, y = page->u.inuse.type_info;
+    bool preemptible = flags & PTF_preemptible;
 
     ASSERT(current_locked_page_ne_check(page));
 
@@ -2936,7 +2937,7 @@ static int _get_page_type(struct page_info *page, unsigned long type,
 
             if ( unlikely(rc) )
             {
-                _put_page_type(page, false, NULL);
+                _put_page_type(page, 0, NULL);
                 goto out;
             }
         }
@@ -2962,7 +2963,7 @@ static int _get_page_type(struct page_info *page, unsigned long type,
 
 void put_page_type(struct page_info *page)
 {
-    int rc = _put_page_type(page, false, NULL);
+    int rc = _put_page_type(page, 0, NULL);
     ASSERT(rc == 0);
     (void)rc;
 }
@@ -2979,7 +2980,7 @@ int get_page_type(struct page_info *page, unsigned long type)
 
 int put_page_type_preemptible(struct page_info *page)
 {
-    return _put_page_type(page, true, NULL);
+    return _put_page_type(page, PTF_preemptible, NULL);
 }
 
 int get_page_type_preemptible(struct page_info *page, unsigned long type)
@@ -2996,7 +2997,7 @@ int put_old_guest_table(struct vcpu *v)
     if ( !v->arch.old_guest_table )
         return 0;
 
-    switch ( rc = _put_page_type(v->arch.old_guest_table, true,
+    switch ( rc = _put_page_type(v->arch.old_guest_table, PTF_preemptible,
                                  v->arch.old_guest_ptpg) )
     {
     case -EINTR:
-- 
2.23.0


["xsa299/0005-x86-mm-Rework-get_page_and_type_from_mfn-conditional.patch" (application/octet-stream)]

From 1b19b8d643b072ab4d4cd25a684d7315ab0a1512 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 05/11] x86/mm: Rework get_page_and_type_from_mfn conditional

Make it easier to read by declaring the conditions in which we will
retain the ref, rather than the conditions under which we release it.

The only way (page == current->arch.old_guest_table) can be true is if
preemptible is true; so remove this from the query itself, and add an
ASSERT() to that effect on the opposite path.

No functional change intended.

NB that alloc_lN_table() mishandle the "linear pt failure" situation
described in the comment; this will be addressed in a future patch.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 39 +++++++++++++++++++++++++++++++++++++--
 1 file changed, 37 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 68d17db4ad..da47a66f8b 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1092,8 +1092,43 @@ static int get_page_and_type_from_mfn(
 
     rc = _get_page_type(page, type, preemptible);
 
-    if ( unlikely(rc) && !partial_ref &&
-         (!preemptible || page != current->arch.old_guest_table) )
+    /*
+     * Retain the refcount if:
+     * - page is fully validated (rc == 0)
+     * - page is not validated (rc < 0) but:
+     *   - We came in with a reference (partial_ref)
+     *   - page is partially validated but there's been an error
+     *     (page == current->arch.old_guest_table)
+     *
+     * The partial_ref-on-error clause is worth an explanation.  There
+     * are two scenarios where partial_ref might be true coming in:
+     * - mfn has been partially demoted as type `type`; i.e. has
+     *   PGT_partial set
+     * - mfn has been partially demoted as L(type+1) (i.e., a linear
+     *   page; e.g. we're being called from get_page_from_l2e with
+     *   type == PGT_l1_table, but the mfn is PGT_l2_table)
+     *
+     * If there's an error, in the first case, _get_page_type will
+     * either return -ERESTART, in which case we want to retain the
+     * ref (as the caller will consider it retained), or -EINVAL, in
+     * which case old_guest_table will be set; in both cases, we need
+     * to retain the ref.
+     *
+     * In the second case, if there's an error, _get_page_type() can
+     * *only* return -EINVAL, and *never* set old_guest_table.  In
+     * that case we also want to retain the reference, to allow the
+     * page to continue to be torn down (i.e., PGT_partial cleared)
+     * safely.
+     *
+     * Also note that we shouldn't be able to leave with the reference
+     * count retained unless we succeeded, or the operation was
+     * preemptible.
+     */
+    if ( likely(!rc) || partial_ref )
+        /* nothing */;
+    else if ( page == current->arch.old_guest_table )
+        ASSERT(preemptible);
+    else
         put_page(page);
 
     return rc;
-- 
2.23.0


["xsa299/0006-x86-mm-Have-alloc_l-23-_table-clear-partial_flags-wh.patch" (application/octet-stream)]

From 9c13c5cd7ffd0f421a8cfd6510450b8dac7b1c40 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 06/11] x86/mm: Have alloc_l[23]_table clear partial_flags when
 preempting

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.

At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.

PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held.  Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.

Unfortunately, when alloc_l[23]_table check hypercall_preempt_check()
and return -ERESTART, they set nr_entries_validated, but don't clear
partial_flags.

If we were picking up from a previously-interrupted promotion, that
means that PTF_partial_set would be set even though
[nr_entries_validated] was not partially validated.  This means that
if the page in this state were de-validated, put_page_type() would
erroneously be called on that entry.

Perhaps worse, if we were racing with a de-validation, then we might
leave both PTF_partial_set and PTF_partial_general_ref; and when
de-validation picked up again, both the type and the general ref would
be erroneously dropped from [nr_entries_validated].

In a sense, the real issue here is code duplication.  Rather than
duplicate the interruption code, set rc to -EINTR and fall through to
the code which already handles that case correctly.

Given the logic at this point, it should be impossible for
partial_flags to be non-zero; add an ASSERT() to catch any changes.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 25 ++++++-------------------
 1 file changed, 6 insertions(+), 19 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index da47a66f8b..24ce8d7c50 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1525,21 +1525,13 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
     for ( i = page->nr_validated_ptes; i < L2_PAGETABLE_ENTRIES;
           i++, partial_flags = 0 )
     {
-        l2_pgentry_t l2e;
+        l2_pgentry_t l2e = pl2e[i];
 
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
-        {
-            page->nr_validated_ptes = i;
-            rc = -ERESTART;
-            break;
-        }
-
-        if ( !is_guest_l2_slot(d, type, i) )
+            rc = -EINTR;
+        else if ( !is_guest_l2_slot(d, type, i) )
             continue;
-
-        l2e = pl2e[i];
-
-        if ( !(l2e_get_flags(l2e) & _PAGE_PRESENT) )
+        else if ( !(l2e_get_flags(l2e) & _PAGE_PRESENT) )
         {
             if ( !pv_l1tf_check_l2e(d, l2e) )
                 continue;
@@ -1612,13 +1604,8 @@ static int alloc_l3_table(struct page_info *page)
         l3_pgentry_t l3e = pl3e[i];
 
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
-        {
-            page->nr_validated_ptes = i;
-            rc = -ERESTART;
-            break;
-        }
-
-        if ( is_pv_32bit_domain(d) && (i == 3) )
+            rc = -EINTR;
+        else if ( is_pv_32bit_domain(d) && (i == 3) )
         {
             if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) ||
                  (l3e_get_flags(l3e) & l3_disallow_mask(d)) )
-- 
2.23.0


["xsa299/0007-x86-mm-Always-retain-a-general-ref-on-partial.patch" (application/octet-stream)]

From b9b8f9a55480637084609d07f94b4c453c71b950 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 07/11] x86/mm: Always retain a general ref on partial

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page struct:
nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated.

At the moment, a distinction is made between promotion and demotion
with regard to whether the entry itself "holds" a general reference
count: when entry promotion is interrupted (i.e., returns -ERESTART),
the entry is not considered to hold a reference; when entry demotion
is interrupted, the entry is still considered to hold a general
reference.

PTF_partial_general_ref is used to distinguish between these cases.
If clear, it's a partial promotion => no general reference count held
by the entry; if set, it's partial demotion, so a general reference
count held.  Because promotions and demotions can be interleaved, this
value is passed to get_page_and_type_from_mfn and put_page_from_l*e,
to be able to properly handle reference counts.

Unfortunately, because a refcount is not held, it is possible to
engineer a situation where PFT_partial_set is set but the page in
question has been assigned to another domain.  A sketch is provided in
the appendix.

Fix this by having the parent page table entry hold a general
reference count whenever PFT_partial_set is set.  (For clarity of
change, keep two separate flags.  These will be collapsed in a
subsequent changeset.)

This has two basic implications.  On the put_page_from_lNe() side,
this mean that the (partial_set && !partial_ref) case can never happen,
and no longer needs to be special-cased.

Secondly, because both flags are set together, there's no need to carry over
existing bits from partial_pte.

(NB there is still another issue with calling _put_page_type() on a
page which had PGT_partial set; that will be handled in a subsequent
patch.)

On the get_page_and_type_from_mfn() side, we need to distinguish
between callers which hold a reference on partial (i.e.,
alloc_lN_table()), and those which do not (new_cr3, PIN_LN_TABLE, and
so on): pass a flag if the type should be retained on interruption.

NB that since l1 promotion can't be preempted, that get_page_from_l2e
can't return -ERESTART.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
* Appendix: Engineering PTF_partial_set while a page belongs to a
  foreign domain

Suppose A is a page which can be promoted to an l3, and B is a page
which can be promoted to an l2, and A[x] points to B.  B has
PGC_allocated set but no other general references.

V1:  PIN_L3 A.
  A is validated, B is validated.
  A.type_count = 1 | PGT_validated | PGT_pinned
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated (A[x] holds a general ref)

V1: UNPIN A.
  A begins de-validation.
  Arrange to be interrupted when i < x
  V1->old_guest_table = A
  V1->old_guest_table_ref_held = false
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = i < x
  B.type_count = 0
  B.count = 1 | PGC_allocated

V2: MOD_L4_ENTRY to point some l4e to A.
  Picks up re-validation of A.
  Arrange to be interrupted halfway through B's validation
  B.type_count = 1 | PGT_partial
  B.count = 2 | PGC_allocated (PGT_partial holds a general ref)
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = x
  A.partial_pte = PTF_partial_set

V3: MOD_L3_ENTRY to point some other l3e (not in A) to B.
  Validates B.
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated ("other l3e" holds a general ref)

V3: MOD_L3_ENTRY to clear l3e pointing to B.
  Devalidates B.
  B.type_count = 0
  B.count = 1 | PGC_allocated

V3: decrease_reservation(B)
  Clears PGC_allocated
  B.count = 0 => B is freed

B gets assigned to a different domain

V1: Restarts UNPIN of A
  put_old_guest_table(A)
    ...
      free_l3_table(A)

Now since A.partial_flags has PTF_partial_set, free_l3_table() will
call put_page_from_l3e() on A[x], which points to B, while B is owned
by another domain.

If A[x] held a general refcount for B on partial validation, as it does
for partial de-validation, then B would still have a reference count of
1 after PGC_allocated was freed; so B wouldn't be freed until after
put_page_from_l3e() had happend on A[x].
---
 xen/arch/x86/mm.c        | 84 +++++++++++++++++++++++-----------------
 xen/include/asm-x86/mm.h | 15 ++++---
 2 files changed, 58 insertions(+), 41 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 24ce8d7c50..d294b09a38 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1072,10 +1072,11 @@ get_page_from_l1e(
  * page->pte[page->nr_validated_entries].  See the comment in mm.h for
  * more information.
  */
-#define PTF_partial_set         (1 << 0)
-#define PTF_partial_general_ref (1 << 1)
-#define PTF_preemptible         (1 << 2)
-#define PTF_defer               (1 << 3)
+#define PTF_partial_set           (1 << 0)
+#define PTF_partial_general_ref   (1 << 1)
+#define PTF_preemptible           (1 << 2)
+#define PTF_defer                 (1 << 3)
+#define PTF_retain_ref_on_restart (1 << 4)
 
 static int get_page_and_type_from_mfn(
     mfn_t mfn, unsigned long type, struct domain *d,
@@ -1084,7 +1085,11 @@ static int get_page_and_type_from_mfn(
     struct page_info *page = mfn_to_page(mfn);
     int rc;
     bool preemptible = flags & PTF_preemptible,
-         partial_ref = flags & PTF_partial_general_ref;
+         partial_ref = flags & PTF_partial_general_ref,
+         partial_set = flags & PTF_partial_set,
+         retain_ref  = flags & PTF_retain_ref_on_restart;
+
+    ASSERT(partial_ref == partial_set);
 
     if ( likely(!partial_ref) &&
          unlikely(!get_page_from_mfn(mfn, d)) )
@@ -1097,13 +1102,15 @@ static int get_page_and_type_from_mfn(
      * - page is fully validated (rc == 0)
      * - page is not validated (rc < 0) but:
      *   - We came in with a reference (partial_ref)
+     *   - page is partially validated (rc == -ERESTART), and the
+     *     caller has asked the ref to be retained in that case
      *   - page is partially validated but there's been an error
      *     (page == current->arch.old_guest_table)
      *
      * The partial_ref-on-error clause is worth an explanation.  There
      * are two scenarios where partial_ref might be true coming in:
-     * - mfn has been partially demoted as type `type`; i.e. has
-     *   PGT_partial set
+     * - mfn has been partially promoted / demoted as type `type`;
+     *   i.e. has PGT_partial set
      * - mfn has been partially demoted as L(type+1) (i.e., a linear
      *   page; e.g. we're being called from get_page_from_l2e with
      *   type == PGT_l1_table, but the mfn is PGT_l2_table)
@@ -1126,7 +1133,8 @@ static int get_page_and_type_from_mfn(
      */
     if ( likely(!rc) || partial_ref )
         /* nothing */;
-    else if ( page == current->arch.old_guest_table )
+    else if ( page == current->arch.old_guest_table ||
+              (retain_ref && rc == -ERESTART) )
         ASSERT(preemptible);
     else
         put_page(page);
@@ -1324,8 +1332,8 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
               PTF_partial_set )
         {
-            ASSERT(!(flags & PTF_defer));
-            rc = _put_page_type(pg, PTF_preemptible, ptpg);
+            /* partial_set should always imply partial_ref */
+            BUG();
         }
         else if ( flags & PTF_defer )
         {
@@ -1370,8 +1378,8 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
     if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
          PTF_partial_set )
     {
-        ASSERT(!(flags & PTF_defer));
-        return _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
+        /* partial_set should always imply partial_ref */
+        BUG();
     }
 
     if ( flags & PTF_defer )
@@ -1401,8 +1409,8 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
         if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
               PTF_partial_set )
         {
-            ASSERT(!(flags & PTF_defer));
-            return _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
+            /* partial_set should always imply partial_ref */
+            BUG();
         }
 
         if ( flags & PTF_defer )
@@ -1540,13 +1548,22 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
         else
             rc = get_page_from_l2e(l2e, pfn, d, partial_flags);
 
-        if ( rc == -ERESTART )
-        {
-            page->nr_validated_ptes = i;
-            /* Set 'set', retain 'general ref' */
-            page->partial_flags = partial_flags | PTF_partial_set;
-        }
-        else if ( rc == -EINTR && i )
+        /*
+         * It shouldn't be possible for get_page_from_l2e to return
+         * -ERESTART, since we never call this with PTF_preemptible.
+         * (alloc_l1_table may return -EINTR on an L1TF-vulnerable
+         * entry.)
+         *
+         * NB that while on a "clean" promotion, we can never get
+         * PGT_partial.  It is possible to arrange for an l2e to
+         * contain a partially-devalidated l2; but in that case, both
+         * of the following functions will fail anyway (the first
+         * because the page in question is not an l1; the second
+         * because the page is not fully validated).
+         */
+        ASSERT(rc != -ERESTART);
+
+        if ( rc == -EINTR && i )
         {
             page->nr_validated_ptes = i;
             page->partial_flags = 0;
@@ -1556,6 +1573,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
         {
             gdprintk(XENLOG_WARNING,
                      "Failure %d in alloc_l2_table: slot %#x\n", rc, i);
+            ASSERT(current->arch.old_guest_table == NULL);
             if ( i )
             {
                 page->nr_validated_ptes = i;
@@ -1614,7 +1632,7 @@ static int alloc_l3_table(struct page_info *page)
                 rc = get_page_and_type_from_mfn(
                     l3e_get_mfn(l3e),
                     PGT_l2_page_table | PGT_pae_xen_l2, d,
-                    partial_flags | PTF_preemptible);
+                    partial_flags | PTF_preemptible | PTF_retain_ref_on_restart);
         }
         else if ( !(l3e_get_flags(l3e) & _PAGE_PRESENT) )
         {
@@ -1623,13 +1641,14 @@ static int alloc_l3_table(struct page_info *page)
             rc = -EINTR;
         }
         else
-            rc = get_page_from_l3e(l3e, pfn, d, partial_flags);
+            rc = get_page_from_l3e(l3e, pfn, d,
+                                   partial_flags | PTF_retain_ref_on_restart);
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = partial_flags | PTF_partial_set;
+            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
         }
         else if ( rc == -EINTR && i )
         {
@@ -1806,13 +1825,14 @@ static int alloc_l4_table(struct page_info *page)
             rc = -EINTR;
         }
         else
-            rc = get_page_from_l4e(l4e, pfn, d, partial_flags);
+            rc = get_page_from_l4e(l4e, pfn, d,
+                                   partial_flags | PTF_retain_ref_on_restart);
 
         if ( rc == -ERESTART )
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = partial_flags | PTF_partial_set;
+            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
         }
         else if ( rc < 0 )
         {
@@ -1909,9 +1929,7 @@ static int free_l2_table(struct page_info *page)
     else if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = (partial_flags & PTF_partial_set) ?
-            partial_flags :
-            (PTF_partial_set | PTF_partial_general_ref);
+        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
     }
     else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
     {
@@ -1959,9 +1977,7 @@ static int free_l3_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = (partial_flags & PTF_partial_set) ?
-            partial_flags :
-            (PTF_partial_set | PTF_partial_general_ref);
+        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
     }
     else if ( rc == -EINTR && i < L3_PAGETABLE_ENTRIES - 1 )
     {
@@ -1992,9 +2008,7 @@ static int free_l4_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = (partial_flags & PTF_partial_set) ?
-            partial_flags :
-            (PTF_partial_set | PTF_partial_general_ref);
+        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
     }
     else if ( rc == -EINTR && i < L4_PAGETABLE_ENTRIES - 1 )
     {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index 0e628fa417..ba9be4ed40 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -241,22 +241,25 @@ struct page_info
          * page.
          *
          * This happens:
-         * - During de-validation, if de-validation of the page was
+         * - During validation or de-validation, if the operation was
          *   interrupted
          * - During validation, if an invalid entry is encountered and
          *   validation is preemptible
          * - During validation, if PTF_partial_general_ref was set on
-         *   this entry to begin with (perhaps because we're picking
-         *   up from a partial de-validation).
+         *   this entry to begin with (perhaps because it picked up a
+         *   previous operation)
          *
-         * When resuming validation, if PTF_partial_general_ref is clear,
-         * then a general reference must be re-acquired; if it is set, no
-         * reference should be acquired.
+         * When resuming validation, if PTF_partial_general_ref is
+         * clear, then a general reference must be re-acquired; if it
+         * is set, no reference should be acquired.
          *
          * When resuming de-validation, if PTF_partial_general_ref is
          * clear, no reference should be dropped; if it is set, a
          * reference should be dropped.
          *
+         * NB at the moment, PTF_partial_set should be set if and only if
+         * PTF_partial_general_ref is set.
+         *
          * NB that PTF_partial_set and PTF_partial_general_ref are
          * defined in mm.c, the only place where they are used.
          *
-- 
2.23.0


["xsa299/0008-x86-mm-Collapse-PTF_partial_set-and-PTF_partial_gene.patch" (application/octet-stream)]

From 88ac5077bc1d8405f87dfe5ef1bdd803593bf2b5 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 08/11] x86/mm: Collapse PTF_partial_set and
 PTF_partial_general_ref into one

...now that they are equivalent.  No functional change intended.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c        | 50 +++++++++++-----------------------------
 xen/include/asm-x86/mm.h | 29 +++++++++++------------
 2 files changed, 26 insertions(+), 53 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index d294b09a38..0655c0570b 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1067,13 +1067,12 @@ get_page_from_l1e(
 
 /*
  * The following flags are used to specify behavior of various get and
- * put commands.  The first two are also stored in page->partial_flags
- * to indicate the state of the page pointed to by
+ * put commands.  The first is also stored in page->partial_flags to
+ * indicate the state of the page pointed to by
  * page->pte[page->nr_validated_entries].  See the comment in mm.h for
  * more information.
  */
 #define PTF_partial_set           (1 << 0)
-#define PTF_partial_general_ref   (1 << 1)
 #define PTF_preemptible           (1 << 2)
 #define PTF_defer                 (1 << 3)
 #define PTF_retain_ref_on_restart (1 << 4)
@@ -1085,13 +1084,10 @@ static int get_page_and_type_from_mfn(
     struct page_info *page = mfn_to_page(mfn);
     int rc;
     bool preemptible = flags & PTF_preemptible,
-         partial_ref = flags & PTF_partial_general_ref,
          partial_set = flags & PTF_partial_set,
          retain_ref  = flags & PTF_retain_ref_on_restart;
 
-    ASSERT(partial_ref == partial_set);
-
-    if ( likely(!partial_ref) &&
+    if ( likely(!partial_set) &&
          unlikely(!get_page_from_mfn(mfn, d)) )
         return -EINVAL;
 
@@ -1101,14 +1097,14 @@ static int get_page_and_type_from_mfn(
      * Retain the refcount if:
      * - page is fully validated (rc == 0)
      * - page is not validated (rc < 0) but:
-     *   - We came in with a reference (partial_ref)
+     *   - We came in with a reference (partial_set)
      *   - page is partially validated (rc == -ERESTART), and the
      *     caller has asked the ref to be retained in that case
      *   - page is partially validated but there's been an error
      *     (page == current->arch.old_guest_table)
      *
-     * The partial_ref-on-error clause is worth an explanation.  There
-     * are two scenarios where partial_ref might be true coming in:
+     * The partial_set-on-error clause is worth an explanation.  There
+     * are two scenarios where partial_set might be true coming in:
      * - mfn has been partially promoted / demoted as type `type`;
      *   i.e. has PGT_partial set
      * - mfn has been partially demoted as L(type+1) (i.e., a linear
@@ -1131,7 +1127,7 @@ static int get_page_and_type_from_mfn(
      * count retained unless we succeeded, or the operation was
      * preemptible.
      */
-    if ( likely(!rc) || partial_ref )
+    if ( likely(!rc) || partial_set )
         /* nothing */;
     else if ( page == current->arch.old_guest_table ||
               (retain_ref && rc == -ERESTART) )
@@ -1329,13 +1325,7 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         struct page_info *pg = l2e_get_page(l2e);
         struct page_info *ptpg = mfn_to_page(_mfn(pfn));
 
-        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
-              PTF_partial_set )
-        {
-            /* partial_set should always imply partial_ref */
-            BUG();
-        }
-        else if ( flags & PTF_defer )
+        if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = ptpg;
             current->arch.old_guest_table = pg;
@@ -1375,13 +1365,6 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 
     pg = l3e_get_page(l3e);
 
-    if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
-         PTF_partial_set )
-    {
-        /* partial_set should always imply partial_ref */
-        BUG();
-    }
-
     if ( flags & PTF_defer )
     {
         current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
@@ -1406,13 +1389,6 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
     {
         struct page_info *pg = l4e_get_page(l4e);
 
-        if ( (flags & (PTF_partial_set | PTF_partial_general_ref)) ==
-              PTF_partial_set )
-        {
-            /* partial_set should always imply partial_ref */
-            BUG();
-        }
-
         if ( flags & PTF_defer )
         {
             current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
@@ -1648,7 +1624,7 @@ static int alloc_l3_table(struct page_info *page)
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+            page->partial_flags = PTF_partial_set;
         }
         else if ( rc == -EINTR && i )
         {
@@ -1832,7 +1808,7 @@ static int alloc_l4_table(struct page_info *page)
         {
             page->nr_validated_ptes = i;
             /* Set 'set', leave 'general ref' set if this entry was set */
-            page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+            page->partial_flags = PTF_partial_set;
         }
         else if ( rc < 0 )
         {
@@ -1929,7 +1905,7 @@ static int free_l2_table(struct page_info *page)
     else if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+        page->partial_flags = PTF_partial_set;
     }
     else if ( rc == -EINTR && i < L2_PAGETABLE_ENTRIES - 1 )
     {
@@ -1977,7 +1953,7 @@ static int free_l3_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+        page->partial_flags = PTF_partial_set;
     }
     else if ( rc == -EINTR && i < L3_PAGETABLE_ENTRIES - 1 )
     {
@@ -2008,7 +1984,7 @@ static int free_l4_table(struct page_info *page)
     if ( rc == -ERESTART )
     {
         page->nr_validated_ptes = i;
-        page->partial_flags = PTF_partial_set | PTF_partial_general_ref;
+        page->partial_flags = PTF_partial_set;
     }
     else if ( rc == -EINTR && i < L4_PAGETABLE_ENTRIES - 1 )
     {
diff --git a/xen/include/asm-x86/mm.h b/xen/include/asm-x86/mm.h
index ba9be4ed40..32807d4dd4 100644
--- a/xen/include/asm-x86/mm.h
+++ b/xen/include/asm-x86/mm.h
@@ -236,7 +236,7 @@ struct page_info
          * operation on the current page.  (That page may or may not
          * still have PGT_partial set.)
          *
-         * If PTF_partial_general_ref is set, then the PTE at
+         * Additionally, if PTF_partial_set is set, then the PTE at
          * @nr_validated_ptef holds a general reference count for the
          * page.
          *
@@ -245,23 +245,20 @@ struct page_info
          *   interrupted
          * - During validation, if an invalid entry is encountered and
          *   validation is preemptible
-         * - During validation, if PTF_partial_general_ref was set on
-         *   this entry to begin with (perhaps because it picked up a
+         * - During validation, if PTF_partial_set was set on this
+         *   entry to begin with (perhaps because it picked up a
          *   previous operation)
          *
-         * When resuming validation, if PTF_partial_general_ref is
-         * clear, then a general reference must be re-acquired; if it
-         * is set, no reference should be acquired.
+         * When resuming validation, if PTF_partial_set is clear, then
+         * a general reference must be re-acquired; if it is set, no
+         * reference should be acquired.
          *
-         * When resuming de-validation, if PTF_partial_general_ref is
-         * clear, no reference should be dropped; if it is set, a
-         * reference should be dropped.
+         * When resuming de-validation, if PTF_partial_set is clear,
+         * no reference should be dropped; if it is set, a reference
+         * should be dropped.
          *
-         * NB at the moment, PTF_partial_set should be set if and only if
-         * PTF_partial_general_ref is set.
-         *
-         * NB that PTF_partial_set and PTF_partial_general_ref are
-         * defined in mm.c, the only place where they are used.
+         * NB that PTF_partial_set is defined in mm.c, the only place
+         * where it is used.
          *
          * The 3rd field, @linear_pt_count, indicates
          * - by a positive value, how many same-level page table entries a page
@@ -271,8 +268,8 @@ struct page_info
          */
         struct {
             u16 nr_validated_ptes:PAGETABLE_ORDER + 1;
-            u16 :16 - PAGETABLE_ORDER - 1 - 2;
-            u16 partial_flags:2;
+            u16 :16 - PAGETABLE_ORDER - 1 - 1;
+            u16 partial_flags:1;
             s16 linear_pt_count;
         };
 
-- 
2.23.0


["xsa299/0009-x86-mm-Properly-handle-linear-pagetable-promotion-fa.patch" (application/octet-stream)]

From 12f81c49bd8ff183503d8cb97552a6b801eab64c Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 09/11] x86/mm: Properly handle linear pagetable promotion
 failures

In order to allow recursive pagetable promotions and demotions to be
interrupted, Xen must keep track of the state of the sub-pages
promoted or demoted.  This is stored in two elements in the page
struct: nr_entries_validated and partial_flags.

The rule is that entries [0, nr_entries_validated) should always be
validated and hold a general reference count.  If partial_flags is
zero, then [nr_entries_validated] is not validated and no reference
count is held.  If PTF_partial_set is set, then [nr_entries_validated]
is partially validated, and a general reference count is held.

Unfortunately, in cases where an entry began with PTF_partial_set set,
and get_page_from_lNe() returns -EINVAL, the PTF_partial_set bit is
erroneously dropped.  (This scenario can be engineered mainly by the
use of interleaving of promoting and demoting a page which has "linear
pagetable" entries; see the appendix for a sketch.)  This means that
we will "leak" a general reference count on the page in question,
preventing the page from being freed.

Fix this by setting page->partial_flags to the partial_flags local
variable.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix

Suppose A and B can both be promoted to L2 pages, and A[x] points to B.

V1: PIN_L2 B.
  B.type_count = 1 | PGT_validated
  B.count = 2 | PGC_allocated

V1: MOD_L3_ENTRY pointing something to A.
  In the process of validating A[x], grab an extra type / ref on B:
  B.type_count = 2 | PGT_validated
  B.count = 3 | PGC_allocated
  A.type_count = 1 | PGT_validated
  A.count = 2 | PGC_allocated

V1: UNPIN B.
  B.type_count = 1 | PGT_validate
  B.count = 2 | PGC_allocated

V1: MOD_L3_ENTRY removing the reference to A.
  De-validate A, down to A[x], which points to B.
  Drop the final type on B.  Arrange to be interrupted.
  B.type_count = 1 | PGT_partial
  B.count = 2 | PGC_allocated
  A.type_count = 1 | PGT_partial
  A.nr_validated_entries = x
  A.partial_pte = -1

V2: MOD_L3_ENTRY adds a reference to A.

At this point, get_page_from_l2e(A[x]) tries
get_page_and_type_from_mfn(), which fails because it's the wrong type;
and get_l2_linear_pagetable() also fails, because B isn't validated as
an l2 anymore.
---
 xen/arch/x86/mm.c | 6 +++---
 1 file changed, 3 insertions(+), 3 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 0655c0570b..00e112f674 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1553,7 +1553,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_flags = 0;
+                page->partial_flags = partial_flags;
                 current->arch.old_guest_ptpg = NULL;
                 current->arch.old_guest_table = page;
             }
@@ -1647,7 +1647,7 @@ static int alloc_l3_table(struct page_info *page)
         if ( i )
         {
             page->nr_validated_ptes = i;
-            page->partial_flags = 0;
+            page->partial_flags = partial_flags;
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
         }
@@ -1818,7 +1818,7 @@ static int alloc_l4_table(struct page_info *page)
             if ( i )
             {
                 page->nr_validated_ptes = i;
-                page->partial_flags = 0;
+                page->partial_flags = partial_flags;
                 if ( rc == -EINTR )
                     rc = -ERESTART;
                 else
-- 
2.23.0


["xsa299/0010-x86-mm-Fix-nested-de-validation-on-error.patch" (application/octet-stream)]

From e2b0f5123dc84db23f5b4d4ad672c89055f0d8bd Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:49 +0100
Subject: [PATCH 10/11] x86/mm: Fix nested de-validation on error

If an invalid entry is discovered when validating a page-table tree,
the entire tree which has so far been validated must be de-validated.
Since this may take a long time, alloc_l[2-4]_table() set current
vcpu's old_guest_table immediately; put_old_guest_table() will make
sure that put_page_type() will be called to finish off the
de-validation before any other MMU operations can happen on the vcpu.

The invariant for partial pages should be:

* Entries [0, nr_validated_ptes) should be completely validated;
  put_page_type() will de-validate these.

* If [nr_validated_ptes] is partially validated, partial_flags should
  set PTF_partiaL_set.  put_page_type() will be called on this page to
  finish off devalidation, and the appropriate refcount adjustments
  will be done.

alloc_l[2-3]_table() indicates partial validation to its callers by
setting current->old_guest_table.

Unfortunately, this is mishandled.

Take the case where validating lNe[x] returns an error.

First, alloc_l3_table() doesn't check old_guest_table at all; as a
result, partial_flags is not set when it should be.  nr_validated_ptes
is set to x; and since PFT_partial_set clear, de-validation resumes at
nr_validated_ptes-1.  This means that the l2 page at pl3e[x] will not
have put_page_type() called on it when de-validating the rest of the
l3: it will be stuck in the PGT_partial state until the domain is
destroyed, or until it is re-used as an l2.  (Any other page type will
fail.)

Worse, alloc_l4_table(), rather than setting PTF_partial_set as it
should, sets nr_validated_ptes to x+1.  When de-validating, since
partial is 0, this will correctly resume calling put_page_type at [x];
but, if the put_page_type() is never called, but instead
get_page_type() is called, validation will pick up at [x+1],
neglecting to validate [x].  If the rest of the validation succeeds,
the l4 will be validated even though [x] is invalid.

Fix this in both cases by setting PTF_partial_set if old_guest_table
is set.

While here, add some safety catches:
- old_guest_table must point to the page contained in
  [nr_validated_ptes].
- alloc_l1_page shouldn't set old_guest_table

If we experience one of these situations in production builds, it's
safer to avoid calling put_page_type for the pages in question.  If
they have PGT_partial set, they will be cleaned up on domain
destruction; if not, we have no idea whether a type count is safe to
drop.  Retaining an extra type ref that should have been dropped may
trigger a BUG() on the free_domain_page() path, but dropping a type
count that shouldn't be dropped may cause a privilege escalation.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
---
 xen/arch/x86/mm.c | 53 +++++++++++++++++++++++++++++++++++++++++++++--
 1 file changed, 51 insertions(+), 2 deletions(-)

diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 00e112f674..43ff3627eb 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1552,6 +1552,20 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
             ASSERT(current->arch.old_guest_table == NULL);
             if ( i )
             {
+                /*
+                 * alloc_l1_table() doesn't set old_guest_table; it does
+                 * its own tear-down immediately on failure.  If it
+                 * did we'd need to check it and set partial_flags as we
+                 * do in alloc_l[34]_table().
+                 *
+                 * Note on the use of ASSERT: if it's non-null and
+                 * hasn't been cleaned up yet, it should have
+                 * PGT_partial set; and so the type will be cleaned up
+                 * on domain destruction.  Unfortunately, we would
+                 * leak the general ref held by old_guest_table; but
+                 * leaking a page is less bad than a host crash.
+                 */
+                ASSERT(current->arch.old_guest_table == NULL);
                 page->nr_validated_ptes = i;
                 page->partial_flags = partial_flags;
                 current->arch.old_guest_ptpg = NULL;
@@ -1579,6 +1593,7 @@ static int alloc_l3_table(struct page_info *page)
     unsigned int   i;
     int            rc = 0;
     unsigned int   partial_flags = page->partial_flags;
+    l3_pgentry_t   l3e = l3e_empty();
 
     pl3e = map_domain_page(_mfn(pfn));
 
@@ -1595,7 +1610,7 @@ static int alloc_l3_table(struct page_info *page)
     for ( i = page->nr_validated_ptes; i < L3_PAGETABLE_ENTRIES;
           i++, partial_flags = 0 )
     {
-        l3_pgentry_t l3e = pl3e[i];
+        l3e = pl3e[i];
 
         if ( i > page->nr_validated_ptes && hypercall_preempt_check() )
             rc = -EINTR;
@@ -1648,6 +1663,24 @@ static int alloc_l3_table(struct page_info *page)
         {
             page->nr_validated_ptes = i;
             page->partial_flags = partial_flags;
+            if ( current->arch.old_guest_table )
+            {
+                /*
+                 * We've experienced a validation failure.  If
+                 * old_guest_table is set, "transfer" the general
+                 * reference count to pl3e[nr_validated_ptes] by
+                 * setting PTF_partial_set.
+                 *
+                 * As a precaution, check that old_guest_table is the
+                 * page pointed to by pl3e[nr_validated_ptes].  If
+                 * not, it's safer to leak a type ref on production
+                 * builds.
+                 */
+                if ( current->arch.old_guest_table == l3e_get_page(l3e) )
+                    page->partial_flags = PTF_partial_set;
+                else
+                    ASSERT_UNREACHABLE();
+            }
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
         }
@@ -1824,7 +1857,23 @@ static int alloc_l4_table(struct page_info *page)
                 else
                 {
                     if ( current->arch.old_guest_table )
-                        page->nr_validated_ptes++;
+                    {
+                        /*
+                         * We've experienced a validation failure.  If
+                         * old_guest_table is set, "transfer" the general
+                         * reference count to pl3e[nr_validated_ptes] by
+                         * setting PTF_partial_set.
+                         *
+                         * As a precaution, check that old_guest_table is the
+                         * page pointed to by pl4e[nr_validated_ptes].  If
+                         * not, it's safer to leak a type ref on production
+                         * builds.
+                         */
+                        if ( current->arch.old_guest_table == l4e_get_page(l4e) )
+                            page->partial_flags = PTF_partial_set;
+                        else
+                            ASSERT_UNREACHABLE();
+                    }
                     current->arch.old_guest_ptpg = NULL;
                     current->arch.old_guest_table = page;
                 }
-- 
2.23.0


["xsa299/0011-x86-mm-Don-t-drop-a-type-ref-unless-you-held-a-ref-t.patch" (application/octet-stream)]

From e033990c472001a8c37b0ac1a410d7b4ed13be69 Mon Sep 17 00:00:00 2001
From: George Dunlap <george.dunlap@citrix.com>
Date: Thu, 10 Oct 2019 17:57:50 +0100
Subject: [PATCH 11/11] x86/mm: Don't drop a type ref unless you held a ref to
 begin with

Validation and de-validation of pagetable trees may take arbitrarily
large amounts of time, and so must be preemptible.  This is indicated
by setting the PGT_partial bit in the type_info, and setting
nr_validated_entries and partial_flags appropriately.  Specifically,
if the entry at [nr_validated_entries] is partially validated,
partial_flags should have the PGT_partial_set bit set, and the entry
should hold a general reference count.  During de-validation,
put_page_type() is called on partially validated entries.

Unfortunately, there are a number of issues with the current algorithm.

First, doing a "normal" put_page_type() is not safe when no type ref
is held: there is nothing to stop another vcpu from coming along and
picking up validation again: at which point the put_page_type may drop
the only page ref on an in-use page.  Some examples are listed in the
appendix.

The core issue is that put_page_type() is being called both to clean
up PGT_partial, and to drop a type count; and has no way of knowing
which is which; and so if in between, PGT_partial is cleared,
put_page_type() will drop the type ref erroneously.

What is needed is to distinguish between two states:
- Dropping a type ref which is held
- Cleaning up a page which has been partially de/validated

Fix this by telling put_page_type() which of the two activities you
intend.

When cleaning up a partial de/validation, take no action unless you
find a page partially validated.

If put_page_type() is called without PTF_partial_set, and finds the
page in a PGT_partial state anyway, then there's certainly been a
misaccounting somewhere, and carrying on would almost certainly cause
a security issue, so crash the host instead.

In put_page_from_lNe, pass partial_flags on to _put_page_type().

old_guest_table may be set either with a fully validated page (when
using the "deferred put" pattern), or with a partially validated page
(when a normal "de-validation" is interrupted, or when a validation
fails part-way through due to invalid entries).  Add a flag,
old_guest_table_partial, to indicate which of these it is, and use
that to pass the appropriate flag to _put_page_type().

While here, delete stray trailing whitespace.

This is part of XSA-299.

Reported-by: George Dunlap <george.dunlap@citrix.com>
Signed-off-by: George Dunlap <george.dunlap@citrix.com>
Reviewed-by: Jan Beulich <jbeulich@suse.com>
-----
Appendix:

Suppose page A, when interpreted as an l3 pagetable, contains all
valid entries; and suppose A[x] points to page B, which when
interpreted as an l2 pagetable, contains all valid entries.

P1: PIN_L3_TABLE
  A -> PGT_l3_table | 1 | valid
  B -> PGT_l2_table | 1 | valid

P1: UNPIN_TABLE
  > Arrange to interrupt after B has been de-validated
  B:
    type_info -> PGT_l2_table | 0
  A:
    type_info -> PGT_l3_table | 1 | partial
    nr_validated_enties -> (less than x)

P2: mod_l4_entry to point to A
  > Arrange for this to be interrupted while B is being validated
  B:
    type_info -> PGT_l2_table | 1 | partial
    (nr_validated_entires &c set as appropriate)
  A:
    type_info -> PGT_l3_table | 1 | partial
    nr_validated_entries -> x
    partial_pte = 1

P3: mod_l3_entry some other unrelated l3 to point to B:
  B:
    type_info -> PGT_l2_table | 1

P1: Restart UNPIN_TABLE

At this point, since A.nr_validate_entries == x and A.partial_pte !=
0, free_l3_table() will call put_page_from_l3e() on pl3e[x], dropping
its type count to 0 while it's still being pointed to by some other l3

A similar issue arises with old_guest_table.  Consider the following
scenario:

Suppose A is a page which, when interpreted as an l2, has valid entries
until entry x, which is invalid.

V1:  PIN_L2_TABLE(A)
  <Validate until we try to validate [x], get -EINVAL>
  A -> PGT_l2_table | 1 | PGT_partial
  V1 -> old_guest_table = A
  <delayed>

V2: PIN_L2_TABLE(A)
  <Pick up where V1 left off, try to re-validate [x], get -EINVAL>
  A -> PGT_l2_table | 1 | PGT_partial
  V2 -> old_guest_table = A
  <restart>
  put_old_guest_table()
    _put_page_type(A)
      A -> PGT_l2_table | 0

V1: <restart>
  put_old_guest_table()
    _put_page_type(A) # UNDERFLOW

Indeed, it is possible to engineer for old_guest_table for every vcpu
a guest has to point to the same page.
---
 xen/arch/x86/domain.c        |  6 +++
 xen/arch/x86/mm.c            | 99 +++++++++++++++++++++++++++++++-----
 xen/include/asm-x86/domain.h |  4 +-
 3 files changed, 95 insertions(+), 14 deletions(-)

diff --git a/xen/arch/x86/domain.c b/xen/arch/x86/domain.c
index febbb23336..918564a910 100644
--- a/xen/arch/x86/domain.c
+++ b/xen/arch/x86/domain.c
@@ -1168,9 +1168,15 @@ int arch_set_info_guest(
                     rc = -ERESTART;
                     /* Fallthrough */
                 case -ERESTART:
+                    /*
+                     * NB that we're putting the kernel-mode table
+                     * here, which we've already successfully
+                     * validated above; hence partial = false;
+                     */
                     v->arch.old_guest_ptpg = NULL;
                     v->arch.old_guest_table =
                         pagetable_get_page(v->arch.guest_table);
+                    v->arch.old_guest_table_partial = false;
                     v->arch.guest_table = pagetable_null();
                     break;
                 default:
diff --git a/xen/arch/x86/mm.c b/xen/arch/x86/mm.c
index 43ff3627eb..79c3e4c473 100644
--- a/xen/arch/x86/mm.c
+++ b/xen/arch/x86/mm.c
@@ -1329,10 +1329,11 @@ static int put_page_from_l2e(l2_pgentry_t l2e, unsigned long pfn,
         {
             current->arch.old_guest_ptpg = ptpg;
             current->arch.old_guest_table = pg;
+            current->arch.old_guest_table_partial = false;
         }
         else
         {
-            rc = _put_page_type(pg, PTF_preemptible, ptpg);
+            rc = _put_page_type(pg, flags | PTF_preemptible, ptpg);
             if ( likely(!rc) )
                 put_page(pg);
         }
@@ -1355,6 +1356,7 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
         unsigned long mfn = l3e_get_pfn(l3e);
         bool writeable = l3e_get_flags(l3e) & _PAGE_RW;
 
+        ASSERT(!(flags & PTF_partial_set));
         ASSERT(!(mfn & ((1UL << (L3_PAGETABLE_SHIFT - PAGE_SHIFT)) - 1)));
         do {
             put_data_page(mfn_to_page(_mfn(mfn)), writeable);
@@ -1367,12 +1369,14 @@ static int put_page_from_l3e(l3_pgentry_t l3e, unsigned long pfn,
 
     if ( flags & PTF_defer )
     {
+        ASSERT(!(flags & PTF_partial_set));
         current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
         current->arch.old_guest_table = pg;
+        current->arch.old_guest_table_partial = false;
         return 0;
     }
 
-    rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
+    rc = _put_page_type(pg, flags | PTF_preemptible, mfn_to_page(_mfn(pfn)));
     if ( likely(!rc) )
         put_page(pg);
 
@@ -1391,12 +1395,15 @@ static int put_page_from_l4e(l4_pgentry_t l4e, unsigned long pfn,
 
         if ( flags & PTF_defer )
         {
+            ASSERT(!(flags & PTF_partial_set));
             current->arch.old_guest_ptpg = mfn_to_page(_mfn(pfn));
             current->arch.old_guest_table = pg;
+            current->arch.old_guest_table_partial = false;
             return 0;
         }
 
-        rc = _put_page_type(pg, PTF_preemptible, mfn_to_page(_mfn(pfn)));
+        rc = _put_page_type(pg, flags | PTF_preemptible,
+                            mfn_to_page(_mfn(pfn)));
         if ( likely(!rc) )
             put_page(pg);
     }
@@ -1506,6 +1513,14 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
 
     pl2e = map_domain_page(_mfn(pfn));
 
+    /*
+     * NB that alloc_l2_table will never set partial_pte on an l2; but
+     * free_l2_table might if a linear_pagetable entry is interrupted
+     * partway through de-validation.  In that circumstance,
+     * get_page_from_l2e() will always return -EINVAL; and we must
+     * retain the type ref by doing the normal partial_flags tracking.
+     */
+
     for ( i = page->nr_validated_ptes; i < L2_PAGETABLE_ENTRIES;
           i++, partial_flags = 0 )
     {
@@ -1570,6 +1585,7 @@ static int alloc_l2_table(struct page_info *page, unsigned long type)
                 page->partial_flags = partial_flags;
                 current->arch.old_guest_ptpg = NULL;
                 current->arch.old_guest_table = page;
+                current->arch.old_guest_table_partial = true;
             }
         }
         if ( rc < 0 )
@@ -1677,12 +1693,16 @@ static int alloc_l3_table(struct page_info *page)
                  * builds.
                  */
                 if ( current->arch.old_guest_table == l3e_get_page(l3e) )
+                {
+                    ASSERT(current->arch.old_guest_table_partial);
                     page->partial_flags = PTF_partial_set;
+                }
                 else
                     ASSERT_UNREACHABLE();
             }
             current->arch.old_guest_ptpg = NULL;
             current->arch.old_guest_table = page;
+            current->arch.old_guest_table_partial = true;
         }
         while ( i-- > 0 )
             pl3e[i] = unadjust_guest_l3e(pl3e[i], d);
@@ -1870,12 +1890,16 @@ static int alloc_l4_table(struct page_info *page)
                          * builds.
                          */
                         if ( current->arch.old_guest_table == l4e_get_page(l4e) )
+                        {
+                            ASSERT(current->arch.old_guest_table_partial);
                             page->partial_flags = PTF_partial_set;
+                        }
                         else
                             ASSERT_UNREACHABLE();
                     }
                     current->arch.old_guest_ptpg = NULL;
                     current->arch.old_guest_table = page;
+                    current->arch.old_guest_table_partial = true;
                 }
             }
         }
@@ -2801,6 +2825,28 @@ static int _put_page_type(struct page_info *page, unsigned int flags,
         x  = y;
         nx = x - 1;
 
+        /*
+         * Is this expected to do a full reference drop, or only
+         * cleanup partial validation / devalidation?
+         *
+         * If the former, the caller must hold a "full" type ref;
+         * which means the page must be validated.  If the page is
+         * *not* fully validated, continuing would almost certainly
+         * open up a security hole.  An exception to this is during
+         * domain destruction, where PGT_validated can be dropped
+         * without dropping a type ref.
+         *
+         * If the latter, do nothing unless type PGT_partial is set.
+         * If it is set, the type count must be 1.
+         */
+        if ( !(flags & PTF_partial_set) )
+            BUG_ON((x & PGT_partial) ||
+                   !((x & PGT_validated) || page_get_owner(page)->is_dying));
+        else if ( !(x & PGT_partial) )
+            return 0;
+        else
+            BUG_ON((x & PGT_count_mask) != 1);
+
         ASSERT((x & PGT_count_mask) != 0);
 
         switch ( nx & (PGT_locked | PGT_count_mask) )
@@ -3058,17 +3104,34 @@ int put_old_guest_table(struct vcpu *v)
     if ( !v->arch.old_guest_table )
         return 0;
 
-    switch ( rc = _put_page_type(v->arch.old_guest_table, PTF_preemptible,
-                                 v->arch.old_guest_ptpg) )
+    rc = _put_page_type(v->arch.old_guest_table,
+                        PTF_preemptible |
+                        ( v->arch.old_guest_table_partial ?
+                          PTF_partial_set : 0 ),
+                        v->arch.old_guest_ptpg);
+
+    if ( rc == -ERESTART || rc == -EINTR )
     {
-    case -EINTR:
-    case -ERESTART:
+        v->arch.old_guest_table_partial = (rc == -ERESTART);
         return -ERESTART;
-    case 0:
-        put_page(v->arch.old_guest_table);
     }
 
+    /*
+     * It shouldn't be possible for _put_page_type() to return
+     * anything else at the moment; but if it does happen in
+     * production, leaking the type ref is probably the best thing to
+     * do.  Either way, drop the general ref held by old_guest_table.
+     */
+    ASSERT(rc == 0);
+
+    put_page(v->arch.old_guest_table);
     v->arch.old_guest_table = NULL;
+    v->arch.old_guest_ptpg = NULL;
+    /*
+     * Safest default if someone sets old_guest_table without
+     * explicitly setting old_guest_table_partial.
+     */
+    v->arch.old_guest_table_partial = true;
 
     return rc;
 }
@@ -3219,11 +3282,11 @@ int new_guest_cr3(mfn_t mfn)
             switch ( rc = put_page_and_type_preemptible(page) )
             {
             case -EINTR:
-                rc = -ERESTART;
-                /* fallthrough */
             case -ERESTART:
                 curr->arch.old_guest_ptpg = NULL;
                 curr->arch.old_guest_table = page;
+                curr->arch.old_guest_table_partial = (rc == -ERESTART);
+                rc = -ERESTART;
                 break;
             default:
                 BUG_ON(rc);
@@ -3460,6 +3523,7 @@ long do_mmuext_op(
                     {
                         curr->arch.old_guest_ptpg = NULL;
                         curr->arch.old_guest_table = page;
+                        curr->arch.old_guest_table_partial = false;
                     }
                 }
             }
@@ -3494,6 +3558,11 @@ long do_mmuext_op(
             case -ERESTART:
                 curr->arch.old_guest_ptpg = NULL;
                 curr->arch.old_guest_table = page;
+                /*
+                 * EINTR means we still hold the type ref; ERESTART
+                 * means PGT_partial holds the type ref
+                 */
+                curr->arch.old_guest_table_partial = (rc == -ERESTART);
                 rc = 0;
                 break;
             default:
@@ -3562,11 +3631,15 @@ long do_mmuext_op(
                 switch ( rc = put_page_and_type_preemptible(page) )
                 {
                 case -EINTR:
-                    rc = -ERESTART;
-                    /* fallthrough */
                 case -ERESTART:
                     curr->arch.old_guest_ptpg = NULL;
                     curr->arch.old_guest_table = page;
+                    /*
+                     * EINTR means we still hold the type ref;
+                     * ERESTART means PGT_partial holds the ref
+                     */
+                    curr->arch.old_guest_table_partial = (rc == -ERESTART);
+                    rc = -ERESTART;
                     break;
                 default:
                     BUG_ON(rc);
diff --git a/xen/include/asm-x86/domain.h b/xen/include/asm-x86/domain.h
index 7cebfa4fb9..212303f371 100644
--- a/xen/include/asm-x86/domain.h
+++ b/xen/include/asm-x86/domain.h
@@ -302,7 +302,7 @@ struct arch_domain
 
     struct paging_domain paging;
     struct p2m_domain *p2m;
-    /* To enforce lock ordering in the pod code wrt the 
+    /* To enforce lock ordering in the pod code wrt the
      * page_alloc lock */
     int page_alloc_unlock_level;
 
@@ -571,6 +571,8 @@ struct arch_vcpu
     struct page_info *old_guest_table;  /* partially destructed pagetable */
     struct page_info *old_guest_ptpg;   /* containing page table of the */
                                         /* former, if any */
+    bool old_guest_table_partial;       /* Are we dropping a type ref, or just
+                                         * finishing up a partial de-validation? */
     /* guest_table holds a ref to the page, and also a type-count unless
      * shadow refcounts are in use */
     pagetable_t shadow_table[4];        /* (MFN) shadow(s) of guest */
-- 
2.23.0



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic