[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xml-dev
Subject:    Re: [xml-dev] Write an XSLT program that generates an XSLT program or write a universal XSLT program
From:       Michael Kay <mike () saxonica ! com>
Date:       2022-05-12 8:17:52
Message-ID: 89A42550-76A1-4079-9F9D-70B54F8D5082 () saxonica ! com
[Download RAW message or body]

The XSD validator which I wrote in XSLT and described at Markup UK 2018

https://www.saxonica.com/papers/markupuk-2018mhk.pdf

is still sitting on an internal shelf and hasn't seen the light of day in public, \
though it reached the point where it was passing something like 95% of the tests.

This was a "back end" schema validator only; it relied on Saxon's Java schema \
compiler to process the raw XSD documents, including generation of finite state \
automata for the complex types. But I don't think that doing the front end in XSLT \
would be particularly difficult (in fact, most of the difficulties are in the back \
end). Verifying subsumption of restricted types is probably the hardest part.

There are a few issues described in the paper which Rick's note doesn't address:

* assertions would be straightforward if they used untyped XPath. But they don't; \
they work on semi-validated data (validated against everything except the \
assertions), and constructing semi-validated data in (non-schema-aware?) XSLT poses a \
challenge. For example, in an assertion, "@discount lt @price" compares the typed \
values of the two attributes, not the untyped values.

* XSD rules for equality of atomic values (for example, in uniqueness constraints) \
aren't the same as XPath equality rules (e.g  timezone handling is different)

Yes, working with the XSD specification is a nightmare; it's the toughest spec I've \
ever had to work with other than Algol 68, and unlike Algol 68, some of the apparent \
formality turns out to be spurious; when it gets to tricky things that ought to be \
formal, like whether two types are identical, the spec bails out. Perhaps I'm a \
masochist, but for me, that's a fun engineering challenge.

I've considered the approach of validating complex types by turning them into regular \
expressions against a string and using a regex engine. The main reason I decided \
against it is that regex engines produce no useful diagnostics; they just tell you \
the string doesn't match. Perhaps the answer to that would be to write a regex engine \
with better diagnostics - I can see that being useful! 

Michael Kay
Saxonica

> On 12 May 2022, at 08:49, Rick Jelliffe <rjelliffe@allette.com.au> wrote:
> 
> People interested in doing this should feel free to grab code from \
> https://github.com/Schematron/schematron/tree/master/trunk/xsd2sch \
> <https://github.com/Schematron/schematron/tree/master/trunk/xsd2sch> (or even \
> update it!) 
> In about 2008, JSTOR sponsored an R&D project to implement the reasonably large \
> subset of XSD 1.0 that they used, to run as Schematron: this was not only to \
> advance the state of the art, but because they were (I gather) finding XSD \
> validators of the time just spewed out standard messages and numbers, which were as \
> unhelpful as Voynich to editors and so on. (Perhaps they wanted to use apps and \
> pipelines that did not support XSD too? Phases/progressive validation could also \
> open up some extra workflow possibilities.) 
> The coverage is approximately:
> simple datatypes: believed to be 100%
> list and union datatypes: not supported
> structural constraints on elements and attributes: supported (~)
> multiple namespaces, import and include: supported (~)
> identity constraints: not supported
> dynamic constraints: (xsi:type, xsi:nill) not supported
> tricky prefixes: (elementFormDefault) not supported
> 
> Obviously implementing identity constraints and xsd:assert would be a doddle. \
> (There is a page on identity constraints at the link below to give the idea.) It \
> needs much more testing to be ready for commercial use, but is good enough for \
> targetted use or cannibalization. 
> The main difficulty of the project was retaining technical staff, if I recall: they \
> absolutely hated having to deal with the XSD specification and found the technology \
> had too many edge cases to be tractable, which meant that the project had to be \
> organized in small discrete chunks-- not for Scrum reasons but just for mental \
> fatigue. (These were not dummies: one was working through his PhD, another ended up \
> in Redmond.) 
> Anyway, the code is there, and descriptions of the approaches (originally on \
> OReilly's blog) is at Schematron.com (find "Converting XML Schemas to Schematron" \
> for background)  with details at  https://schematron.com/document/2974.html \
> <https://schematron.com/document/2974.html> 
> I guess the main surprise to come out of it was that we could validate content \
> models using XPath 2. Originally we started with just pairwise validation for \
> element content types: x/y can only be followed by z, etc but it dawned on me that \
> we could make a string listing the names of child elements in sequence, separated \
> by spaces (e.g. "head body"), and test if that matched a regex generated from the \
> content model, which took care of cardinality constraints too. (Which meant that \
> Schematron was strictly more powerful than XSD 1.0.)   
> The joy at finding we could do content model grammar validation was tempered by the \
> realization that we could not give much better validation diagnostics: the messages \
> always had to be in terms of where the error was detected rather than what caused \
> it. E.b if the content model was ( A, ( B, Z, X) | Z) and the instand had A, Z, X \
> it would say  "we found unexpected X here instead of Z" rather than e.g "After A, B \
> is missing, so you cannot have the Z followed by an X."  Presumably some extra \
> smarts could be added fir this, and perhaps the XSD could gave sone annotations to \
> help.  
> The larger issue was that Schematron allows semantic assertions and diagnostics: \
> you can express a constraint in natural language in the terms that target user \
> understands, and give feedback to them. (A real example: I was working on a \
> pipeline system where the edited documents were translated into several \
> intermediate XML vocabs and structures before being output and validated. The \
> company employed devops people to look at the validation logs, then trace back to \
> the original authoring format, then decide if it were a programming error or markup \
> error.) So merely converting an XSD to Schematron did not allow the advantage of \
> having efficient, specific, targetted feedback. 
> (It goes deeper than the names. The grammar-based schemas have no capability of \
> capturing and transmitting intention: if an attribute or element is required, why \
> is it required? If a content model is super-complicated, what simpler pattern is \
> actually being modelled, albeit clumsily? ) 
> I would not want to implement this again using XSLT 2. Maybe 3 is better (?) but I \
> think doing at least some of the stages in some general-purpose language (Java, \
> etc) that allowed decoratable objects would have reduced the mental complexity a \
> lot: immutability just sucks sometimes.  
> 
> Cheers
> Rick
> 
> 
> 
> On Mon, 9 May 2022, 21:16 Roger L Costello, <costello@mitre.org \
> <mailto:costello@mitre.org>> wrote: Hi Folks,
> 
> 
> 
> The Schematron processor that I use is an XSLT program that takes as input a \
> Schematron schema and the XSLT program transforms the Schematron schema into an \
> XSLT program that is specific to the Schematron schema: 
> 
> 
> Schematron schema --> XSLT --> XSLT for the particular Schematron schema
> 
> 
> 
> Then the "XSLT for the particular Schematron schema" is run and it inputs the XML \
> document to be validated. The output is the validation results: 
> 
> 
> XML doc to be validated --> XSLT for the particular Schematron schema --> \
> validation results 
> 
> 
> Rick et al chose to implement Schematron validation by generating a stylesheet for \
> the particular Schematron schema. 
> 
> 
> An alternative strategy would have been to create a universal stylesheet that \
> directly performs Schematron validation on the XML doc to be validated: 
> 
> 
> XML doc to be validated --> universal stylesheet --> validation results
> 
> 
> 
> Interestingly, Michael Kay has a blog post \
> (https://dev.saxonica.com/blog/mike/2018/02/could-we-write-an-xsd-schema-processor-in-xslt.html \
> <https://dev.saxonica.com/blog/mike/2018/02/could-we-write-an-xsd-schema-processor-in-xslt.html>) \
> in which he discusses the idea of using XSLT to build an XML Schema validator. He \
> explores the idea of whether to write an XSLT program that generates another XSLT \
> program (as Schematron does) or whether to write a universal XSLT program. At the \
> end of his blog, Michael writes: 
> 
> 
> I still have an open mind about whether a universal stylesheet should be used, or a \
> generated stylesheet for a particular schema. 
> 
> 
> A fascinating parallel, I think.
> 
> 
> 
> /Roger
> 


[Attachment #3 (unknown)]

<html><head><meta http-equiv="Content-Type" content="text/html; \
charset=utf-8"></head><body style="word-wrap: break-word; -webkit-nbsp-mode: space; \
line-break: after-white-space;" class="">The XSD validator which I wrote in XSLT and \
described at Markup UK 2018<div class=""><br class=""></div><div class=""><a \
href="https://www.saxonica.com/papers/markupuk-2018mhk.pdf" \
class="">https://www.saxonica.com/papers/markupuk-2018mhk.pdf</a></div><div \
class=""><br class=""></div><div class="">is still sitting on an internal shelf and \
hasn't seen the light of day in public, though it reached the point where it was \
passing something like 95% of the tests.</div><div class=""><br class=""></div><div \
class="">This was a "back end" schema validator only; it relied on Saxon's Java \
schema compiler to process the raw XSD documents, including generation of finite \
state automata for the complex types. But I don't think that doing the front end in \
XSLT would be particularly difficult (in fact, most of the difficulties are in the \
back end). Verifying subsumption of restricted types is probably the hardest \
part.</div><div class=""><br class=""></div><div class="">There are a few issues \
described in the paper which Rick's note doesn't address:</div><div class=""><br \
class=""></div><div class="">* assertions would be straightforward if they used \
untyped XPath. But they don't; they work on semi-validated data (validated against \
everything except the assertions), and constructing semi-validated data in \
(non-schema-aware?) XSLT poses a challenge. For example, in an assertion, "@discount \
lt @price" compares the typed values of the two attributes, not the untyped \
values.</div><div class=""><br class=""></div><div class="">* XSD rules for equality \
of atomic values (for example, in uniqueness constraints) aren't the same as XPath \
equality rules (e.g &nbsp;timezone handling is different)</div><div class=""><br \
class=""></div><div class="">Yes, working with the XSD specification is a nightmare; \
it's the toughest spec I've ever had to work with other than Algol 68, and unlike \
Algol 68, some of the apparent formality turns out to be spurious; when it gets to \
tricky things that ought to be formal, like whether two types are identical, the spec \
bails out. Perhaps I'm a masochist, but for me, that's a fun engineering \
challenge.</div><div class=""><br class=""></div><div class="">I've considered the \
approach of validating complex types by turning them into regular expressions against \
a string and using a regex engine. The main reason I decided against it is that regex \
engines produce no useful diagnostics; they just tell you the string doesn't match. \
Perhaps the answer to that would be to write a regex engine with better diagnostics - \
I can see that being useful!&nbsp;</div><div class=""><br class=""></div><div \
class="">Michael Kay</div><div class="">Saxonica<br class=""><div><br \
class=""><blockquote type="cite" class=""><div class="">On 12 May 2022, at 08:49, \
Rick Jelliffe &lt;<a href="mailto:rjelliffe@allette.com.au" \
class="">rjelliffe@allette.com.au</a>&gt; wrote:</div><br \
class="Apple-interchange-newline"><div class=""><div dir="auto" class="">People \
interested in doing this should feel free to grab code from&nbsp;<a \
href="https://github.com/Schematron/schematron/tree/master/trunk/xsd2sch" \
class="">https://github.com/Schematron/schematron/tree/master/trunk/xsd2sch</a> (or \
even update it!)<div dir="auto" class=""><br class=""></div><div dir="auto" \
class="">In about 2008, JSTOR sponsored an R&amp;D project to implement the \
reasonably large subset of XSD 1.0 that they used, to run as Schematron: this was not \
only to advance the state of the art, but because they were (I gather) finding XSD \
validators of the time just spewed out standard messages and numbers, which were as \
unhelpful as Voynich to editors and so on. (Perhaps they wanted to use apps and \
pipelines that did not support XSD too? Phases/progressive validation could also open \
up some extra workflow possibilities.)</div><div dir="auto" class=""><br \
class=""></div><div dir="auto" class=""><span \
style="background-color:rgb(255,255,255);color:rgb(33,33,33);font-family:-apple-system,blinkmacsystemfont,&quot;segoe \
ui&quot;,roboto,&quot;helvetica neue&quot;,ubuntu,sans-serif;font-size:16px" \
class="">The coverage is approximately:</span></div><div dir="auto" class=""><ul \
style="background-repeat:no-repeat;padding:0px 0px 0px 24px;margin:0px 0px \
1rem;color:rgb(33,33,33);font-family:-apple-system,blinkmacsystemfont,&quot;segoe \
ui&quot;,roboto,&quot;helvetica \
neue&quot;,ubuntu,sans-serif;font-size:16px;background-color:rgb(255,255,255)" \
class=""><li style="background-repeat:no-repeat;padding:0px;margin:0px" \
class=""><span style="background-repeat:no-repeat;padding:0px;margin:0px;font-weight:bolder" \
class="">simple datatypes:</span>&nbsp;believed to be 100%</li></ul><ul \
style="background-repeat:no-repeat;padding:0px 0px 0px 24px;margin:0px 0px \
1rem;color:rgb(33,33,33);font-family:-apple-system,blinkmacsystemfont,&quot;segoe \
ui&quot;,roboto,&quot;helvetica \
neue&quot;,ubuntu,sans-serif;font-size:16px;background-color:rgb(255,255,255)" \
class=""><li style="background-repeat:no-repeat;padding:0px;margin:0px" \
class=""><span style="background-repeat:no-repeat;padding:0px;margin:0px;font-weight:bolder" \
class="">list and union datatypes:</span>&nbsp;not supported</li></ul><ul \
style="background-repeat:no-repeat;padding:0px 0px 0px 24px;margin:0px 0px \
1rem;color:rgb(33,33,33);font-family:-apple-system,blinkmacsystemfont,&quot;segoe \
ui&quot;,roboto,&quot;helvetica \
neue&quot;,ubuntu,sans-serif;font-size:16px;background-color:rgb(255,255,255)" \
class=""><li style="background-repeat:no-repeat;padding:0px;margin:0px" \
class=""><span style="background-repeat:no-repeat;padding:0px;margin:0px;font-weight:bolder" \
class="">structural constraints on elements and attributes:</span>&nbsp;supported \
(~)</li></ul><ul style="background-repeat:no-repeat;padding:0px 0px 0px \
24px;margin:0px 0px 1rem;color:rgb(33,33,33);font-family:-apple-system,blinkmacsystemfont,&quot;segoe \
ui&quot;,roboto,&quot;helvetica \
neue&quot;,ubuntu,sans-serif;font-size:16px;background-color:rgb(255,255,255)" \
class=""><li style="background-repeat:no-repeat;padding:0px;margin:0px" \
class=""><span style="background-repeat:no-repeat;padding:0px;margin:0px;font-weight:bolder" \
class="">multiple namespaces, import and include:</span>&nbsp;supported \
(~)</li></ul><ul style="background-repeat:no-repeat;padding:0px 0px 0px \
24px;margin:0px 0px 1rem;color:rgb(33,33,33);font-family:-apple-system,blinkmacsystemfont,&quot;segoe \
ui&quot;,roboto,&quot;helvetica \
neue&quot;,ubuntu,sans-serif;font-size:16px;background-color:rgb(255,255,255)" \
class=""><li style="background-repeat:no-repeat;padding:0px;margin:0px" \
class=""><span style="background-repeat:no-repeat;padding:0px;margin:0px;font-weight:bolder" \
class="">identity constraints:</span>&nbsp;not supported</li></ul><ul \
style="background-repeat:no-repeat;padding:0px 0px 0px 24px;margin:0px 0px \
1rem;color:rgb(33,33,33);font-family:-apple-system,blinkmacsystemfont,&quot;segoe \
ui&quot;,roboto,&quot;helvetica \
neue&quot;,ubuntu,sans-serif;font-size:16px;background-color:rgb(255,255,255)" \
class=""><li style="background-repeat:no-repeat;padding:0px;margin:0px" \
class=""><span style="background-repeat:no-repeat;padding:0px;margin:0px;font-weight:bolder" \
class="">dynamic constraints:</span>&nbsp;(xsi:type, xsi:nill) not supported</li><li \
style="background-repeat:no-repeat;padding:0px;margin:0px" class="">tricky \
prefixes:&nbsp;(elementFormDefault) not supported</li></ul></div><div dir="auto" \
class=""><br class=""></div><div dir="auto" class="">Obviously implementing identity \
constraints and xsd:assert would be a doddle. (There is a page on identity \
constraints at the link below to give the idea.) It needs much more testing to be \
ready for commercial use, but is good enough for targetted use or \
cannibalization.</div><div dir="auto" class=""><br class=""></div><div dir="auto" \
class="">The main difficulty of the project was retaining technical staff, if I \
recall: they absolutely hated having to deal with the XSD specification and found the \
technology had too many edge cases to be tractable, which meant that the project had \
to be organized in small discrete chunks-- not for Scrum reasons but just for mental \
fatigue. (These were not dummies: one was working through his PhD, another ended up \
in Redmond.)</div><div dir="auto" class=""><br class=""></div><div dir="auto" \
class="">Anyway, the code is there, and descriptions of the approaches (originally on \
OReilly's blog) is at <a href="http://Schematron.com" class="">Schematron.com</a> \
(find "Converting XML Schemas to Schematron" for background)&nbsp; with details \
at&nbsp; <a href="https://schematron.com/document/2974.html" \
class="">https://schematron.com/document/2974.html</a></div><div dir="auto" \
class=""><br class=""></div><div dir="auto" class="">I guess the main surprise to \
come out of it was that we could validate content models using XPath 2. Originally we \
started with just pairwise validation for element content types: x/y can only be \
followed by z, etc but it dawned on me that we could make a string listing the names \
of child elements in sequence, separated by spaces (e.g. "head body"), and test if \
that matched a regex generated from the content model, which took care of cardinality \
constraints too. (Which meant that Schematron was strictly more powerful than XSD \
1.0.)&nbsp;&nbsp;</div><div dir="auto" class=""><br class=""></div><div dir="auto" \
class="">The joy at finding we could do content model grammar validation was tempered \
by the realization that we could not give much better validation diagnostics: the \
messages always had to be in terms of where the error was detected rather than what \
caused it. E.b if the content model was ( A, ( B, Z, X) | Z) and the instand had A, \
Z, X it would say&nbsp; "we found unexpected X here instead of Z" rather than e.g \
"After A, B is missing, so you cannot have the Z followed by an X."&nbsp; Presumably \
some extra smarts could be added fir this, and perhaps the XSD could gave sone \
annotations to help.&nbsp;</div><div dir="auto" class=""><br class=""></div><div \
dir="auto" class="">The larger issue was that Schematron allows semantic assertions \
and diagnostics: you can express a constraint in natural language in the terms that \
target user understands, and give feedback to them. (A real example: I was working on \
a pipeline system where the edited documents were translated into several \
intermediate XML vocabs and structures before being output and validated. The company \
employed devops people to look at the validation logs, then trace back to the \
original authoring format, then decide if it were a programming error or markup \
error.) So merely converting an XSD to Schematron did not allow the advantage of \
having efficient, specific, targetted feedback.</div><div dir="auto" class=""><br \
class=""></div><div dir="auto" class="">(It goes deeper than the names. The \
grammar-based schemas have no capability of capturing and transmitting intention: if \
an attribute or element is required, why is it required? If a content model is \
super-complicated, what simpler pattern is actually being modelled, albeit clumsily? \
)</div><div dir="auto" class=""><br class=""></div><div dir="auto" class="">I would \
not want to implement this again using XSLT 2. Maybe 3 is better (?) but I think \
doing at least some of the stages in some general-purpose language (Java, etc) that \
allowed decoratable objects would have reduced the mental complexity a lot: \
immutability just sucks sometimes.&nbsp;</div><div dir="auto" class=""><br \
class=""></div><div dir="auto" class=""><br class=""></div><div dir="auto" \
class="">Cheers</div><div dir="auto" class="">Rick</div><div dir="auto" class=""><br \
class=""></div><div dir="auto" class=""><br class=""></div></div><br class=""><div \
class="gmail_quote"><div dir="ltr" class="gmail_attr">On Mon, 9 May 2022, 21:16 Roger \
L Costello, &lt;<a href="mailto:costello@mitre.org" \
class="">costello@mitre.org</a>&gt; wrote:<br class=""></div><blockquote \
class="gmail_quote" style="margin:0 0 0 .8ex;border-left:1px #ccc \
solid;padding-left:1ex">





<div lang="EN-US" link="#0563C1" vlink="#954F72" style="word-wrap:break-word" \
class=""> <div class="m_8475819124404225871WordSection1"><p class="MsoNormal">Hi \
Folks,<u class=""></u><u class=""></u></p><p class="MsoNormal"><u \
class=""></u>&nbsp;<u class=""></u></p><p class="MsoNormal">The Schematron processor \
that I use is an XSLT program that takes as input a Schematron schema and the XSLT \
program transforms the Schematron schema into an XSLT program that is specific to the \
Schematron schema:<u class=""></u><u class=""></u></p><p class="MsoNormal"><u \
class=""></u>&nbsp;<u class=""></u></p><p class="MsoNormal">Schematron schema --&gt; \
XSLT --&gt; XSLT for the particular Schematron schema<u class=""></u><u \
class=""></u></p><p class="MsoNormal"><u class=""></u>&nbsp;<u class=""></u></p><p \
class="MsoNormal">Then the "XSLT for the particular Schematron schema" is run and it \
inputs the XML document to be validated. The output is the validation results:<u \
class=""></u><u class=""></u></p><p class="MsoNormal"><u class=""></u>&nbsp;<u \
class=""></u></p><p class="MsoNormal">XML doc to be validated --&gt; XSLT for the \
particular Schematron schema --&gt; validation results<u class=""></u><u \
class=""></u></p><p class="MsoNormal"><u class=""></u>&nbsp;<u class=""></u></p><p \
class="MsoNormal">Rick et al chose to implement Schematron validation by generating a \
stylesheet for the particular Schematron schema.<u class=""></u><u \
class=""></u></p><p class="MsoNormal"><u class=""></u>&nbsp;<u class=""></u></p><p \
class="MsoNormal">An alternative strategy would have been to create a universal \
stylesheet that directly performs Schematron validation on the XML doc to be \
validated:<u class=""></u><u class=""></u></p><p class="MsoNormal"><u \
class=""></u>&nbsp;<u class=""></u></p><p class="MsoNormal">XML doc to be validated \
--&gt; universal stylesheet --&gt; validation results<u class=""></u><u \
class=""></u></p><p class="MsoNormal"><u class=""></u>&nbsp;<u class=""></u></p><p \
class="MsoNormal">Interestingly, Michael Kay has a blog post (<a \
href="https://dev.saxonica.com/blog/mike/2018/02/could-we-write-an-xsd-schema-processor-in-xslt.html" \
target="_blank" rel="noreferrer" \
class="">https://dev.saxonica.com/blog/mike/2018/02/could-we-write-an-xsd-schema-processor-in-xslt.html</a>)
  in which he discusses the idea of using XSLT to build an XML Schema validator. He \
explores the idea of whether to write an XSLT program that generates another XSLT \
program (as Schematron does) or whether to write a universal XSLT program. At the end \
of his  blog, Michael writes:<u class=""></u><u class=""></u></p><p \
class="MsoNormal"><u class=""></u>&nbsp;<u class=""></u></p><p \
class="MsoNormal"><span style="font-size: 14pt; font-family: &quot;Noto Serif&quot;, \
serif;" class="">I still have an open mind about whether a universal stylesheet \
should be used, or a generated stylesheet for a particular schema.</span><u \
class=""></u><u class=""></u></p><p class="MsoNormal"><u class=""></u>&nbsp;<u \
class=""></u></p><p class="MsoNormal">A fascinating parallel, I think.<u \
class=""></u><u class=""></u></p><p class="MsoNormal"><u class=""></u>&nbsp;<u \
class=""></u></p><p class="MsoNormal">/Roger<u class=""></u><u class=""></u></p> \
</div> </div>

</blockquote></div>
</div></blockquote></div><br class=""></div></body></html>



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic