[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xerces-j-dev
Subject:    Re: Parsing external entities in multifile documents
From:       Paul Kinnucan <paulk () mathworks ! com>
Date:       2003-05-29 15:47:14
[Download RAW message or body]


The following are examples of a book and a chapter file
that I concocted to illustrate my problem. Both files
are exactly as Epic generated them.

Book file:

<?xml version="1.0" encoding="utf-8"?>
<!--ArborText, Inc., 1988-2002, v.4002-->
<!DOCTYPE book PUBLIC "-//The MathWorks//DTD axdocbook variant//"
 "http://www-internal.mathworks.com/devel/Adoc/nightly/matlab/doc/tools/epiccustom/doctypes/tmwbook/tmwbook.dtd" \
[ <!ENTITY fake.txt SYSTEM "fake.txt">
<!ENTITY fake_chapter1.xml SYSTEM "fake_chapter1.xml">
]>
<?Pub UDT instructions _comment FontColor="red"?>
<?Pub UDT template _font?>
<?Pub Inc?>
<book id="a1054221435">
<title>Fake Book</title>
<bookinfo><?Pub Dtl?>
<productname>==Name of this product, for top of printed title page==</productname>
<titleabbrev>==Short title, for bottom of printed title page==</titleabbrev>
<subtitle>==Title for banner on each HTML page==</subtitle>
<releaseinfo>==Version info, for bottom of printed title page==</releaseinfo>
<copyright><year>==Range of copyright years, e.g., 2000&#x2013;2002.==</year>
<holder>by The MathWorks, Inc.</holder></copyright>
<revhistory>
<revision>
<revnumber>==Number of this printing batch, e.g., &#x201c;Second \
printing&#x201d;==</revnumber> <date>==Month and year of printing batch, e.g., \
September 2000==</date> <revremark>==Comment, e.g., &#x201c;Revised for version 1.0.2 \
(Release 12)&#x201d;==</revremark> </revision>
</revhistory>
<abbrev>==Unique book code for this book==</abbrev>
<subjectset>
<subject><subjectterm>MATLAB</subjectterm></subject>
</subjectset>
</bookinfo>&Table_of_Contents;
<?Pub Caret?>&fake_chapter1.xml;
&Index;
</book>
<?Pub *0000001494 0?>

Chapter file that makes a reference to an external text file:

<?xml version="1.0" encoding="utf-8"?>
<!-- Fragment document type declaration subset:
ArborText, Inc., 1988-2002, v.4002
<!DOCTYPE book PUBLIC "-//The MathWorks//DTD axdocbook variant//"
 "http://www-internal.mathworks.com/devel/Adoc/nightly/matlab/doc/tools/epiccustom/doctypes/tmwbook/tmwbook.dtd" \
[ <!ENTITY fake.txt SYSTEM "fake.txt">
]>
-->
<?Pub UDT instructions _comment FontColor="red"?>
<?Pub UDT template _font?>
<chapter id="a1054221617">
<title>Fake Chapter</title>
<para>Here is an inserted fake text file:</para>
<para>&fake.txt;<?Pub Caret?></para>
</chapter>
<?Pub *0000000598 0?>


Please note that Epic includes the declaration for &fake.txt; both in the
book doctype and the chapter doctype as you suggested. However, I need
to parse the chapter file as a standalone document and Epic comments out
the chapter doctype element so that it is ignored by the parser.

The result is that when my Java application tries to parse the chapter file
as a standalone document, Xerces signals an unrecoverable error
(undefined external entity) when it encounters the external entity
reference (&fake.txt;).

Is there a way I can successfully parse chapter files without modifiying
them, e.g., by supplying a custom entity resolver that extracts entity
declarations from the commented-out doctype? (Modification of the chapter
file XML is not an option as I explained in my previous post.)

- Paul



K. Venugopal writes:
 > 
 > Hi Paul ,
 > 
 > Paul Kinnucan wrote:
 > 
 > >Hi,
 > >
 > >I need some advice on how to deal with a problem that I have 
 > >encountered trying to use xerces to parse external entities
 > >in multifile documents created by Arbortext's Epic editor.
 > >
 > >The documents in question are technical manuals consisting of a book
 > >file that references a set of chapter files as external entities
 > >defined by the book's doctype declaration.  The chapter files are
 > >themselves XML files, i.e., XML "fragments" of the book. At the head
 > >of each chapter file is an XML comment that encloses a doctype
 > >declaration that specifies the same doctype as that defined by the
 > >book files doctype declaration, i.e., the book's document type.
 > >
 > >The problem occurs when writers use Epic to "include" external text
 > >files (usually nonXML program listings) in a chapter file.  Epic
 > >implements this by inserting an entity definition for the inserted
 > >file in the commented out doctype declaration at the head of the
 > >chapter file and a reference to the entity where the inserted 
 > >text is to appear. When displaying the chapter, Epic knows to
 > >look for the definition of the entity in the commented out
 > >doctype declaration. However, xerces does not. It regards the
 > >external entity as undefined and errors out, preventing me from
 > >parsing the file.
 > >  
 > >
 > If i have understood your problem right you can declare your entities 
 > for text files in your book xml where you have declared entities for 
 > your chapter xml files.  
 > 
 > In addition to this you need to set
 > parser.setProperty("http://java.sun.com/xml/jaxp/properties/schemaLanguage", 
 > "http://www.w3.org/2001/XMLSchema");
 > 
 > It should have worked when you set schema validation to true  and above 
 > property is needed only in case of jaxp . I will look into this .
 > 
 > 
 > Regards
 > venu
 > 
 > >How can I parse such files? I've noticed that the DOMParser class
 > >has setEntityResolver() and getEntityResolver() methods. This
 > >suggests to me that it might be possible for me to define and use my
 > >owe external entity resolver. This resolver would try to use the
 > >default resolver and if that failed would look for a definition of the
 > >entity in the commented-out doctype declaration at the head of the
 > >file. Does the setEntityResolver method actually support such 
 > >a solution? Is there a better way to resolve this problem? Any
 > >help you can give me would be deeply appreciated.
 > >
 > >- Paul
 > >
 > >
 > >---------------------------------------------------------------------
 > >To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
 > >For additional commands, e-mail: xerces-j-user-help@xml.apache.org
 > >
 > >  
 > >
 > 
 > <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN">
 > <html>
 > <head>
 >   <title></title>
 > </head>
 > <body>
 > <br>
 > Hi Paul ,<br>
 > <br>
 > Paul Kinnucan wrote:<br>
 > <blockquote type="cite"
 >  cite="mid16085.1572.193000.27749@gargle.gargle.HOWL">
 >   <pre wrap="">Hi,
 > 
 > I need some advice on how to deal with a problem that I have 
 > encountered trying to use xerces to parse external entities
 > in multifile documents created by Arbortext's Epic editor.
 > 
 > The documents in question are technical manuals consisting of a book
 > file that references a set of chapter files as external entities
 > defined by the book's doctype declaration.  The chapter files are
 > themselves XML files, i.e., XML "fragments" of the book. At the head
 > of each chapter file is an XML comment that encloses a doctype
 > declaration that specifies the same doctype as that defined by the
 > book files doctype declaration, i.e., the book's document type.
 > 
 > The problem occurs when writers use Epic to "include" external text
 > files (usually nonXML program listings) in a chapter file.  Epic
 > implements this by inserting an entity definition for the inserted
 > file in the commented out doctype declaration at the head of the
 > chapter file and a reference to the entity where the inserted 
 > text is to appear. When displaying the chapter, Epic knows to
 > look for the definition of the entity in the commented out
 > doctype declaration. However, xerces does not. It regards the
 > external entity as undefined and errors out, preventing me from
 > parsing the file.
 >   </pre>
 > </blockquote>
 > <font color="#3333ff"> If i have understood your problem right you can declare 
 > your entities for text files in your book xml where you have declared entities 
 > for your chapter xml files. </font>&nbsp;<br>
 > <br>
 > In addition to this you need to set <br>
 > parser.setProperty(<a class="moz-txt-link-rfc2396E" \
href="http://java.sun.com/xml/jaxp/properties/schemaLanguage">"http://java.sun.com/xml/jaxp/properties/schemaLanguage"</a>,
  > <a class="moz-txt-link-rfc2396E" \
href="http://www.w3.org/2001/XMLSchema">"http://www.w3.org/2001/XMLSchema"</a>);<br>  \
> <br>  > It should have worked when you set schema validation to true &nbsp;and \
> above property
 > is needed only in case of jaxp . I will look into this .<br>
 > <br>
 > <br>
 > Regards<br>
 >   venu<br>
 >  <br>
 > <blockquote type="cite"
 >  cite="mid16085.1572.193000.27749@gargle.gargle.HOWL">
 >   <pre wrap="">
 > How can I parse such files? I've noticed that the DOMParser class
 > has setEntityResolver() and getEntityResolver() methods. This
 > suggests to me that it might be possible for me to define and use my
 > owe external entity resolver. This resolver would try to use the
 > default resolver and if that failed would look for a definition of the
 > entity in the commented-out doctype declaration at the head of the
 > file. Does the setEntityResolver method actually support such 
 > a solution? Is there a better way to resolve this problem? Any
 > help you can give me would be deeply appreciated.
 > 
 > - Paul
 > 
 > 
 > ---------------------------------------------------------------------
 > To unsubscribe, e-mail: <a class="moz-txt-link-abbreviated" \
href="mailto:xerces-j-user-unsubscribe@xml.apache.org">xerces-j-user-unsubscribe@xml.apache.org</a>
  > For additional commands, e-mail: <a class="moz-txt-link-abbreviated" \
href="mailto:xerces-j-user-help@xml.apache.org">xerces-j-user-help@xml.apache.org</a> \
>   >   </pre>
 > </blockquote>
 > <br>
 > </body>
 > </html>


---------------------------------------------------------------------
To unsubscribe, e-mail: xerces-j-user-unsubscribe@xml.apache.org
For additional commands, e-mail: xerces-j-user-help@xml.apache.org


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic