'Re: Interesting QDomDocument::setContent variant'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       koffice-devel
Subject:    Re: Interesting QDomDocument::setContent variant
From:       David Faure <faure () kde ! org>
Date:       2004-06-07 16:53:36
Message-ID: 200406071853.36220.faure () kde ! org
[Download RAW message or body]

On Monday 16 February 2004 13:49, Nicolas Goutte wrote:
> On Monday 16 February 2004 12:33, David Faure wrote:
> > On Monday 16 February 2004 12:28, Nicolas Goutte wrote:
> > > On Monday 16 February 2004 11:39, David Faure wrote:
> > > > On Monday 16 February 2004 11:33, Nicolas Goutte wrote:
> > > > > On Monday 16 February 2004 11:17, David Faure wrote:
> > > > > > The OASIS format has a special tag for whitespace to ensure
> > > > > > whitespace is preserved. This is more reliable than using real
> > > > > > whitespace and depending on the XML parser to respect it.
> > > > >
> > > > > Something like:
> > > > > <text:span> </text:span>
> > > > > too?
> > > >
> > > > No, <text:s text:c="4"> for 4 spaces.
> > >
> > > Yes, but you do not write:
> > > <text:span><text:s text:c="1"/></text:span>
> > > but
> > > <text:span> </text:span>
> >
> > Yes, because one space is always preserved, the problem is that 2 or more
> > spaces can be collapsed as a single one, no?
> 
> No, the problem is the feature:
> http://trolltech.com/xml/features/report-whitespace-only-CharData
> of Qt's SAX parser.
> 
> (See QXmlSimpleReader in Qt's doc.)
> 
> With it on (which QDomDocument does by default), if white space is the only 
> content of an element, it is not reported (i.e. ignored.)
> 
> >
> > Or do you see problems with XML parsers (which ones?) not seeing the single
> > space in <text:span> </text:span>?
> 
> Try it with a normal QDomDocument::setContent 
> 
> QDomDocument doc;
> doc.setContent( QCString( "<test> </test>" ) );
> qDebug("%s\n", doc.toCString().data() );
> 
> And you get:
> <test/>
> 
> (I know very well the problem. I have it since the start of KWord's AbiWord 
> import filter. Without making it on purpose, the test file, which I had made, 
> had exactly such a construction and of course gave problems with 
> QDomDocument::setContent. That is why the AbiWord import filter is made with 
> SAX (i.e. the QXML classes.))

I see.
I just tried this setContent variant, after hitting the <text:span> </text:span> parsing bug.
The problem is that it reports far too much whitespace. Even the newline between two
<text:p> elements is reported.
For instance:
      <text:p>foo</text:p>
      <text:p>bar</text:p>
leads to the firstChild()/nextSibling() loop reporting a tag with tagName == "".
Handling this would mean handling such 'empty tag' elements everywhere in the 
code, or getting rid of all newlines and indentation in the generated XML files
(now I see why the OO files have none!).

In fact.... even though we could use <text:s> even for a single space 
(which is not recommended) we would still not be able to read OO-generated
files (with <span> </span>). The OASIS spec says we should process whitespace
inside the text:p element and its children, and NOT in the rest of the XML,
but QDom/QXml doesn't allow us such fine-grained control, only a global on/off
setting.

So removing all our newlines and indentation, and using QXmlSimpleReader to get
all whitespace reported, sounds like the only way out.

-- 
David Faure, faure@kde.org, sponsored by Trolltech to work on KDE,
Konqueror (http://www.konqueror.org), and KOffice (http://www.koffice.org).
_______________________________________________
koffice-devel mailing list
koffice-devel@mail.kde.org
https://mail.kde.org/mailman/listinfo/koffice-devel
[prev in list] [next in list] [prev in thread] [next in thread]