From koffice-devel  Tue Apr 13 17:26:06 2004
From: Werner Trobin <trobin () kde ! org>
Date: Tue, 13 Apr 2004 17:26:06 +0000
To: koffice-devel
Subject: Re: Faster saving
Message-Id: <200404131926.06637.trobin () kde ! org>
X-MARC-Message: https://marc.info/?l=koffice-devel&m=108187721219304

On Tuesday 13 April 2004 18:54, Thomas Zander wrote:
> Slightly off-topic; but interresting are the results that I read about
> recently of a new IO system approach in an API I use.
> The old ones used a streams based approach; where each byte that comes in
> is encoded as soon as it comes in.
> The new style has a Buffer object which can contain any number of
> characters; and only when thats full is when all the text is converted
> (like from QChar or char* to utf-8).
> Unlike you expect the speed gains were amazing; it seems that staying in
> the processors cache while doing something as trivial as utf-8 encoding is
> quite lucrative.

Interesting. Any links?

[...]
 
> > > > It has taken (rounded values):
> > > > QCString: 19s
> > > > QTextStream: 55s
> > > > KBufferedIODevice( 8KB ): 41s
> > > > KBufferedIODevice( 64KB ): 35s
> > > > KBufferedIODevice( 128KB ): 34s
> > > >
> > > > So something is still worng.
> > >
> > > valgrind-calltree/KCachegrind explains it. It's the unicode->utf8
> > > conversion that takes too much time with the QTextStream solutions,
> > > because the codec is triggered for every character - whereas with the
> > > QCString solution it's only triggered once, for the whole of the data.
> > >
> > > I also agree with waiting for Qt 4, we'll optimize then, depending on
> > > how it does it.
> >
> > One more thing on this thread: it's about time we start implementing
> > saving in the OASIS format. Since you wrote most of the current code
> > relating to this (oowriterexport.*), do you think we should use the
> > QDomDocument solution or your "direct writing of text into the ZIP
> > device" solution (zipWriteData), i.e. writing litteral XML directly?
> >
> > Doesn't the latter lead to syntax errors too often? (and encoding
> > errors?) It has the advantage that when writing out utf8 or simple
> > latin1 stuff (tags/attributes) no char*->QString->char* conversion is
> > needed, though (unlike currently).
> >
> > Hmm, Werner wrote another solution, with a class that has an API almost
> > like QDom, but writes out the stuff along the way instead of all at
> > once, IIRC. (This has the advantage that it requires much less memory
> > than the huge-QDomDocument approach). Werner, what's the status on this?

I didn't really think about that after we decided not to try it for the Word 
filter. From what I still remember you'd need some convenience "fake-dom" 
classes and a stack. Then you can provide a nice API at less cost than the 
real QDomDocument. To avoid keeping the full document in memory this fake-dom 
serializes as much as possible and gets rid of the "real" objects in memory.
This will of course be slower than writing the strings directly.

While writing that mail I got another idea how to speed up writing of a 
document: If you look at a representative document (whatever that might 
be :-) you'll see lots of paragraphs and little extra stuff like tables, 
headers, footers, footnotes,... I.e. there won't be a lot of bells and 
whistles in a normal document but a lot of xml paragraph "elements" which are 
very similar, if viewed on a character level (given that OASIS also uses 
namespaces as far as I saw).

What about having "templates" of paragraphs in memory, already in string form 
and encoded. When you have to write out another paragraph you just (deep) 
copy that full string and fill out the attributes. E.g.

<paragraph><blah width="        "/><foo name="              ">bleh</foo>
<layout>
<blubb font="      "/>...
</layout>

In order to avoid excessive search&replace copying you have to add some spaces 
for the attribute values (for most of them we know the maximum length). Then 
you also know the index of the attributes and can replace on the fly. If you 
have tags of variable length (e.g. font name) you can split the full string 
in several pieces and process them sequentially.

This would probably pay off for tags which are used really really often. If 
you implement the fake-dom hierarchy you could even hide all those crude 
hacks behind a sane API by providing a ParagraphElement or so, where you just 
set the attributes via public setFoo() calls.

That said, I unfortunately don't have the time to give it a try for the next 
few months :-(

Ciao,
Werner

P.S. Don't forget the datatype rope (a heavy duty string, hehe :-) if you are 
writing some code which concattenates long strings and performs substring 
operations. It's an STL extension, you can find a description at: 
http://www.sgi.com/tech/stl/Rope.html
_______________________________________________
koffice-devel mailing list
koffice-devel@mail.kde.org
https://mail.kde.org/mailman/listinfo/koffice-devel