'design discussion: caching QTextFormat for faster text loading'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       koffice-devel
Subject:    design discussion: caching QTextFormat for faster text loading
From:       Jos van den Oever <jos.van.den.oever () kogmbh ! com>
Date:       2010-08-13 9:59:52
Message-ID: 201008131529.12046.jos.van.den.oever () kogmbh ! com
[Download RAW message or body]

Hi all,

As you may have noticed, I'm working on speeding up KOffice at the moment. The 
biggest bottleneck for the loading of documents at the moment is our styling 
system. In this mail, I want to explain why this is the case and give a 
suggestion for improving the situation.

For the explanation, I use an example document. It is a document with 20000 
paragraph. This is not a an unusually high number; the document has about 400 
pages. An empty line is also a paragraph. The paragraphs have only 3 different 
styles. I counted this with the following command:

 cat content.xml| xmlstarlet sel  \
  -N text="urn:oasis:names:tc:opendocument:xmlns:text:1.0" \
   -t -m "//text:p" -v '@text:style-name' -n |sort|uniq -c

I also counted the character styles in use. There were only two.
This is a low number the usual number will be less than a hundred though. I'm 
using so little styles to show that the styling bottleneck is not caused by 
having a lot of different styles, but is also present for very simple 
documents.

Even though this document has so few styles, KOffice spends almost 50% of the 
execution time on applying these styles. This was measured with the --
benchmark-loading argument on KWord running in valgrind.

The reason so much time is spent there is caused by a difference in the way 
styles work in ODF and in QTextDocument. The equivalents to styles in 
QTextDocument are QTextFormat subclasses, there is QTextBlockFormat for 
paragraphs, QTextCharFormat for parts of paragraphs (e.g. text:span) and some 
other subclasses.

In ODF there is style inheritance; in QTextDocument there is not. So to obtain 
the QTextCharFormat for a stretch of text, a QTextCharFormat for the style and 
each of its ancestors is created. These are merged together, starting from the 
ultimate ancestor up to the most specific style. The function
  QTextFormat::merge( const QTextFormat & other )
is used for this.
When loading the document mentioned above, this function is called about 
200000 times. The same styles are created over again for each time they are 
mentioned. This is unneeded and should be avoided.

To avoid this, there is a change needed in the way text (QTextBlocks) are 
created out of ODF in KOffice. Right now, a QTextCursor is used to keep track of 
the current style and when a style statement is encountered, that style is 
converted to a QTextFormat and applied to the style at the cursor position. 
After the stretch, the style is unapplied.

An alternative approach would be to give each QTextFormat that is the result 
of applying styles a unique identifier and storing it. The identifier should 
consist of all the information needed to recreate the QTextFormat for the 
current styles. A list of the ODF style identifiers would do this. Then, 
instead of creating a QTextFormat over and over, only the identifier need be 
created. The QTextFormat is then only created if it does not yet exist.

The QTextFormat that is created is then the amalgamation of the ODF style 
hierarchy and can be applied with the quick functions
 void setCharFormat ( const QTextCharFormat & format )
 void setBlockFormat ( const QTextBlockFormat & format )

For the example document this would cut down the calls to QTextFormat::merge 
from about 200000 to about 200, or 1000 times less. The number of calls to 
expensive functions like KoParagraphStyle::applyStyle and 
KoCharacterStyle::applyStyle would be reduced greatly too.

In short, all calls to the various instances of applyStyle should be put in a 
central class that creates QTextFormat instances from ODF styles and caches 
the results. The resulting styles are set directly on stretches of text.

I am currently working on such a cache as a proof of concept. The text loading 
class is very complex so do not expect review requests or commits soon. The 
proof of concept mainly serve to show that a near 2x speed-up of loading is 
possible for large documents.

Styling information is not the only data stored in QTextFormat instances. The 
change tracking information and the RDF annotations are also stored there. For 
these, and possibly other future datas, the cache will not be used. Change 
tracking and RDF information is usually unique to a stretch of text, in 
contrast to styling information. Hence, storing that information may be done 
in the same way that it is done now.

Cheers,
Jos

-- 
Jos van den Oever, software architect
+49 391 25 19 15 53
http://kogmbh.com/legal/
_______________________________________________
koffice-devel mailing list
koffice-devel@kde.org
https://mail.kde.org/mailman/listinfo/koffice-devel
[prev in list] [next in list] [prev in thread] [next in thread]