'RFC: Parser/Document Design'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       koffice-devel
Subject:    RFC: Parser/Document Design
From:       Werner Trobin <trobin () kde ! org>
Date:       2001-07-26 14:22:08
[Download RAW message or body]

Hi!

I decided that it's time to annoy you all with a looong mail again ;)
Please try to think about it and tell me your opinion.

Note: This is just about *reading* WinWord files. The design for writing
will probably be similar, just the other way round :)

Current state of development:
-----------------------------
1) We already have the generated code which is able to read and write the
   basic structures of the files. This works quite well. and except the few
   structures we can't generate easily everything seems to work fine.

2) The string class of choice is UString. The reason I choose it over
   UT_UCS2String is that it's implicitly shared and we can switch the byte
   order on demand which is nice due to a change in iconv.

3) We have an interface for the OLE library and iconv which is not really
   finished or perfect, but it works.

Next steps:
------------
Problem: We have to parse at least three different kinds of Word files,
and we have the documentation of two of them. So far so good. The
problem is that there are some differences between Word95 and Word97
structures (and they are bigger than the differences between Word97 and
Word2000).
What to do in the parser? Where to have the abstraction from the Ms Word
oddities,... There are several questions like that and you can probably come
up with some I didn't even think of. However, after two hours of twisting
my brain in different directions at the same time I came up with the 
following:

LLDocument   <---- HLAdapter
                      |
               HLDocument

This is the "hierarchy" of the document abstraction. The LLDocument is
an abstract base class and contains callbacks for stuff like
"paragraphStart", "tableCellEnd",... and passes Word97 structures(!!)
around. The HLDocument, also an abstract base class, provides a more
"abstract" way of handling all that (e.g. "foundParagraph()",), so that
"simple" filters (e.g. doc -> txt) don't have to mess around with the low
level interface. The HLAdapter class inherits LLDocument and implements
all the callbacks it's interested in. Then it triggers the HLDocument
callbacks if it feels like doing so (e.g. when it read a complete paragraph).

For you as a user of the library there are two ways to write a filter:
- hardcore, i.e. inherit LLDocument and do all the work yourself. This will
  probably be needed for the KWord filter and the AbiWord filter, as they
  want to extract as much information as possible.
- lazy, i.e. inherit HLDocument and just implement those few callbacks. This
  might work for stuff like an ASCII preview, some basic DOC -> HTML
  conversion (e.g. for mail programs which want to show a preview for an
  attached Word file,...). This is just a quick&dirty approach, tho.

In the library I suggest to use the following structure to make it work:

Parser <--- Parser9x <--- Parser97
                      `--- Parser95

The Parser class is a quite simple abstract base which just allows
construction, shares some common variables (e.g. the storage, the streams,
the iconv abstraction class,....). The Parser9x class provides some common
functionality which can be shared between Word97 and Word95. We should be
careful about what to put there. Finally, the two Parser95 and Parser97
classes do all the dirty work, walking through the streams.

Note that we only use Word97 structures in the callbacks of the LLDocument
interface. The reason is, that we don't want to know the version we're
reading, we only want to get the content. This means that we have to get
rid of the distinction between Word97 and Word95 at some level. I don't know
if it's the best level to do so, feel free to suggest a different solution.

However, in this case the Word95 parser has two tasks:
- read the structures from the stream (native Word95) and perform all the
  parsing depending on their information
- Convert the Word95 structures and pass them to all the callback functions
  of the LLDocument.
This doesn't mean that we have to convert *all* structures to Word97
structures in this parser, but "just" those we want to pass around (e.g. CHP,
PAP, SEP,...).
As this code for the conversion might involve some "intelligence" I doubt that
we can generate this code.

As "glue" between those two hierarchies I suggest using a plain Factory class
which takes a filename, creates the OLE storage on it, reads the first 4 bytes
of the WordDocument stream and checks the nFib. Depending on that it creates
the proper parser. All the "user" of the library has to do is to create the
Document object and call parse() on it.

An issue still to be discussed is if we want to have Word97 structures in the
interface for high-level filtering (HLDocument) or not. If we don't want to
have that we can hide some ugly MSisms, but of course we have to "invent"
new structures.

So... that was all. Thanks a lot for reading up to this point.

Awaiting your comments, flames, and questions,
Werner, who will not be able to reply until tomorrow morning
_______________________________________________
Koffice-devel mailing list
Koffice-devel@master.kde.org
http://master.kde.org/mailman/listinfo/koffice-devel

[prev in list] [next in list] [prev in thread] [next in thread]