'Re: [Wvware-devel] RFC: Parser/Document Design'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       koffice-devel
Subject:    Re: [Wvware-devel] RFC: Parser/Document Design
From:       shaheed <srhaque () iee ! org>
Date:       2001-07-28 10:54:15
[Download RAW message or body]

Hi Werner,

> Next steps:
> ------------
> Problem: We have to parse at least three different kinds of Word files,
> and we have the documentation of two of them. So far so good. The
> problem is that there are some differences between Word95 and Word97
> structures (and they are bigger than the differences between Word97 and
> Word2000).

BTW: note that a correct Word97 parser should handle most (all?) Word2000 
docs since AFAICS, the extensions were designed to be backwards compatible. 
So, all that one would loose is W2K extensions. Note, that since W2K can also 
use HTML as a native format: we might eventually be able to reverse engineer 
any such extensions by looking at the HTML file!

> What to do in the parser? Where to have the abstraction from the Ms Word
> oddities,... There are several questions like that and you can probably
> come up with some I didn't even think of. However, after two hours of
> twisting my brain in different directions at the same time I came up with
> the following:
>
> LLDocument   <---- HLAdapter
>
>                HLDocument
>
> This is the "hierarchy" of the document abstraction. The LLDocument is
> an abstract base class and contains callbacks for stuff like
> "paragraphStart", "tableCellEnd",... and passes Word97 structures(!!)
> around. The HLDocument, also an abstract base class, provides a more
> "abstract" way of handling all that (e.g. "foundParagraph()",), so that
> "simple" filters (e.g. doc -> txt) don't have to mess around with the low
> level interface. The HLAdapter class inherits LLDocument and implements
> all the callbacks it's interested in. Then it triggers the HLDocument
> callbacks if it feels like doing so (e.g. when it read a complete
> paragraph).

Sounds good to me.

> For you as a user of the library there are two ways to write a filter:
> - hardcore, i.e. inherit LLDocument and do all the work yourself. This will
>   probably be needed for the KWord filter and the AbiWord filter, as they
>   want to extract as much information as possible.
> - lazy, i.e. inherit HLDocument and just implement those few callbacks.
> This might work for stuff like an ASCII preview, some basic DOC -> HTML
> conversion (e.g. for mail programs which want to show a preview for an
> attached Word file,...). This is just a quick&dirty approach, tho.

We should consider having the library implement a subclass of HLDocument 
which is geared to regression testing. The idea here would be to *machine 
generate* a class called DumpDocument which makes a text dump of the document 
(including textualised PAPs for example). Not only would this be good for 
testing, but it would be a brilliant reverse engineering tool!

> In the library I suggest to use the following structure to make it work:
>
> Parser <--- Parser9x <--- Parser97
>                       `--- Parser95
>
> The Parser class is a quite simple abstract base which just allows
> construction, shares some common variables (e.g. the storage, the streams,
> the iconv abstraction class,....). The Parser9x class provides some common
> functionality which can be shared between Word97 and Word95. We should be
> careful about what to put there. Finally, the two Parser95 and Parser97
> classes do all the dirty work, walking through the streams.

The plus side of this scheme is that there is a clean separation between 
Word95 and Word97, and so this is easy to extend to Word 2 etc. (when Dom 
gets time :-)). The down side is that Word 95 and Word97 are so similar, 
there will be a lot of duplication. Thus:

Parser <--- Parser9x <--- Parser97
                                  <--- Parser95

where Parser9x will use virtual methods from Parser95 and Parser97 if 
required (which will be rarely).

> Note that we only use Word97 structures in the callbacks of the LLDocument
> interface. The reason is, that we don't want to know the version we're
> reading, we only want to get the content. This means that we have to get
> rid of the distinction between Word97 and Word95 at some level. I don't
> know if it's the best level to do so, feel free to suggest a different
> solution.

I think we should hide the low level differences in the low level code:

Parser <--- Parser9x <--- Parser97
                                  <--- Parser95 <--- Parser95to97

See below for how Parser95to97 is built.

> However, in this case the Word95 parser has two tasks:
> - read the structures from the stream (native Word95) and perform all the
>   parsing depending on their information
> - Convert the Word95 structures and pass them to all the callback functions
>   of the LLDocument.
> This doesn't mean that we have to convert *all* structures to Word97
> structures in this parser, but "just" those we want to pass around (e.g.
> CHP, PAP, SEP,...).
> As this code for the conversion might involve some "intelligence" I doubt
> that we can generate this code.

Actually, I think we can generate most of it. There are three cases:

- 95 is bit-for-bit identical to 97

- 95 is name-for-name identical to 97, and the only difference is the range 
of values permitted for 97 is bigger then 95.

- neither of the above. This is just a few structures.

We can therefore machine generate conversion functions quite easily for all 
but a few cases, and then hand-edit the few exception cases. So, Parser95to97 
simply promotes output structures. See (1d) in my original note 
http://www.geocrawler.com/lists/3/SourceForge/6086/50/6120533/.

> As "glue" between those two hierarchies I suggest using a plain Factory
> class which takes a filename, creates the OLE storage on it, reads the
> first 4 bytes of the WordDocument stream and checks the nFib. Depending on
> that it creates the proper parser. All the "user" of the library has to do
> is to create the Document object and call parse() on it.

Fine.

> An issue still to be discussed is if we want to have Word97 structures in
> the interface for high-level filtering (HLDocument) or not. If we don't
> want to have that we can hide some ugly MSisms, but of course we have to
> "invent" new structures.

Yes we do want Word97. Anything else will loose information. There are some 
MSisms a library like wv2 can assist filter writers with, and we should find 
a home for them somewhere. Perhaps a MidLevelDocument (MLDocument?) for 
things like eliminating the end-of-paragrpah markers, or converting the ugly 
field stuff into something a bit more structured

Apologies if this all sounds a bit familiar :-)

Thanks, Shaheed
_______________________________________________
Koffice-devel mailing list
Koffice-devel@master.kde.org
http://master.kde.org/mailman/listinfo/koffice-devel

[prev in list] [next in list] [prev in thread] [next in thread]