Hi! I decided that it's time to annoy you all with a looong mail again ;) Please try to think about it and tell me your opinion. Note: This is just about *reading* WinWord files. The design for writing will probably be similar, just the other way round :) Current state of development: ----------------------------- 1) We already have the generated code which is able to read and write the basic structures of the files. This works quite well. and except the few structures we can't generate easily everything seems to work fine. 2) The string class of choice is UString. The reason I choose it over UT_UCS2String is that it's implicitly shared and we can switch the byte order on demand which is nice due to a change in iconv. 3) We have an interface for the OLE library and iconv which is not really finished or perfect, but it works. Next steps: ------------ Problem: We have to parse at least three different kinds of Word files, and we have the documentation of two of them. So far so good. The problem is that there are some differences between Word95 and Word97 structures (and they are bigger than the differences between Word97 and Word2000). What to do in the parser? Where to have the abstraction from the Ms Word oddities,... There are several questions like that and you can probably come up with some I didn't even think of. However, after two hours of twisting my brain in different directions at the same time I came up with the following: LLDocument <---- HLAdapter | HLDocument This is the "hierarchy" of the document abstraction. The LLDocument is an abstract base class and contains callbacks for stuff like "paragraphStart", "tableCellEnd",... and passes Word97 structures(!!) around. The HLDocument, also an abstract base class, provides a more "abstract" way of handling all that (e.g. "foundParagraph()",), so that "simple" filters (e.g. doc -> txt) don't have to mess around with the low level interface. The HLAdapter class inherits LLDocument and implements all the callbacks it's interested in. Then it triggers the HLDocument callbacks if it feels like doing so (e.g. when it read a complete paragraph). For you as a user of the library there are two ways to write a filter: - hardcore, i.e. inherit LLDocument and do all the work yourself. This will probably be needed for the KWord filter and the AbiWord filter, as they want to extract as much information as possible. - lazy, i.e. inherit HLDocument and just implement those few callbacks. This might work for stuff like an ASCII preview, some basic DOC -> HTML conversion (e.g. for mail programs which want to show a preview for an attached Word file,...). This is just a quick&dirty approach, tho. In the library I suggest to use the following structure to make it work: Parser <--- Parser9x <--- Parser97 `--- Parser95 The Parser class is a quite simple abstract base which just allows construction, shares some common variables (e.g. the storage, the streams, the iconv abstraction class,....). The Parser9x class provides some common functionality which can be shared between Word97 and Word95. We should be careful about what to put there. Finally, the two Parser95 and Parser97 classes do all the dirty work, walking through the streams. Note that we only use Word97 structures in the callbacks of the LLDocument interface. The reason is, that we don't want to know the version we're reading, we only want to get the content. This means that we have to get rid of the distinction between Word97 and Word95 at some level. I don't know if it's the best level to do so, feel free to suggest a different solution. However, in this case the Word95 parser has two tasks: - read the structures from the stream (native Word95) and perform all the parsing depending on their information - Convert the Word95 structures and pass them to all the callback functions of the LLDocument. This doesn't mean that we have to convert *all* structures to Word97 structures in this parser, but "just" those we want to pass around (e.g. CHP, PAP, SEP,...). As this code for the conversion might involve some "intelligence" I doubt that we can generate this code. As "glue" between those two hierarchies I suggest using a plain Factory class which takes a filename, creates the OLE storage on it, reads the first 4 bytes of the WordDocument stream and checks the nFib. Depending on that it creates the proper parser. All the "user" of the library has to do is to create the Document object and call parse() on it. An issue still to be discussed is if we want to have Word97 structures in the interface for high-level filtering (HLDocument) or not. If we don't want to have that we can hide some ugly MSisms, but of course we have to "invent" new structures. So... that was all. Thanks a lot for reading up to this point. Awaiting your comments, flames, and questions, Werner, who will not be able to reply until tomorrow morning _______________________________________________ Koffice-devel mailing list Koffice-devel@master.kde.org http://master.kde.org/mailman/listinfo/koffice-devel