'Re: [RFC] Export filter architecture for MS Office'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       koffice-devel
Subject:    Re: [RFC] Export filter architecture for MS Office
From:       shaheed <srhaque () iee ! org>
Date:       2001-09-12 20:59:49
[Download RAW message or body]

On Wednesday 12 September 2001 1:07 pm, Thomas Zander wrote:
> > The first thing is that it seems to me that the director/builder pattern
> > that Eva proposed is very similar to the KOffice olefilters architecture,
> > since that fundamentally is all about embedding objects one within
> > another. So, it is interesting that not only is there an existence proof
> > that director/builder works nicely in this application, but that there
> > are implications that:
>
> I don't follow this sentence, the one thing that the Builder patter (which
> you can find in your local copy of 'design patterns  isbn:0201633612) is
> that it allows you to build an object structure via a default interface. In
> other words, it does nothing with embedding at all.

I meant that the read-side of the olefilters architecture is similar to the 
director.Contruct() method of director/builder. It understands an OLE 
docfile, but does not restrict whether the output data is represented as 
embedded objects.

Of course, the write-side of olefilters supports embedding, but that is a 
different matter.

> One of the things inharent in the pattern is that it does not return
> anything to you, as you can see in Eva's example (and my reply to that).
> All methods are implemented to return void.
>
> This is because the pattern is created to (and I quote) "Seperate the
> construction of a complex object from its representation so that the same
> construction process can create different representations."
>
> Where this and your example come together is that the director (which knows
> only the common builder interface) calls the methods and knows only the
> basic relations between the objects created by the builder. So the director
> should not have to know the direct parent of the object it creates.

Right for a pure director/builder model. But there are cases where I believe 
it is not sufficient, such as olefilter's write-side which, in oder to 
support the creation of documents which mirror the embedding of the original 
MUST preserve the parent-child relationship: the only place it can learn this 
is from the read-side.

> Let me state an example.
>
> I have a text document and want to convert that to KWord.
> In KWord we use frames, and in a frame(set) we have paragraphs. This
> relation is missing in the text document, the paragraphs are directly
> beneath the document 'node'
>
> If I now want to convert that same text document to OpenOffice, which does
> not do frames, the paragraph would directly be a child of the document.
>...

So, in the former case, we have to synthesise the parent-child relationship. 
(BTW: its even worse if you try to recognise tables in a text document). But 
what if you are going from a more structured form to a less structured form? 

It must be true that an infrastructure that preserves the hierarchy will allow 
a more perfect transliteration than one which does not! For example, the 
hierarchy in a Word document is roughly:

OLE docfile
    Word data OLE streams
        Sections
            Paragraphs
                Runs

but the current Word filter does not preserve section information (because I 
could not think of a KWord representation), but a more generic filter would 
certainly benefit by being able to represent that paragrpah X belonged to 
section A. Werner has solved this nicely for import, I am saying that the 
same is needed for export.

> If you go back to an email I sent earlier, as an answer to Eva's email
>  http://lists.kde.org/?l=koffice-devel&m=100014931211274&w=4
> you should notice my example at the bottom which starts with 'Consider the
> following;'.
>
> This tells you the solution provided by the patters (again coming from the
> book) which uses numbering per type. So if someone does not know about a
> type there is no problem in the director/builder communication.

Sorry, I don't have the book (my brother does, but that is not much use, I 
know :-)). On the other hand, you can see that the techniques used by the 
word import filter follow exactly the builder pattern (see the gotXXX() 
methods).

> > 1. the director might usefully be an abstract class
>
> Right, for every input type you have one implementing class.
> Similarly you have one builder implementation for each output type.
>
> > 2. we might usefully have implementations for OLE-based documents and
> > anything else that has a *regularised* notion of nesting
>
> Nesting is done on a document basis, right? Where a document is anything
> from a picture to an excel sheet.
>
> Then I don't understand your concern about nesting _inside_ the filter. On
> a global scale a document is opened via a filter and will be converted to
> something native to the office suite.
> Any embedded parts will be treated in the same way, in that they will be
> filtered to a native format.

Yes and no. Nesting of documents is done on a document basis, but one can also 
consider using the director/builder patterns to parse lower level nested 
constructs. My point was that doing so formally is worthwhile when the code 
can be reused (i.e. is reasonably standardised), but may not be in other 
cases. I tried to exploit this by using templates for certain common 
structures in the word filter, but hand-generated code for much else in the 
parser.

> FRom a reference POV in the document to a nested document this will have to
> be done (and is currently done in KOffice) via filenames inside the archive
> for the whole document.
> I.e. a document points to a picture with a href, as something like:
>   <embedded type=picture href="images/picture1" width="123" height="123" />
>
> There are 2 scenarios I can imagine when opening an input file with an
> embedded part. One is that as soon as the import filter finds an embedded
> part the filter manager is told that it should convert it to filename
> 'tar:something/xyz' this can be done directly or after the main filter has
> finished (and thus the converted document is not in mem anymore)
> The second way is that this proofs to be impossible for some reason and the
> same filter just creates a new builder and does the conversion in the same
> filter instance, again immidiately or after the main document has been
> written.
>
> > The point of all this is that if we do it right, I think we can make such
> > filters usable across multiple OpenSource projects. We only implement the
> > logic of each filter once: all projects get improved capabilites. Does
> > this sound feasible? Any better ideas around?
>
> The trick about this is to create an interface for the builder that every
> builder should implement and every director can use. Therefor the interface
> has to be so big that it can be used to build every type of output document
> we plan to create.
> I certainly hope this can be done, but it will be very tricky ;)

I think that this is a very big job for a single interface. Remember, in 
addition to the simple hierarchy I drew above, you will need to design:

1. cross-application types such as styles. To be useful, your builder 
interface will have to be the closure of all possible style settings in all 
possible apps. 

2. many unique cross-application types. For example: sections, text, tables, 
date fields, spreadsheet formula, mathmatical formulae, column width 
settings, figure captions,...

I am proposing soemthing a bit less ambitious. I am proposing something that 
looks a bit like director/builder but that is specific for MS Word writing. 
If it is successful, then the same style of coding might work elsewhere. Its 
not code reuse, but it is learning reuse. My analogy is that I saw how the 
Excel import filter was written, I tidied it up a bit, and was then able to 
rapidly develop the kwmf, msod and ppt filters using very similar techniques.

> I have been talking about another approuch to this problem, the approuch is
> to get all the open source editors to use one DTD (not as hard as you
> imagine). If this is done all filters will just have to output one DTD. I
> think you can imagine the advantages of that.

When this is complete, I acknowledge that my work may prove obsolete, and a 
single MS Word exporter will fit all editors. But the world needs export now, 
and even when your work is done, I will be as happy to see my 
first-generation export work be thrown away then as I am to see my 
first-generation import work be thrown away now. With luck, it should be 
straightforward to adapt/rewrite/learn-from-but-discard my code for a second 
generation of unified export filters.

> Among other things it will allow all parties to join forces in creating a
> filter for any format since it will work with all applications using the
> standard DTD.
>
> Lots of the above ideas can still be used with the one DTD solve, and I
> hope that my lengty email does not scare you off of the notion of doing
> this at all, since I believe its a good idea (hence my lengty email ;)

Sure, I know the success of KOffice and other suites is important to us both. 
My bottom line is that wv2 has the makings of a cross-editor import facility, 
and I believe we should try for cross-editor export too, while we wait for 
unified DTDs.

Thanks for taking the time to critique my note. Wiriting Word filters is a 
very lonely job :-).

Shaheedight? Where a document is anything
> from a picture to an excel sheet.
>
> Then I don't understand your concern about nesting _inside_ the filter
_______________________________________________
Koffice-devel mailing list
Koffice-devel@mail.kde.org
http://mail.kde.org/mailman/listinfo/koffice-devel

[prev in list] [next in list] [prev in thread] [next in thread]