Werner Trobin wrote: > Torben, Reggie, and all the others agreed on storing embedded KOffice Parts > and binary data (e.g. pictures, movies, sounds) via a simple gzipped-tar- > structure. Yesterday (990730) Torben "released" KTar which will probably be I had more imagined an XML-based scheme where all parts where stored as embedded parts of the document itself. E.g. a KWord document with a KFormula part would be stored like this:
Here goes editor text Here goes formula description More editor text here
In this example, the tag "main" signals that this part should be opened as the main window and loaded initially. "part" indicates that this is an embedded part within that main document. The "editor" and "version" property tells the storage system which part that should be loaded. The content here is passed to the naming service which loads the appropriate component. Binaries data is saved as uuencoded/base64 data. An extension to this model is that each part can have a DTD associated, so that documents can be validated upon reading by the storage system, and formatted for the part accordingly, i.e. the system does the parsing and the part gets the data pretty much well-chewed. I think that such a text-based solution have various advantages over a binary solution. One of them is the above mentioned validation and parsing. Others are system-level indexing, exchange with non-KDE systems (via XSL) and out-of-part browsing of the document's contents. Also note that none of the parts need be named in this example, contrary to a .tar-based solution. The user is not obliged to specify a name (although it can optionally be supplied); a name is something by which which the user wants to identify the part -- nameless parts are only interesting within the scope of the enclosing document. Text need not be more than marginally less effective if the access layer provides a high enough abstraction. (It should probably have more in common with DOM than with other storage systems such as Bento). The text could always be piped through a compression filter before storing to disk. The downside with such a solution is that I reckon most code already done for the storage system must be rewritten. (Which already may be too late for your timeframes) The other approach is of course to have each part store its contents in XML (or really what-ever format it chooses), making it possible to link between parts within the same document, which brings us back to the .tar-based solution (more on that later) > 1) Name of the KOffice Files: > I think it is important to show that the file is a tgz-archieve, so I suggest > using filenames like "MyLetter.kwd.tgz", "Sales.ksp.tgz", or > "Meeting.kpr.tgz". Does this naming convention host a hidden specification that the name of the file indicates the main part (i.e. the part that is started as the frame of the document)? I don't get the grasp of where the responsibility of opening the document is put. Does the system decompress the file, and then starts the main part feeding it the file, or is the file opened by the storage system, all parts are identified and prepared and then the frame is started, being fed it's part contents? I would prefer the latter model, as it puts more responsibility on the storage system and less on the part itself. (Control question: Does a part necessarily know the type of other parts stored inside it? (and: should it be possible to exploit this type knowlegde to extract information from sub-parts? I think not, since while it creates better integration, it breaks encapsulation between parts making it more a KOffice-plugin model than a system component model) > I'm not quite sure if we should go for a "flat" directory structure (one > single The main problem with a flat directory layout (and any other layout with a fixed number of directories for that matter) is that you easily run into name conflicts (e.g. if the first picture in a text is always named "picture0" then you'll run into conflicts when the user tries to merge two texts). The same applies for directories named after the type of the part. Each part effectively needs its own directory. Which brings me down to the main problem with .tar-based files: They're modeling a directory on a file-system. A directory is ultimatly only a subset of the files on the disk. A directory doesn't contain data itself. This is not a proper model of a document. A part have contents as well as other parts stored. HTTP has the same problem, which is why one gets the index.html file by default if one ask for the directory itself. Hence, there is a need to extend the .tar model: A "directory" would contain both other parts and the contents of the part itself. Such a "directory" (why not call it a part?) would not neccesarily have a name, unless the user has assigned one. A name becomes nothing more than an optional label. I can't see any reasons to let the storage system generate a lot of "partXxx" names that is mixed with labels. E.g.(using the previous mentioned example) lab.kwd.tgz +--- +--- | | | +--- | | | +--- 0.ksp +--- | | | | | +--- 0.jpg +--- | | | +--- 1.kdg "Layout" +--- | | | +--- 0.jpg +--- | +--- This is a KWord-file which has a spreadsheet embedded, which again has a picture embedded, (back to the text:) a diagram and a picture. Both the picture in the text and in the spreadsheet are named with the ordinal 0 because they are in different scopes. The diagram part is also explicitly named by the user (which would make it possible to reach with an URL: file:/home/jdoe/lab.kwd.tgz/Layout) For an explanation of and , look below. > Another thing to discuss is if we need a "summary.ko" file. This could e.g. > contain the author's name, the dates,... This information is stored at least This would be a nice replacement for extended attributes (supporting such features on file systems that doesn't have this implemented). In my opinion this should be treated as properties to the document as a whole and not to a particular part, i.e. it is as much part of the file as the name and access rights are. Hence, it should probably be searchable from within the shell. This should probably be in a designated stream of the ".tar" itself, like the content branch of a part. (see the example above) The downside of such a solution is that one break the backward- compatability with .tar-files (unless of course one calls the special branches for ".data", ".content", ".attributes", ".dtd" or something like that) An extension to this model is also a special branch that holds the DTD for any parts that is stored in XML format (or a reference to a system-wide installed such -- that is a tradeoff between space and portability to systems without that part installed) Sincerely, R.