'Re: KOffice Storage Structure'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       koffice
Subject:    Re: KOffice Storage Structure
From:       Roland Kaufmann <roland.kaufmann () student ! uib ! no>
Date:       1999-08-04 20:48:25
[Download RAW message or body]

Werner Trobin wrote:
> Torben, Reggie, and all the others agreed on storing embedded KOffice Parts
> and binary data (e.g. pictures, movies, sounds) via a simple gzipped-tar-
> structure. Yesterday (990730) Torben "released" KTar which will probably be

I had more imagined an XML-based scheme where all parts where stored
as embedded parts of the document itself. E.g. a KWord document with
a KFormula part would be stored like this:

	<main editor="KWord" version="1.0">
		Here goes editor text
		<part editor="KFormula" version="1.0">
			Here goes formula description
		</part>
		More editor text here
	</main>

In this example, the tag "main" signals that this part should be opened
as the main window and loaded initially. "part" indicates that this is
an embedded part within that main document. The "editor" and "version"
property tells the storage system which part that should be loaded. The 
content here is passed to the naming service which loads the appropriate 
component. Binaries data is saved as uuencoded/base64 data. 

An extension to this model is that each part can have a DTD associated, 
so that documents can be validated upon reading by the storage system, 
and formatted for the part accordingly, i.e. the system does the parsing
and the part gets the data pretty much well-chewed.

I think that such a text-based solution have various advantages over a
binary solution. One of them is the above mentioned validation and
parsing. Others are system-level indexing, exchange with non-KDE
systems (via XSL) and out-of-part browsing of the document's contents.

Also note that none of the parts need be named in this example, contrary
to a .tar-based solution. The user is not obliged to specify a name
(although it can optionally be supplied); a name is something by which
which the user wants to identify the part -- nameless parts are only
interesting within the scope of the enclosing document.

Text need not be more than marginally less effective if the access layer
provides a high enough abstraction. (It should probably have more in
common with DOM than with other storage systems such as Bento). The text
could always be piped through a compression filter before storing to 
disk.

The downside with such a solution is that I reckon most code already
done for the storage system must be rewritten. (Which already may be
too late for your timeframes)

The other approach is of course to have each part store its contents
in XML (or really what-ever format it chooses), making it possible to
link between parts within the same document, which brings us back to
the .tar-based solution (more on that later)

> 1) Name of the KOffice Files:
> I think it is important to show that the file is a tgz-archieve, so I suggest
> using filenames like "MyLetter.kwd.tgz", "Sales.ksp.tgz", or
> "Meeting.kpr.tgz".

Does this naming convention host a hidden specification that the name
of the file indicates the main part (i.e. the part that is started
as the frame of the document)?

I don't get the grasp of where the responsibility of opening the
document is put. Does the system decompress the file, and then starts
the main part feeding it the file, or is the file opened by the
storage system, all parts are identified and prepared and then the
frame is started, being fed it's part contents?

I would prefer the latter model, as it puts more responsibility on
the storage system and less on the part itself. (Control question:
Does a part necessarily know the type of other parts stored inside
it? (and: should it be possible to exploit this type knowlegde to
extract information from sub-parts? I think not, since while it
creates better integration, it breaks encapsulation between parts
making it more a KOffice-plugin model than a system component model)

> I'm not quite sure if we should go for a "flat" directory structure (one
> single

The main problem with a flat directory layout (and any other layout with
a fixed number of directories for that matter) is that you easily run into
name conflicts (e.g. if the first picture in a text is always named
"picture0" then you'll run into conflicts when the user tries to merge
two texts).

The same applies for directories named after the type of the part. Each
part effectively needs its own directory.

Which brings me down to the main problem with .tar-based files: They're
modeling a directory on a file-system. A directory is ultimatly only a
subset of the files on the disk. A directory doesn't contain data itself.

This is not a proper model of a document. A part have contents as well
as other parts stored. HTTP has the same problem, which is why one gets
the index.html file by default if one ask for the directory itself.

Hence, there is a need to extend the .tar model: A "directory" would
contain both other parts and the contents of the part itself. Such a
"directory" (why not call it a part?) would not neccesarily have a
name, unless the user has assigned one. A name becomes nothing more
than an optional label. I can't see any reasons to let the storage
system generate a lot of "partXxx" names that is mixed with labels.
E.g.(using the previous mentioned example)

lab.kwd.tgz +--- <data> +--- <contents>
            |           |
            |           +--- <dtd>
            |           |
            |           +--- 0.ksp +--- <contents>
            |           |                |
            |           |                +--- 0.jpg +--- <contents>
            |           |
            |           +--- 1.kdg "Layout" +--- <contents>
            |           |
            |           +--- 0.jpg +--- <contents>
            |
            +--- <attributes>

This is a KWord-file which has a spreadsheet embedded, which again
has a picture embedded, (back to the text:) a diagram and a picture.
Both the picture in the text and in the spreadsheet are named with
the ordinal 0 because they are in different scopes. The diagram part
is also explicitly named by the user (which would make it possible
to reach with an URL: file:/home/jdoe/lab.kwd.tgz/Layout)

For an explanation of <attributes> and <dtd>, look below.

> Another thing to discuss is if we need a "summary.ko" file. This could e.g.
> contain the author's name, the dates,... This information is stored at least

This would be a nice replacement for extended attributes (supporting
such features on file systems that doesn't have this implemented). In
my opinion this should be treated as properties to the document as a
whole and not to a particular part, i.e. it is as much part of the file
as the name and access rights are. Hence, it should probably be 
searchable from within the shell.

This should probably be in a designated stream of the ".tar" itself, 
like the content branch of a part. (see the example above)

The downside of such a solution is that one break the backward-
compatability with .tar-files (unless of course one calls the
special branches for ".data", ".content", ".attributes", ".dtd" 
or something like that)

An extension to this model is also a special branch that holds the
DTD for any parts that is stored in XML format (or a reference to
a system-wide installed such -- that is a tradeoff between space
and portability to systems without that part installed)

Sincerely,
	R.

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic