'Re: KDE does not recognice KWord docs'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       koffice
Subject:    Re: KDE does not recognice KWord docs
From:       Holger Schroeder <holger-kde () holgis ! net>
Date:       2003-09-29 10:36:38
[Download RAW message or body]

Hi all,

On Saturday 27 September 2003 23:48, you wrote:
> On Saturday 27 September 2003 20:25, Nicolas Goutte wrote:
> > It is part of KDE. KZip 3.1 will simply generate "fat"-based data, KZip
> > 3.2 will give "unx" ones.
>
> Thanks guys for this investigation. I wasn't aware of that change.
> (Holger: it seems that this change broke the magic-recognition of KOffice
> files, where the uncompressed file called "mimetype" would have its
> contents at position 38 in the ZIP file).
>
the structure of a zip file local header in a zip archive is:

local file header signature 4 bytes (0x04034b50)
	version needed to extract	2 bytes
	general purpose bit flag	2 bytes
	compression method		2 bytes
	last mod file time			2 bytes
	last mod file date			2 bytes
	crc-32				4 bytes
	compressed size			4 bytes
	uncompressed size		4 bytes
	file name length			2 bytes
	extra field length			2 bytes 
	file name				(variable size)
	extra field				(variable size)

(from http://www.pkware.com/products/enterprise/white_papers/appnote.html)

as u understend from the patch contained in thomas zanders first mail in this 
thread, the mime type recognition is done by detecting the string 
application/x-kword in the file at a fixed offset.

when you look at "old" kzip files as i coded them, there was no use for a 
extra field. so in the file there is the string mimetypeapplication/x-kword.
so the filename and the beginning of the content are "concatenated" when there 
is no extra field. in this case (we know how long the filename is, and we 
know that there is no extra field) the beginning of the content/mimetype 
string is fixed.

the extra field as it is introduced now gives us the following advantages over 
the old way:

struct ParseFileInfo {
   // file related info
 //  QCString name;              // filename
   mode_t perm;                  // permissions of this file
   time_t atime;                 // last access time (UNIX format)
   time_t mtime;                 // modification time (UNIX format)
   time_t ctime;                 // creation time (UNIX format)
   int uid;                      // user id (-1 if not specified)
   int gid;                      // group id (-1 if not specified)
   QCString guessed_symlink;     // guessed symlink target
   int extralen;                 // length of extra field
..
}

so in the "not-koffice" case we should by default write the new fileformat, as 
these values are kind of useful there.

so how to fix this for koffice ?

i see two possibilities:

1.) add an option to allow writing of zip files without this extra info and 
use it in koffice, as the permissions and Xtimes are not needed in the files.

2.) as we only want to have "application/x-kword" at a fixed offset in the 
zip-archive, it would also be possible to not create a first file with the 
filename mimetype and the _content_ application/x-kword, but a first file 
with the _name_ mimetypeapplication/x-kword and any content after that.
this would have the advantage, that our mimetype is _always_ at this offset, 
no matter which different extra fields with which lengths will be ever 
introduced, as the file name is saved in the zip file before the extra 
fields. as long as nobody creates a "zip format version 2", which will then 
be a whole new format, we would have solved this issue.

the only thing that should be checked is, how openoffice would handle these 
files. ok, i looked at an example file from openwriter. they have no first 
mimetype file in their format, they directly start with the file content.xml.

so i guess they neither care about a file named "mimetype" nor about a file 
called "mimetypeapplication/x-kword".

so somebody could change the code in koffice, that writes the mimetype to this 
and it should work. that would have the advantage, that we don't have to 
introduce a new function in kzip to not write the extra stuff, which would be 
a little bit ugly, if i understood it right with all these virtual_hooks. and 
we wouldn't have to always check that nobody breaks kzip in the future.

while we are at it, i would not only call the first file 
mimetypeapplication/x-kword, but i would suffix it with the version it was 
created with, perhaps we can use it for something in the future, and these 
few bytes do not hurt anybody. so it would be called for example 
application/x-kword-1.3.0 or so.

> > No, sure. I cannot remember if KOffice 1.2.x had its own KZip (named
> > KoZip) or not. So I do not know if the change has to be done if KDE 3.1.x
> > or in KOffice 1.2.x.
>
> CVS says that KoZip was part of KOffice-1.2.x indeed. But:
> > But in any case, I am really starting to ask me if for the last
> > KOffice-own file format it is useful to have again a subtle change.
> > However this would mean to force KZip 3.2 to be able to write in the
> > "fat" modus, either on command or simply for uncompressed files.
>
> Yes, I think we shouldn't do something that changes our 'magic'
> recognition: 
* other projects/tools/etc. might use the magic we had
> previously, this change will break it 

by using solution 2 it would be unbroken again

* are we sure that the new offset is
> always going to be 55? What's between position 30 and position 55? This
> looks more fragile to me.

the extra field can be of a variable length, and the file content starts 
directly after the filename and the extra field. so there is no general 
solution, when we want the detection string to be in the content and on the 
other hand allow this extra field. in the code of kzip.cpp there is already a 
possibility to parseInfoZipUnixNew and for sure it will somewhen introduce 
another length for the extra field...

> * OpenOffice.org uses the "fat" format in ZIP files, and the whole point of
> switching to ZIP was to use the same thing as they do, and particularly
> having the same kind of magic mimetype recognition.
>
i have no idea how they are doing mimetype detection. iirc their "weak" 
detection is solely based on the filename extension, and their "strong" 
detection they use when loading a file parses their manifest.xml, a kind of 
"table of contents" for their archive. but i may be wrong here...

> So we need to fix KZip to give us "fat" format again. Holger: do you know
> if that's easily doable? Is there any benefit in the "unx" stuff? Should
> just move "back" to fat, or should we add a method to choose the format?
>
> (In case you missed the rest of the thread: zipinfo <file> shows "fat" or
> "unx"; "fat" on OOo and kdelibs-3.1-generated files, and "unx" in
> kdelibs-cvs-generated files)

unfortunately i am quite busy with university, so i can't hack on that right 
now, but i will follow this discussion, so feel free to ask, if i explained 
something not good enough.

Holger

____________________________________
koffice mailing list
koffice@mail.kde.org
To unsubscribe please visit:
http://mail.kde.org/mailman/listinfo/koffice
[prev in list] [next in list] [prev in thread] [next in thread]