[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-bugs-dist
Subject:    Bug#9286: KWord exports HTML files encoded in UTF-8
From:       Nicolas GOUTTE <nicog () snafu ! de>
Date:       2001-01-23 0:35:29
[Download RAW message or body]

I am not the programmer of KWord's HTML export filter but I would like to 
comment two bugs reports concerning it:

#6721 "saving as htm - Umlaute not encoded"
#9286 "html-export doesn't crash, but the result is trash"

The two reports complains that the HTML export filter does not code umlauts 
correctly. I am afraid that it is not true.

The programmer of the export filter has chosen to create a HTML encoded in 
UTF-8, as it is the default encoding of the whole koffice package.
However that means that the file looks odd when you look at it using an 
editor display only ISO-8859-1 (Latin 1), as the characters do not look as 
expected.
This is due to the fact that UTF-8 encodes non-ascii UNICODE characters in 
multiple bytes!

Note that konqueror displays such files with problems (e.g. bug #18536)!

However the HTML file itself is correct and valid file. It is compliant 
with HTML 4.01 as the encoding is given through the line:

<META HTTP-EQUIV="Content-Type" content="text/html; charset=utf-8">

Encoding the umlauts like &auml; is longer and you can code only 
ISO-8859-1, greek, a few scientific and a few other characters in that way. 
However, you are right that it is more human readable than UTF-8. And yes, 
it would be perfect, if you could choose the encoding you want, before the 
export filter starts its work...

For now, if you have the program recode(1), there are possible workaround.

If you want &auml;

	recode -d utf-8..html "yourfile.html"

Please replace "yourfile.html" by the name of the exported file, this file 
will be overwritten by recode(1).
Note: the -d is important or else recode(1) will also change < > & " into 
&lt; &gt; &amp; &quot;
With an editor, you may then change the line:
<META HTTP-EQUIV="Content-Type" content="text/html; charset=utf-8">
into:
<META HTTP-EQUIV="Content-Type" content="text/html; charset=us-ascii">

If you want ISO-8859-1 (latin1):

	recode utf-8..latin1 "yourfile.html"

and then with an editor, you MUST change the line:
<META HTTP-EQUIV="Content-Type" content="text/html; charset=utf-8">
into:
<META HTTP-EQUIV="Content-Type" content="text/html; charset=ISO-8859-1">


*PLEASE NOTE*
Note 1: the HTML export filter has still the bug, that part of the text is 
not saved, so bug #9286 ("html-export doesn't crash, but the result is 
trash") is NOT closed but becomes similar to #9440 and #9481 ("Kword - Save 
As HTML bug"). I can even confirm that this bug still exists.
Note 2: I have not tested if KWord can import back correctly the exported 
HTML file. (This would be another bug, if it does not!)

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic