'Re: Why was this feature removed?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-devel
Subject:    Re: Why was this feature removed?
From:       Lars Knoll <lars () trolltech ! com>
Date:       2002-11-28 7:59:21
[Download RAW message or body]

> Lars Knoll wrote:
> > Just FYI: The encoding used when saving and loading files has nothing to
> > do with Qt behaviour when choosing fonts. The problem is usually that the
> > applications save files either in Latin1 (really bad for everyone outside
> > US and western europe; if a KDE app does this it should be fixed) or the
> > local encoding (meaning you can't save thai when you're locale is set to
> > something else than tis620 or utf8). The only way to get around this in
> > the long term is to start getting rid of locale encodings and start using
> > utf8 everywhere for data (and fileneames).
> >
> > If you need real multilanguage support, the only way to go is to set your
> > locale encoding to utf8. Unfortunately not all applications support this
> > correctly (especially some old command line tools or older XServers).
>
> Tell me if I understand things correctly:
>
> For plain text, there is no encoding.  There are only 8-bit bytes.

In a way. A plain text file is only a array of bytes. It is possible trying to 
do some magic determining the encoding, but that i doomed to fail in many 
cases.

> These can be displayed as anything, depending on the choice of encoding
> of font used.  In other words, the encoding pertains to deciding what

No. It has nothing to do with the font. Qt/KDE use an internally represent the 
data in a QString (this one represents the data in Unicode). So your 
application has to convert the raw data to Unicode while loading the file and 
back to a certain encoding while saving. Once the string is correctly 
converted to Unicode, we know it is eg. thai (because it lies in the range 
0xe00-0x0e80 in unicode. Qt then tries to find a font that can display these 
characters (and will use either a tis-620/8859-11 or a Unicode font for this.

Using Unicode internally in all of KDE has made all of KDE development a lot 
simpler, is more future proof and allows you to intermix chinese with thai 
and arabic (of course you can only save that data in a few encodings, eg. 
utf8, to disk if you don't want to loose information).

> character to display corresponding to a certain byte of text.  The
> encoding doesn't pertain to the bytes of text themselves.  However
> obviously the text wouldn't be readable unless the bytes that were
> created when you typed certain characters were not subsequently fed back
> into the same encoding map, to display the same characters that were
> originally typed.

Yes. Except that there is one more step involved as explained above. 

> What I am getting at is that for a user who only needs access to two
> languages (English and another), isn't it the most straightforward path
> to simply allow one to choose their font encoding manually?  Then plain
> text only needs to have one byte per character.

What if you want to write one document containing both languages at the same 
time? The real way to go is to use a default encoding that can contain all of 
them. 

> Also, I wonder the ramifications of using something like a UTF8 for text
> files such as program source code?  I mean, there is a place for plain
> text editors, and it seems necessary to allow the user to control the
> font encoding used for the extended character set.

You're still stuck in the encoding mess Unix had for 20 years. With the 
Unicode standard there is finally a way out of this hell. This is one thing 
we could learn from MS. They've gone this way a lot further than we have and 
all of NT/200/XP are completely Unicode based.

> Ugh.  I can't believe how complicated all this is.
>
> I just want to type Thai and English text and filenames.

Just set you locale to en_US.UTF8 (export LANG=en_US.UTF8) and you should be 
able to do so.

> Also, using UTF8 makes havoc upon the console and terminal environments.
>   It is much preferable to use one byte per character data for
> filenames, so that they can be read in the terminals and consoles.  Then
> again, it is simply a matter of choosing a font with the right extended
> character set, and choosing the encoding.

And a simple 8 bit encoding won't work. You can't squeeze all of thai and eg. 
russian and english into 8 bytes. utf8 is a multibyte encoding (one character 
can be encoded in 1-3 bytes, and leave ascii as 1 byte ascii. IMO the best 
choice.

Well, recent shell utilities should be able to deal with utf8. konsole can 
aswell. I have my laptop set to use utf8 as my locale, and run konsole in 
utf8 mode. Like that I can use thai, russian and german filenames and they 
all display correctly when I type ls.

Lars

>> Visit http://mail.kde.org/mailman/listinfo/kde-devel#unsub to unsubscribe <<
[prev in list] [next in list] [prev in thread] [next in thread]