[prev in list] [next in list] [prev in thread] [next in thread]
List: kde-core-devel
Subject: HOWTO: Deal properly with internationalisation in Qt 2 and KDE 2 code
From: David Faure <david () mandrakesoft ! com>
Date: 2000-11-27 22:35:37
[Download RAW message or body]
"Sergey A. Sukiyazov" <ssukiyazov@freemail.ru> wrote this nice summary
of common problems with the way US or European developers write
Qt/KDE code, leading to many problems for users of other encodings.
Those who already know most of this stuff should still have a look at
the bit about QTextStream (section 3), I was quite surprised to learn that
its default conversion depends on the underlying device...
Glossary
=========
For those a bit confused by the terms used below, I'll try to define the
most common ones, as an introduction. If you know those terms, skip
to the second part of the document.
* encoding: the way character codes (usually 0-255) are understood.
Due to ancient limitations, the same codes between 0 and 255 mean
different things from a country to another. Deciding which character
maps to those codes is controlled by the "encoding".
The difference with "charset" is that the "charset" is rather used for fonts,
but the concept is more or less the same.
* ascii: encoding covering only the range 0-127, with 'american'
non-accentuated letters, digits, punctation signs, etc.
* latin1: the encoding used by the Western European countries,
which adds to ascii some accentuated letters etc. It happens to be the
default encoding in many places of Qt and KDE, which is why
non-latin1 users have more trouble than latin1 users :(
Latin1 is also known as ISO-8859-1.
* "local8bit": the correct encoding for a given country (Qt chooses
what local8bit does depending on $LANG and $LC_CTYPE).
This encoding defines other characters (esp. in the range 128-255),
that are needed by a given country/language.
Sergey also calls them the "National characters".
* utf8/utf16: Encodings for "Unicode", i.e. finally, an encoding that doesn't
depend on the country. In Unicode, each code means a single character,
which is why unicode is the real long term solution to this mess... if only
there were Unicode fonts... :)
QString stores everything in Unicode, so for any text displayed in the GUI
there's no need to take any special action. What is tricky, is to know how to
convert this to the right encoding when interfacing with the file system, when
storing data somewhere, etc.
Anyway, up to you Sergey (I took the liberty to rephrase some sentences) :
HOW TO DEAL PROPERLY WITH INTERNATIONALISATION
==============================================
1. Right-fashion-localized system must permit to use of national characters
everythere, incl. filenames etc.
2. Methods QString::latin1(), QString::ascii() and implicit cast
QString => (const char *) for national UNICODE strings return empty
string because translation works until first UNICODE character with
non-zero higher byte is occured on input string.
However, translation from QString to char* is done by .latin1() method
for most programs. Consider to use (const char *)str.local8Bit()
(str is of class QString) or the method
(const char *)QFile::encodeName(QString &fileName) in order to use
national characters everywhere (for example, in system calls etc.).
In general, the most frequently encountered problem is translation from
QString to char* through .latin1() or .ascii() method (used latin1() by
default). These methods translate UNICODE string until encounter the
first character with non-zero higher byte that results in termination
of this translation when first national character encountered ==>
russian UNICODE characters will not be translated and these methods will
return empty strings.
Furthermore, it is nessesary to compile programs with Qt-2.X.X with
option -DQT_NO_ASCII_CAST in order to prevent automatic use of method
.latin1() to translate QString into (const char*). It may be possible
that program compilation aborts with the error, for example:
----C++ Sample Code : Cut from here ------------------------
QString fileName = QString::fromLocal8Bit("æÁÊÌ");
FILE *f;
....
f = ::fopen( fileName, "r+w" );
.....
----C++ Sample Code : Cut from here ------------------------
If option -DQT_NO_ASCII_CAST is not given then automatic translation
will be done through QString::latin1() that results in the empty string.
To prevent error during compilation with the option -DQT_NO_ASCII_CAST
given it is nessesary to use the following:
----C++ Sample Code : Cut from here ------------------------
QString fileName = QString::fromLocal8Bit("æÁÊÌ");
FILE *f;
....
f = ::fopen( (const char *)QFile::encodeName(fileName), "r+w" );
.....
----C++ Sample Code : Cut from here ------------------------
[Note: KDE's compilation system defines -DQT_NO_ASCII_CAST by default]
The above example applies to all translations from UNICODE to 8-bit
strings. In general, you should use the methods QString::local8Bit() and
QString::fromLocal8Bit(...) instead of QString::latin1() and
QString::QString(const char *) and QString::fromLatin1(...).
3. The second error resulting into incorrect conversion of national
characters from UNICODE into one-byte string, arises at usage of the
class QTextStream bound with a file or with 8-bit string (QByteArray
or QCString).
A text stream (class QTextStream), bound with the file (class QIODevice),
always will use conversion local8Bit (Encoding == QTextStream:: Locale),
but a text stream (class QTextStream), bound with one-byte string
(class QByteArray or class QCString), will use conversion Latin1
(Encoding == QTextStream:: Latin1) !
If Encoding == QTextStream::Latin1, no conversion is made, so use this
to write char *s or QCStrings to a stream without conversion.
It can be tested, if you look through Qt source code
(File qt-2.2.2/src/tools/qtextsream.cpp Line 556).
If we insert string in UNICODE (class QString), containing national
characters, into stream bound with one-byte string, the method
QTextStream::operator<<(const QString& s) is executed, and if
Encoding == QTextStream::Latin1, it converts 'Russian characters' to '?' !
For characters in the codings ISO-8859-1 or us-ascii of conversion have
no effect, the codes of these characters do not vary. In ISO-8859-1 or
us-ascii UNICODE strings high byte is 0, the low byte is equal to the
character code, that is the same in one-byte encoding. The strings in
the coding UTF8, containing only characters ISO-8859-1 or us-ascii,
remain without change.
For national characters in UNICODE high byte is distinct from 0. The
low byte does not coincide with the one-byte character code. Therefore
such strings expose to changes at conversion QString ==> (const char *).
If programs are tested using only strings in the encodings ISO-8859-1
or us-ascii, the errors of conversion QString <==> (const char *) will
not be visible, because no conversion happens.
When using the class QTextStream, always explicitely set the
mode of conversion of characters, calling the method
QTextStream::setEncoding(...). It will save in the future from incorrect
conversion of national characters when using QTextSream.
Thanks for spreading these reasons among all programmers
writing programs for KDE or simply Qt, so that in the future we can
minimize the manifestation of these troubles.
Best regards
Sukiyazov Sergey <corwin@dstu.rnd.runnet.ru>
<ssukiyazov@freemail.ru>
-------------------------------------------------------
--
David FAURE, david@mandrakesoft.com, faure@kde.org
http://www.mandrakesoft.com/~david/, http://www.konqueror.org/
KDE, Making The Future of Computing Available Today
See http://www.kde.org/kde1-and-kde2.html for how to set up KDE 2
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic