------Boundary=d6saesis32t5dhizyuu8ydsrtis3g6 Content-Type: text/plain; charset="iso-8859-1" MIME-Version: 1.0 Hi Waldo, On Thursday, 05. Jun 2003 17:36 Waldo Bastian wrote: > -----BEGIN PGP SIGNED MESSAGE----- > Hash: SHA1 > > [Now with patch] > > When using utf8 for filename encoding, Qt is unable to access files > that do not have valid utf8 names. This can happen when a user on a > Unix system with a utf8 locale accesses a CD that uses a different > encoding for its files. > > The problem is rooted in the following equality: > QFile::encodeName(QFile::decodeName(path)) == path > > Qt is able to process files properly as long as this equality holds > true. Even when the actual encoding of a filename does not correspond > to the encoding used by QFile, Qt will, as long as this equality > holds, pass the same 8bit string that it receives from e.g. readdir(3) > to sysem functions such as open(2). In such case the visual > representation of the filename will be incorrect but actions on the > file will continue to work as expected. > > When the equality does not hold true, Qt will pass a 8bit string to > systems functions such as open(2) that differs from the 8bit string it > received from e.g. readdir(3). Such action is likely to fail: this > different 8bit string will most likely not point to an existing file > or worse, point to a different file. > > Not every 8bit string is a valid utf8 sequence, when QFile uses a utf8 > codec it replaces invalid utf8 sequences with QChar::replaced in the > QString. When such QString is then converted back to utf8 again, the > resulting 8bit string is a valid utf8 sequence but no longer identical > to the original 8bit string. > > I would like to propose that QFile::decodeName/encodeName uses a > modified utf8 codec such that the conversion utf8 ->QString -> utf8 > always results in the original 8bit string, even if such string is not > a valid utf8 sequence. > > I have attached a patch that illustrates how such modified codec could > look like. I have used 0xfffd as escape character, maybe another > character such as 0xffff would be more suitable. > > I am aware that this very problem can be solved for KDE applications > by providing our own encoding function via QFile::setEncodingFunction > but since this problem will affect Qt-only applications as well, it > would be better if it could be solved in Qt itself. You're right. One needs a solution to the problem you described. However I don't like using 0xfffd+QChar(ch) for mapping these characters to Unicode, as the forward and back transformations violate the utf8 encoding a lot. I've implemented a slightly different solution mapping the characters to a surrogate pair in the supplementary private use area, as this should hopefully lead to less conflicts. The only disadvantage is that currently (until we have a better surrogate handling in Qt) each of these characters will show up as two boxes instead of one box and the char mapped from latin1. The diff against qt-3.2 beta2 is attached. Cheers, Lars -- Lars Knoll, Senior Software Engineer Trolltech AS, Waldemar Thranes gt. 98, N-0175 Oslo, Norway ------Boundary=d6saesis32t5dhizyuu8ydsrtis3g6 Content-Type: text/x-diff Content-Disposition: attachment; filename="patch.diff" MIME-Version: 1.0 --- src/codecs/qutfcodec.cpp 2003-07-04 08:30:46 -0000 +++ src/codecs/qutfcodec.cpp 2003-07-04 08:30:46 -0000 @@ -70,8 +70,15 @@ } } if (u > 0xffff) { - *cursor++ = 0xf0 | ((uchar) (u >> 18)); - *cursor++ = 0x80 | ( ((uchar) (u >> 12)) & 0x3f); + // see QString::fromUtf8() and QString::utf8() for explanations + if (u > 0x10fe00 && u < 0x10ff00) { + *cursor++ = (u - 0x10fe00); + ++ch; + continue; + } else { + *cursor++ = 0xf0 | ((uchar) (u >> 18)); + *cursor++ = 0x80 | ( ((uchar) (u >> 12)) & 0x3f); + } } else { *cursor++ = 0xe0 | ((uchar) (u >> 12)); } @@ -79,7 +86,7 @@ } *cursor++ = 0x80 | ((uchar) (u&0x3f)); } - ch++; + ++ch; } *cursor = 0; lenInOut = cursor - (uchar*)rstr.data(); --- src/tools/qstring.cpp 2003-07-04 08:30:46 -0000 +++ src/tools/qstring.cpp 2003-07-04 08:30:46 -0000 @@ -5184,8 +5184,18 @@ } } if (u > 0xffff) { - *cursor++ = 0xf0 | ((uchar) (u >> 18)); - *cursor++ = 0x80 | ( ((uchar) (u >> 12)) & 0x3f); + // if people are working in utf8, but strings are encoded in eg. latin1, the resulting + // name might be invalid utf8. This and the corresponding code in fromUtf8 takes care + // we can handle this without loosing information. This can happen with latin filenames + // and a utf8 locale under Unix. + if (u > 0x10fe00 && u < 0x10ff00) { + *cursor++ = (u - 0x10fe00); + ++ch; + continue; + } else { + *cursor++ = 0xf0 | ((uchar) (u >> 18)); + *cursor++ = 0x80 | ( ((uchar) (u >> 12)) & 0x3f); + } } else { *cursor++ = 0xe0 | ((uchar) (u >> 12)); } @@ -5193,7 +5203,7 @@ } *cursor++ = 0x80 | ((uchar) (u&0x3f)); } - ch++; + ++ch; } rstr.truncate( cursor - (uchar*)rstr.data() ); return rstr; @@ -5220,13 +5230,14 @@ if ( len < 0 ) len = strlen( utf8 ); QString result; - result.setLength( len ); // worst case + result.setLength( len*2 ); // worst case QChar *qch = (QChar *)result.unicode(); uint uc = 0; int need = 0; + int error = -1; uchar ch; for (int i=0; i