[prev in list] [next in list] [prev in thread] [next in thread]
List: kde-core-devel
Subject: Re: [Issue N23835] [PATCH] Files with non-utf8 names unaccessible
From: qt-bugs () trolltech ! com
Date: 2003-07-04 8:31:48
[Download RAW message or body]
Hi Waldo,
On Thursday, 05. Jun 2003 17:36 Waldo Bastian wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> [Now with patch]
>
> When using utf8 for filename encoding, Qt is unable to access files
> that do not have valid utf8 names. This can happen when a user on a
> Unix system with a utf8 locale accesses a CD that uses a different
> encoding for its files.
>
> The problem is rooted in the following equality:
> QFile::encodeName(QFile::decodeName(path)) == path
>
> Qt is able to process files properly as long as this equality holds
> true. Even when the actual encoding of a filename does not correspond
> to the encoding used by QFile, Qt will, as long as this equality
> holds, pass the same 8bit string that it receives from e.g. readdir(3)
> to sysem functions such as open(2). In such case the visual
> representation of the filename will be incorrect but actions on the
> file will continue to work as expected.
>
> When the equality does not hold true, Qt will pass a 8bit string to
> systems functions such as open(2) that differs from the 8bit string it
> received from e.g. readdir(3). Such action is likely to fail: this
> different 8bit string will most likely not point to an existing file
> or worse, point to a different file.
>
> Not every 8bit string is a valid utf8 sequence, when QFile uses a utf8
> codec it replaces invalid utf8 sequences with QChar::replaced in the
> QString. When such QString is then converted back to utf8 again, the
> resulting 8bit string is a valid utf8 sequence but no longer identical
> to the original 8bit string.
>
> I would like to propose that QFile::decodeName/encodeName uses a
> modified utf8 codec such that the conversion utf8 ->QString -> utf8
> always results in the original 8bit string, even if such string is not
> a valid utf8 sequence.
>
> I have attached a patch that illustrates how such modified codec could
> look like. I have used 0xfffd as escape character, maybe another
> character such as 0xffff would be more suitable.
>
> I am aware that this very problem can be solved for KDE applications
> by providing our own encoding function via QFile::setEncodingFunction
> but since this problem will affect Qt-only applications as well, it
> would be better if it could be solved in Qt itself.
You're right. One needs a solution to the problem you described. However
I don't like using 0xfffd+QChar(ch) for mapping these characters to
Unicode, as the forward and back transformations violate the utf8
encoding a lot.
I've implemented a slightly different solution mapping the characters to
a surrogate pair in the supplementary private use area, as this should
hopefully lead to less conflicts. The only disadvantage is that
currently (until we have a better surrogate handling in Qt) each of
these characters will show up as two boxes instead of one box and the
char mapped from latin1. The diff against qt-3.2 beta2 is attached.
Cheers,
Lars
--
Lars Knoll, Senior Software Engineer
Trolltech AS, Waldemar Thranes gt. 98, N-0175 Oslo, Norway
["patch.diff" (text/x-diff)]
--- src/codecs/qutfcodec.cpp 2003-07-04 08:30:46 -0000
+++ src/codecs/qutfcodec.cpp 2003-07-04 08:30:46 -0000
@@ -70,8 +70,15 @@
}
}
if (u > 0xffff) {
- *cursor++ = 0xf0 | ((uchar) (u >> 18));
- *cursor++ = 0x80 | ( ((uchar) (u >> 12)) & 0x3f);
+ // see QString::fromUtf8() and QString::utf8() for explanations
+ if (u > 0x10fe00 && u < 0x10ff00) {
+ *cursor++ = (u - 0x10fe00);
+ ++ch;
+ continue;
+ } else {
+ *cursor++ = 0xf0 | ((uchar) (u >> 18));
+ *cursor++ = 0x80 | ( ((uchar) (u >> 12)) & 0x3f);
+ }
} else {
*cursor++ = 0xe0 | ((uchar) (u >> 12));
}
@@ -79,7 +86,7 @@
}
*cursor++ = 0x80 | ((uchar) (u&0x3f));
}
- ch++;
+ ++ch;
}
*cursor = 0;
lenInOut = cursor - (uchar*)rstr.data();
--- src/tools/qstring.cpp 2003-07-04 08:30:46 -0000
+++ src/tools/qstring.cpp 2003-07-04 08:30:46 -0000
@@ -5184,8 +5184,18 @@
}
}
if (u > 0xffff) {
- *cursor++ = 0xf0 | ((uchar) (u >> 18));
- *cursor++ = 0x80 | ( ((uchar) (u >> 12)) & 0x3f);
+ // if people are working in utf8, but strings are encoded in eg. latin1, the resulting
+ // name might be invalid utf8. This and the corresponding code in fromUtf8 takes care
+ // we can handle this without loosing information. This can happen with latin filenames
+ // and a utf8 locale under Unix.
+ if (u > 0x10fe00 && u < 0x10ff00) {
+ *cursor++ = (u - 0x10fe00);
+ ++ch;
+ continue;
+ } else {
+ *cursor++ = 0xf0 | ((uchar) (u >> 18));
+ *cursor++ = 0x80 | ( ((uchar) (u >> 12)) & 0x3f);
+ }
} else {
*cursor++ = 0xe0 | ((uchar) (u >> 12));
}
@@ -5193,7 +5203,7 @@
}
*cursor++ = 0x80 | ((uchar) (u&0x3f));
}
- ch++;
+ ++ch;
}
rstr.truncate( cursor - (uchar*)rstr.data() );
return rstr;
@@ -5220,13 +5230,14 @@
if ( len < 0 )
len = strlen( utf8 );
QString result;
- result.setLength( len ); // worst case
+ result.setLength( len*2 ); // worst case
QChar *qch = (QChar *)result.unicode();
uint uc = 0;
int need = 0;
+ int error = -1;
uchar ch;
for (int i=0; i<len; i++) {
- ch = *utf8++;
+ ch = utf8[i];
if (need) {
if ( (ch&0xc0) == 0x80 ) {
uc = (uc << 6) | (ch & 0x3f);
@@ -5244,25 +5255,42 @@
}
}
} else {
- // error
- *qch++ = QChar::replacement;
+ // See QString::utf8() for explanation.
+ //
+ // The surrogate below corresponds to a Unicode value of (0x10fe00+ch) which
+ // is in one of the private use areas of Unicode.
+ i = error;
+ *qch++ = QChar(0xdbff);
+ *qch++ = QChar(0xde00+((uchar)utf8[i]));
need = 0;
}
+ error = -1;
} else {
if ( ch < 128 ) {
*qch++ = ch;
} else if ((ch & 0xe0) == 0xc0) {
uc = ch & 0x1f;
need = 1;
+ error = i;
} else if ((ch & 0xf0) == 0xe0) {
uc = ch & 0x0f;
need = 2;
+ error = i;
} else if ((ch&0xf8) == 0xf0) {
uc = ch & 0x07;
need = 3;
+ error = i;
}
}
}
+ if (error != -1) {
+ // we have some invalid characters remaining we need to add to the string
+ for (int i = error; i < len; ++i) {
+ *qch++ = QChar(0xdbff);
+ *qch++ = QChar(0xde00+((uchar)utf8[i]));
+ }
+ }
+
result.truncate( qch - result.unicode() );
return result;
}
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic