[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-core-devel
Subject:    Re: [Issue N23835] [PATCH] Files with non-utf8 names unaccessible
From:       qt-bugs () trolltech ! com
Date:       2003-07-04 8:31:48
[Download RAW message or body]

Hi Waldo,

On Thursday, 05. Jun 2003 17:36 Waldo Bastian wrote:
> -----BEGIN PGP SIGNED MESSAGE-----
> Hash: SHA1
>
> [Now with patch]
>
> When using utf8 for filename encoding, Qt is unable to access files
> that do not have valid utf8 names. This can happen when a user on a
> Unix system with a utf8 locale accesses a CD that uses a different
> encoding for its files.
>
> The problem is rooted in the following equality:
> 	QFile::encodeName(QFile::decodeName(path)) == path
>
> Qt is able to process files properly as long as this equality holds
> true. Even when the actual encoding of a filename does not correspond
> to the encoding used by QFile, Qt will, as long as this equality
> holds, pass the same 8bit string that it receives from e.g. readdir(3)
> to sysem functions such as open(2). In such case the visual
> representation of the filename will be incorrect but actions on the
> file will continue to work as expected.
>
> When the equality does not hold true, Qt will pass a 8bit string to
> systems functions such as open(2) that differs from the 8bit string it
> received from e.g. readdir(3). Such action is likely to fail: this
> different 8bit string will most likely not point to an existing file
> or worse, point to a different file.
>
> Not every 8bit string is a valid utf8 sequence, when QFile uses a utf8
> codec it replaces invalid utf8 sequences with QChar::replaced in the
> QString. When such QString is then converted back to utf8 again, the
> resulting 8bit string is a valid utf8 sequence but no longer identical
> to the original 8bit string.
>
> I would like to propose that QFile::decodeName/encodeName uses a
> modified utf8 codec such that the conversion utf8 ->QString -> utf8
> always results in the original 8bit string, even if such string is not
> a valid utf8 sequence.
>
> I have attached a patch that illustrates how such modified codec could
> look like. I have used 0xfffd as escape character, maybe another
> character such as 0xffff would be more suitable.
>
> I am aware that this very problem can be solved for KDE applications
> by providing our own encoding function via QFile::setEncodingFunction
> but since this problem will affect Qt-only applications as well, it
> would be better if it could be solved in Qt itself.

You're right. One needs a solution to the problem you described. However
I don't like using 0xfffd+QChar(ch) for mapping these characters to
Unicode, as the forward and back transformations violate the utf8
encoding a lot.

I've implemented a slightly different solution mapping the characters to
a surrogate pair in the supplementary private use area, as this should
hopefully lead to less conflicts. The only disadvantage is that
currently (until we have a better surrogate handling in Qt) each of
these characters will show up as two boxes instead of one box and the
char mapped from latin1. The diff against qt-3.2 beta2 is attached.

Cheers,
Lars

--
Lars Knoll, Senior Software Engineer
Trolltech AS, Waldemar Thranes gt. 98, N-0175 Oslo, Norway

["patch.diff" (text/x-diff)]

--- src/codecs/qutfcodec.cpp	2003-07-04 08:30:46 -0000
+++ src/codecs/qutfcodec.cpp	2003-07-04 08:30:46 -0000

@@ -70,8 +70,15 @@
 		    }
 		}
 		if (u > 0xffff) {
-		    *cursor++ = 0xf0 | ((uchar) (u >> 18));
-		    *cursor++ = 0x80 | ( ((uchar) (u >> 12)) & 0x3f);
+		    // see QString::fromUtf8() and QString::utf8() for explanations
+		    if (u > 0x10fe00 && u < 0x10ff00) {
+			*cursor++ = (u - 0x10fe00);
+			++ch;
+			continue;
+		    } else {
+			*cursor++ = 0xf0 | ((uchar) (u >> 18));
+			*cursor++ = 0x80 | ( ((uchar) (u >> 12)) & 0x3f);
+		    }
 		} else {
 		    *cursor++ = 0xe0 | ((uchar) (u >> 12));
 		}
@@ -79,7 +86,7 @@
  	    }
  	    *cursor++ = 0x80 | ((uchar) (u&0x3f));
  	}
- 	ch++;
+ 	++ch;
     }
     *cursor = 0;
     lenInOut = cursor - (uchar*)rstr.data();

--- src/tools/qstring.cpp	2003-07-04 08:30:46 -0000
+++ src/tools/qstring.cpp	2003-07-04 08:30:46 -0000

@@ -5184,8 +5184,18 @@
 		    }
 		}
 		if (u > 0xffff) {
-		    *cursor++ = 0xf0 | ((uchar) (u >> 18));
-		    *cursor++ = 0x80 | ( ((uchar) (u >> 12)) & 0x3f);
+		    // if people are working in utf8, but strings are encoded in eg. latin1, the resulting
+		    // name might be invalid utf8. This and the corresponding code in fromUtf8 takes care
+		    // we can handle this without loosing information. This can happen with latin filenames
+		    // and a utf8 locale under Unix.
+		    if (u > 0x10fe00 && u < 0x10ff00) {
+			*cursor++ = (u - 0x10fe00);
+			++ch;
+			continue;
+		    } else {
+			*cursor++ = 0xf0 | ((uchar) (u >> 18));
+			*cursor++ = 0x80 | ( ((uchar) (u >> 12)) & 0x3f);
+		    }
 		} else {
 		    *cursor++ = 0xe0 | ((uchar) (u >> 12));
 		}
@@ -5193,7 +5203,7 @@
  	    }
  	    *cursor++ = 0x80 | ((uchar) (u&0x3f));
  	}
- 	ch++;
+ 	++ch;
     }
     rstr.truncate( cursor - (uchar*)rstr.data() );
     return rstr;
@@ -5220,13 +5230,14 @@
     if ( len < 0 )
 	len = strlen( utf8 );
     QString result;
-    result.setLength( len ); // worst case
+    result.setLength( len*2 ); // worst case
     QChar *qch = (QChar *)result.unicode();
     uint uc = 0;
     int need = 0;
+    int error = -1;
     uchar ch;
     for (int i=0; i<len; i++) {
-	ch = *utf8++;
+	ch = utf8[i];
 	if (need) {
 	    if ( (ch&0xc0) == 0x80 ) {
 		uc = (uc << 6) | (ch & 0x3f);
@@ -5244,25 +5255,42 @@
 		    }
 		}
 	    } else {
-		// error
-		*qch++ = QChar::replacement;
+		// See QString::utf8() for explanation.
+		//
+		// The surrogate below corresponds to a Unicode value of (0x10fe00+ch) which
+		// is in one of the private use areas of Unicode.
+		i = error;
+		*qch++ = QChar(0xdbff);
+		*qch++ = QChar(0xde00+((uchar)utf8[i]));
 		need = 0;
 	    }
+	    error = -1;
 	} else {
 	    if ( ch < 128 ) {
 		*qch++ = ch;
 	    } else if ((ch & 0xe0) == 0xc0) {
 		uc = ch & 0x1f;
 		need = 1;
+		error = i;
 	    } else if ((ch & 0xf0) == 0xe0) {
 		uc = ch & 0x0f;
 		need = 2;
+		error = i;
 	    } else if ((ch&0xf8) == 0xf0) {
 		uc = ch & 0x07;
 		need = 3;
+		error = i;
 	    }
 	}
     }
+    if (error != -1) {
+	// we have some invalid characters remaining we need to add to the string
+	for (int i = error; i < len; ++i) {
+	    *qch++ = QChar(0xdbff);
+	    *qch++ = QChar(0xde00+((uchar)utf8[i]));
+	}
+    }
+
     result.truncate( qch - result.unicode() );
     return result;
 }


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic