--Boundary-02=_EXQu/d9Bd+OCEbY Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Hello everyone, I'm going to rehash a couple of situations that arise or have arisen in KDE= =20 regarding character encodings. I have thought of a solution, which involves= =20 TT adding a couple of methods to QString and QCString, but, before I send a= n=20 e-mail to qt-bugs@, I'd like to have some feedback. Background: throughout Qt and KDE, all APIs take filenames in the form of QStrings, tha= t=20 is, in Unicode representation. However, when dealing with lower-level syste= m=20 calls, it is necessary that those QStrings be converted to an 8-bit=20 representation. Normally, when talking to the operating system or other=20 applications, the locale encoding is used, but that is lossy. UTF-8 is=20 recommended, but the other side must be prepared to use it. The problem: (Qt issue N23835, thread:=20 http://lists.kde.org/?l=3Dkde-core-devel&m=3D105730766410987&w=3D2) in early July, we fixed a problem with the handling of files whose names we= re=20 not properly UTF-8 encoded when using UTF-8 for filenames. Without that fix= ,=20 users were unable to open (or even rename, I think) files whose names were= =20 "broken". The solution our troll friends came up with was to make the UTF-8=20 encoder/decoder algorithms map each character of the invalid UTF-8 sequence= s=20 to a section of the User Range in Unicode Plane 1 (from U+10FE00 to=20 U+10FEFF), representing those values by two UTF-16 surrogates in a QString.= =20 You might have seen its effect in that you now see two "squares" where the= =20 invalid character is located. The side-effect of this is that the UTF-8 codec can now decode any string,= =20 which is not the correct behaviour. For instance, Kopete relies on the=20 decoding of the UTF-8 message to determine if it was properly encoded (see = BR=20 67727). Besides, this doesn't solve all the problems: other encodings might= =20 fail the same way UTF-8 does, which still renders broken filenames=20 inoperable. My second request in=20 http://lists.kde.org/?l=3Dkde-core-devel&m=3D105731424516065&w=3D2 isn't so= lved=20 either. The next problem: (BR 65378, BR 56197) in more than one instance, one exact encoding method for a given Unicode=20 string is desired. Bug #56197 requires an encoding parameter to be sent bac= k=20 and forth kio_ftp and the KIO master so that FTP filenames can be=20 reconstructed in their original 8-bit form. Bug #65378 requires either that= =20 the parameters for the application be already encoded (i.e., change the=20 return value from QStringList to QValueList) or that a flag=20 indicating whether QString::local8Bit or QFile::encodeName should be used. My proposal: to solve all of those problems and to erradicate the side-effect, a=20 non-trivial fix is required. First of all, the patch to Qt from issue N2383= 5=20 should be reverted, making UTF-8 completely legal again. Next, (and here's what I am proposing to TT) is that both QString and QCStr= ing=20 hold a QTextCodec* pointer to the codec that can be used to convert the=20 string back to its original form. QFile::encodeName and decode would be a=20 special QTextCodec in this regard and they have to work for every encoding,= =20 not just UTF-8. One solution would be to break the filename into its=20 components and encode each one separately; if any fail, the same "broken=20 UTF-8" decoding of the current solution can be applied. As for KDE code, we'd have to check where in the filesystem-handling code a= ny=20 assumption about the codec is made. One such example is Bug #65378. The=20 solution there would then be simple: instead of relying on=20 QString::local8Bit, the associated codec encoder would be used. As for Bug #56197, the solution would still be including the encoding in th= e=20 metadata. =46or the problem Issue N23835 was the solution of, KDE code has to make su= re=20 that the codec value is kept alongside the QString internally -- that is, t= o=20 be sure that the QString represents a filename. That way, when reencoding=20 back to its 8-bit form in order to (for instance) rename the file, the=20 original 8-bit value is restored. In order to launch an application, we end= =20 up with Bug #65378, which means the codec value would have to be transmitte= d=20 through the DCOP stream (easiest solution: include it in the QString's=20 marshalling format). Going even further, for KDE4 we could have applications being launched from= =20 kdeinit as libraries (like Konqueror) should receive its argument list in=20 Unicode form, thus preserving the codec as well as any other character=20 (example: imagine running an application from the minicli with one of the=20 arguments containing a character that cannot be encoded in the locale's=20 encoding). I hope I have been clear enough. I have written this text in order to get s= ome=20 feedback, so I'd really appreciate any comments. Please find me on IRC if y= ou=20 wish to have a live discussion. PS: I can solve Bug #65378 with a workaround (namely, a bitmap returned fro= m=20 KRun::processDesktopExec indicating whether QFile::encodeName should be use= d=20 or not) =2D-=20 Thiago Macieira - Registered Linux user #65028 thiagom@mail.com =20 ICQ UIN: 1967141 PGP/GPG: 0x6EF45358; fingerprint: E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358 --Boundary-02=_EXQu/d9Bd+OCEbY Content-Type: application/pgp-signature Content-Description: signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQA/uQXEM/XwBW70U1gRAtsrAJ0TB+xtEej8PYi9GFU5mujY4M6AvgCghYuE GtandrrbdKsDGUe3gNaWIJQ= =aaDZ -----END PGP SIGNATURE----- --Boundary-02=_EXQu/d9Bd+OCEbY--