From kde-core-devel Mon Nov 17 18:39:03 2003 From: Thiago Macieira Date: Mon, 17 Nov 2003 18:39:03 +0000 To: kde-core-devel Subject: Re: Problem with encodings in several places in KDE X-MARC-Message: https://marc.info/?l=kde-core-devel&m=106909516607535 MIME-Version: 1 Content-Type: multipart/mixed; boundary="--Boundary-02=_OXRu/VWcn1H4xg/" --Boundary-02=_OXRu/VWcn1H4xg/ Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Waldo Bastian wrote: >Keep in mind that N23835 fixes (part of) the problem now, your proposal wi= ll >not be available before Qt 4. I am aware of that. But my proposal could be included in Qt 3.3 (if that is= =20 released) without too great damage -- marshalling formats notwithstanding.= =20 See below. >> Next, (and here's what I am proposing to TT) is that both QString and >> QCString hold a QTextCodec* pointer to the codec that can be used to >> convert the string back to its original form. QFile::encodeName and deco= de >> would be a special QTextCodec in this regard and they have to work for >> every encoding, not just UTF-8. One solution would be to break the >> filename into its components and encode each one separately; if any fail, >> the same "broken UTF-8" decoding of the current solution can be applied. > >??? You want to register the encoding for each of the segments and keep th= em >around, even under transformation? No, that's not what I want. I want a pathname to be broken into components= =20 (separated by QDir::separator) and each component be separately inspected f= or=20 invalid sequences. The idea is that one could have a file with undecodable filename but whose= =20 leading directory names are legal. The current solution only solves the UTF= =2D8=20 case, not any other cases that might arise (as you yourself pointed out in= =20 July). The idea is to decode properly the legal components, but use the "broken=20 UTF-8" method for those that can't be decoded. Since nothing else should be= =20 generating those surrogate pairs, it's safe to assume that when they are=20 present in a path component, it indicates undecoded 8-bit sequences. =46or other strings that are not filenames, the old rules would apply. I.e.= ,=20 QString::fromUtf8(s).utf8() =3D=3D s is not guaranteed. (A leading marker per component might be adviseable, performance-wise) >Keeping codecs around in the QString would indeed be nice, yes. Changes in >marshall format would break KDE4 - KDE3 wire compatibility though. Then >again, if we just drop that requirement, the migration to D-BUS will become > a lot easier. Not necessarily since the wire format is versioned. Qt can read all the=20 previous version's marshalling format and write in them -- at least QString= s=20 can. It's just a matter of handshake. KDE classes might not be checking the stream version and thus not be prepar= ed=20 to handle multiple versions of itself. By the way, recommendation for future wire formats: include a size argument= =20 leading to each block in such a way that new fields are appended to the end= =2E=20 Older implementations will ignore them if present; newer implementations wi= ll=20 assume default values if missing. =2D-=20 Thiago Macieira - Registered Linux user #65028 thiagom@mail.com =20 ICQ UIN: 1967141 PGP/GPG: 0x6EF45358; fingerprint: E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358 --Boundary-02=_OXRu/VWcn1H4xg/ Content-Type: application/pgp-signature Content-Description: signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.3 (GNU/Linux) iD8DBQA/uRXOM/XwBW70U1gRAkqPAJ9BLg3JOM1yOmp5idgizzQy9yvFiQCfbsR8 wj1bpwEoYzqXi+/j5wvBVlk= =MSD0 -----END PGP SIGNATURE----- --Boundary-02=_OXRu/VWcn1H4xg/--