From kde-core-devel Fri Jul 04 12:48:59 2003 From: Thiago Macieira Date: Fri, 04 Jul 2003 12:48:59 +0000 To: kde-core-devel Subject: Re: [Issue N23835] [PATCH] Files with non-utf8 names unaccessible X-MARC-Message: https://marc.info/?l=kde-core-devel&m=105732305622736 MIME-Version: 1 Content-Type: multipart/mixed; boundary="--Boundary-02=_CfXB/oP3fNmji+u" --Boundary-02=_CfXB/oP3fNmji+u Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline Waldo Bastian wrote: >I don't know if there are other multi-byte encodings around that have the > same problem as utf8. Some of the encodings popular in Asia are multi-byt= e, > but I am not familar with them beyond that. They may pose a problem. Well, if the 8-bit sequence is restored perfectly, it shouldn't pose a=20 problem. But the encoding must be done per path component: for instance, a= =20 directory name ending in shifted state should be considered invalid, becaus= e=20 a slash follows and the start of the next component. >We must be careful with autodetection of encoding, that will work fine when >decoding (to QString), but there is a risk that we no longer know which >encoding was used when we get to the point where we need to encode the > string again (from QString). Indeed. It might be interesting to keep a selection of encodings per FTP or= =20 =46ISH site such as we do with browser identifications. I hate keeping stat= es,=20 but this seems to be the only way, unless someone comes up with a bright ne= w=20 idea. >It may be possible to pass the encoded string as-is via KURL, although that > is somewhat fragile. For that to work in combination with the ftp slave, = we > probably need KURL::setPath8Bit(const QCString &) and QCString >KURL::path8Bit(). Not sure if that will work, since it relies on URLs being >passed as KURL and that may not always be the case. That's also something that we'll have to stress test with KURL: to see if i= t=20 guards the original 8-bit encoding after some transformations. We have a lot of problems here: =2D the local filesystem (file:/) protocol, in which we must use the local= =20 encoding for filenames =2D filesystem-like protocols, in which we must translate the same way as a= bove,=20 but allow the user to select the encoding -- and probably keep that state=20 cached =2D normal URLs should translate into UTF-8 as per IRI: i.e., "=E9" must be= =20 equivalent to %C3%A9, not %E9. But not on file-like protocols! IRI also complicates things a lot... Typed characters and entities should b= e=20 handled UTF-8, which means %E9 cannot be translated into "=E9" even on Lati= n 1=20 pages... We could maybe construct an "URL encoding hint database", which would retur= n a=20 different encoding depending on the protocol and/or hostname of the URL. =46or one thing, the hostname component of an URL seems to be being handled= =20 correctly. =2D-=20 Thiago Macieira - Registered Linux user #65028 thiagom@mail.com =20 ICQ UIN: 1967141 PGP/GPG: 0x6EF45358; fingerprint: E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358 --Boundary-02=_CfXB/oP3fNmji+u Content-Type: application/pgp-signature Content-Description: signature -----BEGIN PGP SIGNATURE----- Version: GnuPG v1.2.2 (GNU/Linux) iD8DBQA/BXfCM/XwBW70U1gRAuTvAKCHFGGoBpy28aXYNQaF/vuiXejmrwCfeltb QSWXpTNeXn0136B5tMgqls0= =yLtM -----END PGP SIGNATURE----- --Boundary-02=_CfXB/oP3fNmji+u--