From kde-core-devel  Fri Jul 04 12:48:59 2003
From: Thiago Macieira <thiagom () wanadoo ! fr>
Date: Fri, 04 Jul 2003 12:48:59 +0000
To: kde-core-devel
Subject: Re: [Issue N23835] [PATCH] Files with non-utf8 names unaccessible
X-MARC-Message: https://marc.info/?l=kde-core-devel&m=105732305622736
MIME-Version: 1
Content-Type: multipart/mixed; boundary="--Boundary-02=_CfXB/oP3fNmji+u"


--Boundary-02=_CfXB/oP3fNmji+u
Content-Type: text/plain;
  charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
Content-Disposition: inline

Waldo Bastian wrote:
>I don't know if there are other multi-byte encodings around that have the
> same problem as utf8. Some of the encodings popular in Asia are multi-byt=
e,
> but I am not familar with them beyond that. They may pose a problem.

Well, if the 8-bit sequence is restored perfectly, it shouldn't pose a=20
problem. But the encoding must be done per path component: for instance, a=
=20
directory name ending in shifted state should be considered invalid, becaus=
e=20
a slash follows and the start of the next component.

>We must be careful with autodetection of encoding, that will work fine when
>decoding (to QString), but there is a risk that we no longer know which
>encoding was used when we get to the point where we need to encode the
> string again (from QString).

Indeed. It might be interesting to keep a selection of encodings per FTP or=
=20
=46ISH site such as we do with browser identifications. I hate keeping stat=
es,=20
but this seems to be the only way, unless someone comes up with a bright ne=
w=20
idea.

>It may be possible to pass the encoded string as-is via KURL, although that
> is somewhat fragile. For that to work in combination with the ftp slave, =
we
> probably need KURL::setPath8Bit(const QCString &) and QCString
>KURL::path8Bit(). Not sure if that will work, since it relies on URLs being
>passed as KURL and that may not always be the case.

That's also something that we'll have to stress test with KURL: to see if i=
t=20
guards the original 8-bit encoding after some transformations.

We have a lot of problems here:
=2D the local filesystem (file:/) protocol, in which we must use the local=
=20
encoding for filenames
=2D filesystem-like protocols, in which we must translate the same way as a=
bove,=20
but allow the user to select the encoding -- and probably keep that state=20
cached
=2D normal URLs should translate into UTF-8 as per IRI: i.e., "=E9" must be=
=20
equivalent to %C3%A9, not %E9. But not on file-like protocols!

IRI also complicates things a lot... Typed characters and entities should b=
e=20
handled UTF-8, which means %E9 cannot be translated into "=E9" even on Lati=
n 1=20
pages...

We could maybe construct an "URL encoding hint database", which would retur=
n a=20
different encoding depending on the protocol and/or hostname of the URL.

=46or one thing, the hostname component of an URL seems to be being handled=
=20
correctly.

=2D-=20
  Thiago Macieira  -  Registered Linux user #65028
   thiagom@mail.com          =20
    ICQ UIN: 1967141   PGP/GPG: 0x6EF45358; fingerprint:
    E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358

--Boundary-02=_CfXB/oP3fNmji+u
Content-Type: application/pgp-signature
Content-Description: signature

-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.2.2 (GNU/Linux)

iD8DBQA/BXfCM/XwBW70U1gRAuTvAKCHFGGoBpy28aXYNQaF/vuiXejmrwCfeltb
QSWXpTNeXn0136B5tMgqls0=
=yLtM
-----END PGP SIGNATURE-----

--Boundary-02=_CfXB/oP3fNmji+u--