[prev in list] [next in list] [prev in thread] [next in thread]
List: kde-core-devel
Subject: Re: [Issue N23835] [PATCH] Files with non-utf8 names unaccessible
From: Thiago Macieira <thiagom () wanadoo ! fr>
Date: 2003-07-04 12:48:59
[Download RAW message or body]
Waldo Bastian wrote:
>I don't know if there are other multi-byte encodings around that have the
> same problem as utf8. Some of the encodings popular in Asia are multi-byte,
> but I am not familar with them beyond that. They may pose a problem.
Well, if the 8-bit sequence is restored perfectly, it shouldn't pose a
problem. But the encoding must be done per path component: for instance, a
directory name ending in shifted state should be considered invalid, because
a slash follows and the start of the next component.
>We must be careful with autodetection of encoding, that will work fine when
>decoding (to QString), but there is a risk that we no longer know which
>encoding was used when we get to the point where we need to encode the
> string again (from QString).
Indeed. It might be interesting to keep a selection of encodings per FTP or
FISH site such as we do with browser identifications. I hate keeping states,
but this seems to be the only way, unless someone comes up with a bright new
idea.
>It may be possible to pass the encoded string as-is via KURL, although that
> is somewhat fragile. For that to work in combination with the ftp slave, we
> probably need KURL::setPath8Bit(const QCString &) and QCString
>KURL::path8Bit(). Not sure if that will work, since it relies on URLs being
>passed as KURL and that may not always be the case.
That's also something that we'll have to stress test with KURL: to see if it
guards the original 8-bit encoding after some transformations.
We have a lot of problems here:
- the local filesystem (file:/) protocol, in which we must use the local
encoding for filenames
- filesystem-like protocols, in which we must translate the same way as above,
but allow the user to select the encoding -- and probably keep that state
cached
- normal URLs should translate into UTF-8 as per IRI: i.e., "é" must be
equivalent to %C3%A9, not %E9. But not on file-like protocols!
IRI also complicates things a lot... Typed characters and entities should be
handled UTF-8, which means %E9 cannot be translated into "é" even on Latin 1
pages...
We could maybe construct an "URL encoding hint database", which would return a
different encoding depending on the protocol and/or hostname of the URL.
For one thing, the hostname component of an URL seems to be being handled
correctly.
--
Thiago Macieira - Registered Linux user #65028
thiagom@mail.com
ICQ UIN: 1967141 PGP/GPG: 0x6EF45358; fingerprint:
E067 918B B660 DBD1 105C 966C 33F5 F005 6EF4 5358
[Attachment #3 (application/pgp-signature)]
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic