'Re: [Issue N23835] [PATCH] Files with non-utf8 names unaccessible'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-core-devel
Subject:    Re: [Issue N23835] [PATCH] Files with non-utf8 names unaccessible
From:       Thiago Macieira <thiagom () wanadoo ! fr>
Date:       2003-07-04 12:48:59
[Download RAW message or body]


Waldo Bastian wrote:
>I don't know if there are other multi-byte encodings around that have the
> same problem as utf8. Some of the encodings popular in Asia are multi-byte,
> but I am not familar with them beyond that. They may pose a problem.

Well, if the 8-bit sequence is restored perfectly, it shouldn't pose a 
problem. But the encoding must be done per path component: for instance, a 
directory name ending in shifted state should be considered invalid, because 
a slash follows and the start of the next component.

>We must be careful with autodetection of encoding, that will work fine when
>decoding (to QString), but there is a risk that we no longer know which
>encoding was used when we get to the point where we need to encode the
> string again (from QString).

Indeed. It might be interesting to keep a selection of encodings per FTP or 
FISH site such as we do with browser identifications. I hate keeping states, 
but this seems to be the only way, unless someone comes up with a bright new 
idea.

>It may be possible to pass the encoded string as-is via KURL, although that
> is somewhat fragile. For that to work in combination with the ftp slave, we
> probably need KURL::setPath8Bit(const QCString &) and QCString
>KURL::path8Bit(). Not sure if that will work, since it relies on URLs being
>passed as KURL and that may not always be the case.

That's also something that we'll have to stress test with KURL: to see if it 
guards the original 8-bit encoding after some transformations.

We have a lot of problems here:
- the local filesystem (file:/) protocol, in which we must use the local 
encoding for filenames
- filesystem-like protocols, in which we must translate the same way as above, 
but allow the user to select the encoding -- and probably keep that state 
cached
- normal URLs should translate into UTF-8 as per IRI: i.e., "é" must be 
equivalent to %C3%A9, not %E9. But not on file-like protocols!

IRI also complicates things a lot... Typed characters and entities should be 
handled UTF-8, which means %E9 cannot be translated into "é" even on Latin 1 
pages...

We could maybe construct an "URL encoding hint database", which would return a 
different encoding depending on the protocol and/or hostname of the URL.

For one thing, the hostname component of an URL seems to be being handled 
correctly.

-- 
  Thiago Macieira  -  Registered Linux user #65028
   thiagom@mail.com           
    ICQ UIN: 1967141   PGP/GPG: 0x6EF45358; fingerprint:
    E067 918B B660 DBD1 105C  966C 33F5 F005 6EF4 5358

[Attachment #3 (application/pgp-signature)]

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic