'Re: Q: The dimension of UNICODE awareness'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-devel
Subject:    Re: Q: The dimension of UNICODE awareness
From:       Waldo Bastian <bastian () suse ! de>
Date:       1999-10-14 10:49:10
[Download RAW message or body]

On Thu, 14 Oct 1999, Boris Povazay wrote:
> Waldo Bastian wrote:
> 
> > Yes. An URL contains by definition only ASCII.
> > 
> > > > Does anybody know something about Unix interna regarding this?
> > > Unix internals use the locale setting, no ? Don't know how
> > > that copes with more than 256 values though... It
> > > probably doesn't.
> > 
> > URL's are very bad in handling unicode stuff.. Basically URLs are 8-bit
> > based. I guess that the only portable way to make a URL of a Unicode
> > filename is to encode the filename with UTF8 and then encode this UTF8
> > filename with the normal URL encoding stuff. That means that the URL of
> > a korean filename looks horrible... even for koreans.
> > 
> > Maybe someone can check if there is an RFC in the making which covers
> > URLs & Unicode?
> 
> URIs are currently either ASCII or interpreted as UTF-8. The following
> paragraph (http://www.w3.org/TR/WD-charmod#URIs) holds the information
> concerning the current status:
> ---
> 5. Character Encoding in URIs 
> 
>         According to the current definition [RFC2396], URIs are
> restricted to a subset of US-ASCII. There is also an escaping mechanism
> to
>         encode arbitrary byte values using the %HH convention, but
> because in general, the mapping from characters to bytes is not defined,
> this is
>         of limited use. To avoid future incompatibilities, W3C
> specifications MUST include the following paragraph by reference: 
> 
>              For all syntactic elements in the format/protocol which are
> being interpreted as URIs, characters that are syntactically not
>              allowed by the generic URI syntax (i.e. all non-ASCII
> characters plus the excluded characters [RFC2396, Section 2.4.3])
>              MUST be treated as follows: Each such character is
> represented in UTF-8 as one or more bytes, each of these bytes is
>              escaped with the URI escaping mechanism (i.e. converted to
> %HH, where HH is the hexadecimal notation of the byte value),
>              and the original character is replaced by the resulting
> character sequence. 
> 
>         Example: In the URI <http://www.w3.org/People/Dürst/>, the
> character "ü" is not allowed. The representation of "ü" in UTF-8
> consists of two
>         bytes with the values 0xC3 and 0xBC. The URI is therefore
> converted to <http://www.w3.org/People/D%C3%BCrst/>. 
> 
>         Note: The intent of this is not to freeze the definitions of
> URIs to a subset of US-ASCII characters forever, but to assure that W3C
>         technology correctly and predictably interacts with systems that
> are based on the current definition of URIs while not inhibiting a
> future
>         extension of the URI definition. 
> 
>         Note: Current W3C specifications already contain provisions in
> accordance with the above. For [XML 1.0], please see Section 4.2.2,
>         External Entities. For [HTML 4.0], please see Appendix B.2.1:
> Non-ASCII characters in URI attribute values, which also contains some
>         provisions for backwards compatibility. Further information and
> links can be found at [I18NURI]. 
> ---
> The following paragraph
> (http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1) shows what
> can be expected in the future (since everything is going to be
> UNICODE-based):
> ---
> B.2.1 Non-ASCII characters in URI attribute values
> 
> Although URIs do not contain non-ASCII values (see [URI], section 2.1)
> authors sometimes specify them in attribute values expecting URIs (i.e.,
> defined
> with %URI; in the DTD). For instance, the following href value is
> illegal: 
> 
>   <A href="http://foo.org/Håkon">...</A>
> 
> We recommend that user agents adopt the following convention for
> handling non-ASCII characters in such cases: 
> 
>    1.Represent each character in UTF-8 (see [RFC2044]) as one or more
> bytes. 
>    2.Escape these bytes with the URI escaping mechanism (i.e., by
> converting each byte to %HH, where HH is the hexadecimal notation of the
> byte
>      value). 
> 
> This procedure results in a syntactically legal URI (as defined in
> [RFC1738], section 2.2 or [RFC2141], section 2) that is independent of
> the character
> encoding to which the HTML document carrying the URI may have been
> transcoded. 
> 
>   Note. Some older user agents trivially process URIs in HTML using the
> bytes of the character encoding in which the document was received. Some
> older HTML documents rely on
>   this practice and break when transcoded. User agents that want to
> handle these older documents should, on receiving a URI containing
> characters outside the legal set, first use
>   the conversion based on UTF-8. Only if the resulting URI does not
> resolve should they try constructing a URI based on the bytes of the
> character encoding in which the document
>   was received. 
> 
>   Note. The same conversion based on UTF-8 should be applied to values
> of the name attribute for the A element. 
> ---
> 
> Hope it helps!

Great. So they basically say the same, do Unicode -> UTF8 and then go
from UTF8 to URL-encoding.

This means that the URL of a korean filename looks like shit. So I
think we should distinguish between the "URL encoding according to the
RFCs" and a "printable URL". In the printable version we can unescape
everything as long as it remains unambiguous (We should still escape
":", "/" , "?" and maybe a few others.)

so we would get e.g.
   QCString KUrl::url() 
and 
   QString KUrl::printableURL()

Opinions?

Cheers,
Waldo

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic