[prev in list] [next in list] [prev in thread] [next in thread]
List: kde-devel
Subject: Re: Q: The dimension of UNICODE awareness
From: Waldo Bastian <bastian () suse ! de>
Date: 1999-10-14 10:49:10
[Download RAW message or body]
On Thu, 14 Oct 1999, Boris Povazay wrote:
> Waldo Bastian wrote:
>
> > Yes. An URL contains by definition only ASCII.
> >
> > > > Does anybody know something about Unix interna regarding this?
> > > Unix internals use the locale setting, no ? Don't know how
> > > that copes with more than 256 values though... It
> > > probably doesn't.
> >
> > URL's are very bad in handling unicode stuff.. Basically URLs are 8-bit
> > based. I guess that the only portable way to make a URL of a Unicode
> > filename is to encode the filename with UTF8 and then encode this UTF8
> > filename with the normal URL encoding stuff. That means that the URL of
> > a korean filename looks horrible... even for koreans.
> >
> > Maybe someone can check if there is an RFC in the making which covers
> > URLs & Unicode?
>
> URIs are currently either ASCII or interpreted as UTF-8. The following
> paragraph (http://www.w3.org/TR/WD-charmod#URIs) holds the information
> concerning the current status:
> ---
> 5. Character Encoding in URIs
>
> According to the current definition [RFC2396], URIs are
> restricted to a subset of US-ASCII. There is also an escaping mechanism
> to
> encode arbitrary byte values using the %HH convention, but
> because in general, the mapping from characters to bytes is not defined,
> this is
> of limited use. To avoid future incompatibilities, W3C
> specifications MUST include the following paragraph by reference:
>
> For all syntactic elements in the format/protocol which are
> being interpreted as URIs, characters that are syntactically not
> allowed by the generic URI syntax (i.e. all non-ASCII
> characters plus the excluded characters [RFC2396, Section 2.4.3])
> MUST be treated as follows: Each such character is
> represented in UTF-8 as one or more bytes, each of these bytes is
> escaped with the URI escaping mechanism (i.e. converted to
> %HH, where HH is the hexadecimal notation of the byte value),
> and the original character is replaced by the resulting
> character sequence.
>
> Example: In the URI <http://www.w3.org/People/Dürst/>, the
> character "ü" is not allowed. The representation of "ü" in UTF-8
> consists of two
> bytes with the values 0xC3 and 0xBC. The URI is therefore
> converted to <http://www.w3.org/People/D%C3%BCrst/>.
>
> Note: The intent of this is not to freeze the definitions of
> URIs to a subset of US-ASCII characters forever, but to assure that W3C
> technology correctly and predictably interacts with systems that
> are based on the current definition of URIs while not inhibiting a
> future
> extension of the URI definition.
>
> Note: Current W3C specifications already contain provisions in
> accordance with the above. For [XML 1.0], please see Section 4.2.2,
> External Entities. For [HTML 4.0], please see Appendix B.2.1:
> Non-ASCII characters in URI attribute values, which also contains some
> provisions for backwards compatibility. Further information and
> links can be found at [I18NURI].
> ---
> The following paragraph
> (http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1) shows what
> can be expected in the future (since everything is going to be
> UNICODE-based):
> ---
> B.2.1 Non-ASCII characters in URI attribute values
>
> Although URIs do not contain non-ASCII values (see [URI], section 2.1)
> authors sometimes specify them in attribute values expecting URIs (i.e.,
> defined
> with %URI; in the DTD). For instance, the following href value is
> illegal:
>
> <A href="http://foo.org/Håkon">...</A>
>
> We recommend that user agents adopt the following convention for
> handling non-ASCII characters in such cases:
>
> 1.Represent each character in UTF-8 (see [RFC2044]) as one or more
> bytes.
> 2.Escape these bytes with the URI escaping mechanism (i.e., by
> converting each byte to %HH, where HH is the hexadecimal notation of the
> byte
> value).
>
> This procedure results in a syntactically legal URI (as defined in
> [RFC1738], section 2.2 or [RFC2141], section 2) that is independent of
> the character
> encoding to which the HTML document carrying the URI may have been
> transcoded.
>
> Note. Some older user agents trivially process URIs in HTML using the
> bytes of the character encoding in which the document was received. Some
> older HTML documents rely on
> this practice and break when transcoded. User agents that want to
> handle these older documents should, on receiving a URI containing
> characters outside the legal set, first use
> the conversion based on UTF-8. Only if the resulting URI does not
> resolve should they try constructing a URI based on the bytes of the
> character encoding in which the document
> was received.
>
> Note. The same conversion based on UTF-8 should be applied to values
> of the name attribute for the A element.
> ---
>
> Hope it helps!
Great. So they basically say the same, do Unicode -> UTF8 and then go
from UTF8 to URL-encoding.
This means that the URL of a korean filename looks like shit. So I
think we should distinguish between the "URL encoding according to the
RFCs" and a "printable URL". In the printable version we can unescape
everything as long as it remains unambiguous (We should still escape
":", "/" , "?" and maybe a few others.)
so we would get e.g.
QCString KUrl::url()
and
QString KUrl::printableURL()
Opinions?
Cheers,
Waldo
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic