'Re: Q: The dimension of UNICODE awareness'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kde-devel
Subject:    Re: Q: The dimension of UNICODE awareness
From:       Boris Povazay <boris () jouh ! at>
Date:       1999-10-14 9:44:00
[Download RAW message or body]

Waldo Bastian wrote:

> Yes. An URL contains by definition only ASCII.
> 
> > > Does anybody know something about Unix interna regarding this?
> > Unix internals use the locale setting, no ? Don't know how
> > that copes with more than 256 values though... It
> > probably doesn't.
> 
> URL's are very bad in handling unicode stuff.. Basically URLs are 8-bit
> based. I guess that the only portable way to make a URL of a Unicode
> filename is to encode the filename with UTF8 and then encode this UTF8
> filename with the normal URL encoding stuff. That means that the URL of
> a korean filename looks horrible... even for koreans.
> 
> Maybe someone can check if there is an RFC in the making which covers
> URLs & Unicode?

URIs are currently either ASCII or interpreted as UTF-8. The following
paragraph (http://www.w3.org/TR/WD-charmod#URIs) holds the information
concerning the current status:
---
5. Character Encoding in URIs 

        According to the current definition [RFC2396], URIs are
restricted to a subset of US-ASCII. There is also an escaping mechanism
to
        encode arbitrary byte values using the %HH convention, but
because in general, the mapping from characters to bytes is not defined,
this is
        of limited use. To avoid future incompatibilities, W3C
specifications MUST include the following paragraph by reference: 

             For all syntactic elements in the format/protocol which are
being interpreted as URIs, characters that are syntactically not
             allowed by the generic URI syntax (i.e. all non-ASCII
characters plus the excluded characters [RFC2396, Section 2.4.3])
             MUST be treated as follows: Each such character is
represented in UTF-8 as one or more bytes, each of these bytes is
             escaped with the URI escaping mechanism (i.e. converted to
%HH, where HH is the hexadecimal notation of the byte value),
             and the original character is replaced by the resulting
character sequence. 

        Example: In the URI <http://www.w3.org/People/Dürst/>, the
character "ü" is not allowed. The representation of "ü" in UTF-8
consists of two
        bytes with the values 0xC3 and 0xBC. The URI is therefore
converted to <http://www.w3.org/People/D%C3%BCrst/>. 

        Note: The intent of this is not to freeze the definitions of
URIs to a subset of US-ASCII characters forever, but to assure that W3C
        technology correctly and predictably interacts with systems that
are based on the current definition of URIs while not inhibiting a
future
        extension of the URI definition. 

        Note: Current W3C specifications already contain provisions in
accordance with the above. For [XML 1.0], please see Section 4.2.2,
        External Entities. For [HTML 4.0], please see Appendix B.2.1:
Non-ASCII characters in URI attribute values, which also contains some
        provisions for backwards compatibility. Further information and
links can be found at [I18NURI]. 
---
The following paragraph
(http://www.w3.org/TR/REC-html40/appendix/notes.html#h-B.2.1) shows what
can be expected in the future (since everything is going to be
UNICODE-based):
---
B.2.1 Non-ASCII characters in URI attribute values

Although URIs do not contain non-ASCII values (see [URI], section 2.1)
authors sometimes specify them in attribute values expecting URIs (i.e.,
defined
with %URI; in the DTD). For instance, the following href value is
illegal: 

  <A href="http://foo.org/Håkon">...</A>

We recommend that user agents adopt the following convention for
handling non-ASCII characters in such cases: 

   1.Represent each character in UTF-8 (see [RFC2044]) as one or more
bytes. 
   2.Escape these bytes with the URI escaping mechanism (i.e., by
converting each byte to %HH, where HH is the hexadecimal notation of the
byte
     value). 

This procedure results in a syntactically legal URI (as defined in
[RFC1738], section 2.2 or [RFC2141], section 2) that is independent of
the character
encoding to which the HTML document carrying the URI may have been
transcoded. 

  Note. Some older user agents trivially process URIs in HTML using the
bytes of the character encoding in which the document was received. Some
older HTML documents rely on
  this practice and break when transcoded. User agents that want to
handle these older documents should, on receiving a URI containing
characters outside the legal set, first use
  the conversion based on UTF-8. Only if the resulting URI does not
resolve should they try constructing a URI based on the bytes of the
character encoding in which the document
  was received. 

  Note. The same conversion based on UTF-8 should be applied to values
of the name attribute for the A element. 
---

Hope it helps!

ciao BoP
-- 
http://kde.jouh.at - bp@jouh.at
DI Boris Povazay

[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic