'Re: [I18n] __STDC_ISO_10646__ is sane'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       xfree-i18n
Subject:    Re: [I18n] __STDC_ISO_10646__ is sane
From:       Bruno Haible <haible () ilog ! fr>
Date:       2001-02-28 19:17:03
[Download RAW message or body]

About the support for implementing a line breaking algorithm and similar,
Ienup Sung writes:

> Additionally, speaking of LC_CTYPE class and character mappinng expansion,
> in each locale, you can specify your own classification character class and
> mapping by using wctype, iswctype, wctrans, and towctrans functions by the way.
> Also, in our framework, well, and also in my understanding, most of the major
> commercial Unix systems' i18n framework, people can even add your own
> classification/mapping functions to any of the locale by specifying them
> in their localedef source file and also accompanying locale specific method
> files so that not only you can specify a new character classification 
> and mapping classes but also your own classification functions and that's
> the reason why some of the Asian locales in Solaris have locale-specific
> classification functions like isphonogram, isideogram, iswchar6, iswchar9,
> iswchar21, and also conversion functions like tojhira and tojkata for
> directly localized applications.

Sure, in glibc and Solaris one can create new locales with additional
information. But this doesn't solve the problem of the person who
implements a line breaking algorithm:

   1. The algorithm/application is intended to run on all locales of
      all systems. The person who implements it is different from the
      one who maintains the locales of glibc or Solaris. If an
      application needs an iswideograph function, will you add it to
      all existing Solaris locales?

   2. Locales only provide for additional mappings of type
             wchar_t --> bool
      and
             wchar_t --> wchar_t

      but what the line breaking algorithm needs is a
             wchar_t --> { 0, 1, ..., 19 }
      mapping.

So I'm back on the claim:

   Without __STDC_ISO_10646__, wchar_t is useless except for the few
   things provided in libc: wctype.h and wcwidth().

> As an example, internally, for Solaris, we took m_strscanfor() and
> ...Each locale who needs sophisticated (and, right and correct) text
> boundary resolution, will and can provide their own locale-specific text
> boundary resolution module

OK, then with special vendor support there is better support for line
breaking. This means:

   Without __STDC_ISO_10646__, wchar_t is useless except for the few
   things provided in libc/other vendor libs: wctype.h and wcwidth()
   and line breaking.

Then someone wants to implement smart quotes: turning U+0022 into
U+201C/U+201D based on context. Or similar things.

Regardless how far you add vendor support, there will always be areas
where people are blocked because they don't know what a wchar_t is.

> this deficiency doesn't really calls for the notion of wchar_t ==
> Unicode and I rather think two are totally separate and irrelavant
> topics/issues.

This deficiency calls for either adding __STDC_ISO_10646__, or not
using wchar_t at all except for very trivial programs. And even in
those simple programs,
  - the wide char input functions fgetwc() etc. are better avoided
    because they are unreliable (they just fail when the file is not
    in the expected encoding, without any possibility for error
    recovery),
  - the wide char input and output functions don't mix well with
    getc() and putc(),
  - the towupper, towlower functions are better avoided because
    uppercasing is better done at string level (German eszet -> ss
    etc) and the towtitle function is missing.

What remains is very very little. The only thing the wchar_t APIs
offer that cannot be done with char* APIs is wcwidth().

Bruno
_______________________________________________
I18n mailing list
I18n@XFree86.Org
http://XFree86.Org/mailman/listinfo/i18n

[prev in list] [next in list] [prev in thread] [next in thread]