[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lyx-devel
Subject:    Re: [Fwd: The design of Encoding class]
From:       "Asger Alstrup Nielsen" <alstrup () diku ! dk>
Date:       1999-03-31 21:42:40
[Download RAW message or body]

> Le me give more detail about the story. For some language, most of them are
> Asian language, we have ASCII glyphs in our encoding. But, unlike most
western
> encoding(like latin series), we usually seperate ASCII and Asian glyphs into
> two fonts traditionally. It's not required but it's the current status. 
> Therefore, X invent XFontSet which is a composite font including several 
> independent fonts. For example, a fontset may defined as
> 
>   defaultFontSet : -*-24-*-iso8859-1, -*-24-*-big5-0
>
> The underlying mechenism of X library will parse the input string and use 
> one of these two fonts to draw the glyphs.

In what encoding does the X library want the strings to use these fontsets?

> For western languages which can fit within 8bit encoding, we don't need step
3
> because they have already been fix width encoding. This is to say that the
> document representation encoding and font encoding is the same.
> 
> For Asian languages which can't fit within 8 bit encoding, the document
> representation encoding may be different with font encoding if we don't 
> use font set.

I don't know much about font set, but if it makes things easier, I don't see
why we should not support it.

> X library may use mbstowcs of C library or use the mbstowcs of itsself depend

> on the compile time option. If it use that of C library, the wide of wide 
> character is depend on the implementation of C library.
> 
> Therefore, it seems that Encoding class is just another implementation of
> mbstowcs.  Why do we need to do it again. 

To some extend, the Encoding class is just another implementation of mbstowcs. 
However, it is extended and improved in several important ways:  mbstowcs
converts between a multibyte string to a wide character string according to the
current locale.  However, as far as I know, the current locale is not well
defined, and it's furthermore not portable.  We don't know which encoding the
string is in after we have converted it, except that it fits the current
locale!
mbstowcs is restricted to convert from a variable length encoding to a fixed
width encoding.  The Encoding class is designed to convert from any encoding to
any other encoding.

I'm not exactly sure what you intend to do, but I think it's something like
this:
We mostly ignore the issue of encodings, and just make sure that the input,
output and displaying encodings are compatible, so that we don't have to do any
explicit encoding management and conversions, except the conversions the C
locale library can provide.

The problem with this approach is that we are caging ourselves.  With this
approach, we can not convert one document to another encoding, paste something
from one document into another with a different encoding, convert parts of it
to Unicode insets, or in general be able to uppercase a letter.
We won't be able to make a document portable:  On one machine, the Danish
locale might mean one thing, but on another, it might be something else.  The
result:  The document is not portable!

This is much the same situation as we have with LyX today, and I for one wants
to get over these limitations.

I want LyX to free the user of the trouble of encodings.  And the best way to
do that is to make things as flexible as possible:  The user can decide every
encoding in the process specifically, as long as the necessary support has been
implemented.

I understand that providing Unicode conversion might be problematic for some
asian encodings, and we have to deal with this.  So I propose that we for
starters adopt a pseudo-unicode encoding, where we simply map the Asian
encodings that are hard to convert to Unicode directly into the Unicode space,
well knowing that the mapping does not conform to the standard.

This way, we will get most of the claimed benefits:  We can paste documents
into each others, change encodings, etc., but with the disclaimer that an
exported file in unicode might not really be unicode.

At least the iso-8859-x encodings will work as intended, and over time, we
might be able to provide the converters for the hard Asian encodings as well.

> I think we can use mbstowcs to implement Encoding class.
> By this approach, we don't need to writer convert for every languages.

If you happen to know that the mbstowcs are well defined and standardized for
Asian encodings, please go ahead and use those to implement the converters!  I
don't care how you implement the converter, I just care that we get them.

> Agree! toUnicode and fromUnicode should be used on Unicode inset only.
> The question is who will use it? Unicode inset is designed to let user put
> a couple of glyphs of other language on their document. For example, put
> some Chinese glyphs on a document of German. 

toUnicode and fromUnicode are mostly a hidden layer that is only used for the
purposes mentioned in the design doc.  You will also find the answer to the
question of who will use it there.

> People use Chinese will
> not use Unicode inset to put Chinese glyphs on their document.

Specifically, the Unicode inset is not meant to this.

> The next question is how LyX support inset. I think a call to XDrawString16
> is enough and it's very efficient.

Probably.

> Why I concern about is how Encoding will be used in LyX. Is it used by
Unicode
> inset only? I think so according to its current structure. If you want to use
> it in text buffer. I suggest we need another methods toWidthCharacter and
> fromWidthCharacter to translate between variable elength ncoding and fix 
> legth encoding.

This is already part of the design.  We will define a set of encodings:  Some
of them are fixed width, and others are variable length encodings.  As far as
the support has been implemented, we will be able to convert between those
freely.

This will be used to provide the flexibility and freedom I mentioned above.

If proper support was implemented, this would among other things imply that
Chinese could be written in Unicode internally, and exported as Big5, if
somebody thought that would be fun.  Or the other way around:  Use Big5 as
document encoding, and export as Unicode.  Or keep everything in Big5.  Or in
Unicode.  Or whatever else you can think of.

Greets,

Asger


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic