[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kfm-devel
Subject:    Re: Gecko
From:       Lars Knoll <Lars.Knoll () mpi-hd ! mpg ! de>
Date:       2000-04-10 8:26:20
[Download RAW message or body]

On Mon, 10 Apr 2000, Andreas Pour wrote:

> Lars Knoll wrote:
> > 
> > On Sun, 9 Apr 2000, Mukhsein Johari wrote:
> > 
> > > > Unicode yes, BiDi no. They have some screenshots, but if you can read
> > > > arabic or hebrew, you'll notice, that all they do is reverse single
> > > > words. They don't reverse the ordering of the words. So it would be like
> > > > reading a sentence lke the following in english:
> > > > .car a is This
> > >
> > > Hmm? How can this be? Isn't the word order dependent on how the file was
> > > 'written'? Also, since you have _whole_ words displayed correctly,
> > > wouldn't the reversal of word order be trivial?
> > 
> > No. The file is written in logical order (the order you would pronounce
> > the thing when speaking). The BiDi algorithm then transforms it to the
> > visual order for printing. So a text like the following (big letters
> > hebrew chars, small ones english):
> > THIS IS HEBREW WITH two english WORDS INSIDE.
> > would get reordered to:
> > .EDISNI DROW two english HTIW WERBEH SI SIHT
> > 
> > What mozilla does is to reverse every hebrew word without changing to the
> > correct order.
> > 
> > Anyway, the reversal of word order is the hard thing, reversing the order
> > inside a single word is more or less trivial. An example: Imagine you have
> > some markup in the sentence above:
> > THIS IS HEBREW <b>WITH two</b> english WORDS INSIDE.
> > The output on the screen will have to look like:
> > .EDISNI SDROW <b>two</b> english <b>HTIW</b> WERBEH SI SIHT
> > 
> > See the problem?
> 
> I would think reversing words is harder -- you have to deal with
> punctuation and when it separates a word or is part of a word.  For
> example, 'pseudo-transparency' must be treated as one word, but
> 'something-other than this-is the cause' has a different
> interpretation.  Periods can also be used within a word and to separate
> words.  In other words, tokenizing words is the difficult step.  Once
> you have accomplished that, you just apply whatever formatting code the
> word occurs in to the word individually.
>
> For example, if you have tokenized the sentence above, you find two
> words in between the formatting codes <b>, so you would just apply them
> to each word invidually.  Since you have already tokenized the words,
> you would then pick out language sets and reverse any phrases and any
> words within phrases falling within a language set to be so reversed.
> 
> Thus, in your example, step 1 is tokenizing words (and applying
> formatting codes to all words they fall within individually):
> 
> 	Token 1:  THIS             Lang: Hebrew   Format: Reg.
> 	Token 2:  IS               Lang: Hebrew   Format: Reg.
> 	Token 3:  HEBREW           Lang: Hebrew   Format: Reg.
> 	Token 4:  WITH             Lang: Hebrew   Format: Bold
> 	Token 5:  two              Lang: English  Format: Bold
> 	Token 6:  english          Lang: English  Format: Reg.
> 	Token 7:  WORDS            Lang: Hebrew   Format: Reg.
> 	Token 8:  INSIDE           Lang: Hebrew   Format: Reg.
> 
> Of course your formatting code can be quite a bit more complicated. 
> Once you have your token list, step 2 is to reconstruct in the proper
> letter/word order.  For example, Token 0 is HEBREW, so you reverse
> letters and start at the right; you keep doing this till you come to
> English, then keep the order as is (not sure what you do if a line break
> occurs inside the English, but I'm sure there's a standard for that),

Line breaks are applied in logical order.

> then when you get back to Hebrew, start reversing again. So you end up
> with the correct result.
> 
> But maybe I'm missing something :-).

Well, in principle, that's what you have to do. But there are a quite a 
few things which make the implementation non trivial. Don't you think
there is a reason, that mozilla still doesn't support it the right
way? And it took me (knowing Hebrew) more than half a year where I came
back quite often to the problem to find a solution which works (complies
to the Unicode BiDi algorithm), and is fast and flexible enough to be used
for all cases.

The basic BiDi algorithm
(http://www.unicode.org/unicode/reports/tr9/tr9-6.html) is a document of
about 10 pages, but the implementation is non trivial, especially to get
it working with all the HTML/CSS formatting stuff.

Anyway, it works nicely in khtml since two weeks :-)

Lars

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic