[prev in list] [next in list] [prev in thread] [next in thread] 

List:       kfm-devel
Subject:    Re: Gecko
From:       Andreas Pour <pour () mieterra ! com>
Date:       2000-04-10 7:34:30
[Download RAW message or body]

Lars Knoll wrote:
> 
> On Sun, 9 Apr 2000, Mukhsein Johari wrote:
> 
> > > Unicode yes, BiDi no. They have some screenshots, but if you can read
> > > arabic or hebrew, you'll notice, that all they do is reverse single
> > > words. They don't reverse the ordering of the words. So it would be like
> > > reading a sentence lke the following in english:
> > > .car a is This
> >
> > Hmm? How can this be? Isn't the word order dependent on how the file was
> > 'written'? Also, since you have _whole_ words displayed correctly,
> > wouldn't the reversal of word order be trivial?
> 
> No. The file is written in logical order (the order you would pronounce
> the thing when speaking). The BiDi algorithm then transforms it to the
> visual order for printing. So a text like the following (big letters
> hebrew chars, small ones english):
> THIS IS HEBREW WITH two english WORDS INSIDE.
> would get reordered to:
> .EDISNI DROW two english HTIW WERBEH SI SIHT
> 
> What mozilla does is to reverse every hebrew word without changing to the
> correct order.
> 
> Anyway, the reversal of word order is the hard thing, reversing the order
> inside a single word is more or less trivial. An example: Imagine you have
> some markup in the sentence above:
> THIS IS HEBREW <b>WITH two</b> english WORDS INSIDE.
> The output on the screen will have to look like:
> .EDISNI SDROW <b>two</b> english <b>HTIW</b> WERBEH SI SIHT
> 
> See the problem?

I would think reversing words is harder -- you have to deal with
punctuation and when it separates a word or is part of a word.  For
example, 'pseudo-transparency' must be treated as one word, but
'something-other than this-is the cause' has a different
interpretation.  Periods can also be used within a word and to separate
words.  In other words, tokenizing words is the difficult step.  Once
you have accomplished that, you just apply whatever formatting code the
word occurs in to the word individually.

For example, if you have tokenized the sentence above, you find two
words in between the formatting codes <b>, so you would just apply them
to each word invidually.  Since you have already tokenized the words,
you would then pick out language sets and reverse any phrases and any
words within phrases falling within a language set to be so reversed.

Thus, in your example, step 1 is tokenizing words (and applying
formatting codes to all words they fall within individually):

	Token 1:  THIS             Lang: Hebrew   Format: Reg.
	Token 2:  IS               Lang: Hebrew   Format: Reg.
	Token 3:  HEBREW           Lang: Hebrew   Format: Reg.
	Token 4:  WITH             Lang: Hebrew   Format: Bold
	Token 5:  two              Lang: English  Format: Bold
	Token 6:  english          Lang: English  Format: Reg.
	Token 7:  WORDS            Lang: Hebrew   Format: Reg.
	Token 8:  INSIDE           Lang: Hebrew   Format: Reg.

Of course your formatting code can be quite a bit more complicated. 
Once you have your token list, step 2 is to reconstruct in the proper
letter/word order.  For example, Token 0 is HEBREW, so you reverse
letters and start at the right; you keep doing this till you come to
English, then keep the order as is (not sure what you do if a line break
occurs inside the English, but I'm sure there's a standard for that),
then when you get back to Hebrew, start reversing again. So you end up
with the correct result.

But maybe I'm missing something :-).

Ciao,

Andreas

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic