[prev in list] [next in list] [prev in thread] [next in thread] 

List:       koffice
Subject:    Re: Implementing hyphenation / wordwrap for Thai
From:       Rudiger Koch <rkoch () sas ! co ! th>
Date:       1999-09-10 3:52:37
[Download RAW message or body]

On Thu, 09 Sep 1999, Jo Dillon wrote:
> Rudiger Koch (rkoch@sas.co.th) spake thusly:
> [Thai linebreaks] 
> > PS:
> > The Thai version of MS Word can do the linebreaks, but not properly. About 20%
> > of all linebreaks are wrong. Getting Thai support for KDE right could boost
> > the use of Linux and KDE here. It might even be possible to convience
> > officials to give Linux the preference to Windows in Thai schools, much like
> > ScholarNet in Mexico. 
> 
>   Well, it's certainly something worth working on. The trouble is it's hard
> for European or American developers to work on something like this since
> they don't know the language :) Perhaps you could explain in more detail
> what would be required?
> 

Fortunatelly the code to do the actual Thai word separation is already
available. It currently works as command line utility for HTML and LaTeX. In
the case of HTML a <WBR> tag is inserted between all words. The <WBR> tag
tells the browser that this is a _potential_ position to break a line. It
does not create a gap (whitespace) between words, though.

We just need a new interface to make the code usable with K-Applications.
Since hyphenation is a similar linebreak problem, I suggest to tackle both
with the aproach to be chosen which would probably be an API or a CORBA
interface to provide this service to applications. I am just asking for help
to define this interface.

Some more background:
Thai is a tonal language related to Chinese. Thai script is totally
different, however. It is highly influenced by the Indian languages Pali and
Sanskrit. It is character based like Roman, Greek or Cyrillic. This is about
all Thai and European scripts have in common.

A Thai syllable always starts with a consonant that carries a vowel. This
vowel can be noted left, right, above or below the carrying consonant or it
even may not appear at all, depending on the vowel. Word boundaries do not
have any marker which makes Thai difficult to read and which makes KWord and
KMail to do a really bad job with Thai. It is amazing that X can handle such a
complex scripture! The guys at MIT really did a great job!

To get an idea of how Thai script looks like check out:
http://www.nectec.or.th
where you find GIFs with Thai sentences.

-Rudiger


--
 Software Advanced Solutions           Fon: +66 76 218 826 
 48 Villa 1 Yaowarat Soi 1             Fax: +66 76 214 041
 Phuket, Thailand 83000                rkoch@sas.co.th

 // Why use Windows when the door is open and free of charge?
 // Linux: The choice of a GNU generation

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic