[prev in list] [next in list] [prev in thread] [next in thread] 

List:       aspell-user
Subject:    Re: International support
From:       Samphan Raruenrom <samphan () thai ! com>
Date:       1999-01-11 14:46:29
Message-ID: LYR14511-8633-1999.01.11-09.45.14--kevinatk#home.com () franklin ! oit ! unc ! edu
[Download RAW message or body]

Kevin Atkinson wrote:
> Samphan Raruenrom wrote:
> > In Thai, we don't put spaces between words at all so
> > the same situation happends naturally.
> > Typical Thai word-segmentation algorithm (which usually
> > do spelling check also) use maximal-match backtracking
> > algorithm with trie word list(s).
> > My implementation is at http://www.thai.net/libinthai/
> > IBM Classes for Unicode implementation is at
> > http://www.ibm.com/java/education/boundaries/boundaries.html
> Ok so how do you detect bonduries of unknown or misspelled words.

IBM ICU's algorithm describe in the above URL is :-
: If we exhausted our possibilities without finding 
: a valid sequence of words, it either means there's
: an error in the text, or the text includes a word 
: that isn't in the dictionary. In either case, we restore
: the set of break positions that matched the most 
: characters, advance one character past where the
: mismatch occurred in that sequence, and start over 
: from there. This works pretty well: usually only
: one or two boundary positions around the error 
: are in the wrong place.


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic