'Statistical/Corpus Based Word BreakIterators for ICU'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       icu
Subject:    Statistical/Corpus Based Word BreakIterators for ICU
From:       Jeffrey Sorensen <sorenj () ieee ! org>
Date:       2004-12-13 18:19:38
Message-ID: 41BDDD3A.9070503 () ieee ! org
[Download RAW message or body]

I would like to ask the ICU developers if there is interest in a new
algorithm and implementation for an alternative to the
DictionaryBasedBreakIterator class for languages such as Chinese and
Thai that do not mark word boundaries with whitespace.
For more on the current solution, see
http://oss.software.ibm.com/icu/apiref/classDictionaryBasedBreakIterator.html#_details

A potential solution, depending upon the application area, is the
use of character N-grams and a corpus trained statistical model to
compute a likely segmentation.  This is the solution pioneered by
Richard Sproat and presented in:
Sproat R., Shih C., Gale W., Chang N. (1996): "A Stochastic
Finite-State Word-Segmentation Algorithm for Chinese".
Computational Linguistics, 22(3)

Chinese segmentation is still an active are of research, and
people interested in the current state of the art should be
encouraged to review the material on the SIGHAN website
http://www.sighan.org/

With regards to the ICU community, my intention is not to provide
a state of the art segmenter, but instead the necessary APIs and
implementations for a reference implementation that achieves
approximately 90% accuracy for use in Text Boundary applications such
as indexing for document retrieval and text selection for double and
triple clicking in word processing.

 From an implementation standpoint, the work is already complete, the
matter is "simply" one of making it available.  This involves making the
code for the new BreakIterator itself available, ~200 lines, but more
importantly, find training material that can be used to generate a model
that can be used without non-commercial encumbrances.

 From an ICU user's perspective, I am proposing a new class derived from
BreakIterator that implements a DynamicProgrammingBasedBreakIterator (DPBI).
Like the DictionaryBasedBreakIterator, my current proposed solution is
asymmetric; that is, moving backwards is different from moving forwards,
and the class internally maintains a list of break points for the current
text it is assigned to support moving backwards.  (Alternatively, the models
described below can be built for forwards and backwards contexts, but this
could mean moving backwards generates a different word segmentation.)

I'll also briefly describe the implementation.  The DPBI is the composition
of two non-deterministic finite state automata.  The first is a machine that
hypothesizes a break between every pair of characters in the target 
language.
The second is a weighted finite state transducer that assigns the 
probability
of seeing a break between the last (n-1) characters and the next character.
For cases where the last (n-1) characters were not seen, there is also a
series of "back-off" epsilon arcs to the probability of seeing a word break
given the last (n-2), (n-3), ... characters.  Here the past n characters is
understood to include past hypothesized breaks (from the state machine
viewpoint, a break is considered a character inserted into the stream.)

The weighted transducer is trained using a corpus of text that has been
segmented according to the custom of the locale, where the weight
is the fraction of times that a break is introduced given the history.

For segmenting a text sequence of N characters, there are some 2^(N-1)
hypothetical solutions.  DPBI performs a computational efficient search, the
Viterbi-Bellman dynamic programming algorithm.

If there is interest, I would be happy to post samples of documents that 
have
been segmented using this technique.  Regarding the availability of the 
code
and the models, especially under the terms of the ICU license agreement,
work will need to be done.  Looking at the overall ICU framework, the
introduction, however trivially, of weighted finite state transducers could
have broader applications.

I welcome comments both positive and negative, as this type of statistical
segmentation may not be adequate for all applications.  I am certainly also
interested in any other proposal that are on the table for improving the
present (and limited) Dictionary based word segmentation in ICU.

_______________________________________________
icu mailing list
icu@oss.software.ibm.com
http://oss.software.ibm.com/developerworks/oss/mailman/listinfo/icu
[prev in list] [next in list] [prev in thread] [next in thread]