'[aspell-devel] Experiments with spell-checking...'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       aspell-devel
Subject:    [aspell-devel] Experiments with spell-checking...
From:       "Mike C. Fletcher" <mcfletch () rogers ! com>
Date:       2002-10-28 21:19:16
Message-ID: 3DBDA9D4.40907 () rogers ! com
[Download RAW message or body]

I've been working on and off the last week on a contexualisable 
spell-check engine for Python.  So far I have a basically functional 
system, but I'm running into a number of issues.  I'm wondering if 
others have suggestions about how to fix these problems (better 
approaches).  I'm going to summarise the approach I've taken so the 
problems have some useful context:

Overview of the system

    I'm providing two major services with the spell-checker:

        The first is the standard "is this word in the dictionary" 
"check", which is fast, and very basic regardless of the underlying 
technology.  Basically for any given word you have 1 lookup in the 
database which returns the (normally first, but optionally all) word-set 
object in which the word was found, and any variants on the normalised 
form of the word (e.g. if the word checked was "Form" and only "form" 
was in the database, you'd get back a list with only "form" in it, and 
your app could decide whether that constituted an error or not (e.g. by 
looking at whether you are at the start of a sentence, in a title, etc).

        The second service is the "what words are similar to this?" 
"suggestion", which is comparatively very slow (0.03s (at 
edit-distance=1) through 2.5s (at edit-distance=2) seconds depending on 
the chosen "fuzziness").  The algorithm is currently based on a phonetic 
compression mechanism (with the rules read from Aspell's files) with 
edit-distance calculations determining which records are considered 
'hits'.   At the moment, this uses an approach based on the 
documentation for the aspell engine.  It does a heavily-trimmed query 
for single-edit-distance fuzziness (~2000 queries on a 500,000 word 
dictionary) and a (slightly-trimmed) straight scan through the database 
for more lax searches (which scans an average of ~200,000 records in a 
500,000 word dictionary).

    Eventually I intend to provide a few other services:

        Rank words according to their likely relevance to a context 
(assumes words are in the dictionary).  This will be built on a number 
of mechanisms, including whole-word-list rankings by dictionary (e.g. 
common English words having a higher ranking than Shakespearean English 
words in most dictionaries, but the reverse in a Shakespearean 
dictionary), statistical recording per-user (i.e. you always use the 
word "ripping", so it's more likely to be what you mean), and 
potentially algorithmic mechanisms based on the context of the word (for 
example, Noun-verb agreement).  Beyond the obvious use-case of sorting 
suggestions, it would be useful for technical writers to take a 
Standard-English dictionary split by commonality of word (as is 
available) and colourise each word in a corpus according to its commonality.

        Word-completion (lookup of all completing words in the current 
dictionary)

        Correction and usage tracking (would feed some of the ranking 
mechanisms above).  This would likely be per-user (and optional).  It 
would allow you to store per-app/per-doc/whatever usage databases to 
provide both more efficient and more personal/contextualised services to 
the user.  Correction tracking is a sub-set of usage tracking, used 
solely by the spelling system.  Usage tracking in the more general 
approach allows you to make educated guesses based on the current input 
and past history, what the user probably intended to say.


The primary focus of the system's current design is on modularity and 
flexibility, so the code allows for loading both in-memory and on-disk 
word lists and mixing and matching word-lists within multiple loaded 
dictionaries (note: aspell also offers mix-and-match mechanisms for 
word-lists).  The idea here being that you load word-lists per-document, 
per-project, and per-application and add them to the current user's 
default dictionaries/correction-lists/usage-statistics to give you a 
contextually-relevant dictionary.


Current Status and Problems

    I'm currently using Python's bsddb wrappers for storing the 
word-lists.  I'm storing the phonetic database as "phonetic-compressed": 
word [+ '\000'+word] and the non-phonetic database as "normalised": word 
[+ '\000'+word].  That means that the database requires (loosely) 4 
times the storage space of the words themselves + the overhead of the 
database.  In practice that means that the 500,000 word phonetic 
database (7.1MB plain-text for words) takes around 29MB when compiled). 
 500,000 is a fairly large dictionary (my "official scrabble dictionary" 
(dead-trees) is 500,000), but I'm thinking a real-world installation 
will want ~ 1,000,000 words to be available in system word-sets (for 
each language), so we'd be talking around 120MB*(number of languages) 
just for the system dictionaries.  So, some thoughts on how to compress 
the word-sets without dramatically overloading the system (processing 
overhead) would be appreciated.

    bsddb will, if a program using it crashes with a db open for 
writing, corrupt the database and be unable to open it again.  Is there 
a better embeddable dbase system for this kind of work?  I'm considering 
testing meta-kit (which I used a few years ago), but the last time I 
worked with it the same problem plagued it (it would corrupt files if 
not properly closed).

    The scanning speed is just not usable under the current bsddb system 
(even if I use a non-thread-safe naive approach), so the aspell-style 
searches for larger edit distances (i.e. more mistakes allowed) are 
prohibitively expensive (i.e. a single suggestion taking 2.5 seconds). 
 The current code goes through around 200,000 iterations for a 
distance=2 search (even after optimising out the 2-deletion cases). 
 That would likely be unnoticably fast in C, but it's hideously slow in 
Python.

    I am considering creating a phonetic-feature-set database instead of 
continuing work on the scanning phonetic database. This would store all 
(2-character sub-strings in the phonetic compression): (each word having 
that feature). That would give n+1 (start and end are considered 
characters in the feature-sets) entries for each word.  To do that w/out 
creating huge databases I would need a mechanism for storing pointers to 
words, rather than words, within the database.  I'm sure there must be 
some useful way to do that, but everything I've seen relies on storing 
integers then doing a query to resolve each integer back to a string. 
 That seems like it would still be expensive (especially using bsddb). 
 Any better mechanisms people know of?  As a note, this is a toolkit for 
building spell-checkers, so I'll likely provide both mechanisms, but 
it's unlikely anyone would use both at the same time (it would mean 
~180MB/language for system dictionaries).

For those interested in playing with the current code, the project is on 
sourceforge at:

    http://sourceforge.net/projects/pyspelling/

Enjoy yourselves,
Mike

_______________________________________
  Mike C. Fletcher
  Designer, VR Plumber, Coder
  http://members.rogers.com/mcfletch/






[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic