[prev in list] [next in list] [prev in thread] [next in thread] 

List:       lucene-dev
Subject:    Large stop word list
From:       Culley Angus <Culley.Angus () SilentOne ! com>
Date:       2001-11-13 4:46:09
[Download RAW message or body]

I am currently looking at building a brute force analyzer that will
effectively be able to index the textual content of any 
binary file (within reason).

This is being done by weeding out as much predictable 'binary junk' from the
stream, then building a (large) stop word list to filter the rest out.

In order to do this, I may need a rather large stop word list, to filter out
the inevitable junk that will be encountered, 
and I am a little worried about the possible size of this list.

Has anyone had problems with something like this, 
specifically with a brute force-like filter or involving a particularly
large stop word list?

Thanks,
Culley.

--
To unsubscribe, e-mail:   <mailto:lucene-dev-unsubscribe@jakarta.apache.org>
For additional commands, e-mail: <mailto:lucene-dev-help@jakarta.apache.org>

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic