'Re: protwords.txt support in stemmers'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       solr-dev
Subject:    Re: protwords.txt support in stemmers
From:       Grant Ingersoll <gsingers () apache ! org>
Date:       2010-03-30 19:17:15
Message-ID: EC247F2C-B55F-4772-82F1-2F83E550D26B () apache ! org
[Download RAW message or body]


On Mar 30, 2010, at 8:33 AM, Yonik Seeley wrote:

> On Tue, Mar 30, 2010 at 8:06 AM, Robert Muir <rcmuir@gmail.com> wrote:
> > We have two choices:
> > * we could treat this stuff as impl details, and add protwords.txt support
> > to all stemming factories. we could just wrap the filter with a
> > keywordmarkerfilter internally.
> > * we could deprecate the explicit protwords.txt in the few factories that
> > support it, and instead create a factory for KeywordMarkerFilter.
> > * we could do something else, e.g. both.
> > 
> > So, to illustrate, by adding a factory for the KeywordMarkerFilter, a user
> > could do:
> > 
> > <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
> > <filter class="solr.SomeStemmer"/>
> > 
> > and get the same effect, instead of having to add support for protwords.txt
> > to every single stem factory.
> 
> Yep, this decomposition seems more powerful.
> 
> Sort of related: for a long time I've had the idea of allowing the
> expression of more complex filter chains that can conditionally
> execute some parts based on tags set by other parts.
> 
> This is straightforward to just hand-code in Java of course, but
> trickier to do well in a declarative setting:
> 
> <filter class="solr.Tagger" tag="protect" words="protwords.txt"/>
> <filter class="solr.SomeStemmer" skipTags="protect"/>
> 
> The idea was to also make this fast by allocating a bit per tag
> (assuming we somehow knew all of the possible ones in a particular
> filter chain) and using a bitfield (long) to set and test.  I was
> planning on using Token.flags before the new analysis attribute stuff
> came into being.

I believe you have to declare the Attributes up front, right?  Should be possible to \
know them, right?

> 
> It would also be nice to make the token categories generated by
> tokenizers into tags (like StandardTokenizer's ACRONYM, etc).  A
> tokenizer that detected many of the properties could significantly
> speed up analysis because tokens would not have to be re-analyzed to
> see if they contain mixed case, numbers, hyphens, etc (i.e. the fast
> path for WDF would be checking a bit per token).

Good opportunity to also get rid of the TypeAttribute all together, too, as that \
thing is no longer useful.


[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic