[prev in list] [next in list] [prev in thread] [next in thread]
List: solr-dev
Subject: Re: protwords.txt support in stemmers
From: Grant Ingersoll <gsingers () apache ! org>
Date: 2010-03-30 19:17:15
Message-ID: EC247F2C-B55F-4772-82F1-2F83E550D26B () apache ! org
[Download RAW message or body]
On Mar 30, 2010, at 8:33 AM, Yonik Seeley wrote:
> On Tue, Mar 30, 2010 at 8:06 AM, Robert Muir <rcmuir@gmail.com> wrote:
> > We have two choices:
> > * we could treat this stuff as impl details, and add protwords.txt support
> > to all stemming factories. we could just wrap the filter with a
> > keywordmarkerfilter internally.
> > * we could deprecate the explicit protwords.txt in the few factories that
> > support it, and instead create a factory for KeywordMarkerFilter.
> > * we could do something else, e.g. both.
> >
> > So, to illustrate, by adding a factory for the KeywordMarkerFilter, a user
> > could do:
> >
> > <filter class="solr.KeywordMarkerFilterFactory" protected="protwords.txt"/>
> > <filter class="solr.SomeStemmer"/>
> >
> > and get the same effect, instead of having to add support for protwords.txt
> > to every single stem factory.
>
> Yep, this decomposition seems more powerful.
>
> Sort of related: for a long time I've had the idea of allowing the
> expression of more complex filter chains that can conditionally
> execute some parts based on tags set by other parts.
>
> This is straightforward to just hand-code in Java of course, but
> trickier to do well in a declarative setting:
>
> <filter class="solr.Tagger" tag="protect" words="protwords.txt"/>
> <filter class="solr.SomeStemmer" skipTags="protect"/>
>
> The idea was to also make this fast by allocating a bit per tag
> (assuming we somehow knew all of the possible ones in a particular
> filter chain) and using a bitfield (long) to set and test. I was
> planning on using Token.flags before the new analysis attribute stuff
> came into being.
I believe you have to declare the Attributes up front, right? Should be possible to \
know them, right?
>
> It would also be nice to make the token categories generated by
> tokenizers into tags (like StandardTokenizer's ACRONYM, etc). A
> tokenizer that detected many of the properties could significantly
> speed up analysis because tokens would not have to be re-analyzed to
> see if they contain mixed case, numbers, hyphens, etc (i.e. the fast
> path for WDF would be checking a bit per token).
Good opportunity to also get rid of the TypeAttribute all together, too, as that \
thing is no longer useful.
[prev in list] [next in list] [prev in thread] [next in thread]
Configure |
About |
News |
Add a list |
Sponsored by KoreLogic