[prev in list] [next in list] [prev in thread] [next in thread] 

List:       wekalist
Subject:    Re: [Wekalist] StringToWordVector
From:       Mark Hall <mhall () pentaho ! com>
Date:       2013-12-16 3:11:03
Message-ID: CED4D4CD.E690%mhall () pentaho ! com
[Download RAW message or body]

On 14/12/13 10:53 am, "Mike Vogel" <Mike.Vogel@knowledgent.com> wrote:

>Using explorer is it possible to use stop words and NGramTokenizer?
> 
>When I try these settings:
> ·        
>Lowercase the input strings
> ·        
>Apply the default stop words
> ·        
>Stem the words using one of the default stemmers
> ·        
>NGramTokenizer with min and max 2
> 
>As soon as I use the NGramTokenizer the stop words are no longer applied.

It's not that it isn't applied, its just that the NGramTokenizer returns
tokens that are multiple words. So none of the n-grams are in the stop
words list (which is applied after tokenisation). I guess NGramTokenizer
would need an option to apply a list of stop words internally so that it
could skip them when forming n-grams. The same would be true of stemming.

Cheers,
Mark.


_______________________________________________
Wekalist mailing list
Send posts to: Wekalist@list.waikato.ac.nz
List info and subscription status: http://list.waikato.ac.nz/mailman/listinfo/wekalist
List etiquette: http://www.cs.waikato.ac.nz/~ml/weka/mailinglist_etiquette.html

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic