[prev in list] [next in list] [prev in thread] [next in thread] 

List:       james-user
Subject:    Re: Bayesian Analysis spam filter is under attack
From:       David Legg <david.legg () searchevent ! co ! uk>
Date:       2007-11-25 17:39:54
Message-ID: 4749B36A.6030504 () searchevent ! co ! uk
[Download RAW message or body]

I don't think you'd need worry about incorrectly tokenizing each word.

In your example it doesn't matter if McDonald gets (incorrectly?) split 
into Mc and Donald since it is up to the Bayesian analysis to detect if 
the combination of Mc and Donald indicates spam.  The point is that at 
the moment I don't believe the filter gets the chance to make the 
decision since all it sees is McDonalds as one token.

David -

Tom Brown wrote:
> It seems like you wouldn't need to run a dictionary past each token,
> just make sure the split happens after the 3rd character (so McDonalds
> or MacDonald doesn't get split, but the majority of the SPAM is
> correctly tokenized).
>   


---------------------------------------------------------------------
To unsubscribe, e-mail: server-user-unsubscribe@james.apache.org
For additional commands, e-mail: server-user-help@james.apache.org

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic