[prev in list] [next in list] [prev in thread] [next in thread] 

List:       spambayes
Subject:    [Spambayes] using SpamBayes for Wiki filtering
From:       trac () matt-good ! net (Matthew Good)
Date:       2005-11-27 1:25:18
Message-ID: 1133054719.23256.21.camel () localhost ! localdomain
[Download RAW message or body]

On Sun, 2005-11-27 at 12:16 +1300, Tony Meyer wrote:
> The tokenizer is designed to tokenize email, but you could certainly  
> write your own tokenizer (or subclass the existing one) designed to  
> tokenize wiki pages.  Once you've got tokens, you can use the  
> existing classifier and storage classes (classifier.py and
storage.py).
> 
> However, you might find that the email tokenizer does reasonably
well  
> on wiki pages; email and web text are not particularly different.
It  
> would be worth trying that first.

Yeah, a lot of the wiki-formatting constructs should be pretty close to
what people use in plain-text emails.  I may also see how it behaves if
I convert it to HTML first instead of using the raw Wiki text.

I started writing a subclass of the SQLClassifier in order to store the
statistics in the Trac db, which is pretty straightforward.

> > Are there any other projects using SpamBayes like this that I can  
> > use as
> > an example?
> 
> There's a plug-in for a web proxy to use SpamBayes for web filtering  
> in the contrib/ directory.  If you google through the archives of  
> this list (or maybe spambayes-dev?) there's an example of Skip using  
> SpamBayes for music classification, IIRC.  I've used the classifier
&  
> storage for classification of lines of dialogue in a scripted  
> performance.

Thanks, I'll check that out.  It looks like using SpamBayes should work
out ok.  The harder part will probably be adding the appropriate places
to the Trac UI for training the classifier (and presumably making this
extensible, since we'll want to allow various spam-prevention plugins).

-- 
Matthew Good <trac at matt-good.net>


[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic