'Re: Autolearning from rules rather than score'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       spamassassin-users
Subject:    Re: Autolearning from rules rather than score
From:       RW <rwmaillists () googlemail ! com>
Date:       2009-01-29 18:03:06
Message-ID: 20090129180306.0b9accf7 () gumby ! homeunix ! com
[Download RAW message or body]

On Thu, 29 Jan 2009 10:32:05 +0100
Matus UHLAR - fantomas <uhlar@fantomas.sk> wrote:

> On 28.01.09 22:36, RW wrote:
> > I just pass it though dspam and then score like this:
> > 
> > header   DS_HAM       X-DSPAM-Result =~ /^(Innocent|Whitelisted)/
> > header   DS_SPAM      X-DSPAM-Result =~ /^Spam/
> > meta     DS_HAM_FULL  DS_HAM && (BAYES_00 || BAYES_05)
> > 
> > score    DS_HAM        -2.5
> > score    DS_SPAM       21.0
> > score    DS_HAM_FULL  -15.0
> 
> don't you trust dspam too much?

No, it's 9 points below the 30 point threshold, and  DSPAM FP's once in
a blue moon.

> 
> > score    BAYES_00     -2.5
> > score    BAYES_05     -1.5
> 
> Why do you do this?
> 1. you are assigning scores for BAYES even if BAYES is turned off

Are any of the BAYES_* rules hit if BAYES is turned-off? 

> 2. BAYES_00 has score -2.599 when network rules are on, you are
> lowering effectiveness - 

Not really, BAYES_00 is left independently scored just to allow a little
extra safety-margin for the unlikely combination DS_SPAM + BAYES_00,
aside from this combination BAYES_00 will always trip the DS_HAM_FULL
meta rule. It's set so that BAYES_00 + DS_HAM + DS_HAM_FULL = -20 which
eliminates the most serious problem in SA whereby a large  legitimate
mail sometimes accumulates a lot of textual hits, and runs up a huge
score. The precise value doesn't matter all that much, but -20 allows
me to see at a glance how many other points I'm getting.

I don't regard the default BAYES scores to be all that sensible
because the scoring algorithm tolerates a very high level of FP's. I
think most people that use sa-learn properly would see  better results
if they dropped BAYES_00 to -10 or lower. Personally I've never seen a
single spam hit BAYES_00. 


> > However, thinking about it a bit more, I think that the only real
> > problem is that ham that scores between 0.1 and 5.0
> > wont be learned as ham, and I can fix that by moving the autolearn
> > threshold to up to 4.9.
> 
> So you are willing to feed _anything_ to BAYES filters, which means
> that without manual intervention, FP and FN rate will both quickly
> increase

That's a bit alarmist, worst case is that the 4.9 threshold causes a
small number of spams that are aren't recognised by dspam to be
*temporarily* learned as ham in SA until I correct it. That sounds
pretty good to me, if a new type of mail starts to arrive, that's not
recognised by DSPAM, I'd like SA to behave like DSPAM does and
give it the benefit of the doubt.


The normal way to train DSPAM is to let it autolearn everything, and
correct it when it's wrong. Once DSPAM has learned from  reasonable
sized corpora it can stand a few percent  miss-classifications in new
mail because they're washed-out by correct classifications. It's only
if you stop correcting that you have a real problem, and even then it
will simply catch less spam, due to the strong bias against FP's.
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic