[prev in list] [next in list] [prev in thread] [next in thread] 

List:       spamassassin-users
Subject:    Re: Help tagging URL spam
From:       Alex <mysqlstudent () gmail ! com>
Date:       2012-01-03 0:39:15
Message-ID: CAB1R3siSFvwX3hfL0X0sPxWXF4YACgeij5m+svpGfMNY_E70qA () mail ! gmail ! com
[Download RAW message or body]

Hi,

>>>> http://pastebin.com/raw.php?i=1Y5QCkfh
>>>> http://pastebin.com/raw.php?i=KdmZXM0d
...
>> What I haven't been able to figure out is a more generalized pattern
>> from these, such as something in the header that is inconsistent with
>> non-spam or contains some type of invalid header data, such as the
>> mismatch between having originated at yahoo but being sent as
>> sbcglobal?
>>
>> Shouldn't have bayes picked this up after learning a dozen or more of
>> these?
>>
>
> IMHO, yes. Are you sure you are training bayes correctly. Are you using the
> same user to train bayes as the user that is running SA? Work through some
> of the advice already given regarding bayes.

Yes, I'm pretty sure bayes is solid. I'm autolearning, but at -1 and
13 instead of the defaults, and about 9.7M tokens. I could probably
return it now to defaults since it's been running now for a while.
Bayes is in mysql, so I have bayes_sql_username set, so it always uses
that database, and there aren't any other databases. I am wondering
why even after a sync it isn't represented, but perhaps that's due to
mysql?

$ sa-learn --sync
$ sa-learn --dump magic
0.000          0          3          0  non-token data: bayes db version
0.000          0     383489          0  non-token data: nspam
0.000          0     484418          0  non-token data: nham
0.000          0    9768178          0  non-token data: ntokens
0.000          0 1316487858          0  non-token data: oldest atime
0.000          0 1325550621          0  non-token data: newest atime
0.000          0          0          0  non-token data: last journal sync atime
0.000          0 1325550621          0  non-token data: last expiry atime
0.000          0    5529600          0  non-token data: last expire atime delta
0.000          0     982370          0  non-token data: last expire
reduction count

Does the mysql bayes even have a journal or is it managed by mysql?

> sanjit.in is now listed in a couple URIBLs (URIBL_PH_SURBL &
> URIBL_HOSTKARMA_BL) - don't know if it was listed at the time you received
> them.

Yes, for me too.

> They hit some local meta rules I have combining FREEMAIL_FROM with
> __HAS_ANY_URI, __MANY_RECIPS, and various missing/blank subject rules. For
> me these are relatively good indicators of FREEMAIL spam.

Yes, the missing/blank are good triggers for metas.

RW rwmaillists@googlemail.com  wrote:

> RP_MATCHES_RCVD=-1.613,

> IIWY I'd take a look at how RP_MATCHES_RCVD is working for you. A lot
> of us find it does more harm than good. In particular it's adding a
> negative score to AOL, Yahoo, etc.

That's a good idea. I noticed one hit this rule and the other didn't.
Not sure I can remove it altogether, because so many ham messages have
hit it on my system, but maybe a meta can be built from it, such as
with FREEMAIL_FROM and adding points if it doesn't hit RP_MATCHES_RCVD
(from doesn't match recvd)?

Thanks again,
Alex
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic