[prev in list] [next in list] [prev in thread] [next in thread] 

List:       spambayes
Subject:    Re: [Spambayes] overtraining and retraining
From:       Jesus Cea <jcea () jcea ! es>
Date:       2011-10-17 14:01:14
Message-ID: 4E9C352A.6080201 () jcea ! es
[Download RAW message or body]

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

On 16/10/11 18:45, skip@pobox.com wrote:
> 
>>> 2. When I train over a message, I keep training in a loop until
>>> the message probability goes under 20% (ham) or over 90%
>>> (spam). As the database ages, training spam needs more
>>> "looping", that is, the probability goes up slowly. The ham
>>> training, nevertheless, is fast and the loop counting is low.
> 
> Jesus> Uhm, the wiki says: "never train the same message Jesus>
> twice". Reason?. I am breaking this badly.
> 
> Jesus,
> 
> I use train to exhaustion as referenced in your other email
> (contrib/tte.py in the SpamBayes distribution).  I currently have
> 21 hams and 17 spams in my current training database.  I suggest
> you just toss out everything but the most recent 10-15 hams and
> spams then start with that.
> 
> I cheat as well, since both my pobox.com mail forwarding service
> and Gmail (where it forwards to) apply their own spam filters
> before SpamBayes gets a crack at my mail.  The downside of that is
> that I need to scan their held spams periodically.

Thanks for your reply, Skip, but you don't address any of my concerns
:-): 1. Do not train with the same message twice, 2. Keep spam/ham
balanced, 3. Is normal that "training" can slowly degrade the
quality?, and if so, what people do about it (beside deleting the DB
and retrain again with recent samples).

I think that 1&2 are related to the bayes asumption about independent
samples. But the code is abusing bayes so badly that breaking this
condition is actually irrelevant in our context :-).

BTW, what are the changes between 1.1a4 (my version) and 1.1a6?. I
can't find an updated CHANGELOG...

- -- 
Jesus Cea Avion                         _/_/      _/_/_/        _/_/_/
jcea@jcea.es - http://www.jcea.es/     _/_/    _/_/  _/_/    _/_/  _/_/
jabber / xmpp:jcea@jabber.org         _/_/    _/_/          _/_/_/_/_/
.                              _/_/  _/_/    _/_/          _/_/  _/_/
"Things are not so easy"      _/_/  _/_/    _/_/  _/_/    _/_/  _/_/
"My name is Dump, Core Dump"   _/_/_/        _/_/_/      _/_/  _/_/
"El amor es poner tu felicidad en la felicidad de otro" - Leibniz
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.10 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org/

iQCVAwUBTpw1Kplgi5GaxT1NAQLlbQP/RxagFrvQcmWpz54cku6GR2KLkZByS54E
1ArPp92RlarYEaB0fUhn1D8JBbIOgwPHT65sE1p94mh18D7NxIVsJdUW4Ay9ZnR7
62CttlHFBMynv7xJGSzZ8d4OECwIqSobNqUYZgRLEwdKOvT/uak1t3DXW2o8xpRD
swfOemBzEtI=
=98ok
-----END PGP SIGNATURE-----
_______________________________________________
SpamBayes@python.org
http://mail.python.org/mailman/listinfo/spambayes
Info/Unsubscribe: http://mail.python.org/mailman/listinfo/spambayes
Check the FAQ before asking: http://spambayes.sf.net/faq.html

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic