[prev in list] [next in list] [prev in thread] [next in thread] 

List:       python-dev
Subject:    [Python-Dev] The first trustworthy <wink> GBayes results
From:       tim.one () comcast ! net (Tim Peters)
Date:       2002-08-31 21:43:39
Message-ID: LNBBLJKPBEHFEDALKOLCKEEJBBAB.tim.one () comcast ! net
[Download RAW message or body]

[Tim, to Paul Graham]
> ...
> I also noted earlier that FREE (all caps) is now one of the 15 words that
> most often makes it into the scorer's best-15 list, and cutting
> the legs off a clue like that is unattractive on the face of it.  So I'm
> loathe to fold case unless experiment proves that's an improvement, and it
> just doesn't look likely to do so.

Those experiments have been run now.  Folding case gave a slight but
significant improvement in the false negative rate.  It had no effect on the
false positive rate, but did change the *set* of messages flagged as false
positives:  conference announcments are no longer flagged (for their VISIT
OUR WEBSITE FOR MORE INFORMATION! kinds of repeated SCREAMING), but some
highly off-topic messages do (e.g., talking about money is now
indistinguishable from screaming about MONEY).  So, overall, I'm leaving
case-folding in.  It does (of course) reduce the database size, and reduce
the amount of training data needed.  I have no idea what this does for
corpora in languages other than English (for that matter, I don't even know
what "fold case" *means* in other languages <wink>).

Experiment also showed that boosting the "unknown word" probability from 0.2
to 0.5 was a pure win:  it had no significant effect on the false positive
rate, but cut the false negative rate by a third.  The only change I've seen
that had a bigger effect on reducing false negatives was adding special
parsing and tagging for embedded URLs.



[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic