[prev in list] [next in list] [prev in thread] [next in thread] 

List:       spambayes-bugs
Subject:    [spambayes-bugs] [ spambayes-Feature Requests-854705 ] Detect "
From:       noreply () sourceforge ! net (SourceForge ! net)
Date:       2005-05-13 3:57:39
Message-ID: E1DWRIx-000462-Kn () sc8-sf-web2 ! sourceforge ! net
[Download RAW message or body]

Feature Requests item #854705, was opened at 2003-12-06 01:58
Message generated for change (Comment added) made by anadelonbrin
You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=854705&group_id=61702

Category: None
Group: None
>Status: Closed
Priority: 5
Submitted By: Julian Morrison (julianm)
Assigned to: Nobody/Anonymous (nobody)
Summary: Detect "line noise" in subject and body

Initial Comment:
Spell check words in the message subject and body,
generate tokens for the count of misspellings in each.
Perhaps also generate tokens for the ratio of
incorrect/correct spellings? This could be chunked to
make it easier to train eg: all, more than half, about
half, less than half, none. These should be seperate
for subject and for body since garble in the header is
very predictive of spam.

Also, there has to be some way to look for words with
"impossible to pronounce" consonant clusters such as
"dvgkbm". Could spambayes be made to look for
"syllables"? Eg: by parsing words into syllables and
generating tokens for each? I'm not sure there's a
parsing technique that's sufficiently
internationalized.  Perhaps even just generating tokens
for ASCII consonant clusters would be better than nothing.

----------------------------------------------------------------------

>Comment By: Tony Meyer (anadelonbrin)
Date: 2005-05-13 15:57

Message:
Logged In: YES 
user_id=552329

I tried generating tokens if a token wasn't in a dictionary
(more-or-less the same as spell checking), and that didn't
help.  See the wiki http://entrian.com/sbwiki for more
details and the patch, in case anyone else wants to try it.

Unless anyone can show that this helps, it won't be added.

The unpronouncable suggestion is unlikely to help if
dictionary words didn't.  I don't see how it work would
outside English, anyway.

----------------------------------------------------------------------

Comment By: Julian Morrison (julianm)
Date: 2003-12-06 03:41

Message:
Logged In: YES 
user_id=21754

Yeah you're right about "unpronounceable:xmlrpc", oops, my
bad. Sorry, ignore that bit.

The hack I suggested for misspellings can be extended to
unpronounceability counts, or anything similar. If it's a
known token and a statistical ham indicator, then never
count it as "unpronounceable" or "misspelled". That approach
would quickly enough learn tech-speak or whatever, but it
would catch high incidence of garble.

----------------------------------------------------------------------

Comment By: Richie Hindle (richiehindle)
Date: 2003-12-06 03:11

Message:
Logged In: YES 
user_id=85414

What's the difference between the tokeniser spitting
out "xmlrpc" and spitting out ""unpronounceable:xmlrpc"?
That doesn't make any difference.  The difference is when
you "generate tokens for the count of misspellings" (or
unpronounceables) - then your system starts to decide
that high unpronounceable conts are spammy, and techie
messages get more spammy.  (Unless the tech-speak
outweighs the spam garbage, but even we're not *that*
techie!)


----------------------------------------------------------------------

Comment By: Julian Morrison (julianm)
Date: 2003-12-06 03:04

Message:
Logged In: YES 
user_id=21754

Hmm, would it not merely learn token
"unpronounceable:xmlrpc" as a ham indicator?

Also, as a spellcheck hack: words that are already
recognised tokens, and are ham indicators, should not count
as misspelled even if the spell check rejects them. This
would then quickly learn not to add "xmlrpc" into the
misspelled-words count and ratio.

----------------------------------------------------------------------

Comment By: Richie Hindle (richiehindle)
Date: 2003-12-06 02:53

Message:
Logged In: YES 
user_id=85414

We spambayes developers spend a lot of time talking
about smtp, pop3, cdo, mapi, tcpip, http, html, py2exe,
rfc822, chi2, kmail, ie, oe, xmlrpc, bsddb...

Now those things would be trained as ham clues, but
your scheme would dilute them.  I'm not saying it's a
bad idea, but just because something is unpronouncable
and not in the dictionary doesn't make it the same class
of thing as all the other tokens which are unpronouncable
and not in the dictionary.


----------------------------------------------------------------------

You can respond by visiting: 
https://sourceforge.net/tracker/?func=detail&atid=498106&aid=854705&group_id=61702

[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic