'Current best-practices around normalize_charset?'

[prev in list] [next in list] [prev in thread] [next in thread] 

List:       spamassassin-users
Subject:    Current best-practices around normalize_charset?
From:       Jay Sekora <jsekora () csail ! mit ! edu>
Date:       2013-07-16 15:30:38
Message-ID: 51E5671E.3000005 () csail ! mit ! edu
[Download RAW message or body]

Hi.  We're running SpamAssassin 3.3.1, and pursuant to some advice I've 
seen in archives of this list and spamassassin-dev (e.g., 
http://osdir.com/ml/spamassassin-dev/2009-07/msg00156.html), I am *not* 
using normalize_charset.  Unfortunately, this makes filtering text in 
binary encodings almost impossible, since even if you can come up with a 
word you want to match, word boundaries aren't at byte boundaries, so if 
I were to try to write rules byte-by-byte, I'd need several possible 
match strings, and I wouldn't be able to match the first or last 
character of the phrase I want to match (which for, say, Chinese, where 
words tend to be one or two characters long, is a big problem).  That's 
on top of the alternative patterns needed to represent non-Unicode 
encodings, of course.

Anyway, my question is, is that advice still valid (for 3.3.1, which is 
packaged for Debian Squeeze, or for latest stable)?  And if so, what do 
people tend to do to write rules for East Asian character sets (or, for 
that matter, for Western character sets encoded in binary to make them 
harder to filter)?  The traffic on the bug report quoted in the above 
message is kind of ambiguous.

(I will note that ok_languages and ok_locales are pretty useless here, 
at least for site-wide use, since we have users with correspondence in 
pretty much any language we've ever seen spam in.)

Jay

-- 
Jay Sekora
Linux system administrator and postmaster,
The Infrastructure Group
MIT Computer Science and Artificial Intelligence Laboratory
[prev in list] [next in list] [prev in thread] [next in thread]
Configure | About | News | Add a list | Sponsored by KoreLogic