[prev in list] [next in list] [prev in thread] [next in thread] 

List:       spamassassin-users
Subject:    Re: Evasion with Unicode format characters
From:       John Hardin <jhardin () impsec ! org>
Date:       2018-10-30 20:34:17
Message-ID: alpine.LNX.2.21.1810301329480.4629 () athena ! impsec ! org
[Download RAW message or body]

On Tue, 30 Oct 2018, Cedric Knight wrote:

> I thought of submitting a patch via Bugzilla, but then decided to first
> ask and check that I understood the general principles of body checks,
> and SpamAssassin's current approach to Unicode. Apologies for the length
> of this message. I hope the main points make sense.
>
> A fair number of webcam bitcoin 'sextortion' scams have evaded detection
> and worried recipients because of including relevant credentials.
> (Incidentally, I assume the credentials and addresses are mostly from
> the 2012 LinkedIn breach, but someone on the RIPE abuse list reports
> Mailman passwords were also used). BITCOIN_SPAM_05 is catching some of
> this spam, but on writing body regexes to catch the wave around 16
> October, I noticed that my rules weren't matching because the source was
> liberally injected with invisible characters:
> Content preview:  I a<U+200C>m a<U+200C>wa<U+200C>re blabla is one of
> your pa<U+200C>ss. L<U+200C>ets   g<U+200C>et strai<U+200C>ght
> to<U+200C> po<U+200C>i<U+200C>nt. No<U+200C>t o<U+200C>n<U+200C>e

Would you send me a zipped copy? I would like to update the ZW text 
obfuscation rule for that, and possibly others.

> As minor points, 'Format' excludes a couple of separator characters in
> the same range that instead match [:space:]
> https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:subhead=Format%20character:]
> Then there is the C1 [:cntrl:] set, which some MUA's may render
> silently, I think including the 0x9D matched by the recent
> __UNICODE_OBFU_ZW (what's the significance of UNICODE in the rule name?):
> https://unicode.org/cldr/utility/list-unicodeset.jsp?a=[:General_Category=Control:]
> Finally, there may be a case for including as 'almost' invisible narrow
> blanks like U+200A &hairsp; U+202F and maybe U+205F. The Perl Unicode
> database may not be completely up-to-date here, and Perl 5.18 doesn't
> recognise U+61c, U+2066 and U+1BCA1 ranges as p\{Format}, although 5.24
> does.

"UNICODE" because the invisible crap ain't ANSI. :)

> So my patch was going to be something to eliminate Format characters
> from get_rendered_body_text_array() like:
> --- lib/Mail/SpamAssassin/Message.pm	(revision 1844922)
> +++ lib/Mail/SpamAssassin/Message.pm	(working copy)
> @@ -1167,6 +1167,8 @@
>   $text =~ s/\n+\s*\n+/\x00/gs;		# double newlines => null
> # $text =~ tr/ \t\n\r\x0b\xa0/ /s;	# whitespace (incl. VT, NBSP) => space
> # $text =~ tr/ \t\n\r\x0b/ /s;		# whitespace (incl. VT) => single space
> +  # do not render zero-width Unicode characters used as obfuscation:
> +  $text =~
> s/[\p{Format}\N{U+200C}\N{U+2028}\N{U+2029}\N{U+061C}\N{U+180E}\N{U+2065}-\N{U+2069}]//gs;
>   $text =~ s/\s+/ /gs;		        # Unicode whitespace => single space
>   $text =~ tr/\x00/\n/;			# null => newline

The problem with this approach is the *presence* of such characters is a 
pretty strong spam sign.

Potentially those tests could be moved to RAWBODY rules, though - I'll 
investigate that for the ZW rule.

-- 
  John Hardin KA7OHZ                    http://www.impsec.org/~jhardin/
  jhardin@impsec.org    FALaholic #11174     pgpk -a jhardin@impsec.org
  key: 0xB8732E79 -- 2D8C 34F4 6411 F507 136C  AF76 D822 E6E6 B873 2E79
-----------------------------------------------------------------------
   ...the Fates notice those who buy chainsaws...
                                               -- www.darwinawards.com
-----------------------------------------------------------------------
  Tomorrow: Halloween
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic