[prev in list] [next in list] [prev in thread] [next in thread] 

List:       spamassassin-devel
Subject:    [Bug 5072]  New: Images: Instead of OCR (difficult), just recognize whether image is likely to conta
From:       bugzilla-daemon () issues ! apache ! org
Date:       2006-08-29 15:48:44
Message-ID: 5072.dev () spamassassin ! apache ! org
[Download RAW message or body]

http://issues.apache.org/SpamAssassin/show_bug.cgi?id=5072

           Summary: Images: Instead of OCR (difficult), just recognize
                    whether image is likely to contain text
           Product: Spamassassin
           Version: unspecified
          Platform: All
        OS/Version: All
            Status: NEW
          Severity: enhancement
          Priority: P5
         Component: Rules
        AssignedTo: dev@spamassassin.apache.org
        ReportedBy: richard@2006.atterer.net


Hi,
ATM the race between spammers and spam filters seems to be in the area of images
- the spammers make it harder and harder to OCR their images.
So maybe you could look at recognizing "text" in general. IMHO it is highly
unlikely for many people that any "normal" mail will contain a picture with lots
of text in it, so simply the presence of such a picture could be an indicator
for spam.
There are a number of possibilities that spring to mind when finding out the
amount of text in a picture.

Possible ways of implementing this:

- frequency analysis (e.g. using a DCT) - I guess text will have a lot of
high-freq spikes, and additionally there will be a low-freq spike in the
vertical direction - the text lines!
- Number of different colours
- Number of lines which are exactly 1, 2, 3... pixels wide
- When "filling" areas of adjacent pixels of the same colour (i.e. individual
characters), the size of the resultant bounding boxes in pixels will probably
show a particular distribution. E.g.: "90% of all fillable areas have a bounding
box which is 200 square pixels +/- 50 pixels."

Hope this conveys the idea and that it's useful to you!
Cheers, Richard



------- You are receiving this mail because: -------
You are the assignee for the bug, or are watching the assignee.
[prev in list] [next in list] [prev in thread] [next in thread] 

Configure | About | News | Add a list | Sponsored by KoreLogic