Yahoo Groups archive

Milter-greylist

Index last updated: 2026-04-28 23:32 UTC

Message

Re: [milter-greylist] [off-topic] OCR milter?

2006-11-02 by AIDA Shinra

At Thu, 2 Nov 2006 16:07:41 +0100 (CET),
Oliver Fromme wrote:
> 
> 
> The problem is that OCR itself is heavyweight.  I've worked
> with quite a few OCR systems in the past 15 years, and all
> of them require a serious amount of processing performance.
> And even then you have a certain error rate.  The good
> systems (those that have a low error rate) even require a
> "training" process in advance for recognizing the fonts
> being used for the documents being scanned.
> 
> Therefore think that running OCR on embedded images on a
> mail server isn't an option, I'm afraid.  Unless you're
> prepared that processing of every email requires a long
> time, and that there are errors in the recognized text
> (so that your regex will have trouble matching).  It will
> even make it easier to run a denial-of-service attack
> against your mail server, by simply sending many emails
> that contain multiple large images with random pixel
> patterns.
> 
> I think a better approach (i.e. much faster and more
> reliable) would be to create a public database of such
> images.  Spammers aren't generating new images for every
> single mail, so that should be feasible.
> 
> First, when someone identifies a mail as spam, a hash of
> the image (e.g. an MD5 checksum) is submitted to the
> database.  This could be pretty much automated by a script
> or little tool, so a single hotkey from within your MUA
> will do it.
> 
> Second, if somewhere else email is received with one or
> more images attached, the MD5 checksum will be calculated
> and verified with the database.  If it had been reported
> before, it is tagged as spam.
> 
> In fact, the database could be implemented as a DNS black
> list, so milter-greylist would already support it.  ;-)
> The DNS query would ask for the MD5 checksum (as a string
> of 32 hex digits).  Maybe the reply could also contain some
> indication of how many people reported this image as spam
> already, so a higher number would indicate a better
> reliability that it is indeed spam (i.e. higher "score").
> 
> Just an idea.  Anyway, OCR is not a solution, I think.

MD5 can be easily bipassed by making invisible unique changes to
images. We need something like DFT or wavelet transformation for such
a purpose.

I suggest to render HTML mail and determine whether its layout is
spammy or not. For example if an image covers 95% of screen then the
mail will be probably spam. A poorman can implement part of such a
strategy by SpamAssassin rules.

Attachments

Move to quarantaine

This moves the raw source file on disk only. The archive index is not changed automatically, so you still need to run a manual refresh afterward.