At Thu, 2 Nov 2006 16:07:41 +0100 (CET), Oliver Fromme wrote: > > > The problem is that OCR itself is heavyweight. I've worked > with quite a few OCR systems in the past 15 years, and all > of them require a serious amount of processing performance. > And even then you have a certain error rate. The good > systems (those that have a low error rate) even require a > "training" process in advance for recognizing the fonts > being used for the documents being scanned. > > Therefore think that running OCR on embedded images on a > mail server isn't an option, I'm afraid. Unless you're > prepared that processing of every email requires a long > time, and that there are errors in the recognized text > (so that your regex will have trouble matching). It will > even make it easier to run a denial-of-service attack > against your mail server, by simply sending many emails > that contain multiple large images with random pixel > patterns. > > I think a better approach (i.e. much faster and more > reliable) would be to create a public database of such > images. Spammers aren't generating new images for every > single mail, so that should be feasible. > > First, when someone identifies a mail as spam, a hash of > the image (e.g. an MD5 checksum) is submitted to the > database. This could be pretty much automated by a script > or little tool, so a single hotkey from within your MUA > will do it. > > Second, if somewhere else email is received with one or > more images attached, the MD5 checksum will be calculated > and verified with the database. If it had been reported > before, it is tagged as spam. > > In fact, the database could be implemented as a DNS black > list, so milter-greylist would already support it. ;-) > The DNS query would ask for the MD5 checksum (as a string > of 32 hex digits). Maybe the reply could also contain some > indication of how many people reported this image as spam > already, so a higher number would indicate a better > reliability that it is indeed spam (i.e. higher "score"). > > Just an idea. Anyway, OCR is not a solution, I think. MD5 can be easily bipassed by making invisible unique changes to images. We need something like DFT or wavelet transformation for such a purpose. I suggest to render HTML mail and determine whether its layout is spammy or not. For example if an image covers 95% of screen then the mail will be probably spam. A poorman can implement part of such a strategy by SpamAssassin rules.
Message
Re: [milter-greylist] [off-topic] OCR milter?
2006-11-02 by AIDA Shinra
Attachments
- No local attachments were found for this message.