Emmanuel Dreyfus wrote:
> I'd like to fight better the image spams. The only solution I have heard
> about is the OCR plug-in for spamassassin. I don't want to run spamassassin
> because it's too heavyweight. I'm looking for a way to convert the image
> into text and to run a regex filter on it.
>
> Is there something lightweight and reliable that does that? If not I'm
> going to develop it.
>
> My idea would be to create a new milter that would perform OCR on images
> contained in the message and would attach the obtained text at the end
> of the message so that others tools (milter-regex for instance) could work
> on it.
>
> Any opinion on that approach?
The problem is that OCR itself is heavyweight. I've worked
with quite a few OCR systems in the past 15 years, and all
of them require a serious amount of processing performance.
And even then you have a certain error rate. The good
systems (those that have a low error rate) even require a
"training" process in advance for recognizing the fonts
being used for the documents being scanned.
Therefore think that running OCR on embedded images on a
mail server isn't an option, I'm afraid. Unless you're
prepared that processing of every email requires a long
time, and that there are errors in the recognized text
(so that your regex will have trouble matching). It will
even make it easier to run a denial-of-service attack
against your mail server, by simply sending many emails
that contain multiple large images with random pixel
patterns.
I think a better approach (i.e. much faster and more
reliable) would be to create a public database of such
images. Spammers aren't generating new images for every
single mail, so that should be feasible.
First, when someone identifies a mail as spam, a hash of
the image (e.g. an MD5 checksum) is submitted to the
database. This could be pretty much automated by a script
or little tool, so a single hotkey from within your MUA
will do it.
Second, if somewhere else email is received with one or
more images attached, the MD5 checksum will be calculated
and verified with the database. If it had been reported
before, it is tagged as spam.
In fact, the database could be implemented as a DNS black
list, so milter-greylist would already support it. ;-)
The DNS query would ask for the MD5 checksum (as a string
of 32 hex digits). Maybe the reply could also contain some
indication of how many people reported this image as spam
already, so a higher number would indicate a better
reliability that it is indeed spam (i.e. higher "score").
Just an idea. Anyway, OCR is not a solution, I think.
Best regards
Oliver
--
Oliver Fromme, secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.
"Python tricks" is a tough one, cuz the language is so clean. E.g.,
C makes an art of confusing pointers with arrays and strings, which
leads to lotsa neat pointer tricks; APL mistakes everything for an
array, leading to neat one-liners; and Perl confuses everything
period, making each line a joyous adventure <wink>.
-- Tim Peters