Yahoo Groups archive

Milter-greylist

Index last updated: 2026-04-28 23:32 UTC

Thread

[off-topic] OCR milter?

[off-topic] OCR milter?

2006-11-02 by Emmanuel Dreyfus

Hello

It's a bit off-topic, but I know there are a bunch of mail system experts
on this list...

I'd like to fight better the image spams. The only solution I have heard 
about is the OCR plug-in for spamassassin. I don't want to run spamassassin
because it's too heavyweight. I'm looking for a way to convert the image 
into text and to run a regex filter on it.

Is there something lightweight and reliable that does that? If not I'm
going to develop it.

My idea would be to create a new milter that would perform OCR on images
contained in the message and would attach the obtained text at the end
of the message so that others tools (milter-regex for instance) could work 
on it. 

Any opinion on that approach?

-- 
Emmanuel Dreyfus
manu@...

Re: [milter-greylist] [off-topic] OCR milter?

2006-11-02 by Michael Menge

Hi, the ocr plugin has to do much work to convert the image to png and then run gocr on the image. but even if you don t want to run SA you could write a

Re: [milter-greylist] [off-topic] OCR milter?

2006-11-02 by Emmanuel Dreyfus

On Thu, Nov 02, 2006 at 03:49:30PM +0100, Michael Menge wrote:
> the ocr plugin has to do much work to convert the image to png and  
> then run gocr on the image. but even if you don't want to run SA you  
> could write a interface/wrapper wich connects sendmail with the OCR  
> plugin via milter interface

Correct me if I'm wrong, but isn't that spamassassin plug-in written in 
Perl? 

-- 
Emmanuel Dreyfus
manu@...

Re: [milter-greylist] [off-topic] OCR milter?

2006-11-02 by Oliver Fromme

Emmanuel Dreyfus wrote:
 > I'd like to fight better the image spams. The only solution I have heard 
 > about is the OCR plug-in for spamassassin. I don't want to run spamassassin
 > because it's too heavyweight. I'm looking for a way to convert the image 
 > into text and to run a regex filter on it.
 > 
 > Is there something lightweight and reliable that does that? If not I'm
 > going to develop it.
 > 
 > My idea would be to create a new milter that would perform OCR on images
 > contained in the message and would attach the obtained text at the end
 > of the message so that others tools (milter-regex for instance) could work 
 > on it. 
 > 
 > Any opinion on that approach?

The problem is that OCR itself is heavyweight.  I've worked
with quite a few OCR systems in the past 15 years, and all
of them require a serious amount of processing performance.
And even then you have a certain error rate.  The good
systems (those that have a low error rate) even require a
"training" process in advance for recognizing the fonts
being used for the documents being scanned.

Therefore think that running OCR on embedded images on a
mail server isn't an option, I'm afraid.  Unless you're
prepared that processing of every email requires a long
time, and that there are errors in the recognized text
(so that your regex will have trouble matching).  It will
even make it easier to run a denial-of-service attack
against your mail server, by simply sending many emails
that contain multiple large images with random pixel
patterns.

I think a better approach (i.e. much faster and more
reliable) would be to create a public database of such
images.  Spammers aren't generating new images for every
single mail, so that should be feasible.

First, when someone identifies a mail as spam, a hash of
the image (e.g. an MD5 checksum) is submitted to the
database.  This could be pretty much automated by a script
or little tool, so a single hotkey from within your MUA
will do it.

Second, if somewhere else email is received with one or
more images attached, the MD5 checksum will be calculated
and verified with the database.  If it had been reported
before, it is tagged as spam.

In fact, the database could be implemented as a DNS black
list, so milter-greylist would already support it.  ;-)
The DNS query would ask for the MD5 checksum (as a string
of 32 hex digits).  Maybe the reply could also contain some
indication of how many people reported this image as spam
already, so a higher number would indicate a better
reliability that it is indeed spam (i.e. higher "score").

Just an idea.  Anyway, OCR is not a solution, I think.

Best regards
   Oliver

-- 
Oliver Fromme,  secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.

"Python tricks" is a tough one, cuz the language is so clean. E.g.,
C makes an art of confusing pointers with arrays and strings, which
leads to lotsa neat pointer tricks; APL mistakes everything for an
array, leading to neat one-liners; and Perl confuses everything
period, making each line a joyous adventure <wink>.
        -- Tim Peters

Re: [milter-greylist] [off-topic] OCR milter?

2006-11-02 by Chris Hoogendyk

Just a further comment on this.

The images are coming through with multiple background and text colors
with matching grey levels and lint speckling over them. This is designed
to make OCR more difficult. I don't know how successful the OCR has
been, but I can tell by looking at the images that the spammers are
trying to circumvent that tool.


---------------

Chris Hoogendyk

-
   O__  ---- Systems Administrator
  c/ /'_ --- Biology & Geology Departments
 (*) \(*) -- 140 Morrill Science Center
~~~~~~~~~~ - University of Massachusetts, Amherst 

<hoogendyk@...>

--------------- 

Erd\ufffds 4




Oliver Fromme wrote:
Show quoted textHide quoted text
> Emmanuel Dreyfus wrote:
>  > I'd like to fight better the image spams. The only solution I have heard 
>  > about is the OCR plug-in for spamassassin. I don't want to run spamassassin
>  > because it's too heavyweight. I'm looking for a way to convert the image 
>  > into text and to run a regex filter on it.
>  > 
>  > Is there something lightweight and reliable that does that? If not I'm
>  > going to develop it.
>  > 
>  > My idea would be to create a new milter that would perform OCR on images
>  > contained in the message and would attach the obtained text at the end
>  > of the message so that others tools (milter-regex for instance) could work 
>  > on it. 
>  > 
>  > Any opinion on that approach?
>
> The problem is that OCR itself is heavyweight.  I've worked
> with quite a few OCR systems in the past 15 years, and all
> of them require a serious amount of processing performance.
> And even then you have a certain error rate.  The good
> systems (those that have a low error rate) even require a
> "training" process in advance for recognizing the fonts
> being used for the documents being scanned.
>
> Therefore think that running OCR on embedded images on a
> mail server isn't an option, I'm afraid.  Unless you're
> prepared that processing of every email requires a long
> time, and that there are errors in the recognized text
> (so that your regex will have trouble matching).  It will
> even make it easier to run a denial-of-service attack
> against your mail server, by simply sending many emails
> that contain multiple large images with random pixel
> patterns.
>
> I think a better approach (i.e. much faster and more
> reliable) would be to create a public database of such
> images.  Spammers aren't generating new images for every
> single mail, so that should be feasible.
>
> First, when someone identifies a mail as spam, a hash of
> the image (e.g. an MD5 checksum) is submitted to the
> database.  This could be pretty much automated by a script
> or little tool, so a single hotkey from within your MUA
> will do it.
>
> Second, if somewhere else email is received with one or
> more images attached, the MD5 checksum will be calculated
> and verified with the database.  If it had been reported
> before, it is tagged as spam.
>
> In fact, the database could be implemented as a DNS black
> list, so milter-greylist would already support it.  ;-)
> The DNS query would ask for the MD5 checksum (as a string
> of 32 hex digits).  Maybe the reply could also contain some
> indication of how many people reported this image as spam
> already, so a higher number would indicate a better
> reliability that it is indeed spam (i.e. higher "score").
>
> Just an idea.  Anyway, OCR is not a solution, I think.
>
> Best regards
>    Oliver

Re: [milter-greylist] [off-topic] OCR milter?

2006-11-02 by AIDA Shinra

At Thu, 2 Nov 2006 16:07:41 +0100 (CET),
Oliver Fromme wrote:
> 
> 
> The problem is that OCR itself is heavyweight.  I've worked
> with quite a few OCR systems in the past 15 years, and all
> of them require a serious amount of processing performance.
> And even then you have a certain error rate.  The good
> systems (those that have a low error rate) even require a
> "training" process in advance for recognizing the fonts
> being used for the documents being scanned.
> 
> Therefore think that running OCR on embedded images on a
> mail server isn't an option, I'm afraid.  Unless you're
> prepared that processing of every email requires a long
> time, and that there are errors in the recognized text
> (so that your regex will have trouble matching).  It will
> even make it easier to run a denial-of-service attack
> against your mail server, by simply sending many emails
> that contain multiple large images with random pixel
> patterns.
> 
> I think a better approach (i.e. much faster and more
> reliable) would be to create a public database of such
> images.  Spammers aren't generating new images for every
> single mail, so that should be feasible.
> 
> First, when someone identifies a mail as spam, a hash of
> the image (e.g. an MD5 checksum) is submitted to the
> database.  This could be pretty much automated by a script
> or little tool, so a single hotkey from within your MUA
> will do it.
> 
> Second, if somewhere else email is received with one or
> more images attached, the MD5 checksum will be calculated
> and verified with the database.  If it had been reported
> before, it is tagged as spam.
> 
> In fact, the database could be implemented as a DNS black
> list, so milter-greylist would already support it.  ;-)
> The DNS query would ask for the MD5 checksum (as a string
> of 32 hex digits).  Maybe the reply could also contain some
> indication of how many people reported this image as spam
> already, so a higher number would indicate a better
> reliability that it is indeed spam (i.e. higher "score").
> 
> Just an idea.  Anyway, OCR is not a solution, I think.

MD5 can be easily bipassed by making invisible unique changes to
images. We need something like DFT or wavelet transformation for such
a purpose.

I suggest to render HTML mail and determine whether its layout is
spammy or not. For example if an image covers 95% of screen then the
mail will be probably spam. A poorman can implement part of such a
strategy by SpamAssassin rules.

Re: [milter-greylist] [off-topic] OCR milter?

2006-11-02 by manu@netbsd.org

Oliver Fromme <olli@...> wrote:

> I think a better approach (i.e. much faster and more
> reliable) would be to create a public database of such
> images.  Spammers aren't generating new images for every
> single mail, so that should be feasible.

But it's extremely easy for the spammer to generate a new image each
time, with a few changing pixels. That costs nearly nothing.

I haven't checked, but I would not be surprised if it was not already
the case.

OCR is time-consuming, but it can be spread on many machines, should you
need it. And in order to avoid DoS, you can decide that images
containing GIF are second-class citizen and process them slower than
regular mail. It's extremely easy to do: you only have to nice the OCR
computing.


-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@...

Re: [milter-greylist] [off-topic] OCR milter?

2006-11-02 by manu@netbsd.org

Chris Hoogendyk <hoogendyk@...> wrote:

> The images are coming through with multiple background and text colors
> with matching grey levels and lint speckling over them. This is designed
> to make OCR more difficult. I don't know how successful the OCR has
> been, but I can tell by looking at the images that the spammers are
> trying to circumvent that tool.

I have good feedbacks from users of spamassassin OCR plugin...

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@...

Re: [milter-greylist] [off-topic] OCR milter?

2006-11-02 by Oliver Fromme

manu@... wrote:
 > Oliver Fromme wrote:
 > 
 > > I think a better approach (i.e. much faster and more
 > > reliable) would be to create a public database of such
 > > images.  Spammers aren't generating new images for every
 > > single mail, so that should be feasible.
 > 
 > But it's extremely easy for the spammer to generate a new image each
 > time, with a few changing pixels. That costs nearly nothing.

In order to change even a single pixel, the spammer would
have to decompress the image, and then compress it again.
That costs quite a bit of CPU resources, so I don't think
they're doing that when sending millions of spam mails.

Someone else mentioned that the spammers are already trying
to confuse OCR, by including background patterns, color
gradients, speckle pixels etc...   Maybe that can be taken
to our advantage.  It should be possible to write a filter
that detects such anti-OCR patterns.  (Very similar to the
filters that detect anti-Regex tyyp0s 1n Subjetc 1ines...)

 > OCR is time-consuming, but it can be spread on many machines, should you
 > need it. And in order to avoid DoS, you can decide that images
 > containing GIF are second-class citizen and process them slower than
 > regular mail. It's extremely easy to do: you only have to nice the OCR
 > computing.

It might work on personal machines that receive only mail
for one person, or maybe a few users.  (Obviously that's
already the case, see SpamAssassin + gocr.)  But it won't
work on large servers that receive mail (and spam) for
hundreds or thousands of people.

So far I don't receive much spam with images.  Or maybe I
just don't notice because greylisting and other measures
drop them before I see them.  :-)   But if it became a
real problem one day, I would simply drop all mails that
contained images.  There's no reason someone has to mail
me an image.  And even _if_ someone wants do to that, he
will have to gzip it or uuencode it or whatever (spammers
cannot gzip or uuencode their images, because most people
won't see them).

Best regards
   Oliver

-- 
Oliver Fromme,  secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.

"Python is an experiment in how much freedom programmers need.
Too much freedom and nobody can read another's code; too little
and expressiveness is endangered."
        -- Guido van Rossum

Re: [milter-greylist] [off-topic] OCR milter?

2006-11-02 by manu@netbsd.org

Oliver Fromme <olli@...> wrote:

> In order to change even a single pixel, the spammer would
> have to decompress the image, and then compress it again.
> That costs quite a bit of CPU resources, so I don't think
> they're doing that when sending millions of spam mails.

I suspect they just build the image on the fly when sending spam. That
costs nearly nothing.
 
> It might work on personal machines that receive only mail
> for one person, or maybe a few users.  (Obviously that's
> already the case, see SpamAssassin + gocr.)  But it won't
> work on large servers that receive mail (and spam) for
> hundreds or thousands of people.

It's easy to scatter the job on several machines...

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@...

Re: [milter-greylist] [off-topic] OCR milter?

2006-11-03 by Jack L. Stone

At 09:38 PM 11.2.2006 +0100, you wrote:
>Oliver Fromme <olli@...> wrote:
>
>> In order to change even a single pixel, the spammer would
>> have to decompress the image, and then compress it again.
>> That costs quite a bit of CPU resources, so I don't think
>> they're doing that when sending millions of spam mails.
>
>I suspect they just build the image on the fly when sending spam. That
>costs nearly nothing.
> 
>> It might work on personal machines that receive only mail
>> for one person, or maybe a few users.  (Obviously that's
>> already the case, see SpamAssassin + gocr.)  But it won't
>> work on large servers that receive mail (and spam) for
>> hundreds or thousands of people.
>
>It's easy to scatter the job on several machines...
>
>-- 
>Emmanuel Dreyfus
>http://hcpnet.free.fr/pubz
>manu@...
>

I don't know how this fits into the picture with MGL, but have you looked
at the SA's alternative to the OCR?

It seemed to me that the OCR used for SA had way too many working parts,
and Dallas Engelken provided a image filter ruleset alternative that had
done an excellent job of catching the image spams. None are getting through
here.

# ImageInfo Plugin for SpamAssassin
# Version: 0.6
# Current Home: http://www.rulesemporium.com/plugins.htm#imageinfo
# Created: 2006-08-02
# Modified: 2006-10-04
# By: Dallas Engelken <dallase@...>
#
# Changes: 
#   0.6 - fixed dems_ bug in image_size_range_
#   0.5 - added image_named and image_to_text_ratio
#   0.4 - added image_size_exact and image_size_range
#   0.3 - added jpeg support
#   0.2 - optimized by theo
#   0.1 - added gif/png support
#
# Files:
#   ImageInfo.pm (plugin)  - http://www.rulesemporium.com/plugins/ImageInfo.pm
#   imageinfo.cf (ruleset) - http://www.rulesemporium.com/plugins/imageinfo.cf
#   

(^_^)
Happy trails,
Jack L. Stone

System Admin
Sage-american

Greylist.db file size on Tru64

2006-11-03 by Daniel Clar

Hello,

I'm running milter-greylist on a HP Tru64 box it's working fine but when the
greylist.db file is more than 32000000 bytes the process stops and is unable to
be restarted.

The sie is not always the same but I suppose that it depend on buffering.

Some ideas ?

Regards,

Daniel

RE: [milter-greylist] Greylist.db file size on Tru64

2006-11-03 by attila.bruncsak@itu.int

> I'm running milter-greylist on a HP Tru64 box it's working 
> fine but when the
> greylist.db file is more than 32000000 bytes the process 
> stops and is unable to
> be restarted.

Hello,

This is just a guess, the problem may be somewhere else:

Check the maximum allowed resources for a process,
especially the data segment size via the command

ulimit -a

You may want to tune your system if some values are inappropriate.

Bests,
Attila

Move to quarantaine

This moves the raw source file on disk only. The archive index is not changed automatically, so you still need to run a manual refresh afterward.