Yahoo Groups archive

Milter-greylist

Index last updated: 2026-04-28 23:32 UTC

Message

Re: [milter-greylist] Re: Limiting resident memory usage

2006-11-03 by Matt Kettler

Oliver Fromme wrote:
> Hi,

Hello.

> 
> I'm very sorry, this will become a somewhat lengthy mail.
> I hope your spam filter doesn't drop it.  ;-)
> But the issue at hand is _not_ trivial and cannot be
> discussed in a single paragraph.

Agreed, many of my posts here have been very long.

> 
> Matt Kettler wrote:
>  > That said, do you really think your regex execution is THAT slow
>  > that it will take longer than 50us of real CPU time to evaluate
>  > one on a 2ghz processor?
> 
> Do you know how regular expressions work? 

Yes, I'm *extensively* familiar with how regexes work.


> The expression
> itself (i.e. the string from the configuration file) is
> not stored in memory, neither is it used directly for
> matching.

I know.

> 
> Instead, during parsing, the regex is "compiled" into a
> finite state automaton.  

Yes, true. This should happen when the config file is parsed, not on a
per-connection basis.

> That automaton is basically a
> data structure consisting of several tables (arrays),
> typically several hundred bytes in size, sometimes even
> thousands.

True. The regexes in question are 791 bytes total.

  The size depends on structure of the regexp
> and the actual implementation of the regular expression
> library, and how it trades off speed and size.  I assume
> that milter-greylist uses the system library (regex(3)),
> which can behave quite differently on different platforms,
> depending on whether the authors turned their attention
> to speed or to memory consumption.

Agreed that regexes CAN be slow. However, this whole thread is about two
specific regexes.

> 
> When the automaton is applied ("executed") to a string,
> the speed also depends very much on the structure of the
> regular expression.  If the automaton only contains simple
> states (e.g. fixed characters and unlimited repetitions),
> it is almost as fast as a plain string comparison, i.e.
> negligible.  On the other hand, complicated states such
> as repetition ranges or back-references require recursion
> and will need significantly more CPU time.  If you even
> nest them in multiple levels, CPU consumption will sky-
> rocket exponentially.
> 
> Bottom line:  When using regular expressions, it's worth
> to understand how they work, and then craft them in a way
> that is most efficient.

Yes, exactly. Read my posts. All of them point out you need to write your
regexes WELL.

In fact, this thread started with me offering performance tuning suggestions to
someones regexes.

All I'm arguing against is the concept that DNS is faster than a small number of
well written regexes. That's utter nonsense, but it's the view that "eclark" has
been espousing.

Let me re-quote myself from an earlier post:

------------------
However, using a RBL to replace 2 short lightweight regexes is NOT a performance
gain. It's a massive performance loss.
------------------

That's my point. Not that regexes are always faster. But that modest numbers of
them are faster.  Let's not take this further than it belongs.

Really.. eclark blasted me for suggesting that someone use the following two
regexes:

acl greylist domain /[0-9]{1,3}[-.][0-9]{1,3}[-.][0-9]{1,3}[-.]/
acl greylist domain /[0-9]{12}/

And suggested they'd be better off using a DNS query. Even if they have 10 more
regexes just like those, they won't be better off with the DNS query, except
perhaps a slight reduction in memory usage.

Those two are NOT slower than a DNS lookup. Period. Even if the DNS is locally
cached. They're structured to prevent back-tracking, a common "killer" in regexes.

Let's look at them in perl, which is going to do a memory-large compilation
trying to tune for speed:

$perl -mre=debug /[0-9]{1,3}[-.][0-9]{1,3}[-.][0-9]{1,3}[-.]/
size 73 Got 588 bytes for offset annotations.

$perl -mre=debug -e '/[0-9]{12}/'
size 14 Got 116 bytes for offset annotations.

Do you really think you can complete a DNS query with 791 bytes of data storage?
Including recursion and parsing? AND use fewer CPU clock cycles?

I don't think you believe that, and nor do I. But apparently some folks do.

> So, the bottom line is:  Whether regular expressions or
> DNS BLs are more efficient for someone depend on a whole
> lot of things.  

Agreed.. Which is why my posts have been very careful to constrain the situation
to comparing locally cached DNS against a small number of well written regexes.


> You cannot generically say that one is
> always more efficient than the other. 

Agreed 100%. I've never made that claim.

Attachments

Move to quarantaine

This moves the raw source file on disk only. The archive index is not changed automatically, so you still need to run a manual refresh afterward.