Yahoo Groups archive

Milter-greylist

Index last updated: 2026-04-28 23:32 UTC

Thread

slow reloading of the database: the problem explained

Re: [milter-greylist] slow reloading of the database: the problem explained

2007-01-10 by AIDA Shinra

At Tue, 9 Jan 2007 21:32:51 +0000,
Emmanuel Dreyfus wrote:
> 
> Hello
> 
> I tracked down the problem of slow database reloading in 3.1.x series.
> 
> The problem was introduced with the pending_insert_to_queue() and
> autowhite_insert_to_queue() functions, that sort the in-memory database
> by date. For each tuple to be inserted, the whole database is searched, 
> which results in a major slow down when thousands of tuples are reloaded,
> at startup time.
>
> Are these functions useful at all? I don't see the point of sorting 
> the tuples, since the code in charge of retreiving tuples does not use
> their sorted property to win a shortcut.
> 
> Attached is a patch that remove the sorting of tuples. It restores the
> instant loading time we had before 3.1 series. Feedback is welcome: does
> it harm in any way to nuke that code?

The whole database need to be sorted in order for pending_timeout()
and autowhite_timeout() to work.

I think pending_insert_to_queue() and autowhite_insert_to_queue()
finish quickly as long as entries are inserted in old-to-new order. Am
I wrong? Profiler will clarify where is the problem.

Re: [milter-greylist] slow reloading of the database: the problem explained

2007-01-10 by Emmanuel Dreyfus

On Wed, Jan 10, 2007 at 04:40:33PM +0900, AIDA Shinra wrote:
> > Attached is a patch that remove the sorting of tuples. It restores the
> > instant loading time we had before 3.1 series. Feedback is welcome: does
> > it harm in any way to nuke that code?
> 
> The whole database need to be sorted in order for pending_timeout()
> and autowhite_timeout() to work.
> 
> I think pending_insert_to_queue() and autowhite_insert_to_queue()
> finish quickly as long as entries are inserted in old-to-new order. Am
> I wrong? Profiler will clarify where is the problem.

That functions causes the reloading operation to switch from O(n) to O(n^2),
since we have to walk all previously inserted tuples before inserting a new
tuple. The cost is huge for a big databases.

When I wrote milter-greylist at once, I did no special effort at cleaning 
all timed out entries. I just kicked them out occasionnally when I encountered
them while walking the database for seach and dump operations. 

Of course, as a search succeeds before walking the whole database, some 
timed out entries remained undeleted. That does not really matter, it justs
uses memory. Andanyway, we still often walk (and therefore clean up) the 
whole database each time a new message gets in.

I think it would be wise to get back to that situation. I'll do that if
nobody objects.

-- 
Emmanuel Dreyfus
manu@...

Re: [milter-greylist] slow reloading of the database: the problem explained

2007-01-10 by Oliver Fromme

Emmanuel Dreyfus wrote:
 > I tracked down the problem of slow database reloading in 3.1.x series.
 > 
 > The problem was introduced with the pending_insert_to_queue() and
 > autowhite_insert_to_queue() functions, that sort the in-memory database
 > by date. For each tuple to be inserted, the whole database is searched, 
 > which results in a major slow down when thousands of tuples are reloaded,
 > at startup time.
 > 
 > Are these functions useful at all? I don't see the point of sorting 
 > the tuples, since the code in charge of retreiving tuples does not use
 > their sorted property to win a shortcut.
 > 
 > Attached is a patch that remove the sorting of tuples. It restores the
 > instant loading time we had before 3.1 series. Feedback is welcome: does
 > it harm in any way to nuke that code?

Just a thought:  Maybe it would make sense to sort them,
so you can use a simple binary search algorithm to retrieve
them, which should improve retrieval speed considerably
(it would scale much better with large lists).

Of course, the sorting should be done only once, after
the whole databas has been reloaded.  On the other hand,
the list would need to be kept sorted whenever a new
tuple is added during operation.  For that purposed, maybe
it would be useful to use a balanced tree of some sort.

(Unfortunately, my time is too limited to have a look at
implementing it myself.  :-(  But maybe someone else can
pick up the idea.)

Of course, all of that would be unnecessary if we had a
real SQL backend.  ;-)  (SCNR)

Best regards
   Oliver

-- 
Oliver Fromme,  secnetix GmbH & Co. KG, Marktplatz 29, 85567 Grafing
Dienstleistungen mit Schwerpunkt FreeBSD: http://www.secnetix.de/bsd
Any opinions expressed in this message may be personal to the author
and may not necessarily reflect the opinions of secnetix in any way.

"... there are two ways of constructing a software design:  One way
is to make it so simple that there are _obviously_ no deficiencies and
the other way is to make it so complicated that there are no _obvious_
deficiencies."        -- C.A.R. Hoare, ACM Turing Award Lecture, 1980

Re: [milter-greylist] slow reloading of the database: the problem explained

2007-01-10 by Emmanuel Dreyfus

On Wed, Jan 10, 2007 at 11:51:21AM +0100, Oliver Fromme wrote:
> Just a thought:  Maybe it would make sense to sort them,
> so you can use a simple binary search algorithm to retrieve
> them, which should improve retrieval speed considerably
> (it would scale much better with large lists).

Well, if nobody volunteer for that, I'll go ahead restoring the old 
behavior, and we'll wait for someone to improve it later.

-- 
Emmanuel Dreyfus
manu@...

Re: [milter-greylist] slow reloading of the database: the problem explained

2007-01-10 by Joel Reicher

> Just a thought:  Maybe it would make sense to sort them,
> so you can use a simple binary search algorithm to retrieve
> them, which should improve retrieval speed considerably
> (it would scale much better with large lists).
...
> Of course, all of that would be unnecessary if we had a
> real SQL backend.  ;-)  (SCNR)

Most (all?) databases are implemented with a hash table, aren't they?

If you want better performance on retrieval, why not just use a hash table
for the list? Insertion then remains quite simple too.

Personally, I'm happy with the simple, unsorted list, since I think
the delay on it is tiny compared with the DNS lookups that happen during
the various stages of mail delivery.

But if you really do want a more sophisticated data structure, I
suspect a binary tree might be overkill.

Cheers,

	- Joel

Re: [milter-greylist] slow reloading of the database: the problem explained

2007-01-13 by AIDA Shinra

At Wed, 10 Jan 2007 08:58:10 +0000,
Emmanuel Dreyfus wrote:
> 
> On Wed, Jan 10, 2007 at 04:40:33PM +0900, AIDA Shinra wrote:
> > > Attached is a patch that remove the sorting of tuples. It restores the
> > > instant loading time we had before 3.1 series. Feedback is welcome: does
> > > it harm in any way to nuke that code?
> > 
> > The whole database need to be sorted in order for pending_timeout()
> > and autowhite_timeout() to work.
> > 
> > I think pending_insert_to_queue() and autowhite_insert_to_queue()
> > finish quickly as long as entries are inserted in old-to-new order. Am
> > I wrong? Profiler will clarify where is the problem.
> 
> That functions causes the reloading operation to switch from O(n) to O(n^2),
> since we have to walk all previously inserted tuples before inserting a new
> tuple. The cost is huge for a big databases.

No. These functions are O(n) as long as the dump is sorted in old to
new order because they search insertion position from latest to
earliest order.

I believe the real source of our problem lies in another
code. Profiling will reveal it.

Re: [milter-greylist] slow reloading of the database: the problem explained

2007-01-13 by Emmanuel Dreyfus

On Sat, Jan 13, 2007 at 04:46:35PM +0900, AIDA Shinra wrote:
> I believe the real source of our problem lies in another
> code. Profiling will reveal it.

Experience shows that if you remove the logic from pending_insert_to_queue()
and just insert at the end of the queue, then the original performance is
restored. Just give it a try (don't forget to patch 
autowhite_insert_to_queue() too)

-- 
Emmanuel Dreyfus
manu@...

Move to quarantaine

This moves the raw source file on disk only. The archive index is not changed automatically, so you still need to run a manual refresh afterward.