Yahoo Groups archive

Milter-greylist

Index last updated: 2026-04-28 23:32 UTC

Thread

Segfault due to non-reentrant GeoIP?

Segfault due to non-reentrant GeoIP?

2010-02-09 by Enrico Scholz

Hi,

with milter-greylist 4.2.3 on a CentOS5 platform I get segfaults[1] like

| Error Traversing Database for ipnum = 3645194514 - Perhaps database is corrupt?
| *** Segmentation fault
| Register dump:

This seems to happen when two hosts connect within a very short time. I
think that GeoIP_id_by_name() is called for the second host while this
function is still executed for the first one.  As both are operating on
the same 'geoip_handle' handle, this will corrupt internal state when
geoip is not reentrant.



Enrico

Footnotes: 
[1] see https://bugzilla.redhat.com/show_bug.cgi?id=511849#c18

Re: [milter-greylist] Segfault due to non-reentrant GeoIP?

2010-02-09 by Emmanuel Dreyfus

On Tue, Feb 09, 2010 at 01:29:31PM +0100, Enrico Scholz wrote:
> This seems to happen when two hosts connect within a very short time. I
> think that GeoIP_id_by_name() is called for the second host while this
> function is still executed for the first one.  As both are operating on
> the same 'geoip_handle' handle, this will corrupt internal state when
> geoip is not reentrant.

And reentrant it is not: nm shows that gethostbyname is used. Will
you contribute a fix?

-- 
Emmanuel Dreyfus
manu@...

Re: Segfault due to non-reentrant GeoIP?

2010-02-09 by Enrico Scholz

Emmanuel Dreyfus <manu-S783fYmB3Ccdnm+yROfE0A@...> writes:

>> This seems to happen when two hosts connect within a very short time. I
>> think that GeoIP_id_by_name() is called for the second host while this
>> function is still executed for the first one.  As both are operating on
>> the same 'geoip_handle' handle, this will corrupt internal state when
>> geoip is not reentrant.
>
> And reentrant it is not: nm shows that gethostbyname is used. 

This is harmless because it is used in the update code only.  Problem seems
to be the GeoIP _check_mtime() function and other races (e.g. non-atomic
lseek() + read() sequences).


> Will you contribute a fix?

I sent a patch to the maillist; I hope that yahoo groups accepts it...


Enrico

Re: Segfault due to non-reentrant GeoIP?

2010-02-10 by Enrico Scholz

Enrico Scholz
<enrico.scholz-jNDFPZUTrfQ+B2oLq8eQJv4efur1V5z/s0AfqQuZ5sE@...>
writes:

> This seems to happen when two hosts connect within a very short time. I
> think that GeoIP_id_by_name() is called for the second host while this
> function is still executed for the first one.  As both are operating on
> the same 'geoip_handle' handle, this will corrupt internal state when
> geoip is not reentrant.

While looking over the code it seems that other subsystems have similar
issues. E.g. 'p0f' uses a shared 'p0fsock' socket in non-atomic
reconnect-write-read sequences.  Ditto for 'spamd'.

Should there be added mutex locks to all these subsystems, or would it
make more sense to add locking to the calling instance?


Enrico

Re: [milter-greylist] Re: Segfault due to non-reentrant GeoIP?

2010-02-10 by Petar Bogdanovic

On Wed, Feb 10, 2010 at 02:18:17PM +0100, Enrico Scholz wrote:
> Enrico Scholz
> <enrico.scholz-jNDFPZUTrfQ+B2oLq8eQJv4efur1V5z/s0AfqQuZ5sE@...>
> writes:
> 
> > This seems to happen when two hosts connect within a very short time. I
> > think that GeoIP_id_by_name() is called for the second host while this
> > function is still executed for the first one.  As both are operating on
> > the same 'geoip_handle' handle, this will corrupt internal state when
> > geoip is not reentrant.
> 
> While looking over the code it seems that other subsystems have similar
> issues. E.g. 'p0f' uses a shared 'p0fsock' socket in non-atomic
> reconnect-write-read sequences.  Ditto for 'spamd'.

I don't understand.  Why should we lock that?  In case of spamd, every
thread talks to a separate spamd child through its own socket fd..

		Petar Bogdanovic

Re: Segfault due to non-reentrant GeoIP?

2010-02-10 by Enrico Scholz

Petar Bogdanovic <petar-+Dgt6vZh/JqsTnJN9+BGXg@...> writes:

>> > This seems to happen when two hosts connect within a very short time. I
>> > think that GeoIP_id_by_name() is called for the second host while this
>> > function is still executed for the first one.  As both are operating on
>> > the same 'geoip_handle' handle, this will corrupt internal state when
>> > geoip is not reentrant.
>> 
>> While looking over the code it seems that other subsystems have similar
>> issues. E.g. 'p0f' uses a shared 'p0fsock' socket in non-atomic
>> reconnect-write-read sequences.  Ditto for 'spamd'.
>
> I don't understand.  Why should we lock that?

In case of p0f, there is

----
static int p0fsock = -1;

int p0f_lookup(priv)
{
        ...
        if (p0f_reconnect() != 0)
                return -1;
        ...
        if (write(p0fsock, &req ,sizeof(req)) != sizeof(req)) {
        ...
        if (read(p0fsock, &rep, sizeof(rep)) != sizeof(rep)) {
        ...
}
----

p0f_lookup() is called unlocked from mlfi_connect().


DKIM (which might be also affected), calls dkim_verify() with a static
dkim_ptr.


> In case of spamd, every thread talks to a separate spamd child through
> its own socket fd..

Sorry; you are right. Spamd is ok.



Enrico

Re: [milter-greylist] Re: Segfault due to non-reentrant GeoIP?

2010-02-10 by manu@netbsd.org

Enrico Scholz <enrico.scholz@...-chemnitz.de> wrote:

> Should there be added mutex locks to all these subsystems, or would it
> make more sense to add locking to the calling instance?

I don't know. Whar are the pros and cons?

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@...

[PATCH] Make p0f queries reentrant (was: Segfault due to non-reentrant GeoIP?)

2010-02-20 by Enrico Scholz

A patch which fixes a similar problem in the p0f code:

Do not share the p0fsock descriptor between several threads because
resulting the connect - write - read - close sequence is racy. p0f
protocol allows one query per connection only, so this patch allocates
and frees a local socket per connection.

Move to quarantaine

This moves the raw source file on disk only. The archive index is not changed automatically, so you still need to run a manual refresh afterward.