Yahoo Groups archive

Milter-greylist

Index last updated: 2026-04-28 23:32 UTC

Thread

hang in new configuration load

hang in new configuration load

2014-05-01 by Bruncsak, Attila

Hello,

Sometimes the milter-greylist (V4.5.10) hangs when trying to load new configuration file.
There are more than 24000 threads, all are blocked in conf_update().
There is one thread which is executing the conf_load_internal()
but this one does not progress, it blocks on some lock
which I could not figure out which one.

The only way out is to restart the process.

Is it only me who is having this trouble?

Best,
Attila

Re: [milter-greylist] hang in new configuration load

2014-05-01 by manu@...

Bruncsak, Attila <attila.bruncsak@...> wrote:

> Sometimes the milter-greylist (V4.5.10) hangs when trying to load new
> configuration file. There are more than 24000 threads, all are blocked in
> conf_update(). There is one thread which is executing the
> conf_load_internal()


Could you build with -g and check where exactly it stalls in
conf_load_internal()?

> Is it only me who is having this trouble?

I must confess I never reload, I always restart.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@...

Re: [milter-greylist] hang in new configuration load

2014-05-01 by Jim Klimov

1 \u043c\u0430\u044f 2014�\u0433. 10:43:17 CEST, manu@... \u043f\u0438\u0448\u0435\u0442:
>Bruncsak, Attila <attila.bruncsak@...> wrote:
>
>> Sometimes the milter-greylist (V4.5.10) hangs when trying to load new
>> configuration file. There are more than 24000 threads, all are
>blocked in
>> conf_update(). There is one thread which is executing the
>> conf_load_internal()
>
>
>Could you build with -g and check where exactly it stalls in
>conf_load_internal()?
>
>> Is it only me who is having this trouble?
>
>I must confess I never reload, I always restart.

I also used to always reload, though in my recent deployments I did switch to dynamic reloading by updating the config file. On this list I was kindly pointed to a command-line mode to verify the config file, so I only push a valid config to the service (I collect the configs on relays from several snippets maintained some in cvs and some locally, by a script).

However, since it all runs under Solaris SMF so failed services are restarted, and there are no more failures due to invalid configs, I'd have to verify if any of our hosts had spontaneous milter restarts this year ;)

Although a hang like yours likely won't cause a restart (process not dead?) and we don't update configs that often.

BTW, what is your deployment's scale (how many config lines, why so many threads)? Maybe size also matters? ;)

How often do you change the configs? Does the problem occur after a few reloads (i.e. may be leaks) or this can be on the first attempt?

//Jim
--
Typos courtesy of K-9 Mail on my Samsung Android

RE: [milter-greylist] hang in new configuration load

2014-05-01 by Bruncsak, Attila

> 
> BTW, what is your deployment's scale (how many config lines, why so many threads)? Maybe size also matters? ;)
> 
> How often do you change the configs? Does the problem occur after a few reloads (i.e. may be leaks) or this can be on the first
> attempt?
> 

The configuration file contains now 75893 entries.
It is created dynamically,
and changes about once an hour.

Before applying to production there is a syntax check executed.
The production milter-greylist process never gets incorrect configuration file.

The threads are accumulated throughout hours during the night while the milter-greylist process was hanging.
The counterpart sendmail processes are all timed out.

The problem does not happen regularly after a few reloads.
Sometimes the process is up for one week or two so.
A reload finishes in about 4 - 5 seconds if the hang does not happen which is mostly the case.

RE: [milter-greylist] hang in new configuration load

2014-05-01 by Bruncsak, Attila

> Could you build with -g and check where exactly it stalls in
> conf_load_internal()?
> 

It always builds with "-g" flag for the C compiler by default.
It requires special configuration effort to switch off the debugging.

Re: [milter-greylist] hang in new configuration load

2014-05-01 by manu@...

Bruncsak, Attila <attila.bruncsak@...> wrote:

> > Could you build with -g and check where exactly it stalls in
> > conf_load_internal()?
> 
> It always builds with "-g" flag for the C compiler by default.
> It requires special configuration effort to switch off the debugging.

So if you attach the process with gdb, where are you?

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@...

RE: [milter-greylist] hang in new configuration load

2014-05-02 by Bruncsak, Attila

> 
> So if you attach the process with gdb, where are you?
> 

Here I am:

(gdb) attach 2945
Attaching to process 2945
Reading symbols from /usr/local/bin/milter-greylist...done.
warning: .dynamic section for "/lib64/libresolv.so.2" is not at the expected address
warning: difference appears to be caused by prelink, adjusting expectations
warning: .dynamic section for "/lib64/libpthread.so.0" is not at the expected address
warning: difference appears to be caused by prelink, adjusting expectations
warning: .dynamic section for "/lib64/libnsl.so.1" is not at the expected address
warning: difference appears to be caused by prelink, adjusting expectations
warning: .dynamic section for "/lib64/libc.so.6" is not at the expected address
warning: difference appears to be caused by prelink, adjusting expectations
warning: .dynamic section for "/lib64/ld-linux-x86-64.so.2" is not at the expected address
warning: difference appears to be caused by prelink, adjusting expectations
Reading symbols from /lib64/libresolv.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libresolv.so.2
Reading symbols from /lib64/libpthread.so.0...(no debugging symbols found)...done.
[Thread debugging using libthread_db enabled]
[New Thread 0x2abf9ec88940 (LWP 23910)]
[New Thread 0x2abfa148c940 (LWP 23907)]
[New Thread 0x2abfa1e8d940 (LWP 23904)]
[New Thread 0x2abfa008a940 (LWP 23901)]
[New Thread 0x2abf9e287940 (LWP 23899)]
[New Thread 0x2abfa288e940 (LWP 23895)]
[New Thread 0x2abf9b081940 (LWP 23847)]
[New Thread 0x2abfa328f940 (LWP 23746)]
[New Thread 0x2abfa3c90940 (LWP 23743)]
[New Thread 0x2abf9c483940 (LWP 23728)]
[New Thread 0x2abf9ba82940 (LWP 16424)]
[New Thread 0x2abf9a680940 (LWP 2952)]
[New Thread 0x2abf99c7f940 (LWP 2951)]
[New Thread 0x2abf9927e940 (LWP 2950)]
[New Thread 0x2abf9887d940 (LWP 2949)]
[New Thread 0x2abf97e7c940 (LWP 2946)]
Loaded symbols for /lib64/libpthread.so.0
Reading symbols from /lib64/libnsl.so.1...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnsl.so.1
Reading symbols from /lib64/libc.so.6...(no debugging symbols found)...done.
Loaded symbols for /lib64/libc.so.6
Reading symbols from /lib64/ld-linux-x86-64.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/ld-linux-x86-64.so.2
Reading symbols from /lib64/libnss_files.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_files.so.2
Reading symbols from /lib64/libnss_dns.so.2...(no debugging symbols found)...done.
Loaded symbols for /lib64/libnss_dns.so.2
warning: no loadable sections found in added symbol-file system-supplied DSO at 0x7fffe8dfd000
0x00002abf96fe2346 in poll () from /lib64/libc.so.6
(gdb) info threads
  17 Thread 0x2abf97e7c940 (LWP 2946)  0x00002abf96aec280 in pthread_cond_timedwait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  16 Thread 0x2abf9887d940 (LWP 2949)  0x00002abf96aeeccb in accept () from /lib64/libpthread.so.0
  15 Thread 0x2abf9927e940 (LWP 2950)  0x00002abf96aeeccb in accept () from /lib64/libpthread.so.0
  14 Thread 0x2abf99c7f940 (LWP 2951)  0x00002abf96aec019 in pthread_cond_wait@@GLIBC_2.3.2 () from /lib64/libpthread.so.0
  13 Thread 0x2abf9a680940 (LWP 2952)  0x00002abf96aef9c8 in do_sigwait () from /lib64/libpthread.so.0
  12 Thread 0x2abf9ba82940 (LWP 16424)  0x00002abf96fdd48b in read () from /lib64/libc.so.6
  11 Thread 0x2abf9c483940 (LWP 23728)  0x00002abf96fe2346 in poll () from /lib64/libc.so.6
  10 Thread 0x2abfa3c90940 (LWP 23743)  0x00002abf96fe2346 in poll () from /lib64/libc.so.6
  9 Thread 0x2abfa328f940 (LWP 23746)  0x00002abf96fe2346 in poll () from /lib64/libc.so.6
  8 Thread 0x2abf9b081940 (LWP 23847)  0x00002abf96fe2346 in poll () from /lib64/libc.so.6
  7 Thread 0x2abfa288e940 (LWP 23895)  0x00002abf96fe2346 in poll () from /lib64/libc.so.6
  6 Thread 0x2abf9e287940 (LWP 23899)  0x00002abf96fe2346 in poll () from /lib64/libc.so.6
  5 Thread 0x2abfa008a940 (LWP 23901)  0x00002abf96fe2346 in poll () from /lib64/libc.so.6
  4 Thread 0x2abfa1e8d940 (LWP 23904)  0x00002abf96fe2346 in poll () from /lib64/libc.so.6
  3 Thread 0x2abfa148c940 (LWP 23907)  0x00002abf96fe2346 in poll () from /lib64/libc.so.6
  2 Thread 0x2abf9ec88940 (LWP 23910)  0x00002abf96fe2346 in poll () from /lib64/libc.so.6
* 1 Thread 0x2abf9726fb80 (LWP 2945)  0x00002abf96fe2346 in poll () from /lib64/libc.so.6
(gdb)

But this is the normal working milter-greylist process.
To provide useful information I have to wait till it hangs, which may be an hour,
or even two weeks to wait.

I have tried to force the hang with the command:
while :; do touch /etc/mail/greylist.conf; sleep 10; done
It let this run for two hours, no success ( I mean no hang :-( )

Please let me know if I can still give any useful information before the hang occurs on his own.

RE: [milter-greylist] hang in new configuration load

2014-05-02 by Bruncsak, Attila

> > 0x00002abf96fe2346 in poll () from /lib64/libc.so.6
> 
> What does bt tell?
> 
(gdb) bt
#0  0x00002abf96fe2346 in poll () from /lib64/libc.so.6
#1  0x000000000042c707 in mi_listener ()
#2  0x000000000042ba61 in smfi_main ()
#3  0x0000000000408a3a in main (argc=<value optimized out>, argv=0x7fffe8df5e18) at milter-greylist.c:2071
(gdb)

Re: [milter-greylist] hang in new configuration load

2014-05-02 by manu@...

Bruncsak, Attila <attila.bruncsak@...> wrote:

> #0  0x00002abf96fe2346 in poll () from /lib64/libc.so.6
> #1  0x000000000042c707 in mi_listener ()
> #2  0x000000000042ba61 in smfi_main ()
> #3  0x0000000000408a3a in main (argc=<value optimized out>,
> argv=0x7fffe8df5e18) at milter-greylist.c:2071 (gdb)

Um we are in libmilter. That one will be hard to debug.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@...

RE: [milter-greylist] hang in new configuration load

2014-05-12 by Bruncsak, Attila

> > #0  0x00002abf96fe2346 in poll () from /lib64/libc.so.6
> > #1  0x000000000042c707 in mi_listener ()
> > #2  0x000000000042ba61 in smfi_main ()
> > #3  0x0000000000408a3a in main (argc=<value optimized out>,
> > argv=0x7fffe8df5e18) at milter-greylist.c:2071 (gdb)
> 
> Um we are in libmilter. That one will be hard to debug.
> 

So here we are:

(gdb) thread 3462
[Switching to thread 3462 (Thread 0x2abf9a881940 (LWP 679))]#0  0x00002abf96aeb5f0 in pthread_rwlock_wrlock ()
   from /lib64/libpthread.so.0
(gdb) bt
#0  0x00002abf96aeb5f0 in pthread_rwlock_wrlock () from /lib64/libpthread.so.0
#1  0x000000000040f43c in conf_load_internal (timestamp=<value optimized out>) at conf.c:192
#2  0x00002abf96ae783d in start_thread () from /lib64/libpthread.so.0
#3  0x00002abf96feb26d in clone () from /lib64/libc.so.6
(gdb)

conf.c:192 is ACL_WRLOCK.
I have stopped all the sendmail processes and seemingly the deadlock is still there.
I will have to clean the system now, the production must go on.
It is now code reading phase, is there any thread which can keep that lock
longer than just reading the ACL structure itself in the memory?

Re: [milter-greylist] hang in new configuration load

2014-05-12 by manu@...

'Bruncsak, Attila' attila.bruncsak@... [milter-greylist]
<milter-greylist@yahoogroups.com> wrote:

> I have stopped all the sendmail processes and seemingly the deadlock is
> still there. I will have to clean the system now, the production must go
> on. It is now code reading phase, is there any thread which can keep that
> lock longer than just reading the ACL structure itself in the memory?

In acl.c we have a rather long ACL_RDLOCK/ACL_UNLOCK in acl_filter(). It
calls ac->ac_acr->acr_filter() and acl_actions(). Both may perform
network operations (e.g.: DNSRBL).

How can that be fixed? We can lock the individual entry and assign it a
refcount (config reload will decrease the refcount, unlink the entry,
and leave it allocated if refcount > 0). Or we can make sure no thread
get stuck forever here, but I am not sure how it could be acheived.


-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@...

RE: [milter-greylist] hang in new configuration load

2014-05-13 by Bruncsak, Attila

> In acl.c we have a rather long ACL_RDLOCK/ACL_UNLOCK in acl_filter(). It
> calls ac->ac_acr->acr_filter() and acl_actions(). Both may perform
> network operations (e.g.: DNSRBL).
> 
> How can that be fixed? We can lock the individual entry and assign it a
> refcount (config reload will decrease the refcount, unlink the entry,
> and leave it allocated if refcount > 0). Or we can make sure no thread
> get stuck forever here, but I am not sure how it could be acheived.
> 

As a principle all lock protected critical section must not contain any
other actions than manipulating the memory of the process shared
between the threads. 
This excludes every I/O operation like sending syslog message or whatever else.

Please let us know if anybody disagree with me!

If you agree with me, we should start thinking of redesigning the program
strictly adhering to that principle.

Re: [milter-greylist] hang in new configuration load

2014-05-13 by Emmanuel Dreyfus

On Tue, May 13, 2014 at 09:37:47AM +0000, 'Bruncsak, Attila' attila.bruncsak@... [milter-greylist] wrote:
> As a principle all lock protected critical section must not contain any
> other actions than manipulating the memory of the process shared
> between the threads. 
 
I agree. But the problem here is that acl_filter() is working on 
an ACL entry and we do not want it to be freed during its operation.

We do a rought lock of the whole list during the operation, and it
can be improved. Locking the individual entry seems the way to go, 
but this means adding a refcount so that it can be detached from the
list (by config reload) during operation, and freed by the last 
user when refcount drops to zero.

That will let config reload working, but that does not solves the problem
of stuck threads holding a reference on an ACL entry forever. As the threads
will accumulate, we will reach the point where the program still fail.



-- 
Emmanuel Dreyfus
manu@...

RE: [milter-greylist] hang in new configuration load

2014-05-13 by Bruncsak, Attila

> 
> We do a rought lock of the whole list during the operation, and it
> can be improved. Locking the individual entry seems the way to go,
> but this means adding a refcount so that it can be detached from the
> list (by config reload) during operation, and freed by the last
> user when refcount drops to zero.
> 

What about the following?

The config_load() creates a totally new ACL list not relaying on any existing data
(no locking is needed than).
The ACL list has his own lock and associated reference count
which is initialized by the config_load().
Threads created after new configuration will start using this ACL list.
Old threads when they finish their job will release the lock of the old ACL list 
and the last old thread frees the list as the reference count reaches 0.
So old threads and new threads will use different ACL list as well as different locks.

The same logic may be applied not only to the ACL list but the full configuration,
as the ACL list is only part of the configuration data.

If we go further with this logic no need for reference count for the ACL list,
the reference count of the configuration should be used since one configuration has
exactly one ACL list. Part of freeing up the configuration is freeing up its associated ACL list.

Oh, by the way no ACL lock is needed at all if the ACL list is not modified after loading the new configuration!
(I am just thinking further as I am writing this e-mail)

Re: [milter-greylist] hang in new configuration load

2014-05-14 by manu@...

'Bruncsak, Attila' attila.bruncsak@... [milter-greylist]
<milter-greylist@yahoogroups.com> wrote:

> What about the following?

We still don't address the problem of stuck threads that accumulate to
the point the process will fail. The locking problem is a symptom (that
should be fixed), but not the root cause.

-- 
Emmanuel Dreyfus
http://hcpnet.free.fr/pubz
manu@...

RE: [milter-greylist] hang in new configuration load

2014-05-14 by Bruncsak, Attila

> 
> > What about the following?
> 
> We still don't address the problem of stuck threads that accumulate to
> the point the process will fail. The locking problem is a symptom (that
> should be fixed), but not the root cause.
> 

Currently, the threads are accumulated because they are all permanently
waiting on a lock to have the new configuration available.
With my proposal only those threads would use the new configuration
when it is read in and available in the memory;  existing threads
use the current (old) configuration not waiting on any lock.

Re: [milter-greylist] hang in new configuration load

2014-05-14 by Johann Klasek

On Wed, May 14, 2014 at 01:15:35PM +0200, manu@... [milter-greylist] wrote:
> 'Bruncsak, Attila' attila.bruncsak@... [milter-greylist]
> <milter-greylist@yahoogroups.com> wrote:
> 
> > What about the following?
> 
> We still don't address the problem of stuck threads that accumulate to
> the point the process will fail. The locking problem is a symptom (that
> should be fixed), but not the root cause.

Attilas variant seems to eleminate the cause so you have no symptoms at all,
isn't that so?
This look common to me for such cases where to build a datastructure in
a shadow area with a final and single replacement action instead of several
in-place changes on in-use data ...

Johann

Re: [milter-greylist] hang in new configuration load

2014-05-14 by Emmanuel Dreyfus

On Wed, May 14, 2014 at 02:16:16PM +0200, Johann Klasek johann@... [milter-greylist] wrote:
> Attilas variant seems to eleminate the cause so you have no symptoms at all,
> isn't that so?

I understand the symptom vanish in the short term (which is desirable, I 
do not oppose the change), but you still have stuck threads that eat 
more ane more resources in the process.

-- 
Emmanuel Dreyfus
manu@...

RE: [milter-greylist] hang in new configuration load

2014-05-14 by Bruncsak, Attila

> On Wed, May 14, 2014 at 02:16:16PM +0200, Johann Klasek johann@... [milter-greylist] wrote:
> > Attilas variant seems to eleminate the cause so you have no symptoms at all,
> > isn't that so?
> 
> I understand the symptom vanish in the short term (which is desirable, I
> do not oppose the change), but you still have stuck threads that eat
> more ane more resources in the process.
> 

That is right. With my variant one given thread may stuck forever
that does not change.
The big advancement that the stuck thread does not hold any lock,
so it cannot make the full milter-greylist stuck.

In my case I guess (I did not verified) the spamd (spamassassin) is stuck
and it holds the corresponding thread in the milter-greylist
which than holds all the other working threads.

I think in addition to, with the assumption I made before,
if I would kill the spamd process,
the corresponding thread would continue to run
releasing all the resources it is using at the end.

Move to quarantaine

This moves the raw source file on disk only. The archive index is not changed automatically, so you still need to run a manual refresh afterward.