Hello,
After a recent batch of package updates on my Debian 6.0 amd64 system, I (and at least one forum member) encounter a problem with the 'lookup-domain-daemon'. Indeed, this service now uses 100% of our CPU.
One of the package that was updated in both systems was MySQL (client, common, server, server_core): v5.1.61. But looking at the content of the 'lookup-lookup-daemon.pl' file, I doubt that MySQL can be the reason for the problem. On my system, the following package were also updated:
- firmware-linux-free_2.6.32-41squeeze2_all.deb,
- linux-base_2.6.32-41squeeze2_all.deb,
- linux-image-2.6.32-5-amd64_2.6.32-41squeeze2_amd64.deb,
- linux-libc-dev_2.6.32-41squeeze2_amd64.deb.
On my particular system, I have no more than 4 processes linked to 'lookup-domain-daemon.pl' and 3 of them are using 100% of my CPU (I have 2 x Intel Xeon E5620).
root@sd-28802:~# ps aux | grep "lookup"
root 13199 0.0 0.1 76472 47280 ? Ss 03:55 0:00 /usr/share/webmin/virtual-server/lookup-domain-daemon.pl
root 14321 100 0.2 88608 53528 ? R 03:58 0:50 /usr/share/webmin/virtual-server/lookup-domain-daemon.pl
root 14326 100 0.2 88608 53528 ? R 03:58 0:50 /usr/share/webmin/virtual-server/lookup-domain-daemon.pl
root 14336 101 0.2 88608 53528 ? R 03:58 0:49 /usr/share/webmin/virtual-server/lookup-domain-daemon.pl
r
The file '/var/webmin/lookup-domain-daemon.pid' indicates process #13199 as the initial one.
The file '/var/webmin/lookup-domain-daemon.log' shows no error message and contains normal output ("[Mon Apr 16 04:00:03 2012] user=root NOUSER", for example) that allow me to say that, even if using 100% of the CPU, the daemon seems to be working.
After executing "/etc/init.d/lookup-domain stop" and waiting approximatively 1 minute, the 4 processes are gone from "ps aux" output.
I am wondering why this specific daemon stopped working.
If I can be of any help to figure out the root of the problem, I'm ready to help.
Tristan CHARBONNIER
PS: Topic on the forum: https://www.virtualmin.com/node/21885
Comments
Submitted by JamieCameron on Sun, 04/15/2012 - 23:17 Comment #1
Which Virtualmin version are you running there? We recently released version 3.91, which includes a fix for an issue like this..
Submitted by John_B on Mon, 04/16/2012 - 03:36 Comment #2
For me the problem, on Debian 6 64 bit, kernel 3, (reported in forum post https://www.virtualmin.com/node/21885) started with the upgrade to Virtualmin 3.91.
Incidentally I also found issuing command 'service lookup-domain stop' stops the process only after a delay. You can of course stop it immediately with kill -9 [pid number]
Submitted by tristanleboss on Mon, 04/16/2012 - 08:17 Comment #3
I also use v3.91-gpl that was updated at the same time (forgot to notice it at first, sorry).
Submitted by andreychek on Mon, 04/16/2012 - 08:43 Comment #4
Jamie, there's another fellow who posted a similar issue in the forums who sees high CPU usage with lookup-domain and is using Virtualmin 3.91 as well.
Submitted by John_B on Mon, 04/16/2012 - 09:03 Comment #5
Probably refering to my thread, linked above.
Unless this apparent bug can be resolved fairly quickly, it would be useful to have brief advice on whether it is safe to downgrade Virtualmin back to 3.90, and the easiest method for downgrading, since for the resource consumption of lookup-domain-d under 3.90 was acceptable on my setup.
Submitted by JamieCameron on Mon, 04/16/2012 - 12:04 Comment #6
John - when lookup-domain is using high CPU like this, can you try running :
strace -o /tmp/strace.txt -p XXX
where XXX is the PID of the high CPU process. Let it run for 10 seconds, hit ctrl-c, and then email me the strace.txt file at jcameron@virtualmin.com
A short-term mitigation for this issue is to just stop the lookup-domain-daemon.pl process. Virtualmin will fall back to an alternate method of looking up users, which uses more CPU for each email but won't run into this bug.
Submitted by John_B on Mon, 04/16/2012 - 13:04 Comment #7
Thanks for the tip about stopping the process and still running Spamassassin.
I tried running strace and ltrace on the process pid (which anyway is not persistent), but can get no output either to file or screen. Sorry!
Submitted by JamieCameron on Mon, 04/16/2012 - 20:29 Comment #8
So does the process that is using 100% of CPU only run for a short time before exiting?
Submitted by John_B on Tue, 04/17/2012 - 01:57 Comment #9
Correct. And restarts with new pid, I see sometimes one, sometimes two running. Maybe a minute or two each. I can find nothing in messages or dmesg log about it.
Submitted by JamieCameron on Tue, 04/17/2012 - 13:53 Comment #10
So in Virtualmin 3.91, the lookup-domain-daemon server was changed to run a separate process for each incoming message, rather than processing them in series. This shouldn't make any difference unless your system is low on memory, or you get huge spikes of email ..
Roughly how many messages does your system get each hour?
Submitted by John_B on Tue, 04/17/2012 - 15:40 Comment #11
Less than 200 per day including spam in the three accounts for which spamassassin is now enabled. Less than double that for accounts with alias forwarding (no actual mailbox), plus spam hitting the server for accounts with no mail set up.
After the problem started I disabled spam protection for all but three domains, with a total of 5 email addresses. Since that, if I start lookup-domain service, within five minutes it is showing one or two processes using 100% cpu. The chances are that in that time one or two email have come in. Ten arriving in five minutes would be unusual.
Submitted by JamieCameron on Tue, 04/17/2012 - 23:11 Comment #12
You might want to check the log file
/var/log/procmail.log
to see how many messages your system is actually processing per day - it's possible there is a local mail loop that is creating more email that you expect.Submitted by John_B on Wed, 04/18/2012 - 09:30 Comment #13
You are right.
1. an uninstalled Aegir was trying to write a deleted mailbox (some kind of php-based cron job)
2. awstats cron job was trying to write warning messages to a mailbox already full with warning messages at www-data@servername.com, and writing instead to /var/www/Maildir/new
It never occurred to me to check that www-data mailbox. Anyway awstats is not useful for me because it reports only one unique when behind a reverse proxy. so I disabled it, then disabled its associated cron tabs.
I am now seeing spamassassin consuming 70% CPU. but only briefly. lookup-domain.d consuming 3.1% RAM on 1.5GB, so 48M. So all OK.
Thanks for your help.
Submitted by JamieCameron on Wed, 04/18/2012 - 10:37 Comment #14
Cool, that will explain it. I will mark this bug as closed..
Submitted by dboone on Tue, 05/01/2012 - 13:41 Comment #15
Sorry for the new issue I posted, but I never saw this post when I searched. Here's a link to my post with a solution:
https://www.virtualmin.com/node/22044