ClamAV and SpamD

We have been experiencing extremely high CPU and disk IO utilization on our production box, with Load Avg anywhere from 3 or 4 up to as much as 20 and 30 (higher at one freakish point). I am working with Canonical and they have been helping with identifying any issues with the installation. Since it appears that ClamAV and Spamd are the biggest culprits (they have on more than 1 occasion basically locked up our server and stopped other services), I think it would be wise for us to offload the processing to another server. I saw you have instruction on this, but I may need a bit more guidance.

We re running VMware ESXi 3.51, and over the next couple months buying another physical server, and Canonical was suggesting we move to 12.04 (which is coming out over the next 1-2 weeks).

1) Can you provide guidance in setting up a second Ubuntu VM strictly for processing Spamd and ClamAV? 2) Lookup-domain.d (sometimes .p) still seem to be running very hot (same as with issue 1). Was this daemon updated with the latest Vm update? 3) Since we rely on Virtualmin for our server, when do you expect it to be able to support 12.04? 4) Do you have any information or experience with tweaking ESXi 3.51 to make the most of performance for Ubuntu?

Thanks so much.

ps. I just saw how to close out all out open tickets. I'll do that shortly.

Status: 
Closed (fixed)

Comments

Howdy --

1) Can you provide guidance in setting up a second Ubuntu VM strictly for processing Spamd and ClamAV?

That should actually be straight forward... all you would need is a basic install of Ubuntu, install spamassassin and clamav on it (from the Ubuntu repository), and then follow the setup instructions here in the section named "Moving Spam and Virus Scanning to Another System":

http://www.virtualmin.com/documentation/email/spam-av

2) Lookup-domain.d (sometimes .p) still seem to be running very hot (same as with issue 1). Was this daemon updated with the latest Vm update?

Yup! Jamie put changes into lookup-daemon that should cause it to use less CPU. Those changes are active in Virtualmin version 3.91.

3) Since we rely on Virtualmin for our server, when do you expect it to be able to support 12.04?

You should see Virtualmin support within a few weeks of 12.04 coming out.

4) Do you have any information or experience with tweaking ESXi 3.51 to make the most of performance for Ubuntu?

Sorry, I don't have any recommendations for tweaks with VMware.

Though, once Ubuntu 12.04 support is available in Virtualmin -- you may see performance improvements in the new 3.2 kernel.

Great. I will review the procedure for setting up the secondary Ubuntu server for spamd and clamav. I assume I would have to use the same version of Ubuntu (8.04) to make it simple.

Also, Canonical made the following changes to /etc/spamassassin/local.cf dns_available yes lock_method flock use_bayes 1 bayes_auto_learn 1 use_auto_whitelist 0

They would also like to add some parameters to Postfix, but fear that Virtualmin may override them. Would you check this link and let me know if it would ok for them to try these parameters or if they are something that I can do from within the vm gui?

http://www.systmbx.com/postfix/how-to-stop-spam-using-postfix-configuration

While you're at it concerning updating, when you get the new server, may I suggest you also replace that hopelessly outdated ESXi 3.5 with the current version 5.0? :) I'm quite sure you're going to see many many improvements there.

You just need to make sure to get a 64-bit system, since ESXi 4 and 5 run only on 64-bit, which should be rather standard by now though. And make sure to put enough memory in the box. Especially services like SpamAssassin and ClamAV need quite a bit of memory. If possible, give the Virtualmin VM 2 or more GB of RAM for smooth performance.

The distro/version used on your other server doesn't matter.

Since you're an Ubuntu user -- I'd probably actually use the new Ubuntu coming out next week, 12.04 LTS, since it'll be around for quite some time (though note that Virtualmin itself won't be supported on that for a few more weeks, but you can use it as a dedicated SpamAssassin/ClamAV system in the meantime).

Regarding the Postfix parameters you mentioned -- I don't see any parameters mentioned in that link which would be incompatible with any parameters Virtualmin sets.

Upgrading to the latest ESXi is also on the plate. We were a bit hesitant in upgrading the existing ESXi 3.5 to the newer one since it's our production server. We did it on our test server (up to 4) and it appears to be running fine, so we may just bite the bullet and at least upgrade this one as well. The current box has 12GB RAM, it's 64 bit XEON quad core, and RAID 5 w/ 5 spindles and about 1TB storage. So the box is pretty capable. It's all in the plan. I'm just buying a little time until we have the resources to place a newer server as production and demote the existing to test.

Thanks for the feedback on the parameters. I'm going to have Canonical do that next. I appreciate all your feedback.

I know that the lookup-domain.d(.p) daemon was updated and we have 3.91 of Virtualmin running, but Canonical has asked the following. Although I am going to proceed with offloading the ClamAV and Spamassasin to a secondary VM to help reduce the burden, but I'm not sure if this process will still be a factor. I assume this is also part of the email items. Thanks.

FROM CANONICAL I just saw lots of lookup-domain processes running on high cpu usage. Please don't forget to check with virtualmin why this happen and if there is a tweak or workaround to avoid such issues.

14670 root 0 -20 3660 3656 1948 D 12 0.0 2:21.09 atop
5632 clamav 20 0 170m 116m 7180 S 11 1.2 1:34.17 clamd
22661 root 20 0 8996 7288 1696 R 6 0.1 0:00.21 lookup-domain.p
22637 root 20 0 8732 7068 1680 R 5 0.1 0:00.22 lookup-domain.p
22638 root 20 0 8336 6684 1672 R 5 0.1 0:00.21 lookup-domain.p
22657 root 20 0 8600 6852 1672 R 5 0.1 0:00.21 lookup-domain.p
22298 root 20 0 37600 34m 1420 R 5 0.3 0:00.72 lookup-domain-d
22332 root 20 0 36956 34m 1396 R 5 0.3 0:00.62 lookup-domain-d
22672 root 20 0 8336 6636 1672 R 5 0.1 0:00.20 lookup-domain.p
22210 root 20 0 39460 36m 1472 R 5 0.4 0:00.88 lookup-domain-d
22554 root 20 0 34172 31m 1408 R 5 0.3 0:00.36 lookup-domain-d
22569 root 20 0 34168 31m 1404 R 5 0.3 0:00.35 lookup-domain-d
22653 root 20 0 8080 6376 1672 R 5 0.1 0:00.21 lookup-domain.p
22710 root 20 0 8204 6500 1672 R 5 0.1 0:00.18 lookup-domain.p
22711 root 20 0 7780 6124 1660 R 5 0.1 0:00.18 lookup-

Well, let's try one thing just to make sure there isn't a problem with the lookup-domain-daemon process.

Try running this command:

/etc/init.d/lookup-domain restart

That will make sure the daemon is running properly, and using the newly tweaked code.

After doing that, do you continue to see the same problems? Or does that assist with the high CPU usage you're seeing?

Actually, I have the command listed as one of my regular commands to run when I see this process running this hot and often. The server was rebooted this morning and all processes are running fresh. Restarting the processes usually does calm it down, but then it will start running hot again at various points and I would restart the process again. I think that's why Canonical was asking about the process.

Is there any chance we could log into your system to review the logs and as well as the resource usage?

Also, roughly how many users do you have on the system there that receive email?

And what sort of email traffic are you receiving -- do you know how many messages a day you're receiving?

Thanks!

I would like to coordinate with Canonical since they are working on this as well. But, lookup-domain was just running about 30 instances (based on TOP from what I saw) and CPU was at 0%. I restarted lookup-d and it is again back to a normal state.

I certainly can provide access to the server for you to look at it. I can create an account for you as I did with Canonical.

We have just under 800 accounts on the server. 10GB Ram, XEON Quad 2.0, RAID 5 array under VMWare 3.5 (yes we are upgrading it).

Not sure as a temporary fix to have monit watch it and restart lookup-d if it sees it spiking, just asking since I'm certainly not a sysadmin.

Also, as a side note, Canonical made some adjustments to reject more spam from hitting the server.

How would you like me to proceed with access?

Correction, there were hundreds of lookup-d instances running on the server before I restarted that process. Wow!

Well, if it helps, I'm not looking to make any changes now, only to review the logs and running processes. So we shouldn't be getting in Canonical's way.

However, if we do wish to make any changes, I'll propose them here, and then we can coordinate those with Canonical.

If that's okay -- the next step would be to provide an account of some sort. You could email the details to eric@virtualmin.com.

I will add the account for you. I want to make sure you can see and access the logs. what is the best way to add the account on the server for you? I can do it right at a secure CRT prompt or through virtualmin. Sorry for the ignorant questions, but I've jumped into the mix since my old developer (pseudo sysadmin) has left. I will then email you the login credentials to the email you referenced.

Well, how you add the account doesn't matter -- the thing to do would be to make sure it's a member of the "admin" group, so that it has sudo rights.

You also have the option of using the Virtualmin Support module, and simply enabling Remote Access from there. That temporarily grants us access to the root account, if logging in as root hadn't been disabled.

Just emailed you credentials. I'm in my office and want to stick with this as long as you can. As in my email, if there is a way to throttle this process from launching hundreds of times, that would be great. Not sure if Munin would be able to do that. We're using it for other services.

Thanks, we'll look into this.

The issue though is that lookup-domain needs to process each incoming message -- so if you get a large batch of incoming messages, that could cause a spike in resource usage.

You had mentioned in your email about a problem with an outgoing newsletter -- that wouldn't actually cause any lookup-domain processes to be launched, since they're only used on incoming messages.

The only reason that would occur is if that user was also seeing a large number of bounced messages that was being delivered back to their account.

Now, there's no way to rate limit lookup-domain -- it runs once every time it's told to by Postfix/procmail. But, it may be possible to rate limit incoming emails from Postfix. I don't have much experience in such a setup, though this documentation here may be a place to start:

http://www.porcupine.org/postfix/doc/rate.html#process

We're going to review your logs to see if there's anything unusual that we can find with the resource issues you're seeing there, we'll let you know what we find.

Thank you for the additional information. Looking forward to seeing if you can find something unusual since this has gotten worse. Regarding the client sending the email blast, the recipients do include about 120 email accounts on our server and usually has attachments of 250k - 700k, so that may be a reason for so many spawned lookup-d instances. I'm going to look and see if ClamAV also scans all outgoing messages and attachments in case that is also contributing.

Both SpamAssassin and ClamAV only scan incoming messages -- there's nothing on a typical Virtualmin setup that scans outgoing messages, only incoming messages are scanned.

Good to know. I can rule that out. Again, sorry if that's common knowledge. Let me know if you see anything that can shed light.

As a temporary measure until we place a new server for dedicated email (to split the processing between multiple servers), I am going to follow your instructions on using a second Ubuntu VM for processing AV and Spam. Since I assume that this only scans the email, is about 6GB free storage and 4GB Ram on a Xeon Dual Core adequate? Or is there a recommendation that you can provide?

Yup, those specs sound great!

That's actually a little more than you should need -- but if you can spare the RAM, that would certainly ensure that you have enough RAM.

thanks. I'm actually trying to get this done today since we've been experiencing some severe issues with high load and I think this should at least help. We have a paid version of Virtualmin on our production server (3.91) and on this other server, it's running the GPL version at 3.703. Is that ok and the latest? I am a firm believer in paying for quality software and support, and Virtualmin is a definite plus so far. Once we get our new server, I plan on getting another paid version (unless you have a bundle for multiple servers), but in the meantime, will these two versions work well together?

Well, we'd certainly always recommend using the most recent Virtualmin version, and 3.703 is fairly old. I'd be concerned about bugs and security issues in it.

That said, running SpamAssassin and Clamd on a remote server doesn't require that Virtualmin be running on that remote server -- so the Virtualmin version you have there won't affect the performance of SpamAssassin and ClamAV.

I just have it make it easy to make sure all modules are up to date. I've been trying to find how to update it to the latest version. I keep finding a post from 2008 about upgrading it, but it looks like webmin. Can you point me in the right direction

Well, I don't want to introduce too many topics into this one request, or I'll start having trouble keeping track of them all :-)

We'd be be happy to help though, and you're welcome to open up a new request to talk about that. Be sure to let us know how you went about installing Virtualmin on that server. Thanks!

Is it possible to use spamassassin on a remote server for a single virtual server? I would like to test it out thoroughly, and the same with ClamAV. I think I have spamassassin configured correctly on a separate server.

Unfortunately, it's only possible to change the configuration for the entire server.

What I would recommend doing is change SpamAssassin to use your remote server, and do so outside of business hours.

Once you make the change, look in the mail logs on your remote SpamAssassin server (/var/log/mail.log), where you should be able to see SpamAssassin processing email that is arriving on your other server.

Also, if you look in your mail.log and procmail.log files on your primary server, you shouldn't be seeing SpamAssassin related errors.

Final fix is a new server and should be here on Friday. Dual 6 Core processors, 30GB RAM, 6 drive RAID 10 array with 15000RPM SAS drives running VMWare 5. We are going to split up servers into 1-Email, 1-Spam/Av, 1-Everything else. In the meantime, here are some changes I made on our current "eggs-in-one-basket" system. 1) Disable ClamAV scanning on the server 2) Increased the Max Child Processes for Spamassassin from 5 to 20 3) Decreased the Postfix Child Processes from 100 to 50 4) Decreased the mac allowed recipients from 50 to 20 5) Moved all cron processes to early morning hours

I'm also notifying the two top clients for email dist to their users to send me their blasts and I'll regulate them. Still cannot explain why the load explodes and idle goes to 0% when there is only 150 emails coming into the queue. So, if you want to help figure that out to possible help others using Pro, I'll be available this evening to run tests in the wee hours on the current box because it still just doesn't make sense and I know there is something causing this odd behavior.

Okay, that all sounds good -- with the only possible exception being that increasing SpamAssassin's child processes could potentially cause more load on your system, if more SpamAssassin processes are dealing with spam at the same time.

As Joe mentioned in your other support request though, we're a little baffled as to what's going on... you shouldn't see issues like you're describing with 100 emails (or even 1000 emails).

We aren't seeing this behavior on other systems (and certainly none as high-end as yours), so we aren't sure what to change or fix.

One thing I had been wondering is if maybe you're seeing some sort of hardware IO bottleneck, but that's difficult to test on a live server.

However, are you saying that it would be possible to schedule a time to reproduce the issue you're seeing? That is, if we can find a time in the evening/night when Jamie has some availability, would you be able to send yourself 100 emails to reproduce the issue?

What timezone are you in, and what time frame are you available?

Oh, and one other thing is that Jamie re-write the lookup-daemon after you initially brought all this up -- it's now multi-threaded, and can process more email at once. Which, on most systems, allows it to run more efficiently.

However, since we're seeing such odd behavior on your setup, that doesn't seem to be helping -- and may even be making things worse.

If you like, Jamie had offered that we could put the old lookup-domain version back onto your system, which would cause it to process emails slower, since it's not multi-threaded, and it can only process one email at a time.

I'm on EST and last night was testing the system between 11:30pm and 1:30am. So high load averages are much easier to monitor. For instance, my test included setting up an alias for my domain (rjrsolutions.com) that sends an email to all users in the domain. Very simple to set up. Since the accounts we have also do a bunch of forwarding, it turns out to be a good 100 deliveries. If possible, I'm available tonight since I have to figure out what is going on.

Also, if we can put the old lookup-domain.d(p) back, I think that would be a plus for us, since I would rather slow down delivery than lose all communication until it clears out.

One last question. I followed your instructions to remove ClamAV for all our domains and then disabled it through Vm, but I'm still seeing ClamD pop up and eating CPU occasionally when running atop -D. What am I missing? I was trying to disable AV from all processing to help until the new server is in or we identify the anomaly.

I've attached a replacement for lookup-domain-daemon.pl to this bug report, which can switch to "serial" mode. What you need to do is :

  1. Save it as /usr/share/webmin/virtual-server/lookup-domain-daemon.pl
  2. Add the line lookup_domain_serial=1 to /etc/webmin/virtual-server/config
  3. Run /etc/init.d/lookup-domain restart

Let us know if that helps..

Fantastic. Great opportunity to test it out as well. I've set up tow mailing lists, one for each of the clients that sends to all their users. One has 130 account and the other has 28 account. The mail utility I'm using can bet set to deliver metered sends. I have another 2 of them I have to send to the 130 user account. I know how it performed with the "new" lookup-domain-d, so I'll apply the above and do the next one to see how it reacts.

I have it set to send 10 emails and pause 15 seconds before sending the next 10. Since you have an account on our server to SSH in, do you want to open up top and watch the reaction?

I just tried the "new" old file and it had a positive impact. I was able to let the send continue up to about 100 emails (15 seconds between each group of 10) and the load went up to about 7.5 and idle CPU went to 5% after a while and then I paused it so the system could catch up. During the processing, I did see clamd pop up a couple times with high CPU util, but that is supposed to be off. I used the method that Vm provided, but it still seems to be running. What's the best option to shut off?

PS. I'm ready for another, You want to log in and watch?

Ok, I am logging in now ..

It completed the send. Of course this was the first one in about 1.5 weeks that did not go crazy. That's how it's supposed to be. It got much better once I applied the lookup-domain fix you sent. The last one I did shot load to about 20 and idle to 0%. I think it's a disk i/o bottleneck.

Great! Disk IO certainly could be an issue ... or lack of memory.

I think it's Disk I/O. I have 10 GB RAM on the box. Anyway to allow more memory to lessen the disk I/O on anything?

If you have 10 GB, RAM probably isn't the issue .. but disk IO may be.