Centos buffer tiemout [#12769]

Submitted by SteveAcup on Tue, 12/22/2009 - 15:22

Jamie, This is either a really big one else I've just screwed up the server config.

My users in VMPro are seeing constant socket timeouts when fetching pop3 mail, or trying to send SMTP, or connecting FTP, or sometimes even http.

This started a few weeks ago, may have been correlated with an upgrade but I'm not sure at this point.

Server: Linux 2.6.18-164.el5 on x86_64 CentOS Linux 5.3 3.75 Pro, 5.82 GB total, 968.71 MB used

I've tuned the kernel every which way I can think of to support lots of sockets, but it still times out. It may be fine for an hour, then it will start locking up every few minutes. I have nagios checking smtp, ftp and pop3 every minute.

Any ideas?

Status:

Closed (fixed)

Comments

Submitted by Joe on Tue, 12/22/2009 - 15:33 Pro Licensee Comment #1

Any clues in the maillog?

Anything relevant in the kernel log? (dmesg)

Likewise, /var/log/messages?

Submitted by SteveAcup on Tue, 12/22/2009 - 18:01 Comment #2

There are no error messages relevant to the timeouts that I can find in any log. I checked the /var/log folder and looked at all logs that have timestamps indicating that they could have entrees correlated to the problem. Also checked http logs in each hosted domain to make sure none were causing a problem.

It just stops accepting new sockets for awhile, then starts again. Webmin hangs for a few seconds, ssh console hangs, pop3,smpt, and imap also hang.

The system is straight from the CentOS ISO with the VMPro install script a few months ago. Since the problem started, I have of course gone through many CentOS performance tuning steps.. all have failed.

Most of the sockets eventually get through since the delay clears before 60 seconds. Nagios has a default 10 second timeout. I tweaked that to 15 seconds and it didn't change the freq of failures much. I would propose that since most -eventually- get through, there are few/no error log traces.

There are only 275 users and 15 domains on this machine. The FreeBSD machine next to it has >1000 users (1 domain) with all the same basic services running - web, smtp, ftp, pop3, imap, dns but no delays.

Unfortunately, the domain on the freebsd machine cannot co-exist with the domains on the Centos machine due to the differences in default naming of users.

Submitted by andreychek on Tue, 12/22/2009 - 21:12 Comment #3

Howdy -- here's a few questions regarding the issues you're seeing --

What is your systems load average during one of the times you're having this trouble? You can see that by typing "uptime".
How many active connections do you have to your server? You can see that with "netstat -an|grep tcp|wc -l"
Can you run the "dmesg" command, and attach the output of that to this bug report?
What is the output of the "free" command?

Submitted by SteveAcup on Wed, 12/23/2009 - 17:44 Comment #4

I've scripted your suggested commends and the next time I catch the system lagging I'll run it and attach the output.

During freeze:

running free total used free shared buffers cached Mem: 6098128 3507456 2590672 0 322436 2078856 -/+ buffers/cache: 1106164 4991964 Swap: 6094840 0 6094840

running uptime 18:26:58 up 4 days, 7:03, 1 user, load average: 4.97, 2.68, 2.09

number of connections - netstat -an|grep tcp|wc -l 65

During normal operations:

running free total used free shared buffers cached Mem: 6098128 3498608 2599520 0 322444 2079104 -/+ buffers/cache: 1097060 5001068 Swap: 6094840 0 6094840

running uptime 18:30:39 up 4 days, 7:06, 1 user, load average: 0.24, 1.51, 1.76

number of connections - netstat -an|grep tcp|wc -l 90

dmesg dump txt file attached

In the procmail log I'm getting a bunch of clamav errors - they happen with every email. I've tried to turn off clamav long ago but appeared to have failed.

    ERROR: Can't connect to clamd: No such file or directory\

I've turned off FTP and http and the delays still happen. I'm betting on this being SPAM related.... a burst of addresses come in and swap the system somehow. I've tried seriously increasing the connection capability without a cure. I'll try limiting inbound mail to 1 delivery address per connection to see if that smooths out the surges.

Submitted by SteveAcup on Thu, 12/24/2009 - 12:59 Comment #5

I have stats collection running every minute, and graphed a good correlation to the problem - jpeg attached.

It appears the load spikes every 12 to 15 minutes and is followed by a spike in received emails.

I've turned off ClamAV completely, and turned off bayes in Spamassassin to reduce load to little effect.

Reviewing detailed mail and proc log to see if there is a particular source.

(deleted excerpt from procmail log... it was a jumbled mess)

Submitted by SteveAcup on Thu, 12/24/2009 - 17:25 Comment #6

jpeg with cpu load and email per minute attached

This jpg I believe shows an iowait CPU load that is delaying the delivery of emails. When the iowait clears, the backlogged emails come through in a burst. Here is a summary from dmesg:

942 kjournald

736 repquota

356 pdflush

129 collectinfo

 62 lookup

 52 httpd

 49 syslogd

 48 spamd

 39 pop

 27 procmail

Haven't quite figured out what to do next to track this down.

Submitted by SteveAcup on Fri, 12/25/2009 - 12:37 Comment #7

I've tried turning off journaling by changing fstab to ext2 and rebooting. That did not work yet, as the source forgot to mention clearing the journal first and deleting the journal file before the fstab change. I should research two sources before making arbtrary changes :>) System still works same as before, but I'm hesitant to try changing ext2/3 settings again.

I've ordered a replacement hard drive and will clone the system. With a clone disk I'll try making the complete transition from ext3 to ext2. Also, the new disk may fix the problem.

I've adjusted vm.dirty_background_ratio=5 to see if that impacts anything - making pdflush work more often with smaller amount of cached writes. - period and magnitude of cycle stays the same. I'll try increasing to 20.

Submitted by Joe on Fri, 12/25/2009 - 14:09 Pro Licensee Comment #8

I really don't think this is a journalling or filesystem issue. In fact, I'm almost completely sure of it. (ext3 is faster than ext2 for most classes of problem, not slower...and while the default configuration might not be right for all loads, it's usually not the first or even second or third bottleneck that heavily loaded servers run into.)

I'd want to suspect network issues. I've seen similar behavior in the past on high load web caches when there was an MTU mismatch (due to virtual networking happening at the router) or duplex/link speed mismatches or syncing problems with speed was changing rapidly (generally also due to router configuration).

The clamav daemon not being running would absolutely cause a huge backlog (as every mail would have to time out on the virus check before being delivered), and would tie up lots of connections. When you say "turned off" ClamAV, did you just stop the service, or did you actually configure Virtualmin not to do virus scanning? The former would pretty much guarantee the backlog.

Would it be possible for me to drop in on this box? I might be able to spot the trouble if I could interact with it for a few minutes while it is under duress.

Submitted by SteveAcup on Fri, 12/25/2009 - 14:48 Comment #9

sidenote - I've found that in general you (i.e. me) should not just turn off something like clamav in the system settings if they are already turned on. The virtual servers (at least on my systems) do not clean themselves of the config settings. I had to re-enable clamAV in vritual min, go to each virtual directory and turn off clamAV, then disable clamAV again in VM.

To answer.... for the past 48 hours clamav has been totally turned off and there are no more log messages complaining about "could not find ClamAV". The iowait problem is still happening.

I changed the fstab settings to "noatime" and the high cpu load cycle appears to have changed. Without the atime updates there are fewer inodes to flush from the cache, and the time the system is stalled on each cycle appears to be reduced.

The best diagnostics right now are the System Statistics - CPU 1 min in VM showing the periodic nature of the CPU spikes, and top with "1" and "i" selected to show individual core iowaits and no idle processes.

Joe - I'll email you login info to the machine.

Submitted by SteveAcup on Sat, 01/09/2010 - 12:02 Comment #10

Joe, this is still a problem.

I've been troubleshooting for a long time now, and tuned the heck out of the server so that the problem is still there, but the impact is less.

There is a CPU surge every 12 min 45 seconds +/- 15 seconds.... up to 800 % use (8 cores). Top reports that the use is entirely iowait. It appears that the program using the CPU is the journal or a flush command.

If I block all incoming mail port 25 via a remote firewall/router the CPU cycling stops. If I allow incoming mail again it starts.

Today I'm going to try turning off SPAMAssassin to see if that changes the results.

Submitted by Joe on Sun, 01/10/2010 - 00:33 Pro Licensee Comment #11

I kinda ran out of ideas.

I did see some erratic behavior with regard to disk IO performance, when running hdparm -tT, though I don't understand sdparm enough to tweak the settings in any useful way. I'm going to guess there is a bug in the disk controller driver in the kernel, or possibly buggy hardware. I don't really know how to address it, beyond altering hardware or hoping that a kernel upgrade corrects the problem. We've run up against the limits of my hardware/kernel knowledge.

I used to be much more knowledgeable about hardware when my job was building hardware, but these days, I almost never touch the stuff. So all the tools (like sdparm for modern disks) are new to me and I have no idea what to look for, in terms of obvious gotchas (in the past, I would expect to find the disk configured to use an old mode rather than a DMA mode, but that isn't possible with SATA drives).

In short, I don't know how to fix it, and I'm not even sure how to begin troubleshooting beyond the obvious stuff (which we've already pretty much ruled out): memory, CPU contention, file handles, etc.

One last stab at the problem: Are you running the latest available kernel? Maybe it's a driver bug that's been fixed.

Submitted by SteveAcup on Tue, 01/19/2010 - 15:14 Comment #12

Joe... problem fixed.

After adjusting every setting I could, the problem would not go away. Plugged the hard drive into a duplicate server, and the problem did not go away. So...

I installed CentOS on a new hard drive, installed VMPro via script, and restored the individual domains via backup/restore process. Once that was done, I tweaked the various settings that did not carry over. The new drive and OS worked fine...... no drop-outs at all.

I took the old disk and ran extensive diagnostics on it, and found no hardware problem. I assume that the OS/webmin/VM/Applications had been corrupted somehow. Unless you are very curious and would like a disk image to play with, I'll be wiping the old disk soon and reusing it elsewhere.

As part of the process through I did buy the 2nd copy of VM pro... so now I have a 50 domain version running on CentOS and a 10 domain version on FreeBSD.

thanks for the help... Steve

Submitted by JamieCameron on Tue, 01/19/2010 - 16:02 Comment #13

Great! Sounds like the underlying cause was some kind of filesystem corruption ..

Submitted by Issues on Thu, 02/04/2010 - 01:37 Comment #14

Automatically closed -- issue fixed for 2 weeks with no activity.