CentOS 5.6 Upgrade: CPU load increases and server becomes unresponsive [#17978]

Submitted by tellis on Fri, 04/22/2011 - 16:46 Pro Licensee

Hi - We've upgraded two of our VirtualMin boxes to 5.6 over the last three days and we're seeing odd behavior. Screen shots attached of server

statistics graph. They show CPU load increasing at a steady rate (45 degree angle - weird) and while this happens, the machines become unresponsive. We cannot ssh or get http traffic from Apache - it does however ping. Again, this started happening immediately after upgrading from CentOS 5.5 to 5.6 per the VirtualMin interface. All other requested patches from VirtualMin repository had been applied prior to CentOS 5.6 upgrade. Has anyone else seen this behavior? We cannot pin down the offending service.

Status:

Active

Comments

Submitted by tellis on Fri, 04/22/2011 - 16:48 Pro Licensee Comment #1

Also, the file attachment uploader is not working - tried on Firefox and IE (Win 32) - happy to e-mail my screen shots! Thanks!

Submitted by JamieCameron on Fri, 04/22/2011 - 17:03 Comment #2

Can you email them to me at jcameron@virtualmin.com ?

Also, can you run top when this load is happening, and if so what does it report is the top process by CPU usage?

Submitted by andreychek on Fri, 04/22/2011 - 17:07 Comment #3

Well, the actual process of upgrading all those packages can be a bit CPU intensive.

However, I hadn't ever run into folks who've had trouble performing the upgrade, or difficulty accessing the server during that time.

The key would be to determine what's going on with the processes on your server at the moment.

Normally, I'd suggest doing that by logging in via SSH, and using a tool such as "top".

If SSH isn't working for you, are you able to use Virtualmin, and browse to Webmin -> System -> Running Processes? That area may help give you an idea of what's utilizing your CPU.

Submitted by tellis on Fri, 04/22/2011 - 18:12 Pro Licensee Comment #4

Hi Jamie

I've e-mailed the screen shots to you - top revealed nothing suspect - memory consumption on both servers was well below physical RAM capacity and CPU utilization was no where near pegged. Top looked like a normal, healthy system. One server was upgraded three days ago. The behavior happened this morning again on server "A" after we stopped the following services which looked new in terms of booting up launched in the CentOS 5.6 upgrade:

libvert - all iscsi - all qemu snmptrapd dovecot mcstrans pcscd xend xendomains HALdaemon hplip

Server "B" had a subset of these launched but I've turned them off as well after "B" became unresponsive yesterday afternoon. Both servers are presently at runlevel 3 (they were at 5 when these episodes happened) and we're watching them closely. These servers have been in production for 18 months and were stable on CentOS 5.4 and 5.5 - no problems.

Submitted by tellis on Fri, 04/22/2011 - 18:16 Pro Licensee Comment #5

Hi andreychek

The instability started hours after a successful upgrade via Virtualmin. As I mentioned to Jamie, top revealed nothing and once the server becomes unresponsive, Virtualmin cannot be accessed.

Thanks

Tony

Submitted by JamieCameron on Fri, 04/22/2011 - 19:07 Comment #6

Thanks for those graphs - they aren't too useful unfortunately, as clearly logging has completely halted during the failure period, which causes those straight lines.

Another thing to look at in top is sorting processes by memory usage, which you can get by hitting M .

Also, when the problem is happening, try running vmstat 1 to see IO load. This is shown in the bi and bo columns.

Submitted by tellis on Mon, 04/25/2011 - 10:52 Pro Licensee Comment #7

More info: One of our two servers became unresponsive again. This is what was on tty1 one I checked the physical machine.

audit: backlog limit exceeded audit: audit_backlog=321 > audit_backlog_limit=320 printk: 2 messages suppressed audit: audit_backlog=321 > audit_backlog_limit=320 audit: audit_lost=4352 audit_rate_limit=0 audit_backlog_limit=320 audit: backlog limit exceeded

I've increased auditd buffer length. Any thoughts on what's flooding audit or how to trace would be appreciated.

Thanks

Submitted by andreychek on Mon, 04/25/2011 - 11:26 Comment #8

One possible cause of those messages is if SELinux was enabled. Although Virtualmin disables that during installation, it may have been re-enabled somewhere along the way.

If you look at the file "/etc/selinux/config", what is "SELINUX" set to? It should normally be set to "disabled".

If you change that value, the server would need to be rebooted afterwards for that setting to take effect.

Submitted by tellis on Mon, 04/25/2011 - 11:45 Pro Licensee Comment #9

Hi - SELinux has been and remains disabled. -Thanks for the suggestion.

Submitted by andreychek on Tue, 04/26/2011 - 08:36 Comment #10

Those messages you're seeing on the console should simultaneously be logged to /var/log/messages.

If you look in that logfile, do you see any other corresponding notices that may explain what's generating the audit messages?

Submitted by bbuhlman on Fri, 04/29/2011 - 12:30 Comment #11

We have had to reload the OS on this server after moving the hard disks to identical server hardware failed to resolve our issues. You may temporarily see VirtualMin serial number 5557099 appear from two different machines while we complete the reload and restore

We are still trying to figure out what in the CentOS 5.6 upgrade was responsible. The hardware is HP Proliant DL360 G3

Submitted by andreychek on Fri, 04/29/2011 - 12:40 Comment #12

That's no problem in regards to the license... you may see a warning about it being used on multiple machines, but that warning will go away after a couple of days when it no longer sees both IP's in use.

Submitted by tellis on Fri, 05/27/2011 - 17:05 Pro Licensee Comment #13

Update: We're still struggling with this issue with our HP DL360s and DL380s. I've come across the following RH bugs that seem to point to cciss kernel mod for HP Smart Array. However, other controllers seem to be affected. RH is claiming firmware, not OS. Considering LSI is involved too, I'm not sure about their conclusion. We've rolled back to 2.6.18-194.el5 from 2.6.18-238.9.1.el5 in CentOS 5.6 with associated cciss kernel mod version 3.6.22-RH1 back to 3.6.20-RH4.

"System hang on access to cciss drive" https://bugzilla.redhat.com/show_bug.cgi?id=615543

"Server hangs, processes being blocked for more than 120 seconds" https://bugzilla.redhat.com/show_bug.cgi?id=605444

Keywords to this bug "I have similar output on the screen and the server completely freezes which unfortunately leaves me without any log entries in messages or dmesg to paste here."

Thanks -Tony