One of our servers got a massive overload at peak hours an hour ago due to swapping quicking in, and from top result below it looks like due to monitor.pl process eating up loooots of RAM:
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
31725 root 20 0 1656m 1.5g 3576 D 8 43.3 0:14.00 monitor.pl
In urgency, we had to kill that process, but then had to take server down and finally reboot it due to heavy swap not able to resume from this situation.
How can it be that monitor.pl is even able to use that much RAM ???
That server runs fine with 2.5 Gigs usually, and has 3.5 gigs RAM allocated. But the combined usage of RAM, and most probably of disk access contention generated by that process just killed the whole server.
btw "Collect all available package updates" was off if that matters (for collectinfo.pl).
I'm stumped and worried that this may happen again on that high-traffic server.
Comments
Submitted by andreychek on Wed, 01/20/2010 - 10:36 Comment #1
Howdy -- we may need to get Jamie's input on the specifics there as to why it's using such a decent chunk of RAM.
In the meantime though, you may want to start out by temporarily disabling the status monitoring feature (In System Settings -> Features and Plugins -> Status Monitoring). That should prevent monitor.pl from firing up again until you re-enable it.
A few questions to help get an idea of what's going on --
How many Virtual Servers do you have on your system?
What is the output of "free" on your server now?
What version of Virtualmin are you using -- are you by chance using the latest, 3.76?
Are you using Cloudmin on this server? If so, which version of that do you have?
Thanks!
Submitted by beat on Wed, 01/20/2010 - 10:59 Comment #2
total used free shared buffers cached
Mem: 3578188 3249788 328400 0 40204 1829420
-/+ buffers/cache: 1380164 2198024
Swap: 4194296 8 4194288
Submitted by JamieCameron on Wed, 01/20/2010 - 12:17 Comment #3
Using 1.5G of RAM is crazy, as all monitor.pl does is check the status of various servers and websites.
What monitors do you have defined at Webmin -> Others -> System and Server Status? A screenshot would be useful..
Also, if you kill monitor.pl and re-run it manually, does it use up 1.5G of RAM again?
Submitted by beat on Wed, 01/20/2010 - 17:29 Comment #4
I never saw it at 1.5 Gigs.
Our "System and Server Status" : all green except last one:
Monitoring On host Status
Website site1.com Local
Website site1.com (SSL) Local
Postfix Server Local
BIND DNS Server Local
Website site2.com (SSL) Local
Monitoring On host Status
Apache Webserver Local
MySQL Database Server Local
Website site2.com Local
PostgreSQL Database Server Local (status: down, but that's due to not needed and not completely installed, didn't yet look why, but as it's off and unneeded on that server, it's on the todo)
Submitted by beat on Wed, 01/20/2010 - 17:53 Comment #5
If that matters, here the error at attempting to finish configuring Postgress through aptitude U:
Setting up postgresql-8.3 (8.3.9-0ubuntu8.04) ...
* Starting PostgreSQL 8.3 database server
* The PostgreSQL server failed to start. Please check the log output:
2010-01-21 00:37:16 CET FATAL: could not load server certificate file "server.crt": Permission denied
...fail!
invoke-rc.d: initscript postgresql-8.3, action "start" failed.
dpkg: error processing postgresql-8.3 (--configure):
subprocess post-installation script returned error exit status 1
dpkg: dependency problems prevent configuration of postgresql:
postgresql depends on postgresql-8.3; however:
Package postgresql-8.3 is not configured yet.
dpkg: error processing postgresql (--configure):
dependency problems - leaving unconfigured
That's off a plain virtualmin pro installation on a new Ubuntu 8.04LTS server 64 bits Xen instance. I remember we had to comment out a line in a postgres configuration file on another server to get it runing. Didn't remember right away which. But you may want to fix that in a future virtualmin installer.
Looks like this: http://www.mail-archive.com/ubuntu-bugs@lists.ubuntu.com/msg1462324.html
Tried this: https://bugs.launchpad.net/ubuntu/+source/postgresql-8.3/+bug/370422
sudo adduser postgres ssl-cert
The user `postgres' is already a member of `ssl-cert'.
Finally remembered and changed this line to false:
ssl = true # (change requires restart)
in /etc/postgresql/8.3/main/postgresql.conf (as our firewall doesn't allow external database access that can be ok for now, but the bug is elsewhere)
Again doubt it's related, but just in case...for completeness.
Submitted by JamieCameron on Wed, 01/20/2010 - 18:48 Comment #6
So monitor.pl doesn't spike up to 1.5GB anymore?
I wonder, did perhaps any of the sites being monitored have a huge file as their index page?
Submitted by beat on Wed, 01/20/2010 - 18:53 Comment #7
As said, I didn't see any spikes anymore nor before nor after that single event. We do continuous monitoring and logging of our servers, and a 1.5 gigs RAM use would be showing on the graphs...
The homepages are pretty normal, and I doubt the pages could have replied more than 8 MB as that's our limit for http replies in mod_security.
Submitted by JamieCameron on Wed, 01/20/2010 - 20:01 Comment #8
Hmm .. it is kind of hard to debug this then, unless it happens repeatedly (not that we'd really want that).