Hi guys,
My production server periodically has load average spikes of up to 100 - 200. These last for about 3-4 minutes and then the system either goes completely south requiring a reboot or it settles back down to it's norm of about 1.
I notice that these events always start about 20-25 minutes past the hour and that collectinfo.pl is always in the "run" state from a ps taken during this time.
So I guess I'm looking to know a few things: 1) Are there any known issues with collectinfo.pl that could cause it to spin out of control and/or spawn processes erroneously?
2) What would be the impact if I turned it off?
Looking to find root cause since this has devastating effects on my customers ecommerce sites. Their customers start experiencing timeouts and the the phones start ringing.
Please help if you can.
Comments
Submitted by JamieCameron on Sun, 06/27/2010 - 16:23 Comment #1
collectinfo.pl runs pretty often - usually once every 5 minutes or so. So it isn't surprising to see it running, and it would likely take even login when the system is loaded.
You could try running
top
to see what is using the most CPU time when this happens. Also, when top is running hitM
to see what is using the most RAM .. often high load can be caused by a process using too much memory.Submitted by tbirnseth on Sun, 06/27/2010 - 16:43 Comment #2
Unfortunately, I can't usually get in when the load has spiked. I get some PS output that I capture which is basically the same as top. An example of what I capture in included below.
What is the impact of turning off collectinfo.pl?
The version I'm looking at in cron runs at 21 minutes after every hour, not every 5 minutes. It's the only constant (other than the http/php-cgi activity) that I see across these "events".
A PS captured during today's "event" is (using ps : "ps -eflFH r"). Note that almost everything is in a "disk wait" status.:
F S UID PID PPID C PRI NI ADDR SZ WCHAN RSS PSR STIME TTY TIME CMD
4 D root 31841 31823 0 78 0 - 20587 sync_p 24276 1 11:21 ? 0:02 /usr/bin/perl /usr/libexec/webmin/virtual-server/collectinfo.pl
0 D ezom 31763 31761 0 76 0 - 43996 sync_p 4792 1 11:20 ? 0:00 /usr/bin/php -f ./launchDispatch.php
0 D root 31732 31730 0 78 0 - 12553 sync_b 15040 1 11:20 ? 0:01 /usr/bin/perl /usr/libexec/webmin/status/monitor.pl
4 D tackle 32417 31107 0 76 0 - 44997 sync_p 19132 1 11:25 ? 0:00 /usr/bin/php-cgi
4 D tackle 32393 31107 0 76 0 - 44997 sync_p 19236 1 11:25 ? 0:00 /usr/bin/php-cgi
4 R tackle 32286 31107 0 78 0 - 44324 - 14376 1 11:23 ? 0:00 /usr/bin/php-cgi
4 D tackle 32273 31107 0 76 0 - 45970 sync_p 17652 1 11:23 ? 0:00 /usr/bin/php-cgi
4 D tackle 32263 31107 0 76 0 - 44324 sync_p 12992 1 11:23 ? 0:00 /usr/bin/php-cgi
4 D tackle 32257 31107 0 76 0 - 44324 sync_p 12856 1 11:23 ? 0:00 /usr/bin/php-cgi
4 D tackle 32239 31107 0 76 0 - 45323 sync_p 15120 1 11:23 ? 0:00 /usr/bin/php-cgi
4 D tackle 32213 31107 0 78 0 - 45323 sync_p 15092 1 11:23 ? 0:00 /usr/bin/php-cgi
4 R tackle 32208 31107 0 76 0 - 45579 - 15736 1 11:23 ? 0:00 /usr/bin/php-cgi
4 R tackle 32197 31107 0 78 0 - 44324 - 13136 1 11:23 ? 0:00 /usr/bin/php-cgi
4 D cork 31738 31107 0 76 0 - 47446 sync_p 15408 1 11:20 ? 0:04 /usr/bin/php-cgi
4 D 509 30066 31107 0 76 0 - 60552 sync_p 11880 1 10:42 ? 0:01 /usr/bin/php-cgi
4 D rmt 19978 31107 0 76 0 - 47719 sync_p 20620 1 06:20 ? 0:04 /usr/bin/php-cgi
4 D 509 15504 31107 0 76 0 - 61069 sync_p 8164 1 04:25 ? 0:32 /usr/bin/php-cgi
5 D apache 32242 23962 0 76 0 - 74720 sync_p 11484 0 11:23 ? 0:00 /usr/sbin/httpd
5 D apache 32240 23962 0 76 0 - 74720 sync_p 11068 1 11:23 ? 0:00 /usr/sbin/httpd
5 D apache 32180 23962 0 76 0 - 74785 sync_p 10572 1 11:22 ? 0:00 /usr/sbin/httpd
5 D apache 32164 23962 0 76 0 - 74720 sync_p 11608 1 11:22 ? 0:00 /usr/sbin/httpd
5 D apache 32159 23962 0 76 0 - 74720 sync_p 11616 1 11:22 ? 0:00 /usr/sbin/httpd
5 D apache 32044 23962 0 76 0 - 74785 sync_p 11088 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 32043 23962 0 76 0 - 74720 sync_p 10992 0 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 32032 23962 0 76 0 - 74785 sync_p 10592 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 32022 23962 0 76 0 - 74720 sync_p 12196 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 32013 23962 0 76 0 - 74720 sync_p 11120 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 32004 23962 0 76 0 - 74720 sync_p 11300 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 32003 23962 0 77 0 - 74720 sync_p 10924 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31993 23962 0 76 0 - 74720 sync_p 11148 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31991 23962 0 76 0 - 74785 sync_p 11064 0 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31988 23962 0 76 0 - 92825 - 4436 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31983 23962 0 76 0 - 74785 sync_p 10588 0 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31982 23962 0 76 0 - 74785 sync_p 10724 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31966 23962 0 76 0 - 74785 sync_p 10684 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31962 23962 0 76 0 - 74785 sync_p 10804 0 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31960 23962 0 76 0 - 74785 sync_p 10788 0 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31958 23962 0 76 0 - 74785 sync_p 10736 0 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31956 23962 0 76 0 - 74785 sync_p 10892 0 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31952 23962 0 76 0 - 92825 - 4396 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31943 23962 0 76 0 - 92825 - 5032 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31937 23962 0 76 0 - 74785 sync_p 10768 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31935 23962 0 76 0 - 74785 sync_p 10780 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31921 23962 0 76 0 - 92825 - 4296 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31906 23962 0 76 0 - 92825 - 5224 0 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31898 23962 0 76 0 - 74785 sync_p 10720 0 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31897 23962 0 76 0 - 74785 sync_p 11272 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31896 23962 0 76 0 - 74720 sync_p 11584 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31895 23962 0 76 0 - 74785 sync_p 10844 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31882 23962 0 76 0 - 92825 - 4600 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31881 23962 0 76 0 - 92825 sync_b 4648 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31857 23962 0 76 0 - 92825 - 4416 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31856 23962 0 78 0 - 92825 - 4080 0 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31840 23962 0 76 0 - 92825 - 4404 1 11:21 ? 0:00 /usr/sbin/httpd
5 D apache 31807 23962 0 78 0 - 92825 - 4100 1 11:20 ? 0:00 /usr/sbin/httpd
5 D apache 31792 23962 0 76 0 - 92825 sync_p 3528 1 11:20 ? 0:00 /usr/sbin/httpd
5 D apache 31717 23962 0 76 0 - 92825 - 4616 1 11:19 ? 0:00 /usr/sbin/httpd
5 D apache 31497 23962 0 76 0 - 74720 sync_b 11224 1 11:13 ? 0:00 /usr/sbin/httpd
5 D apache 30571 23962 0 76 0 - 74850 sync_p 11680 0 10:55 ? 0:00 /usr/sbin/httpd
5 D apache 27407 23962 0 76 0 - 74850 sync_b 12064 1 09:33 ? 0:02 /usr/sbin/httpd
5 D apache 27058 23962 0 76 0 - 74850 sync_b 11448 1 09:22 ? 0:02 /usr/sbin/httpd
5 D apache 20992 23962 0 76 0 - 92825 stext 5340 1 06:49 ? 0:03 /usr/sbin/httpd
4 R root 32570 3810 4 78 0 - 16424 - 1004 0 11:28 pts/0 0:00 ps -eflFH r
4 D postfix 32479 2776 0 76 0 - 14167 sync_b 2992 1 11:26 ? 0:00 smtpd -n smtp -t inet -u -o smtpd_sasl_auth_enable yes -o smtp_bind_address 98.129.216.127
4 D postfix 32467 2776 0 76 0 - 13617 sync_p 2344 0 11:26 ? 0:00 cleanup -z -t unix -u
5 D root 17795 1 0 76 0 - 21839 sync_p 2748 1 Jun26 ? 3:21 /usr/libexec/webmin/virtual-server/lookup-domain-daemon.pl
Submitted by tbirnseth on Sun, 06/27/2010 - 16:46 Comment #3
Lovely, it formatted just fine in the edit window...
Submitted by JamieCameron on Sun, 06/27/2010 - 16:47 Comment #4
Turning off collectinfo.pl will break Virtualmin's system statistics graphs, and will make the System Information page slower to load. However, other things will still run fine..
You might try SSHing in first, running
top
, then waiting for the problem to occur.