These forums are locked and archived, but all topics have been migrated to the new forum. You can search for this topic on the new forum: Search for High Server Load because of iowait - find the cause? on the new forum.
Hello all,
Today one of my servers is becoming very slow, with very High Load Average (up to 200). There's no very high CPU use, but a high IOWAIT.
Normally I can identify the problem, it is usually email, someone sending spam. But today I can't find out what is causing the high load iowait.
My top shows:
top - 14:27:34 up 4:12, 2 users, load average: 84.58, 53.22, 43.74 Tasks: 681 total, 1 running, 679 sleeping, 0 stopped, 1 zombie Cpu0 : 10.8%us, 2.0%sy, 0.0%ni, 0.0%id, 86.8%wa, 0.0%hi, 0.3%si, 0.0%st Cpu1 : 2.3%us, 0.7%sy, 0.0%ni, 96.3%id, 0.7%wa, 0.0%hi, 0.0%si, 0.0%st Cpu2 : 0.3%us, 0.3%sy, 0.0%ni, 0.0%id, 99.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu3 : 0.3%us, 0.3%sy, 0.0%ni, 0.0%id, 99.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu4 : 5.3%us, 1.0%sy, 0.0%ni, 82.4%id, 11.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu5 : 0.7%us, 0.3%sy, 0.0%ni, 0.0%id, 99.0%wa, 0.0%hi, 0.0%si, 0.0%st Cpu6 : 1.3%us, 0.3%sy, 0.0%ni, 0.0%id, 98.3%wa, 0.0%hi, 0.0%si, 0.0%st Cpu7 : 0.0%us, 0.7%sy, 0.0%ni, 0.0%id, 99.3%wa, 0.0%hi, 0.0%si, 0.0%st Mem: 16037364k total, 13479312k used, 2558052k free, 1231820k buffers Swap: 18284540k total, 0k used, 18284540k free, 7857368k cached PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND 55028 eptvmast 20 0 276m 21m 7512 S 6.0 0.1 0:00.92 php-cgi 45290 eptvmast 20 0 286m 31m 7804 S 4.3 0.2 0:14.55 php-cgi 54977 1261 20 0 277m 19m 7536 S 3.0 0.1 0:00.25 php-cgi 55165 1080 20 0 186m 24m 7476 D 2.0 0.2 0:00.06 php-cgi 16602 phoenixa 20 0 181m 20m 7744 S 1.3 0.1 0:49.99 php-cgi 55166 helicida 20 0 273m 16m 7276 D 1.3 0.1 0:00.04 php-cgi 3501 mysql 20 0 12.9g 1.1g 7492 S 1.0 7.1 2:58.34 mysqld 54940 root 20 0 78880 3436 2408 S 1.0 0.0 0:00.04 auth 45639 postfix 20 0 82552 4988 3112 S 0.7 0.0 0:00.12 cleanup 51587 root 20 0 15504 1820 992 R 0.7 0.0 0:01.30 top 54757 postfix 20 0 98952 5132 3564 S 0.7 0.0 0:00.20 smtpd 249 root 20 0 0 0 0 S 0.3 0.0 0:01.41 scsi_eh_3 1880 root 16 -4 93156 864 572 S 0.3 0.0 0:02.81 auditd 3688 root 20 0 238m 53m 3220 S 0.3 0.3 0:04.80 spamd 11834 dovecot 20 0 78840 3324 2268 S 0.3 0.0 0:10.39 auth 41193 postfix 20 0 82584 5004 3112 S 0.3 0.0 0:00.55 cleanup 47545 567 20 0 15824 2196 992 S 0.3 0.0 0:37.54 top 53714 1316 20 0 18460 1988 1548 S 0.3 0.0 0:00.14 imap 53715 apache 20 0 334m 16m 2712 S 0.3 0.1 0:00.01 httpd 54547 866 20 0 18472 1968 1564 S 0.3 0.0 0:00.01 imap 54912 756 20 0 18456 1964 1548 D 0.3 0.0 0:00.04 imap 1 root 20 0 19232 1528 1236 S 0.0 0.0 0:01.24 init 2 root 20 0 0 0 0 S 0.0 0.0 0:00.00 kthreadd 3 root RT 0 0 0 0 S 0.0 0.0 0:00.07 migration/0 4 root 20 0 0 0 0 S 0.0 0.0 0:00.30 ksoftirqd/0 5 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/0 6 root RT 0 0 0 0 S 0.0 0.0 0:00.01 watchdog/0 7 root RT 0 0 0 0 S 0.0 0.0 0:00.01 migration/1 8 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/1 9 root 20 0 0 0 0 S 0.0 0.0 0:00.19 ksoftirqd/1 10 root RT 0 0 0 0 S 0.0 0.0 0:00.00 watchdog/1 11 root RT 0 0 0 0 S 0.0 0.0 0:00.01 migration/2 12 root RT 0 0 0 0 S 0.0 0.0 0:00.00 stopper/2 13 root 20 0 0 0 0 S 0.0 0.0 0:00.18 ksoftirqd/2
My iotop normally does not show anything unusual , but for every several seconds it shows like belllow for one second:
Total DISK READ: 69.10 K/s | Total DISK WRITE: 3.79 M/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 53168 be/4 vendas4@ 3.84 K/s 0.00 B/s 0.00 % 99.99 % dovecot/imap 31584 be/4 tlazarim 0.00 B/s 0.00 B/s 0.00 % 99.99 % spamd child 55433 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % repquota -u -v /home 55382 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % repquota -u -v /home 54574 be/4 root 0.00 B/s 0.00 B/s 0.00 % 99.99 % repquota -u -v /home 55938 be/4 comunica 0.00 B/s 0.00 B/s 0.00 % 99.99 % procmail-wrapper -o -a mail2.intercode.com.br -d comunicacao-frutal.mg.gov.br 17162 be/4 barrosde 0.00 B/s 0.00 B/s 0.00 % 99.99 % spamd child 54903 be/4 educacao 26.87 K/s 0.00 B/s 0.00 % 99.99 % dovecot/imap 56043 be/4 a.robert 26.87 K/s 0.00 B/s 0.00 % 99.99 % dovecot/imap 52897 be/4 editorar 3.84 K/s 0.00 B/s 0.00 % 99.99 % spamd child 109664 be/4 controli 0.00 B/s 0.00 B/s 0.00 % 99.99 % php-cgi 460 be/3 root 0.00 B/s 153.55 K/s 0.00 % 95.16 % [jbd2/dm-0-8] 45595 be/4 postfix 0.00 B/s 11.52 K/s 0.00 % 24.99 % cleanup -z -t unix -u 5197 be/4 apache 0.00 B/s 0.00 B/s 0.00 % 14.66 % httpd 3562 be/4 mysql 7.68 K/s 295.59 K/s 0.00 % 9.89 % mysqld --basedir=/usr --datadir=/var/lib/mysql/ --plugin-dir=/~=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock 1631 be/4 root 0.00 B/s 19.19 K/s 0.00 % 4.60 % [flush-253:0] 44141 be/4 mysql 0.00 B/s 15.36 K/s 0.00 % 3.06 % mysqld --basedir=/usr --datadir=/var/lib/mysql/ --plugin-dir=/~=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock 1911 be/4 root 0.00 B/s 7.68 K/s 0.00 % 0.48 % rsyslogd -i /var/run/syslogd.pid -c 5 109309 be/4 mysql 0.00 B/s 7.68 K/s 0.00 % 0.30 % mysqld --basedir=/usr --datadir=/var/lib/mysql/ --plugin-dir=~=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock 3523 be/4 mysql 0.00 B/s 0.00 B/s 0.00 % 0.23 % mysqld --basedir=/usr --datadir=/var/lib/mysql/ --plugin-dir=/~=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock 41193 be/4 postfix 0.00 B/s 7.68 K/s 0.00 % 0.00 % cleanup -z -t unix -u 49440 be/4 apache 0.00 B/s 3.84 K/s 0.00 % 0.00 % httpd 53694 be/4 apache 0.00 B/s 3.84 K/s 0.00 % 0.00 % httpd 2713 be/4 clam 0.00 B/s 1167.00 K/s 0.00 % 0.00 % clamd 51907 be/4 apache 0.00 B/s 3.84 K/s 0.00 % 0.00 % httpd 56015 be/4 clam 0.00 B/s 0.00 B/s 0.00 % 0.00 % clamd 5496 be/4 root 0.00 B/s 3.84 K/s 0.00 % 0.00 % lookup-domain-daemon.pl 55207 be/4 apache 0.00 B/s 3.84 K/s 0.00 % 0.00 % httpd 1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init 2 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd] 3 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0] 4 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0] 5 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [stopper/0] 6 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/0] 7 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/1] 8 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [stopper/1] 9 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/1] 10 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/1] 11 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/2] 12 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [stopper/2] 13 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/2] 14 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/2] 15 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/3] 16 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [stopper/3] 17 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/3] 18 rt/4 root 0.00 B/s 0.00 B/
Normally it is like this:
Total DISK READ: 159.43 K/s | Total DISK WRITE: 680.50 K/s TID PRIO USER DISK READ DISK WRITE SWAPIN IO> COMMAND 460 be/3 root 0.00 B/s 0.00 B/s 0.00 % 35.70 % [jbd2/dm-0-8] 944 be/3 root 0.00 B/s 73.88 K/s 0.00 % 18.93 % [jbd2/dm-2-8] 108619 be/4 mysql 0.00 B/s 42.77 K/s 0.00 % 2.06 % mysqld --basedir=/usr --datadir=/var/lib/mysql/ --plugin-dir=~=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock 3562 be/4 mysql 0.00 B/s 7.78 K/s 0.00 % 1.96 % mysqld --basedir=/usr --datadir=/var/lib/mysql/ --plugin-dir=/~=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock 3723 be/4 postgrey 0.00 B/s 15.55 K/s 0.00 % 0.04 % socket 43280 be/4 mysql 0.00 B/s 3.89 K/s 0.00 % 0.04 % mysqld --basedir=/usr --datadir=/var/lib/mysql/ --plugin-dir=/~=/var/run/mysqld/mysqld.pid --socket=/var/lib/mysql/mysql.sock 65734 be/4 postfix 0.00 B/s 11.67 K/s 0.00 % 0.00 % cleanup -z -t unix -u 41193 be/4 postfix 0.00 B/s 23.33 K/s 0.00 % 0.00 % cleanup -z -t unix -u 1881 be/3 root 0.00 B/s 7.78 K/s 0.00 % 0.00 % auditd 1911 be/4 root 0.00 B/s 3.89 K/s 0.00 % 0.00 % rsyslogd -i /var/run/syslogd.pid -c 5 65505 be/4 apache 0.00 B/s 3.89 K/s 0.00 % 0.00 % httpd 1 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % init 2 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [kthreadd] 3 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/0] 4 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/0] 5 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [stopper/0] 6 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/0] 7 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/1] 8 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [stopper/1] 9 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/1] 10 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/1] 11 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/2] 12 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [stopper/2] 13 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/2] 14 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/2] 15 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/3] 16 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [stopper/3] 17 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/3] 18 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/3] 19 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/4] 20 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [stopper/4] 21 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/4] 22 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/4] 23 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/5] 24 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [stopper/5] 25 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/5] 26 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/5] 27 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/6] 28 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [stopper/6] 29 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/6] 30 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/6] 31 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [migration/7] 32 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [stopper/7] 33 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [ksoftirqd/7] 34 rt/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [watchdog/7] 35 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [events/0] 36 be/4 root 0.00 B/s 0.00 B/s 0.00 % 0.00 % [events/1] 37 be/4 root 0.00 B/s 0.00 B/s
The postfix email queue is normal, and I usually get the culprit there. But not today.
I have enough free disk space
[root@linux03 ~]# df -h Filesystem Size Used Avail Use% Mounted on /dev/mapper/vg_linux_a-lv_root 900G 813G 42G 96% / tmpfs 7.7G 0 7.7G 0% /dev/shm /dev/sda1 477M 121M 331M 27% /boot /dev/mapper/vg_linux_b-LogVol00 1.4T 911G 395G 70% /home /dev/sdc1 1.8T 1.3T 414G 77% /backup [root@linux03 ~]#
Can anyone please advise how to identify the problem?
Thanks a lot
Rogerio
Howdy,
Sounds like you've already looked for a lot of the usual culprits!
Could you post the output of this command:
ps auxwf
That will give us an idea of what all is going on at the moment.
Also, this command will show if there's a DoS of some sort occurring:
netstat -ntu | awk '{print $5}' | cut -d: -f1 | sort | uniq -c | sort -nr | head -10
Hi Andrey, here it is:
And here is second command:
Thank you
Hi Andrey, here it is:
I’ve attached the ps command,
And here is second command:
Thank you
Hello!
I am using this script to check high iowait which is caused in my case mostly by pop3 proc.
while true; do date; ps auxf | awk '{if($8=="D") print $0;}'; sleep 1; done
you will filter with that processes that have D flag - Uninterruptible sleep (usually IO)
put that in an exec file and run it when the service is degraded.
after, you could check with lsof -p (pid no from previous script output) and see the files that are written or read.
Howdy,
Hmm, it can be tough to discern the cause and the effect just by looking at the process list... but I notice two things.
You do seem to have a decent amount of PHP processes running, though many of them are older.
However, the thing that really stood out to me is how many emails procmail is currently in the process of delivering.
What would you think about temporarily disabling Postfix (just by running "/etc/init.d/postfix stop", and seeing if your load goes back to normal after a couple of minutes.
If that doesn't help, we may want to look deeper into the PHP processes running, and what they're up to.
-Eric
Hi Andrey,
I still could not find the cause of the high IO wait. Look at the top command right now, Postfix is stopped.
On iotop I noticed a process, index.cgi that's using most of the IO... what is this process? A few moments ago it was writing 30Mb/s
Also, is there a way to check if the HD is failing? Read/write errors?
Thank you
Howdy,
Hmm, I'm not aware of a program named "index.cgi" that would run as root, that's a little unusual.
It sounds like a web app, but web apps shouldn't run as root. It might be worth investigating what's going on there.
In the meantime though, you could always kill that process.
I don't think you're seeing a disk related issue here, though if you wanted, you could run "dmesg" and review the output towards the end to see if any disk related errors are being generated.
-Eric
Hi Eric,
Today i didn't have any problems so far. But looking at dmegs, among a lot of firewall entries , I've found the follwing:
It seems to refer to sdb disk...
Howdy,
Yeah, it does appear that you're seeing some disk related errors on /dev/sdb... that may be contributing to the problem you're seeing there.
-Eric