Submitted by cjcollins on Sat, 12/07/2013 - 22:20
Problem: All sites go down when adding a new Virtual Server. I still have ssh access and #top reveals an apache2 process sitting at 100% CPU.
Quick Fix: I can recover all the sites by killing the apache2 process and then starting it back up again. This is done by,
# top
# ps ax | grep -v grep | grep apache2
# kill PID
where PID is the apache2 process taking up all the CPU.
# /etc/init.d/apache2 start
The network looks like this,
ns1.bluerubyhosting.com --> 173.165.128.42 (1:1 NAT) --> 192.168.7.5 --> eth0 on main virtualmin ns2.bluerubyhosting.com --> 173.165.128.45 (1:1 NAT) --> 192.168.7.6 --> eth1 on main virtualmin
Please help! Thanks
Status:
Closed (fixed)
Comments
Submitted by cjcollins on Sat, 12/07/2013 - 22:22 Comment #1
Submitted by andreychek on Sat, 12/07/2013 - 22:34 Comment #2
Howdy -- are the authentication details of your slave DNS server correct?
The error I see in your error logs shows this:
Login to RPC server as root rejected
That most commonly occurs when the root password, or the IP address, are incorrect.
Submitted by cjcollins on Sat, 12/07/2013 - 22:37 Comment #3
BEFORE my second NS2 virtualmin (192.168.7.6) had a different password then my main NS1 virtualmin (192.168.7.5). So now that I have both IPs going to the same virtualmin the log in password to NS2 should be the same as NS1. How/Where do I set that? I think that is the problem.
Submitted by cjcollins on Sat, 12/07/2013 - 23:15 Comment #4
The same error, Login to RPC server as root rejected
is generated when i go to webmin > cluster webmin servers and try to add 192.168.7.6 as a second server.
Submitted by cjcollins on Sun, 12/08/2013 - 20:36 Comment #5
Here's more information. So when I try to create a new virtual site all the other sites go down but I can SSH into the Ubuntu server. When I run "top" command I see this,
PID USER PR NI VIRT RES SHR S %CPU %MEM TIME+ COMMAND
6781 www-data 20 0 399m 11m 632 R 100 0.1 2:38.80 apache2
I can see that apache2 is using 100% of the cpu. I can see what is being process by doing by this command
# ps ax | grep -v grep | grep apache
6781 ? R 4:44 /usr/sbin/apache2 -k start
So basically apache2 is trying to start and sits at 100% CPU crashing all the other sites.
Submitted by JamieCameron on Sat, 12/07/2013 - 23:41 Comment #6
So if NS1 and NS2 are the same machine, you don't actually need a cluster slave setup - otherwise Virtualmin will try to create the slave DNS zone on the same system as the master.
Am I correct that you actually only have a single physical system?
Submitted by cjcollins on Sun, 12/08/2013 - 00:04 Comment #7
Ya that's right I now have a single physical system. I went ahead and deleted all traces of the second server (192.168.7.6) from Webmin > Webmin Server Index, Cluster Webmin server, Cluster Usermin Servers. After doing that I tried creating the virtual site again and I have a different output. (see attached) The original error "Login to RPC server as root rejected" went away but I still have the problem with the process "apache2 -k start" maxing out the CPU and crashing all the other sites.
Before I really did have a second server (192.168.7.6) and so I forgot to erase them out of the cluster when I decided to just add a second NIC on my main server and assign it 192.168.7.6.
Submitted by JamieCameron on Sun, 12/08/2013 - 12:30 Comment #8
So if you try to restart Apache with
/etc/init.d/apache2 start
, does it actually start up, or does it just crash? If the latter, what gets logged to the Apache error log?Submitted by cjcollins on Sun, 12/08/2013 - 20:40 Comment #9
I tried killing the apache2 process running at 100% cpu and starting it back up,
# /etc/init.d/apach2 start
* Starting web server apache2 [Sun Dec 08 13:02:28 2013] [warn] NameVirtualHost 173.165.128.42:80 has no VirtualHosts
I have attached the error log.
Submitted by andreychek on Sun, 12/08/2013 - 14:30 Comment #10
When you killed Apache, and started it on the command line -- that appears to have worked properly, that's just a warning that you saw. Are your websites working for you at that point?
Regarding the CPU issue -- the logs show some unusual errors that Apache is generating.
Do you have any unusual modules enabled, perhaps modules from third party repositories?
It's possible that an Apache module is mis-behaving.
What output do you receive if you run this command:
ls /etc/apache2/mods-enabled
Also, what output does this show:
dpkg -l apache2
Submitted by cjcollins on Sun, 12/08/2013 - 14:35 Comment #11
The websites work when I reboot the system. I can recreate the problem by just trying to "create a virtual server".
I just barely rebooted and ran both of those commands. (see attached).
Submitted by cjcollins on Sun, 12/08/2013 - 20:41 Comment #12
Submitted by andreychek on Sun, 12/08/2013 - 15:19 Comment #13
Hmm, I don't see any unusual modules enabled there.
Do you see an influx of bandwidth, that corresponds with the Apache CPU load you're seeing? I'm curious if that's related to traffic, rather than a misbehaving module.
Submitted by cjcollins on Sun, 12/08/2013 - 15:24 Comment #14
I'm using monitis to monitor the server. See picture of today's graph attached.. The top graph is pings so the red dots are when the server appeared to be down. It's directly related to when the CPU load shot up to 100%
Submitted by cjcollins on Sun, 12/08/2013 - 16:11 Comment #15
I just realized I can bring the sites back online by killing the apache2 process and starting it back up again. I made a short video showing the problem,
https://www.dropbox.com/s/v5esb1vxaosan1p/virtualmin_crash.mp4
Submitted by andreychek on Sun, 12/08/2013 - 16:37 Comment #16
Thanks for the video! Yeah, I understand what's occurring, I just don't know why that might happen.
There's nothing Virtualmin does that should cause that sort of behavior... Virtualmin just adds VirtualHost content for the new domain, and then restarts Apache.
What if you run this command on that Apache process:
strace -p PID > apache_strace.txt 2>&1
And then, substitute the Apache process's process ID in place of the "PID" above.
And then after 5-10 seconds, kill that process if it doesn't end automatically.
Could you attach the resulting file (apache_strace.txt)?
Submitted by cjcollins on Sun, 12/08/2013 - 17:35 Comment #17
Sure. It's taking too long to upload that so here's a link from my dropbox,
https://www.dropbox.com/s/u59fgh3kbbwqurh/apache_strace.txt
Submitted by andreychek on Sun, 12/08/2013 - 18:09 Comment #18
Thanks! I've sent the relevant bits over to Jamie, let's see what he can make of it.
I'm going to post it below for future reference -- the messages below repeat throughout the entire file:
Process 22466 attached - interrupt to quit
gettimeofday({1386543157, 496531}, NULL) = 0
gettimeofday({1386543157, 496717}, NULL) = 0
gettimeofday({1386543157, 496886}, NULL) = 0
poll([{fd=67, events=POLLIN}], 1, 3000) = 1 ([{fd=67, revents=POLLHUP}])
read(67, "", 13160) = 0
gettimeofday({1386543157, 497414}, NULL) = 0
gettimeofday({1386543157, 497568}, NULL) = 0
gettimeofday({1386543157, 497725}, NULL) = 0
gettimeofday({1386543157, 497869}, NULL) = 0
poll([{fd=67, events=POLLIN}], 1, 3000) = 1 ([{fd=67, revents=POLLHUP}])
read(67, "", 13160) = 0
Submitted by JamieCameron on Sun, 12/08/2013 - 22:50 Comment #19
So I had a look, and it seems that just running
apache2ctl graceful
is enough to trigger this problem ... which suggests it is actually some kind of Apache bug.As a work-around, I configured Virtualmin to not use that command - instead, it restarts Apache to apply config changes.
Submitted by cjcollins on Mon, 12/09/2013 - 00:15 Comment #20
I just confirmed the problem is fixed. I can add/delete sites without apache crashing the other sites. Thanks Jamie!
Submitted by JamieCameron on Mon, 12/09/2013 - 12:03 Comment #21
Great! Now as to why apache2ctl causes Apache to hang, I don't know ..
Submitted by cjcollins on Sat, 01/04/2014 - 00:31 Comment #22
Submitted by cjcollins on Wed, 01/15/2014 - 05:20 Comment #23
I still have the issue with apache2 crashing. Now it's just random and I don't know what triggers it. Here's my quick fix for the problem. I just wrote a script that I run every 20minutes checking if apache has crashed. Basically I know it crashed if there's only one apache2 process running. Here's my script,
#!/bin/bash
ps aux | grep -v grep| grep apache2
BROKE=$(ps aux | grep -v grep | grep apache2 | wc -l)
PID=$(ps aux | grep -v grep| grep apache2 | tail -n 1| cut -d" " -f2)
echo $BROKE
echo $PID
if [ $BROKE -eq 1 ]
then
date >> /home/chris/scripts/apache_crash_log
echo "PID=$PID" >> /home/chris/scripts/apache_crash_log
echo "server crashed...sites down!"
kill $PID
/etc/init.d/apache2 stop
sleep 2
/etc/init.d/apache2 start
echo "sites should be back soon..."
echo sleep 5
ps aux | grep -v grep| grep apache2
sleep 1
wall /home/chris/scripts/apache_crash_log
echo "--------------------------------" >> /home/chris/scripts/apache_crash_log
else
echo "everything looks good!"
fi
I tested this in a real situation and it fixed the problem. I went to
Webmin>Cluster>Cluster Cron Jobs
and added this script in a cronjob to run every 20minutes. I don't know what else to do but if it works I'm happy. What do you think?