Submitted by hakan@applepaj.se on Fri, 03/16/2012 - 07:15 Pro Licensee
We are thinking of doing a complete reinstall of our Cloudmin master because it keeps hanging intermittently. The thing we are wondering is what configuration files we need to backup before doing this so we can get the virtual machines running as quickly as posible again?
Regards, Jakob
Status:
Active
Comments
Submitted by andreychek on Fri, 03/16/2012 - 11:50 Comment #1
Howdy -- sorry to hear that your server is locking up!
Are you hosting your Virtual Machines (VM's) on your Cloudmin master server?
Or are your VM's on another server?
Also, which type of VM are you using? Xen? KVM? Or another?
Thanks!
Submitted by JamieCameron on Fri, 03/16/2012 - 12:54 Comment #2
It may be better for us to debug the underlying problem, as a re-install is unlikely to help and may cause Cloudmin to lose track of your VMs. Can you tell us more about these hangs?
Submitted by hakan@applepaj.se on Sat, 03/17/2012 - 02:53 Pro Licensee Comment #3
Hi,
Yes, we have four VM's running on it now, and we plan to run at least one more. All of them are KVM.
/J
Submitted by hakan@applepaj.se on Sat, 03/17/2012 - 05:01 Pro Licensee Comment #4
Hi,
That would be great. The problem we are having with the server is that it runs fine for a day or two sometimes a week then all VMs stops responding. After a reboot everything is fine and so on. We have lookes at the logs and so on but can't find the cause for this problem? What information/logs do you need to debug this problem?
Regards, Jakob
Submitted by andreychek on Sat, 03/17/2012 - 10:24 Comment #5
When the VM's stop responding -- is the host server still responsive? Or is the host unavailable as well?
Also, is there any chance we could log into your host and take a peek at some of your logs and related info?
It's hard to say what exactly the issue is, though we'd be interested in the messages, syslog, and kern.log files -- as well as your current dmesg output, and whether you're using the most recent kernel version available to your distribution.
So you're welcome to provide us with that info if you prefer, but it may be simpler if we logged in to take a look.
Also, if you have a rough time this issue last occurred, that would help us know where to look in the logs.
Thanks!
Submitted by hakan@applepaj.se on Mon, 03/19/2012 - 05:01 Pro Licensee Comment #6
That would be great, how do we proceed?
Submitted by hakan@applepaj.se on Mon, 03/19/2012 - 07:24 Pro Licensee Comment #7
We found this in the kernel log after a reboot, could it have something to do with the problem?
"Mar 17 12:32:03 bfg kernel: [173304.969115] irq 19: nobody cared (try booting with the "irqpoll" option)
Mar 17 12:32:03 bfg kernel: [173304.969120] Pid: 0, comm: swapper Not tainted 2.6.32-39-server #86-Ubuntu
Mar 17 12:32:03 bfg kernel: [173304.969122] Call Trace:
Mar 17 12:32:03 bfg kernel: [173304.969124] [] __report_bad_irq+0x2b/0xa0
Mar 17 12:32:03 bfg kernel: [173304.969133] [] note_interrupt+0x18c/0x1d0
Mar 17 12:32:03 bfg kernel: [173304.969136] [] handle_fasteoi_irq+0xdd/0x100
Mar 17 12:32:03 bfg kernel: [173304.969140] [] handle_irq+0x22/0x30
Mar 17 12:32:03 bfg kernel: [173304.969145] [] do_IRQ+0x6c/0xf0
Mar 17 12:32:03 bfg kernel: [173304.969147] [] ret_from_intr+0x0/0x11
Mar 17 12:32:03 bfg kernel: [173304.969149] [] ? finish_task_switch+0x59/0xe0
Mar 17 12:32:03 bfg kernel: [173304.969155] [] ? finish_task_switch+0x50/0xe0
Mar 17 12:32:03 bfg kernel: [173304.969159] [] ? thread_return+0x48/0x41f
Mar 17 12:32:03 bfg kernel: [173304.969163] [] ? cpu_idle+0xeb/0x110
Mar 17 12:32:03 bfg kernel: [173304.969167] [] ? rest_init+0x77/0x80
Mar 17 12:32:03 bfg kernel: [173304.969171] [] ? start_kernel+0x36d/0x376
Mar 17 12:32:03 bfg kernel: [173304.969174] [] ? x86_64_start_reservations+0x125/0x129
Mar 17 12:32:03 bfg kernel: [173304.969178] [] ? x86_64_start_kernel+0xfa/0x109
Mar 17 12:32:03 bfg kernel: [173304.969180] handlers:
Mar 17 12:32:03 bfg kernel: [173304.969181] [] (pdc_interrupt+0x0/0x2d0 [sata_promise])
Mar 17 12:32:03 bfg kernel: [173304.969193] Disabling IRQ #19"
Update: Error seems to have disappeared after we replaced the SATA-card.
Submitted by andreychek on Mon, 03/19/2012 - 09:31 Comment #8
Yeah it's certainly possible that something was awry with the SATA card.
It's up to you how we proceed then -- if you'd like us to take a look at your logs, you can either enable Remote Support using the Virtualmin Support module, or you can email your login details to eric@virtualmin.com.
If you do that, be sure to let us know when the last problem occurred, so we know where to look in the logs.
Or, if you'd like to see if this new SATA card fixes the issue, we can hold off to see if it happens again.
Submitted by hakan@applepaj.se on Mon, 03/19/2012 - 09:41 Pro Licensee Comment #9
We will let it run and see, if it hangs again we will let you know.
Thanks! Jakob
Submitted by hakan@applepaj.se on Wed, 03/28/2012 - 05:43 Pro Licensee Comment #10
So, after about 1 week the server is behaving strange (slow) again. I have attached the logfiles you wanted, tell me if there is any other info you need.
Kernel is 2.6.32.40-server
Regards, Jakob
Submitted by andreychek on Wed, 03/28/2012 - 11:40 Comment #11
Well, I see a few unusual issues related to what I think are some video drivers, though I'm not sure if that's related to the slowness you're seeing now.
What output do you see if you run the command "uptime" on the host -- are you seeing a high load at the moment?
If so, what processes do you see consuming a lot of resources when you run the command "top"?
Also, if you're still seeing this problem, we'd also be happy to log into your host and take a look around a bit.
Submitted by hakan@applepaj.se on Thu, 03/29/2012 - 04:25 Pro Licensee Comment #12
Output from uptime
user@bfg:~$ uptime 11:12:30 up 17:06, 2 users, load average: 70.39, 49.91, 43.15
Submitted by andreychek on Thu, 03/29/2012 - 08:01 Comment #13
Okay, so that load there appears to be the issue -- something is hammering your server. And we just need to figure out what :-)
If you run the command "top", do any processes stand out to you?
Also, are you able to review the "uptime" output of your individual VPS's?
If one VPS were having significant load issues, that could potentially affect the host server like you're seeing.
Submitted by hakan@applepaj.se on Thu, 03/29/2012 - 08:27 Pro Licensee Comment #14
Ok, we received the same IRQ-error yesterday and replaced the other SATA-card and updated the motherboard BIOS. It's running fine right now with at load of 2-5 on the master which seems normal. The VPS:es has a load of 0.3 to 4 depending on the load, so that seems ok too.
We'll let you know next week if everything is ok or if the problem shows up again. I think the problem is the Promise SATA-card.
/HÃ¥kan