False Monitoring Alerts

Hello,

Everyday I wake up to emails saying that HTTP and SSL service has gone down for all of my virtual servers that I have set to monitoring. It happens at different times. Yesterday was at 7ishAM and today was at 6ishAM.

When I check my system, the sites are NOT down. This is a problem because I have clients complaining everyday of these false alarms. I have set the timeout time for these monitoring checks to 120 seconds from the previous 60 and that didn't seem to help.

The uptime on my server is 9 days and no services like apache seem to have failed either

Can someone point out any particular problems?

Thanks, G

Status: 
Closed (fixed)

Comments

Howdy -- until we get this resolved, what you might consider doing is prevent the status notifications from going to Virtual Server owners.

To do that, go into System settings -> Server Templates -> Default -> Status Monitoring.

From there, first enter an email address in "Additional email address for monitoring messages", and then set "Send email to server owner" to "No".

Sometimes at the 10 second timeout, the check can timeout -- but in theory, it should be fine at 120 seconds.

Do you know of anything particularly CPU intensive going on around 6-7am?

You might also want to check the files logs/access_log and logs/error_log under the problem domains for entries from around the time those messages were sent - there may be messages indicating why the status check failed.

Thanks for the reply guys!!

I have checked every log and don't see anything going on around those times. I even ran sar and vmstat for those times and nothing seems to be showing critical. The system load is at a minimum of 0.02 at 5 minutes and a maximum of 0.21 at 5 minutes.

I even put "watch" to monitor certain logs and nothing popped up that was obvious, though I could certainly be overlooking something.

I also went ahead and changed the timeout to 60 seconds from 10 in the "Default Template" and went into each template I have and told it to go by the "Default Template" settings to see if that makes a difference.

And also forgot to mention, I turned the option to mail to server owners any alerts in the meantime to see if this fixes anything

In the emails you get when the site is reported as down, is there any more detail such as an HTTP error message?

the message is the same in all emails for every domain:

Monitor on sld.mydomain.tld.com for 'Website sld.mydomain.sld' has detected that the service has gone down at 12/Jun/2009 06:10

That's the only thing in the body of the email and it's the same every time.

Also, just a thought...it would be nice if the system would tell you that a virtual server is back up after it succeeds with the check after failure.

I am guessing this is what Mon is for? I guess I would have to read into Mon

You can have it send email when a service goes back up as well - just go to Webmin -> Others -> System and Server Status -> Scheduled Monitoring, and in the 'Send email when' field select 'When a service changes status'.

As for the underlying cause, does it help if you go to Webmin -> Others -> System and Server Status -> whatever.com , and increase the 'Failures before reporting' field?

Jamie,

I made the change to "When a service changes status". Thanks for that since I was not aware we had that.

Also, I went back to every whatever.com and increased the "Failures before reporting" field and looks like that may have worked but too soon to tell. I haven't received any failures today which is good. I was getting them daily :-)

I will give it a day or 2 (Till Monday) to see if this helped and keep you posted.

Thanks again

Ok, let us know.

By the way, do you get the failure emails for all domains at the same time, or at different times?

Jamie,

Still no failure emails so looking good so far. :-)

And yes, I get failures for all domains at the same time.

Was Apache perhaps restarted at that time? You can see by looking at the /var/log/httpd/error_log or /var/log/apache2/error_log file ..

Jamie,

There were no errors for apache not signs of shutdown of apache or other services. I had looked for this prior and checked again and nothing.

But, I still haven't received any alerts since the change on Friday so this is good :-)

Ok, so this looks like some kind of transient failure at that time - perhaps the system was loaded enough that the HTTP request didn't return in time. To work around this in future, I will have Virtualmin set that failure count to 2 for all new domains..

Automatically closed -- issue fixed for 2 weeks with no activity.