Server up but Virtualmin won't stay up

#2 Thu, 12/26/2013 - 12:17

Locutus

Might be a memory or resource issue, killing the Webmin process. You might want to check your syslog for respective messages, also /etc/webmin/miniserv.log might contain useful information.

Check free for available memory. If you're on a OpenVZ VPS, check /proc/user_beancounters.

#3 Thu, 12/26/2013 - 13:00

itmustbe

It would appear email spam is up quite a bit today too, perhaps coincidentally?

@Locutus There is no miniserv.log to be found in that location, and I am not on OpenVZ VPS

#4 Thu, 12/26/2013 - 17:07

Locutus

I'm sorry, stupid me. :) The log is at /var/webmin/miniserv.log of course.

The increased spam mails can possibly cause more resources to be used, if you're e.g. using SpamAssassin in standalone mode (which will spawn an SA process for each mail).

Please check the other things I mentioned (syslog, free). You can also use the tool atop which records historical performance data like which process uses how much memory/CPU etc., to find potential resource leaks.

#5 Thu, 12/26/2013 - 18:56

itmustbe

@Locutus I think we may be off on the wrong track suspecting memory usage, as running "top" shows nothing unusual, and I checked my host's admin panel (at Media Temple) and looked at my VPS memory and CPU usage live, today, over the week, and over the month, and there are no spikes at all. The server and all services appear to be running normally other than the fact Virtualmin -- out of nowhere -- no longer pulls up (not for more than a moment, anyway!) One user of three (this is a small server!) reported a surge of spam emails, but the other accounts have been fine (all use SSL and very strong passwords).

#6 Thu, 12/26/2013 - 20:03

itmustbe

@Locutus I have found the miniserv.log but there are no unusual or very recent entries. I can't recall what mode in which I have SpamAssassin running... it's the mode that takes the least virtual server memory. To check "free" do I just type "free", like running "top"? And can you point me in the direction of my syslog (I'm accustomed to using Virtualmin and Webmin for most tasks, although I can at least ssh and su in as root, but I'm a bit out of my depth after that). The logwatch (which I have set to detailed reporting) this morning showed nothing unusual other than 59 emails delivered to "root", which usually receives no emails unless something is wrong. Unfortunately, I only know how to check those through Webmin (via the Read User Mail server), and of course I still can't pull up Webmin :( I was tempted earlier to reboot the server via Media Temple's VPS control panel, but I don't want to have any Webmin-related bootup issues that take existing services (email/websites) offline.

#7 Thu, 12/26/2013 - 20:24

itmustbe

Ah, something new in my troubleshooting! Spam Assassin is not running at all, I can see in my email headers that it has stopped along with Webmin/Virtualmin. So that explains the sudden influx of spam to one email account. I know "/etc/init.d/webmin start" works to try to start webmin (though it is not working in my case, or at least, webmin isn't staying up more than a split second or two). I'm not sure what command to use to try to restart Spam Assassin though (even after some research on the web)? I believe it is operating as a separate process, maybe spamc or spamd? Of course the main issue is still the lack of Webmin/Virtualmin all of a sudden, but Spam Assassin is a pretty important process too, and it's odd they're both down together, while email and sites continue to run as normal...

#8 Fri, 12/27/2013 - 04:49

Locutus

There might be resource issues even though you don't see them in free, i.e. if processes have already been killed. But that's just a guess of course. It's certainly not normal that Webmin and SpamAssassin simply stop running.

The syslog is usually located in /var/log/syslog or /var/log/messages, depending on your distribution. Check that first and look for crash or OOM messages.

Also check those 59 emails sent to root. They should be located in /root/Maildir/new or /root/Maildir/cur.

#9 Fri, 12/27/2013 - 07:55

itmustbe

I'll check the root emails in just a sec, and the logs, but quickly, the Logwatch this morning looked a lot stranger, with ClamAV in trouble:

--------------------- clam-update Begin ------------------------ 

The ClamAV update process was started 1 time(s)

Last ClamAV update process started at Thu Dec 26 03:31:06 2013

Last Status:
   main.cvd is up to date (version: 55, sigs: 2424225, f-level: 60, builder: neo)
   Downloading daily-18284.cdiff [100%]
   Downloading daily-18285.cdiff [100%]
   Downloading daily-18286.cdiff [100%]
   Downloading daily-18287.cdiff [100%]
   WARNING: [LibClamAV] mpool_malloc(): Can't allocate memory (262144 bytes).
   WARNING: [LibClamAV] cli_mpool_strdup(): Can't allocate memory (24 bytes).
   WARNING: [LibClamAV] cli_loadhash: Problem parsing database at line 52176
   WARNING: [LibClamAV] Can't load daily.mdb: Malformed database
   WARNING: [LibClamAV] cli_tgzload: Can't load daily.mdb
   WARNING: [LibClamAV] Can't load /var/lib/clamav/clamav-ba720c437667db49a41f36fdea54b7d8.tmp/clamav-00990949e1715fc6913f79e25927592d.cld: Malformed database
   ERROR: Failed to load new database: Malformed database
   ERROR: During database load : ERROR: Failed to load new database: Malformed database
   WARNING: Database load exited with status 55
   ERROR: Failed to load new database

The following ERRORS and/or WARNINGS were detected when
running the ClamAV update process.  If these ERRORS and/or
WARNINGS do not show up in the "Last Status" section above,
then their underlying cause has probably been corrected.

ERRORS:
   During database load : ERROR: Failed to load new database: Malformed database: 1 Time(s)
   Failed to load new database: 1 Time(s)
   Failed to load new database: Malformed database: 1 Time(s)

WARNINGS:
   [LibClamAV] Can't load /var/lib/clamav/clamav-ba720c437667db49a41f36fdea54b7d8.tmp/clamav-00990949e1715fc6913f79e25927592d.cld: Malformed database: 1 Time(s)
   [LibClamAV] cli_mpool_strdup(): Can't allocate memory (24 bytes).: 1 Time(s)
   [LibClamAV] mpool_malloc(): Can't allocate memory (262144 bytes).: 1 Time(s)
   [LibClamAV] cli_tgzload: Can't load daily.mdb: 1 Time(s)
   [LibClamAV] cli_loadhash: Problem parsing database at line 52176: 1 Time(s)
   Database load exited with status 55: 1 Time(s)
   [LibClamAV] Can't load daily.mdb: Malformed database: 1 Time(s)

---------------------- clam-update End ------------------------- 


--------------------- Clamav Begin ------------------------ 


Daemon check list:
   Database status OK: 144 Time(s)

---------------------- Clamav End -------------------------

#10 Fri, 12/27/2013 - 08:06

itmustbe

By the way, I'm running the latest Cent OS 6.

So that bit I just pasted above about ClamAV I also found in my system log, but today's entries show ClamAV is fine again, and loaded its database ok. I see nothing else in /var/log/messages other than these ClamAV entries over the last week, all of which looked normal except the one pasted above (the second-to-last one in the main system log for Dec. 26). Oddly, I'm only seeing ClamAV entries in this system messages log, at least over the last week, but perhaps that's normal.

I will look at those root emails next, as I appeared to get 124 of them overnight in addition to the other 59!

#11 Fri, 12/27/2013 - 08:17

itmustbe

So the first system message:

postfix::is_postfix_running failed : Failed to query Postfix config command to get the current value of parameter process_id_directory: at ../web-lib-funcs.pl line 1376.

Actually it's looking like all 59 + 124 messages are along those lines, though I'm just checking a few randomly right now.

Should I try just rebooting the server via Media Temple's control panel? It's been running well for awhile now (it'd actually been 90 days of uptime or so when I last looked at it the the other week and ran some backups for the end of the month... I've been running Virtualmin/Webmin very happily for over a year now, with the server updating itself, and it gets very little usage, just a couple up-to-date Wordpress sites, some static sites, and a bit of email). I only hesitate to reboot as I can at least SSH in right now, and I'd hate for something to go terribly wrong, and wish I had spent more time troubleshooting while I still had a way in!

#12 Fri, 12/27/2013 - 08:19

itmustbe

I just typed "free" as well (and all those root emails do seem to be about Postfix):

                total           used          free        shared    buffers   cached
Mem:       3774872     930212    2844660          0          0     318108
-/+ buffers/cache:     612104    3162768
Swap:            0               0               0

#13 Fri, 12/27/2013 - 09:08

Locutus

Please enclose all screen listings in [code][/code] tags, otherwise monospace font and linebreaks are lost, making it unreadable.

#14 Fri, 12/27/2013 - 09:13

Locutus

The memory errors you receive from ClamAV are odd, considering your "free" shows enough memory. It might be a hardware issue of your server. Is it a physical or virtual machine?

You can try rebooting it. Using atop you can record historical memory usage data, to see if at the time of problems occurring there's a memory issue.

About the Postfix error, Eric or someone else from the Virtualmin team would have to say something. You can try running postconf (if that's the "Postfix config command" they're talking about) and see if it works.

#15 Fri, 12/27/2013 - 09:35

itmustbe

The atop command doesn't appear to be installed on my system.

It is a virtual machine on Media Temple's VPS service. I shudder to say that's it's inside a Plesk/Parallels virtual container of some sort (I'm a refugee from Plesk's control panel!)

Running postconf appears to work fine, I get a whole bunch of output in my terminal.

I thought those memory errors odd too, though they cleared up over the day as this morning ClamAV had no such trouble. Still no Spam Assassin or Webmin/Virtualmin running though! I'm a little concerned still about rebooting in case I have more trouble. Should I wait to hear from Eric on this forum before rebooting?

#16 Fri, 12/27/2013 - 09:36

itmustbe

Sorry I just saw your note on enclosing tags with code, I knew I was doing something wrong there, I'll do that with any future lines of code to make them more readable!

#17 Fri, 12/27/2013 - 09:54

Locutus

Eric might be able to say more, yeah, since I'm not familiar with CentOS or Plesk. He also has more experience with (resource) issues on several virtual machine hosters.

#18 Fri, 12/27/2013 - 10:17

itmustbe

I will await Eric's feedback here then... just in case there's something we're missing to check before rebooting. Perhaps rebooting will cure everything magically, but in my experience (mostly with Plesk long ago!) rebooting while other things are going wrong is not always wise, as one can lose one's access to the server, and it seems with Linux that most ailments can be cured over SSH and without a reboot.

The server throughout this period has been performing quite normally, I should add... no slowdown that typically accompanies memory issues. Just a lack of Virtualmin/Webmin and SpamAssassin these last few days, which is rather worrying of course, but you'd never know it from accessing the mailserver and websites.

#19 Fri, 12/27/2013 - 15:19

andreychek

Do you have a /proc/user_beancounters file? If so, could you post it's contents?

-Eric

#20 Sat, 12/28/2013 - 08:57

itmustbe

Here is the contents of the /proc/user_beancounters file:

Version: 2.5
       uid  resource                     held              maxheld              barrier                limit              failcnt
    90999:  kmemsize                 76751205             78077952            228000000            240000000                    0
            lockedpages                     0                    0                 1200                 1200                    0
            privvmpages                924582               926555               896536               943718               892528
            shmpages                     1206                 1206                90000                90000                    4
            dummy                           0                    0  9223372036854775807  9223372036854775807                    0
            numproc                        92                  120                  600                  600                    0
            physpages                  214748               218221               896536               943718                    0
            vmguarpages                     0                    0               524288           2147483647                    0
            oomguarpages               192788               192788               524288           2147483647                    0
            numtcpsock                     36                   36                 2000                 2000                    0
            numflock                       14                   14                 1000                 1100                    0
            numpty                          1                    1                  100                  100                    0
            numsiginfo                      0                   30                 1024                 1024                    0
            tcpsndbuf                  676192               676192             10000000             20000000                    0
            tcprcvbuf                  589824               589824             10000000             20000000                    0
            othersockbuf               339864               341152              5000000             10000000                    0
            dgramrcvbuf                     0                    0             10000000             10000000                    0
            numothersock                  279                  281                 2000                 2000                    0
            dcachesize               42796557             42918127             57000000             60000000                    0
            numfile                      3948                 4015                40000                40000                    0
            dummy                           0                    0  9223372036854775807  9223372036854775807                    0
            dummy                           0                    0  9223372036854775807  9223372036854775807                    0
            dummy                           0                    0  9223372036854775807  9223372036854775807                    0
            numiptent                      34                   34                  500                  500                    0

#21 Sat, 12/28/2013 - 10:29

Locutus

There's about a million failures for private virtual memory page allocations, so it indeed it is a memory related issue.

I guess you need to save memory in Virtualmin, or ask your hoster to increase the "privvmpages" limit for you. Eric might have some ideas too, since I'm not familiar with this kind of virtualization.

(Didn't know that other services besides OpenVZ use the beancounter file, otherwise I'd have asked for it immediately without the "if you're running under OpenVZ" restriction.)

#22 Sat, 12/28/2013 - 14:08

itmustbe

If the memory is now free, do you have any ideas why Virtualmin/Webmin wouldn't be staying up after I run /etc/init.d/webmin start? Or is trying to pull it up perhaps responsible for the memory spike recorded in this beancounters output? It seems like rebooting might help if it's a memory related issue? Assumedly SpamAssassin is down because its process is resource-intensive too? I've never had similar trouble over the past year+ (everything typically ticks along fine around 1GB of memory usage out of my guaranteed 2GB... in fact I already have the server tuned for minimal memory usage as I used just to have 1GB of guaranteed memory before I got a free upgrade).

#23 Sun, 12/29/2013 - 04:41

Locutus

Unfortunately I'm not really familiar with OpenVZ and related virtualization systems that use the beancounters file, so I can't really say why the memory allocation requests fail. Eric might be able to say more about this.

The general consensus and suggestion is (considering the myriad of problems we've seen in this forum that are related to OpenVZ-like systems) to not use such a virtualization hoster with Virtualmin.

It might be as simple as asking your hoster to increase the beancounter limits that fail for you. It might also be that there's no real solution and you need to find another hoster. But as I said, I can only guess here, Eric might know more.

#24 Sun, 12/29/2013 - 10:57

itmustbe

Thank you @Locutus for taking your time during the holidays with this thread... and you were right all along about it being a memory/resource issue :) The latest message to root was fatal: couldn't execute /usr/bin/gpg: Cannot allocate memory Rebooting the server brought Virtualmin and SpamAssassin back up and everything is running as normal again.

#25 Mon, 12/30/2013 - 10:20

andreychek

Howdy,

Yeah, as Locutus mentioned, you are seeing resource failures with your VPS.

It appears that your provider is either using OpenVZ or Virtuozzo, and those VPS types can have issues where even if you aren't technically out of RAM, if you're using what they call "burst memory", memory can be taken away from one of the processes on your server using that RAM to give to another user on that host, if they need it.

What you'd want to do is ask your provider for more guaranteed RAM, as you seem to frequently be running out of RAM.

Each time a failure shows up in that user_beancounters file -- that failure represents a process that may be killed off due to a resource problem.

-Eric

#26 Mon, 12/30/2013 - 11:19

itmustbe

I've noticed some oddities actually in Virtualmin with my reported memory usage. It was after some upgrade or other (some Virtualmin upgrade in the past maybe six months or so?) The issue is that Virtualmin started reporting real memory usage at half its actual usage. Right now it says: Real memory 3.42 GB total, 590.42 MB used but in fact I'm using twice that much memory. The memory available is correct, however... I'm actually supposed to get 2GB of memory guaranteed, up to 4GB burstable, and it never appears (in the historical data within Media Temple's control panel) as if I've gone over 50% usage of my guaranteed allocation, so it's odd that the beancounters file indicates these RAM usage problems... but perhaps it's that Plesk Virtuozzo software, I'm not any fan of their software.

Here's the current state of the beancounters file, no failures so far... I'll definitely keep an eye on this file from time to time now I know it exists!

Version: 2.5
       uid  resource                     held              maxheld              barrier                limit              failcnt
    90999:  kmemsize                 69569131             70860800            228000000            240000000                    0
            lockedpages                     0                    0                 1200                 1200                    0
            privvmpages                276047               277960               896536               943718                    0
            shmpages                     1206                 1206                90000                90000                    0
            dummy                           0                    0  9223372036854775807  9223372036854775807                    0
            numproc                       104                  135                  600                  600                    0
            physpages                  261641               265441               896536               943718                    0
            vmguarpages                     0                    0               524288           2147483647                    0
            oomguarpages               157361               157361               524288           2147483647                    0
            numtcpsock                     47                   47                 2000                 2000                    0
            numflock                       10                   11                 1000                 1100                    0
            numpty                          1                    1                  100                  100                    0
            numsiginfo                      0                   30                 1024                 1024                    0
            tcpsndbuf                  948552               948552             10000000             20000000                    0
            tcprcvbuf                  770048               770048             10000000             20000000                    0
            othersockbuf               383792               388464              5000000             10000000                    0
            dgramrcvbuf                     0                    0             10000000             10000000                    0
            numothersock                  300                  302                 2000                 2000                    0
            dcachesize               39941079             40034396             57000000             60000000                    0
            numfile                      4115                 4165                40000                40000                    0
            dummy                           0                    0  9223372036854775807  9223372036854775807                    0
            dummy                           0                    0  9223372036854775807  9223372036854775807                    0
            dummy                           0                    0  9223372036854775807  9223372036854775807                    0
            numiptent                      34                   34                  500                  500                    0

#27 Mon, 12/30/2013 - 11:27

Locutus

Depending on usage (number of websites, use of PHP, FTP uploads, incoming emails, spam and virus scanning, other general activity), the server's memory usage could over time easily peak over 2 GB, even if the average is lower. "Burstable" memory is very unreliable, as you've seen, and can lead to processes being killed randomly if they hold that memory for too long.

Having burstable memory is even counter-productive in this case. The OS sees that it has 4 GB of memory, and wants to make use of it. It does not know that half of that memory is dangerously unreliable.

Also note that you're already using about one third of the allowed privvmpages, so with some time of usage, the limit could be reached again.

I suppose the only way to somewhat reliably prevent that (aside from not using Virtuozzo/OpenVZ) would be to regularly reboot the server. Or have a script monitor the beancounters and reboot the server if some limit is being reached. Both of which I'd not recommend for serious web hosting.