Urgent!!! Usermin memory leak get almost all server memory.

Hello,

last night usermin behaved very strange on one single user.

When the particular user try to log-in usermin with 2 processes (miniserv.pl) stuck on top with 100% CPU and 30% and rising on each process.

At the same time usermin panel is constantly loading in the browser and showing only one small part of the left menu and nothing more.

When I stop (service usermin stop) usermin it stops and everything went back to normal.

I tried to login with other user and everything was normal without any issues but every time I try to login in this particular user the issue come again.

So the issue is repeatable.

Client said that at that moment he was trying to set email filters.

Uusermin enabled modules:

Filter and Forward Mail Read Mail SpamAssassin Mail Filter

If you need more information I will try to get it.

Status: 
Active

Comments

Could be that the user has a corrupted mail file or mail index.

Try SSHing into your system, cding to the user's home directory, and then running :

rm .usermin/mailbox/\*index\*

Hi Jamie,

Sorry but was able to keep the email account that way because the client was needed it and if he access it it very likely to take the whole server down.

So we backupped the messages, recreate the email account and copy the the messages back in. After that it was working right. I can't test it, but I'm sure that this file was OK because it was one of the first things I checked.

It was like:

pass=some password
user=*
nologout=1

But this should never happen. If we didn't react fast the server was down in a minutes. So this should be somehow prevented.

I think you should consider implementing some resource limits on the system processes and the users processes.

We considering write a plugin for webmin/virtualmin for cgroups management. Probably you are aware of this kernel function which have ability to limit the server resources for groups of users, processes, executables. This will help make the system way more stable.

When it is ready we can share it with you to add it in official virtualmin (if you like) and you can add it more deep in the virtualmin. If it is added in the code of virtualmin it can control the resources of every process virtualmin start. For example backup creation on a loaded server will not hit performance on clients services (websites) (nice is not best way this to be done).

Yeah, Usermin shouldn't be able to use up all memory like this - there must be some corrupt index or other file that is causing it to go into an infinite loop. If you have code that can prevent this in the general case, I would be very interested to see it.

Hi Jamie,

We have face that issue again bringing production server very close to a halt. We do not have actual perl code but we have try to start it in a control group but we we need to edit /etc/usermin/start and add cgexec -g */cgroup after the exec . This works but probably should be redone every time virtualmin update so it is not ideal solution. If you have better idea we will be glad to read it.

Creating a Control group plugin for virtualmin and add most of the users and processes to a control groups will make the system near bullet proof. If every script that is started from webmin have an option to add parameters before the command we can start them directly in control group.

We have found that the memory leak is gone after delete the contents of /homes/user/.usermin/ directory. Probably only the .ip and .pag files need to be deleted only but we just delete the whole content and it works OK.

A cgroup shouldn't really be needed - Usermin shouldn't have leak RAM, regardless of what the user does.

What was the user action that triggered this latest problem?

Hi Jamie,

That is the third time we report memory leak all of them ware near server halt in less then 10 minutes. Just our fast reaction saved it. This time it was eating 200MB/s this on a 64GB ram machine will fill the memory in just 5 minutes if all of the memory is free which can't be on busy server. We had luck to catch it on all memory full and 40% swap free.

I'm telling you all this details just to understand how serious is this.

The user action was only attempt to login. We then temporary disabled the mail user but is the user already have session this disabling do not work and when the user refresh the page the leak start again. So cgroup with hard memory limits are mandatory just for safety. No one is ever immune to memory leaks in code but safety is in first place.

We have plans in the feature to write a module to virtualmin, and maybe rewrite parts of it to add cgroups for users and for webmin/virtualmin/usermin. But do not know when we will have the time for this maybe when something like this really bring server to halt - hope not.

As I see the init scrips of both usermin and webmin do not have options to add parameters before and/or after the executable in sysconfig like most of the init scripts have. For example apache.

Also we have option for setting: CPU priority for scheduled jobs IO class for scheduled jobs IO priority for scheduled jobs

but this is only for scheduled scripts. Can you add option for custom parameters before all the executables that webmin/virtualmin starts like couple of text fields that all written there to be added raw before the command like: /etc/usermin/start:

#!/bin/sh
echo Starting Usermin server in /usr/libexec/usermin
trap '' 1
LANG=
export LANG
#PERLIO=:raw
unset PERLIO
export PERLIO
exec (some text added here) /usr/libexec/usermin/miniserv.pl /etc/usermin/miniserv.conf

How we can achieve that the way that it will survive updates.

Are you able to determine if this particular issue is always occurring when the same user logs in?

Or is that difficult to determine?

Before we delete the described databases it was every time the user try to login the leak starts and did not stop till we stop/restart usermin

Would it be possible for one of us to login to your system? If you have a user that can trigger this, I'd like to see what's happening inside Usermin when the problem occurs.

We have fixed it. And we do not know how to brake the database to reproduce it. If you guide me how to collect all the details you need I can do it next time this occurs.

Also can we add the needed parameters the way they will not be overwritten with the next update. It is unacceptable for us to run this ticking bomb on production without safety switch (cgroup with memory limit).

Perhaps you can update this bug if it happens again?

I agree that this should be fixed, but I don't think that a process-level memory limit is the right answer.

The memory limit is set high enough to not be hit in normal work situations and low enough to not eat up the whole memory. For example now it is set to 5GB which should not be reached if there is no memory leak but if something brake will not fill up the RAM.

When this happen again I will try to move and this particular account and reproduce it on test machine where I can give you access.

Thanks! So far we haven't been able to reproduce anything like that, so being able to see what's going on there would be a big help in sorting out what's going on and why that's occurring.