High usage of memory and hight number of system processes

After upgrading all sites from PHP72 to PHP74 on a Virtualmin server, we see the many sites give httpd error code 500. Those sites have many /bin/php74-cgi processes running and we suspect that the cause of the 500 httpd error code is exceeding the limit of processes by the Virtualmin's site's Linux userid. We also see a much larger number of system processes, and a huge increase in the usage of memory.

Status: 
Active
Virtualmin version: 
6.16
Webmin version: 
1.973

Comments

What gets logged to the logs/error_log file under the domain's home directory when you see one of those 500 errors?

Please help us by telling us how to change the /home/*/fcgi-bin/php7.4.fcgi script so it says "exec /bin/php74-cgi", instead of "/bin/php74-cgi".

On our server, 80 sites are incorrectly missing the "exec", and 2 sites correctly have the "exec". The ones missing the "exec" look like this,

!/bin/bash

ulimit -u 60 ulimit -v 1048576 PHPRC=$PWD/../etc/php7.4 export PHPRC umask 022 export PHP_FCGI_CHILDREN PHP_FCGI_MAX_REQUESTS=99999 export PHP_FCGI_MAX_REQUESTS SCRIPT_FILENAME=$PATH_TRANSLATED export SCRIPT_FILENAME /bin/php74-cgi

Note the last line should say "exec /bin/php74-cgi". How do we correctly fix this ?

(We determined the missing "exec" is the cause of our problem--it means that when Apache sends a "terminate" signal to its worker processes, the bash script exits, leaving the php74-cgi process running. The orhpan php74-cgi processes pile up causing (1) some sites to fail due to reaching limit of allowed processes for that site, and (2) server wide memory exhaustion.)

Thanks for your attention.

Virtualmin will always generate that script with exec, unless you have a custom script defined at System Settings -> Server Templates.

On our server, 80 sites are using a custom template, Could you recommend how to add the missing "exec" to the /home/*/fcgi-bin/php7.4.fcgi and the ulimit settings are maintained.

Thanks.

Did you manually added the ulimit settings to the php7.4.fcgi file, or was this done via Virtualmin?

It was done via Virtualmin.

When we created a virtual server, we selected the custom template.

CPU and memory resource limits section in the custom template: Maximum number of processes At most 40 Maximum size per process At most 500MB Maximum CPU time per process At most 60 seconds

After the virtual server was created, we used "Resource Limits" to adjust the "Memory and CPU limits" as the server required more resources.

We need the php7.*.fcgi bash script to contain the ulimit statements (to limit resources), followed by the "exec" (to ensure that Apache can terminate its workers without creating orphans which will bring down the server). Can you confirm we can have both ?

We did Virtualmin "Create Virtual Server" using the "Default Setting". The result is,

$ cat /home/alextest1/fcgi-bin/php7.3.fcgi

!/bin/bash

PHPRC=$PWD/../etc/php7.3 export PHPRC umask 022 export PHP_FCGI_CHILDREN PHP_FCGI_MAX_REQUESTS=99999 export PHP_FCGI_MAX_REQUESTS SCRIPT_FILENAME=$PATH_TRANSLATED export SCRIPT_FILENAME

exec /opt/rh/rh-php73/root/usr/bin/php-cgi

Note the last line starts with "exec".

Now we did a Virtualmin "Edit Virtual Server" and changed the Maximum number of processes from unlimited to 60. The result,

$ cat /home/alextest1/fcgi-bin/php7.3.fcgi

!/bin/bash

ulimit -u 60 PHPRC=$PWD/../etc/php7.3 export PHPRC umask 022 export PHP_FCGI_CHILDREN PHP_FCGI_MAX_REQUESTS=99999 export PHP_FCGI_MAX_REQUESTS SCRIPT_FILENAME=$PATH_TRANSLATED export SCRIPT_FILENAME

/opt/rh/rh-php73/root/usr/bin/php-cgi

Note the last line incorrectly is missing the "exec". With the missing "exec" when the httpd daemon tries to terminate workers (e.g. on a graceful restart), the missing "exec" will mean orphaned processes, and eventual server memory exhaustion. This is a serious bug. The "exec" must be there.

How do we have limits on process counts and the correct "exec" both ? In other words why when we add ulimit does that result with the "exec" statement in the last line being removed?

At line around line 809 in /usr/libexec/webmin/virtual-server/php-lib.pl it says,

If using process limits, we can't exec PHP as there will be no chance for the limit to be applied :(

The /usr/libexec/webmin/virtual-server/php-lib.pl code then proceeds to take the "exec" out from the fcgi bash script.

Removing the "exec" is catastrophic: Apache sends SIGTERM to its workers when it wants to terminate them. Without the "exec" the SIGTERM goes to the fcgi bash script instead of to the php workers, which terminates the bash scripts and leaves the php worker processes orphaned. More and more orphaned php workers pile up. The leads to (1) Sites with "ulimit -u" set start returning http code 500 because they are unable to fork even one php worker, due to the site's orphaned php workers reaching the process limit. And (2), The orphaned workers continue accumulating until all server memory is exhausted. This catastrophe is all due to removing the "exec" and is unacceptable.

Please note that, process limits and "exec" can work together, contrary to the comment in the code. Here is an example to illustrate,

$ cat test.ulimit.sh

!/bin/bash

#

First show us the maximum number of processes

# grep 'Max processes' /proc/$$/limits #

Next set maximum number of processes to some crazy number

# ulimit -u 112 #

Again show us the maximum number of process

# grep 'Max processes' /proc/$$/limits #

Now exec "grep" and ask it to show us the maximum number of processes

# exec grep 'Max processes' /proc/$$/limits # # echo due to the "exec" this line will never be reached

$ sh test.ulimit.sh Max processes 4096 281121 processes Max processes 112 112 processes Max processes 112 112 processes

In above, you can see we set a "ulimit -u 112". Then we "exec grep", and the ulimit is still the same. There is no problem to set a process ulimit and exec.

Please fix the catastrophic problem, ensuring that the "exec" in fcgi scripts is never removed.

Much appreciated.

Ilia's picture
Submitted by Ilia on Fri, 06/04/2021 - 09:09

Assigned: Unassigned »

Jamie, I can confirm that saving Administration Options ⇾ Edit Resource Limits: Resource Limits removes exec part from the wrapper scripts. This is a bug!

Note the last line should say "exec /bin/php74-cgi". How do we correctly fix this ?

Those files are marked as immutable so you would need to chattr -i filename on the file before being able to edit and save it.

Oh right, I totally forgot about that code. Now I look into it, the reason the exec is removed is because otherwise there's no way to enforce the limit on the number of processes the user can run, because the process is already running.

Let me try and understand what you are trying to say. We have an fcgi file which says

exec .../php..

We add a ulimit to that file and you change that line to

.../php...

And you are saying that the reason the exec was removed was so that a running process can take advantage of the new ulimit.

However do you not have to restart the process if you change from exec to no exec? Which process are you talking about?

Not having the exec statement brings virtualmin down and exhausts all memory and causes 500 errors.

Are you now saying things are working as designed?

In the file php-lib.pl your comment around where the "exec" statement is removed talks about setting the ulimit for the the wrapper script.

I wonder if you could answer this question. Lets say I want to set the ulmit to 60 processes I do this and the exec statement is removed from the script. This requires the script to re-run since you are now changing how php gets invoked. Lets say this has happened and we are running in this state. Now I want to change the linmit from 60 to 80 because this is a busy site. So the fcgi file is changed to reflect this new value. Now the comments in the script imply that the reason that the exec was removed was so that we can take advantage of this new scenario. There is a way of changing the limits of a running process and that is by invoking the prlimit command.

Do you use the prlimit command to do this?

I did a simple search and could find no indication that the prlimit is used.

Therefore to change the limit from 60 to 80 the fcgi script would need to be re-invoked, your comments quite clearly state that this is a wrapper script.

If the process needs to be re-invoked for the above change why was the "exec" statement removed again?

Ted

So the exec statement will cause the PHP command to be run in the same process as the .fcgi wrapper, rather than launching a new process. Since the ulimit is also set in that same process, it won't be able to stop a new PHP process from executing since so new process is being started - and hence, the limit on the number of processes will have no effect.

The only way around this would be to somehow check in the .fcgi script if the process limit has already been hit, and exit at that point.

Getting back to the original problem. If we want to limit a site how do we handle the orphaned processes?

So, even without the exec line, the only extra processes should be shell scripts that launch PHP - on their own, they should only consume a small amount of RAM.

Do the actual PHP processes never exit, even after Apache is restarted? Normally they have a timeout before exiting, which you can set in Virtualmin on the PHP Options page.

Ilia's picture
Submitted by Ilia on Tue, 06/15/2021 - 15:54

I have taken a bit closer look to this issue. Jamie, please have a closer look too to this edit resource limits functionality, as there are few bugs I could find (describe a bit later).

At first, in addition to the initial bug report -- the actual error in my tests was coming not from having exec removed, the actual problem, which generated the error (prevented simple PHP script from running) was using presumably incorrect values for unlimit -v? I assume because the set values are happen to be too low? Perhaps units are different? However ulimit expects values in kB and should be correct. This is still remains unclear to me but adding few extra zeros to the memory limit in FCGI wrapper script made the PHP page work for me. Could this be related to PHP memory_limit option? However, if the overall memory limit set on Edit Resource Limits page is higher than system can provide, it's still wouldn't work (apparently) and fail silently, as there would be not enough memory to allocate, and the kernel would kill the process. Perhaps adding more robust error logging would help?

@Ted what happens if you using console run su - username and then check on the limits? Can you even switch to that user?

Jamie, overall I don't understand why using ulimit inside of FCGI wrapper script as we do (passing the same values) at all? Limits for the user are already imposed by using /etc/security/limits.conf configuration file, right? Besides user cannot change the limits for itself (aside from setting a lower values -- which would make sense in this case), as it's imposed by the administrator and this FCGI script is running as a domain owner. Moreover Apache is already imposing the limits for the process with RLimitNPROC, RLimitCPU and RLimitMEM directives? Could you please explain why it's needed? Do you want to impose limits for the given process? If so, how the amount of all allocated memory for a user can be assigned to a single process/script directly? It would make sense to provide 10% of memory to a script out of all memory that goes to a user.. Perhaps not using unlimit -v at all? Am I missing something here?

The bugs: (all was tested in FCGI mode)

  1. When very first time limits are set using Edit Resource Limits page, FCGI wrapper script is not getting modified (with adding ulimit and removing exec), while limits.conf and Apache virtualhost.conf are getting modified (added) as expected. It takes few tries to reproduce, as sometimes it works. Try toggling it few times from unlimited to setting some values, back again to unlimited and then again setting some values.
  2. Directive FcgidMaxRequestLen getting removed upon adding the limits - is this expected?
  3. When disabling limits, the limits are only getting remove correctly from limits.conf file. Apache virtualhost.conf file is left with RLimitNPROC, RLimitCPU and RLimitMEM intact - look like a bug?
  4. When disabling limits, the FCGI wrapper file (in my case php7.4.fcgi) at some point lost +x execute permission leaving things in non-working state, even with default, previously working wrapper script.
  5. And the cherry on the top, not related to limits directly but indirectly -- later displayed postsave.cgi page has a link to edit databases (Manage and create databases link) when database as a feature is not enabled for the given virtual server.

It might be useful to start with our initial problem: We were worried that on a Virtualmin server with many web sites, a handful of sites might unexpectedly see a high http transaction rate and/or http transactions taking longer to complete, resulting in starvation of Apache workers for the other web sites on the same server, resulting in denial of service or service degradation for the remaining sites.

Now I would welcome any comment you have about this initial problem. Perhaps it is not a real problem ?

For the above described problem, we have for years coded the "Maximum number of processes" option in Virtualmin for each web site. Some web sites are set to 20 maximum process, some to 30 maximum processes, and some to 60 maximum processes depending on our gut feel of how big and heavy the web site is. Our thinking was that this would act as a wall, limiting a busy site to consume only a maximum number of Apache worker processes, ensuring that other sites have enough Apache workers not to suffer performance degradation.

Now I would welcome any comment you have about this solution to the stated problem.

Unfortunately, coding "Maximum number of processes" has unfortunate consequences: (1) more and more orphaned idle Apache processes are created, eventually crashing the server due to memory starvation, and (2) any site where maximum processes is coded will start consistently returning error code 500 on every http transaction once the number of orphaned idle Apache processes reaches the site's maximum number of processes. Both (1) and (2) are catastrophic conditions which happen because with "Maximum number of processes" coded, Virtualmin removes the "exec" in the Apache worker wrapper script--that means that when Apache sends a termination signal aka SIGTERM to the bash wrapper script, the bash wrapper script terminates leaving it's child process Apache worker orphaned.

(Since before 2015 we have "temporarily" dealt with massive quantities of orphaned Apache worker processes by regularly running a script that terminates orphaned Apache processes. This is not a good operational practice. And it recently bit us in the ass.)

It may also be that using "maximum number of processes" is not giving us what we really intended. We want a busy site (ie high http transaction rates and/or http transactions taking longer) to suffer the resulting performance degradation, while other web sites on the same server are unaffected. It sounds like what might happen with "maximum number of processes" is that the busy web site will start seeing apparently random http 500 errors ? That would not be desirable, and is not what we thought we were getting with an option named "maximum number of processes".

Your expertise and feedback on any of these issues is appreciated. If there is a better way of doing things, we are all ears.

I'm looking above to see what questions are you asking us at this point in time. I'm assuming most of the questions/comments are to other members of Virtualmin ? Unless I have misunderstood, I think only this question is addressed to us,

what happens if you using console run su - username and then check on the limits?

Here is what happens ...

su -s /usr/bin/bash - klub

Last login: Wed Jun 16 15:00:33 EDT 2021 on pts/0 [klub@myweb ~]$ ulimit -a core file size (blocks, -c) 0 data seg size (kbytes, -d) unlimited scheduling priority (-e) 0 file size (blocks, -f) unlimited pending signals (-i) 281121 max locked memory (kbytes, -l) 64 max memory size (kbytes, -m) unlimited open files (-n) 122880 pipe size (512 bytes, -p) 8 POSIX message queues (bytes, -q) 819200 real-time priority (-r) 0 stack size (kbytes, -s) 8192 cpu time (seconds, -t) 600 max user processes (-u) 30 virtual memory (kbytes, -v) 1048576 file locks (-x) unlimited

It appears that the Virtualmin "maximum number of processes" we set is the "ulimit -u" as soon as we login to the account.

Let us know if you have any other questions.

Thanks for your time,

Alex Nishri, working with Ted Sikorski on the problem.

It might also be useful for you to see the following ...

grep klub /etc/security/limits.conf

klub hard as 1048576

klub hard nproc 30

klub hard cpu 10

@klub hard as 1048576

@klub hard nproc 30

@klub hard cpu 10

I see one more question you are asking us ...

Do the actual PHP processes never exit, even after Apache is restarted?

What we see is orphaned idle Apache workers. I say idle because an strace shows they are in a perpetual wait, never getting new work.

This starts with the Apache parent terminating what it thinks is it's Apache worker process by sending a terminate signal (SIGTERM). Except the SIGTERM actually goes to the wrapper script, which terminates and leaves it's child Apache worker orphaned. That orphaned process is not getting any more http transactions to handle. So no PHP script is going to be started in the orphaned idle Apache worker process. And hence no PHP timeout limits apply.

Regarding the limits.conf file - this only applies when a user logs in via SSH, and so won't have any impact on FCGI scripts.

I wonder if the best solution is to instead switch to FPM to run PHP scripts, and setup limits there?

We are still waiting for answers to the questions we posed above,

[1] We are worried that on a Virtualmin server with many web sites, a handful of sites might unexpectedly see a high http transaction rate and/or http transactions taking longer to complete, resulting in starvation of Apache workers for the other web sites on the same server, resulting in denial of service or service degradation for the remaining sites. For this problem, we have for years coded the "Maximum number of processes" option in Virtualmin for each web site. Some web sites are set to 20 maximum processes, some to 30 maximum processes, and some to 60 maximum processes depending on our gut feel of how big and heavy the web site is. Our thinking was that this would act as a wall, limiting a busy site to consume only a maximum number of Apache worker processes, ensuring that other sites have enough Apache workers not to suffer performance degradation. Is this the best way to prevent a small number of sites experiencing an unexpectedly high http transaction rate and/or http transactions taking longer to complete, resulting in starvation of Apache workers for the other web sites ?

[2] Is coding "Maximum number of processes" giving us the desired result ?

[3] Unfortunately, coding "Maximum number of processes" has unfortunate consequences: (a) more and more orphaned idle Apache processes are created, eventually crashing the server due to memory starvation, and (b) any site where maximum processes is coded will start consistently returning error code 500 on every http transaction once the number of orphaned idle Apache processes reaches the site's maximum number of processes. Both (a) and (b) are catastrophic conditions which happen because with "Maximum number of processes" coded, Virtualmin removes the "exec" in the Apache worker wrapper script--that means that when Apache sends a termination signal aka SIGTERM to the bash wrapper script, the bash wrapper script terminates leaving it's child process Apache worker orphaned. Can you fix "Maximum number of processes" so it does not have these catastrophic consequences ?

Thanks for your time,

Alex Nishri, working with Ted Sikorski on the problem.