Apache 2.2.15 + mod_fcgid 2.3.7 (default Virtualmin installation) graceful restarts generate error both in browser and error log [#30033]

Submitted by gpetrov on Sat, 09/14/2013 - 01:04

Hi,

I posted this in the forum few days ago (https://www.virtualmin.com/node/29913), but since I am not getting any feedback there I am sending it also here.

I am also able to confirm that I can reproduce the same error in more than one setup both with CentOS 5 and 6

There is a problem (more likely with mod_fcgid) causing errors both in the error log and the client browser with still running processes during graceful restart.

Can you please confirm you get the same errors in your setups?

I am still digging in the mod_fcgid code maybe I'll come up with some fix.

What is the setup:

CentOS 6.4 x86_64 minimal installation Virtualmin 4.02.gpl GPL installed by the automatic .sh script, all default settings mod_fcgid.x86_64 2.3.7-1.el6 from the virtualmin repo httpd.x86_64 1:2.2.15-29.el6.vm.1 from the virtualmin repo Single virtual server, running under the default FCGId execution mode, with the default of 90 sec php execution time Single test.php file containing

<?php
for($i = 1; $i <= 30; $i++) {
   echo $i."\n";
   sleep(1);
}
?>

What is the error:

Run the script via browser, then go and do a graceful restart on apache (service httpd graceful). After around 12 seconds you are going to see "No data received" error in you browser (Chrome) and the following in the apache error log for that virtual server:

(22)Invalid argument: mod_fcgid: can't lock process table in pid 25570 (the pid number will be different)

Further experiments show that this script gets forcefully killed before ending.

If you reduce the time the script executes to 5 seconds ($i <= 4), you'll get the same result, this time after 5 seconds.

Further experiments show this process completes, but you still get the errors both in the browser and the error log.

It is more likely a problem of mod_fcgid, not Virtualmin itself.

The first experiment tweak was to add a file write at the end of the script which shows which script completes and which gets killed before that. I got the result above.

Add this inside the loop:

file_put_contents("test.txt", "test run for: ".$i." seconds"); So why 12 seconds and where is this set. After some time I discovered that increasing FcgidErrorScanInterval to 60 will let the second process to complete (but still you get the errors).

If you check the code of mod_fcgid In fcgid_pm_main.c, the graceful restart should be performed by the function kill_all_subprocess() but obviously the scan_errorlist() is also executed even if there is a check for procmgr_must_exit().

The error in the log "can't lock process table in pid 25570" probably means that some information about the process is destroyed immediately upon the graceful restart, so we will never get the result back.

Even if we get around the early termination of the processes increasing FcgidErrorScanInterval the second problem is actually bigger - all your users are going to see this error.

Do you get the same? So far I can propose to:

Try to fix this problem and deploy custom version of the mod_fcgid Having in mind that mod_fcgid cannot share the APC cache and is probably doomed long term, push the php-fpm + Apache implementation. Both Apache 2.2 and 2.4 are possible.

Thanks for your time!

Status:

Active

Comments

Submitted by andreychek on Sat, 09/14/2013 - 09:55 Comment #1

Howdy -- thanks for the report about FCGID.

I was able to reproduce it on a 64 bit system, though the problem didn't occur on my 32 bit test system.

I did some digging on Google, and ran into a lot of people experiencing that issue.

I found a bug report, which appears to be for that issue, here:

https://issues.apache.org/bugzilla/show_bug.cgi?id=50309

And a patch for it here:

https://issues.apache.org/bugzilla/attachment.cgi?id=27982&action=diff

There's a few issues in resolving that though... in general, we go out of our way to not supply custom packages. We typically suggest filing a bug report with the vendor regarding issues like this.

Providing an option for PHP-FPM is something we've been considering, though that may still be some time off before that implementation is completed.

One thing you could try in the meantime is to see if this EPEL version resolves the issue you're seeing. It appears to include the patch from that bug report. The EPEL FCGID RPM is available here:

http://pkgs.org/centos-6-rhel-6/epel-x86_64/mod_fcgid-2.3.7-1.el6.x86_64...

Does that RPM resolve the issue you're seeing?

Submitted by gpetrov on Mon, 09/16/2013 - 03:42 Comment #2

Hello andreychek and thanks for the time you spent on this!

The issue described in the link you provided is different and the patch was already applied to the mod_fcgid 2.3.7.

As you said, this issue was widely disputed as it caused Apache crash and leftover processes, which eventually fill the whole memory and make the whole server non responsive. But it (should) have been fixed. I didn't make it clear in the first post, sorry about that.

We are facing another problem, I suppose somehow connected with the old one. As you can see Apache did not crash and there is no leftover processes, since all of them are killed gracefully or forcefully at some point. So far my dig in the code shows that the mutex file is flushed upon graceful restart from the parent process, which makes it impossible for the child, still running processes, to return the result when they end. Another problem is that the scan_errorlist() kills the still running processes way to soon not letting them to end gracefully. scan_errorlist() should not apply to the processes which are already TERM-ed. There is a mechanism which forcefully kill processes at some point if they do not end gracefully, but the time to wait is hardcoded to 8 seconds. I would propose to export that to another config FcgidGracefulWait so you have some control over it.

It is interesting if this problem is not there for 32 bit systems. I have never tested it on 32 bit systems.

I have also sent this issue to the Apache bug and user mailings, so far no replies.

Unfortunately I have no experience with Apache modules whatsoever... but anyway I will try to dig it and propose a fix.

As for the PHP-FPM I would strongly encourage you to implement it. I would suggest making it an optional (or non default) option in order to not have the pain to test/rework all the script installers or brake anything. I can help with early testing. Apache 2.4 is another good addition (with mpm_event and mod_proxy_fcgi) - still few bugs there, but the community seems to fix these very fast.