Backup with >170 domains silently fails. Too many open files (CLOSE_WAIT to racksapce container) [#41967]

Submitted by bobemoe on Fri, 09/09/2016 - 03:27

One of my virtualmin scheduled backups to rackspace seemed to be completing successfully (all domains.tar.gz present), but not saving the log or sending the results email or deleting the old backups. The server has just over 700 sites (mainly aliases).

I have tracked the cause of this down to a compound of 2 bugs during the "delete old backups" phase:

First: in rs_http_single_call function in /usr/share/webmin/virtual-server/rs-lib.pl an HTTP connection is opened to rackspace server and an API call is issued. Upon success the connection is closed and the response is returned. However, upon error the function returns an error code but the connection is not closed and remains in CLOSE_WAIT state. There are 5 places this function can return without closing the connection.

Second: in purge_domain_backups function in /usr/share/webmin/virtual-server/backups-lib.pl which loops through all files in the rackspace container. For each file rs_stat_object is called. If the file is a .gz file it is deleted, along with the .dom and .info files. The next loop iteration is the .dom file (which was just deleted) so when rs_stat_object is called a 404 is correctly returned, but this is considered an error by rs_http_single_call which then leaves the connection open as described above. Could this function be re-factored to put the rs_stat_object call in the next conditional block which is not executed for .dom and .info files?

After deleting about 170 files the process has over 1000 CLOSE_WAIT connections, and the OS then refuses to allow any more files to be opened causing the process to crash and exit silently. Original forum thread here.

Status:

Active

Comments

Submitted by bobemoe on Fri, 09/09/2016 - 03:29 Comment #1

Body:

View changes

Submitted by bobemoe on Fri, 09/09/2016 - 03:31 Comment #2

Body:

View changes

Corrected typos

Submitted by andreychek on Fri, 09/09/2016 - 08:43 Comment #3

Thanks for your bug report! Jamie will get back with you regarding that.

Submitted by JamieCameron on Fri, 09/09/2016 - 18:30 Comment #4

Thanks for pointing this out - we'll fix the unclosed HTTP connections in the next release.

About the purge function - it shouldn't matter than the .dom and .info files were removed, as they are skipped when deleting anyway via the line :

if ($f =~ /^$re($|\/)/ && $f !~ /\.(dom|info)$/ &&

Submitted by bobemoe on Sat, 09/10/2016 - 05:55 Comment #5

Hi, thanks for looking into this. I have been doing some further investigation and testing...

Issue #1: I'm glad you agree the HTTP connections should be closed. I have patched this locally and it has allowed the backup to run without crashing, however 12 hours in it is still deleting the old backups, my monitoring shows it is only deleting 1 per minute.. I believe this slow processing is due to the next issue:

Issue #2: What you suggest about the .dom and .info files being skipped makes a lot of sense as this is how we would expect it to function. However from my earlier testing and debug output I can confirm this is only half true. The filename check you quoted will prevent a deletion attempt of those files, however if you look at the first line of the loop, rs_stat_object is still called on the file before this filename check.

        foreach my $f (@$files) {
                local $st = &rs_stat_object($rsh, $host, $f);
                next if (!ref($st));
                local $ctime = int($st->{'X-Timestamp'});
                if ($f =~ /^$re($|\/)/ && $f !~ /\.(dom|info)$/ &&
                    $f !~ /\.\d+$/) {
                        # Found one to delete
                        $mcount++;
                        next if (!$ctime || $ctime >= $cutoff);
                        local $old = int((time() - $ctime) / (24*60*60));
                        &$first_print(&text('backup_deletingfile',
                                            "<tt>$f</tt>", $old));
                        local $err = &rs_delete_object($rsh, $host, $f);
...

As the result of this call ($st and $ctime) is not needed until after the filename test passes, I am proposing moving the first 3 lines of the loop into the inside of that if block. As follows:

        foreach my $f (@$files) {
                if ($f =~ /^$re($|\/)/ && $f !~ /\.(dom|info)$/ &&
                    $f !~ /\.\d+$/) {
                        # Found one to delete
                        local $st = &rs_stat_object($rsh, $host, $f);
                        next if (!ref($st));
                        local $ctime = int($st->{'X-Timestamp'});
                        $mcount++;
                        next if (!$ctime || $ctime >= $cutoff);
                        local $old = int((time() - $ctime) / (24*60*60));
                        &$first_print(&text('backup_deletingfile',
                                            "<tt>$f</tt>", $old));
                        local $err = &rs_delete_object($rsh, $host, $f);
...

I believe this will actually have the affect of removing 6 unneeded API calls per site and speed up the process immensely, because when rs_stat_object is called on a non-existent file, rs_http_call receives a 404 and then actually makes 3 API attempts with a 10 second delay after each "fail" (confirmed with strace), this is what was leaving 6 open connection per domain (3 per .dom + 3 per .info). Now that patch #1 is in place it is closing the connections, but still making those 6 requests and a 60 second delay per site.

(Maybe a 404 response should not be considered an error, as it is actually a valid API response and unlikely to change when the same call is retried. But maybe this is also past the scope of the issue, involves changing a more widely shared library, and would have no impact here if we can remove the unnecessary erroneous calls anyway. But just a thought)

I am going to apply proposed fix #2 here and I am expecting it to reduce the backup time by several hours!

Submitted by bobemoe on Sat, 09/10/2016 - 08:04 Comment #6

Removed unhelpful comment. My bad.

Submitted by bobemoe on Sat, 09/10/2016 - 08:04 Comment #7

Removed unhelpful comment. My bad.

Submitted by JamieCameron on Sat, 09/10/2016 - 16:36 Comment #8

Thanks for pointing that out as well - you're right that those stats are un-necessary. I'll re-order the code to fix this.

Submitted by bobemoe on Sun, 09/11/2016 - 03:38 Comment #9

Brilliant :) After applying those two fixes here, last nights backup completed in just over 2 hours. The completion email was sent, log was saved, and all backups >30 days were deleted.

.. deleted 706 old backups, and skipped 590 that were not old enough

The skipped count doesn't seem high enough, I would expect it to be 706*30=21,180

It does add up on my other server .. deleted 21 old backups, and skipped 610 that were not old enough 21*30=630 so accounting for a few fails, its pretty close.

I will keep an eye on this over the next few days and give a final conclusion shortly.

Thanks for the awesome product :)

Submitted by bobemoe on Tue, 09/13/2016 - 05:16 Comment #10

Still running fine after a couple more days :)

I have investigated why the skipped count appears consistantly low, and it seems that rs_list_objects only returns the first 10,000 files. This is a Rackspace API limitation, there are workarounds.

While this isn't actually causing an issue as the files to delete happen to be in the list of 10,000 I can see this might cause problems if the files aren't accidentally returned in the right order, or If I had another 600 domains on the box!

Submitted by JamieCameron on Tue, 09/13/2016 - 20:59 Comment #11

Thanks, I didn't know about that 10,000 files limit! The next release of Virtualmin will increase it to 1M files.