Virtualmin Backups do not log which servers failed (empty string) which also result in cleaning the incremental dir

Since the last update of the Virtualmin backups (which resolved the problems with overwriting the changes made to a virtual server during backup) we have discovered a possible bug:

Sometimes, during backup, when some of the virtual servers fail, their name (and ID) are not logged and we have something like that at the end of the backup:

xxx servers backed up successfully, 1 had errors.

Servers with errors:

This is probably due to not correctly adding the domain data for the failed domain to the errdoms array.

This is then resulting in something very bad: At the end of the backup, Virtualmin removes the corresponding files for all failed domains from the incremental dir, which should force a full backup on this domain on next backup (around line 1467 in backups-lib.pl). When executing this on an empty domain name, this actually unlinks the whole incremental dir thus forcing full backup on all domains next time.

We see this not every time, some times the domain name (and the ID and all the domain data) are logged right. Also sometimes in a single backup log we have both empty names of failed domains and real like that:

xxx servers backed up successfully, 3 had errors.

Servers with errors:  faileddomain2.com faileddomain3.com

There is an blank space before faileddomain2.com which is actually the empty failed domain.

Probably it depends on what feature failed - some feature(s) do not log the domain right when fail.

This problem we see on multiple installations, it is not something specific to the installation or the domain name.

A quick fix to avoid full backup every time is to comment the line that unlinks the file in the incremental dir. But then the bug should be tracked and fixed at its origin.

Thanks for the time spent on resolving this. Also we would like to help researching more, but please send us some ideas on where to search for this bug.

Thanks!

Status: 
Closed (fixed)

Comments

OK, actually we have researched more and discovered this in the backup log:

Creating backup for virtual server  ..
.. some domain details missing!

.. completed in 0 seconds

Then we had some brainstorming and I think we found the problem (not exactly the code, but the principle):

Sometimes virtual servers fail in the backup because the virtual server is deleted after the backup started (the backup script lists the existing domains when starts, if there are many domains and the backup is like 10 hours it is common that some of them is deleted before being actually backed up).

Because of the domains list cache in the memory (this is to avoid rereading the domains dir every time we have to list the domains), sometimes we have a domain with failing features (missing Apache configuration, midding Domain, etc.) when the data for this domain comes from the cache, or we have a failing non existing domain with empty name if the cache expired.

In short: When we have a domain that is deleted after the backup (of all domains) started, but the domain list cache expired, this domain fails and it is logged with empty name, which then results in unlinking the whole directory containing the incremental data... next time any backup is actually a full backup.

What this means:

  1. There should be a check if a domain exists (was not deleted since the backup started) before being backed up. Or the list of the domains should be reread without the cache before every domain
  2. The domain list cache should be cleaned when a domain is removed (otherwise it contains data for a non existing domain)
  3. The function that removes the incremental file for a filed domain should check if the id is not empty (otherwise it actually deletes the whole dir)

Can you please confirm the problem? Can you please suggest a patch fixing pt. 1 (checking if the domain exists before backup) or pt. 3 as this will solve the problem before an official fix is released?

Thanks a lot for your time!

Here is a quick fix on pt.3:

in backups-lib.pl on line 1468 replace:

foreach my $d (@errdoms) {
    &unlink_file("$incremental_backups_dir/$d->{'id'}");
    }

with:

foreach my $d (@errdoms) {
    if($d->{'id'}) {
        &unlink_file("$incremental_backups_dir/$d->{'id'}");
        }
    }

Thanks for pointing this out - your diagnosis is spot-on about domains being deleted before the backup gets to them. I will fix this properly in the next Virtualmin release (version 4.13).

Thanks a lot Jamie!

Do the quick fix looks OK? I don't know if $d->{'id'} will actually return empty string, false or something else and if this kind of check is correct.

Here is another suggestion on the backups (should I post this as a feature suggestion in the forum?):

When using SSH mode for the backups, compress the file on the remote server. So the local server just makes tar, copies it to the remote (backup) server and starts the compression there. This will increase the bandwidth, but save CPU. It makes sense for cases when the backup server have fast local connection to the primary (Virtualmin) server and you want to save CPU. We kind of achieved this manually by changing the compression to "tar only" for the backups and then run gzip on all .tar files once a day on the remote server using CRON. But this have the drawback that we need a lot of spare space to accumulate the whole non compressed backup before the gzip starts.

In this case we can even redirect the tar output directly to the backup server instead of creating a local file and then copy it (which will also reduce I/O on the Virtualmin server). There might be a checkbox for that in the interface to let you choose between I/O and stability.

This can also be achieved by adding a field to the backup interface (when using SSH) to execute a command on the remote server after each file is transferred (this can be used to gzip or bz2 it); And a field to execute a command on the remote server after all files are transferred (this can be used to start a clamdscan on the whole backup for example to give you an early warning for hacked websites). Two variables can be add to the command by Virtualmin - the filename of the transferred file and the local (on the remote server) path. Anything else you can schieve by writing your own commands/scripts.

Let me know what you think.

While technically that could be done, it would for most users be sub-optimal as bandwidth is "scarcer" than local CPU time. Also, it wouldn't work for backup destinations that only allow scp. So I don't think it is worth the effort to implement this.

OK, thanks a lot Jamie, I guess we will keep compressing the backups on the backup server via CRON schedule.

Thanks for your time!

Automatically closed -- issue fixed for 2 weeks with no activity.