Backup periodically excludes disks on on VM

I use Cloudmin to manage KVM instances. Every few days, the backup function decides that the disk on one particular guest is excluded and then does not back up anything. The next day, it'll be back to normal and go ahead and back it up. Thinking that something in that guest's config was screwy, I manually excluded and then included the disk from/to backup, but this problem persists.

Suggestions?

Status: 
Closed (fixed)

Comments

So does the VM that is failing backups have only a single disk, or is just one of multiple bring excluded?

It only has a single disk.

On the host system, is the disk stored in a regular file, or in an LVM logical volume?

Ok - and can you post the error message from the backup that happens when the disk is excluded?

It only says "All disks excluded"

See screenshot of backup logs for the host in question and one of the detail of yesterday's backup attempt.

Here's the contents of the backuplogs file (1446967506_1422816029196850): size= server=1422816029196850 id=1446967506_1422816029196850 empty=1 base_dest= msg=All disks have been either excluded from the backup or are for swap time=1446967506 ok=0 backup=1422773601157730 host=scanner dest=host:/vmbackup/2015-11-08/scanner.gz dests=host:/vmbackup/2015-11-08/scanner.gz owner=

Thanks - it looks like maybe Cloudmin isn't detecting some of the disks on your VMs at all.

On the backup form, what do you have the "Shut down systems?" option set to?

It's set to not shut down the systems.

OK - so when a backup works properly, I assume it is creating an LVM snapshot for each VM?

You should see a message like "Creating LVM snapshots for disks of XYZ" during the backup process.

Is there anything unusual about the state of the guest when this exclusion happens? For example, is it down or in some unreachable state?

As far as I know, the guest is fine.

Here's the snippet from the email yesterday when it worked:

Backing up scanner to /vmbackup/2015-11-11/scanner.gz on host system ..
    Creating LVM snapshots for disks of scanner ..
    Compressing LVM disks for scanner ..
    Removing LVM snapshots for disks of scanner ..
.. created backup of 4.14 GB

Saving details of system scanner to /vmbackup/2015-11-11/scanner.serv on host system ..
.. done

Saving list of disks for scanner to /vmbackup/2015-11-11/scanner.disks on host system ..
.. done

And here's today's when it didn't work:

Backing up scanner to /vmbackup/2015-11-12/scanner.gz on host system ..
    Creating LVM snapshots for disks of scanner ..
.. system excluded from backup

And there is nothing in the system status around that time (ignore those "parent host is down" errors; that was me changing out the network switch and forgetting to properly set the MTU in the switch for the network that interconnects the parent hosts):

09/Nov/2015 00:18 Parent host is down SSH Monitoring (toor)
08/Nov/2015 23:45 SSH Parent host is down Monitoring
05/Nov/2015 05:22 Down SSH Web UI (toor)
05/Nov/2015 05:20 SSH Down Web UI (toor)
12/Aug/2015 15:00 No SSH SSH Monitoring
12/Aug/2015 14:55 SSH No SSH Monitoring

Ok, it looks like the cause is that during the backup, Cloudmin couldn't fetch the list of LVM volume groups from the host system sometimes, which causes it to mis-identify which disks are on LVM.

Do you every get any error or unusual output (like missing disk sizes) when you visit the Manage Disks page for the problem VM?

Sorry that I haven't updated this ticket in a while. I got busy with other things.

Anyways, this particular host still periodically does the "exclude" thing. I have recently added a host (via manual migration from KVM with libvirt) and it has 3 disks on it. I marked two of the disks as "exclude from backups," but did not do so for the main (boot + root filesystem) disk.

It shows up as "excluded" when the backups run.

I went to take a look at the main disk (vda) and it's a bit weird. It thinks the current use of the disk is virtual memory, as shown in the attachment.

I've attached the screen shot for the other two disks as well.

Cloudmin will exclude disks that it thinks are used for swap, which could explain this problem.

Can you post the /etc/fstab file from that VM? That's what we use to determine if a disk is for swap or not.

Incidentally, the backup of this VM worked last night. It didn't consider it excluded. As a reminder, I moved this machine over (manually) from a plain old libvirt-managed KVM instance. Do you recommend that I move swap onto its own disk to not anger Cloudmin?

Here is the /etc/fstab:

UUID=cf78de5a-3065-49e6-8d3c-209a358e154f /               ext4    errors=remount-ro,noatime 0       1
UUID=f7b40af2-0800-4035-b954-58ebfcebc186 none            swap    sw              0       0
UUID=d85cece1-0a1c-46ca-a52d-ef748a1eb605 /tmp ext4 noatime 0 2
UUID=3a7f5e0f-a134-4cac-be0a-31f604c1e0b4 /var/lib/plexmediaserver ext4 noatime 0 2

And here's the blkid output:

/dev/vda1: UUID="cf78de5a-3065-49e6-8d3c-209a358e154f" TYPE="ext4"
/dev/vda5: UUID="f7b40af2-0800-4035-b954-58ebfcebc186" TYPE="swap"
/dev/vdb1: UUID="d85cece1-0a1c-46ca-a52d-ef748a1eb605" TYPE="ext4"
/dev/vdc1: UUID="3a7f5e0f-a134-4cac-be0a-31f604c1e0b4" TYPE="ext4"

I'd recommend always using real device paths in /etc/fstab instead of UUID= lines, as Cloudmin doens't have any way of converting those to actual disk images at backup time. Try making that change, and let us know if it solves the problem.

I didn't remove the UUIDs, but I did change to a separate disk for swap for the htpc machine that just started acting up. I think it's stable.

I just went and looked at the machine about which I originally complained. It, too, has swap on the same disk as root and it uses UUIDs.

I think that there is a bug in cloudmin here because it doesn't fail consistently. When I get a moment, I will try to poke around in the codebase to see if I can figure out what's up. I am perfectly willing to follow your advice about isolating swap and using real device paths, but I'm also perplexed about why the backups decide to work sometimes when they should be breaking all of the time.

I had another look at the code, and found what is almost certainly the cause of this inconsistent failure. A fix will be included in the next release of Cloudmin.

Regardless, we recommend using device paths in /etc/fstab instead of UUID= lines.

Glad that you tracked it down!

I understand about the device names. That seems to be the default nowadays by the Ubuntu installer, at least, and those problematic machines were migrated from my old configuration that I managed with virsh/libvirt.

Incidentally, doesn't the master execute commands on the guests at backup time? If so, could you not fire off a blkid to dig into the UUID?

Cloudmin does try to get the label to device mapping if the VM is running, using the e2label command and links in /dev/disk/by-uuid . However, this doesn't work if the VM is shut down.

Ok -- cool. Thanks for the explanation. I'm going to mark this one up to the swap being on the same disks at the filesystem combined with whatever bug you found. Feel free to close out this case at your leisure!

Thanks!

Thanks - marking this as fixed