Backup periodically excludes disks on on VM [#38556]

Submitted by JamieCameron on Sun, 11/08/2015 - 15:49 Comment #1

So does the VM that is failing backups have only a single disk, or is just one of multiple bring excluded?

Log in or register to post comments

Submitted by mlbarrow on Sun, 11/08/2015 - 17:14 Comment #2

It only has a single disk.

Log in or register to post comments

Submitted by JamieCameron on Sun, 11/08/2015 - 23:57 Comment #3

On the host system, is the disk stored in a regular file, or in an LVM logical volume?

Log in or register to post comments

Submitted by mlbarrow on Sun, 11/08/2015 - 23:59 Comment #4

LVM

Log in or register to post comments

Submitted by JamieCameron on Mon, 11/09/2015 - 00:01 Comment #5

Ok - and can you post the error message from the backup that happens when the disk is excluded?

Log in or register to post comments

Submitted by mlbarrow on Mon, 11/09/2015 - 00:20 Comment #6

It only says "All disks excluded"

See screenshot of backup logs for the host in question and one of the detail of yesterday's backup attempt.

Here's the contents of the backuplogs file (1446967506_1422816029196850): size= server=1422816029196850 id=1446967506_1422816029196850 empty=1 base_dest= msg=All disks have been either excluded from the backup or are for swap time=1446967506 ok=0 backup=1422773601157730 host=scanner dest=host:/vmbackup/2015-11-08/scanner.gz dests=host:/vmbackup/2015-11-08/scanner.gz owner=

Log in or register to post comments

Submitted by JamieCameron on Mon, 11/09/2015 - 01:01 Comment #7

Thanks - it looks like maybe Cloudmin isn't detecting some of the disks on your VMs at all.

On the backup form, what do you have the "Shut down systems?" option set to?

Log in or register to post comments

Submitted by mlbarrow on Mon, 11/09/2015 - 23:06 Comment #8

It's set to not shut down the systems.

Log in or register to post comments

Submitted by JamieCameron on Tue, 11/10/2015 - 00:49 Comment #9

OK - so when a backup works properly, I assume it is creating an LVM snapshot for each VM?

You should see a message like "Creating LVM snapshots for disks of XYZ" during the backup process.

Log in or register to post comments

Submitted by mlbarrow on Tue, 11/10/2015 - 19:43 Comment #10

Correct.

Log in or register to post comments

Submitted by JamieCameron on Wed, 11/11/2015 - 01:08 Comment #11

Is there anything unusual about the state of the guest when this exclusion happens? For example, is it down or in some unreachable state?

Log in or register to post comments

Submitted by mlbarrow on Thu, 11/12/2015 - 11:26 Comment #12

As far as I know, the guest is fine.

Here's the snippet from the email yesterday when it worked:

Backing up scanner to /vmbackup/2015-11-11/scanner.gz on host system ..
    Creating LVM snapshots for disks of scanner ..
    Compressing LVM disks for scanner ..
    Removing LVM snapshots for disks of scanner ..
.. created backup of 4.14 GB

Saving details of system scanner to /vmbackup/2015-11-11/scanner.serv on host system ..
.. done

Saving list of disks for scanner to /vmbackup/2015-11-11/scanner.disks on host system ..
.. done

And here's today's when it didn't work:

Backing up scanner to /vmbackup/2015-11-12/scanner.gz on host system ..
    Creating LVM snapshots for disks of scanner ..
.. system excluded from backup

And there is nothing in the system status around that time (ignore those "parent host is down" errors; that was me changing out the network switch and forgetting to properly set the MTU in the switch for the network that interconnects the parent hosts):

09/Nov/2015 00:18  Parent host is down SSH Monitoring (toor)
08/Nov/2015 23:45    SSH Parent host is down Monitoring
05/Nov/2015 05:22   Down    SSH Web UI (toor)
05/Nov/2015 05:20    SSH Down    Web UI (toor)
12/Aug/2015 15:00    No SSH  SSH Monitoring
12/Aug/2015 14:55   SSH No SSH  Monitoring

Log in or register to post comments

Submitted by JamieCameron on Fri, 11/13/2015 - 00:18 Comment #13

Ok, it looks like the cause is that during the backup, Cloudmin couldn't fetch the list of LVM volume groups from the host system sometimes, which causes it to mis-identify which disks are on LVM.

Do you every get any error or unusual output (like missing disk sizes) when you visit the Manage Disks page for the problem VM?

Log in or register to post comments

Submitted by mlbarrow on Sat, 12/26/2015 - 12:18 Comment #14

Sorry that I haven't updated this ticket in a while. I got busy with other things.

Anyways, this particular host still periodically does the "exclude" thing. I have recently added a host (via manual migration from KVM with libvirt) and it has 3 disks on it. I marked two of the disks as "exclude from backups," but did not do so for the main (boot + root filesystem) disk.

It shows up as "excluded" when the backups run.

I went to take a look at the main disk (vda) and it's a bit weird. It thinks the current use of the disk is virtual memory, as shown in the attachment.

I've attached the screen shot for the other two disks as well.

Log in or register to post comments

Submitted by JamieCameron on Sat, 12/26/2015 - 20:20 Comment #15

Cloudmin will exclude disks that it thinks are used for swap, which could explain this problem.

Can you post the /etc/fstab file from that VM? That's what we use to determine if a disk is for swap or not.

Log in or register to post comments

Submitted by mlbarrow on Sun, 12/27/2015 - 13:34 Comment #16

Incidentally, the backup of this VM worked last night. It didn't consider it excluded. As a reminder, I moved this machine over (manually) from a plain old libvirt-managed KVM instance. Do you recommend that I move swap onto its own disk to not anger Cloudmin?

Here is the /etc/fstab:

UUID=cf78de5a-3065-49e6-8d3c-209a358e154f /               ext4    errors=remount-ro,noatime 0       1
UUID=f7b40af2-0800-4035-b954-58ebfcebc186 none            swap    sw              0       0
UUID=d85cece1-0a1c-46ca-a52d-ef748a1eb605 /tmp ext4 noatime 0 2
UUID=3a7f5e0f-a134-4cac-be0a-31f604c1e0b4 /var/lib/plexmediaserver ext4 noatime 0 2

And here's the blkid output:

/dev/vda1: UUID="cf78de5a-3065-49e6-8d3c-209a358e154f" TYPE="ext4"
/dev/vda5: UUID="f7b40af2-0800-4035-b954-58ebfcebc186" TYPE="swap"
/dev/vdb1: UUID="d85cece1-0a1c-46ca-a52d-ef748a1eb605" TYPE="ext4"
/dev/vdc1: UUID="3a7f5e0f-a134-4cac-be0a-31f604c1e0b4" TYPE="ext4"

Log in or register to post comments

Submitted by JamieCameron on Sun, 12/27/2015 - 21:42 Comment #17

I'd recommend always using real device paths in /etc/fstab instead of UUID= lines, as Cloudmin doens't have any way of converting those to actual disk images at backup time. Try making that change, and let us know if it solves the problem.

Log in or register to post comments

Submitted by mlbarrow on Sat, 01/02/2016 - 11:44 Comment #18

I didn't remove the UUIDs, but I did change to a separate disk for swap for the htpc machine that just started acting up. I think it's stable.

I just went and looked at the machine about which I originally complained. It, too, has swap on the same disk as root and it uses UUIDs.

I think that there is a bug in cloudmin here because it doesn't fail consistently. When I get a moment, I will try to poke around in the codebase to see if I can figure out what's up. I am perfectly willing to follow your advice about isolating swap and using real device paths, but I'm also perplexed about why the backups decide to work sometimes when they should be breaking all of the time.

Log in or register to post comments

Submitted by JamieCameron on Sat, 01/02/2016 - 13:22 Comment #19

I had another look at the code, and found what is almost certainly the cause of this inconsistent failure. A fix will be included in the next release of Cloudmin.

Regardless, we recommend using device paths in /etc/fstab instead of UUID= lines.

Log in or register to post comments

Submitted by mlbarrow on Sat, 01/02/2016 - 13:32 Comment #20

Glad that you tracked it down!

I understand about the device names. That seems to be the default nowadays by the Ubuntu installer, at least, and those problematic machines were migrated from my old configuration that I managed with virsh/libvirt.

Incidentally, doesn't the master execute commands on the guests at backup time? If so, could you not fire off a blkid to dig into the UUID?

Log in or register to post comments

Submitted by JamieCameron on Sat, 01/02/2016 - 13:49 Comment #21

Cloudmin does try to get the label to device mapping if the VM is running, using the e2label command and links in /dev/disk/by-uuid . However, this doesn't work if the VM is shut down.

Log in or register to post comments

Submitted by mlbarrow on Sat, 01/02/2016 - 17:56 Comment #22

Ok -- cool. Thanks for the explanation. I'm going to mark this one up to the swap being on the same disks at the filesystem combined with whatever bug you found. Feel free to close out this case at your leisure!

Thanks!

Log in or register to post comments

Submitted by JamieCameron on Sat, 01/02/2016 - 18:14 Comment #23

Thanks - marking this as fixed

Log in or register to post comments

Backup periodically excludes disks on on VM

Comments