Replication fails from Virtualmin Pro via Cloudmin - invalid tar file

Hey guys

I have configuration with 4 servers, made up of a cloudmin (connect) server and 3 web servers.

Virtual servers exist on web101 and I'm trying to ask cloudmin to replicate the the websites to web102 and web203. This process fails every time, with the error:

Starting replication from web101 of Virtualmin settings ..
    Finding source and destination systems ..
    .. found source web101 and 2 destinations
    Refreshing domains on source system ..
    .. done
    Creating temporary directories ..
    .. done
    Backing up 4 virtual servers on source system ..
    .. created backup of 711.95 kB
    Transferring backups to destination systems ..
    .. done
    Restoring backups on destination systems ..
    .. 0 restores succeeded, 2 failed
    Failed to restore on web102 : Failed to read backup file : /tmp/.webmin/105207_54742_1_fastrpc.cgi/ssltest.local.tar.gz : Not a valid tar or tar.gz file
    Failed to restore on web103 : Failed to read backup file : /tmp/.webmin/953777_63257_1_fastrpc.cgi/ssltest.local.tar.gz : Not a valid tar or tar.gz file
    Removing excess virtual servers on destinations ..
    .. no domains need deletion on web102
    .. no domains need deletion on web103
Replication failed - see the output above for the reason why.

So I've interrupted the process to grab the tar files that cloudmin stores on the 2 destination servers, inside the /tmp/.webmin directory. In the example above there are 4 tar files related to the virtual servers to replicate and then 1 virtualmin settings tar file. When I try to extract these myself it's evident that the tar files are invalid / corrupt.

]# tar -zxvf benchmark.local.tar.gz
./
tar: Skipping to next header
gzip: stdin: invalid compressed data--crc error
gzip: stdin: invalid compressed data--length error
tar: Child returned status 1
tar: Error is not recoverable: exiting now
#
# tar zxvf cloudmin101.local.tar.gz
gzip: stdin: invalid compressed data--format violated
./
./.backup/
./.usermin/
tar: Skipping to next header
tar: Child returned status 1
tar: Error is not recoverable: exiting now

My initial thoughts were that this was related to an interrupted process during the transfer / dd process. However, the machines are all VMs on the same host at present (during testing), connected to a physical single switch. The network is rock solid, achieves 3gbps between the machines, and never drops any packets.

Is there any chance someone could point me in the right direction to troubleshoot this further? Cheers

Status: 
Closed (fixed)

Comments

It's worth also adding that the files do vary in size too...

[root@web103 953777_63257_1_fastrpc.cgi]# ls -alh
total 728K
drwxr-x--- 4  502  502 4.0K Oct 25 17:18 .
drwxr-xr-x 4 root root   72 Oct 25 17:18 ..
drwxrwxrwx 2 root root    6 Oct 25 17:18 .backup
-rw-r--r-- 1 root root  26K Oct 25 17:18 benchmark.local.tar.gz
-rw-r--r-- 1 root root  27K Oct 25 17:18 cloudmin101.local.tar.gz
-rw-r--r-- 1 root root  27K Oct 25 17:18 replication.local.tar.gz
-rw-r--r-- 1 root root 630K Oct 25 17:18 ssltest.local.tar.gz
drwx------ 2  502  502    6 Jul 15 17:42 .usermin
-rw-r--r-- 1 root root 5.4K Oct 25 17:18 virtualmin.tar.gz

That's quite unusual - it seems like the ssh transfer of those files failed silently, resulting in a corrupt or truncated file.

Is the destination system perhaps out of disk space?

Hey Jamie

Nope, these are all brand new systems. /tmp is part of the root filesystem and has a 124GB xfs within a logical volume. Total disk usage on the entire machine is 3GB.

On the source system, which backup compression format do you have selected in Virtualmin? This is visible at System Settings -> Module Configuration -> Backup and restore -> Backup compression format.

Backup compression format is currently gzip on the source system.

And on the destination too?

Yep on all 3 nodes it's gzip

If on the source system you just make a regular Virtualmin backup of one of the domains, scp it to the destination and then restore, does it work OK?

Yeah, manual backup, SCP and restore via the virtualmin web interface works as expected.

Still getting the same errors with a newly created virtual server on the source system:

Failed to restore on web102 : Failed to read backup file : /tmp/.webmin/862446_16406_1_fastrpc.cgi/benchmark.local.tar.gz : Not a valid tar or tar.gz file

Failed to restore on web103 : Failed to read backup file : /tmp/.webmin/38937_19047_1_fastrpc.cgi/benchmark.local.tar.gz : Not a valid tar or tar.gz file

Is /tmp perhaps full on the source or destination system?

Nope. I made sure I ran the manual backup, SCP and restore from that location to be sure. But as I said in an earlier post the systems are all HyperV nodes with 127GB root file systems and only ~3GB of data on each.

Any chance we could login to this system to see what's going wrong?

Erm yeah, I'll have to grant VPN access and then logins for the 3 virtualmin nodes and the cloudmin node. Could you let me know where to send the credentials and I'll set it all up for you.

All of the nodes I grant access to have zero data on them and so you can't hurt anything. Feel free to reboot, or whatever else you may need to do as long as you let me know exactly what any potential fix might have been.

Thanks

I've just sent those through to you.

Hey Jamie

I can see that you've not yet connected to the VPN using the details I've emailed over. Could you please give me an update on this, and also confirm the credentials were safely received by you last Friday.

Thanks

I didn't get your email - what address did you send it from?

Hi Jamie

I've just sent a further copy to both your Virtualmin and webmin mailboxes. It was sent from chris at domain_removed dot co dot uk. Please let me know if it was received this time.

Thanks

Well, that was very interesting - after much debugging I found the problem, which was a corner case in which transferred files could be corrupted but only when logging in as a sudo-capable user instead of root. I've fixed this on your system, and will include the fix in the next Cloudmin release.

Status: Active » Fixed

That's brilliant news Jamie, thanks again. I'm quite relieved it was a bug and not a misconfiguration on our part.