Redundant services using Cloudmin Pro

OBJECTIVE>> To setup a dual-node HA cluster to provide uninterrupted and/or load-balanced hosting for KVM instances managed by Cloudmin Pro and mirrored to the second host using DRBD

HARDWARE>> Nodes: 2X Dual Xeon 16GB Internet connectivity: 2X Internet leased lines(different ISPs) load-balanced through a dual-wan router Virtualization: KVM Replication: DRBD/Pacemaker Replicated Drive: /KVM

NETWORKING: Both internet lines connected to Dual-Wan router. Both nodes; eth0s connected to router for private IP assignment and Internet access Both nodes; eth1s connected to each other for DRBD/Corosync/GFS2 communication

SOFTWARE: Both nodes; CentOS 6.5 installed Both nodes; Cloudmin Pro installed Both nodes; DRBD configured in Dual-Primary mode to replicate /KVM on a GFS2 filesystem

CLOUDMIN CONFIGURATION: node2 as Cloudmin replica of node1 node1 and node2 part of host failover group ("System Operations -> Force Failover" OPTION MISSING)

TRIAL: Creating the first Instance on node1: Test1and testing failover to secondary host: node2

Following the above setup, we consider that Cloudmin will assume the existence of the test1 image file on the local drive /KVM on node2 when it is created in node1 and switch to node2:Test1 when node1 fails.

Instead, what happens is that test1 goes down as soon as node1 is shutdown.

QUESTIONS: 1. What are we doing wrong here? 2. Is the topology and configurations correct? 3. Are we missing steps? 4. Is there anything additional to be done? 5. Is there an alternative/easier/more logical approach to achieving the objective?

Thanking you, Surjo Banerjee

Status: 
Active

Comments

Image mirroring using DRDB isn't something we have tested unfortunately - rather, the common setup is to use an NFS or iSCSI mount from a central storage system by two VM hosts. Cloudmin can then do automatic failover if one host goes down.

Thank you for your reply Jamie.

-- Ok, please give me a hint on the following scenerio then..

If host#1 is reading the image from its local storage: /kvm/test1.img and it fails, assuming failover is setup the right way, will Cloudmin Pro then switch to host#2 and look for the /kvm/test1.img on host#2's local storage? This is logical if Cloudmin is simply replicating/moving the test1 instance (except test1.img) from host#1 to host#2. And this should solve our problem.

Please help us out. We have spent more than six months on research already and according to all our trials and reviewing every piece of documentation available, Cloudmin seems to be our best chance. We have been using Webmin/Virtualmin for more than a year now, so we are quite familiar and very comfortable with your products. The launch of our healthcare application is held up because of not being able to put together a redundant hosting platform.

We feel we are very close to building this HA cluster using a combination of DRBD, KVM, GFS2 and Cloudmin Pro and I am willing to share the entire approach if it helps you in any way. At this point all the major elements are functioning as expected, just a bit of fine tuning will get us on track here.

Your guidance is highly appreciated! :-)

Awaiting your response..

Thanking you, Surjo.

Jamie,

Status: Failover not triggered automatically and cannot find a way to trigger manual failover in Cloudmin.

  1. We have been successful in moving kvm instances between node1 and node2 and Cloudmin detects the locally available img copies in /kvm on both nodes; the /kvm directory is replicated between the two servers by DRBD on GFS2. So the live migration works great!
  2. We have also setup a separate dedicated Cloudmin Pro server.
  3. We have registered and added node1 and node2 to the fail-over group
  4. As before, we are unable to find System 'Operations -> Force Failover' to force node1 to fail and test the manual switchover.
  5. Therefore, we manually turned off node1 by pressing and holding the power button.
  6. The /kvm/test1.img and related files are all locally available on node2 however Cloudmin Pro is not restarting the test instance test1 on node2 even after we refreshed node1 status to indicate its down. It doesnt even seem to make an attempt to switch over. No messages or notifications.
  7. Looking closely at the instructions documented at https://www.virtualmin.com/documentation/cloudmin/vm/failover , I now realize that what I find on the current version of Cloudmin Pro is a bit different than those instructions. therefore I have a feeling something's not right with my Fail-over configuration.

https://www.virtualmin.com/documentation/cloudmin/vm/failover

Ex#1 "If you want Cloudmin to automatically perform failovers, set the Failover group enabled? to automatic mode. Otherwise you can select manual mode to trigger failovers manually when you detect a host system has gone down."

My Cloudmin version has three options: Automatic and manual/Manual Only/Disabled Automatic and Manual: No migration action observed on Cloudmin after forcing node1 to fail. Test1 has been down on node1 for more than 30 minutes and hasn't migrated to node2.

Ex#2 Setting Failover to Manual: No way to force failover: System Operations -> Force Failover is not an option seen as opposed to to the documentation: "If you have set a failover group to manual mode, Cloudmin will not automatically move virtual systems off down hosts in the group. However, you can force a failover at System Operations -> Force Failover. "

Remaining settings:

  1. I set the "Host downtime before automatic failover: to '1' minutes
  2. On failover, send email to both (checked); admin and system owners
  3. Selected hosts: both nodes selected on right side box
  4. Virtual systems to fail over :: Any system on selected hosts.

SAVE

Now I am removing Cloudmin from node1 and node2 to find out if that helps.

Please advise..,

Surjo.

I read your most recent post after I typed the following text, and realized you attempted my first suggestion already. We may want to wait for Jamie's advice, as it sounds like you're not seeing the "Force Failover" option. However, I did want to offer what I typed up, even if some of it is no longer relevant to you --

In the scenario Jamie described, both host #1 and host #2 would be sharing the same storage, where the Virtual Machine images would be stored on a shared NFS or iSCSI mount.

The exact details of that scenario are described here:

https://www.virtualmin.com/documentation/cloudmin/vm/failover

There's an alternative that Cloudmin supports. Rather than using shared storage as described above, Cloudmin could also monitor a given server, and if it goes down, it could change the DNS records so that a second server becomes the live server.

That scenario is described here:

https://www.virtualmin.com/documentation/cloudmin/vm/roundrobin

Would either of those do what you're after? The second option is interesting in that it would also allow you to do load balancing, if you wanted.

Thanks Andrey,

I have already looked at Jamie's post that mentions those two options. The first one as you are aware, hasn't been successful for us because the Fail-over doesn't switch anything. Ideally, fail-over is something we would want to get working because of the DRBD limitations that may not handle a load balanced dual-primary very well.

We will need to spend some time with DNS RoundRobin and DRBD to work stably in Dual-Primary mode. Proxmox seems to work with DRBD if shared storage isn't available.

Could you please suggest? As far as I have read, Cloudmin is superior to Proxmox in many ways and I would like to stick with it because of the seamless integration with VM level virtualmin/webmin deployments that we have to manage.

I do believe Cloudmin would do it considering it is recognizing the local images during migration of VMs. Only that Fail-over doesn't seem to be active for some reason as I described in my previous post.

Please advise, I think we are very very close..

Regards, Surjo.

Regarding the "force failover" option, did you select the VM from the left menu, or the host system? It only appears when a VM is chosen, AND the host system for the VM is part of a failover group.

If this still doesn't help, I would be glad to login to your Cloudmin system to see what is going wrong - since your setup is a bit unusual, there may be bugs in the failover process.