Automatic failover

Hi,

Can you please implement a way for a CloudMin slave server to automatically become a master, in case the actual master fails?

Status: 
Active

Comments

That's a pretty good idea - currently a replica has to be told manually to take over.

That said, this would be tricky to implement properly as having two masters both thinking they are the only one would be potentially damaging to virtual machines.

Hi Jamie,

How about, if a master goes down, it should automatically be demoted to a slave upon reboot. i.e. maybe have something in the startup script that demotes it to a slave. The slave, in the mean-time has promoted itself to a master since the other master was offline. This way, every node in the cluster is a slave by default, as soon as it boots up. And then it needs to check the status of the other slaves to see which one is the master. If there's no master, it can promote itself.

Or something like that? Basically similar to the Linux Heartbeat project

And, it would be a good idea to email the system owner about state changes as well, just to keep a watchful eye on everything.

What if a system loses network connectivity without a reboot though, and then comes back? In this case, it wouldn't be able to know if it should be a slave or master ...

True. How about:

So maybe it should stay in a slave mode, until it could contact a master to confirm that it's a slave, or until we manually promote it to master. i.e. a slave will always stay a slave, until it finds the other node(s) in the cluster and can confirm if there is indeed a master. And then, there could maybe be some sort of "election" to determine which one should be the master according to some rules we set. For example: ServerA = Primary Master ServerB = Secondary Master ServerC = Tertiary Master

but, if there's a network problem, then we generally have bigger problems to worry about as well :)

Yeah, as "master" election is the only viable solution.. I will look into how this could be implemented.

Yeah, as "master" election is the only viable solution.. I will look into how this could be implemented.