Maish Saidel-Keesing has written a post about the fault-tolerance topic with multiple vCPUs a few weeks ago. He has valid points in his argumentation, but anyway I want to give you a little bit of my view on this topic (IMO).

With fault-tolerance two VMs are running nearly symmetrical on 2 different ESXi hosts with one (primary) processing IO and the other one dropping it (secondary). With the release of vSphere 6.0 VMware will support this feature with a VM of up to 4vCPU and 64 Gbyte memory. [More Details here]

I try to summarize the outcome Maish’s argumentation:

FT is not the big deal feature since it only protects against a hardware failure of the ESXi host without any interruptions in the service of the protected VM. It does NOT detected or deal with a failure at Operating Systems and Application level.

So what Maish think we really need are cluster-mechansims on application level and if legacy applications don’t.

I would in general not disagree with this opinion. In an ideal world all applications would be stateless, scaleable and protectable with a load-balancer in front of them. But we will need 1X or more years until all applications are built in such a new ‘modern’ way. We will not get rid of the legacy applications in the short-term.

Within the last 4 years of beeing an instructor I received one questions nearly every time when delivering a vSphere class:

‘Can we finally protect our SMP-VMs now with Fault Tolerance? No?! Awww :(‘

So I would not say there is a not a need out there for this feature. Being involved in some bidding last year we had very often the requirement to deliver a system for automation-solutions within large building-complexes (airports, factories, etc.).

Software being used in such domains are sometimes legacy application par excelente (ironic) programmed with a paradigm long before agile/restful/virtualization played a role in the tech-world.  Sometimes you can licence a cluster feature (and pay 10 time as much as for a 1-node licence) – sometimes you can’t cluster it and need other ideas or workaround to increase the availability.

Some biddings were not won because of opponents who where able to deliver solutions that can (on the paper) tolerate an hardware outage without any service-/session impact.

For me with SMP-FT typical design-considerations come into play:

  • How does the cluster work? Does it work on application/OS-level or does it only protect for a general outage?
  • What were failure/failover reasons in the past? (e.g. vCenter – in most cases I had a failure here it was because of Database problem [40%], Active Directory / SSO problem [10%], a hardware failure [45%] or rest [5%])  -> A feature like FT would protected against a huge amount of failure experienced in the past. Same considerations can be taken into account for all kind of applications (e.g. virtual load-balancer, Horizon View Connection Server etc.)
  • How much would a suitable solutions cost to make, buy or update?

Sure we need to get rid of legacy applications, but to be honest… this will be a very long road (the business decides and pays it) and once we have gotten to the point where the legacy applications are gone – the next generation of legacy applications is in place that need to be transformed (Docker?! 😉 ).

We should see FT as it is. A new tool within our VMware toolkit to fit specific requirements and protect VMs (legacy/new ones) on a new level with pros- and cons (as always). IMO every tool / feature that gives us more opportunities to protect the IT is very welcome.