Unlocking Resilience with Fault Domains in vSAN

My Journey from Infrastructure Admin to Cloud Architect: Understanding vSAN Fault Domains

As an infrastructure admin, I have always been fascinated by the underlying technology that powers our virtualized environment. Recently, I had the opportunity to delve deeper into vSAN and its concepts, specifically the Fault Domain (FD) concept. In this blog post, I will share my journey of understanding vSAN FDs and how they can be used to protect our cluster against rack or site failures.

What are Fault Domains in vSAN?

In vSAN, a Fault Domain (FD) is a grouping of hosts that provides protection against a rack or site failure. Each FD can have one or more ESXi hosts, and usually, it is used to protect the cluster against a single rack or site failure. vSAN will never place components of the same object in the same FD. If the whole FD fails (e.g., a top-of-rack switch failure or a site disconnection), we will still have a majority of votes for the object to be available.

The Smallest Fault Domain is the Host Itself

If we don’t configure any FD in vCenter, every ESXi host will become a kind of FD, because we will never have components of the same object on the same host… even if the host has more than one disk group. So, the smallest FD is the host itself.

Protecting against Rack Failures

In vSAN, the smallest number of FDs required to protect against a single rack failure is 3. To achieve this, we can place two ESXi hosts per rack and have five FDs in total (3x VMDK + 2x witness). With this configuration, we can ensure that even if one rack fails, the other racks will still be available to provide access to our virtual machines.

Using Fault Domains for Protection

To protect against two rack failures, we can use a mirroring policy with FTT=2. This requires five FDs (3x VMDK + 2x witness). With this configuration, we can ensure that even if two racks fail, the other racks will still be available to provide access to our virtual machines.

Another important aspect to consider is that vSAN distributes components automatically, so the concept of FD might be a way to influence where the components are placed. For example, if we have five racks and ten ESXi hosts, we could place two hosts per rack and use FDs to protect against rack failures.

Stretched Clusters for Additional Protection

In some cases, it may be desirable to protect against a single rack failure using a stretched cluster concept. In this approach, vSAN will think of Rack 1 as a Preferred Site and Rack 2 as a Secondary Site, and will use Primary Failures To Tolerate of 1 to mirror components between those “sites.” This provides an additional layer of protection against rack failures.

Witness Host for Split Brain Scenario

When using a stretched cluster, it is essential to have a witness host that is hosted outside or on a standalone tower host that does not require a third rack. This witness host will protect the cluster against a split brain scenario.

Conclusion

In conclusion, Fault Domains are an essential concept in vSAN that provide protection against rack or site failures. Understanding how to configure FDs and use them for protection is crucial for any infrastructure admin or cloud architect. By using FDs, we can ensure that our virtual machines remain available even in the event of a failure. Additionally, stretched clusters and witness hosts can provide additional layers of protection against rack failures.