Infrastructure and Resource Planning with VMware vSAN ESA/MAX Architecture (Deep Dive)
Hello, With VMware vSphere 8, many important changes have been made to the vSAN architecture. In addition to the ESA architecture that comes with vSphere v8, v8 U2 introduced the vSAN MAX architecture, which provides very nice enhancements for SDS/HCI platforms.
Another new feature is VMware Core Subscription Licensing, which allows you to build a more modular platform by providing even more flexibility and functionality. Personally, I like the core-based subscription licensing. As someone who started using it months ago, I found it more cost effective than the previous socket-based per-CPU licensing. VMware has bundled many of its products here, allowing you to use more integrated and functional environments. With add-on licensing, you can grow very flexibly by getting additional features as you need them.
vSAN is very flexible compared to other HCI solutions and has a very low overhead because it works as a native service. vSAN’s beauty is that it is very easy to install, maintain, upgrade, manage and monitor. Even 1 day is enough to learn, which makes it very easy for operations teams compared to open source SDS solutions. As a structure, there is no need for storage control VM or container-based service and operation. This makes it very attractive. Otherwise, there can be both troubleshooting difficulties and computational overhead.
vSAN ESA Architecture
What makes it different from the vSAN OSA architecture is that it comes with a completely new Log-structured Filesystem (vSAN LFS) structure.
vSAN ESA provides very flexible metadata management. It uses the data tree structure for more efficient reconciliation.
As shown in the figure below, we can use a performance-oriented scalable structure by keeping metadata pages in memory. It can provide higher performance by providing the actively accessed blocks that are desired to be retrieved from disk via memory.
The vSAN ESA can use the adaptive write method by selecting one of the most appropriate buses based on the incoming I/O request. The default bus handles a small I/O, while the second, larger bus is prioritized to handle larger incoming I/O requests. This allows write performance to scale based on the type and size of the workload, rather than using the same write operation for different workloads.
This ensures high IOPS, high bandwidth, and low latency, especially for write-intensive workloads.
RAID6 write request with ESA, LFS file system provides almost the same performance as RAID1 write request. This allows you to write close to RAID1 performance and save 50% of the data.
For read requests, the request is first processed by the Distributed Object Manager (DOM) client. The DOM client checks the cache for recently used blocks, and if the available block is in the cache, the read operation is performed immediately. If the block is not in the cache, it proceeds to the second step. The LFS checks if the requested data is held as an in-memory buffer, if not it goes to the third step. A search is performed on the metadata and the data is found in the metadata, known as the B-tree. The request is sent back to the DOM client and the checksum is checked. If the data is compressed, it is decompressed and the read request is completed.
With ESA, Native Key Provider, you can build your own cluster encryption without using an external KMS. The Data Encryption Key (DEK) is shared by all nodes in the cluster. Therefore, it can be moved or read between all nodes without the need for decryption. Compared to the OSA architecture, the encryption resource requirements for each I/O operation are greatly reduced. In addition, compression is not affected by encryption.
ESA also provides native snapshot support on vSAN clusters. ESA Snapshot uses the B-tree table instead of the traditional chain snapshot, as well as the write and read request. The LFS file system provides metadata about which data is written to which snapshot. Snapshot deletion is nearly 100 times faster. When a snapshot is deleted on ESA, the snapshot deletion process is largely a metadata data deletion process. Deletion is logical, and then metadata and data are removed at the appropriate time. With version 8, a maximum of 32 snapshots can be taken per object.
ESA architecture claims each disk independently. In an OSA architecture, if one cache disk per disk group fails, all the capacity disks connected to that cache disk are out of service. As shown in the example below, there are 2 different disk groups and a total of 12 8TB capacity disks. Assuming 50% utilization, when the cache disk connected to the second disk group fails, the rebuild process starts for 24TB of data and has a 50% impact on the host capacity. If we observe a similar scenario on ESA, since there is no cache tier, if one of the disks fails, the impact is only as much as the utilization rate of that disk.
I mentioned that ESA provides almost the same performance as the RAID6 write request RAID1 write request that comes with the LFS file system. With the ESA architecture, the number of nodes required for erasure coding has also been updated. With vSAN ESA, a dedicated witness node is no longer required for RAID 1 policy. In OSA architecture, we needed a witness node for RAID1 storage policy. In this way, we can create both RAID1 and RAID5 storage policies in ESA architecture with a minimum of 3 nodes. If we use RAID5, we have a 50% capacity overhead savings. If we use 5 nodes here, then we will have 75% capacity overhead savings with RAID5.
With vSAN ESA, compression is enabled by default. However, if you want, you can create a different storage policy to turn compression off using vCenter Storage Policy Based Management (SPBM). If you ask why this might be necessary, it would be better to use the workload and the files’ own compression capabilities, such as PostgreSQL, Video. In this case, there is no need to recompress already compressed data and you can use the computing power of the CPU more effectively.
vSAN ESA delivers 4x the compression performance of the OSA architecture. While OSA has a theoretical compression ratio of 1:2, ESA can go up to 1:8 depending on the workload. In this article, I have calculated it using the most guaranteed 1:2 ratio.
NOTE: The ESA architecture does not use cluster-level deduplication. Instead, you can use granular compression (cluster-level compression is no longer available), allowing you to apply a very flexible storage policy.
Minimum Disk and Node requirements
For vSAN ESA, all disks on the node must be NVMe. Unlike OSA, vSAN ESA does not use Cache Tier and Capacity Tier disk groups. Of course, when the architecture removes the Cache Tier tier, there is no need for cache disks. Therefore, we now use all disks on the node as NVMe of the same type and capacity. In a way, when we use mixed-use disks with ESA, it is as if all of our disks are already running on the cache tier layer. This is one of the main reasons why the architecture provides high compression and I/O.
Read-intensive disks are now supported in the vSAN ESA architecture. However, we cannot use any read-intensive disk here. There must be certain requirements such as high performance class. In addition, the DWPD (endurance value) must be at least 1. If you ask if there is a cost advantage compared to Mixed Use NVMe, you will see that there is not much difference when you look at the list prices. For this reason, I used mixed-use disks in the BOM list I created. Here, the disk type depends on your read/write needs on the cluster. If you need a lot of reads, then using read-intensive NVMe can make a difference in performance.
- Type: NVMe TLC
- Performance Class:Class F (100000–349999) or higher
- Endurance Class:1 DWPD or higher
- Capacity:1.6TB or higher
- vSAN ESA ReadyNode Minimum # Device/Host:2
- vSAN Max ReadyNode Minimum # Device/Host:8
Can we use different capacities of NVMe disks on the same host in the ESA architecture? The answer is YES.
But my preference would still be to use symmetric disks. The nice thing about using different capacity disks in an asymmetric vSAN cluster is that if different capacity disks become available on another ESA cluster in the future, you can use them on different cluster nodes. (Ready node vendor and servers must be conjugate!)
With vSAN ESA, cluster-level object capacity has increased from 9000 to 27,000. With vSAN 8 U2, you can go up to 500 VMs per node on ESA clusters. On the OSA side, the limit is still 200 VMs.
Sample sizing and BOM for ESA
As an example, I have prepared a kit list below for three different manufacturers.
DELL PowerEdge R750 BOM
HPE ProLiant DL380 Gen11 BOM
Lenovo ThinkSystem SR650 V3 BOM
According to the above configuration, the total compute and storage resources on the ESA cluster will be as follows.
In the new licensing packages, it is the most logical choice to use at least 16 cores per server. Due to the core-based subscription, we no longer have a core limit per socket. (We are no longer stuck with the limit of 32 cores per socket) While the number of cores per socket has increased so much, it is now more advantageous to use high cores per socket on cluster with high compute requirements.
To give an example, the cost of 2 x 64 Core CPUs is almost the same as the cost of 4 x 32 Core CPUs. Just like the cost of 2 32GB RDIMMs and 1 64GB RDIMM is close to each other.
In this way, you significantly reduce your white space, server and infrastructure requirements. If you ask how much you will gain in server cost, you may have a gain of 20%. But if you serve as a large Cloud Provider, you can be quite profitable in white space and infrastructure (like 50%)
In terms of energy and BTUs, there is not much difference because high core CPUs have a very high TDP value.
CPU models such as the new 5th Generation Intel Xeon Platinum 8592+ (350W) or 4th Generation AMD EPYC 9654 (360W) have very high energy requirements. If the nodes will also be AI Ready, there will be an additional requirement of approximately 350W per GPU. In this case, when using 2 sockets and 2 GPUs per node, there will be a requirement of approximately 1.4kW. If we take into account Disk and Fiber NICs, there will be at least 1.8–2.4 kW PSU requirement per node. If we take into account that we will cool this server in BTUs, then the advantage on the white space, infrastructure side and the cost on the energy, cooling side will be almost equalized.
In fact, in some server models today, Liquid Cooling cooling has almost become a necessity instead of Air Cooling.
The IT world is always equalizing the units somehow, isn’t it? 😊
Licensing
We calculate the licensing for our sample BOM list as follows.
We use the formula (vSphere Foundation: Subscription capacity is the total number of cores × number of CPUs × number of ESXi hosts) in the new VVF subscription licensing.
If we calculate 56 cores per CPU, 2 sockets per server, 10 nodes per cluster in the example HCI scenario.
56 x 2 x 10 = 1120 core subscription.
As vSAN Enterprise Add-On TiB for HCI (vSAN: Subscription capacity is the total number of TiBs x number of ESXi hosts in each vSAN cluster) formula.
If we calculate 8 6.4TB NVMe disks per node and 10 nodes per cluster in the example HCI scenario.
51.2TB x 10 = 512 TB RAW with 466 TiB subscription capacity.
NOTE: If you use VCF with vSAN Add-on Subscription Licensing, you get 1 TiB for each VCF Core Subscription capacity for free.
If we adjust the example HCI scenario according to VCF, this time we do the calculation as follows.
If we calculate 56 cores per CPU, 2 sockets per server, 10 nodes per cluster in the VCF scenario.
We reach 56 x 2 x 10 = 1120 VCF core subscription.
Assuming we are using a 466TiB RAW vSAN Enterprise Add-On subscription with 51.2TB x 10 = 512TB RAW, we have the option to use an additional 654TiB of vSAN Enterprise Add-On subscription capacity. While we have 466 Tbytes of free capacity licensing, as a bonus, we have the option to grow the remaining 654 Tbytes for free. Note that VCF Core licensing is not cheap 😊 but it is a really nice option for private cloud environments.
NOTE: If we used 24 6.4TB NVMe disks on each node in the VCF scenario (or if we added +16 more 6.4TB NVMe disks per node as Scale-Up) then;
153.6TB x 10 = 1536 TB RAW and 1397 TiB RAW vSAN Enterprise Add-On subscription would be needed.
If we subtract the 1120 core subscription license we licensed for VCF from the 1397 TiB vSAN RAW capacity, we would need to license an additional 277 TiB vSAN Enterprise Add-On subscription for the 10-node VCF cluster.
ESA Architecture Network Requirements
In ALL-Flash OSA architecture, at least 10G network was sufficient. But I usually prefer 25G network for vSAN VMkernel. If you are managing a small OSA cluster such as 3/4/6 nodes, 10G fiber might be sufficient for vSAN network, but if you need high capacity above 6 nodes, it makes more sense to use 25G.
The cost of 10G/25G switches and SFP+/SFP28 modules are very close. For this reason, if you are investing in a new cluster, it may be beneficial for you to choose Dual Rate 10/25G switch.
In addition, even in OSA architecture, some workloads can have very high read/write requirements. If you have such applications and want to run them on vSAN, 25G network will make you comfortable in the future.
Let’s come to the ESA side, VMware’s comment on this is very clear. If the number and capacity of NVMe disks you use in ESA architecture is high, even 25G network will be risky. If you are managing a small ESA cluster, you need to use at least 25G network. If the cluster is larger, you will either do 25G+25G LACP/LAG or use 100G active/passive network.
It is very important that the vSAN VMkernel network is 100G, especially if you are using large numbers and 15TB NVMe disks in large environments!
If you are using vSAN MAX, you will need at least a 50G network anyway, so it might be more appropriate to choose a 100G switch here. That way you don’t need to use LACP/LAG.
vSAN MAX Architecture
We actually saw the basic version of the vSAN MAX architecture as HCI Mesh in vSAN 7 U1. It evolved through vSAN 8 U2, and finally matured and became available with vSAN 8 U2. With HCI Mesh, we were already able to share capacity between different vSAN clusters. With vSAN MAX, we can now completely decouple and provide storage resources to standard vSphere clusters. In a sense, we can completely decouple VMware environments from SAN storage and SAN switch architecture. Or we can provide both SAN and HCI resources to non-HCI vSphere clusters using hybrid datastores. (SAN datastore may require SAN HBA depending on the situation)
vSAN MAX runs only on ESA architecture and scales as the primary storage resource for your vSphere clusters.
Each vSAN Max cluster can scale up to 8.6 petabytes.
Use Cases
Infrastructure cost optimization; vSAN Max enables right-sizing of resources to reduce licensing costs.
Unified storage; vSAN Max lets you use server resources that are not ideal for HCI (such as Gen1 blade or legacy hardware). If you want to take advantage of vSAN and keep resources independent of each other, this is the solution for you. You can easily deploy a shared storage platform in the datacenter.
Fast scaling for Cloud Native applications; vSAN Max can be an ideal storage resource for Cloud Native applications.
The minimum requirement for a vSAN Max cluster is
Other vSphere or HCI cluster types that are supported/not supported for vSAN Max are as follows.
Should vSAN Max Cluster and client vSphere Cluster use the same CPU manufacturer?
NO, processors from different manufacturers can be used on vSAN Max cluster and client vSphere cluster. For example, a client vSphere cluster using AMD can be connected to a vSAN Max cluster using AMD. (Or vice versa) The important thing is that the vSAN Max cluster uses hardware compatible with the ESA architecture (such as Disk, NVMe Controller, NIC)
A maximum of 128 hosts can be connected to the vSAN Max cluster.
vSAN Max Architecture Network Requirements
vSAN Max requires high network capacity due to its architecture.
- Datacenter-class redundant switch
100Gb uplink for vSAN Max cluster VMkernel (at least 10Gb is sufficient for client vSphere clusters, but at least 25Gb is recommended depending on workload. - It is recommended to use NIOC (vSphere Network I/O Control) for vSphere Distributed Switch.
- LACP can be used for 10G or 25G network infrastructures, but active/passive connection is the most suitable alternative due to operational complexity.
Example Switch configuration
Storage Policy Based Management (SPBM) is included for all vSphere clusters on vSAN Max.
Cross-cluster vMotion is supported for client vSphere clusters. VMs can be moved between client vSphere clusters that are connected to a vSAN Max cluster.
Instance sizing and BOM for vSAN Max
As an example, I have prepared a kit-list for three different manufacturers below.
DELL PowerEdge R760 Server BOM
HPE ProLiant DL380 Gen11 BOM
Lenovo ThinkSystem SR650 V3 BOM
According to the above configuration, the total compute and storage resources on the vSAN Max cluster will be as follows.
Licensing
Licensing is the same as vSAN ESA architecture. You can calculate the RAW capacity of the vSAN Max cluster in TiB and purchase it as an add-on for both VCF and VVF.
As a sample calculation
For vSAN Advanced/Enterprise Add-On TiBs, we use the formula (vSAN: Subscription capacity is the total number of TiBs x number of ESXi hosts in each vSAN cluster).
In the example vSAN Max scenario, if we calculate 16 6.4TB NVMe disks per node and 10 nodes per cluster.
102.4 TB x 10 = 1024TB RAW with 932 TiB subscription capacity.
The vSphere version of the nodes used in the vSphere cluster that will access vSAN Max must be at least 8. No additional licensing is required for the vSphere cluster servers that will connect to the vSAN Max datastore. (Basic vSphere licensing is a must!)
Is vSAN ESA or vSAN Max suitable for every workload?
In my experience, one of the biggest mistakes made in HCI architectures is the positioning of individual workloads. If you are going to run workloads like Microsoft SQL AAG, Hadoop, File Server on HCI, you should analyze and plan very well, otherwise it’s more trouble than it’s worth :)
If you are running workloads like Microsoft SQL AAG and Hadoop on HCI, this is not very logical. Because those systems have a redundancy structure in them. If you say I will use it on HCI, I recommend that you use RAID0 storage policy for such workloads. That way you can use the capacity on HCI more effectively. It is not very correct to use RAID1/5/6 storage policy on the vSAN side for a system with node-to-node redundancy in itself.
Especially if you are using vSAN ESA disks as archive disks for a service like SIEM, you are making a big mistake. Again, using a high-capacity file server on architectures like HCI or ESA is completely unnecessary.
This recommendation applies not only to vSAN or other HCI vendors, but also to legacy All-NVMe SAN storage.
Unstructured structures are a complete waste for HCI and NVMe SAN in my opinion, let them work the way they know how. That way you are not putting capital on the line.
Another recommendation is that if you are going to use a large cluster in your network infrastructure, make a switch investment accordingly. Especially in the telco sector, if you are going to manage HCI as an infrastructure, you should definitely go one tier above the minimum requirement.
What workloads is vSAN ESA suitable for?
First, I think VDI is perfect for HCI.
Second, standard VM workloads (e.g. application server, web server, standalone DB server)
Third, it is now a very suitable solution for Telco 5G RAN (there are very nice, rugged servers for 5G RAN, I will definitely do a review about them later)
Fourth, if you want to connect small branches such as RO-BO to the central office, it is a suitable solution for both the central office and the branches.
Fifth, cloud native infrastructures. When we talk about VMware, Tanzu and vSAN are a very good duo and they work very integrated with each other.
Sixth, if you are a service provider, it is ideal for private cloud and public cloud service, you just need to analyze and plan properly.
The last thing I would like to mention is that if you think that using SSD disks on vSAN OSA will reduce server costs, you are wrong. Whether you use mixed-use or read-intensive SSD, you will end up with almost the same cost as NVMe disks. But if you use ESA and NVMe disks, you will gain a lot. This is a valid analysis in terms of both performance and cost. I will prove it to you below.
Below, the same class of SSD and NVMe disks are almost the same in terms of list cost, and NVMe can be much more affordable in some projects. But you will see that there is a very serious difference in performance.
So if you are going to invest in a new server, and you are thinking about doing it with vSAN, I would recommend that you use the ESA architecture.
If you look at the compression ratios, you can reduce the cost significantly with effective capacity. Since you are using NVMe back-end, you are unlikely to have a performance problem here.
The same is true for mixed-use and read-intensive disks on the vSAN ESA and Max side. Again, the cost of the disks you see below is nearly the same, but there are small differences in durability and performance.
Finally, another important point is that the vSAN ESA architecture does not support the new tri-mode RAID controllers for NVMe disks. Here, the server must have a PCIe NVMe controller for NVMe back-end.
See you soon.