GPU HA & Sharing Strategy

(VMware + NVIDIA + IIoT workloads)

Audience: Infra, Ops, IoT App, Architecture
Scope: L4 GPUs, VMware, Kubernetes, PCI Passthrough vs vGPU, SAN vs No-SAN
Goal: Maximum clarity on when, why, and how to design HA correctly—without overengineering.


1. Ground Truth

1.1 What HA really means (often confused)

Layer HA Means
VMware HA VM restarts on another host
GPU HA GPU workload continues with acceptable downtime
Application HA App survives failure (may restart)
Business HA SLA respected (minutes vs seconds matter)

🚨 GPU hardware itself is NOT live-migratable (for L4 class GPUs today).

So HA ≠ zero downtime.
HA = fast, predictable recovery.


2. GPU Allocation Models (Core Decision)

2.1 PCI Passthrough (Most misunderstood)

What it is

  • GPU exclusively bound to one VM

  • Bare-metal-like performance

Truth

  • ❌ No automatic GPU failover

  • ❌ VM restart requires manual GPU rebind

  • ❌ VMotion not supported

When it makes sense

  • Dedicated inference nodes

  • Predictable load

  • App-level restart is acceptable


2.2 NVIDIA vGPU (Licensed, controlled sharing)

What it is

  • GPU sliced into profiles (e.g., L4-1Q, 2Q, 4Q)

  • VMware + NVIDIA vGPU license required

Truth

  • ✅ Better utilization

  • ✅ Multiple VMs per GPU

  • ❌ Still no live GPU migration

  • ❌ Requires SAN for VMware HA

When it makes sense

  • Multiple small/medium workloads

  • Shared GPU economics

  • You already accept SAN cost


3. SAN & HBA — The Non-Negotiable Reality

3.1 Why SAN is required for VMware HA

VMware HA needs:

VM Disk (VMDK) must be accessible by ALL ESXi hosts

That means:

  • Shared storage

  • SAN (FC / iSCSI)

  • HBA or NIC depending on SAN type

3.2 When HBA card is required

Scenario HBA Needed? Why
FC SAN ✅ Yes Fibre Channel protocol
iSCSI SAN ❌ No (NIC enough) Ethernet based
No SAN ❌ No Local disks only

📌 HBA has NOTHING to do with GPU
HBA exists only for shared storage access


4. Scenario-Based Strategy (This is the heart)


Scenario A — Dedicated GPU Nodes, No SAN (Recommended Default)

Architecture

Node-1 (L4) → VM-1 → App Pod
Node-2 (L4) → VM-2 → App Pod
Node-3 (L4) → VM-3 → App Pod
+ Buffer Node (CPU / optional GPU)

Characteristics

  • PCI Passthrough

  • Local disks only

  • Kubernetes controls restart

Failure Handling

Node failure →
VM lost →
K8s reschedules pod →
App starts on another GPU node

Why this works

  • GPU workloads are stateless (CV inference)

  • Startup < 2–5 minutes acceptable

  • No SAN dependency

  • Lowest operational complexity

Verdict

BEST for IIoT / CV / Edge inference
✅ Cheapest
✅ Clearest ownership
❌ No VM-level HA (but app HA exists)


Scenario B — VMware HA with SAN (Enterprise Traditional)

Architecture

ESXi Cluster (3 nodes)
├── SAN (shared datastore)
├── HBA / iSCSI
└── NVIDIA vGPU

Failure Handling

Host failure →
VM restarted on another host →
vGPU reattached →
App restarts

Tradeoffs

Aspect Reality
Cost High (SAN + HBA + vGPU license)
Complexity High
GPU failover Restart only
Performance Slight overhead

Verdict

⚠️ Only if VM-centric org mandates VMware HA
❌ Overkill for inference workloads


Scenario C — Hybrid: GPU Passthrough + Buffer Node (Smart HA)

Architecture

3 GPU Nodes (PCI Passthrough)
1 Buffer Node (CPU / spare GPU)
Kubernetes controls placement

Failure Handling

GPU node down →
Load shifts to remaining nodes →
Buffer absorbs burst →
Manual rebind if needed

Why this is powerful

  • No SAN

  • No vGPU license

  • Operationally predictable

  • Capacity-based HA, not infra-based HA

📌 This matches industrial HA philosophy (N+1 capacity)


5. N+1 Capacity Rule (Non-Negotiable)

If:

Each L4 handles 10 streams
You run 3 nodes

Then:

Max production load = 20 streams
3rd node = failover capacity

🚨 If you run all GPUs at 100%, no HA exists
This is physics, not software.


6. Decision Matrix (Print This)

Requirement Best Choice
Lowest cost PCI Passthrough
Fastest inference PCI Passthrough
Shared GPU usage vGPU
VMware-only ops vGPU + SAN
Edge / IIoT No SAN
Predictable restart Kubernetes
Zero infra drama Independent nodes

7. Monitoring & Ops (Often Ignored)

Mandatory

  • NVIDIA DCGM

  • GPU temperature, memory, power

  • Node-level alerts

Facilities

  • UPS sizing for GPU surge

  • Cooling validation (L4 still draws spikes)


8. Final Recommendation (Clear & Defensible)

Primary Strategy

Independent GPU nodes with PCI Passthrough + Kubernetes HA + N+1 capacity

SAN + HBA only when

  • VMware HA is mandated

  • Stateful VMs must survive host loss

  • Budget + ops maturity exist

vGPU only when

  • Multiple small workloads must share GPU

  • License cost justified

  • SAN already exists


9. One-Line Summary

“We design HA at the application and capacity level, not by forcing GPUs into legacy virtualization models.”

HIGH-LEVEL DECISION FLOW

START
|
v
Is GPU workload STATELESS?
(CV inference, analytics, AI inference)
|
+-- NO --> (Training / Stateful GPU / Long jobs)
| |
| v
| Use Dedicated GPU + SAN
| (vGPU optional)
| END
|
+-- YES -->
|
v
Is ZERO or NEAR-ZERO DOWNTIME required?
(< 1 min, no restart allowed)
|
+-- YES -->
| |
| v
| GPU live migration NOT supported
| → Redesign application (active-active)
| → Use traffic routing, not VMware HA
| END
|
+-- NO -->
|
v
Is VM-LEVEL HA mandatory by Infra policy?
|
+-- YES -->
| |
| v
| SAN REQUIRED
| |
| Is GPU sharing required?
| |
| +-- YES --> vGPU + SAN + HBA/iSCSI
| |
| +-- NO --> PCI Passthrough + SAN
| |
| END
|
+-- NO -->
|
v
Do you want GPU sharing across workloads?
|
+-- YES -->
| |
| v
| vGPU needed
| SAN recommended
| (Cost + license justified?)
| |
| +-- YES --> vGPU + SAN
| |
| +-- NO --> Re-evaluate workload
| END
|
+-- NO -->
|
v
Can application restart tolerate 25 min?
|
+-- YES -->
| |
| v
| PCI Passthrough
| Independent GPU nodes
| Kubernetes HA
| N+1 capacity
| NO SAN
| END
|
+-- NO -->
|
v
Add BUFFER NODE
(CPU or spare GPU)
Capacity-based HA
END

SAN & HBA SUB-DECISION (Embed This Separately)

Do we need VMware HA (VM restart on another host)?
|
+– NO –> NO SAN
| NO HBA
|
+– YES –>
|
v
Shared Storage Required
|
+– FC SAN –> HBA REQUIRED
|
+– iSCSI –> NIC sufficient

QUICK DECISION TABLE (For Meetings)

Question If Answer is YES If Answer is NO
Stateless workload? Proceed with app-level HA SAN-based design
GPU sharing needed? vGPU PCI Passthrough
VMware HA mandatory? SAN required No SAN
Restart acceptable? K8s + N+1 Redesign app
Budget constrained? No SAN SAN + vGPU

RECOMMENDED DEFAULT PATH (IIoT / CV / Edge)

Stateless CV inference

Restart acceptable

No VMware HA mandate

PCI Passthrough

Independent GPU nodes

Kubernetes scheduling

N+1 capacity

Optional buffer node

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *