Confidential AI Inference: How TEEs and GPU Confidential Computing Protect Data-in-Use

For two decades, the security industry has been very good at two things and silent about a third. We encrypt data at rest, so a stolen disk yields ciphertext. We encrypt data in transit, so a tapped wire yields ciphertext. But the moment a model loads a prompt into memory to compute on it, the protection evaporates: the plaintext sits in RAM and GPU VRAM, readable by the hypervisor, the cloud operator, a privileged insider, or anyone who can dump a memory page. Confidential ai inference closes that third gap by running the model inside hardware-enforced trusted execution environments so prompts and weights stay encrypted even while the GPU is computing on them. For regulated workloads — patient records summarized by an LLM, trading signals scored by a model, biometric templates matched against a vector index — that data-in-use exposure is the difference between a deployable system and a compliance non-starter. This article walks the full stack: the threat model, CPU and GPU TEEs, attestation, key release, a reference architecture, and where it breaks.

What this covers: the data-in-use threat model, CPU TEEs (Intel TDX, AMD SEV-SNP, Arm CCA), NVIDIA GPU confidential computing on Hopper and Blackwell, remote attestation and key release, a regulated reference architecture, and the real-world trade-offs.

Context and Background

Cryptographic protection has historically covered two of the three states of data. Encryption at rest (LUKS, BitLocker, cloud KMS-backed volume encryption) protects bytes sitting on storage. Encryption in transit (TLS 1.3, mTLS, WireGuard) protects bytes moving across a network. The third state — data in use, the bytes loaded into CPU registers, RAM, and accelerator memory during actual computation — was, until recently, simply trusted to the platform. If you ran inference on a cloud VM, you were trusting the hypervisor, the host firmware, the cloud operator’s control plane, and every privileged process that could call ptrace or read /proc/<pid>/mem not to look.

That trust assumption is exactly what confidential computing removes. The Confidential Computing Consortium defines it as protecting data in use by performing computation in a hardware-based, attested trusted execution environment (TEE). The hardware encrypts memory and enforces that even the most privileged software outside the TEE — hypervisor, host OS, system management mode — cannot read the plaintext. For AI specifically, this matters twice over: you protect the user’s data (the prompt, the retrieved documents, the embeddings) and the provider’s intellectual property (the model weights, which on a frontier model represent tens of millions of dollars of training spend). Both live in memory at inference time, and both were previously exposed. If you are running these workloads on Kubernetes, the same primitives underpin confidential containers, which package this isolation at pod granularity.

It is worth being precise about the threat model, because confidential computing is often oversold. The adversaries it actually defends against are: (1) the untrusted cloud operator and its administrators, who have physical access to the host and root on the hypervisor; (2) a malicious or compromised host software stack — a backdoored hypervisor, a rootkit in the host kernel, firmware tampering; (3) co-tenants in a multi-tenant environment attempting to read across the isolation boundary; and (4) physical attackers with bus probes or cold-boot DRAM access. What it does not defend against is equally important: it does not protect against bugs or backdoors in the code you run inside the TEE, it does not stop a model that is configured to log prompts to an external sink, and it does not by itself defeat side channels. The mental model to carry through the rest of this article is a boundary: everything inside is measured and protected, everything outside sees only ciphertext, and the entire security argument rests on (a) the boundary being enforced by silicon and (b) you being able to prove, remotely, exactly what is inside it.

A Reference Architecture for Confidential AI Inference

A confidential AI inference service runs the model server inside a confidential VM (a CPU TEE such as Intel TDX or AMD SEV-SNP), pairs it with a GPU running in confidential computing mode so weights and activations are protected in VRAM, and gates the release of model and data decryption keys on a successful remote attestation of the entire stack. The client connects over an attested TLS channel (RA-TLS) that proves it is talking to genuine TEE hardware running the expected software before any secret is exchanged. No plaintext key, weight, or prompt exists outside the hardware-protected boundary.

Figure 1: End-to-end confidential AI inference. The client establishes an RA-TLS channel into a confidential VM; an attestation agent gathers CPU and GPU evidence, a remote verifier (including NVIDIA’s NRAS for the GPU) issues an attestation result, and only then does a Key Broker Service release the keys that decrypt model weights and serve the request.

The figure shows the four moving parts that every design must reconcile: the CPU TEE that holds the runtime and orchestrates, the GPU CC mode that does the heavy compute, the attestation pipeline that establishes trust, and the key broker that withholds secrets until trust is proven. Get any one of them wrong — say, a GPU that is attested but a CPU VM that is not, or keys released before attestation completes — and the guarantee collapses to ordinary cloud security.

A worked example makes the dependencies concrete. Suppose you serve a 70B-parameter model whose weights you encrypt at rest with a key held in your KBS. A request arrives. The confidential VM boots, measures itself, fetches a signed GPU report, and presents the combined evidence to the verifier with a fresh nonce. The verifier confirms the CVM measurement matches your approved inference image digest, confirms the GPU is a genuine H100 in CC mode via NRAS, and returns a token. The runtime presents that token to the KBS, receives the weight-decryption key wrapped to an in-enclave transport key, decrypts the weights inside the CVM, and streams them into protected VRAM over the encrypted PCIe channel. The prompt arrived over RA-TLS, so it was never plaintext on the wire or in host memory. At no point did the cloud operator, the hypervisor, or a co-tenant see a decrypted weight, prompt, or response. Remove any single guarantee — say the KBS releases keys without checking the token, or the GPU runs in normal mode — and an operator with host access reads everything. The subsections below take each layer in turn.

The CPU TEE: Confidential VMs

The CPU-side TEE is the trust anchor. Modern confidential computing on x86 has converged on VM-level isolation rather than the older process-level enclave model (Intel SGX), because LLM serving stacks — PyTorch, CUDA libraries, a web server, a tokenizer — are far too large and dynamic to fit the constrained, hand-partitioned SGX enclave model comfortably. Intel TDX (Trust Domain Extensions) and AMD SEV-SNP (Secure Encrypted Virtualization with Secure Nested Paging) both create a hardware-isolated VM whose memory is transparently encrypted by an on-die memory controller using keys the hypervisor never sees. The guest runs a nearly unmodified Linux kernel; the CPU enforces that the host cannot read guest memory and, critically, cannot silently modify it without detection (the “integrity” in SEV-SNP and the secure EPT in TDX). Arm CCA (Confidential Compute Architecture) brings the same idea to Arm via the Realm Management Extension, carving out “realms” that the normal-world hypervisor cannot inspect — relevant as Arm-based servers (AWS Graviton-class, Azure Cobalt) take on more inference.

The mechanics differ in instructive ways. Under AMD SEV-SNP, each VM is assigned an address-space identifier (ASID) bound to a distinct AES-128 memory-encryption key managed by the on-die AMD Secure Processor; the encryption happens transparently in the memory controller, so DRAM contents are ciphertext. SNP’s headline addition over earlier SEV revisions is integrity: a Reverse Map Table (RMP) tracks the ownership and validation state of every page, defeating the replay, remapping, and aliasing attacks that plagued earlier SEV — a hypervisor can no longer silently swap a guest page for an old or attacker-chosen one without the guest detecting it on access. Under Intel TDX, the isolation unit is a Trust Domain; a small, Intel-signed module called the TDX Module runs in a new CPU mode (SEAM) and mediates between the untrusted VMM and the TD, enforcing a secure Extended Page Table so the host cannot map or read TD memory, while the Multi-Key Total Memory Encryption (MKTME) engine encrypts each TD’s pages under a private key. Arm CCA uses the Realm Management Extension to partition physical memory via the Granule Protection Table, with a firmware Realm Management Monitor brokering realm lifecycle — conceptually parallel to the TDX Module’s role. The practical upshot for an architect: all three give you a VM-shaped TEE you can ssh-into-shaped, run a normal container runtime inside, and treat as a sealed box, but the measurement you attest (what firmware, what kernel command line, what initial memory) and the certificate chain you verify against differ per vendor, and your attestation tooling has to speak the right dialect.

The CPU TEE holds the orchestration logic, the network termination, the attestation agent, and the staging buffers that marshal data to and from the GPU. It is the component that holds the decrypted model and data keys once attestation succeeds, and it is the root from which the GPU’s trust is extended. A design decision that recurs here is TCB size: every library, kernel module, and sidecar you pull into the confidential VM enlarges what you must measure and trust. A minimal, purpose-built inference image with a pinned digest is far easier to reason about — and to write a key-release policy against — than a sprawling general-purpose VM image.

GPU Confidential Computing

A CPU TEE alone is useless for serious LLM inference, because the actual matrix multiplies happen on a GPU, and historically the GPU was a giant hole in the boundary: weights and activations sat in VRAM in plaintext, and anyone able to read the device’s memory over PCIe or via the driver could exfiltrate them. NVIDIA Confidential Computing, introduced on the Hopper H100 and carried forward and strengthened on the Blackwell generation, closes this. When the GPU is placed in CC mode, its on-board memory is protected, the firmware enforces that the host cannot read VRAM, and all data crossing the PCIe bus between the CPU TEE and the GPU is encrypted.

Because PCIe itself has no native confidentiality for this path on Hopper, the data is moved through encrypted bounce buffers: the CPU TEE encrypts a payload into a shared, non-confidential staging region, and the GPU decrypts it on-chip into its protected memory, with authenticated encryption protecting integrity and confidentiality in both directions. Concretely, when the inference runtime issues a cudaMemcpy from confidential host memory to the device, the data is first AES-encrypted into a shared bounce buffer that the host can see (but only as ciphertext), DMA’d across PCIe, and decrypted by the GPU’s on-die engine into protected VRAM; results return by the inverse path. The session keys for this channel are negotiated during a secure handshake (an SPDM-style session) that is itself rooted in the GPU’s attestation, so the encrypted tunnel cannot be established with an unattested or spoofed device. Two consequences follow. First, the overhead concentrates on the transfer path, not the compute path — once weights and the KV cache are resident in protected VRAM, the tensor cores run at native speed; it is the host-device copies (uploading activations, downloading logits) that pay the encryption tax. Second, this is why batching and keeping working sets resident matter so much for confidential inference economics: you amortize the encrypted-transfer cost across more useful compute. The GPU produces an attestation report — a signed measurement of its firmware, VBIOS, and CC configuration — that the verifier checks against NVIDIA’s attestation service.

The Combined Trust Chain

The hard part is binding the two TEEs into one trust domain. The CPU confidential VM attests itself to a verifier, and separately the GPU attests itself; the architecture must ensure that the specific GPU the verifier blessed is the same physical GPU bound to this specific CPU TEE, and that the encrypted channel between them was negotiated with keys rooted in both attestations. NVIDIA’s model has the CPU TEE collect the GPU’s signed report and forward it as part of a combined evidence bundle. Only when both the CPU quote and the GPU report verify — and the secure session between them is established — does the key broker release secrets. This composition is what makes end-to-end confidentiality real rather than theatrical: a chain is exactly as strong as its weakest attested link, and an un-attested GPU paired with a pristine CPU TEE leaks everything.

Attestation, Key Release, and the Trust Chain

Attestation is the mechanism that converts “I am running in a TEE” from a claim into cryptographic evidence a remote party can verify. The IETF RATS architecture, RFC 9334, gives the canonical vocabulary: an Attester produces Evidence about its own state; a Verifier appraises that Evidence against Reference Values and policy to produce Attestation Results; and a Relying Party consumes those results to make a trust decision — here, whether to release keys. Keeping these roles distinct is what lets a key broker trust a measurement it did not itself compute.

In a confidential AI deployment the flow runs end to end before a single token is served. The confidential VM collects its CPU quote — a hardware-signed measurement of the VM’s initial memory, firmware, and configuration (a TDX TD-quote or an SEV-SNP attestation report). It requests a signed report from the GPU. It bundles both as evidence and sends them to a verifier. The verifier checks the CPU quote against the CPU vendor’s certificate chain (for Intel, rooted in the Provisioning Certification Service; for AMD, in the AMD Secure Processor’s VCEK/ASK chain) and the GPU report against NVIDIA’s Remote Attestation Service (NRAS), which validates the GPU firmware measurements and CC configuration against NVIDIA’s signed reference values. If everything matches the expected reference values, the verifier issues a signed attestation result token. The runtime presents that token to a Key Broker Service (KBS), and only then are the model-weight and data-decryption keys released into the TEE. This is the central inversion of confidential computing: keys are released because the environment proved itself, not because someone with cloud credentials asked.

It helps to separate three things the RATS model deliberately keeps apart. The Evidence is raw, vendor-specific, and untrusted on its face — a TDX quote is meaningless to a relying party that cannot parse Intel’s format and verify Intel’s signature. The Verifier is the specialist that can do that parsing and signature checking and that holds the reference values; centralizing it means your key broker does not need to understand four hardware formats. The Attestation Result is a normalized, verifier-signed statement (“a genuine TDX TD with measurement M and a genuine H100 in CC mode were observed at time T”) that the relying party can consume with a single trust relationship — it trusts the verifier. This separation is what makes the architecture composable and is why you should resist the temptation to have your application directly parse raw quotes.

Figure 2: The attestation and key-release sequence. The confidential VM gathers its own CPU quote and the GPU’s signed report, the verifier appraises both (delegating GPU verification to NVIDIA NRAS), and the Key Broker releases decryption keys only against a valid attestation token.

Two practical patterns make this usable. RA-TLS (remote-attestation TLS) embeds the attestation evidence directly into the X.509 certificate presented during the TLS handshake, so establishing the secure channel and proving the TEE’s identity happen in one round trip — the client refuses to complete the handshake unless the embedded evidence verifies. The trick is to put the public key of an ephemeral, in-TEE-generated key pair into the attestation report (in the report’s user-data field), so the evidence cryptographically binds “this is genuine TEE hardware in this exact state” to “and it holds the private key terminating this TLS session.” Without that binding, an attacker could relay a valid attestation from one machine while terminating TLS on another. Key Broker Services (the pattern used by projects such as the CNCF Confidential Containers’ Trustee, and by cloud key-management offerings with confidential bindings) externalize the policy: an operator declares “release key K only to a TEE whose measurement is M, running image digest D,” and the broker enforces it. The same binding technique applies — the released key is wrapped to a transport key whose public half appears in the verified evidence, so only the attested TEE can unwrap it.

A subtle but critical property is freshness. Every attestation flow must include a verifier-supplied nonce that the attester folds into its signed evidence, so an adversary cannot capture a valid attestation token and replay it later against a machine that has since been compromised or reconfigured. Tokens are short-lived and bound to the specific session. Crucially, the data path that the released keys protect still runs through the encrypted CPU-to-GPU channel, so even after keys are released, plaintext weights and activations never traverse an unprotected bus or sit in host-readable memory.

Figure 3: The CPU-TEE-to-GPU-CC data path. Encrypted weights in the confidential VM are staged, encrypted into a bounce buffer, transferred over PCIe, and decrypted on-chip into protected VRAM; results return the same way. Plaintext never exists outside hardware-protected memory.

Comparing the Hardware

The four building-block technologies differ in granularity, maturity, and what exactly they measure. The table below summarizes the practical distinctions an architect needs.

Technology	Vendor	Isolation unit	Memory protection	Attestation evidence	Role in AI inference
Intel TDX	Intel	Trust Domain (VM)	Per-TD encrypted memory, secure EPT	TD-quote signed via Intel chain	Host CVM for the runtime
AMD SEV-SNP	AMD	Encrypted VM	Per-VM encryption + integrity (reverse map table)	SEV-SNP attestation report	Host CVM for the runtime
Arm CCA	Arm	Realm	Realm memory isolated via RME	Realm attestation token	Emerging CVM on Arm servers
NVIDIA GPU CC	NVIDIA	GPU (H100/Blackwell)	Protected VRAM, encrypted PCIe bounce buffers	Signed GPU report verified by NRAS	Accelerated compute layer

The key architectural reading of this table: TDX, SEV-SNP, and CCA are interchangeable roles (the host confidential VM) and you typically pick whichever your cloud and CPU vendor offer, while NVIDIA GPU CC is a complementary layer that must be composed with one of them. There is no GPU-only confidential inference; the GPU’s trust is always extended from a CPU TEE that holds the keys and orchestrates attestation.

Confidential VMs vs. Confidential Containers

There is a second axis of choice: packaging. A confidential VM gives you the whole guest — kernel, userland, your inference server — as one attested unit. It is conceptually simple, you measure the VM image, and everything inside shares one trust boundary. The downside is granularity: the entire VM is the unit of isolation and of measurement, so a multi-tenant platform either dedicates a CVM per tenant or accepts that co-located workloads share a boundary. Confidential containers (the CNCF Confidential Containers project, building on Kata-style microVMs) push the boundary down to the pod: each confidential pod runs inside its own lightweight TEE-backed VM, attested independently, integrated with Kubernetes scheduling and the Trustee key-broker. This is the right model when you already orchestrate inference on Kubernetes and want per-workload attestation and key release without hand-rolling VM lifecycle management; the trade-off is a more elaborate stack (the guest pulls and verifies its own image, an agent inside the pod brokers attestation) and a slightly larger conceptual surface. For a single, dedicated regulated inference service, a CVM is often simpler; for a multi-tenant platform serving many models or customers, confidential containers usually win on isolation granularity and operational fit. The attestation, GPU-CC, and key-release mechanics described above are identical in both packagings — only the unit of isolation and the orchestration glue differ.

Trade-offs, Gotchas, and What Goes Wrong

Confidentiality is not free, and the failure modes are subtle. The first cost is performance. Memory encryption, the encrypted PCIe bounce-buffer path, and attestation handshakes all add overhead. The honest answer is that overhead is highly workload-dependent: compute-bound large-batch LLM inference, where the GPU spends most of its time in matrix multiplies on data already resident in protected VRAM, tends to see small overhead, while small, chatty, transfer-heavy workloads that constantly move data across the encrypted PCIe path see more. NVIDIA’s own confidential computing documentation and performance write-ups report that for many LLM inference workloads the overhead is modest — frequently in the low single-digit to low-double-digit percent range depending on model size and batch — but you should treat any single headline number as reported/approximate and benchmark your own model, batch size, and sequence length. Do not design capacity plans around a blog-post percentage.

Figure 4: Deployment and threat boundaries. The confidential VM, GPU in CC mode, and inference runtime sit inside the trusted boundary; the host hypervisor, cloud operator, and network/storage are explicitly untrusted and see only ciphertext.

The second issue is side channels. TEEs protect memory contents, not all observable behavior. Access patterns, cache timing, page-fault sequences, and power draw can leak information; the long line of SGX side-channel research (controlled-channel attacks, Foreshadow/L1TF-class transient-execution issues) is a standing reminder that “encrypted memory” is not “no information leakage.” Because the hypervisor still schedules the TEE and handles its page faults, a malicious host occupies an unusually powerful position to observe access patterns at page granularity. For inference specifically, output-length and inter-token timing side channels can leak information about prompts and responses even when memory is sealed — response length alone correlates with content, and a network observer measuring token-streaming cadence learns more than you might expect. Mitigations (padding, constant-time-ish serving, batching unrelated requests) cost performance and are rarely free.

Third, attestation complexity is a real operational tax: you now run verifiers, manage reference values and certificate chains, handle revocation, and must re-attest as firmware and microcode are patched — a TCB recovery event (where a vendor revokes a vulnerable firmware version) can invalidate previously-good measurements overnight, and if your key-release policy pins exact measurements, a routine security patch can break your fleet’s ability to obtain keys until you update reference values. This is a feature, not a bug — you want to stop trusting vulnerable firmware — but it demands an operational process, not a one-time setup. Fourth, supply-chain and root-of-trust trust: you are ultimately trusting Intel, AMD, Arm, and NVIDIA’s silicon, firmware signing keys, and attestation services (and the availability of those services — if NRAS or a vendor PCS endpoint is unreachable, attestation, and therefore key release, can stall). Confidential computing narrows your trust base dramatically — from “the entire cloud stack and every operator” down to “the CPU/GPU vendor’s hardware root of trust” — but it does not eliminate it. It relocates trust to the hardware vendor, which for most regulated workloads is a far better place to put it, but a place worth naming explicitly in your risk register.

Practical Recommendations

Treat confidential AI inference as a system property, not a single product you buy. Start by writing down your threat model explicitly: if your adversary is the cloud operator or a privileged insider, confidential computing is directly on point; if it is a malicious model author or a poisoned dependency inside your TEE, it is not — confidential computing protects the boundary, not the code you put inside it. Insist on end-to-end attestation: a design that attests the CPU VM but runs the GPU in non-CC mode protects nothing once weights hit VRAM. Bind key release to attestation through a real KBS with declarative policy, and version your reference values so firmware patches do not silently break trust or, worse, get waved through. Keep the TCB small: the more you cram into the confidential VM, the larger your attack surface and the harder your measurement is to reason about. Finally, benchmark on your own workload before committing capacity.

Checklist:

[ ] Threat model written down; adversary explicitly named (operator, insider, co-tenant).
[ ] CPU TEE (TDX / SEV-SNP / CCA) selected to match cloud and vendor availability.
[ ] GPU confirmed in confidential-computing mode (H100/Blackwell), not just CC-capable.
[ ] Combined CPU+GPU attestation verified, with GPU evidence checked against NRAS.
[ ] Key release gated on attestation via a KBS with declarative, versioned policy.
[ ] RA-TLS (or equivalent) so clients verify the TEE before sending data.
[ ] Reference values version-controlled; TCB-recovery / patch process defined.
[ ] Side-channel exposure assessed (timing, output length, access patterns).
[ ] Own-workload performance benchmark run before capacity planning.

Frequently Asked Questions

What is confidential AI inference?

It is running model inference inside hardware trusted execution environments so that prompts, retrieved data, intermediate activations, and model weights remain encrypted in memory while the model computes on them. The CPU and GPU enforce that even the hypervisor and cloud operator cannot read the plaintext, and decryption keys are released only after the environment proves its identity through remote attestation.

How is this different from just encrypting data at rest and in transit?

At-rest and in-transit encryption leave data exposed during computation — the moment bytes are loaded into RAM or VRAM to be processed, they are plaintext and readable by privileged software. Confidential computing adds the missing third state, data-in-use protection, by encrypting memory in hardware and isolating the workload from the host.

Do I need a special GPU for confidential AI inference?

Yes. You need a GPU with confidential computing support — NVIDIA’s Hopper H100 and the Blackwell generation provide it — running in CC mode, paired with a CPU TEE (Intel TDX or AMD SEV-SNP). A standard GPU will hold model weights and activations in plaintext VRAM, defeating the purpose, so the accelerator must participate in the trust chain.

What is remote attestation and why does it matter?

Remote attestation is the process by which the TEE produces hardware-signed evidence about its own state, a verifier appraises that evidence against expected reference values (per IETF RFC 9334), and a relying party uses the result to decide whether to trust the environment. It matters because it is what lets a key broker safely release model and data keys only to a genuine, correctly-configured TEE rather than to anyone with cloud access.

What performance overhead should I expect?

It is workload-dependent. Large-batch, compute-bound LLM inference with data resident in protected VRAM tends to see modest overhead, while transfer-heavy workloads that constantly cross the encrypted PCIe path see more. Published figures are often in the low-single-digit to low-double-digit percent range, but treat any specific number as reported/approximate and benchmark your own model and batch size.

Does confidential computing protect against every attack?

No. It protects data-in-use against the host and operator, but it does not eliminate side channels (timing, cache, access-pattern, output-length leakage), it does not protect against malicious or vulnerable code you run inside the TEE, and it relocates — rather than removes — trust to the silicon and firmware vendors. It is a powerful boundary, not a universal guarantee.

Confidential AI Inference: TEEs and GPU Confidential Computing (2026)