Kubernetes Cost Optimization and GPU Rightsizing (2026)

Kubernetes Cost Optimization and GPU Rightsizing (2026)

Kubernetes Cost Optimization and GPU Rightsizing in 2026

Most Kubernetes bills are not high because the workloads are heavy. They are high because the cluster is full of air. Kubernetes cost optimization is, at its core, the discipline of removing that air — the gap between what pods request and what they actually use, the idle nodes that never drain, and the GPUs sitting at single-digit utilization while the meter runs at full price. In 2026, with accelerator capacity scarce and AI workloads pushing a single H100 node past the cost of an entire web tier, the gap is no longer a rounding error. It is the line item finance asks about.

This post treats cost optimization as an engineering control loop, not a one-time cleanup. We walk the full stack: tuning requests and limits, bin-packing nodes with Karpenter, scaling workloads with HPA and KEDA, and then the part that dominates modern bills — slicing GPUs with MIG, time-slicing, and MPS so one accelerator serves many tenants.

What this covers: the FinOps loop, the request-to-usage gap, Karpenter consolidation, fractional GPUs, spot interruption handling, and the gotchas that turn savings into outages.

Context and Background

Kubernetes overspend has three structural causes, and almost every cluster suffers from all three at once.

The first is over-provisioned requests. Developers set CPU and memory requests defensively — they copy a manifest, double the numbers “to be safe,” and never revisit them. The scheduler treats requests as reservations, so a pod requesting 2 cores but using 0.2 holds 1.8 cores hostage. Across hundreds of deployments, requested capacity routinely runs two to five times actual usage. You pay for the reservation, not the work.

The second is idle nodes. Autoscalers add nodes under pressure but are far more timid about removing them. A node that drops to 15% utilization after a traffic spike often lingers for hours because a single unmovable pod — or a misconfigured pod disruption budget — pins it in place. Each lingering node is a full instance you rent and barely use.

The third, and increasingly the largest, is idle GPUs. A model-serving pod scheduled onto a whole A100 may use a fraction of its compute and memory, yet Kubernetes hands it the entire device because the default GPU resource model is integer-only — one pod, one card. At AI-node prices, a GPU at 10% utilization is the single most expensive idle resource in your fleet.

These three causes share a root: Kubernetes optimizes for availability and isolation by default, and cost is a property you have to opt into. The scheduler honors requests as hard reservations because doing otherwise would risk noisy-neighbor failures; autoscalers scale down cautiously because aggressive eviction risks availability; GPUs are handed out whole because partitioning needs explicit hardware and plugin configuration. None of this is a bug. It means cost optimization is fundamentally about relaxing safe defaults deliberately, with enough measurement to know you have not relaxed them past the point of stability. That framing matters because it explains why one-time cleanups fail: the defaults reassert themselves with every new deployment.

The fix is not a tool; it is a loop. The FinOps practice frames it as inform, optimize, operate: make spend visible by team and workload, take action to remove waste, then bake the discipline into platform defaults so waste does not creep back. Cost optimization that runs once decays in weeks as new services land with fresh defensive requests. We cover the broader economics of cloud spend and carbon-aware scheduling in our FinOps and GreenOps guide; here we go deep on the Kubernetes-specific mechanics.

The Cost Optimization Stack

Kubernetes cost optimization architecture

Figure 1: The Kubernetes cost optimization control loop — observe, recommend, act, bin-pack, and guard.

The control loop starts with observability (Kubecost or OpenCost mapping spend to workloads), feeds recommendations from the Vertical Pod Autoscaler and idle-node reports, acts by tuning requests and scaling replicas, then lets Karpenter bin-pack and consolidate the resulting demand onto the fewest, cheapest nodes — with FinOps budgets and spot pools wrapping the whole thing as guardrails.

To reduce Kubernetes cost, close three gaps in order: the request-to-usage gap at the pod level, the node-packing gap at the cluster level, and the replica-count gap at the workload level. Tackle them in that sequence — right-sizing requests first makes every downstream packing and scaling decision cheaper and more accurate.

Requests and Limits Rightsizing

Requests are the foundation, because the scheduler makes every placement decision from them. If requests are inflated, the bin-packer packs inflated reservations and provisions nodes for capacity nobody uses. Fix requests first or everything below inherits the error.

The mechanism for closing the request-to-usage gap is the Vertical Pod Autoscaler (VPA). It watches historical CPU and memory consumption and emits recommendations — target, lower bound, upper bound — derived from usage percentiles. Run VPA in recommendation mode (also called Off mode) to get the numbers without auto-applying them; auto-apply on production traffic restarts pods, which you rarely want unannounced.

A realistic starting manifest looks like this:

resources:
  requests:
    cpu: "250m"
    memory: "256Mi"
  limits:
    memory: "512Mi"   # memory limit prevents node OOM
    # no CPU limit — see throttling discussion below

Two deliberate choices here. Memory carries both a request and a limit because memory is incompressible — exceed the limit and the kernel OOMKills the container, so the limit is a hard safety ceiling that protects the node. CPU carries a request but no limit, because CPU is compressible: a CPU limit enforced by the Completely Fair Scheduler throttles the container even when the node has spare cycles, adding latency for no benefit. Set the request to your steady-state need so the scheduler reserves a fair floor, and let bursts use idle headroom. This single pattern — memory limit, no CPU limit — eliminates a large class of mysterious p99 latency spikes while keeping nodes safe.

There is a second-order effect most teams miss: requests and limits together determine a pod’s Quality of Service (QoS) class, and QoS decides who dies first when a node runs out of memory. A pod whose requests equal its limits for every resource is Guaranteed and is evicted last. A pod with requests below limits is Burstable. A pod with neither is BestEffort and is the first thing the kubelet kills under memory pressure. If you right-size carelessly — stripping limits off everything to “save money” — you quietly demote critical services into eviction-prone classes. Keep latency-critical and stateful pods Guaranteed or tightly Burstable; let genuinely sacrificial batch work be BestEffort so it absorbs pressure instead of your API tier.

The goal is a request-to-usage ratio near 1.3 to 1.5 — enough headroom for normal variance, not the 3x–5x most clusters start with. Pull VPA recommendations into a dashboard, sort deployments by absolute waste (requested minus used, times replica count, times node price), and fix the biggest line items first. The top ten deployments usually account for most of the recoverable spend. One caution on VPA: its CPU recommendations are driven by percentile usage over a window, so a service with rare but real spikes — a daily report job, a cache warm — can be sized down to a number that throttles or starves it during the spike. Always sanity-check recommendations against peak behavior, not just the median, before applying. Right-sizing is a measurement problem first and a YAML problem second.

Bin-Packing and Node Autoscaling

Once requests are honest, the question becomes: how few nodes can hold this demand? That is the bin-packing problem, and it is where Karpenter has changed the economics versus the older Cluster Autoscaler.

Cluster Autoscaler works from fixed node groups. You predefine instance types, and it scales those groups up and down. It is reliable but coarse — it cannot pick a cheaper instance shape for the actual pending pods, and it is conservative about scale-down.

Karpenter is provisioner-driven. It looks at the exact resource shape of pending pods and launches the cheapest instance — across families, sizes, and spot or on-demand — that fits them. Crucially, it runs consolidation: it continuously scans for nodes that are underused or whose pods would fit elsewhere, drains them, and reschedules onto fewer, denser nodes. That consolidation loop is the single highest-leverage cost lever in a modern cluster.

Bin-packing and consolidation flow

Figure 2: How pending pods are scheduled, then consolidated onto fewer nodes to cut spend.

A pending pod is first tried against existing nodes; if it fits, the scheduler binds it and density rises. If nothing fits, Karpenter provisions the cheapest viable instance. Periodically the consolidation scan finds low-utilization nodes, drains their pods onto denser packing, and terminates the now-empty nodes — the step that actually removes spend rather than just adding capacity.

The enemy that consolidation fights is fragmentation. Over a day of scale-ups and scale-downs, pods accumulate across nodes like files on a fragmented disk: each node ends up 40–60% used, none full enough to evacuate, and collectively they waste the equivalent of several whole nodes. The scheduler alone will not fix this, because it only places new pods and never relocates running ones. Consolidation is the defragmenter — it deliberately disturbs the running placement to recover the stranded capacity. That is also why it must be allowed to evict and reschedule pods; a cluster that forbids disruption (through blanket pod disruption budgets or do-not-evict annotations everywhere) freezes its own fragmentation in place and pays for it indefinitely. The lever and its cost are the same mechanism: controlled disruption now in exchange for fewer nodes.

A minimal Karpenter NodePool that prefers spot and consolidates aggressively:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: default
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64", "arm64"]
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 1m

Allowing arm64 lets Karpenter pick Graviton-class instances when the workload has multi-arch images — often a meaningful per-core discount. WhenEmptyOrUnderutilized is the policy that does the real work: it consolidates not just empty nodes but underused ones, repacking the cluster as demand ebbs.

The reason provisioner-driven autoscaling beats fixed node groups on cost is shape-matching. With node groups you guess the instance shape up front; whatever you guess, some pods waste it. A node group of 8-core machines running pods that each want 3 cores strands two cores per node permanently. Karpenter instead reads the aggregate shape of the pending batch and picks an instance whose CPU-to-memory ratio fits — a memory-heavy batch lands on a memory-optimized family, a compute batch on a compute-optimized one. Because it spans capacity types in the same decision, it can also fall back from spot to on-demand automatically when spot is unavailable, so you get the discount without hand-managing two parallel node groups.

There is a subtlety worth flagging: consolidation interacts with node startup taints and topology spread. If your pods carry anti-affinity rules or strict topology-spread constraints, consolidation may be unable to repack them as tightly as the raw resource math suggests, because the constraints forbid co-location. The bin-packer is only ever as dense as your scheduling constraints allow — overly rigid anti-affinity is a hidden cost driver that no autoscaler can fix. Audit your spread and affinity rules with the same scrutiny you give requests; a “high availability” rule that pins one pod per node can double your node count for a workload that would have been perfectly safe at two-per-node.

Workload Autoscaling

Vertical sizing tunes a pod; horizontal scaling tunes the number of pods, and that is where you stop paying for idle replicas during quiet periods. The Horizontal Pod Autoscaler (HPA) scales replicas on CPU, memory, or custom metrics. For request-driven services, scaling on a custom metric like requests-per-second or queue depth tracks real load far better than CPU, which lags and overshoots.

A concrete worked example makes the compounding obvious. Numbers below are illustrative. Take a service running 40 replicas, each requesting 2 cores but using 0.4 — a 5x over-provision. Right-sizing requests to 0.6 cores (a safe 1.5x ratio) drops reserved CPU from 80 cores to 24, a 70% cut before a single node is touched. Now feed that honest demand into Karpenter consolidation and the cluster repacks 24 cores onto perhaps three dense nodes instead of the eight half-empty nodes the inflated requests had spread across. Then move the stateless replicas onto spot at, say, a 70% discount. The three multipliers stack: 0.3 (right-sizing) times tighter packing times 0.3 (spot) compounds far below the original bill. No single lever did it; the sequence did, which is why order matters more than any individual technique.

For event-driven and bursty workloads, KEDA extends HPA with dozens of scalers — Kafka lag, SQS depth, cron schedules, Prometheus queries — and, critically, scale-to-zero. A consumer that processes a nightly batch can sit at zero replicas all day and spin up only when the queue fills. Combined with Karpenter consolidation, scale-to-zero means the nodes backing that workload also disappear, so you pay nothing for idle event consumers. That pairing — KEDA scaling pods to zero, Karpenter removing the emptied nodes — is one of the cleanest cost wins available and applies to most asynchronous workloads.

A caution that separates working autoscaling from thrashing autoscaling: scaling reactions must be slower than your workload’s natural noise, or the cluster oscillates. HPA exposes stabilization windows and scaling policies precisely so a brief dip does not trigger a scale-down that a brief recovery immediately reverses. Each oscillation costs you — pods cold-start, caches re-warm, and Karpenter churns nodes underneath. Set scale-down stabilization conservatively (tens of seconds to minutes depending on the workload), keep scale-up responsive, and the system converges instead of flapping. The two autoscalers also compose in a specific order worth understanding: HPA and KEDA change the number of pods, which changes pending demand, which is what Karpenter then bin-packs into nodes. Get the pod-level signal right first; the node-level economics follow from it.

Guardrails That Keep Requests Honest

Right-sizing once is easy; keeping it right is the hard part, because every new deployment arrives with the same defensive defaults that caused the original bloat. The durable fix is policy enforcement at admission time. A LimitRange per namespace sets default requests and limits and caps the maximum a single pod can ask for, so a developer who omits requests gets a sane default instead of an unbounded reservation. A ResourceQuota caps total requested CPU, memory, and GPU per namespace, turning the namespace into a hard budget that a runaway deployment cannot exceed.

Beyond the built-ins, a policy engine such as a validating admission controller can reject manifests that violate cost rules — a pod requesting more than 4 cores without an exception label, a deployment with no requests at all, a GPU pod that does not specify a sharing profile. Pushing these checks to admission time means the cluster cannot drift back into waste silently; the bad manifest is rejected at apply, with a message telling the author why. This is the “operate” leg of the FinOps loop made concrete: optimization that lives only in a dashboard erodes, but optimization encoded as an admission policy holds. Treat cost guardrails the way you treat security guardrails — as code that runs on every change, not as a quarterly review.

GPU Rightsizing and Spot

GPUs invert the cost model. On a CPU cluster, a few percent of waste is annoying; on a GPU cluster, a single under-used accelerator can dwarf the cost of the rest of the namespace. GPU rightsizing in Kubernetes is therefore less about shaving requests and more about sharing one physical device among many pods, because the default one-pod-one-GPU model leaves most of the silicon idle for inference and development workloads.

There are three sharing mechanisms, and they trade isolation against flexibility. Picking the right one per workload is the core decision.

GPU sharing modes

Figure 3: Three ways to share one physical GPU — MIG, time-slicing, and MPS — and where each fits.

A single A100 or H100 can be carved by MIG into hardware-isolated partitions, shared loosely by time-slicing, or shared spatially by MPS. MIG suits multi-tenant inference where isolation matters; time-slicing suits dev and test where you just want more hands on the device; MPS suits cooperative HPC batch where co-located kernels can pack a card.

Fractional GPU Mechanisms

MIG (Multi-Instance GPU) physically partitions supported NVIDIA data-center GPUs into as many as seven instances, each with dedicated compute slices, dedicated memory, and dedicated memory bandwidth. Because the isolation is in hardware, one tenant cannot starve or crash another — a noisy inference pod in one MIG slice cannot touch the latency of a pod in the neighboring slice. The cost is rigidity: profiles are fixed sizes, you reconfigure by draining the node, and a workload that needs slightly more than a slice provides cannot borrow from an idle neighbor. See the NVIDIA MIG user guide for the exact per-GPU profile tables.

Time-slicing lets many pods share a GPU by interleaving in time on the same context. There is no memory isolation and no performance isolation — pods take turns, and one greedy process can monopolize cycles or exhaust memory and OOM its neighbors. It is trivial to enable through the NVIDIA device plugin and ideal for development, CI, and notebook fleets where you want many people on a card and occasional contention is acceptable.

MPS (Multi-Process Service) runs multiple processes’ kernels concurrently on the GPU with soft, cooperative isolation — better throughput than time-slicing for co-located batch and HPC jobs, but processes still share an address space and a fault in one can affect others. It is the middle ground: more spatial sharing than time-slicing, less protection than MIG.

Mode Isolation Granularity Best use case
MIG Hardware (strong) Up to 7 fixed partitions Multi-tenant inference, SLA-bound serving
Time-slicing None Many pods per GPU Dev, test, CI, notebooks
MPS Soft (process) Concurrent kernels Cooperative HPC and batch jobs

A practical pattern: MIG for production inference where tenants must not interfere, time-slicing for the development cluster where utilization matters more than isolation, and MPS for trusted internal batch pipelines that can cooperate. Savings figures here are illustrative — but moving an inference fleet from one-pod-one-GPU to seven MIG slices can, in principle, raise effective GPU utilization several-fold for workloads that genuinely fit a slice, which translates directly into fewer accelerators rented.

Spot GPUs and Observability

Spot is only one of two pricing levers, and they compose. The other is commitment-based discounting — Savings Plans and Reserved Instances on AWS, Committed Use Discounts on GCP, Reserved VM Instances on Azure — where you trade flexibility for a lower rate by committing to a baseline spend or capacity over one to three years. The right structure is layered: cover your stable, always-on baseline with commitments for the deepest sustained discount, and run your elastic, interruptible surplus on spot. Spot handles the spiky top of the curve; commitments handle the floor that never goes to zero. Karpenter’s capacity-type awareness helps here because it can prefer spot for burst while your committed on-demand baseline absorbs the rest. A cluster running everything on flat on-demand pricing is leaving both levers unused.

Spot and preemptible instances offer steep discounts — frequently in the range of 60–90% off on-demand, illustrative and provider-dependent — in exchange for the cloud reclaiming the capacity on short notice. For stateless, checkpointable, or replicated workloads, that trade is almost free money. The catch is interruption: you get a brief warning (commonly around two minutes) before the instance vanishes, and your system must drain gracefully in that window.

Spot interruption handling flow

Figure 4: Handling a spot interruption. The cloud emits a reclaim notice, the node agent cordons the node and sends SIGTERM, the workload checkpoints, and the controller reschedules the pod onto remaining capacity before the node is reclaimed.

That two-minute window is the whole game. A w

Comments

No comments yet. Why don’t you start the discussion?

Leave a Reply

Your email address will not be published. Required fields are marked *