GPU FinOps: Reducing Your $10M AI Compute Bill

Balys Kriksciunas · Tue Apr 14 2026 · 8 min read

#ai #infrastructure #finops #gpu #cost #compute #commitments #optimization

When GPU spend crosses $500k/month, informal cost discipline stops working. A FinOps playbook for large AI compute bills — attribution, commitments, workload placement, and the structural changes that matter.

GPU FinOps: Reducing Your $10M AI Compute Bill

An AI company’s GPU bill follows a predictable curve. First year: invisible. Second year: noticeable. Third year: dominant. At $10M/year of GPU spend, compute becomes the single largest line item after payroll — and the organization has to treat it with the same seriousness as headcount budgeting.

This post is the FinOps playbook for large GPU bills. Not the “tag your workloads” basics; the structural moves that shift millions of dollars.

The Levers, Ranked By Impact

From our work with shops at the $5M–$50M/year compute scale, the levers in rough order of dollar impact:

Commitment structure — on-demand vs reserved vs bare-metal
Workload right-sizing — matching workload to GPU generation
Utilization — getting above 70% sustained
Provider mix — neoclouds vs hyperscalers
Model right-sizing — not using GPT-4o for classification
Disaggregated serving — see our dedicated post
Quantization — FP8 is essentially free on H100
Caching — prompt caching, semantic caching
Model distillation — custom smaller models for your workload
Geographic optimization — region selection

We’ll focus on the top five. They represent 70–80% of the achievable savings.

Lever 1: Commitment Structure

On-demand GPU pricing is a tax you pay for flexibility. If your workload is sustained, commit.

Commitment	Discount vs on-demand
Spot / preemptible	30–60%
1-year reserved	30–50%
3-year reserved	50–65%
Dedicated / bare-metal multi-year	50–70%
Co-location + owned hardware	60–75% (incl. capex amortization)

At $10M/year of spend, moving from on-demand to 1-year reserved saves $3M–$5M. 3-year reserved saves more but locks you in when hardware generations change.

The right blend:

Baseline (predictable, 24/7): 60–70% on reserved or bare-metal
Peak (sustained but bursty): 20–30% on shorter reserved
Burst (spiky, unpredictable): 10–20% on on-demand
Speculative (research, experiments): spot / preemptible

This is hard to execute without baseline visibility. Which leads to lever 2.

Lever 2: Workload Right-Sizing

“All our inference is on H100” is a red flag. Different workloads want different hardware.

Typical workload placement we recommend:

Workload	GPU	Why
70B+ inference	B200 / MI300X	Memory, FP4/FP8 efficiency
7–13B inference at high QPS	H100 / L40S	Right-sized for throughput
7–13B inference at low QPS	L4 / A10	No need to pay for H100
Batch embedding	A100 / L40S	Cheap bandwidth
Training foundation models	B200 / H100 w/ InfiniBand	Network matters
Fine-tuning	H100 (or MI300X for memory)	Balanced FLOPS / memory
Dev / experimentation	A100 / L40S / spot	Cost matters, perf doesn’t

Moving classification workloads off H100 and onto L4 easily cuts that workload’s cost 4–5x. Across a $10M bill, right-sizing typically delivers 15–25% total savings.

Lever 3: Utilization

An H100 sitting at 35% utilization costs the same as one at 90%. The dirty secret of many AI fleets is sub-50% sustained utilization on expensive hardware.

What drives low utilization:

Over-provisioned for peak. You spun up for Black Friday traffic; it’s now February.
Batching inefficiency. Many models, few requests per model.
Dev / experimentation on production hardware.
Bad autoscaling. Scale-up is fast; scale-down is cautious.
Idle workloads. Notebooks left running, forgotten replicas, dev environments.

Structural fixes:

Multi-tenancy within a replica. Multi-LoRA serving to put many customers on one base model (see LoRA, QLoRA, and PEFT).
Aggressive idle termination. Notebooks auto-shut after N hours. Replicas with no traffic drop to zero.
Workload consolidation. Fewer, fuller replicas vs. many sparse ones.
Right-sized autoscale targets. Target 70–80% average utilization, not 30%.

Each 10-point utilization improvement on a large fleet is a meaningful dollar amount. Going from 40% to 70% on a $10M fleet is effectively $4M in savings.

Lever 4: Provider Mix

Hyperscalers charge a premium for integrated services. Neoclouds undercut on GPU specifically. See Multi-Cloud GPU Strategy.

Typical discount (neocloud vs hyperscaler on-demand):

25–45% lower on H100
20–35% lower on B200
30–40% on MI300X (where available)

At scale, the delta is worth real engineering investment to run multi-cloud.

Pattern:

Bulk reserved on neocloud (CoreWeave, Lambda, Crusoe): biggest cost line
Hyperscaler for compliance-sensitive workloads: AWS / Azure / GCP
Burst on the cheapest on-demand available

For a $10M/year bill, shifting from pure hyperscaler to a mixed deployment with neocloud reserved as the baseline typically saves $2M–$3M.

Lever 5: Model Right-Sizing

A team with only one model in their stack is always spending too much. Different queries merit different models.

Current 2026 price ladder:

Tier	Model	Rough cost ($/M input)
Frontier	GPT-4o, Claude Opus 4	$15–$25
Fast frontier	Claude Sonnet 4, Gemini 2.5 Pro	$3–$5
Mid tier	GPT-4o-mini, Claude Haiku 4.5	$0.15–$0.50
Cheap tier	Llama-3.3-70B hosted	$0.30–$0.50
Ultra cheap	Llama-3.2-8B hosted	$0.05–$0.10
Your own	Self-hosted 70B FP8	~$0.15–$0.40 (blended)

A task-appropriate ladder lets you run a classification at 1/50th the cost of its GPT-4o version. Router logic costs a little complexity; the savings compound. Model selection is one piece of a larger cost puzzle — infrastructure, human oversight, and governance often exceed raw inference spend, as we’ve documented in our enterprise AI agent TCO breakdown.

See AI FinOps: Tracking Token Spend for the attribution side and LLM Gateway Patterns for the routing side.

Attribution: The Foundation

You cannot cut what you don’t see. Every large AI shop ends up with the same three-layer attribution:

Layer 1: Per-workload

Which service, which team, which feature. Implemented via:

Tags on cloud resources
Virtual keys at the LLM gateway
Service labels in Kubernetes

Layer 2: Per-customer (for SaaS)

Which end customer is generating which spend. Implemented via:

Session tracking through the gateway
Usage events to analytics
Per-customer invoice rollups

Layer 3: Per-business-outcome

Cost per resolved support ticket. Cost per generated document. Cost per qualified lead. This is where FinOps becomes a strategic conversation.

Financial Structures Worth Knowing

As spend grows past a threshold, you gain access to commercial options:

Reserved capacity contracts

1-year or 3-year capacity at fixed pricing. Best for baseline load.

Enterprise agreements

Annual or multi-year commitments across multiple services. Hyperscalers will negotiate at $500k+/year.

Bare metal / dedicated racks

Lease racks in a colo. Buy or lease the GPUs. Eliminates cloud margin; adds ops overhead. Worth it above ~1000 GPUs sustained.

Owned infrastructure

Capex your own cluster. Best long-term economics; massive capex commitment; ops team required.

Most shops at $10M/year compute spend should have at least reserved capacity and an enterprise agreement. Bare metal is worth evaluating above $20M/year.

Organizational Structure

FinOps above $5M/year needs more than a dashboard. It needs an operating model:

Dedicated FinOps role — one person minimum owns GPU cost as their primary metric
Monthly cost review — engineering leadership + finance; standing agenda
Quarterly forecasting — projected spend vs budget with assumptions
Architecture council — major new workloads get cost-reviewed before build
Budgets with teeth — team budgets that, if exceeded, trigger conversations

The human structure matters more than any tool. Teams with “a FinOps dashboard” save little. Teams with “Tomás reviews the GPU bill weekly and escalates deviations” save a lot.

The Common Traps

1. Saving 10% by optimizing the wrong line item. Focus on the top 3 workloads. Everything else is noise.

2. Ignoring capacity commits when scaling down. You committed to 100 H100 for 3 years; you now need 50. Either sublease, stay committed, or eat the loss.

3. Hyperscaler credits mask bad habits. “AWS gave us $5M in credits” → team doesn’t optimize → credits run out → cliff.

4. Chasing every new GPU generation. B200 cost/perf is real, but upgrading mid-reservation just costs money.

5. Under-investing in observability. You cannot manage what you cannot measure. Spend is a sacred cow until you have the data to question it.

The Roadmap From Chaos To Control

If your company’s GPU spend has crept above $500k/month without FinOps discipline:

Month 1: Baseline attribution. Who, what, how much. Accept the number is bigger than anyone thinks.

Month 2: Rate card negotiation with current providers. 10–20% gains from talking.

Month 3: Reserved commitments for baseline load. 20–30% on committed portion.

Month 4: Right-sizing. Audit workload → GPU match. 10–20% gains.

Month 5: Multi-cloud evaluation. Neoclouds vs hyperscalers for suitable workloads. 20–30% gains on shifted workloads.

Month 6: Utilization drive. Consolidate, autoscale aggressively, kill idle.

Month 7+: Structural levers — model right-sizing, disaggregation, distillation.

A team that executes this sequence typically cuts their compute bill 30–50% in the first year.

The Short Version

GPU FinOps at scale is a discipline, not a dashboard. The big wins come from:

Commitment structure (reserved + baseline)
Workload-to-GPU matching
Utilization above 70% sustained
Multi-provider mix
Model-tier routing

These aren’t glamorous. They’re boring. They’re also where the dollars are. Companies that invest here have 20–40% lower unit economics than those that don’t. At serious scale, that’s the difference between a profitable AI product and one that isn’t.

GPU FinOps: Reducing Your $10M AI Compute Bill

GPU FinOps: Reducing Your $10M AI Compute Bill

The Levers, Ranked By Impact

Lever 1: Commitment Structure

Lever 2: Workload Right-Sizing

Lever 3: Utilization

Lever 4: Provider Mix

Lever 5: Model Right-Sizing

Attribution: The Foundation

Layer 1: Per-workload

Layer 2: Per-customer (for SaaS)

Layer 3: Per-business-outcome

Financial Structures Worth Knowing

Reserved capacity contracts

Enterprise agreements

Bare metal / dedicated racks

Owned infrastructure

Organizational Structure

The Common Traps

The Roadmap From Chaos To Control

The Short Version

Further Reading

Related Posts

AI FinOps: Tracking Token Spend Across Your Org

The Inference Cost Paradox: Models Are Nearly Free, But Your Agent Bill Just Hit $100K

vLLM vs SGLang: Inference Engine Comparison 2026