TURION .AI

GPU FinOps: Reducing Your $10M AI Compute Bill

Balys Kriksciunas · · 8 min read
GPU FinOps Compute Bill

When GPU spend crosses $500k/month, informal cost discipline stops working. A FinOps playbook for large AI compute bills — attribution, commitments, workload placement, and the structural changes that matter.

GPU FinOps: Reducing Your $10M AI Compute Bill

An AI company’s GPU bill follows a predictable curve. First year: invisible. Second year: noticeable. Third year: dominant. At $10M/year of GPU spend, compute becomes the single largest line item after payroll — and the organization has to treat it with the same seriousness as headcount budgeting.

This post is the FinOps playbook for large GPU bills. Not the “tag your workloads” basics; the structural moves that shift millions of dollars.


The Levers, Ranked By Impact

From our work with shops at the $5M–$50M/year compute scale, the levers in rough order of dollar impact:

  1. Commitment structure — on-demand vs reserved vs bare-metal
  2. Workload right-sizing — matching workload to GPU generation
  3. Utilization — getting above 70% sustained
  4. Provider mix — neoclouds vs hyperscalers
  5. Model right-sizing — not using GPT-4o for classification
  6. Disaggregated serving — see our dedicated post
  7. QuantizationFP8 is essentially free on H100
  8. Caching — prompt caching, semantic caching
  9. Model distillation — custom smaller models for your workload
  10. Geographic optimization — region selection

We’ll focus on the top five. They represent 70–80% of the achievable savings.


Lever 1: Commitment Structure

On-demand GPU pricing is a tax you pay for flexibility. If your workload is sustained, commit.

CommitmentDiscount vs on-demand
Spot / preemptible30–60%
1-year reserved30–50%
3-year reserved50–65%
Dedicated / bare-metal multi-year50–70%
Co-location + owned hardware60–75% (incl. capex amortization)

At $10M/year of spend, moving from on-demand to 1-year reserved saves $3M–$5M. 3-year reserved saves more but locks you in when hardware generations change.

The right blend:

  • Baseline (predictable, 24/7): 60–70% on reserved or bare-metal
  • Peak (sustained but bursty): 20–30% on shorter reserved
  • Burst (spiky, unpredictable): 10–20% on on-demand
  • Speculative (research, experiments): spot / preemptible

This is hard to execute without baseline visibility. Which leads to lever 2.


Lever 2: Workload Right-Sizing

“All our inference is on H100” is a red flag. Different workloads want different hardware.

Typical workload placement we recommend:

WorkloadGPUWhy
70B+ inferenceB200 / MI300XMemory, FP4/FP8 efficiency
7–13B inference at high QPSH100 / L40SRight-sized for throughput
7–13B inference at low QPSL4 / A10No need to pay for H100
Batch embeddingA100 / L40SCheap bandwidth
Training foundation modelsB200 / H100 w/ InfiniBandNetwork matters
Fine-tuningH100 (or MI300X for memory)Balanced FLOPS / memory
Dev / experimentationA100 / L40S / spotCost matters, perf doesn’t

Moving classification workloads off H100 and onto L4 easily cuts that workload’s cost 4–5x. Across a $10M bill, right-sizing typically delivers 15–25% total savings.


Lever 3: Utilization

An H100 sitting at 35% utilization costs the same as one at 90%. The dirty secret of many AI fleets is sub-50% sustained utilization on expensive hardware.

What drives low utilization:

  • Over-provisioned for peak. You spun up for Black Friday traffic; it’s now February.
  • Batching inefficiency. Many models, few requests per model.
  • Dev / experimentation on production hardware.
  • Bad autoscaling. Scale-up is fast; scale-down is cautious.
  • Idle workloads. Notebooks left running, forgotten replicas, dev environments.

Structural fixes:

  • Multi-tenancy within a replica. Multi-LoRA serving to put many customers on one base model (see LoRA, QLoRA, and PEFT).
  • Aggressive idle termination. Notebooks auto-shut after N hours. Replicas with no traffic drop to zero.
  • Workload consolidation. Fewer, fuller replicas vs. many sparse ones.
  • Right-sized autoscale targets. Target 70–80% average utilization, not 30%.

Each 10-point utilization improvement on a large fleet is a meaningful dollar amount. Going from 40% to 70% on a $10M fleet is effectively $4M in savings.


Lever 4: Provider Mix

Hyperscalers charge a premium for integrated services. Neoclouds undercut on GPU specifically. See Multi-Cloud GPU Strategy.

Typical discount (neocloud vs hyperscaler on-demand):

  • 25–45% lower on H100
  • 20–35% lower on B200
  • 30–40% on MI300X (where available)

At scale, the delta is worth real engineering investment to run multi-cloud.

Pattern:

  • Bulk reserved on neocloud (CoreWeave, Lambda, Crusoe): biggest cost line
  • Hyperscaler for compliance-sensitive workloads: AWS / Azure / GCP
  • Burst on the cheapest on-demand available

For a $10M/year bill, shifting from pure hyperscaler to a mixed deployment with neocloud reserved as the baseline typically saves $2M–$3M.


Lever 5: Model Right-Sizing

A team with only one model in their stack is always spending too much. Different queries merit different models.

Current 2026 price ladder:

TierModelRough cost ($/M input)
FrontierGPT-4o, Claude Opus 4$15–$25
Fast frontierClaude Sonnet 4, Gemini 2.5 Pro$3–$5
Mid tierGPT-4o-mini, Claude Haiku 4.5$0.15–$0.50
Cheap tierLlama-3.3-70B hosted$0.30–$0.50
Ultra cheapLlama-3.2-8B hosted$0.05–$0.10
Your ownSelf-hosted 70B FP8~$0.15–$0.40 (blended)

A task-appropriate ladder lets you run a classification at 1/50th the cost of its GPT-4o version. Router logic costs a little complexity; the savings compound.

See AI FinOps: Tracking Token Spend for the attribution side and LLM Gateway Patterns for the routing side.


Attribution: The Foundation

You cannot cut what you don’t see. Every large AI shop ends up with the same three-layer attribution:

Layer 1: Per-workload

Which service, which team, which feature. Implemented via:

  • Tags on cloud resources
  • Virtual keys at the LLM gateway
  • Service labels in Kubernetes

Layer 2: Per-customer (for SaaS)

Which end customer is generating which spend. Implemented via:

  • Session tracking through the gateway
  • Usage events to analytics
  • Per-customer invoice rollups

Layer 3: Per-business-outcome

Cost per resolved support ticket. Cost per generated document. Cost per qualified lead. This is where FinOps becomes a strategic conversation.


Financial Structures Worth Knowing

As spend grows past a threshold, you gain access to commercial options:

Reserved capacity contracts

1-year or 3-year capacity at fixed pricing. Best for baseline load.

Enterprise agreements

Annual or multi-year commitments across multiple services. Hyperscalers will negotiate at $500k+/year.

Bare metal / dedicated racks

Lease racks in a colo. Buy or lease the GPUs. Eliminates cloud margin; adds ops overhead. Worth it above ~1000 GPUs sustained.

Owned infrastructure

Capex your own cluster. Best long-term economics; massive capex commitment; ops team required.

Most shops at $10M/year compute spend should have at least reserved capacity and an enterprise agreement. Bare metal is worth evaluating above $20M/year.


Organizational Structure

FinOps above $5M/year needs more than a dashboard. It needs an operating model:

  • Dedicated FinOps role — one person minimum owns GPU cost as their primary metric
  • Monthly cost review — engineering leadership + finance; standing agenda
  • Quarterly forecasting — projected spend vs budget with assumptions
  • Architecture council — major new workloads get cost-reviewed before build
  • Budgets with teeth — team budgets that, if exceeded, trigger conversations

The human structure matters more than any tool. Teams with “a FinOps dashboard” save little. Teams with “Tomás reviews the GPU bill weekly and escalates deviations” save a lot.


The Common Traps

1. Saving 10% by optimizing the wrong line item. Focus on the top 3 workloads. Everything else is noise.

2. Ignoring capacity commits when scaling down. You committed to 100 H100 for 3 years; you now need 50. Either sublease, stay committed, or eat the loss.

3. Hyperscaler credits mask bad habits. “AWS gave us $5M in credits” → team doesn’t optimize → credits run out → cliff.

4. Chasing every new GPU generation. B200 cost/perf is real, but upgrading mid-reservation just costs money.

5. Under-investing in observability. You cannot manage what you cannot measure. Spend is a sacred cow until you have the data to question it.


The Roadmap From Chaos To Control

If your company’s GPU spend has crept above $500k/month without FinOps discipline:

Month 1: Baseline attribution. Who, what, how much. Accept the number is bigger than anyone thinks.

Month 2: Rate card negotiation with current providers. 10–20% gains from talking.

Month 3: Reserved commitments for baseline load. 20–30% on committed portion.

Month 4: Right-sizing. Audit workload → GPU match. 10–20% gains.

Month 5: Multi-cloud evaluation. Neoclouds vs hyperscalers for suitable workloads. 20–30% gains on shifted workloads.

Month 6: Utilization drive. Consolidate, autoscale aggressively, kill idle.

Month 7+: Structural levers — model right-sizing, disaggregation, distillation.

A team that executes this sequence typically cuts their compute bill 30–50% in the first year.


The Short Version

GPU FinOps at scale is a discipline, not a dashboard. The big wins come from:

  • Commitment structure (reserved + baseline)
  • Workload-to-GPU matching
  • Utilization above 70% sustained
  • Multi-provider mix
  • Model-tier routing

These aren’t glamorous. They’re boring. They’re also where the dollars are. Companies that invest here have 20–40% lower unit economics than those that don’t. At serious scale, that’s the difference between a profitable AI product and one that isn’t.


Further Reading

Compute bill growing faster than revenue? Let’s talk — we run FinOps engagements for AI-heavy companies at every scale.

← back to blog