All Posts
Infrastructure

On-Demand vs Spot GPUs: When to Use Each

DAIRX Team5 min read

Spot GPUs can cost 60–80% less than on-demand pricing. For a team spending $5,000/month on GPU compute, that's up to $4,000 in potential monthly savings. The tradeoff is simple: your instance can be terminated with little or no warning when demand rises. Whether that tradeoff makes sense depends entirely on what you're running.

How Spot Pricing Works

Spot instances — sometimes called preemptible or interruptible instances — represent unused GPU capacity that providers sell at a steep discount. When other customers need that capacity at full price, your spot instance gets terminated, typically with a 30-second to 2-minute warning depending on the provider.

The discounts are substantial and consistent across the market. A100 80GB instances typically see 40–60% spot discounts (on-demand ~$2.00/hr drops to ~$0.85–$1.20/hr). RTX 4090s offer 50–70% savings (on-demand ~$0.55/hr drops to ~$0.18–$0.28/hr). H100 SXMs see 30–50% off (on-demand ~$3.50/hr drops to ~$1.80–$2.50/hr). GPUs with higher general availability tend to offer deeper spot discounts.

When Spot GPUs Make Sense

Spot instances work for workloads that are either short enough to likely complete before interruption, or resilient enough to survive termination without losing meaningful progress.

When You Need On-Demand

Some workloads cannot tolerate interruption at any cost savings. The math simply doesn't work out.

Making Spot Work: Checkpointing

The key to using spot GPUs effectively is making your workloads interruption-resilient. This is primarily a checkpointing problem.

The goal isn't to avoid all interruptions — it's to make each one cost a small, bounded amount of progress. When your checkpoint interval is 30 minutes and your training run is 48 hours, each interruption costs less than 1% of total compute time while saving you 60%+ on the hourly rate.

Spot InstancesCost OptimizationML Training