All Posts
Engineering

How to Choose the Right GPU for ML Training

DAIRX Team6 min read

Not every workload needs an H100. This is the most expensive misconception in ML infrastructure. We regularly see teams provisioning NVIDIA's flagship datacenter GPUs for workloads that would run equally well — sometimes better — on hardware costing 60% less. The GPU you choose should match your actual computational requirements, not your aspirations.

Understanding GPU Tiers

Modern GPUs available for cloud rental fall into roughly four tiers, each with distinct strengths. Consumer-grade cards like the RTX 4090 offer 24GB VRAM with exceptional single-precision performance at $0.30–$0.80/hr — ideal for inference, fine-tuning, and prototyping. Professional GPUs like the A6000 and L40S provide 48GB VRAM with ECC memory at $0.70–$1.50/hr, well-suited for medium model training and production inference.

Datacenter GPUs like the A100 (40GB or 80GB variants) are purpose-built for ML with excellent tensor core performance and NVLink for multi-GPU scaling, running $1.10–$3.50/hr. And flagship H100 SXM cards with 80GB HBM3 and 3TB/s bandwidth offer peak performance at $2.50–$5.00/hr — but only provide meaningful advantage for large-scale training of 70B+ parameter models.

Matching Workloads to Hardware

Fine-tuning a 7B parameter model with LoRA or QLoRA? A single RTX 4090 or A6000 handles this efficiently. QLoRA was specifically designed to run on consumer hardware. Renting an H100 for LoRA fine-tuning is like hiring a freight truck to deliver a letter.

Training a custom model from scratch in the 1B–13B parameter range? An A100 80GB or a pair of A6000s is the sweet spot. You need the VRAM headroom for optimizer states and activations, but you don't need the H100's Transformer Engine. The training time difference is 15–25% for models in this size range, while the cost difference often exceeds 60%.

Large-scale distributed training of 70B+ parameter models is where H100s justify their premium. The HBM3 bandwidth and improved NVLink become significant when scaling across 8+ GPUs with heavy inter-GPU communication. For multi-node training, the H100's networking capabilities make a measured difference.

For inference serving, expensive hardware is almost never required. A quantized model running on RTX 4090s often outperforms an unquantized model on A100s for serving throughput, at a fraction of the cost. Profile your actual inference workload before upgrading hardware.

Memory Matters More Than You Think

The single most common bottleneck in model training isn't compute — it's VRAM. Running out of GPU memory during a training run means either reducing batch size (slower convergence) or switching to a larger GPU (higher cost). Estimating VRAM requirements upfront saves time and money.

For inference, VRAM requirements are roughly the model size in bytes. A 7B model in FP16 needs about 14GB; quantized to INT4, it fits in 3.5GB. Know your numbers before provisioning.

The Cost-Performance Sweet Spot

For most ML practitioners — startups training custom models, researchers running experiments, engineers building inference pipelines — the A100 80GB hits the optimal price-performance ratio in early 2026. It's battle-tested, widely available across providers, and priced competitively on independent GPU clouds.

Before provisioning any GPU, estimate your VRAM requirements, estimate your training duration, and compare the total job cost across available options. A GPU that's 20% faster but costs 80% more isn't a better deal — it's a worse one.

Machine LearningGPU SelectionInfrastructure