GPU VRAM vs Memory Bandwidth: What Actually Matters for AI and LLM Deployments

When AI teams ask "which GPU should we deploy?" they are really asking two questions at once: will the model fit, and how fast will it run? Those map almost one-to-one onto two specs on the data sheet — VRAM and memory bandwidth — and the gap between them across NVIDIA’s lineup is wider than the marketing copy makes obvious.

A 24 GB RTX 4090 and an 80 GB H100 SXM are both "datacenter-capable" GPUs, but they sit on completely different parts of the curve. The 4090 has more VRAM than an old V100 and more raw FP16 throughput than an A100 — yet it runs a 70B-parameter model in single-user inference at a fraction of the H100’s speed, because memory bandwidth, not compute, is what gates token generation.

The chart below plots ~50 NVIDIA GPUs — from the RTX 3050 up through the GB300 superchip — on log-log axes of VRAM (GB) versus memory bandwidth (GB/s). It is the single clearest way we have found to look at GPU selection for AI workloads.

NVIDIA GPU: VRAM vs memory bandwidth

RTX 30 (Ampere)
RTX 40 (Ada)
RTX 50 (Blackwell)
Workstation Pro
Datacenter

VRAM vs bandwidth across NVIDIA consumer, workstation, and datacenter GPUs.
GPU Class Architecture VRAM (GB) Memory Bandwidth (GB/s)



The shape of the chart matters. Consumer cards cluster in the lower-left. Workstation Pro parts sit mid-axis with lots of VRAM but modest bandwidth. The datacenter cluster runs along a diagonal — when NVIDIA moves a generation, HBM-class GPUs buy both axes at once, which is why the gap between an L4 and a B200 is two log-scale tick-marks in each direction.

Why VRAM matters: capacity is binary

VRAM is the brutal, on/off constraint. Either the model fits, or it does not. There is no "almost fits" — when you exceed device memory, you spill to system RAM (10–30× slower) or shard across multiple GPUs and pay a non-trivial interconnect tax.

For a generative AI workload, three things compete for VRAM:

  • Model weights, sized by parameter count and quantization. FP16/BF16 ≈ 2 bytes per parameter, FP8 ≈ 1 byte, INT4 ≈ 0.5 bytes. A 70B model is ~140 GB at FP16, ~70 GB at FP8, ~35 GB at INT4.
  • KV cache, the attention key/value tensors retained for every token in the context window. KV cache scales with context length × batch size × number of attention layers. This is the silent killer — an 80 GB H100 can hold a 70B model in FP8 and still OOM on a 100K-context multi-tenant workload because the cache eats the remaining headroom.
  • Activations and optimizer state during training. Adam-class optimizers carry ~4× the weights themselves (in FP32 master copies plus moments). You do not train a 70B model on an 80 GB card — you shard it across eight.

A practical way to read the chart: the x-axis tells you which models you can run on a single card, period. Anything to the right of 24 GB starts to cover serious 13–34B inference workloads. 80 GB is the threshold for one-card 70B serving with reasonable context. 141 GB and above (H200, B200, B300) is where 100B+ class single-card inference and longer contexts become comfortable.

Why memory bandwidth matters: speed is continuous

Bandwidth is the spec that decides how fast a model that does fit will actually run. For LLM inference, it is more important than any compute number on the data sheet — and the reason is straightforward.

When an LLM generates a token, it has to read every weight that participates in producing that token. At batch size 1 (a single-user chatbot or coding assistant), the GPU’s tensor cores are starved: they finish the math in microseconds and then sit idle waiting for the next chunk of weights to arrive over the memory bus. The whole loop is gated by how many bytes per second the memory subsystem can deliver.

That gives a useful first-order estimate:

tok/s ≈ memory bandwidth ÷ active model bytes

A 70B model at FP8 occupies ~70 GB. On an RTX 4090 at ~1.0 TB/s, that puts a single-user theoretical ceiling around 14 tok/s. On an H100 SXM at 3.35 TB/s, it is ~48 tok/s. On a B200 at 8.0 TB/s, it is ~114 tok/s. Real-world numbers will land lower (kernel overhead, KV cache reads, attention math), but the ratios hold — and that is why HBM dominates the top of the chart.

If you want a concrete tokens-per-second number for your own GPU + model + quantization combination — including multi-GPU tensor parallelism over NVLink or PCIe — plug it into the OCOLO Token Speed Calculator. It is built around the bandwidth-bound decode model and lets you compare configurations side by side, including hourly/monthly power costs.

Two caveats keep the picture honest. Prefill (processing the prompt before generating the first token) is compute-bound, not bandwidth-bound, so FLOPs matter there. And batched serving — many users hitting the same model simultaneously — amortizes the weight reads across requests, shifting the bottleneck back toward compute. But the single-user decode regime is what most teams optimize for first, and that regime is bandwidth all the way down.

Reading the chart: five patterns worth noticing

  1. Consumer GPUs top out at 32 GB / 1.8 TB/s with the RTX 5090. That is enough to run quantized 30–40B models well, but you are out of room for serious training or multi-tenant 70B inference.
  2. The 4060 Ti 16GB and 5060 Ti 16GB are flat-bandwidth. Doubling VRAM did not increase bandwidth — same 288 GB/s and 448 GB/s respectively. Useful if you need a model to fit but do not need it to run fast.
  3. Workstation Ada (L40, L40S, RTX 6000 Ada) is the unsung sweet spot for self-hosting. 48 GB at ~0.9 TB/s on GDDR6 ECC means you fit a 70B model in INT4 or a 30B model in FP16 on a single card without paying HBM prices.
  4. Datacenter HBM parts climb both axes together. A100 40GB → A100 80GB → H100 SXM → H200 → B200 → B300 is a single trajectory: more VRAM and more bandwidth at every step. This is what an NVIDIA generation actually buys you.
  5. GB200 and GB300 are 2-GPU modules. The 384 GB / 16 TB/s and 576 GB / 16 TB/s points are Grace-Blackwell superchips, not single dies. Useful as deployment units, not for apples-to-apples comparison with a B200.

What this means for picking colocation capacity

Once you have picked a GPU class, the spec sheet hands the next problem to the facility. The top of this chart is not just expensive in capex — it is genuinely hard to host.

  • Power per rack. A B200 SXM is ~1,000 W. An 8-GPU HGX node is ~10 kW including the host. A row of training pods can land at 30–50+ kW per rack — well outside the 5–10 kW per rack design point of most legacy colocation sites.
  • Cooling. B200, B300, and GB200/GB300 deployments are largely liquid-cooled at scale. Rear-door heat exchangers, direct-to-chip cold plates, or full immersion are no longer optional at the dense end of the chart. The decision is "does this facility support the cooling topology my hardware needs," not "does it have enough air."
  • Networking. 8 TB/s on-die only helps if you can feed data between nodes fast enough. Multi-node training or large-scale inference clusters need InfiniBand, NVIDIA Spectrum-X, or NVLink Switch fabrics — and those are facility-level decisions about cable plant, switch placement, and per-rack port counts.
  • Footprint and weight. Modern AI nodes are dense and heavy. Floor loading and rack rail spec actually matter again.

If you are sizing a deployment in the H200/B200/GB300 class, the colocation question is not a detail to figure out at the end. It often shapes which generation of GPU you can deploy at all. Ocolo’s colocation provider directory for buyers is built around exactly these constraints — power density, cooling capability, and AI-ready infrastructure — so you can shortlist facilities that match the kit before you commit.

Frequently asked questions

Is VRAM or memory bandwidth more important for LLM inference?

Both, in that order. VRAM is binary — if the model does not fit on the GPU (after weights, KV cache, and activations), nothing else matters; you cannot run it on one card. Memory bandwidth is continuous — once it fits, bandwidth determines how many tokens per second you generate at batch size 1. For single-user inference, bandwidth is the single most important spec on the data sheet.

How much VRAM do I need to run a 70B-parameter model?

Roughly: ~140 GB at FP16/BF16, ~70 GB at FP8, ~35 GB at INT4 — for the weights alone. Add 5–20+ GB on top for the KV cache (more for long contexts and concurrent users), plus framework overhead. In practice that means a 70B model fits on a single 80 GB H100 in FP8, on a single 48 GB workstation card (L40S, RTX 6000 Ada) in INT4, or has to be sharded across two 24 GB consumer cards in INT4. For low-latency, multi-tenant 70B serving with long contexts, an H200 (141 GB) or B200 (192 GB) is the comfortable target.

Why is the H200 so much faster than the H100 for inference?

Same Hopper compute architecture, more memory and more bandwidth. The H100 SXM has 80 GB of HBM3 at 3.35 TB/s. The H200 has 141 GB of HBM3e at 4.8 TB/s. For decode-bound LLM inference, that 1.43× bandwidth jump translates almost directly into a 1.4× tok/s improvement on the same model — and the extra 61 GB of VRAM lets you run larger models or longer contexts on a single card.

Does GDDR7 close the gap with HBM?

Partially. GDDR7 brought consumer and workstation Blackwell parts (RTX 5080, 5090, RTX Pro 6000) to ~1.8 TB/s — within striking distance of an A100 40GB (1.55 TB/s) and well above the V100. But the top HBM parts have moved on: H200 at 4.8 TB/s, B200 at 8 TB/s, GB300 modules at 16 TB/s. GDDR7 is excellent for a workstation tier; it does not replace HBM at the datacenter tier.

Can I run large models on a consumer GPU like the RTX 5090?

Yes, with caveats. A 32 GB RTX 5090 at 1.79 TB/s comfortably runs 13–30B models at FP16 and 70B models at INT4. It is a serious local-inference card. What it is not is a multi-tenant production server — you have no ECC memory, limited NVLink, no SXM form factor for dense racks, and the consumer driver stack is not licensed for many datacenter deployments. For research, prototyping, and single-user workloads it punches above its weight; for shared production AI, the workstation or datacenter tier earns its premium.

What does the OCOLO Token Speed Calculator show that the spec sheet doesn’t?

The data sheet gives you raw bandwidth in GB/s. The Token Speed Calculator translates that into tokens per second for your model at your quantization, accounts for KV cache memory at your context length, factors in multi-GPU tensor parallelism over NVLink or PCIe, and shows you the hourly, daily, and monthly power cost at your local electricity rate. It is the difference between knowing the GPU is "fast" and knowing it does 87 tok/s on Llama-3.1-70B at FP8 with a 16K context.

Should I colocate or use cloud GPUs?

Cloud GPUs are right for spiky workloads, experimentation, and teams without infrastructure capacity. Colocation tends to win when you have steady, predictable utilization (24/7 inference serving, ongoing training runs), when you already own or want to own the hardware, when data residency or networking topology is a constraint, or when monthly cloud bills cross the threshold where 18–36 months of colocation amortizes the gap. For AI deployments specifically, the constraint is rarely just dollars — it is finding facilities with the power density, cooling, and AI-grade networking to host modern HBM-class hardware.

Why are datacenter GPUs so much more expensive per GB of VRAM than consumer cards?

You are not paying for VRAM, you are paying for everything around it: HBM stacks (much costlier than GDDR), ECC memory, NVLink/NVSwitch interconnect, SXM form factors, datacenter-grade drivers and licensing, multi-instance GPU partitioning, validated multi-year reliability, and warranty/support tiers. On the chart, the diagonal climb of A100 → H100 → H200 → B200 is what those dollars buy: more bandwidth and more capacity and the platform features needed to run them in a multi-tenant production cluster.

Where to go next

If you are still narrowing the hardware decision, size your model on the OCOLO Token Speed Calculator before you commit to a SKU — bandwidth-bound tok/s estimates beat data-sheet specs every time.

When you are ready to find a facility that can actually host the kit at the right power density, cooling topology, and AI-ready networking, the Ocolo Buyers directory lists colocation providers that match those constraints.

Scroll to Top