This content originally appeared on Level Up Coding – Medium and was authored by Nathan Caldwell
The question and answer may seem to be simple but I understood many experienced engineers fail to reason at this level. This was the point that helped me stood out in my NVIDIA interview.
I interviewed for this role:
Software Engineer — Data Center Performance
Location: US, CA, Santa Clara
TC range: 272,000 USD — 425,500 USD
The question I was asked by a senior manager during the interview was:
How would you estimate memory consumption of LLM? How it impacts the context length we can work with?
This was my detailed answer with back-of-the-envelope calculations which secured my prestigious NVIDIA offer in an instant:
Llama-70B model + AMD MI300X
(80 layers, 8192 hidden dimensions)
70B parameters + FP16 precision = 70B × 2 = 140 GB
Activation memory = 0.25 * 140GB = 35GB
KV Cache per Token = 2 × Precision × Layers × Hidden Dimension
KV cache per token = 2 × 2 bytes × 80 × 8192 = 2.6MB per token
Total KV Cache = KV Cache per Token × Context Length × Concurrent Requests
Maximum Context Length Calculation
AMD MI300X = 192GB capacity
Available for KV Cache = Total Memory - Model Memory - Activation Memory
192 GB - 140 GB - 35 GB = 17 GB available for KV cache
Maximum Context Length = Available KV Cache ÷
(KV Cache per Token × Concurrency)
For single request: 17 GB ÷ 2.6 MB = ~6,500 tokens
Use 8 GPU = 8-Way Tensor Parallelism
Model memory divided among 8 GPUs.
Model Parameters: 140GB ÷ 8 = 17.5GB per GPU
Activation Memory: 17.5 GB × 0.25 = 4.4GB per GPU
Available KV Cache per GPU: 192GB - 17.5GB - 4.4GB = 170.6GB
Effective KV Cache Capacity: 170.6 GB × 8 = 1,360GB
This represents 80x increase in KV cache capacity compared to single-GPU.
New Maximum Context Length: 1,360GB ÷ 2.6MB = ~523,000 tokens.
Some explanation:
LLM memory consumption involve 3 core memory components:
- Number of model parameters (static for a given LLM architecture)
- Activation memory
- KV cache
Doing calculations and digging deeper in essential today. For instance, in 2023, I had already worked on CNNs for over a year but thought the output of CNN is just a number (denoting label). Most I met that the same misconception. It was a vector of size denoting all supported classes.
I worked with LLMs for a while but became aware of such projections recently while reading a book on Deep Learning (DL1943 cheatsheet, did not find other books covering such core topics).
1. Number of Parameters (Static)
Model Memory = Parameter Count × Precision
The precision varies by quantization approach:
- FP16: 2 bytes per parameter (standard for production)
- FP8: 1 byte per parameter (emerging standard)
- INT4: 0.5 bytes per parameter (aggressive quantization)
Example: Llama-70B with FP16 precision requires 70B × 2 bytes = 140 GB
2. Activation Memory (Processing Buffer)
Activation memory stores intermediate computations during inference. Attention op is theoretically quadratic with sequence length O(N²). Modern optimizations like Flash attention linearize this memory consumption O(N).
25% of model memory is a good practical buffer.
Activation Memory ≈ Model Memory × 0.25
Example: For the 70B model: 140 GB × 0.25 = 35 GB
3. KV Cache (Dynamic Constraint)
Key-Value cache is the primary bottleneck for context length. KV cache grows linearly with both context length and concurrent requests.
KV Cache per Token = 2 × Precision × Layers × Hidden Dimension
Example: Llama-70B (80 layers, 8192 hidden dimension) at FP16:
2 × 2 bytes × 80 × 8192 = 2.6 MB per token
Total KV Cache = KV Cache per Token × Context Length × Concurrent Requests
Maximum Context Length Calculation
Using AMD MI300X with 192GB capacity as an example:
Available for KV Cache = Total Memory — Model Memory — Activation Memory
192 GB — 140 GB — 35 GB = 17 GB available for KV cache
Maximum Context Length = Available KV Cache ÷ (KV Cache per Token × Concurrency)
For single request: 17 GB ÷ 2.6 MB = ~6,500 tokens
This demonstrates a critical constraint: model can handle either one 6,500-token request or approximately six concurrent 1,000-token requests.
Tensor Parallelism Solution
Tensor parallelism addresses memory constraints by distributing models across multiple GPUs. Unlike simply having “more memory,” it fundamentally changes memory allocation patterns.
Memory Distribution with 8-Way Tensor Parallelism
Model Parameters: 140 GB ÷ 8 = 17.5 GB per GPU
Activation Memory: 17.5 GB × 0.25 = 4.4 GB per GPU
Available KV Cache per GPU: 192 GB — 17.5 GB — 4.4 GB = 170.6 GB
Effective KV Cache Capacity: 170.6 GB × 8 = 1,360 GB
This represents an 80x increase in KV cache capacity compared to single-GPU deployment.
New Maximum Context Length: 1,360 GB ÷ 2.6 MB = ~523,000 tokens.
Concurrency-Context Trade-off Options
With 1,360 GB available for KV cache, deployments can choose different configurations:
- Long Context Mode: 64K tokens × 8 requests = 512K total tokens
- High Throughput Mode: 8K tokens × 64 requests = 512K total tokens
- Balanced Mode: 32K tokens × 16 requests = 512K total tokens
Core concepts
- Context length depends on available memory, not model size — proper architecture enables long contexts even for large models.
- Memory components scale differently with parallelization — tensor parallelism dramatically reduces per-GPU costs for static components while creating shared KV cache capacity.
- Context length vs. concurrency is a deliberate trade-off — optimal configuration depends on specific workload patterns.
- Measurement is essential — these calculations provide planning frameworks, but real-world benchmarking with representative workloads remains crucial.


This calculation helped me secure NVIDIA offer [TC $425K, US] was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.
This content originally appeared on Level Up Coding – Medium and was authored by Nathan Caldwell