This content originally appeared on Level Up Coding – Medium and was authored by Natalia Trifonova

Curious to know which GPU to buy or rent to run your LLM inference? 1x, 2x, and 4x RTX 4090, RTX 5090, and RTX PRO 6000 are the most affordable, yet capable builds. I ran a series of benchmarks across multiple GPU cloud servers to evaluate their performance for LLM workloads, specifically serving LLaMA and Qwen models using the vLLM inference engine. This article explains how we tested, what I measured, and what insights I gained by running these models on different GPU configurations.
What Was Measured?
LLM workloads are not just about raw FLOPS. When serving models in production — especially in interactive, multi-turn settings like chat — you care about:
- Download speed of Hugging Face models
- Token latency metrics like:
— TTFT (Time to First Token)
— TPOT (Time per Output Token)
— ITL (Inter-Token Latency)
— E2EL (End-to-End Latency)
What Was Tested?
I created a comprehensive benchmark script that automates the following steps:
1. Run System Benchmark: The script starts with YABS, a popular hardware test suite, to capture CPU, memory, disk, and network performance.
2. Download the Model: simulate production readiness by downloading models from Hugging Face, measuring both time taken and average download speed.
3. Launch vLLM Container: spin up a Docker container using vllm/vllm-openai:latest, bind-mounting the downloaded model directory. The container exposes an OpenAI-compatible API endpoint.
4. Run Inference Benchmark: Finally, it runs `benchmark_serving.py` inside the container to simulate multi-request, high-concurrency LLM usage with synthetic inputs. I use Qwen/Qwen3-Coder-30B-A3B-Instruct model with tensor parallelism set to a number of GPU on the machine.
What Models and Configs Were Used?
- Model: Qwen/Qwen3-Coder-30B-A3B-Instruct
- Serving: vLLM + OpenAI API-compatible interface
- Command Parameters:
— Input/output tokens: 1000
— Concurrency: 200
— Prompts: 1000
— Metrics: ttft,tpot,itl,e2el
Importance of Driver Versions
During our benchmarking with the NVIDIA RTX 5090 and RTX PRO 6000, we observed a huge performance discrepancy between driver versions:
With the older driver 570.86.15 , the inference performance on RTX 5090 was comparable to that of the RTX 4090. Upgrading to driver 575.57.08, we saw significant gains in all vLLM benchmarks.
Hardware Tested
I ran the benchmark across several server configurations. These configurations include 4 x RTX 4090, 4 x RTX 5090, 1 x RTX PRO 6000, and 2 x RTX PRO 6000. These configurations are popular among self-hosting enthusiasts, so I wanted to find out which configuration is the most cost-efficient.
Results
Results of running vllm with Qwen3-Coder-30B-A3B-Instruct model on the corresponding machine
Most providers that we’ve worked with provide 10Gbps internet and deliver this speed. There might be occasional drops in speed due to network congestion. Download speed to distant servers is lower, so renting a server that is very far away from your cloud storage server is not the best idea.
Key Takeaways
- Model download speed can be a limiting factor if your storage or bandwidth is subpar. In some cases, using HF_HUB_ENABLE_HF_TRANSFER=1 helped achieve significantly better download speeds from hugging face.
- Token generation latency (especially TTFT) can vary even across servers with similar GPUs, due to backend configuration and memory bandwidth differences, as well as drivers and software versions.
- 4090s perform well for cost, especially for smaller models like Qwen-3B or LLaMA-8B. However, for larger models or batch inference, PRO 6000 is a clear winner. Even on the small Qwen/Qwen3-Coder-30B-A3B-Instruct model used in the test, the single Pro 6000 is faster than four 4090s and four 5090s. The prefill-decode-disaggregation technique I described in the previous article can reduce the need to transfer a significant amount of data over the PCIe bus, which is the primary performance bottleneck of low-VRAM GPUs when running larger models. However, in vast majority of cases, the PRO 6000 will be a better option.
How to Run This Yourself
Clone the repository:
git clone https://github.com/cloudrift-ai/server-benchmark.git
cd server-benchmark
Install dependencies:
./scripts/setup.sh
Run the benchmark
./scripts/run_benchmarks.sh
GitHub Repository
You can find the code here. The model and other parameters are easily customizable if you want to run it yourself. Feel free to let me know in Discord or in the comments the configuration or the model you’re interested in for future benchmarking!
Benchmarking LLM Inference on RTX 4090, RTX 5090, and RTX PRO 6000 was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.
This content originally appeared on Level Up Coding – Medium and was authored by Natalia Trifonova