Benchmarking LLM Inference on RTX 4090, RTX 5090, and RTX PRO 6000

September 26, 2025

This content originally appeared on Level Up Coding – Medium and was authored by Natalia Trifonova

Image generated by AI (GPT-5)

Curious to know which GPU to buy or rent to run your LLM inference? 1x, 2x, and 4x RTX 4090, RTX 5090, and RTX PRO 6000 are the most affordable, yet capable builds. I ran a series of benchmarks across multiple GPU cloud servers to evaluate their performance for LLM workloads, specifically serving LLaMA and Qwen models using the vLLM inference engine. This article explains how we tested, what I measured, and what insights I gained by running these models on different GPU configurations.

What Was Measured?

LLM workloads are not just about raw FLOPS. When serving models in production — especially in interactive, multi-turn settings like chat — you care about:

Download speed of Hugging Face models
Token latency metrics like:
— TTFT (Time to First Token)
— TPOT (Time per Output Token)
— ITL (Inter-Token Latency)
— E2EL (End-to-End Latency)

What Was Tested?

I created a comprehensive benchmark script that automates the following steps:

1. Run System Benchmark: The script starts with YABS, a popular hardware test suite, to capture CPU, memory, disk, and network performance.

2. Download the Model: simulate production readiness by downloading models from Hugging Face, measuring both time taken and average download speed.

3. Launch vLLM Container: spin up a Docker container using vllm/vllm-openai:latest, bind-mounting the downloaded model directory. The container exposes an OpenAI-compatible API endpoint.

4. Run Inference Benchmark: Finally, it runs `benchmark_serving.py` inside the container to simulate multi-request, high-concurrency LLM usage with synthetic inputs. I use Qwen/Qwen3-Coder-30B-A3B-Instruct model with tensor parallelism set to a number of GPU on the machine.

What Models and Configs Were Used?

Model: Qwen/Qwen3-Coder-30B-A3B-Instruct
Serving: vLLM + OpenAI API-compatible interface
Command Parameters:
— Input/output tokens: 1000
— Concurrency: 200
— Prompts: 1000
— Metrics: ttft,tpot,itl,e2el

Importance of Driver Versions

During our benchmarking with the NVIDIA RTX 5090 and RTX PRO 6000, we observed a huge performance discrepancy between driver versions:

With the older driver 570.86.15 , the inference performance on RTX 5090 was comparable to that of the RTX 4090. Upgrading to driver 575.57.08, we saw significant gains in all vLLM benchmarks.

Hardware Tested

I ran the benchmark across several server configurations. These configurations include 4 x RTX 4090, 4 x RTX 5090, 1 x RTX PRO 6000, and 2 x RTX PRO 6000. These configurations are popular among self-hosting enthusiasts, so I wanted to find out which configuration is the most cost-efficient.

Results

Results of running vllm with Qwen3-Coder-30B-A3B-Instruct model on the corresponding machine

Most providers that we’ve worked with provide 10Gbps internet and deliver this speed. There might be occasional drops in speed due to network congestion. Download speed to distant servers is lower, so renting a server that is very far away from your cloud storage server is not the best idea.

Key Takeaways

Model download speed can be a limiting factor if your storage or bandwidth is subpar. In some cases, using HF_HUB_ENABLE_HF_TRANSFER=1 helped achieve significantly better download speeds from hugging face.
Token generation latency (especially TTFT) can vary even across servers with similar GPUs, due to backend configuration and memory bandwidth differences, as well as drivers and software versions.
4090s perform well for cost, especially for smaller models like Qwen-3B or LLaMA-8B. However, for larger models or batch inference, PRO 6000 is a clear winner. Even on the small Qwen/Qwen3-Coder-30B-A3B-Instruct model used in the test, the single Pro 6000 is faster than four 4090s and four 5090s. The prefill-decode-disaggregation technique I described in the previous article can reduce the need to transfer a significant amount of data over the PCIe bus, which is the primary performance bottleneck of low-VRAM GPUs when running larger models. However, in vast majority of cases, the PRO 6000 will be a better option.

How to Run This Yourself

Clone the repository:

git clone https://github.com/cloudrift-ai/server-benchmark.git
cd server-benchmark

Install dependencies:

./scripts/setup.sh

Run the benchmark

./scripts/run_benchmarks.sh

GitHub Repository

You can find the code here. The model and other parameters are easily customizable if you want to run it yourself. Feel free to let me know in Discord or in the comments the configuration or the model you’re interested in for future benchmarking!

Benchmarking LLM Inference on RTX 4090, RTX 5090, and RTX PRO 6000 was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.

This content originally appeared on Level Up Coding – Medium and was authored by Natalia Trifonova