top of page

LLM Inference Benchmarking - genAI-perf and vLLM

After spending hours dealing with ChatGPT hallucinations, I finally had to do a Google search to find the right tool for LLM inference benchmarking. It turns out NVIDIA has done a great job creating a robust tool that can be used across different platforms, including Triton and OpenAI-compatible APIs.


LLM benchmarking can be confusing, as people often mix up LLM performance testing with benchmarking. Performance testing validates the overall capacity of your server infrastructure, including network latency, CPU performance, and other system-level throughputs. Benchmarking tools, on the other hand, primarily focus on LLM inference engine–specific parameters, which are critical if you are planning to run your own inference platform — something most enterprises are now focusing on.


This is a series of blogs that I will be writing as I go through the process of learning and experimenting with vLLM-based inference solutions, along with insights from real-world use cases operating LLM inference platforms in enterprise environments.


Here are some of the most common inference use cases.


Workload pattern

Typical examples

ISL vs OSL

GPU & KV cache impact

Batching & serving implications

Generation-heavy

Code generation, emails, content creation

Short input (~50–200) → long output (~800–1.5K)

KV cache grows over time during decoding; sustained GPU utilization during generation

Benefits from continuous batching; latency sensitive at high concurrency

Context-heavy (RAG / chat)

Summarization, multi-turn chat, retrieval-augmented QA

Long input (~1K–2K) → short output (~50–300)

Large upfront KV allocation; high memory footprint per request

Limits max concurrency; batching constrained by KV cache size

Balanced (translation / conversion)

Language translation, code refactoring

Input ≈ output (~600–1.8K each)

Stable KV usage throughout request lifecycle

Predictable batching; easier capacity planning

Reasoning-intensive

Math, puzzles, complex coding with CoT

Very short input (~50–150) → very long output (2K–10K+)

Explosive KV growth; long-lived sequences

Poor batch efficiency; throughput drops sharply at scale


In this example we will be setting up a single node Inference + benchmarking node for experimentation purpose, however, production use case would require the Benchmarking tool should run from a separate node.


vLLM Benchmarking Setup
vLLM Benchmarking Setup

For decent benchmarking, you need the following to get started:


  • NVIDIA GPU–powered compute platform. This can be your desktop, or you can use any of the Neo Cloud providers. My obvious preference is Denvr Cloud. Feel free to sign up — https://www.denvr.com/

  • Hugging Face login. Sign up for a free Hugging Face account. You’ll need it to download models and access gated models such as Meta Llama and others.

  • LLM-labs repo. https://github.com/kchandan/llm-labs


Step-by-step guide


To install the necessary packages on the Linux VM (e.g., NVIDIA drivers, Docker, etc.), the easiest approach is to update the IP address in the Ansible inventory file and then let the playbook handle the full installation.


cat llmops/ansible/inventory/hosts.ini
; [vllm_server]
; server_name ansible_user=ubuntu
[llm_workers]
<IP Address> ansible_user=ubuntu ansible_ssh_private_key_file=~/.ssh/<your_key_file>

Once IP address is update, fire the Ansible playbook to install required packages


ansible-playbook -i ansible/inventory/hosts.ini ansible/setup_worker.yml

Post installation ensure, Driver installation looks good


ubuntu@llmops:~/llm-labs$ nvidia-smi
Sun Jan 11 21:53:01 2026
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 590.48.01              Driver Version: 590.48.01      CUDA Version: 13.1     |
+-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|=========================================+========================+======================|
|   0  NVIDIA A100-SXM4-40GB          Off |   00000000:0A:00.0 Off |                    0 |
| N/A   47C    P0             50W /  400W |       0MiB /  40960MiB |      0%      Default |
|                                         |                        |             Disabled |
+-----------------------------------------+------------------------+----------------------+

+-----------------------------------------------------------------------------------------+
| Processes:                                                                              |
|  GPU   GI   CI              PID   Type   Process name                        GPU Memory |
|        ID   ID                                                               Usage      |
|=========================================================================================|
|  No running processes found                                                             |
+-----------------------------------------------------------------------------------------+

Create the common docker bridge network so that all containers could talk to each other ( default bridge driver)

docker network create llmops-net

Export the Huggingface token

export HF_TOKEN=hf_token

Now, simply launch the vLLM docker compose, it will take some time to load

ubuntu@llmops:~/llm-labs/llmops/vllm$ docker compose -f docker-compose-vllm-qwen3-0.6B.yml up -d[+] up 1/1 ✔ Container vllm Created                                                                                                                                                                                          0.3subuntu@llmops:~/llm-labs/llmops/vllm$ docker compose -f docker-compose.monitoring.yml up -dWARN[0000] Found orphan containers ([vllm]) for this project. If you removed or renamed this service in your compose file, you can run this command with the --remove-orphans flag to clean it up.ubuntu@llmops:~/llm-labs/llmops/vllm$ ✔ Container prometheus    Created                                                                                                                                                                                 0.5s ✔ Container dcgm-exporter Created                                                                                                                                                                                 0.5s ✔ Container node-exporter Created                                                                                                                                                                                 0.5s ✔ Container cadvisor      Created                                                                                                                                                                                 0.5s ✔ Container grafana       Created

Ignore the orphan container warning. I have kept those 2 compose file separate deliverable so that more model specific compose files could be added later into the same repo.

Once all containers are downloaded and loaded, it should look like this ( without container crash loop)


ubuntu@llmops:~/llm-labs/llmops/vllm$ docker psCONTAINER ID   IMAGE                             COMMAND                  CREATED              STATUS                    PORTS                                         NAMES750f8e14201d   grafana/grafana:latest            "/run.sh"                58 seconds ago       Up 58 seconds             0.0.0.0:3000->3000/tcp, [::]:3000->3000/tcp   grafana270c865726e9   prom/prometheus:latest            "/bin/prometheus --c…"   59 seconds ago       Up 58 seconds             0.0.0.0:9090->9090/tcp, [::]:9090->9090/tcp   prometheusf679c2313fd2   gcr.io/cadvisor/cadvisor:latest   "/usr/bin/cadvisor -…"   59 seconds ago       Up 58 seconds (healthy)   0.0.0.0:8080->8080/tcp, [::]:8080->8080/tcp   cadvisor28873c028c0b   prom/node-exporter:latest         "/bin/node_exporter …"   59 seconds ago       Up 58 seconds             0.0.0.0:9100->9100/tcp, [::]:9100->9100/tcp   node-exporter5e3f54b8f485   nvidia/dcgm-exporter:latest       "/usr/local/dcgm/dcg…"   59 seconds ago       Up 58 seconds             0.0.0.0:9400->9400/tcp, [::]:9400->9400/tcp   dcgm-exporter3b002c0b1d47   vllm/vllm-openai:latest           "vllm serve --model …"   About a minute ago   Up About a minute         0.0.0.0:8000->8000/tcp, [::]:8000->8000/tcp   vllm

Now we have setup the vLLM inference base setup, next step is to setup Nvidia GenAI-Perf


pip install genai-perf

Do a quick test run to see if everything is working

genai-perf profile \  -m Qwen/Qwen3-0.6B \  --endpoint-type chat \  --synthetic-input-tokens-mean 200 \  --synthetic-input-tokens-stddev 0 \  --output-tokens-mean 100 \  --output-tokens-stddev 0 \  --streaming \  --request-count 50 \  --warmup-request-count 10[2026-01-11 23:53:27] DEBUG    Inferred tokenizer from model name: Qwen/Qwen3-0.6B                                                          config_tokenizer.py:79[2026-01-11 23:53:27] INFO     Profiling these models: Qwen/Qwen3-0.6B                                                                         create_config.py:58[2026-01-11 23:53:27] INFO     Model name 'Qwen/Qwen3-0.6B' cannot be used to create artifact directory. Instead, 'Qwen_Qwen3-0.6B'    perf_analyzer_config.py:157                               will be used.[2026-01-11 23:53:27] INFO     Creating tokenizer for: Qwen/Qwen3-0.6B                                                                           subcommand.py:190[2026-01-11 23:53:29] INFO     Running Perf Analyzer : 'perf_analyzer -m Qwen/Qwen3-0.6B --async --warmup-request-count 10 --stability-percentage subcommand.py:98                               999 --request-count 50 -i http --concurrency-range 1 --service-kind openai --endpoint v1/chat/completions                               --input-data artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency1/inputs.json --profile-export-file                               artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency1/profile_export.json'[2026-01-11 23:53:52] INFO     Loading response data from 'artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency1/profile_export.json'       profile_data_parser.py:66[2026-01-11 23:53:52] INFO     Parsing total 50 requests.                                                                           llm_profile_data_parser.py:124Progress: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████| 50/50 [00:00<00:00, 260.92requests/s]                               NVIDIA GenAI-Perf | LLM Metrics┏━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┳━━━━━━━━┓┃                            Statistic ┃    avg ┃    min ┃    max ┃    p99 ┃    p90 ┃    p75 ┃┡━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━╇━━━━━━━━┩│             Time To First Token (ms) │  12.79 │  11.14 │  16.74 │  15.22 │  13.30 │  13.05 ││            Time To Second Token (ms) │   3.18 │   3.06 │   3.73 │   3.57 │   3.27 │   3.24 ││                 Request Latency (ms) │ 336.79 │ 324.87 │ 348.00 │ 347.84 │ 346.32 │ 345.02 ││             Inter Token Latency (ms) │   3.27 │   3.17 │   3.39 │   3.39 │   3.37 │   3.36 ││     Output Token Throughput Per User │ 305.64 │ 295.21 │ 315.82 │ 315.69 │ 312.30 │ 311.15 ││                    (tokens/sec/user) │        │        │        │        │        │        ││      Output Sequence Length (tokens) │  99.98 │  99.00 │ 100.00 │ 100.00 │ 100.00 │ 100.00 ││       Input Sequence Length (tokens) │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 │ 200.00 ││ Output Token Throughput (tokens/sec) │ 296.71 │    N/A │    N/A │    N/A │    N/A │    N/A ││         Request Throughput (per sec) │   2.97 │    N/A │    N/A │    N/A │    N/A │    N/A ││                Request Count (count) │  50.00 │    N/A │    N/A │    N/A │    N/A │    N/A │└──────────────────────────────────────┴────────┴────────┴────────┴────────┴────────┴────────┘[2026-01-11 23:53:52] INFO     Generating artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency1/profile_export_genai_perf.json                    json_exporter.py:64[2026-01-11 23:53:52] INFO     Generating artifacts/Qwen_Qwen3-0.6B-openai-chat-concurrency1/profile_export_genai_perf.csv


If you are able to see these metrics from GenAI-Perf, it means your setup is complete.

Now let’s move on to setting up the Grafana dashboard.


First, ensure that you have configured the Prometheus backend in Grafana. By default, it points to localhost, so we need to switch it to prometheus, matching the service name used in the Docker Compose file.


Prometheus Grafana Setup for vLLM Benchmark
Prometheus Grafana Setup for vLLM Benchmark

As part of the Docker Compose setup, Grafana should automatically pick up the dashboard (NVIDIA + vLLM).


You should now be able to see the metrics flowing into the Grafana dashboard.


vLLM + DCGM Grafana Dashboard
vLLM + DCGM Grafana Dashboard

At this point, what we have achieved is a basic “hello-world” setup for our LLM benchmarking infrastructure. The next big challenge is to benchmark properly and identify how we can tweak vLLM parameters and GenAI-Perf settings to squeeze the maximum out of the hardware. In this example, I am using a single A100-40GB GPU. It may not sound like much, but these are very powerful cards and work extremely well for agentic workflows where small language models are heavily used.


The next blog will focus more on capturing additional metrics and logs, and on how to get the best out of your hardware.If you are looking to collaborate, please checkout — https://www.becloudready.com/dev-rel-services




Comments

Rated 0 out of 5 stars.
No ratings yet

Add a rating
bottom of page