This content originally appeared on DEV Community and was authored by Sumit Roy
In Article 1
When we first set up Prometheus + Grafana for Gemma 2B on Kubernetes, I expected to see nice dashboards with:
- Tokens per request
- Latency per inference
- Number of inferences processed
β¦but all we got were boring container metrics: CPU%, memory usage, restarts.
Sure, they told us the pod was alive, but nothing about the model itself.
No clue if inference was slow, if requests were timing out, or how many tokens were processed.
Debugging the Metrics Problem
We checked:
- Prometheus scraping the Ollama pod?
- Grafana dashboards connected?
- Metrics endpoint on Ollama?
Thatβs when we realized:
- Ollama by default doesnβt expose model-level metrics.
- It only serves the API for inference, nothing else.
- Prometheus was scraping⦠nothing useful.
The Fix: Ollama Exporter as Sidecar
While digging through GitHub issues, we found a project: Ollama Exporter
It runs as a sidecar container inside the same pod as Ollama, talks to the Ollama API, and exposes real metrics at /metrics for Prometheus.
Basically:
[ Ollama Pod ]
βββ Ollama Server (API β 11434)
βββ Ollama Exporter (Metrics β 11435)
How We Integrated It
Hereβs the snippet we added to the Ollama deployment:
- name: ollama-exporter
image: ghcr.io/jmorganca/ollama-exporter:latest
ports:
- containerPort: 11435
env:
- name: OLLAMA_HOST
value: "http://localhost:11434"
And in Prometheus config:
scrape_configs:
- job_name: 'ollama'
static_configs:
- targets: ['ollama-service:11435']
The Metrics We Finally Got
After adding the exporter, Grafana lit up with:
Metric Name What It Shows
ollama_requests_total Number of inference requests
ollama_latency_seconds Latency per inference request
ollama_tokens_processed Tokens processed per inference
ollama_model_load_time Time taken to load Gemma 2B model
Suddenly, we had real model observability, not just pod health.
Lessons Learned
- Default Kubernetes metrics β Model metrics β You need a sidecar like Ollama Exporter.
- One scrape job away β Prometheus wonβt scrape what you donβt tell it to.
- Metrics help tuning β We later used these metrics to set CPU/memory requests properl
Whatβs Next?
- Now that we have model-level observability, the next step is:
- Adding alerting rules for latency spikes or token errors.
- Exporting historical metrics into long-term storage (e.g., Loki, Thanos).
Trying multiple models Gemma 3, LLaMA 3, Phi-3
and comparing inference latency across them.
Letβs Connect
If you try this setup or improve it, Iβd love to hear from you!
Drop a star on the repo if it helped you β it keeps me motivated to write more experiments like this!
This content originally appeared on DEV Community and was authored by Sumit Roy