πŸ“Š Adding Observability to Gemma 2B on Kubernetes with Prometheus & Grafana



This content originally appeared on DEV Community and was authored by Sumit Roy

In Article 1

When we first set up Prometheus + Grafana for Gemma 2B on Kubernetes, I expected to see nice dashboards with:

  • Tokens per request
  • Latency per inference
  • Number of inferences processed

…but all we got were boring container metrics: CPU%, memory usage, restarts.

Sure, they told us the pod was alive, but nothing about the model itself.
No clue if inference was slow, if requests were timing out, or how many tokens were processed.

🔍 Debugging the Metrics Problem

We checked:

  • Prometheus scraping the Ollama pod? ✅
  • Grafana dashboards connected? ✅
  • Metrics endpoint on Ollama? ❌

That’s when we realized:

  • Ollama by default doesn’t expose model-level metrics.
  • It only serves the API for inference, nothing else.
  • Prometheus was scraping… nothing useful.

💡 The Fix: Ollama Exporter as Sidecar

While digging through GitHub issues, we found a project: Ollama Exporter

It runs as a sidecar container inside the same pod as Ollama, talks to the Ollama API, and exposes real metrics at /metrics for Prometheus.

Basically:

[ Ollama Pod ]
    β”œβ”€β”€ Ollama Server (API β†’ 11434)
    └── Ollama Exporter (Metrics β†’ 11435)

🛠 How We Integrated It

Here’s the snippet we added to the Ollama deployment:

- name: ollama-exporter
  image: ghcr.io/jmorganca/ollama-exporter:latest
  ports:
    - containerPort: 11435
  env:
    - name: OLLAMA_HOST
      value: "http://localhost:11434"

And in Prometheus config:

scrape_configs:
  - job_name: 'ollama'
    static_configs:
      - targets: ['ollama-service:11435']

📊 The Metrics We Finally Got

After adding the exporter, Grafana lit up with:

Metric Name         What It Shows
ollama_requests_total   Number of inference requests
ollama_latency_seconds  Latency per inference request
ollama_tokens_processed Tokens processed per inference
ollama_model_load_time  Time taken to load Gemma 2B model

Suddenly, we had real model observability, not just pod health.

🚀 Lessons Learned

  • Default Kubernetes metrics β‰  Model metrics β†’ You need a sidecar like Ollama Exporter.
  • One scrape job away β†’ Prometheus won’t scrape what you don’t tell it to.
  • Metrics help tuning β†’ We later used these metrics to set CPU/memory requests properl

🔮 What’s Next?

  • Now that we have model-level observability, the next step is:
  • Adding alerting rules for latency spikes or token errors.
  • Exporting historical metrics into long-term storage (e.g., Loki, Thanos).

Trying multiple models Gemma 3, LLaMA 3, Phi-3 and comparing inference latency across them.

💬 Let’s Connect

If you try this setup or improve it, I’d love to hear from you!

Drop a star ⭐ on the repo if it helped you β€” it keeps me motivated to write more experiments like this!


This content originally appeared on DEV Community and was authored by Sumit Roy