Latency Slayer: a Redis 8 semantic cache gateway that makes LLMs feel instant



This content originally appeared on DEV Community and was authored by Mohit Agnihotri

This is a submission for the Redis AI Challenge: Real-Time AI Innovators.

Cover

What I Built

Latency Slayer is a tiny Rust reverse-proxy that sits in front of any LLM API.

It uses embeddings + vector search in Redis 8 to detect “repeat-ish” prompts and return a cached answer instantly. New prompts are answered once by the LLM and stored with per-field TTLs, so only the response expires while metadata persists.

Why it matters: dramatically lower latency and cost, with transparent drop-in integration for any chat or RAG app.

Core tricks

  • Redis Query Engine + HNSW vectors (COSINE) to find semantically similar earlier prompts.
  • Hash field expiration (HSETEX / HGETEX) so we can expire just the “response” field without deleting the whole hash.
  • Redis Streams for real-time hit-rate & latency metrics, rendered in a tiny dashboard.

Demo

Screenshots:

Dashboard Screenshot1

Dashboard Screenshot2

How I Used Redis 8

  • Vector search (HNSW, COSINE) on a HASH document that stores an embedding field (FP32, 1536-d from OpenAI text-embedding-3-small).
  • Per-field TTL on hashes: HSETEX to set the response field and its TTL in a single step; HGETEX to read and optionally refresh TTLs. This gives granular cache lifetimes without deleting other fields (like usage or model metadata).
  • Redis Streams: XADD analytics:cache per request; the dashboard subscribes and renders hit rate, token savings, and latency deltas in real time.

Data model (simplified)

  • cache:{fingerprint} → Hash fields: prompt, resp, meta, usage, created_at (with resp having its own TTL)
  • vec:{fingerprint} → Vector field + tags (model, route, user)
  • Stream: analytics:cache with {event, hit, latency_ms, tokens_saved}

Why Redis 8?

  • New field-level expiration commands on hashes make cache lifecycle clean and safe.
  • New int8 vectors keep memory low and speed high.
  • Battle-tested Streams/PubSub give us real-time observability with a tiny footprint.

What’s next

  • Prefetch: predict likely next prompts and warm them proactively.
  • Hybrid filters: combine vector similarity + tags (model/route) for stricter cache hits.
  • Cold-start tuning: adapt hit threshold by route and user cohort.
  • Currently storing FP32 vectors for simplicity; INT8 quantization is planned to lower memory and speed up search


This content originally appeared on DEV Community and was authored by Mohit Agnihotri