Made For Mission

RAG at Scale: What It Takes To Serve 10,000 Queries A Day

Most teams start with a simple RAG prototype. It feels elegant, almost magical. A vector database, a handful of chunks, a clean loop that retrieves and generates answers. It works beautifully for demos and internal tools. Then you ship it to real users and everything changes. Latency creeps up. Costs balloon. Hallucinations slip through. Cold starts make terrible first impressions.

None of this is a surprise. Naive RAG was never built for production traffic.

At Oxford Dynamics, we’ve helped teams navigate this exact transition: from a prototype that impresses stakeholders to a system that holds up under real load. This post covers what naive RAG actually does, why it falls apart, what advanced RAG adds, and the engineering work that makes a system stable at ten thousand queries a day.

What naive RAG actually is

At its core, naive RAG is a four-step loop. Embed the query. Run similarity search. Take the top chunks. Send them directly to the model. It’s straightforward, but brittle.

It breaks because chunk boundaries are often poor, so the model sees fragments rather than full ideas. Dense similarity alone retrieves irrelevant content. Multi-hop questions fall apart because retrieval is shallow. There’s no metadata filtering, no time awareness, no permissions logic. And the model can’t request additional retrieval when the initial pass is weak.

Naive RAG is fine for demos or hackathons. It’s not fine for systems under load.

What advanced RAG adds

Advanced RAG strengthens retrieval before the model generates anything. It doesn’t rely on a single vector search call. It layers techniques that compensate for the structural weaknesses in naive RAG.

Query rewriting clarifies or decomposes questions. When a user asks “What did the CEO say about it?”, the system rewrites to “What did the CEO say about Q3 earnings guidance?” before retrieval begins.

Hypothetical Document Embedding (HyDE) generates a fake answer first, then uses its embedding for similarity search. This often retrieves better chunks than the original query would.

Hybrid search blends dense retrieval with sparse keyword matching. Vector similarity finds semantically related content. BM25 catches exact terms that embedding models miss.

Metadata filtering extracts constraints before searching. “Last month” becomes a date filter. “In the presentation” filters to .pptx files. User context restricts results to permitted documents.

Reranking uses a cross-encoder to score each query-chunk pair and select the top N. This step trades speed for precision.

Context enrichment makes retrieved chunks useful. Parent chunks, summary chunks, and sliding windows transform fragments into something the model can reason over.

This tightens retrieval quality. But better retrieval doesn’t guarantee a production-ready system.

What breaks at scale

Advanced RAG solves the quality problem. It doesn’t solve the systems problem. At a hundred concurrent users, four things fail in predictable order. Latency climbs because every step is sequential. Costs balloon from redundant computation. Cold starts punish users after every deployment. And hallucinations erode trust when retrieval gaps go undetected.

These are all engineering problems, not model problems. The fixes live in the layer below retrieval.

The production engineering layer

This is the part most RAG guides skip. Advanced retrieval gets you quality. The layer below it keeps the system alive under load.

Here’s what happens when a query arrives:

And here’s what keeps it alive between queries:

A few things to call out.

The semantic cache sits before retrieval, not after. If an incoming query is similar enough to one you’ve already answered (above a configurable similarity threshold), you return the cached result and skip the entire pipeline. For systems with repetitive query patterns like support desks or HR bots, this single component can eliminate a significant chunk of your compute. The savings depend heavily on your query distribution; diverse research-style queries will see less benefit than high-volume support traffic.

Retrieval fans out in parallel. Vector search, BM25, and metadata lookups don’t need to happen sequentially. Running them concurrently reduces wall-clock retrieval time to the duration of the slowest single source, rather than the sum of all three.

Connection pooling on the vector database matters more than you’d expect. At 10,000 queries a day, you’re looking at sustained concurrent connections. Without pooling, each query opens and closes a connection, and the overhead adds up fast.

Streaming the LLM response reduces perceived latency dramatically. The user sees tokens arriving within a few hundred milliseconds instead of waiting three to five seconds for a complete response.

The observability layer isn’t optional. Without per-query cost tracking, cache hit rates, retrieval quality scores, and hallucination rates, you’re flying blind. The teams we’ve worked with that skip this step end up debugging production issues by guessing.

Background processes keep the system warm. Precomputing embeddings for common query patterns, invalidating stale cache entries, and running warm-up sequences after each deployment prevent the cold start problem from ever reaching users.

Error handling and graceful degradation don’t show up in architecture diagrams, but they’re non-negotiable. What happens when the vector DB is unreachable? When the reranker times out? When the LLM returns a refusal? Production systems need circuit breakers on every external call, fallback paths (serve from cache even if stale, degrade to BM25-only retrieval), and retry logic with exponential backoff. The happy path is the easy part. The failure modes are where production systems earn their keep.

The real lesson

Naive RAG gets you a demo. Advanced RAG gets you better retrieval. Neither is enough for real traffic. Production RAG is an engineering problem. The plumbing matters more than the prompt.

If you want a system that stays steady at ten thousand queries a day, the gains come from architecture, caching, concurrency, and operational discipline. RAG isn’t an LLM trick. It’s a systems problem wrapped around a language model.

This is exactly the kind of challenge we tackle at Oxford Dynamics. If your team is navigating the jump from RAG prototype to production system, we’d be happy to share what we’ve learned.

Related Insights

Stay Ahead 

of What’s Next

Knowledge must move faster than the problems we face. Subscribe to receive updates, insights, and perspectives at the intersection of technology and strategy.

You’ll receive curated dispatches on our latest thought leadership, product updates, and announcements. We don’t send noise, only the signals that matter.

Join the mission
Your information is never shared. Unsubscribe anytime.