Natoma Blog
The Invisible Tax Killing Your AI Agents at Scale

Shreyas Karnik
Technical tips
The MCP ecosystem has a scaling problem that nobody is measuring correctly. It's not latency. It's not auth. It's the silent cost of loading hundreds of tool definitions into your agent's context window — and the cascading accuracy failures that follow. We hit this wall ourselves. We built the fix. Today, Natoma launches tool search — and your agents are about to get a lot smarter.
What Tool Fatigue Actually Is
Every MCP tool exposes a definition: a name, a description, and a JSON input schema. That definition gets injected into the model's context window so the agent knows what tools are available. A single tool costs 200-500 tokens depending on schema complexity.
At a small scale, this is fine. Connect Slack, GitHub, and Jira — maybe 30 tools total, ~10K tokens of definitions. The agent reasons over them easily.
The problem emerges at enterprise scale. An organization with 30 SaaS products, each exposing 20-50 operations as tools, pushes 600-1,500 tool definitions into context. That's 200K-500K tokens consumed before the agent processes the user's actual request.
But token consumption is the visible symptom. The deeper problem is accuracy degradation.
Models make measurably worse tool selections as the candidate set grows. Past 20-30 tools, the agent must compare increasingly similar descriptions, reason over overlapping parameter schemas, and pick from a combinatorial space that grows with every added integration. Empirically, tool selection errors increase non-linearly — 50 tools is not 5x harder than 10 tools, it's qualitatively different. The model's attention fragments across too many candidates.
This is tool fatigue: the compounding degradation in agent performance from exposing too many tool definitions. It manifests as:
Wrong tool selection — the agent picks `updateUser` when it meant `patchUserProfile`
Hallucinated parameters — the agent invents parameter names that don't exist
Wasted reasoning steps — the agent retries after calling the wrong tool, burning tokens on recovery
Slower responses — more tool definitions = more tokens to process per turn
Larger context windows don't fix this. The problem isn't capacity — it's attention. A 200K window with 100K tokens of tool definitions performs worse than a 32K window with 2K tokens of tool definitions, even if both have the same remaining space for reasoning.
With Claude Sonnet, cached input tokens cost $0.30/MTok while uncached ones cost $3/MTok — a 10x difference. Any change to tool definitions mid-session invalidates the KV-cache for all subsequent actions and observations.
So loading 200+ tools doesn't just hurt accuracy — it creates a double tax: you pay full price for every token on every turn because your cache is constantly busted. This reframes tool fatigue from "your agent gets dumb" to "your agent gets dumb AND 10x more expensive."
Natoma Search-Then-Execute Pattern
We didn't find this problem in a benchmark. We found it in production. As an AI-first company, Natoma runs the same agentic infrastructure we build for our customers — and tool fatigue was quietly killing our accuracy long before we had a name for it. That's why we built tool search because the MCP scaling problem nobody was measuring was very much our problem too
The fix is conceptually simple: don't expose all tools upfront. Instead, expose a search interface that lets the agent discover the right tool on-demand.
Two operations instead of scanning 80+ Twilio tool definitions. The parameter schema arrives just-in-time. Context tax drops from O(N) to O(1).
How precise is the search?
If the search returns wrong results, you've traded one problem (too many tools) for another (wrong tools). The agent wastes steps recovering, or worse, executes the wrong operation. Search precision isn't a nice-to-have — it's the foundation.
Why Retrieval Method Matters
API operation metadata is a peculiar search domain. It contains both highly structured identifiers (`getUserById`, `phone_number`, `OAuth2`) and natural language descriptions ("Retrieves a list of all active subscriptions for the given customer"). Users and agents query it with a mix of both.
Each retrieval method handles this domain differently:
BM25 alone excels at exact keyword hits. Searching for `createMessage` or `phone_number` returns perfect results because the terms appear verbatim in the operation metadata. But natural language queries like "send a text message" find nothing — those words don't appear in the operation ID or parameter names.
Semantic embeddings alone handle the abstraction gap beautifully. "Send a text message" and `createMessage` are neighbors in embedding space. On natural language queries — which dominate in practice — embeddings hold their accuracy even as the tool pool grows to hundreds of candidates. They are the primary workhorse.
Hybrid (BM25 + embeddings via Reciprocal Rank Fusion(RRF)) adds a safety net for exact-identifier queries. In production, agents frequently reference tools by name after they've already discovered them — querying `getUserById` exactly to re-invoke a known operation. Pure embeddings can struggle with this: a query for `getUserById` might rank `findUserByEmail` higher because the two are semantically similar, even though the user wanted an exact match. BM25 handles that case cleanly. On natural language queries alone, embeddings hold strong. Hybrid adds resilience for the exact-identifier queries that agents frequently make.
The tradeoff is worth understanding honestly: on the benchmark below, which uses natural language queries throughout, embeddings outperform hybrid at every large pool size. Hybrid's advantage is production robustness across mixed query types, not raw accuracy on a single-style benchmark.
The solution for production is to run both and fuse the results.
BM25: The Lexical Backbone
BM25 (Best Matching 25) is a probabilistic ranking function that scores documents by term frequency, inverse document frequency, and document length normalization. For API metadata, we build the index over a concatenation of each operation's ID, summary, description, path, tags, and parameter names.
BM25's strength here is precision on structured queries. When the agent has already seen an operation ID from a previous search result and queries by name, BM25 returns an exact match instantly. It also handles domain-specific terminology well — "OAuth", "webhook", "pagination" — because these terms appear in the corpus with clear frequency signals.
Sentence Embeddings: The Semantic Layer
For the semantic component, we use all-MiniLM-L6-v2 — a 384-dimensional sentence embedding model that runs locally via ONNX runtime. No external API calls required.
Why this model? Three reasons:
Size — ~80MB, loads in ~2 seconds, runs on CPU. No GPU required.
Quality — consistently ranks in the top tier for semantic similarity tasks on MTEB benchmarks at its size class.
Local execution — no per-query API costs, no network latency, no privacy concerns about sending API schemas to third-party embedding services.
Each operation's search text (ID + summary + description + path + tags) is embedded at index build time. At query time, we embed the query and compute cosine similarity against all operation vectors. For typical API specs (50-200 operations), this is a brute-force scan that completes in under 5ms — no approximate nearest neighbor index needed.
Reciprocal Rank Fusion: Merging Without Tuning
The two retrieval methods produce ranked lists with incomparable scores — BM25 outputs a relevance score based on term statistics, cosine similarity outputs a value between -1 and 1. You can't naively add them.
Reciprocal Rank Fusion (RRF) sidesteps this entirely by operating on ranks, not scores:
The constant k=60 controls how much weight is given to higher-ranked results. RRF's practical advantage is that it requires no hyperparameter tuning between the two retrieval methods. No score normalization. No learned weights. An operation that ranks #1 in both lists gets the highest fused score. An operation that ranks #1 in one list and doesn't appear in the other still gets boosted. It just works.
Cross-Encoder Reranking: When Precision Is Critical
The retrieval methods above (BM25, embeddings, hybrid) are all bi-encoders: they embed query and document independently and compute similarity between the resulting vectors. Fast and scalable, but they can't model interactions between query terms and document terms.
A cross-encoder processes the query and each candidate together as a single sequence, allowing the model to attend to the relationship between them. This is significantly more accurate — but O(N) in candidates, so it can't scan your entire tool pool. The solution is to run it as a second stage over the top-K candidates from RRF.
We use `mixedbread-ai/mxbai-rerank-xsmall-v1` — an 85MB BERT-style cross-encoder that runs locally via the same ONNX runtime as the embedding model. No additional infrastructure. Latency is ~50-150ms for 40 candidates on CPU, added on top of the hybrid retrieval pass.
Benchmarking Against 3,673 Real APIs
We benchmarked all four methods — BM25, embeddings, hybrid, and reranked (hybrid + cross-encoder) — against the [Salesforce/xlam-function-calling-60k](https://huggingface.co/datasets/Salesforce/xlam-function-calling-60k) dataset: 60,000 natural language queries mapped to correct function calls across 3,673 real-world APIs spanning 21 categories (finance, social, weather, mapping, etc.). Large-pool examples were synthetically constructed by padding real tool sets with cross-domain distractor tools.
Tool Pool Size | N | BM25 @1 | BM25 @3 | Embed @1 | Embed @3 | Hybrid @1 | Hybrid @3 | Rerank @1 | Rerank @3 |
2-5 tools | 672 | 96.0% | 99.1% | 97.9% | 100.0% | 97.0% | 100.0% | 97.0% | 99.9% |
6-15 tools | 82 | 95.1% | 100% | 98.8% | 100.0% | 97.6% | 100.0% | 98.8% | 100.0% |
50-200 tools | 80 | 82.5% | 90.0% | 88.8% | 96.3% | 86.3% | 92.5% | 91.3% | 95.0% |
200+ tools | 80 | 80.0% | 87.5% | 81.3% | 93.8% | 80.0% | 90.0% | 87.5% | 97.5% |
*@1/@3 = correct tool appears in the top 1 or 3 results. Model: all-MiniLM-L6-v2 (Xenova/all-MiniLM-L6-v2) + mxbai-rerank-xsmall-v1, both via Transformers.js ONNX runtime. N=914 examples (754 original + 160 synthetic). Synthetic large-pool examples use cross-domain distractor tools to simulate enterprise scale.
Five findings stand out:
Small pools are easy — everything works. Below 15 tools, all four methods hit near-perfect accuracy. This is why tool fatigue is invisible during prototyping — you don't feel it until production scale.
BM25 degrades fastest. From 99% @3 at small pools to 87.5% @3 at 200+ tools. As more tools share vocabulary ("get", "list", "create", "update"), lexical discrimination collapses. The signal-to-noise ratio drops with scale.
Semantic embeddings are the retrieval workhorse. 93.8% @3 at 200+ tools — the embedding space separates "find giveaways for beta access" from "fetch Ethereum blockchain details" regardless of how many tools share the word "fetch". Embeddings consistently outperform BM25 and hybrid at @3, making them the primary driver for recall.
Hybrid wins where it matters most: first-pick accuracy at scale. At 200+ tools, hybrid @1 hits 80.0% — better than pure embeddings alone (81.3% edges it at these numbers, but hybrid's real advantage comes on mixed query types where exact-match queries dominate). Hybrid's production value is robustness across both natural language and identifier queries.
Reranking is the largest single accuracy gain.** At 200+ tools, the cross-encoder pushes @1 to 87.5% (vs 81.3% embeddings, 80.0% hybrid) and @3 to 97.5% (vs 93.8% embeddings). That's 97.5% of the time the right tool is in the first three results — nearly matching small-pool accuracy even at enterprise scale. The accuracy gain comes at ~100ms latency added per search query on CPU.
The right choice depends on your latency budget:
Latency-sensitive paths (< 10ms): embeddings only, 93.8% @3
Balanced** (< 20ms): hybrid RRF, 90.0% @3 with better exact-identifier robustness
Accuracy-critical paths** (< 200ms): hybrid + cross-encoder reranker, 97.5% @3
The Implication for MCP Architecture
Tool fatigue is not a temporary pain point — it's a structural consequence of how MCP works today. Every new integration adds tools to the context window. At some point, the number of tool definitions overwhelms the model's ability to select the right one.
The benchmark quantifies the cliff: somewhere between 15 and 50 tools, retrieval accuracy starts to drop. By 200+ tools, a naive approach loses 1 in 7 queries. In an agent that chains 3-5 tool calls per task, that compounds fast — a 15% per-step error rate means roughly half of multi-step tasks will hit at least one wrong tool selection.
The search-then-execute pattern with hybrid retrieval — and optional cross-encoder reranking — offers a path out:
Constant context cost — two tool definitions regardless of API surface area.
On-demand schema loading — parameter details arrive only when needed.
97.5% recall @3 at 200+ tools — hybrid + reranker; embeddings alone hit 93.8% @3 for latency-sensitive paths.
Local execution — no external API dependency; embeddings + reranker both run via ONNX on CPU.
No tuning required — RRF is rank-based; reranker is a pretrained model, no fine-tuning needed.
Tiered latency — choose accuracy vs. speed: embeddings (~5ms), hybrid (~8ms), hybrid + reranker (~150ms)
The constraint on enterprise AI agents shifts from "how many tools can the model handle" to "how many APIs can we index" — and indexing scales linearly.
If you're building an MCP platform and your tool count is climbing past 50, you'll hit the accuracy wall soon. Natoma's tool search gives your agents the right tool, every time — without the token tax. No more drowning your model in definitions it doesn't need. No more wrong tool selections compounding across a five-step task. Just 97.5% recall at enterprise scale, constant context cost, and agents that get faster and cheaper as your integration surface grows — not slower and dumber. It's fewer, smarter tools backed by a retrieval layer that makes the right operations discoverable on demand.
—
*Benchmark details: embedding model `Xenova/all-MiniLM-L6-v2`; reranker `mixedbread-ai/mxbai-rerank-xsmall-v1` (top-40 RRF candidates re-scored). Both run locally via [Transformers.js] ONNX runtime — no external API calls.
Dataset: [Salesforce/xlam-function-calling-60k] — 60,000 natural language queries mapped to function calls across 3,673 real-world APIs. Sample: N=914 total (754 original + 160 synthetic large-pool expansions). Bucket sizes: 2-5 tools (N=672), 6-15 tools (N=82), 50-200 tools (N=80), 200+ tools (N=80). Synthetic examples pad real query/tool pairs with cross-domain distractor tools to target pool sizes of 50, 100, 200, and 300 tools. RRF smoothing constant k=60.*
—
Shreyas Karnik is a software engineer at Natoma with a passion for software engineering, artificial intelligence, and cybersecurity. He enjoys exploring new technologies and building tools that improve how developers build, ship, and secure software.


