Skip to content
cd ..

Smart Tool Loading: How We Cut LLM Token Costs by 75%

// · 5 min read

When you have 270+ tools available, the naive approach is to dump all of them into the LLM’s system prompt. Here’s every tool, here’s what each one does, now figure out which one you need. This works, technically. But it costs a fortune.

Each tool definition includes a name, description, and parameter schema. Across 270+ tools, that adds up to roughly 33,000 tokens just for the tool definitions. Every single LLM call pays that cost, even if the agent is handling a simple email reply that only needs 3 or 4 tools. At scale, with agents making dozens of LLM calls per task, those 33K tokens per call become the dominant cost in the system.

AgenticMail Enterprise solves this with a 3 tier tool loading system that brings the baseline down to around 3,000 tokens. That’s a 75% reduction in token costs for most interactions.

Tier 1: Essential Tools (~20 tools, always loaded)

The first tier is a small set of tools that every agent needs regardless of context. These are the fundamentals: reading and sending messages, basic memory operations (store a fact, recall a fact), requesting additional capabilities, and status reporting.

This set is deliberately small. Each tool earns its place by being used in more than 80% of agent interactions across our usage data. If an agent almost always needs it, it’s essential. If it’s needed frequently but not almost always, it belongs in the next tier.

The ~20 essential tools consume about 3,000 tokens of system prompt space. That’s the baseline cost for every LLM call, and it’s manageable.

Tier 2: Contextual Tools (auto loaded by channel signals)

The second tier contains tools loaded automatically based on signals from the current interaction. If the conversation mentions “email” or “inbox,” the email management tools get loaded. If someone mentions “schedule” or “meeting,” the calendar tools appear. If the conversation involves a URL, the browser tools get added.

The signal detection is keyword based but weighted by recency and frequency. A single mention of “email” in a long conversation might not trigger loading, but two mentions in the last few messages will. This prevents false positives from historical context while staying responsive to the current topic.

Contextual loading typically adds 2,000 to 8,000 tokens depending on how many tool groups are relevant. An agent handling email triage might have the essential tools plus email tools plus calendar tools, totaling maybe 7,000 tokens. Still far less than 33,000.

Tier 3: Specialist Tools (on demand via request_tools)

The third tier contains specialized tools only loaded when an agent explicitly asks for them: advanced data analysis, telephony controls, code execution environments, specialized API integrations.

One of the essential tools in Tier 1 is called request_tools. When an agent realizes it needs a capability it doesn’t have, it calls request_tools with a description of what it needs. The system matches that description against available specialist tools and loads the relevant ones for subsequent calls.

This is a one time cost per conversation. Once loaded, specialist tools stay available for the remainder of that interaction.

The request_tools approach also creates an explicit record of which specialist capabilities each agent uses. If 60% of agents end up requesting the same specialist tool, that’s a signal it should be promoted to the contextual tier.

The Token Math

A typical email triage session:

Without smart loading: 33,000 tokens of tool definitions × 15 LLM calls = 495,000 tokens just for tools.

With smart loading: 3,000 base + 4,000 email tools = 7,000 tokens × 15 LLM calls = 105,000 tokens for tools.

That’s a 79% reduction for this scenario. The actual savings vary by use case, but 70% to 80% is typical.

Tradeoffs

Smart tool loading isn’t free. There’s a cold start problem: the first LLM call in a new context might not have the right tools loaded yet, requiring a follow up call after contextual tools are added.

There’s also the risk of missing tools. If the contextual signals don’t fire correctly, an agent might not realize it has access to a capability it needs. The request_tools escape hatch mitigates this, but it requires the agent to know it’s missing something.

Despite these tradeoffs, the cost savings make smart tool loading essential for production. Burning 33K tokens per call on tool definitions that won’t be used is a luxury that doesn’t survive contact with a real budget.

Source Code

The tool resolver header documents the three tier strategy and the token savings it achieves:

/**
 * Three tier lazy loading:
 *   TIER 1: ESSENTIAL (always loaded, ~12 tools, ~3K tokens)
 *   TIER 2: CONTEXTUAL (loaded when context signals demand)
 *   TIER 3: SPECIALIST (loaded on demand only, ~100+ tools)
 *
 * TOKEN IMPACT:
 *   Simple chat: ~12 tools (~3K tokens) instead of 98 (~20K) = 85% reduction
 */

View the full source on GitHub

// share

// subscribe

New posts and updates straight to your inbox. No noise.

cd ..