Skip to main content
PUBLISHED
Research

Token Efficiency in AI Agent Systems: A Technical Survey and Specification Framework

Comprehensive analysis of token waste in agentic AI systems, demonstrating that knowledge graph-based capability delivery reduces context consumption by 10x compared to flat file scanning, with direct economic implications in a $7.8B market growing at 46% CAGR.

OSSA Research Team··24 min read

Token Efficiency in AI Agent Systems: A Technical Survey and Specification Framework

OSSA Technical Report TR-2026-002 (Revised) Open Standard for Software Agents March 2026


Abstract. Token consumption is the dominant cost driver in LLM-based agent systems, yet efficiency remains architecturally unaddressed at the protocol and specification layer. In a market projected at $7.8 billion in 2026 with a 46% CAGR, enterprise agent inquiries surging 1,445% between Q1 2024 and Q2 2025, and heavy-usage costs reaching $12,000 per 10-day period, token optimization has moved from academic concern to existential business requirement. This revised report synthesizes empirical findings from 23 peer-reviewed papers (2024-2026), production telemetry from MCP and A2A deployments, serialization benchmarks, and new data on knowledge graph versus raw context injection approaches. We identify four systemic waste categories — serialization overhead (40-70% of context), trajectory accumulation (39.9-59.7% reducible), multi-agent coordination tax (29-50% reducible), and protocol envelope bloat — and demonstrate that knowledge graph-based capability delivery via Qdrant vector search achieves 10x fewer tokens than flat file scanning. We further identify Parkinson's Law as applied to agent systems: unlimited context windows lead to measurably worse performance, paralleling research on students given excessive time on examinations. Our analysis demonstrates that a token-aware agent specification, exemplified by the OSSA manifest, can reduce total pipeline costs by 60-80% while maintaining or improving task performance.

Keywords: token efficiency, agent systems, knowledge graphs, vector search, Parkinson's Law, multi-agent coordination, serialization, context management, MCP, A2A, OSSA, trajectory reduction, prompt compression


1. Introduction

1.1 The Token Paradox in a $7.8 Billion Market

Per-token inference costs have decreased approximately 280-fold between 2023 and 2026 [1]. Claude Opus 4.5 operates at $5/$25 per million input/output tokens; DeepSeek-V3 achieves $0.07/M input with cache hits [2]. Despite this, enterprise AI expenditure is accelerating — cloud computing bills rose 19% in 2025 [3], and Gartner projects 40% of enterprise AI agent pilots will be cancelled by 2027 due to unsustainable costs [4].

The scale of the problem has become concrete. The agentic AI market reached $7.8 billion in 2026, growing at a compound annual growth rate of 46% [27]. Gartner documented a 1,445% surge in enterprise agent-related inquiries between Q1 2024 and Q2 2025 [28], reflecting a transition from experimental to production deployments. Enterprise adoption projections indicate that 40% of enterprise applications will incorporate agentic capabilities by 2028, up from approximately 5% in 2025 [28].

The paradox resolves when examining consumption patterns. As agents transition from single-turn generation to multi-step reasoning with tool-calling loops, token consumption scales quadratically or exponentially with task complexity [5]. On the OpenRouter platform, daily usage of Claude 4 Sonnet reaches 100 billion tokens, of which 99% are input tokens accumulated in agent trajectories — only 1% are newly generated output [6].

1.2 Parkinson's Law Applied to Agent Systems

A counterintuitive finding has emerged from both empirical research and production deployments: agents given larger context windows or unlimited token budgets frequently perform worse than constrained agents. This parallels Parkinson's Law — "work expands to fill the time available" — applied to computational cognition.

The phenomenon manifests in several documented ways:

  1. Attention dilution. Transformer attention mechanisms distribute probability mass across all tokens in the context window. As window size increases, the probability assigned to any individual relevant token decreases, producing the "lost in the middle" effect documented by Liu et al. [29]. Agents with 200K context windows consistently miss critical information positioned in the middle 40% of their context.

  2. Decision paralysis through over-information. Kim et al. [13] demonstrated that multi-agent systems with unconstrained budgets spent disproportionate tokens on coordination overhead without improving task accuracy. When the same systems were given fixed token budgets via CoRL [17], accuracy increased — the constraint forced prioritization.

  3. Trajectory sprawl. Xiao et al. [6] found that agents with large context windows accumulated irrelevant observations (build noise, cache files, ANSI escape codes) without pruning, precisely because the window could accommodate them. Constrained agents were forced to consolidate, producing cleaner decision chains.

  4. The examination analogy. Educational psychology research consistently demonstrates that students given excessive time on examinations perform worse than those given moderate time constraints, due to second-guessing, answer-changing, and cognitive fatigue [30]. Agent systems exhibit identical behavior: unlimited token budgets lead to action repetition, hypothesis cycling, and exploration of dead-end reasoning paths.

Jin et al. [17] formalized this insight with CoRL, demonstrating that reinforcement-learning-controlled token budgets produce agents that surpass the best expert LLM on multi-agent tasks. The constraint is not a limitation — it is an architectural feature. OSSA's budget propagation primitive (Section 7.6) operationalizes this finding at the specification level.

1.3 The Economic Imperative

At current pricing, heavy agentic usage costs approximately $12,000 per 10-day period for enterprise deployments running continuous agent pipelines [31]. Annualized, this represents $438,000 per pipeline — a figure that scales linearly with the number of production agent workflows. For organizations running 10-50 agent pipelines, annual token costs reach $4.4M-$21.9M.

Token optimization is therefore not a marginal efficiency gain but a determinant of whether agentic AI is economically viable at scale. A 60% reduction in token consumption — achievable through the specification-level primitives proposed in this report — translates to $2.6M-$13.1M in annual savings for a mid-scale enterprise deployment.

1.4 Scope and Contributions

This report makes six contributions:

  1. Empirical characterization of token consumption across serialization formats, agent trajectories, multi-agent coordination, and protocol overhead, drawing on published benchmarks and production data.
  2. Taxonomy of waste — a four-category classification of avoidable token consumption in agent systems with quantified reduction potential for each.
  3. Knowledge graph analysis — comparative evaluation of knowledge graph + vector store approaches versus static file scanning for agent capability delivery.
  4. Parkinson's Law formalization — documentation of the counter-productive effects of unlimited context on agent performance.
  5. Specification-level primitives — concrete mechanisms for the OSSA standard that address token efficiency at the contract layer rather than the application layer.
  6. Cost model — a parametric framework for estimating token consumption in composed agent pipelines, validated against published benchmarks.

2. Serialization Overhead

2.1 The Format Tax

Structured data serialization is the single largest source of avoidable token waste in agent systems. A production analysis by The New Stack found that serialization overhead — field names, braces, brackets, quotes, colons, commas — consumes 40-70% of available context tokens in RAG and agent pipelines [7]. This overhead is "pure inefficiency": structural formatting repeated across records that conveys no information the model requires for reasoning.

The overhead is especially acute for agent manifests, tool definitions, and capability registries — all of which are arrays of uniform objects with repeated field names. A single MCP tool definition consumes approximately 150-200 tokens in standard JSON format, of which roughly 60-80 tokens (35-45%) are structural syntax rather than semantic content.

2.2 Empirical Format Comparison

The TOON (Token-Oriented Object Notation) project [8] published the first rigorous cross-format benchmark in November 2025, testing 209 data-retrieval questions across four LLMs (Claude Haiku 4.5, Gemini 2.5 Flash, GPT-5 Nano, Grok 4 Fast) with deterministic validation (type-aware comparison without LLM judge). Token counts used GPT-5 o200k_base tokenizer.

Table 1: Format comparison on mixed-structure datasets (209 questions, 4 LLMs)

FormatMean TokensAccuracy (%)Efficiency Score (acc%/1K tok)vs JSON
TOON2,74473.926.9-39.6% tokens
JSON compact3,08170.722.9-32.2% tokens
YAML3,71969.018.6-18.2% tokens
JSON (pretty)4,54569.715.3baseline
XML5,16767.113.0+13.7% tokens

Key findings:

  • Token reduction correlates with accuracy improvement, not degradation. TOON achieved both the lowest token count and the highest accuracy across all four models. This contradicts the naive assumption that compression necessarily loses information.
  • The efficiency gap is multiplicative, not additive. TOON's 76% efficiency improvement over JSON (26.9 vs 15.3 acc%/1K tokens) compounds across every agent interaction in a pipeline.
  • Format advantage is data-shape-dependent. For uniform arrays of objects (the dominant shape for agent capability lists and tool definitions), TOON achieves 50-60% savings over JSON. For deeply nested hierarchies, compact JSON matches or exceeds TOON [9].

2.3 Per-Model Accuracy Breakdown

Table 2: Format accuracy by model (% correct, 209 questions)

FormatClaude Haiku 4.5Gemini 2.5 FlashGPT-5 NanoGrok 4 Fast
TOON59.882.880.472.7
JSON57.477.576.167.9
YAML56.078.973.767.5
XML55.575.671.366.0
JSON compact58.480.476.669.4

The accuracy advantage of compact formats is consistent across all models tested, though the magnitude varies. Gemini 2.5 Flash shows the largest accuracy differential (5.3pp between TOON and JSON), suggesting that models with larger context training may be more sensitive to format efficiency.

2.4 Implications for Agent Specifications

Agent manifests, capability registries, and tool definitions are overwhelmingly composed of uniform arrays of objects — exactly the data shape where compact serialization provides maximum benefit. A typical MCP tools/list response containing 50 tools consumes approximately 9,000-10,000 tokens in standard JSON. In a TOON-style compact format with header-once field declaration, the same payload requires approximately 4,500-5,500 tokens — a 40-50% reduction with no information loss.

For OSSA, this implies the specification should:

  1. Define multiple serialization profiles (full, compact, fingerprint) selectable by consumer based on context budget
  2. Specify a canonical compact encoding for capability arrays that declares field names once
  3. Support schema references ($ref) to avoid inlining repeated type definitions

3. Knowledge Graphs vs. Raw Context Injection

3.1 The 90% Waste Problem

The dominant pattern in current agent systems is raw context injection: the agent's capabilities, tool definitions, system prompts, and retrieved knowledge are concatenated into a flat text payload and injected into the context window. Production telemetry from enterprise MCP deployments reveals that agents spend approximately 90% of their context window on knowledge delivery — loading tool definitions, scanning capability files, and processing manifests — before performing any actual reasoning [31].

This is architecturally equivalent to requiring a human expert to re-read their entire resume, job description, and credential portfolio before answering each question. The information exists; the delivery mechanism is catastrophically inefficient.

3.2 Knowledge Graph Architecture for Agent Capability Delivery

A knowledge graph approach restructures agent capability information as a typed, indexed, traversable graph rather than a flat file. In this architecture:

  • Agents are nodes with typed properties (identity, trust tier, version, composition compatibility)
  • Capabilities are edges connecting agents to action types, input/output schemas, and domain categories
  • Trust relationships are weighted edges reflecting verification status, interaction history, and federation agreements
  • Tool definitions are leaf nodes attached to capability edges, loaded only when traversed

Vector embeddings of capability descriptions enable semantic search: an orchestrator can query "find agents that can analyze Python security vulnerabilities" and retrieve the 3-5 most relevant agents with their capability summaries in approximately 200-400 tokens — versus scanning 50 agent manifests at 200-400 tokens each (10,000-20,000 tokens total).

3.3 Qdrant Vector Search: 10x Token Efficiency

Empirical benchmarks comparing Qdrant vector search against flat file scanning for agent capability discovery demonstrate a consistent 10x reduction in token consumption [32]:

Table 2a: Capability discovery — vector search vs. flat file scanning

MethodTokens per QueryAgents EvaluatedAccuracy (correct routing)Latency
Flat file scan (50 manifests)15,000-20,0005078%2.1s
Qdrant vector search (top-5)1,500-2,00050 (indexed)91%0.3s
Qdrant + fingerprint rerank1,800-2,40050 (indexed)94%0.4s
Reduction factor~10x+16pp~5x

The accuracy improvement is not coincidental. Flat file scanning injects all 50 manifests into the context window, creating attention competition between relevant and irrelevant agent descriptions. Vector search pre-filters to semantically relevant candidates, presenting the LLM with a curated, high-signal subset. This is the Parkinson's Law effect in microcosm: less context produces better decisions.

3.4 Knowledge Graph + OSSA Manifest Synergy

The OSSA manifest format is designed as a token-efficient identity delivery mechanism. A full OSSA manifest consumes 200-400 tokens; a fingerprint profile consumes 15-30 tokens. When combined with a knowledge graph backend:

  1. Indexing phase (offline): OSSA manifests are parsed, capabilities are extracted as graph edges, and vector embeddings are generated for capability descriptions. Cost: zero runtime tokens.
  2. Discovery phase (runtime): Orchestrator issues a semantic query. Qdrant returns top-k capability fingerprints (15-30 tokens each). Total: 75-150 tokens for 5 candidates.
  3. Selection phase (runtime): Orchestrator loads compact profiles for top-3 candidates (60-120 tokens each). Total: 180-360 tokens.
  4. Execution phase: Only the selected agent's full manifest is loaded. Total: 200-400 tokens.

Aggregate discovery cost: 455-910 tokens versus 10,000-20,000 tokens for flat scanning. This is the specification-level solution to the 90% knowledge delivery waste problem.


4. Trajectory Accumulation

4.1 The Compounding Cost Curve

In multi-turn agent systems, the context window at step n contains the full history of all prior steps: system prompt, task description, and every (action, observation) pair from steps 1 through n-1. Because LLM inference cost is proportional to input length (with O(n^2) attention cost ameliorated by KV caching but not eliminated), the total cost of an N-step trajectory is:

C_total = sum_{n=1}^{N} c_input(n) + c_output(n)
        ~ sum_{n=1}^{N} (S + sum_{k=1}^{n-1} |step_k|) * price_input + |step_n_output| * price_output

where S is the system prompt size and |step_k| is the token length of step k's (action, observation) pair. This sum grows quadratically in the number of steps for fixed-length steps, and super-quadratically when tool outputs grow over time (e.g., accumulating file edits).

Production data from Trae Agent on SWE-bench Verified: mean accumulated input per issue reaches 1.0M tokens, with 99% being prior context and 1% new generation [6].

4.2 Waste Taxonomy

Xiao et al. [6] (AgentDiet, arXiv:2509.23586, September 2025) conducted the first systematic analysis of trajectory waste in coding agents. They identified three categories through manual inspection of 50 agent trajectories on SWE-bench Verified:

Table 3: Trajectory waste categories

CategoryDefinitionExamplePrevalence
UselessInformation irrelevant to the task that entered context through tool outputCache files in find output, make[2]: Entering/Leaving directory build noise, ANSI escape codesAll trajectories
RedundantInformation that appears multiple times in the trajectoryFull file contents shown on open, then shown again after str_replace with only a few lines changedDominant source; grows quadratically with edit count
ExpiredInformation that has been superseded by later actionsPre-edit file state after a subsequent edit; test results before the code fixGrows with task complexity; actively misleading

4.3 Reduction Methods and Results

Table 4: Trajectory reduction approaches (2025-2026)

MethodPaperBenchmarkToken ReductionCost ReductionAccuracy ChangeMechanism
AgentDietXiao et al. (Sep 2025) [6]SWE-bench Verified, Multi-SWE-bench Flash39.9-59.7%21.1-35.9%-1.0% to +2.0%LLM reflection module compresses past steps; GPT-5 mini as reflector
Observation MaskingLindenbauer et al. (NeurIPS DL4Code, Dec 2025) [10]SWE-bench Lite~50%~53%0% to +1.0%Replace old tool outputs with placeholders; preserve reasoning and action history
SupervisorAgentarXiv:2510.26585 (Oct 2025) [5]GAIA (3 levels)29.7% (Smolagent), 39.4% (OAgents)N/RCompetitiveRuntime supervisor monitors and intervenes to prevent waste, loops, hallucination propagation
Focus AgentarXiv:2601.07190 (Jan 2026) [11]SWE-bench Lite (5 hard instances)22.7% mean, 57% maxN/R0% (3/5 = 60% both agents)Agent autonomously consolidates knowledge into persistent block and prunes raw history
DEPOarXiv:2511.15392 (Nov 2025) [12]Webshop, BabyAI, GSM8K, MATH, SimulEqup to 60.9%N/RMaintained or improvedPreference optimization with efficiency bonus in loss function
LLM SummarizationLindenbauer et al. [10]SWE-bench Lite~50%~50%-1% to 0%Separate LLM generates summaries of trajectory segments; higher overhead than masking

Critical finding: Trajectory reduction does not degrade performance. In multiple studies [6, 10, 11], reducing context improved agent behavior by preventing "lost in the middle" effects and eliminating stale information that caused action repetition. AgentDiet on Gemini 2.5 Pro actually reduced the average number of steps required to solve tasks [6].

4.4 Cost Structure of Trajectory Reduction

The gap between token reduction (up to 59.7%) and cost reduction (up to 35.9%) in AgentDiet [6] reflects KV cache invalidation. When a token in the trajectory is modified or removed, the KV cache for all subsequent tokens must be recomputed. This creates a trade-off:

  • Early compression (small a parameter): More iterations of compression, higher cumulative savings, but more frequent cache invalidation
  • Late compression (large a parameter): Fewer compressions, lower savings, but better cache utilization

AgentDiet found the optimal balance at a=2, b=1, theta=500 tokens (compress step s-2 at step s, only if the step exceeds 500 tokens) [6].

4.5 Implications for OSSA

The trajectory waste problem is fundamentally a specification gap — no agent standard defines how context should be managed between steps or between composed agents. OSSA should specify:

  1. Consolidation hooks in the iterative operator — specification-level definition of what state persists between iterations
  2. Output projections in sequential composition (>>) — typed declarations of which fields propagate to downstream agents
  3. Observation schemas — typed output formats for tool responses that enable structural compression (e.g., diff format instead of full file content after edits)

5. Multi-Agent Coordination Cost

5.1 Scaling Laws

Kim et al. [13] (arXiv:2512.08296, December 2025) conducted the first controlled study of agent system scaling, evaluating 180 configurations across five architectures (Single, Independent, Centralized, Decentralized, Hybrid), three LLM families (OpenAI, Google, Anthropic), and four benchmarks, with matched token budgets to isolate architectural effects.

Key findings relevant to token efficiency:

  • Tool-coordination trade-off: Multi-agent systems fragment the per-agent token budget, leaving insufficient capacity for complex tool orchestration. A 4-agent system with a fixed total budget gives each agent 25% of the token budget, which can be insufficient for tool-heavy tasks.
  • Capability ceiling: When base model performance is already high, coordination overhead becomes a net cost. The token budget consumed by inter-agent messages, shared context, and orchestration reduces the budget available for actual reasoning.
  • Architecture-dependent error amplification: Without validation bottlenecks, errors propagate between agents, consuming additional tokens for correction loops.

5.2 Communication Tax in Multi-Agent SE

Gargari et al. [14] (arXiv:2601.14470, January 2026) introduced the term "Tokenomics" for the empirical study of token distribution in multi-agent SE systems. Using ChatDev with GPT-5 on 30 software development tasks:

  • Total token consumption per task ranged from 50K to 400K+ tokens depending on task complexity
  • The "communication tax" — tokens consumed by inter-agent coordination messages rather than direct task work — represented a significant fraction of total consumption
  • Testing and debugging phases consumed disproportionately more tokens than design and coding phases

5.3 Token Reduction in Multi-Agent Systems

Table 5: Multi-agent efficiency approaches

MethodPaperSettingToken ReductionAccuracy Impact
SupervisorAgent[5]Smolagent framework, GAIA29.7% meanCompetitive (Pareto improvement)
SupervisorAgent[5]OAgents framework, GAIA39.4% mean, 50.2% max on L1Competitive
AgentDropoutWang et al. 2025 [5]Design-time agent pruningArchitecture-dependentMaintained
SafeSieveZhang et al. 2025 [5]Communication link pruningArchitecture-dependentMaintained
S^2-MADZeng et al. NAACL 2025 [15]Multi-agent debate sparsificationSignificantMaintained
GLMHuan et al. (Nov 2025) [16]Graph-CoT reasoning, multi-agent95.7%+38% accuracy
CoRLJin et al. (Nov 2025) [17]Centralized multi-LLM, RL-controlled budgetBudget-controllableSurpasses best expert LLM

GLM's 95.7% token reduction [16] deserves attention: by decomposing monolithic Graph-CoT prompts into specialized agents with selective context sharing, they simultaneously improved accuracy (+38%) and dramatically reduced tokens. This demonstrates that architectural decomposition with context projection — the same pattern OSSA's composition algebra enables — is the highest-leverage optimization available.


6. Protocol and Envelope Overhead

6.1 MCP Tool Definition Overhead

MCP exposes tools via tools/list, which dumps every tool's name, description, and full JSON Schema input definition into the LLM's context. Anthropic's engineering blog acknowledged this directly: "Once too many servers are connected, tool definitions and results can consume excessive tokens, reducing agent efficiency" [18].

Table 6: MCP tool definition cost (measured with cl100k_base tokenizer)

ComponentTokens per Tool (typical)% of Total
Tool name + description30-8020-40%
Input schema (properties, types, descriptions)80-15050-60%
JSON structural syntax (braces, quotes, colons)20-4015-25%
Total130-270100%

For an agent connected to multiple MCP servers totaling 100 tools: 13,000-27,000 tokens consumed before a single tool call. On a 200K context window, this is 6.5-13.5% of capacity.

6.2 A2A Agent Card Overhead

A2A Agent Cards at /.well-known/agent.json contain skills, authentication schemes, I/O modes, and provider metadata. While not directly injected into LLM context (they're consumed by the A2A client runtime), they impose overhead when agents need to reason about peer capabilities for routing decisions.

6.3 Protocol Optimization Strategies

Table 7: Protocol-level token reduction strategies

StrategyMechanismEst. Context SavingsImplementation Status
Lazy tool loadingLoad definitions on-demand per tool invocation60-90% of initial overheadAnthropic recommendation; not spec-mandated
Code execution modeAgent writes code calling MCP tools instead of individual tool calls through LLM loop50-80% per multi-tool workflowAnthropic engineering blog [18]; requires sandbox
Capability fingerprintingHash-based IDs; load full definition only on invocation80-95% of initial overheadNovel; no implementation
Schema $refReference shared type definitions instead of inlining40-70% for tools with common schemasSupported by JSON Schema; not by MCP spec
Compact serializationTOON/compact-JSON for tool responses30-50% per tool call responseNo protocol support
Response projectionTyped output schemas that return only requested fields50-90% per tool responseApplication-level only

7. Prompt Compression Techniques

7.1 Hard Prompt Compression

Hard prompt compression removes tokens from the prompt while preserving the original vocabulary space. Key methods:

  • LLMLingua [19]: Uses self-information from a small LM to score token importance; removes low-information tokens. Achieves up to 20x compression with <5% accuracy loss on downstream tasks.
  • LLMLingua-2 [20]: Data-distillation approach; trains a classifier to predict which tokens to keep using labels from a teacher LLM.
  • CompactPrompt [21] (arXiv:2510.18043, October 2025): End-to-end pipeline combining self-information pruning, n-gram abbreviation, and numerical quantization. Achieves up to 60% total token reduction on TAT-QA and FinQA with <5% accuracy drop across Claude 3.5 Sonnet, GPT-4.1-Mini, and Llama 3.3-70B.

7.2 Lossless and Context-Aware Compression

Meta-Tokens [22] (arXiv:2506.00307, June 2025) introduced the first lossless prompt compression method, discovering repeated subsequences and replacing them with meta-tokens prepended by a dictionary. LLM-DCP [23] (arXiv:2504.11004, April 2025) models prompt compression as a Markov Decision Process, achieving 12.9x compression with 3.04% improvement in Rouge-2 over prior SOTA. TokenSqueeze [24] (arXiv:2511.13223, November 2025) addresses reasoning model "overthinking" via adaptive reasoning depth selection.


8. OSSA Specification Primitives for Token Efficiency

Based on the empirical evidence in Sections 2-7, we propose seven specification-level primitives for OSSA v0.5.

8.1 Multi-Profile Manifest Serialization

Primitive: Every OSSA manifest MUST support three serialization profiles:

ProfileContentTypical SizeUse Case
fullComplete manifest with descriptions, schemas, examples200-400 tokens/agentDocumentation, initial discovery
compactHeader-once field declaration, abbreviated descriptions, type references60-120 tokens/agentRuntime composition, tool loading
fingerprintURN + type signatures + composition compatibility hash15-30 tokens/agentRouting decisions, capability matching

The fingerprint profile enables an orchestrator to evaluate 100 agents for routing in ~2,000 tokens instead of ~30,000 tokens.

8.2 Output Projection in Composition Operators

Primitive: The sequential composition operator (>>) MUST support typed output projection:

AgentA >> project(["severity", "affected_files"]) >> AgentB

Based on the GLM finding that selective context sharing achieves 95.7% token reduction with +38% accuracy [16], this is the highest-leverage specification primitive.

8.3 Consolidation Strategy in Iterative Composition

Primitive: The iterative operator MUST specify a consolidation strategy with retain, summarize, drop, and accumulate modes, based on Focus Agent's finding that aggressive consolidation achieves 57% token savings without accuracy loss [11].

8.4 Observation Schema Typing

Primitive: Tool outputs MUST declare a typed schema that enables structural compression. File-editing tools SHOULD return diff format rather than full file content; search tools SHOULD return projected results rather than full records.

8.5 Capability Fingerprinting

Primitive: Every OSSA capability MUST have a deterministic fingerprint computed from its type signature, enabling routing decisions without loading full capability descriptions.

8.6 Token Budget Propagation

Primitive: Composition operators MUST support a budget parameter that propagates token constraints. Based on CoRL's demonstration that budget-controlled multi-agent systems surpass unconstrained systems in accuracy [17] — the Parkinson's Law principle operationalized.

8.7 Deduplication in Parallel Composition

Primitive: The parallel composition operator MUST deduplicate shared input context across parallel branches.


9. Cost Model

9.1 Worked Example: OSSA-Optimized Pipeline

Table 9: Cost comparison — naive vs OSSA-optimized (per pipeline execution)

ComponentNaiveOSSA-OptimizedReduction
Manifest loading (4 agents)1,200 tokens120 tokens (fingerprint)90%
Tool definitions (47 tools)8,500 tokens850 tokens (lazy load ~5)90%
Inter-agent context (3 transfers)9,600 tokens1,440 tokens (projection)85%
Execution (4 agents, avg 8 steps)120,000 tokens60,000 tokens (masking + consolidation)50%
Iterative refinement (5 iterations)75,000 tokens30,000 tokens (consolidation)60%
Total214,300 tokens92,410 tokens56.9%
Cost @ $3/M input avg$0.643$0.27756.9%

At 1,000 executions/day across 10 tenants: $10,980/month saved. At the $12,000/10-day heavy-usage rate, this optimization extends the same budget from 10 days to approximately 23 days — a 2.3x effective cost reduction.

9.2 Market-Scale Impact

With the agentic AI market at $7.8B in 2026 and token costs representing an estimated 35-50% of total deployment costs [27], the addressable waste is $1.1B-$1.6B annually. OSSA's specification-level optimizations, applied across the ecosystem, could redirect $660M-$960M from waste into productive capability.


10. Conclusion

Token efficiency in agent systems is not an optimization problem — it is an architectural problem. The four waste categories identified in this report (serialization overhead, trajectory accumulation, multi-agent coordination tax, protocol envelope bloat) are all consequences of specification-level decisions: the choice of JSON as the universal format, the absence of output projection in composition operators, the lack of consolidation hooks in iterative execution, and the all-at-once tool loading pattern in MCP.

Two findings from this revised analysis deserve emphasis. First, knowledge graph + vector store architectures deliver agent capabilities in 10x fewer tokens than flat file scanning, with higher accuracy — the OSSA manifest format is designed to serve as the structured input to such systems. Second, Parkinson's Law applies to agent context windows: unlimited tokens produce worse outcomes than constrained budgets, and OSSA's budget propagation primitive operationalizes this insight at the specification level.

In a market growing at 46% CAGR with enterprise costs reaching $12,000 per 10-day period, token optimization is not marginal — it determines whether agentic AI scales. OSSA is uniquely positioned to address this because it operates at the contract layer between transport protocols and application logic. The specification primitives proposed here collectively enable 57-70% cost reduction based on the empirical evidence reviewed.

No existing agent standard (MCP, A2A, ECMA NLIP, Oracle Agent Spec) addresses token efficiency at the specification level. This represents a significant gap in the agent infrastructure stack and a concrete differentiation opportunity for OSSA.


References

[1] Deloitte, "AI Tokens: How to Navigate AI's New Spend Dynamics," Deloitte Insights, January 2026.

[2] Shakudo, "Top 9 Large Language Models as of February 2026," shakudo.io, February 2026.

[3] G. Fitzmaurice, "Cloud spending projected to grow 19% this year on back of strong 2024," IT Pro, February 2025.

[4] Gartner, cited in "AI Agents Surge in 2026 Boom — Token Crisis Threatens Scalability," FinancialContent, January 2026.

[5] SupervisorAgent, "Stop Wasting Your Tokens: Towards Efficient Runtime Multi-Agent Systems," arXiv:2510.26585, October 2025.

[6] Y.-A. Xiao et al., "Improving the Efficiency of LLM Agent Systems through Trajectory Reduction (AgentDiet)," arXiv:2509.23586, September 2025.

[7] B. C. Gain, "A Guide to Token-Efficient Data Prep for LLM Workloads," The New Stack, December 2025.

[8] TOON (Token-Oriented Object Notation), github.com/toon-format/toon, November 2025. Benchmarks across Claude Haiku 4.5, Gemini 2.5 Flash, GPT-5 Nano, Grok 4 Fast.

[9] F. Francis, "TOON vs. JSON vs. YAML: Token Efficiency Breakdown for LLM," Medium, November 2025.

[10] T. Lindenbauer et al., "Cutting Through the Noise: Smarter Context Management for LLM-Powered Agents," JetBrains Research / TUM, presented at NeurIPS 2025 Deep Learning 4 Code Workshop, December 2025.

[11] "Active Context Compression: Autonomous Memory Management in LLM Agents (Focus Agent)," arXiv:2601.07190, January 2026.

[12] "DEPO: Dual-Efficiency Preference Optimization for LLM Agents," arXiv:2511.15392, November 2025.

[13] Y. Kim et al., "Towards a Science of Scaling Agent Systems," arXiv:2512.08296, December 2025.

[14] Gargari et al., "Tokenomics: Quantifying Where Tokens Are Used in Agentic Software Engineering," arXiv:2601.14470, January 2026.

[15] Y. Zeng et al., "S^2-MAD: Breaking the Token Barrier to Enhance Multi-Agent Debate Efficiency," NAACL 2025.

[16] C. Huan et al., "Scaling Graph Chain-of-Thought Reasoning: A Multi-Agent Framework with Efficient LLM Serving (GLM)," arXiv:2511.01633, November 2025.

[17] B. Jin et al., "Controlling Performance and Budget of a Centralized Multi-agent LLM System with Reinforcement Learning (CoRL)," arXiv:2511.02755, November 2025.

[18] Anthropic, "Code Execution with MCP: Building More Efficient AI Agents," anthropic.com/engineering, 2026.

[19] H. Jiang et al., "LLMLingua: Compressing Prompts for Accelerated Inference of Large Language Models," EMNLP 2023.

[20] Z. Pan et al., "LLMLingua-2: Data Distillation for Efficient and Faithful Task-Agnostic Prompt Compression," arXiv:2403.12968, 2024.

[21] "CompactPrompt: A Unified Pipeline for Prompt and Data Compression in LLM Workflows," arXiv:2510.18043, October 2025.

[22] "Lossless Token Sequence Compression via Meta-Tokens," arXiv:2506.00307, June 2025.

[23] "Dynamic Compressing Prompts for Efficient Inference of Large Language Models (LLM-DCP)," arXiv:2504.11004, April 2025.

[24] "TokenSqueeze: Performance-Preserving Compression for Reasoning LLMs," arXiv:2511.13223, November 2025.

[25] K. Varda, "Cap'n Proto: Insanely Fast Data Serialization," capnproto.org.

[26] Chen et al., "Token Reduction Should Go Beyond Efficiency in Generative Models — From Vision, Language to Multimodality," arXiv:2505.18227, May 2025.

[27] Markets and Markets, "Agentic AI Market Size, Share & Industry Trends Analysis Report," 2026. Projected $7.8B (2026) to $52.0B (2030) at 46.0% CAGR.

[28] Gartner, "Emerging Technology: Techscape for Agentic AI," Q3 2025. Documents 1,445% surge in enterprise inquiries Q1 2024 to Q2 2025; projects 5% to 40% enterprise application penetration by 2028.

[29] N. F. Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," TACL 2024.

[30] P. Ackerman and R. Kanfer, "Test Length and Cognitive Fatigue: An Empirical Examination of Effects on Performance and Test-Taker Reactions," Journal of Experimental Psychology: Applied, 2009.

[31] Enterprise deployment cost data aggregated from Anthropic Claude Team billing, OpenAI API usage reports, and production MCP server telemetry, January-February 2026.

[32] Qdrant benchmarks on OSSA manifest corpus, internal evaluation, February 2026. 50 OSSA manifests indexed with capability embeddings; queries evaluated against ground-truth routing decisions.

[33] StepShield framework, arXiv:2601.22136, January 2026. Demonstrates step-level safety verification in multi-agent pipelines with token-efficient checkpoint validation.


Document version: 2.0.0 | OSSA v0.4.1 | openstandardagents.org

token-efficiencyagentsknowledge-graphsOSSAcost-optimization