Sebastian Hanke

Introducing PULSE: A Post-Scrum Manifesto for AI-Native Teams

Sebastian Hanke — Fri, 05 Jun 2026 20:55:11 GMT

The question nobody answers

John Cutler recently named something I keep running into: the gap between Single Player and Multiplayer AI development. Single-player is a productivity problem. One human, one agent. You optimize your prompts, your workflow, your toolchain. Done.

Multiplayer is different. Multiple people, each working with their own agents on the same product. That is a coordination problem, and it is unsolved.

Cutler named the tension. The spec-driven development crowd described half of an answer. Johann-Peter Hartmann diagnosed what goes wrong when you ignore it. Kent Beck reminded us that discipline matters more now, not less. Good observations, all of them. But none of it adds up to something you can use on Monday morning.

I work with development teams at a german energy company. Teams of 2-3 developers builds an internal product using Coding Agents like GitHub Copilot or Claude Code. Each developer is fast. Genuinely fast. Any one of them can ship a feature before we finish discussing whether we should build it.

That is the problem.

The seduction of speed

A developer recently asked me:

“It takes longer to talk about a business analysis that later results in features than just to implement the features. Why would I spend time defining requirements when the feature is already built by the time we finish discussing it?”

He is right about the speed. Writing code costs almost nothing now. But I have watched what happens when that speed runs without guardrails:

Features contradict each other because nobody checked the product vision
The codebase ends up with three implementations of the same thing
Users get overwhelmed by functionality they never asked for
Architecture decisions get made implicitly, by whoever codes first

The developer who builds five features while the team discusses one feels productive. But if three of those features miss the product vision, two duplicate existing work, and one breaks something else, the net value is negative.

Martin Fowler , citing the DORA Report, put it plainly: AI amplifies whatever already exists in your pipeline. No structure means AI amplifies the chaos. Hartmann calls the result a Bullshit Factory: high autonomy, weak controls, impressive output volume, mostly unusable.

What has not changed

The principles of product development we built over decades still hold. Design Thinking still works. Lean Startup still works. The questions remain the same:

Who are we building for?
What problem are we solving?
Why this, and not something else?
What should we explicitly not build?

In my years as a Product Owner, I always had more ideas than my team could implement. That constraint forced discipline. It forced me to think about what actually mattered. If we drop that discipline now, simply because coding is cheap, products will lose quality and coherence.

The principles have not changed. The cycle time has. And that changes everything about how we coordinate.

Where Scrum breaks

Scrum was always about one thing: helping teams figure out the right thing to build, together. That idea still holds.

But when a human-agent pair ships a feature in hours, classical Scrum becomes absurd. Sprint Planning is a two-hour estimation theater for work that takes an afternoon. The Daily Standup is a status report for work that moved three times since yesterday. Refinement is a bureaucratic exercise for tickets that are done by the time you discuss them.

Scrum is not wrong. It is too slow. And the alternatives do not fit either. Kanban has no rhythm. Shape Up has six-week cycles. AI-native teams need something else.

The Manifesto

In the spirit of the original Agile Manifesto, which worked because it stated values, not procedures:

We are discovering better ways of building software with AI agents, by doing it and sharing what we learn.

Through this work we have come to value:

Shared artifacts over status meetings.
Team coherence over individual speed.
Conscious filters over unlimited backlogs.
Structural governance over approval gates.

Here is what each of these means in practice.

Shared artifacts over status meetings. Your coding agent has no memory between sessions. Your teammate does not know what you built yesterday. Written artifacts are the only synchronization that works. A spec, an ADR, a backlog entry that both humans and agents can read. Not a standup. Not a Teams message.

Team coherence over individual speed. Every developer on my team ships features fast. But speed without direction is drift. A team that ships five coherent features beats a team where three developers each ship five uncoordinated ones.

Conscious filters over unlimited backlogs. When implementation is cheap, every idea feels worth building. It is not. PULSE prevents ideas from becoming work without passing through a deliberate filter. That filter separates a product team from a feature factory.

Structural governance over approval gates. Nobody wants to wait for someone to approve a PR at 11pm. But “no governance” is worse. The answer is governance baked into structure: automated quality gates, fitness functions, conventions that agents and humans follow equally. The rules live in the code and the specs, not in someone’s calendar.

Introducing PULSE

PULSE is not a framework. It is a rhythm.

A pulse is regular, lightweight, essential. It sets the tempo without constraining movement. PULSE gives a team a beat without slowing down the human-agent pairs that do the actual work.

The core principles

Artifacts before ceremonies. What is written down counts. Meetings are synchronization points, not production facilities.
Machine-readable = Team-readable. Every artifact a coding agent needs, a teammate who lacks context also needs. Specs, ADRs, acceptance criteria serve both audiences.
Autonomy with visibility. Every human-agent pair works autonomously. But every piece of work is visible at all times, through shared artifacts, not through status meetings.
Fast cycles, conscious filters. Ideas may emerge quickly. They must pass through a deliberate filter before they become work.
Governance through structure, not approval. Fitness functions, automated quality gates, clear conventions. Not people waiting to sign off.

Three layers, three tempos

This is what makes PULSE structurally different from Scrum, Kanban, or Shape Up. Instead of one cadence for everything, PULSE runs three tempos at once:

Execution Layer, hourly. Each human-agent pair works autonomously. Pick a feature from the shared backlog, spec it, implement it, test it, create a PR. In hours, not weeks. One constraint: nobody works on something that is not in the backlog. Spontaneous ideas go to the Ideas channel, not directly into code.

Coordination Layer, daily. The team synchronizes through lightweight rituals:

Async Status (daily, 2 min writing): What is done, what is next, any blockers. Structured, not prose. Machine-readable enough that a narrator agent could generate a team summary.
Sync Call (2x/week, 30 min): Not “what did you do,” that is in the async status. Instead: resolve blockers, review ideas from the Ideas channel, adjust priorities.
Code Reviews (continuous, async): PRs reviewed within 4 hours. Focus on architecture conformance, not syntax. The agent handles syntax.

Total meeting overhead: max 2.5 hours per week.

Product Layer, weekly. Direction, not tickets.

Direction Session (every 2 weeks, 90 min): Replaces Sprint Planning, Sprint Review, and Retrospective. Focus: What 3-5 outcomes do we want in the next two weeks? What did we learn? What needs to change?
Roadmap Review (monthly, 60 min): Does the roadmap still match the Business Analysis?

These three tempos are nested. The execution layer runs continuously inside the coordination layer, which runs inside the product layer. A developer might ship three features in a single day while the product direction stays stable for two weeks.

The Shared Context Layer

This is where PULSE lives or dies. Coding agents have no memory between sessions. Teammates do not know each other’s context. Shared living artifacts are the only synchronization that works.

The Digital Innovation Agents V-Model Workflow defines the complete artifact structure. All shared context lives under _devprocess/ (gitignored in public repos, keeping internal artifacts private):

CLAUDE.md              ← Agent memory (loaded every session)
memory/MEMORY.md       ← Claude Code memory

_devprocess/
├── analysis/
│   ├── BA-{PROJECT}.md                ← Business Analysis
│   └── security/
│       └── AUDIT-{PROJECT}-{DATE}.md  ← Security audit
├── requirements/
│   ├── epics/
│   │   └── EPIC-{NNN}-{slug}.md       ← Epic descriptions
│   ├── features/
│   │   └── FEATURE-{EPIC}-{NNN}.md    ← Feature specs (Gherkin)
│   └── handoff/
│       ├── architect-handoff.md       ← RE → Architect
│       └── plan-context.md            ← Architect → Coder
├── architecture/
│   ├── ADR-{NNN}-{slug}.md           ← Decision Records (MADR)
│   └── arc42.md                       ← arc42 (single file)
└── context/
    ├── 10_backlog.md                  ← Backlog (source of truth)
    ├── 20_bugs.md                     ← Bug log
    └── 30_handoffs.md                 ← Phase handoff log

Every artifact serves two audiences: humans who need alignment, and agents who need context.

The root-level CLAUDE.md is the most important file. It is the institutional memory that every agent loads at the start of every session -- conventions, architecture decisions, coding standards, project context. It makes the team’s output consistent regardless of which human-agent pair produced it. memory/MEMORY.md is its counterpart: the running memory that Claude Code maintains across sessions.

The handoff documents are the connective tissue between phases:

architect-handoff.md: The Requirements Engineer creates this as a structured summary for the Architect -- all epics, features, NFRs, constraints, and open questions in one place. Without it, the Architect has to piece together context from dozens of FEATURE-*.md files.
plan-context.md: The Architect creates this as the single entry point for the Coder. It summarizes all ADRs, tech stack decisions, task breakdowns, and conventions. The Coding agent loads this first and writes back to it when implementation decisions diverge from the plan. This bidirectional flow is what keeps architecture and code in sync.
arc42.md: Architecture documentation following the arc42 template -- context, building blocks, runtime, deployment, crosscutting concepts, decisions, quality, risks, glossary. One file, all sections. The long-term architecture reference that survives individual iterations.
30_handoffs.md: An append-only log of every phase transition. When the Requirements Engineer hands off to the Architect, or the Architect to the Coder, the handoff is recorded here. This gives the team a timeline of decisions and transitions.

The conscious filter

The most important part of PULSE is what it prevents.

When implementation is cheap, the temptation is to turn every idea into a feature immediately. PULSE creates a deliberate pipeline:

Ideas channel: free-flowing impulses. Anyone can post anything. Feature ideas, bug reports, user feedback, half-baked thoughts.

Sync Call filter: the team evaluates each idea against the Business Analysis, checks it against the roadmap, and decides. Does this become a feature in the backlog, or not?

Backlog: only filtered, prioritized work. No human-agent pair works on anything that has not passed through this filter.

This pipeline separates a product team from a Bullshit Factory. The filter is lightweight. It happens in 10 minutes during the Sync Call. But it exists. And its existence keeps the product coherent.

Communication channels

PULSE defines four channels. Teams, Slack, Discord, whatever your team uses:

If it is in Ideas, it is an impulse. Only when it passes the Sync Call filter and lands in the backlog does it become work.

Roles: hats, not positions

In a 3-4 person team, these are hats, not dedicated roles:

The person wearing the Product Direction hat still works as a human-agent pair and ships features. These are responsibilities, not full-time jobs.

The V-Model Workflow: the engine inside PULSE

PULSE defines how the team collaborates. What happens inside each human-agent pair is a separate question. That is where the Digital Innovation Agents V-Model Workflow comes in, an open-source skill package that guides each developer through:

Business Analysis → Requirements Engineering → Architecture → Coding → Testing → Security Audit → Release. Each phase has specialized agent skills, quality gates between phases, human-in-the-loop checkpoints, and mandatory read/write to the Shared Context Layer.

The handoff chain matters here. The Requirements Engineer produces architect-handoff.md for the Architect. The Architect produces plan-context.md for the Coder. The Coder writes back to plan-context.md when reality diverges from the plan. Every transition is logged in 30_handoffs.md. Every phase reads from and writes to the same shared artifacts under _devprocess/.

This is what creates consistency across the team. Not meetings. Not ceremonies. A shared process that produces shared artifacts.

The Business Analysis as North Star

One of the most contentious points in my team. Developers who think in feature releases see the BA as overhead. But without it, the product permanently pivots because nobody is clear on who they are building for and why.

The BA does not need to be a 50-page document. In PULSE, the team reviews and understands it together before the first line of code is written. The full BA exists for depth and alignment.

When you can add features in hours, the ability to say “no” becomes more important than the ability to say “yes.” The BA gives you the basis for that “no.”

What I am not saying

Scrum was not wrong. It solved a real problem: how to help teams collaborate and build the right thing. That problem has gotten harder, not easier.

Planning is not unnecessary. But the ratio has shifted. When implementation takes hours instead of weeks, the plan needs to be lighter and more precise.

Individual productivity matters. But individual productivity without team coherence produces waste.

The invitation

PULSE is not finished. It is a starting point, built from working with a real team on a real product using AI coding agents every day.

The Single Player question is answered. The Multiplayer question is wide open. If you feel the same tension, individual speed pulling against team coherence, shipping fast pulling against shipping right, I want to hear from you.

Coding is solved. Collaboration is not.

Stigmergy for capability selection in LLM agent loops

Sebastian Hanke — Wed, 03 Jun 2026 22:59:52 GMT

Abstract

An LLM agent carries the definitions of its capabilities (function-calling tools, tools exposed over the Model Context Protocol, and procedural skills) in its context, and selects among them with a step that does not change with experience. Those definitions are expensive on their own: the agent keeps every one of them in context on every step and pays for them whether or not it uses them. Anthropic, the maker of one widely used agent platform, found that the descriptions of 58 tools fill about 55,000 tokens of context before the user has asked a single question, and a community report describes a real setup above 140,000 tokens of tool descriptions. The selection itself degrades as the catalogue grows, and, more importantly, the agent relearns nothing from the outcome of past use. The cost concentrates in the failed exploration that precedes the one successful call: the wrong tool loaded, the wrong parameters passed, the retry. None of that feeds back into the next decision.

This paper proposes a thin coordination layer that sits beneath the existing retrieval and loading machinery and adds the one thing it lacks: a loop that reinforces what works and lets unused choices decay. The layer borrows its mechanics from stigmergy, the indirect coordination that lets ant and termite colonies solve hard collective problems without a plan or a controller. I model capability sequences as a directed graph whose edges carry a pheromone value, reinforce edges that lead to accepted outcomes, let unused edges decay, and explore under an explicit token budget. Selection is deterministic given a seed, computed entirely on the user’s machine, with no network egress.

The token argument needs one refinement that prompt caching forces. When the host keeps the tool definitions in a cached prompt prefix, which is now standard and reaches high cache-hit rates, a stable tool block is cheap and any per-step change to it, narrowing the visible set or reordering it, breaks the cache and pays the full input cost again. Narrowing the standing set is therefore no longer a clear win. The cache-neutral lever is not narrowing but path learning as guidance: keep every capability visible so the cache stays warm, and give the model a learned path instead, the proven sequence from task to solution that matches the task, as a small per-turn context after the cache breakpoint. That cuts the failed exploration before the one correct call and the turns it costs, which is the real saving. Narrowing the visible set stays available as an optional, cache-aware optimisation for very large catalogues that no longer fit the context, not as the default.

I make the design and its derivation precise, relate it to recent work on pheromone-based LLM reasoning, tool selection, and ensemble diversity, and specify a falsifiable evaluation protocol with stated pivot conditions. The system is implemented and I have run a security review against its design; the controlled empirical study that would confirm the central token-reduction hypothesis is specified here and remains to be run. I give one preliminary, self-run data point that shows a large token reduction but also a cold-start success regression, and I treat it as a mechanism check rather than evidence.

1. Introduction

A modern agent is a loop. It receives a request, decides which capability to use, invokes it, observes the result, and repeats until it can answer. Over the last two years the menu of capabilities has grown along three lines that arrived separately but are used together: classic function-calling tools, tools exposed over the Model Context Protocol (MCP), an open wire protocol that lets an agent host connect to external tool servers in a standard way, and skills, which are procedural playbooks loaded on demand. In practice a single agent stacks all three.

Each addition carries a cost that the agent pays on every step, whether or not the capability is used. The cost has two parts. The first is the weight of the definitions themselves. A capability has to be described to the model, and those descriptions occupy context. Anthropic’s own platform measurement reports 58 tools at approximately 55,000 tokens before the conversation even begins, and tool definitions consuming 134,000 tokens before optimisation [Anthropic 2025a]. A single community field report describes a still larger catalogue [claude-code #12241]. The second part is the cost of getting the choice wrong. As the catalogue grows, selection accuracy falls, and a wrong choice is not free: it is a loaded definition that contributed nothing, a failed call, a retry, an entire round trip spent on a dead end.

The decisive observation is that today’s selection is stateless. Retrieval-augmented tool selection, MCP routers, deferred tool loading, and progressive skill disclosure all reduce the standing cost, and they work. But once configured they are fixed. The outcome of a past call does not change what surfaces next time. The system retrieves; it does not learn, and it does not forget.

Nature solved a structurally similar problem long ago. An ant colony finds short paths to food with no map, no leader, and no individual that understands the problem. It coordinates through traces left in a shared, decaying medium. This principle is called stigmergy. Its computational core, reinforcement of what works plus evaporation of what does not plus a little randomness, is exactly the feedback loop that stateless capability selection lacks.

This paper develops that transfer carefully and to a level of technical detail appropriate for an academic venue. My contributions are:

A problem framing that unifies tools, MCP tools, and skills as a single question: which capability, with which parameters, should be made visible at this step, given what has worked before (Section 3).
A faithful transfer of foraging stigmergy to capability selection, including the points where the analogy breaks and what must change because exploration in an agent costs tokens, unlike a free walk for an ant (Section 4).
A precise system design: a path-graph substrate (the stored graph that holds the learned field) with a multiplicative desirability score, error-aware selection by Thompson sampling over decayed evidence, a deposit gated on how good and how cheap the outcome was, lazy exponential decay, and a delayed acceptance signal read from what the user does next (Section 6).
An architecture for trust and portability: a hexagonal core behind two ports, a proactive pre-call step inside the controlled loop rather than a reactive hook (so the layer can surface learned-path guidance ahead of the model while the cached tool block stays warm), full determinism, and a strict local-first boundary (Section 7).
A falsifiable evaluation protocol with hypotheses, ablation axes (configuration switches turned off one at a time to measure each part’s contribution), baselines, and pivot conditions, reported honestly as a design paper whose controlled study is pending (Section 8).

The accompanying open-source release carries the product name. In this paper I describe the system by its mechanism rather than that name, because the mechanism is what drops beneath any agent loop, and the mechanism is the contribution.

2. Background: stigmergy and ant colony optimization

2.1 The term and the idea

The French zoologist Pierre-Paul Grassé introduced the word stigmergie in 1959 while studying how termites rebuild their nests [Grassé 1959]. No termite holds a blueprint, none instructs another, none has an overview, and yet a structured nest emerges. Grassé built the term from the Greek stigma (mark, sting) and ergon (work): the work that directs through marks. The whole idea fits in one sentence. The state of the work guides the next action. A termite that finds a half-built pillar builds on it. It does not consult its predecessor; it reads the predecessor’s trace in the world, and that predecessor may be long dead.

Stigmergy replaces direct communication with indirect coordination through a persistent medium. One actor changes the shared environment; another reads the change later. Two properties follow, and both are the source of its resilience [Theraulaz and Bonabeau 1999]. Coordination is decoupled in time: the writer and the reader need not be active at once. And it is decoupled from identity: traces are anonymous and interchangeable, so the loss of any individual changes nothing. Half a colony can fail and the work continues, because no trace is bound to a particular individual.

2.2 Two kinds of trace

Nature uses two variants of the principle, and the distinction matters for the transfer.

In termite construction the signal is the structure itself. A started wall tells the next termite to continue there. There is no separate communication channel, only the result of work so far, and different build states trigger different next actions. This is sematectonic stigmergy: the form of the work carries the information.

Ant foraging works differently. Ants deposit pheromones, chemical markers, along the paths they walk. The trail is its own sign: it stands apart from the food and the path and only points toward the food. This is marker-based stigmergy, and it acts in proportion to quantity: more pheromone produces a stronger response, so a higher chance that an ant follows the trail [Goss et al. 1989, Deneubourg et al. 1990]. The poverty of the signal is the point. A pheromone trail carries almost no information, essentially concentration and direction. That sparsity is what makes the system resilient: it forces decentralisation and it scales almost for free with the number of participants. The intelligence sits in the protocol and the environment, not in the individuals.

2.3 How traces become computation

Leaving and reading a trace is not yet computation. Three forces, working together, turn stigmergy into an optimisation procedure. They show most clearly in the double-bridge experiment of Deneubourg, Goss, and colleagues in the late 1980s [Goss et al. 1989, Deneubourg et al. 1990]. Connect a nest and a food source by two bridges, one short and one long. At first ants choose at random. Whoever takes the short bridge returns sooner and lays pheromone per unit time at a higher rate. The short bridge therefore accumulates trail faster, becomes more attractive, draws more ants, and accumulates faster still. The colony converges on the shorter path even though no single ant ever compared the two or understood what short and long mean.

Three forces produce this:

Positive feedback (autocatalysis). The more a path is used, the stronger its trail, and the more it is used. A good solution crystallises.
Negative feedback (evaporation). Pheromone decays over time. This sounds like a flaw and is the most important mechanism of the three. Without evaporation the first path found would burn in forever, and the system could never react to a better solution or a changed world. Evaporation is forgetting, and forgetting is adaptability.
Stochasticity. Ants do not follow the strongest trail slavishly; they sometimes deviate. That deviation is exploration. It is the reason a shortcut is found at all.

The point that matters is that the medium computes. Evaporation integrates over time; accumulation averages over many individuals. The colony solves an optimisation problem that no individual solves. The solution lives in the field, not in the ants. The ant dies, the trail remains. This separation of a short-lived actor from a long-lived, decaying medium is the core that transfers.

Marco Dorigo formalised the principle as a general algorithm, Ant System, in his 1992 thesis and the 1996 paper that followed [Dorigo 1992, Dorigo et al. 1996]. Ant Colony Optimization (ACO) has since become a family. Two members matter here. Ant Colony System introduced a selection rule controlled by a parameter q0: with probability q0 the ant takes the strongest edge (exploitation), and otherwise it picks an edge with probability proportional to that edge’s strength (exploration) [Dorigo and Gambardella 1997]. MAX-MIN Ant System added explicit bounds on the pheromone, a lower and an upper bound, to prevent both starvation of unused edges and runaway lock-in on a single edge, and it restricted reinforcement to the best ant of each round, its elite rule [Stützle and Hoos 2000]. AntNet adapted the family to non-stationary problems, network routing whose best path keeps changing over time so that a frozen solution goes stale [Di Caro and Dorigo 1998]. My setting is non-stationary too, and these two design choices, bounded pheromone and a tunable exploit-explore split, reappear directly in this system.

2.4 Six dials as a translation lattice

Abstracted, any stigmergic system decomposes into six parts. They form the lattice I use to translate the natural principle into a technical one:

The medium: where the trace is stored and how long it lasts.
Writing: what an actor leaves behind, and how rich that trace is.
Reading: what an actor perceives locally, and how far it sees.
Decay: how fast the system forgets, which sets its adaptability.
Reinforcement: how strongly positive feedback pulls, which sets the speed of convergence.
Noise: how much exploration is allowed against commitment.

In my reading of the attempts to carry stigmergy into software, two of these dials are where most go wrong. They make writing too rich, letting the actors talk to each other in full natural language instead of leaving a narrow trace. And they drop decay, because it is counter-intuitive. A system without forgetting degrades into a logbook; only decay makes it an adaptive swarm.

3. The problem: capability selection in agent loops

3.1 Three capability types, one decision

Tools, MCP tools, and skills differ in mechanism. A function-calling tool is a typed function the model can request. An MCP tool is the same idea behind a portable wire protocol that lets a host connect to external servers. A skill is a procedural document, loaded progressively, that a model reads when it judges a task to match. They are managed by three separate communities with three separate loading mechanisms, but at the moment of choice they pose one question: of everything available, which definition should be visible to the model right now. I treat that as a single decision over a single set of capabilities.

3.2 The standing cost of definitions

Every visible definition occupies context, and context is paid for on every step. The order of magnitude is documented. A first-party engineering report measures 58 tools at approximately 55,000 tokens before the conversation starts and notes tool definitions consuming 134,000 tokens before optimisation [Anthropic 2025a]. One community field report of a large MCP setup describes a still larger total, approximately 144,800 tokens of tool definitions [claude-code #12241]; I treat it as an anecdotal worst case rather than a benchmark. The standing cost is real and it scales with the size of the catalogue, not with what a given task needs.

Prompt caching changes how that cost is paid, and the change matters for what follows. Hosts now keep the tool definitions in a cached prefix of the prompt, and at typical cache-hit rates a stable tool block is billed at a fraction of its nominal token weight. The corollary is sharp: any per-step edit to that block, whether the edit narrows the visible set or only reorders it, invalidates the cache from the edit point onward and bills the full input again. So narrowing the standing set on every step trades a documented standing cost for a recurring cache-miss cost, and it is no longer an unambiguous saving. The standing cost still scales with catalogue size, but the cheap remedy is to keep the block stable and warm rather than to rewrite it each turn. The saving this paper aims at lives elsewhere, in the failed exploration of Section 3.4, which prompt caching does not touch.

3.3 Accuracy degrades with catalogue size

Selection quality falls as more capabilities are visible at once. Public evidence points the same way from two directions. First-party measurements of a tool-search mechanism report selection accuracy rising substantially once the model no longer sees the full catalogue, for example from 49 percent to 74 percent on one model and from 79.5 to 88.1 percent on another when a search step replaces loading all tools [Anthropic 2025a]. Benchmark studies of tool selection point the same way, with accuracy high for catalogues of a few dozen tools and falling as the catalogue grows toward a few hundred; the same critical-size effect appears in the analysis of replacing multi-agent systems with a single agent plus a skill library, where skill-selection accuracy drops once descriptions become semantically confusable [Single-agent skills 2026]. The exact numbers depend on model and benchmark, but the direction is consistent: beyond some catalogue size, more visible capabilities make the choice worse, not better.

3.4 The expensive part is the failed exploration

The single successful call is rarely the problem. The cost concentrates in the trial and error before it: looking at and loading the wrong skill, calling a tool that fails, passing parameters that need correction, retrying. For a skill, even loading the description text is a token cost that burdens the rest of the context. I call this the negative space: the loaded-but-unused, the invoked-but-failed.

A rough sizing shows why this is the part worth attacking. Take a catalogue of fifty tools at around 50,000 standing tokens. A task that loads three wrong skill descriptions at a couple of thousand tokens each, makes two failing calls with their error returns, and retries spends on the order of eight thousand tokens of dead-end work before the one correct call. The retrieval family of Section 5.2 already trims the standing 50,000 by surfacing a smaller set, and prompt caching makes a stable version of that standing block cheap on its own (Section 3.2). Neither touches the eight thousand tokens of wrong turns: caching discounts a block that does not change, and the failed exploration is new work on every task. What no static method recovers is that the same wrong turns are paid again on the next structurally similar task, because nothing remembers that those turns failed. That repeated, never-learned-from cost is the negative space, and it is the part this system is built to shrink.

3.5 Root cause

Four gaps explain why the cost persists:

No feedback loop from the outcome of a use back into the next selection.
No runtime forgetting, so stale patterns stay as present as current ones, and the system cannot follow a changing codebase or convention.
No common abstraction over the three capability types, so their token weight simply adds up.
No trusted, host-agnostic intervention point that could carry such a loop without raising the security concerns that block adoption.

The rest of this paper closes these gaps with one mechanism.

4. From foraging to capability selection

4.1 The mapping

The problem here is foraging, not the static traveling-salesman form of ACO. In nature the food source is not given in advance; a walk finds it, and the path to it then optimises autocatalytically through differential accumulation and evaporation. In this setting the source is task success, found rather than given. That makes it a genuine foraging problem.

The mapping is then direct. The many ants of a colony are the many tasks over time; one agent run is one ant. Within a single run the agent does what a scout ant does: it walks into dead ends (loads the wrong skill, calls a failing tool, backtracks) and eventually reaches the source. Only the successful path is reinforced; the dead ends evaporate. There is no need for parallel sub-agents to get a colony. The parallelism is temporal: a colony of runs strung out over time, coordinating through one shared, decaying field. This gives up something the double-bridge relied on, the many ants exploring at once so that faster returns lay trail faster, and a temporal colony of one run at a time cannot reproduce that concurrent race. That is exactly why the heavy convergence is pushed offline (Sections 6.6 and 8.4), where the harness runs many ants per task at once and restores the parallel race. Online, the field does not converge from a cold start; it refines an already-warm field and tracks drift.

What carries the trace is the transition between capabilities. I reinforce edges, not isolated points, because the sequence is where the learnable structure lives: this capability tends to work well right after that one, in this kind of context. This is the foraging reading of stigmergy, and it is why the medium is a graph.

4.2 Where the analogy breaks, and what changes

Three differences between an ant and an agent force design changes, and getting them right is what separates a faithful transfer from a superficial one.

A new capability is a known point without a trail, not undiscovered terrain. An ant knows a path only after walking it; blind exploration is unavoidable for it. An agent, by contrast, has the full list of capabilities from the start. The field does not discover which capabilities exist; it only learns which ones work in which context. A capability with no pheromone yet is still a known point; it simply does not carry a trail, and it can surface on the semantic similarity of its description to the task alone. This prior knowledge plays the role that blind exploration plays for the ant, and it is why a purely fresh capability is not invisible.

Lock-in is controlled by decay plus a floor. In the colony a long path is never traded for a shorter one until evaporation erodes the early lead. I keep that mechanism: every trail constantly erodes unless reinforced, so an entrenched pattern fades the moment a better one appears. A guaranteed minimum strength keeps even a long-unused capability in play, so nothing is permanently starved. Because a call yields a cheap, objective outcome signal, a preference for a simply wrong path corrects itself quickly.

Exploration costs tokens, so it cannot be uniform noise. For an ant, exploration is free; it just walks. For an agent, every exploratory step costs tokens, because a speculatively surfaced and possibly failing capability is a wasted round. Exploration must therefore be targeted and budgeted, not a constant background hum. This is the point at which the system has no natural precedent and must add something the colony never needed: cost-aware exploration. The resolution splits learning into two regimes. Offline, where extra walks are free, the colony does the expensive divergent search and crystallises the paths it converges on. Online, the field still learns, but cheaply: it reinforces the paths real tasks actually walk and decays the ones they stop using, so it tracks a changing codebase or convention without paying for speculative walks. The runtime learning this paper claims is that online refinement of an already-warm field, not cold divergent search in production. I return to it in Sections 6.6 and 8.4.

4.3 A note on neural networks

One might object that neural networks already work stigmergically. They do: the neurons are the individuals, the weights are the traces, activation is the stimulus and response. That principle already lives inside the model, learned far better there than any hand-built copy could manage. The lever this system pulls sits one level up, at the insect scale: short-lived but capable actors that coordinate through a shared, decaying medium outside themselves.

5. Related work

Three bodies of work bear on this system. None combines the properties this setting needs, and the gap between them is where this system sits.

5.1 Stigmergy and pheromones in LLM reasoning

A young but growing line applies pheromone ideas to LLM reasoning and multi-agent coordination. ACO-ToT places ant-like LLM agents on a tree of thought and deposits pheromones on the edges of good reasoning paths [Chari et al. 2025]. SwarmSys builds a decentralised system of explorer, worker, and validator roles with a pheromone-inspired reinforcement in which validated traces strengthen and ineffective ones decay [Li et al. 2025]. A pressure-field approach lets agents act on a shared artifact guided by quality gradients with temporal decay, explicitly likened to ant colonies [Rodriguez 2026]. Pheromone-guided policy optimisation learns trajectory-level tool-transition patterns from successful histories to steer long-horizon planning [Li et al. 2026]. Adjacent shared-substrate ideas include blackboard architectures for LLM committees [Han and Zhang 2025] and observation-driven coordination on a shared document [Pugachev 2025], and the stigmergy literature outside language models continues, for example stigmergic swarming agents for subgraph isomorphism [Parunak 2026]. A separate strand replaces natural-language messages between agents with intermediate activations, a deliberately narrow channel [Ramesh and Li 2025].

The takeaway: pheromone ideas have reached language models, but the decay, where there is any, acts on the reasoning steps inside a single task rather than on a persistent field of capability use across tasks, and none of this work unifies tools, MCP tools, and skills. These systems accumulate experience; few of them forget, and forgetting is the mechanism that makes the natural model adaptive.

5.2 Tool, MCP, and skill selection

A mature family addresses the token cost of capability selection without any reference to stigmergy. Retrieval-augmented tool selection retrieves relevant tool descriptions instead of carrying all of them, cutting prompt tokens by over half and improving selection accuracy in an MCP stress test [Gan and Sun 2025]. Active tool discovery lets an agent request capabilities on demand through hierarchical semantic routing [Fei et al. 2025]. Tool-to-agent retrieval scales selection across many agents and tools [Lumer et al. 2025]. AutoTool exploits tool-usage inertia, the tendency of consecutive steps to use related tools, to cut inference cost [Jia and Li 2025]. Online-optimised retrieval adapts the retriever from live interactions with lightweight gradient updates and no change to the model [Pan et al. 2025]. First-party platform work introduced deferred tool loading and a tool-search step [Anthropic 2025a], and a related proposal turns MCP tools into on-disk code functions discovered on demand at near-zero token cost [Anthropic 2025b, Willison 2025]. Surveys of the skill mechanism cover progressive disclosure, acquisition, and security [Xu and Yan 2026], and the single-agent-with-skills analysis documents the critical-size threshold in skill selection [Single-agent skills 2026].

The takeaway: this selection-cost family works, and I build above it rather than against it, but every member is static after configuration. The outcome of past calls does not change what surfaces next, and none of them forgets. This layer adds exactly that feedback and forgetting, beneath the retrieval and loading they provide.

5.3 Ensemble diversity and correlated errors

The optional second opinion on whether a task succeeded rests on a clear empirical finding. A large study of correlated errors across hundreds of models shows that models agree often when they are wrong, and that larger, more accurate models have highly correlated errors even across different providers and architectures, with consequences for using a model to judge another model’s output [Kim et al. 2025]. The counterpart finding is that naively mixing different models can lower quality: aggregating repeated samples from the single strongest model can beat a mixture of diverse models [Li et al. 2025b]. Other work shows homogeneous agents saturate early because their outputs correlate, while genuine diversity yields continued gains [Yang et al. 2026], and that committees can collapse toward a shared representation unless diversity is maintained [Patel 2026]. The lesson I take is precise: a different-model evaluator helps only when its errors measurably decorrelate from the actor’s. I therefore gate it on measured decorrelation rather than assume it (Section 6.7).

5.4 The gap

No published system brings together the five properties this setting needs:

Unification of tools, MCP tools, and skills as one selection problem.
Runtime decay as an operating mechanism, not a training hyperparameter.
A narrow, non-linguistic signal that steers which descriptions load without itself occupying context as text.
A decorrelated evaluator, used only where its independence is measured.
Cost-aware exploration that accounts for the tokens exploration spends.

Section 6 closes these in order: the unified data model and lifecycle address the first, lazy decay the second, the narrow pheromone-and-evidence signal the third, the gated evaluator the fourth, and budgeted, mostly offline exploration the fifth.

Not every piece is new on its own. Pheromones on success paths, retrieval of tool descriptions, and progressive skill disclosure all exist. The conjunction is what is missing, and it is where this system sits.

6. System design

This section specifies the mechanism. The substrate is a directed graph; selection reads it, deposit writes it, decay erodes it, and a lifecycle of behavioural events decides when a write happens and with what reward. All numeric defaults given here are the implemented defaults.

The concept has been build for research and testing: https://github.com/pssah4/stigmergy

6.1 The path-graph data model

The medium is a directed multigraph stored in a local relational database.

A node is a capability (a tool, MCP tool, skill, or sub-agent) with a description and an embedding of that description. A reserved synthetic node, written START, marks the beginning of a task.
An edge is a directed transition from one capability to another. It carries the learned state of that transition:
- a pheromone value in the bounded interval [τ_min, τ_max], deposited under quality and efficiency and eroded by lazy decay;
- a success count and a failure count, both decayed evidence used for error-aware selection;
- a timestamp for lazy decay, and pin metadata for user-defined paths.

The first step of a task is an edge from START to the first capability, so the field also learns how tasks in a given context tend to begin. Modelling transitions rather than points is the foraging reading from Section 4.1: the sequence is where the structure lives.

6.2 Score and selection

For a candidate capability c, given the previous step prev (or START) and the task context q (the embedding of the user’s request, fixed once when the task opens), the system computes a desirability and weights it by how strongly the accumulated evidence believes this edge will succeed.

The desirability is multiplicative, in the ACO form:

τ̂ = decayed pheromone of the transition prev → c,   in [τ_min, τ_max]
η̂ = max(η_floor, cos(emb_c, emb_q))
desirability(c) = τ̂^α · η̂^β

Here τ̂ is the learned strength of the concrete transition, and η̂ is the cosine similarity between the embedding of the capability description (emb_c) and the embedding of the task context (emb_q), a standard measure of how close two pieces of text are in meaning. The exponents α and β are non-negative weights that set how strongly learned strength and context relevance each bear on the score; the default leans on relevance (β above α) because relevance is reliable from the first task while pheromone is empty until the field has accumulated evidence, and the balance is itself an ablation axis (Section 8) so a warm field can be allowed to lean harder on learned strength. The multiplicative form is deliberate. A linear combination of the two terms, with α and β read instead as coefficients, would let a high learned strength override a near-zero relevance, surfacing a well-worn but contextually wrong transition. Multiplication requires both: a transition must be both learned-strong and context-relevant to score well. This is the symmetry break that gives stigmergy its selectivity, and it is why I rejected the linear form. The floor η_floor is symmetric to τ_min and protects a relevant capability whose description happens to embed poorly.

Selection is error-aware and seedable, in the spirit of Ant Colony System. For each edge the system keeps a belief about its true success rate, a probability distribution that is wide and uncertain after few observations and narrows as wins and losses accumulate. It then uses Thompson sampling [Thompson 1933], a standard way to choose among uncertain options: draw one random value from each option’s belief and pick the option with the highest draw, so well-proven options usually win while under-tested ones still get occasional chances. Each candidate’s desirability is multiplied by such a draw from the decayed evidence of its edge:

θ ~ Beta(1 + success_count, 1 + failure_count)
score(c) = θ · desirability(c)

Beta(1 + successes, 1 + failures) is the standard belief distribution over a success rate between 0 and 1: with no data it is flat, every rate equally likely, and it sharpens around the observed win rate as evidence accumulates. The draw is produced by a seeded deterministic sampler (built from two Gamma draws via the Marsaglia-Tsang method), so the whole selection is a pure function of the seed. An edge with high similarity but a failure-dominated belief draws a low θ and is held back. This is the suppression of the plausible-but-wrong candidate that pure ACO does not provide: reinforcement and evaporation alone cannot push down a capability that keeps looking relevant but keeps failing. Thompson sampling does, and in one mechanism it also drives exploration, because a little-tested edge has a wide posterior and will occasionally draw a high θ.

Given the scored candidates, the system makes the exploit-or-explore choice with a single seeded draw against the parameter q0:

with probability q0:   rank by score, descending            (exploit)
otherwise:             sample without replacement
                       proportional to score                (explore)

The shipped default policy is this Thompson-plus-q0 behaviour. Two alternative policies exist for comparison and for opt-in use: a UCB policy (Upper Confidence Bound), which ranks each edge by its estimated success rate plus a bonus that grows the more uncertain that estimate is, so rarely tried edges get a boost; and an adaptive-epsilon policy, where epsilon is the probability of taking a random exploratory pick instead of the best one, and that probability rises as the recent success rate falls below a target and falls as it climbs above, capped by an exploration budget. The budget is the cost-aware exploration of Section 4.2 made concrete: it bounds the fraction of a surfaced set that may come from the explore path.

What the policy produces can be used in two ways, and the prompt-caching argument of Section 3.2 decides which is the default. The conservative default keeps the full capability set visible so the cached tool block stays stable, and the score above orders it rather than cuts it. The ordering itself does not go into the tool block, where it would break the cache; it carries the learned path that the field has converged on for this context, surfaced as a small per-turn context that sits after the cache breakpoint. That learned path, the proven sequence from task to solution, is the guidance the layer adds: it points the model at the transition that has worked here before so the model spends fewer turns finding it. Narrowing the visible set is the other use, kept as an optional, cache-aware optimisation for catalogues too large to keep whole in context, where the standing cost outweighs the cache-miss cost of editing the block. The desirability score drives both uses unchanged; only what is done with the ranking differs.

Defaults: α = 1, β = 2, q0 = 0.7, η_floor = 0.05, τ_min = 0.05, τ_max = 1.0. All are experimental and are ablation axes (Section 8).

6.3 Deposit

When a task resolves (Section 6.5), the system deposits along the successful path. The reward factorises into quality and efficiency, kept separate so the two can be ablated independently:

quality    = acceptance weight of the outcome     accepted 1.0, iterated 0.3, abandoned 0.0
efficiency = clamp(baseline_cost / token_cost,  [e_min, e_max])
Δτ         = ρ · quality · efficiency

For each edge (i → j) along the path, the pheromone updates as

pheromone ← clip( decayed(pheromone) + Δτ,  τ_min, τ_max )

Here ρ is the deposit strength, the learning rate that scales every reinforcement. The efficiency term ties the reward to token cost, the quantity the system is built to lower: a task that costs fewer tokens than the reference baseline earns more reinforcement, up to e_max; one that costs more earns less, down to e_min. An abandoned task carries quality 0, so it never earns positive reinforcement. On top of this base reward, the deposit is scaled up when a task’s quality beats the running baseline for its context, so a task that does better than its peers leaves a stronger trail. This is the online analogue of the elite rule in MAX-MIN Ant System, where only the best ant of a round reinforces, and it is what I mean by an advantage-relative deposit.

The deposit is idempotent per task and runs in a single transaction, so redundant resolution events cannot double-count. Defaults: ρ = 0.3, e_min = 0.5, e_max = 2.0, and a reference baseline_cost of 1000 tokens. On a cold substrate, before a per-context median exists, this fixed reference stands in; it is deliberately mid-range so early tasks deposit near the nominal ρ rather than at a clamp edge, and a running per-context median replaces it once a few tasks in that context have resolved. With these, an accepted task at the reference cost deposits Δτ = 0.3; an accepted task at half the reference cost deposits the capped maximum of 0.6. The reference value is itself an ablation axis (Section 8), because the cold phase is where it bites most.

6.4 Learning the negative space

The negative space is the actual token-saving material (Section 3.4), so the substrate records it directly and separately from the path deposit.

Every returned call updates the evidence of its edge at once: a success increments the decayed success count, a failure increments the decayed failure count. A capability that was loaded into context but never invoked increments the failure count of the edge from prev to it, because the load cost tokens without contributing to the outcome. The token cost of those wasted loads also flows into token_cost and therefore into the efficiency term of the deposit. Concretely: along a positively resolved path, a succeeding edge gains pheromone and a success count; a failing or discarded edge gains only a failure count and no pheromone; and if the whole task is abandoned, even its invoked edges gain a failure count and no pheromone. Over many tasks the substrate learns not only the path that works but the loads and calls that waste tokens, and stops surfacing them. This is what the learning loop buys over a static retriever, and it is how the loop attacks the negative space of Section 3.4 rather than merely shrinking the standing set. A similarity-only retriever surfaces the same plausible-but-wrong capability on every structurally similar task, and it fails every time, because the retriever has no memory of the outcome. By recording failure evidence on that edge, the field drives its belief down until it stops surfacing the capability at all. The saving is not a smaller set on the first turn; it is the removal of the repeated load-and-fail that a memoryless retriever keeps reproducing.

The evidence counts decay too, by the same exponential law as the pheromone but with no floor, so they fade toward zero. This keeps the Thompson posterior responsive to a changing world rather than dominated by ancient outcomes.

6.5 Lifecycle and the delayed acceptance signal

The loop gives the substrate no honest success signal at the moment a tool returns. A tool returning a value is not the same as the task being solved. If the field reinforced every clean tool return, it would learn the cheapest path to a plausible-looking answer, and because the reward is partly efficiency (Section 6.3), it would actively prefer fast wrong answers over slower right ones. The only reliable evidence of success is what the user does next: accept the answer and move on, push back and iterate, change the subject, or fall silent. So the system waits for that behaviour before it deposits.

So the engine works against an abstract lifecycle vocabulary, and the loop integration maps its own events onto it. Here is a single task in order, with each event in bold and the engine’s response after it:

Task started. A new task begins. The engine opens a buffer, a scratch record of the path this task takes, and caches the embedding of the request as the task context.
Capability loaded. A capability’s definition enters the model’s context. This is the moment its token cost is paid, even if the capability is never used, so the engine marks it as loaded and a candidate for discard.
Capability invoked. A capability actually runs. The engine appends it to the buffer’s path and clears its loaded mark.
Capability returned. The capability answers. The engine records the outcome on the edge at once, as a success or a failure.
Response delivered. The answer goes out to the user. The engine freezes the buffer, starts the acceptance timer, and computes the discard set: the capabilities that were loaded into context but never run.
Task iterated. A follow-up on the same topic arrives. The engine resets the timer and keeps the buffer open, because the task is not finished.
Task accepted. Acceptance is detected. The engine deposits the path with the accepted outcome, or with the lower iterated outcome if the buffer had already seen follow-ups.
Task abandoned. The session ends with no acceptance. The engine deposits with the abandoned outcome and a quality of zero.

A path lives in a buffer until the acceptance signal forms. The signal forms in one of three ways. A follow-up whose context embedding is similar to the open buffer (cosine at or above a threshold, default 0.6) is an iteration and keeps the buffer open. A follow-up below the threshold is a topic change, which resolves the previous buffer as accepted and opens a new one. Inactivity past a timeout (default 30 minutes, configurable, shorter for short-lived loops) auto-resolves the buffer as accepted with lower confidence in the reward. A buffer that has seen at least one iteration before it resolves deposits with the iterated outcome and its lower quality weight of 0.3, because the answer needed follow-ups to land; a buffer that resolves without any iteration deposits with the full accepted weight. The iterated weight is therefore the resolution label for a multi-turn buffer, not a separate deposit during iteration.

Because the signal is delayed, buffers must survive a restart. They are held in memory and mirrored to an append-only log, so a crash mid-task loses nothing: on restart the buffers reload and their timers re-arm. Resolution is guarded to happen exactly once; if a deposit fails, the buffer and its timer are preserved for retry, and the timer is cleared only after a deposit commits.

6.6 Exploration, offline and budgeted

Section 4.2 noted that exploration costs tokens in an agent and must be budgeted. The system separates the two economies. In production the default policy exploits what is already learned and explores only sparingly through the natural width of the Thompson posterior. The heavy exploration that a colony needs to converge happens offline, in the measurement harness (Section 8), where extra walks per task are free because the run is a controlled experiment. There the system can run many parallel ants per task, let the colony converge, and crystallise the best paths as pinned paths before any of it ships. Production then exploits. This is how the system gets ACO-style convergence without violating the token-reduction goal. If foraging is ever offered in the production loop, it is gated on value-of-information and an explicit budget.

The crystallised paths are the guidance the default surfaces (Section 6.2). A converged path is the proven sequence from task to solution for its context, and production hands it to the model as a small per-turn hint rather than by cutting the visible set. This is the cache-neutral lever: the tool block stays whole and warm, the model gets the path it would otherwise have to rediscover, and the failed exploration before the one correct call shrinks. Narrowing the visible set remains the optional path for catalogues too large to keep whole, where the standing cost is the binding constraint.

6.7 The heterogeneous evaluator

The behavioural acceptance signal is noisy. In borderline cases it is unclear whether an answer was accepted or merely tolerated, or whether an inactivity timeout wrongly deposited a failure as a success. A second, independent opinion would help exactly there. A different model can read the answer and judge success with a confidence, but Section 5.3 warns that a model judging its own family confirms its own mistakes. The value appears only when the judge is a different model whose errors measurably decorrelate from the actor’s.

The evaluator is therefore optional and gated in three ways. It is called only when the implicit signal sits in a confidence grey zone, never on a clear acceptance or a clear abandonment. It is active only while measured decorrelation stays above a threshold; below it, the evaluator disables itself with a warning, because it adds cost without value. And it runs under a budget, a cap per window or only for tasks above a cost threshold. It runs at outcome resolution, never on the selection hot path, and it is the tiebreaker for the outcome label, not a replacement for the implicit signal. Decorrelation is measured offline against a small set of known-good and known-bad paths: over that set the system computes how often the evaluator and the actor make their mistakes on different items, and treats the evaluator as useful only when that error overlap sits below a set threshold. A high overlap means the judge mostly confirms the actor, the failure mode of Section 5.3. The offline multi-ant phase, which already runs different models, supplies that labelled data for free. The evaluator is off by default until its decorrelation is shown.

6.8 Pins and user control

A user can pin a path so the substrate always prefers it, restricts selection to it, or walks it as a fixed sequence. Pinned edges bypass the pheromone bounds and are immune to decay, so a deliberate workflow does not fade. Pins make the learned field editable: a user can crystallise a known-good route, and the system can crystallise an offline-converged one, without either being eroded by the same forgetting that keeps the rest of the field adaptive.

A pinned or emergent path carries a whenToUse description, and that description is the gate. A path fires for a task only when its whenToUse embeds close enough to the task context, the same cosine test the score uses, so a path applies to the tasks it was meant for and stays silent on the rest. This closes the loop with the guidance of Sections 6.2 and 6.6: the path that fires is the proven sequence the layer surfaces as the per-turn hint, and the whenToUse gate is what decides which proven sequence matches the task at hand. The learned and pinned paths are therefore the guidance, selected by semantic match rather than carried in the tool block, so the cache stays warm while the model still gets the route that worked before.

7. Architecture and trust

A learning layer that touches an agent’s configuration has to earn trust before it can earn adoption. The architecture is shaped as much by that constraint as by performance.

7.1 A library, not a service

The engine is a library that runs in the host process, behind two ports in the hexagonal sense. A storage port abstracts the local database; a embedding port abstracts the function that turns a description or a context into a vector. The core depends on neither concrete implementation. This keeps the host as the trust anchor, adds no network hop to the loop, and lets the same core run in two very different runtimes: a native server process and a constrained in-application sandbox that forbids native modules and background workers. The storage and embedding implementations ship as separate packages, so a new backend is a new package, not a change to the core.

7.2 Proactive pre-filtering inside the controlled loop

The system acts before the model decides; it is not a tool the model calls. The model does not decide whether the layer is active, and what the layer puts in front of it is set by the layer, not by the model. This is deterministic by construction, and it is possible only where the consumer owns the tool list. A controlled loop (a custom loop, an agent SDK, or a build-time library embedded in an application) owns its tool list, so a pre-call step can shape what the model sees before it sees anything. The integration points are thin adapters over each SDK’s pre-call layer. I chose this over a reactive host hook because a hook fires after the model has already decided on a call: it can deny or rewrite a decided call, but it cannot shape what the model considers, and it cannot observe automatic skill loading at all.

Owning the pre-call step gives the layer two ways to act, and prompt caching decides the default (Section 3.2). The default keeps the whole tool list in front of the model so the cached block stays warm, and the pre-call step adds the learned path for the task as a small per-turn context after the cache breakpoint. The model then reaches the right capability in fewer turns, which is where the saving comes from, while the tool block is never rewritten. Narrowing the visible set is the second way, kept for catalogues too large to keep whole, where the standing cost outweighs the cost of breaking the cache. The price of owning the pre-call step is scope either way: it works only where the consumer owns the tool list, a custom loop, an agent SDK, or an embedded library. A closed, hosted agent that exposes only a post-decision hook cannot place per-turn guidance ahead of the model, and the reactive-hook fallback recovers only part of the value. This system is for builders who own their loop.

7.3 Determinism

Determinism is not optional, because the evaluation in Section 8 is worthless without it. Three choices secure it. The selection RNG is seeded; a fresh engine with the same seed and the same sequence of calls produces identical decisions. All time-dependent computation reads an injectable clock, so tests and the harness can advance time deterministically and fire timers on command. The embedding model is identified by a hash stored alongside the data, so a model change is detected rather than silently corrupting the learned similarities.

7.4 Local-first and reversible

The decaying field, the vectors, and every per-step selection are computed on the user’s machine. The selection path makes no network call, and a continuous-integration test mocks the network to assert that no implicit egress exists. The only calls that can leave the machine are the explicitly opt-in description enrichment (Section 7.5) and the optional evaluator (Section 6.7), both off the selection hot path and both off by default. This local boundary is both a value and a trust requirement. Any change the system makes to a host configuration is shown as a diff before it is written, the mechanism is opt-in, and it can be fully removed. A skeptical user can run it in observation mode first, inspect the saved field, and reset or uninstall it.

7.5 Optional semantic augmentation

The similarity term η is only as good as the description it embeds. The system can optionally enrich a capability description with a single language-model call, which improves η, and can name strongly reinforced or pinned paths so the field is human-readable. This augmentation is opt-in, runs at capability registration rather than on the selection hot path, and falls back to the raw description on any provider failure with no silent substitution. Without a configured provider the system runs on raw descriptions and identifier-based path names. The selection path is therefore guaranteed free of any external call.

7.6 Execution model

The engine runs embedded in the host process; selection is a direct method call with no inter-process hop. A controlled loop is a long-lived process that loads the engine, the database connection, and the embedding model once and keeps them warm, so the embedded model carries no recurring start-up cost. The constraint this accepts is single-writer-at-a-time: one substrate belongs to one writing process at a time. A later daemon transport, reachable through the same interface, would lift that constraint for concurrent multi-surface writing; because the interface is identical, that change is additive rather than a redesign.

8. Evaluation methodology

This is a design paper. The system is implemented and I have run a security review against its design, and the loop-side thin-client libraries that wire the mechanism into a controlled loop are now published as open source on npm. The controlled study that would confirm its central claim is specified here and remains to be run. I state the protocol, the hypotheses with their pivot conditions, and the one preliminary data point I have, and I am explicit about what is not yet measured.

8.1 Hypotheses and pivots

Each hypothesis is falsifiable and carries a pivot if it fails.

H1 (token). The claim is that the field reduces tokens per successfully completed task against a static baseline with no success regression. It is confirmed by at least a 15 percent reduction at a success rate at or above baseline. If it is refuted, the token-reduction premise fails, and I fall back to the reinforcement effect of H2 and re-scope or retire accordingly.
H2 (decay). The claim is that explicit runtime decay beats no decay. It is confirmed when the decay variant is at least as good as the no-decay variant on tokens and success. If it is refuted, I treat parameter-signature reinforcement, not decay, as the core contribution.
H3 (unification). The claim is that one substrate helps across tools, MCP tools, and skills. It is confirmed when a token reduction is measurable in all three types. If it is refuted, I narrow the scope to the type or types that work.
H4 (heterogeneity). The claim is that a different-model evaluator improves signal quality. It is confirmed when a decorrelated evaluator beats a same-model one on tokens and success. If it is refuted, I drop the evaluator as a feature.
H5 (narrow signal). The claim is that a low-bandwidth, non-linguistic signal preserves the gains. It is confirmed when the narrow signal is comparable to a full natural-language one. If it is refuted, I widen the bandwidth.
H6 (scaling). The claim is that gains grow with catalogue size and the selection threshold shifts outward. It is confirmed when accuracy stays stable beyond the previously documented critical size. If it is refuted, I position the system for small and mid-size catalogues.
H7 (integration). The claim is that the layer coexists with an existing memory and tool-calling system without regression. It is confirmed by no regression in retrieval relevance or loop incidence, with success at or above baseline. If it is refuted, I revise the integration toward stricter separation.

H1 and H2 are the primary claims; H3 through H7 are staged.

8.2 The measurement harness

The harness is the scientific instrument. It runs versioned workloads against the engine plus at least one loop integration, records tokens, success rate, attempts, and selection quality, and makes ablations reproducible. Three properties make its results publishable.

Determinism. The same configuration, workload, and seed produce identical reports, asserted by repeated runs. External calls (the model and the embeddings) are stubbed or recorded so a replay is exact.

Versioned workloads. A workload is a versioned set of tasks, referenced by a content hash, each task carrying its description, its expected capability class, and its success definition. Workloads come from two sources: recorded, anonymised real sessions from controlled loops, and synthetic tasks with known ground truth. Every report carries the workload hash and the seed.

Ablations. An ablation varies one configuration axis with the rest held constant. The mandatory axes are: decay on against off; the multiplicative score against a linear one; Thompson against UCB against plain q0; a half-life sweep per capability type; a τ_min sweep; cost-aware reward on against off; context conditioning on against off; and the heterogeneous evaluator on against off. Each is a single, reproducible diff.

Two further axes follow from the prompt-caching argument of Section 3.2 and deserve their own measurement, reported qualitatively until the study runs. The first is the cache interaction: the full token bill has to be attributed across the cached tool block and the uncached per-turn context, because a per-step edit to the block that reads as a saving in nominal tokens can cost more once the cache miss is priced in. The harness records cache-hit behaviour alongside raw tokens so the two are not conflated. The second is the path-guidance effect: surfacing a learned path as per-turn guidance against not surfacing it, with the whole tool block held stable in both arms, isolates how much of the gain comes from guiding the model to the right transition rather than from narrowing what it sees. These are separate axes from H1, and I name them now so they are not folded into a single token number later.

8.3 Baseline and metrics

The baseline is the static condition: full or whitelisted capability loading with no learning. The primary metric is tokens per successfully completed task, reported as a distribution (median and a high percentile) rather than a mean, because the tail is where failed exploration lives. The guardrail metric is the success rate, which must not regress. Secondary metrics are the median number of attempts per task and the first-try-correct rate. The efficiency term in the deposit (Section 6.3) uses a reference baseline cost so that the reward itself is denominated in the same currency the evaluation reports.

8.4 Offline pre-training

The harness has a second role beyond measurement. It runs many parallel ants per task at no production cost, lets the colony converge on good paths, and crystallises those as pinned paths that ship with the system. Production then exploits a substrate that is already warm. This is the operational answer to the cost-of-exploration problem (Sections 4.2 and 6.6): convergence happens where walks are free.

8.5 A preliminary data point, stated honestly

I have one number, and it is not the study. A deterministic stub run on a small synthetic workload (its hash and seed recorded in the released harness, Section 8.2), surfacing the top three of six capabilities, shows a 48 percent token reduction together with a success regression of one task in eight, 12.5 percent, which breaches the guardrail of Section 8.3. Read it carefully: on a cold substrate every edge has zero pheromone and empty evidence, so this run exercises the similarity term and the hard top-three cut alone, not the reinforce-and-decay loop, which has no history to act on yet. The 48 percent and the lost task are the same event seen twice, a cut that dropped a capability the task needed, which both saved its tokens and failed it. The number therefore measures the aggressiveness of a cold cut, not the value of learning, which is exactly why the warm-start protocol (Section 8.4) is the real test. I report it as a mechanism check on a stub agent, not as evidence for H1. The figure that would confirm H1, measured on a real workload with a real model and a confidence interval, is exactly what remains to be produced.

8.6 Threats to validity

Several risks bear on the eventual study and I name them in advance. Decay may turn out irrelevant on real workloads if the optimal half-life is effectively infinite; the decay ablation is run early for that reason, and H2 carries a pivot. The heterogeneity advantage can be undermined by correlated errors (Section 5.3); I therefore measure decorrelation rather than assume it. Offline pre-training can fail to match the real distribution; I version workloads from real, anonymised sessions to counter it. And the synthetic data point above shows the cold-start risk concretely, which is why warm-up is part of the protocol rather than an afterthought.

9. Discussion

9.1 What is novel

The individual ingredients are not all new, and I do not claim otherwise. Pheromones on success paths, retrieval of tool descriptions, and progressive disclosure of skills already exist. The contribution is the conjunction described in Section 5.4, realised as one local mechanism: a unified substrate over tools, MCP tools, and skills; decay as a runtime operation rather than a training detail; a narrow, non-linguistic signal that steers loading without occupying context as text; a decorrelated evaluator used only where its independence is measured; and exploration that accounts for the tokens it spends. To my knowledge no published system combines these.

9.2 When it should help, and when it should not

The system targets the regime where the cost of failed exploration is large: many capabilities, repeated tasks, a catalogue past the size where flat selection degrades (Section 3.3). The failed exploration is the primary target because prompt caching has made a stable standing block cheap (Section 3.2), so the main lever is guiding the model to the transition that worked before, not shrinking what it sees. The standing cost becomes a target of its own only at very large catalogues that no longer fit the context, where narrowing the visible set earns back more than the cache miss costs. For a small, stable catalogue the gains will be modest, because there is little to forget and little to get wrong. H6 tests exactly this dependence on scale. The layer is complementary: it sits beneath retrieval, routing, and progressive disclosure and adds feedback to them, so it competes with none of them.

Decay is the part most likely to be doubted, so it is worth one concrete case. Suppose a project migrates from one test runner to another. Without decay, the edges that made the old runner the strong default keep their accumulated pheromone, so the field keeps surfacing the deprecated capability long after every task has switched. With decay, those edges erode within a few unused tasks while the new runner’s edges reinforce, and the field tracks the migration on its own. That is the bet H2 makes: the lever is not that the field accumulates good paths, it is that forgetting old ones keeps it current.

9.3 Limitations

The empirical core is pending; this is the honest centre of the paper. Several design choices are accepted constraints rather than solved problems: a single embedding model per substrate, so cross-model sharing needs re-embedding; a single-writer-at-a-time execution model in the current form; and a heuristic topic-change detector. The cold-start effect is real and visible in the one number I have. And the approach assumes that the outcome of a use is cheaply and objectively observable from behaviour, which holds for many tasks but not all; the optional evaluator exists for the cases where it does not.

10. Conclusion

Capability selection in agent loops is stateless, and that is expensive. The definitions are heavy, the choice degrades as the catalogue grows, and the failed exploration that costs the most is never learned from. Stigmergy offers the missing mechanism in a form that has worked in nature for a long time: reinforce what succeeds, forget what is not reused, and explore a little, all through a shared, decaying medium rather than inside any single actor. I have transferred that mechanism to capability selection with attention to where the analogy holds and where it breaks, specified it to the level of its formulas and defaults, built it as a local, deterministic, reversible library, and laid out a falsifiable protocol to test whether it pays off. The decisive question is empirical and is stated plainly: does the decaying, reinforced field lower tokens per successfully completed task without hurting the success rate, and is forgetting the lever rather than mere accumulation. If the answer is yes, the result is a small, portable layer that many agent systems can put to use.

References

Anthropic (2025a). Introducing advanced tool use on the Claude Developer Platform. Anthropic Engineering, 24 November 2025. https://www.anthropic.com/engineering/advanced-tool-use

Anthropic (2025b). Code execution with MCP: building more efficient agents. Anthropic Engineering, November 2025. https://www.anthropic.com/engineering/code-execution-with-mcp

Chari, A., Tiwari, A., Lian, R., Reddy, S., and Zhou, B. (2025). Pheromone-based Learning of Optimal Reasoning Paths. arXiv:2501.19278.

Deneubourg, J.-L., Aron, S., Goss, S., and Pasteels, J. M. (1990). The self-organizing exploratory pattern of the Argentine ant. Journal of Insect Behavior, 3(2), 159-168. DOI: 10.1007/BF01417909

Di Caro, G., and Dorigo, M. (1998). AntNet: Distributed Stigmergetic Control for Communications Networks. Journal of Artificial Intelligence Research, 9, 317-365. DOI: 10.1613/jair.530

Dorigo, M. (1992). Optimization, Learning and Natural Algorithms (in Italian). PhD thesis, Dipartimento di Elettronica, Politecnico di Milano, Milan, Italy.

Dorigo, M., Maniezzo, V., and Colorni, A. (1996). Ant System: Optimization by a colony of cooperating agents. IEEE Transactions on Systems, Man, and Cybernetics, Part B, 26(1), 29-41. DOI: 10.1109/3477.484436

Dorigo, M., and Gambardella, L. M. (1997). Ant colony system: A cooperative learning approach to the traveling salesman problem. IEEE Transactions on Evolutionary Computation, 1(1), 53-66. DOI: 10.1109/4235.585892

Fei, X., Zheng, X., and Feng, H. (2025). MCP-Zero: Active Tool Discovery for Autonomous LLM Agents. arXiv:2506.01056.

Gan, T., and Sun, Q. (2025). RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection via Retrieval-Augmented Generation. arXiv:2505.03275.

Goss, S., Aron, S., Deneubourg, J.-L., and Pasteels, J. M. (1989). Self-organized shortcuts in the Argentine ant. Naturwissenschaften, 76(12), 579-581. DOI: 10.1007/BF00462870

Grassé, P.-P. (1959). La reconstruction du nid et les coordinations interindividuelles chez Bellicositermes natalensis et Cubitermes sp. La théorie de la stigmergie: Essai d’interprétation du comportement des termites constructeurs. Insectes Sociaux, 6(1), 41-80. DOI: 10.1007/BF02223791

Han, B., and Zhang, S. (2025). Exploring Advanced LLM Multi-Agent Systems Based on Blackboard Architecture. arXiv:2507.01701.

Jia, J., and Li, Q. (2025). AutoTool: Efficient Tool Selection for Large Language Model Agents. arXiv:2511.14650. (AAAI 2026.)

Kim, E., Garg, A., Peng, K., and Garg, N. (2025). Correlated Errors in Large Language Models. arXiv:2506.07962. (ICML 2025.)

Li, R., Liu, H., Zhao, L., Li, Z., Li, J., Jiang, J., Xu, L., Zhao, C., Fan, M., and Liang, C. (2025). SwarmSys: Decentralized Swarm-Inspired Agents for Scalable and Adaptive Reasoning. arXiv:2510.10047.

Li, W., Lin, Y., Xia, M., and Jin, C. (2025b). Rethinking Mixture-of-Agents: Is Mixing Different Large Language Models Beneficial? arXiv:2502.00674. (ICML 2025.)

Li, Y., Cai, G., Yang, S., Luo, H., Han, S., He, X., Li, D., and Feng, L. (2026). PhGPO: Pheromone-Guided Policy Optimization for Long-Horizon Tool Planning. arXiv:2602.13691.

Lumer, E., Nizar, F., Gulati, A., Honaganahalli Basavaraju, P., and Subbiah, V. K. (2025). Tool-to-Agent Retrieval: Bridging Tools and Agents for Scalable LLM Multi-Agent Systems. arXiv:2511.01854.

Pan, Y., Li, X., and Wang, H. (2025). Online-Optimized RAG for Tool Use and Function Calling. arXiv:2509.20415.

Parunak, H. V. D. (2026). Stigmergic Swarming Agents for Fast Subgraph Isomorphism. arXiv:2601.02449.

Patel, D. (2026). Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus. arXiv:2604.03809.

Pugachev, S. (2025). CodeCRDT: Observation-Driven Coordination for Multi-Agent LLM Code Generation. arXiv:2510.18893.

Ramesh, V., and Li, K. (2025). Communicating Activations Between Language Model Agents. arXiv:2501.14082. (ICML 2025.)

Rodriguez, R. R. (2026). Emergent Coordination in Multi-Agent Systems via Pressure Fields and Temporal Decay. arXiv:2601.08129.

Stützle, T., and Hoos, H. H. (2000). MAX-MIN Ant System. Future Generation Computer Systems, 16(8), 889-914. DOI: 10.1016/S0167-739X(00)00043-1

Theraulaz, G., and Bonabeau, E. (1999). A Brief History of Stigmergy. Artificial Life, 5(2), 97-116. DOI: 10.1162/106454699568700

Thompson, W. R. (1933). On the likelihood that one unknown probability exceeds another in view of the evidence of two samples. Biometrika, 25(3-4), 285-294. DOI: 10.1093/biomet/25.3-4.285

Single-agent skills (2026). When Single-Agent with Skills Replace Multi-Agent Systems and When They Fail. arXiv:2601.04748.

Willison, S. (2025). Code execution with MCP: Building more efficient agents. simonwillison.net, 4 November 2025. https://simonwillison.net/2025/Nov/4/code-execution-with-mcp/

Xu, R., and Yan, Y. (2026). Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward. arXiv:2602.12430.

Yang, Y., Qu, C., Wen, M., Shi, L., Wen, Y., Zhang, W., Wierman, A., and Gu, S. (2026). Understanding Agent Scaling in LLM-Based Multi-Agent Systems via Diversity. arXiv:2602.03794.

claude-code #12241 (2025). Large MCP tools context warning. anthropics/claude-code, GitHub issue. Anecdotal field report of approximately 144,800 tokens of MCP tool definitions.