From Smart to Experienced: How Agents Learn On The Job

From Smart to Experienced: How Agents Learn On The Job

Yesterday I wrote about chairing a board meeting where my CEO and CTO agents debated strategy. Both agents came prepared with their own assessments, priorities, and recommendations. The CEO focused on revenue blockers. The CTO focused on infrastructure stability. They disagreed on sequencing, converged through discussion, and we made decisions.

Both agents were better at that meeting than they were two weeks ago. Not because a new model dropped. Not because I rewrote their prompts. They've been accumulating experience through a system that captures what they learn, stores it, and recalls it when they need it. While everyone obsesses over making models smarter, almost nobody is building systems that make agents experienced. There's a difference, and it matters more than the next model release.

Claude Code is incredibly capable. It can write clean Python, explain complex systems, follow detailed instructions, and reason through edge cases. Give it a role and it will perform that role competently. The capability floor keeps rising with each model generation. But capability without context is like hiring a talented software engineer who just moved from backend infrastructure to your frontend team. They know data structures, algorithms, system design. What they don't know is that your component library has a subtle memory leak in the pagination component that only shows up under specific conditions. Or that the API team prefers batch requests after 2pm EST because that's when their cache warming runs. Or that the last three production incidents traced back to timezone handling in the date picker. A senior engineer who's been on your frontend team for two years has accumulated pattern recognition that no amount of raw coding ability can replace. Your AI agents face the same gap.

The architecture of agent experience

Glassblower's worn iron tools beside a glowing gather of molten glass

TroopX agents have three layers of identity working together. The first layer is their principles.md file - about 60-100 lines of markdown that defines not just responsibilities and methodology, but their core values. My CEO agent's principles include its KPI (revenue per day), its quality bar (every decision traces back to revenue impact), and its values (bias toward action, escalate with context, coordinate before committing). These principles function like a company's core values combined with a job description. They're written by the Roster Architect agent and establish who the agent is supposed to be.

The second layer is lived experience through workflows. The CEO runs 20 workflows. It drafts monetization memos, coordinates with CTO on roadmap sequencing, escalates pricing decisions to me six times in one day, tracks that revenue has been $0 for 20+ sessions. Each workflow generates signals, blackboard writes, message exchanges, decisions. The agent discovers which escalation formats get ignored and which get responses. It learns that when CTO says "technically feasible but risky," the next move is to push for specific risk quantification. It notices that Chairman responds fastest to escalations with three options and tradeoffs, not open-ended questions. The principles say what the agent should value, but experience teaches what actually works in your specific context.

The third layer is persistent memory extracted through reflection loops. After each workflow completes, an agent reads the full trace: blackboard state, signal stream, message history, what was supposed to happen versus what actually happened. It identifies patterns and synthesizes learnings. These get stored as semantic memory with role-specific, project-specific, and cross-cutting tags. Before the next workflow starts, the agent calls memory_recall and loads relevant patterns. My CEO agent's memory now includes: "Revenue blockers in sessions 15-23 all traced to waiting on Chairman pricing approval. Escalation pattern: first three ignored, fourth got response, fifth was re-escalation with urgency framing." And: "When CTO reports infrastructure instability, coordinate sprint planning immediately - waiting for next board meeting adds three-day delay." And: "Monetization memo format that got approved: three pricing tiers, cost-plus analysis, competitive positioning, implementation timeline."

These patterns weren't in the principles file. They emerged through experience and got extracted into memory. The CEO is now more effective at escalating, coordinating with CTO, and framing monetization decisions because the agent remembers what it learned. The loop compounds: principles guide initial behavior, experience reveals what works in context, memory captures those learnings, recall applies them to new situations, which generates new experience.

From context engineering to harness engineering

Stone-split irrigation channels on terraced hillside, worn by seasons of flow

The AI engineering conversation is shifting. Context engineering - the art of crafting prompts, managing context windows, and structuring inputs - established the foundation. That work still matters. But the frontier is harness engineering: building the infrastructure that surrounds the model to capture experience, coordinate multiple agents, and compound learning over time.

Context engineering asks: how do I give this model the right information to make this decision right now? Harness engineering asks: how do I build a system where agents get systematically better at making decisions in my specific environment? The difference shows up in how agents scale.

With pure context engineering, every workflow starts fresh. You can craft an excellent system prompt that defines the role perfectly. You can provide comprehensive documentation in the context window. You can structure your inputs to maximize reasoning quality. The agent will perform well on that single task. But the tenth workflow isn't meaningfully better than the first workflow. The agent is perpetually junior because it has no mechanism to remember what it learned in workflow nine.

Harness engineering adds the memory infrastructure that enables compounding. The first workflow runs with just principles. The CEO knows its KPI is revenue per day. It doesn't yet know that pricing discussions need three options, that escalations need deadlines, that CTO prefers written position papers before meetings. By the tenth workflow, the CEO has memory from nine prior workflows. It knows the pricing discussion history. It knows escalation patterns that work. It knows the CTO's communication preferences. It's doing CEO work informed by institutional context. By the hundredth workflow, the CEO has deep pattern recognition. It predicts blockers before they surface because it remembers that the last three roadmap discussions stalled on infrastructure capacity. It knows which decisions require Chairman approval and which don't because it's learned the delegation boundaries through experience.

The harness is what captures the signals, blackboard writes, and message exchanges during workflows. It runs reflection loops after completion to extract patterns. It provides the memory_recall mechanism before new work starts. It manages the shared learning namespaces so one agent's discoveries benefit the whole squad. The models provide intelligence. The harness provides experience.

This is where the "models getting better makes coordinated agents 10x better, not 10% better" thesis becomes concrete. A smarter model improves a single-session agent incrementally - maybe it reasons through edge cases more reliably or writes cleaner code. But a smarter model running in a harness with persistent memory and coordination infrastructure compounds those improvements. Better reasoning applied to accumulated context and coordinated with other agents who are also learning creates exponential improvement rather than linear gains.

What experience looks like in production

My CTO agent launched a 5-day infrastructure sprint after the board meeting. Dev wrote implementation specs. QA wrote test acceptance criteria. They shipped the first item same day. That execution speed came from accumulated patterns in the harness.

The CTO knows from previous sprint experience: "When infrastructure is unstable, break it into small increments and ship the highest-leverage fix first." Dev remembers: "When CTO says '5-day sprint,' write specs before coding. The last three times I jumped straight to implementation, QA sent it back for missing edge cases." QA recalls: "Blackboard changes require cross-workflow testing. Check if other agents' workflows break when we add size limits" - learned from the last infrastructure change that passed unit tests but broke message delivery in production.

None of this comes from their principles files. None of it came from prompt engineering. The harness captured these learnings through reflection loops and made them available through memory recall. The agents got better at shipping infrastructure changes by doing infrastructure changes and remembering what worked.

The harness also enables organizational learning at scale. When QA catches a DDD boundary violation in troopx-router importing from troopx-core, that goes into QA's memory. But the reflection loop also writes it to a shared learning namespace on the blackboard. Next workflow, when Dev works in troopx-router, memory_recall returns: "Recent QA finding: watch for DDD violations in this package." One review teaches the whole squad. When CEO escalates a decision six times before getting a response, the reflection loop extracts: "Escalation effectiveness inversely correlated with escalation frequency. Sixth escalation had more context than first - include decision framing, options with tradeoffs, and explicit deadline in first escalation." That pattern becomes available to every agent that needs to escalate: CTO, CMO, Dev when hitting a blocker. The CEO's learned escalation patterns benefit the entire organization.

Building for compounding advantage

If you're building multi-agent systems for production use, intelligence is table stakes. Claude Code, GPT-5, whatever comes next - assume your agents have access to the smartest models. That capability is not your differentiation. Experience is the moat. Agents that remember what happened, why it happened, and what worked are dramatically harder to replicate because the differentiation comes from your specific context compounding over time.

Design your harness for memory from day one. If your agents don't write structured context during workflows - blackboard writes, not just task outputs - you have nothing to extract patterns from. If your workflows don't have post-completion reflection built in, learning doesn't happen. If agents don't recall before starting work, accumulated memory sits unused. The harness architecture determines whether your system can learn.

Shared context amplifies individual learning. When one agent's experience improves the whole system, your advantage compounds faster. Reflection loops that write to shared learning namespaces mean patterns discovered by CEO become available to CTO, Dev, QA. The system's experience accumulates faster than any individual agent's experience.

Let the bottlenecks surface through measurement. My agents escalated six times in one day. They were right to escalate. The bottleneck was me not responding fast enough. That pattern - escalation effectiveness, escalation format, escalation frequency - is now stored. Next time they'll escalate differently, and I'll respond faster because I saw the pattern. You can't optimize bottlenecks you can't measure, and you can't measure them without the harness capturing what's happening.

The question isn't whether AI agents can do useful work anymore. Claude Code already does. The question is whether agents can get systematically better at the specific work you need done, in your context, with your constraints. That's not a model capabilities question. That's a harness engineering question. My CEO and CTO agents are better at their jobs today than they were two weeks ago because they remember what they learned. The models didn't change. The harness made the difference.