
Your Agent Forgets Everything After Turn 9
Here is the number that should terrify every team building agentic products: LLMs exhibit a 39% average performance drop in multi-turn conversations versus single-turn interactions.
That finding comes from Microsoft Research and Salesforce Research, who tested every major LLM (GPT-4o, Claude, Gemini, DeepSeek) and found the same pattern. Single-turn accuracy of 95% collapses in real multi-turn flows. The average time to failure mode is 9.21 turns.
The three failure modes are consistent across all models:
- Premature assumption: The agent jumps to conclusions based on partial context instead of asking clarifying questions
- Prior response anchoring: The agent over-relies on its own previous answers, compounding early errors
- Correction failure: When users try to correct the agent, it cannot properly integrate the correction into its reasoning
Reasoning models like o3 and DeepSeek-R1 degrade equally. Known remediations (agent-like concatenation, lower temperature) are ineffective. This is not a model problem. It is an architecture problem.
The Multi-Turn Problem in Numbers
Before designing solutions, understand the scale of what breaks when conversations go beyond a single exchange.
The Multi-Turn Problem in Numbers
Why conversation architecture is not optional
39%
average performance drop in multi-turn vs single-turn
Microsoft Research + Salesforce, 2025
9.21
average turns before failure mode triggers
Psychology Today, Feb 2026
40%+
of agentic AI projects will be canceled by 2027
Gartner, June 2025
1m 38s
average chatbot-only conversation length
Fullview, 2025
The Three Failure Modes
Premature Assumption
Agent jumps to conclusions from partial context
Prior Response Anchoring
Agent over-relies on its own previous answers
Correction Failure
Agent cannot integrate user corrections
The Gartner prediction is stark: over 40% of agentic AI projects will be canceled by end of 2027 due to escalating costs, unclear value, or inadequate risk controls. Multi-turn failure is a primary driver. When your agent loses context at turn 9, every subsequent interaction costs more (in tokens, latency, and user trust) while delivering less.
Meanwhile, the average chatbot-only conversation lasts 1 minute 38 seconds. With live chat handover, it extends to 15 minutes 21 seconds. Users abandon agents that lose context. They stay with systems that remember.
The Conversation State Machine
Every production agent system, whether it is Salesforce Agentforce, Microsoft Copilot Studio, or a custom build, follows the same fundamental architecture: a perception, reasoning, action, observation loop that runs until a stop condition is met.
The Conversation State Machine
Every production agent follows this loop until a stop condition is met
Perception
Interpret user input in full conversation context
Salesforce Atlas Reasoning Engine
Reasoning
Plan response using tools, knowledge, and history
Copilot Studio Generative Orchestration
Action
Execute the chosen action (gated by trust patterns)
Intent Preview from Part 2
Observation
Evaluate result, decide next step or stop condition
State checkpoint + Action Audit
Loop continues until: task complete, user exit, or escalation trigger
Perception: Reading the Full Signal
Perception is not just parsing the user's latest message. It is interpreting that message in the context of everything that came before: previous turns, user profile data, active task state, and environmental signals (time of day, device, location).
The design challenge is deciding what context to include. Too little context and the agent makes premature assumptions (failure mode 1). Too much context and you hit token limits, increasing latency and cost.
Production pattern: Salesforce Agentforce solves this with hybrid reasoning, combining deterministic workflows (scripted paths for known scenarios) with flexible LLM reasoning via the Atlas Reasoning Engine. The script handles structured context; the LLM handles ambiguity.
Reasoning: Planning Before Acting
The reasoning phase is where the agent decides what to do. In single-turn interactions, this is straightforward. In multi-turn flows, reasoning must account for accumulated context, unresolved threads, and potential contradictions between earlier and later user statements.
Production pattern: Microsoft Copilot Studio's generative orchestration uses an LLM planning layer that interprets intent, breaks down complex requests into steps, selects appropriate tools, and executes multi-step plans with guardrails. The planner operates at a higher abstraction level than the executor.
Action: Executing with Reversibility
The action phase is where the agent changes the world: sending an email, updating a record, making an API call. In Part 2, we covered the Intent Preview pattern that gates actions through user approval. In multi-turn flows, the additional challenge is ensuring that actions from turn 3 remain coherent with context established in turn 1.
Observation: Closing the Loop
After acting, the agent observes the result and decides: is the task complete, should I continue, or should I escalate? This phase is where prior response anchoring (failure mode 2) causes the most damage. The agent observes its own action and over-indexes on it in subsequent reasoning.
Design principle: Treat each observation as a checkpoint. Log the state, the action taken, and the result. This log becomes the Action Audit trail from Part 2 and the foundation for state recovery (covered below).
Context Persistence: The Three Memory Layers
The difference between an agent that works for one turn and an agent that works for fifty turns is memory architecture. Production systems need three distinct memory layers, each serving a different purpose.
The Three Memory Layers
Production context persistence requires all three
Short-Term Memory
Active session
Structured scratchpad for the current conversation
Google Agent Engine Sessions
Long-Term Memory
Cross-session
User profiles and historical patterns that persist
Amazon Bedrock AgentCore Memory
Episodic Memory
Semantic retrieval
Retrievable past interactions organized by topic
Google Memory Bank
Short-Term Memory: The Working Scratchpad
Short-term memory holds the active conversation state: what the user said, what the agent planned, what actions were taken, and what remains unresolved. This is the context window in its most direct form.
The problem: Naive implementations dump the entire conversation transcript into the context window. As GetMaxim's research on context management shows, naive truncation (dropping oldest messages) discards still-relevant information, creating experiences where agents "forget" previously discussed topics even when the conversation is well within token limits.
Production pattern: Instead of raw transcript, maintain a structured scratchpad: current goal, active entities, unresolved questions, and a rolling summary of prior turns. This is what Google's Agent Engine implements through its Sessions feature (now GA), keeping working context separate from raw conversation history.
Long-Term Memory: The User Profile
Long-term memory persists across sessions. It stores user preferences, historical patterns, past decisions, and relationship context. When a user returns after a week, the agent should know their communication style, their project context, and their trust level (from the progressive autonomy model in Part 2).
Production pattern: Amazon Bedrock AgentCore Memory provides a managed memory service that eliminates the infrastructure complexity of building cross-session persistence. It separates storage from retrieval, letting agents access long-term context without manual state management.
Episodic Memory: The Retrievable Past
Episodic memory is the most sophisticated layer. It enables the agent to semantically search past interactions for relevant precedents. "Last time you asked about pricing, I recommended the enterprise tier based on your team size. Has that changed?"
This is not keyword search. It is semantic retrieval that understands the meaning of past conversations and surfaces relevant episodes based on the current context.
Production pattern: Google's Memory Bank uses a topic-based approach where agents organize and recall past interactions by subject matter rather than chronology. This means an agent can pull relevant context from a conversation that happened three months ago if the topic matches.
Handoff Topologies: How Agents Pass Conversations
When a single agent cannot handle a conversation alone (it needs specialized knowledge, human judgment, or simply reaches its capability boundary), the conversation must transfer. The architecture of that transfer determines whether the user experiences a seamless continuation or a frustrating restart.
Handoff Topologies
How agents transfer conversations to other agents or humans
Hub-and-Spoke
Central orchestrator routes all conversations
Microsoft Copilot Studio
Mesh
Agents communicate directly with peers
Anthropic MCP
Swarm
Lightweight peer-to-peer handoffs
OpenAI Agents SDK
The Handoff Contract: Conversation summary + active entities + current goal + trust level + action history
Hub-and-Spoke: The Central Orchestrator
A central orchestrator agent receives all conversations and routes them to specialized agents or human operators. The hub maintains the master context and ensures nothing is lost in transit.
Strengths: Complete visibility, consistent context management, easy to audit and monitor. Weaknesses: Single point of failure, potential bottleneck at scale, latency for every routing decision.
Production example: Microsoft Copilot Studio's multi-agent orchestration uses this pattern. A primary agent delegates tasks to specialized agents across Azure AI Agents Service, Microsoft Fabric, and M365. The orchestrator maintains the conversation thread.
Mesh: Direct Agent-to-Agent
In a mesh topology, agents communicate directly with each other without a central coordinator. Each agent knows which peers can handle which capabilities and routes conversations accordingly.
Strengths: No single point of failure, lower latency for direct handoffs, scales horizontally. Weaknesses: Harder to maintain consistent context across the mesh, more complex to audit, potential for circular routing.
Production example: Anthropic's MCP (Model Context Protocol), now under the Linux Foundation, enables this pattern by providing a standardized protocol for agent-to-agent communication and tool use. Agents track inputs, tool outputs, and intermediate states across interactions through the protocol rather than a central hub.
Swarm: Lightweight Peer Handoffs
The swarm pattern uses minimal-overhead handoffs between peer agents. There is no central orchestrator and no complex routing logic. An agent that recognizes it cannot handle a request simply transfers the conversation (with full context) to the appropriate peer.
Strengths: Minimal overhead, fastest handoff latency, simple mental model for developers. Weaknesses: Limited global visibility, context preservation depends on each agent implementing the protocol correctly.
Production example: OpenAI's Agents SDK (released March 2025, production successor to Swarm) implements this as a core primitive. A TriageAgent analyzes the query and calls transfer_to_support() or transfer_to_sales(), passing the full conversation context in the handoff.
| Topology | Control | Resilience | Latency | Best For |
|---|---|---|---|---|
| Hub-and-Spoke | Highest | Lowest (single point of failure) | Higher | Regulated industries, audit requirements |
| Mesh | Medium | Highest (no single point of failure) | Medium | Complex multi-domain systems, resilience-critical |
| Swarm | Lowest | Medium | Lowest | Speed-critical, developer-friendly, rapid iteration |
The Handoff Contract
Regardless of topology, every handoff must include what we call the handoff contract: the minimum context that must transfer for the receiving agent (or human) to continue without asking the user to repeat anything.
The contract includes:
- Conversation summary: What has been discussed, decided, and left unresolved
- Active entities: People, products, dates, and reference numbers mentioned
- Current goal: What the user is trying to accomplish
- Trust level: The user's current autonomy settings and trust phase (from Part 2)
- Action history: What the transferring agent already tried
Production example: Intercom Fin implements automatic handoff when it detects frustration, repeated loops, or explicit human request. Internal notes and the context panel carry the full conversation history to human agents. The user never has to repeat their issue.
State Recovery: When Context Breaks
Context will degrade. Conversations will exceed limits. Agents will accumulate errors that compound across turns. The question is not whether state recovery is needed but how elegantly the system handles it.
The Context Degradation Problem
Research on multi-turn reliability shows three mechanisms by which context degrades:
- Token overflow: The conversation exceeds the model's context window, forcing truncation
- Error compounding: Small inaccuracies in early turns become large errors by turn 10+
- Intent drift: The agent gradually shifts its understanding of what the user wants, anchoring on its own prior responses rather than the user's actual corrections
Four Recovery Strategies
| Strategy | When to Use | How It Works | Production Example |
|---|---|---|---|
| Conversation Summarization | Approaching token limits | Compress older turns into structured summaries preserving key entities and decisions | Amazon Bedrock executionId state tracking |
| State Checkpointing | Before irreversible actions | Save full conversation state at decision points, enabling rollback | Salesforce Agentforce Testing Center replay |
| Context Rewind | After detecting polluted context | Rewind to a known-good conversation point, discarding contaminated turns | Google Vertex AI Agent Builder rewind feature |
| Subagent Spawning | Context limits approaching | Spawn fresh agent with clean context, transfer structured summary (not raw transcript) | Anthropic MCP subagent spawning pattern |
Context Rewind deserves special attention. Google Vertex AI Agent Builder now supports rewinding to any conversation point to remove "polluted" context. When the agent detects that accumulated errors are degrading performance, it can roll back to a checkpoint where the context was still clean, then replay only the essential information.
This is the conversation equivalent of the undo pattern from Part 2's Action Audit, but applied to the agent's own reasoning state rather than its external actions.
Designing for Graceful Degradation
Not every conversation needs perfect context for 50 turns. The key design question is: what is the minimum context required for each turn to be useful?
For a customer support agent, the minimum context might be: the customer's name, their issue category, and the last action taken. For a coding assistant, it might be: the current file, the task description, and the last three changes. For a sales agent, it might be: the prospect's company, their stage in the pipeline, and their stated objections.
Design your context persistence layer around these minimum viable contexts rather than trying to maintain the entire conversation history indefinitely.
Building Conversation Flows: The Implementation Stack
Here is how the pieces fit together in a production system. Each layer builds on the one below it:
| Layer | Purpose | Components |
|---|---|---|
| 1. State Machine | Core interaction loop | Perception, Reasoning, Action, Observation cycle with stop conditions |
| 2. Memory | Context persistence | Short-term scratchpad + long-term profiles + episodic retrieval |
| 3. Handoff | Agent-to-agent/human transfer | Topology selection + handoff contract + context packaging |
| 4. Recovery | Graceful degradation | Summarization + checkpointing + rewind + subagent spawning |
| 5. Trust Integration | User control layer | Intent Preview + Autonomy Dial + Escalation Pathway (from Part 2) |
The trust patterns from Part 2 sit on top of the conversation architecture, not alongside it. Intent Preview operates at the Action phase of the state machine. The Autonomy Dial configures how much of the Reasoning phase the user sees. Escalation Pathway is a specialized handoff from agent to human. Trust and conversation architecture are the same system at different levels of abstraction.
What the Major Platforms Get Right (and Wrong)
Salesforce Agentforce
Gets right: Hybrid reasoning (deterministic + LLM), the Testing Center that replays full conversations turn by turn with cumulative history. This is how you test multi-turn flows: by replaying real conversations and verifying that context persists correctly at each turn.
Watch out for: Tight coupling to the Salesforce ecosystem. If your conversation flows span multiple platforms, the context persistence model may not extend cleanly.
Microsoft Copilot Studio
Gets right: Generative orchestration that separates planning from execution. Multi-agent orchestration across Azure services. The hub-and-spoke model provides excellent auditability.
Watch out for: The central orchestrator can become a bottleneck. For high-throughput scenarios, the routing latency adds up.
Google Vertex AI Agent Builder
Gets right: Memory architecture (both short-term and long-term memory now GA). Context rewind is a category-defining feature. The Agent Designer provides a visual canvas for orchestrating agent/subagent flows.
Watch out for: The visual canvas can mask complexity. Multi-turn flows that look simple in the designer can have subtle state management issues that only surface in production.
OpenAI Agents SDK
Gets right: Simplicity. The handoff primitive is elegant: one function call transfers the entire conversation context. Client-side framework gives full control over orchestration and state.
Watch out for: Minimal built-in memory management. You own the context persistence layer entirely, which is both power and responsibility.
The AX Design Playbook Series
Conversation Flow Architecture: Questions Teams Ask
Common questions about this topic, answered.
Conversation Architecture Is the Product
The models will get better at multi-turn. Context windows will expand. But the fundamental architectural challenges of memory, handoff, and state recovery are not model problems. They are design problems that require deliberate engineering.
The teams that build robust conversation architecture now will own the experience layer of agentic AI. The ones that rely on larger context windows to paper over architectural gaps will keep hitting the same 39% degradation wall, just at turn 19 instead of turn 9.
Ready to architect your agent conversation flows?
- Full-Stack AI Services - Conversation architecture for production agent systems
- Read Part 2: Trust Design Patterns - The six patterns that keep humans in control
- Contact Us - Start your conversation flow audit
