Coheso Team
Coheso Team
In Part 1, we explored why building for AI agents represents a paradigm shift in software design. But to build effectively for agents, we need to understand what's happening under the hood.
The short version: LLMs are sophisticated "fill in the blank" systems. Understanding this changes how you think about everything from API design to reliability engineering.
The Fill-in-the-Blank Foundation
At their core, large language models do one thing: predict the next word. You give them a sequence of words, and they predict what comes next.
When you say "what a big blue," very few words come to mind: sky, whale, maybe ocean. That intuition about language patterns? LLMs have converted it into a system that can make these predictions at scale, trained on the entire internet and vast libraries of books.
Because LLMs are predictive, they can't jump directly to an answer. They have to reason through the problem first, generating all the intermediate language needed before they can predict a "yes" or "no."
This is why "chain of thought" prompting works. When we ask an LLM to think step by step, we're giving it permission to generate the reasoning it needs to arrive at an accurate prediction. The thinking isn't just for show. It's computationally necessary.
The Stateless Reality
Here's something that surprises many people: LLMs are stateless. Every interaction is completely independent. No previous conversation with that model has any impact on the current one.
When you have a conversation with ChatGPT or Claude, it feels like the AI remembers your chat history. But the entire conversation is being sent as context with every single API call. The AI doesn't have memory. We're just sending it the full transcript each time.
This is why chat history can sometimes feel imperfect. The system is:
- Storing your conversations in a database
- When you ask a new question, searching for relevant past conversations
- Pulling relevant chunks and adding them to the context
- Sending all of this to the LLM
It's simulated memory through clever retrieval, not actual recall.
The Context Window Constraint
If more context leads to better answers, why not just send everything? Because we're compute-limited.
The "fill in the blank" operation involves massive matrix multiplications. These are expensive. To manage costs and computational load, LLMs have context windows, limits on how much text you can send in a single interaction.
A 100k token context window sounds like a lot until you're trying to answer questions across a document repository with thousands of contracts. You can't just dump everything in. You need to figure out what's relevant and send only that.
This constraint shapes everything about how we build AI-powered products. We need retrieval systems, relevance scoring, and careful context engineering, all to work within these limits.
From Workflows to Agents
Understanding these fundamentals helps explain the evolution from LLM workflows to AI agents.
The Workflow Approach (Still Common Today)
In the early days of LLM applications, humans defined every step:
- When a question comes in, use the LLM to condense it
- Search the document index for relevant passages
- Fetch the top results
- Send those passages plus the question to the LLM
- Generate an answer with citations
This is reliable because we've defined the flow. The LLM just handles small, specific tasks within a structure we control. If it's a policy question, run the policy workflow. If it's a contract question, run the contract workflow.
The Agentic Approach
With agents, instead of defining complete workflows, we give the LLM tools and let it plan the flow dynamically.
We might provide:
- A tool to fetch relevant documents
- A tool to extract specific passages
- A tool to search across contracts
- A tool to generate summaries
Then we tell the agent: "Answer this question." It decides which tools to use and in what order.
The upside is flexibility. The agent can handle novel situations that our predefined workflows never anticipated.
The downside is reliability. The agent might:
- Skip steps it should take
- Use tools in the wrong order
- Answer from "common sense" instead of using provided tools
- Make different decisions on similar questions
Sometimes the agent will just answer based on its common sense, which we don't want. It should use the tools we've provided, but it doesn't always.
The Reliability Problem
This brings us to the core tension in agent design: every additional decision point is a chance for error.
If an agent needs to:
- Recognize it needs to fetch documents
- Call the right document retrieval tool
- Parse the results correctly
- Decide if it needs more context
- Synthesize an answer
...then errors can compound at each step. Even if each individual decision is 95% accurate, five decisions means roughly 77% overall accuracy. That's not good enough.
Consider the microwave analogy: if your microwave failed one in twenty times, you'd stop using it. We expect near-perfect reliability from our tools. AI agents need to meet that bar to achieve real adoption.
This reliability problem is driving a fundamental rethink of how we design AI systems, which we'll explore in Part 3.
This is Part 2 of a 3-part series on Building for AI. In Part 3, we'll explore how to design APIs that match human intentions, and why that's the key to reliable agents.
