What are AI agents? Simon Willison crowdsourced a lot of definitions that focus on 1) taking an action on the user’s behalf outside the LLM and 2) using the LLM do complex loops and if statements.

In essence we turn the LLM into the engine of a Turing machine that can manage state, memory storage, and control to change state based on memory storage (if statements), and also use external tools like APIs and databases and computer use to get info and perform actions.

Here is a roadmap of agentic patterns to learn, and resources to learn them (work in progress)

  • Role / Instruction Prompting – Craft concise system‑, role‑ and user‑level instructions so the LLM answers properly and in the right format. This is the entry‑level skill that underpins every other pattern. Practical ChatGPT Prompting: 15 Patterns to Improve Your Prompts
    • C-L-E-A-R
      • Contextualize - Specify a role or persona: “You are a copy editor with years of experience polishing articles and blog posts for web publication.”
      • Limits - Length; format like three bullet points; tone or style like concisely, or like a tech journalist, or only use facts from this text.
      • Elaborate and give Examples - Explain and provide as much detail and specifics as possible. Use chain of thought and other advanced prompting methods.
      • Audience - Identify the audience the response is addressed to, such as ‘explain like I’m 5’.
      • Reflect or Review - Ask ChatGPT to ask clarifying questions before answering, give itself space, such as “think step by step”, “make sure of x before answering”.
    • P-R-E-P-A-R-E-D is another.
    • Or T-C-E-P-F-T. Use what resonates with you.
    • These days, you don’t need to spend a lot of hours learning prompt engineering, think about your intention, take a first crack using one of the above frameworks, and then ask your favorite LLM to improve it.
    • Side quest - proper evals, and prompt optimization with tools like DSPy.
    • Good prompting and evals are a foundational skill. If you have good evals, you will eventually have good prompts and outputs through iteration. If you don’t have good evals, changes in underlying LLMs and assumptions will break your prompts and agents.

  • Tool Use – Expose a catalog of external APIs and let the LLM decide which expert tool to call at each step (search, calculator, SQL, control a browser or a python interpeter through computer use, etc.). Microsoft: Tool Use Design Pattern

  • Basic RAG – Give the agent documents and a tool (such as a vector database) to find relevant parts of the documents and respond using them via in-context learning (stuffing the prompt with data to ground the answer, examples etc.).

With these 3 components you can build highly capable single-turn OpenAI Assistants or Custom GPTs. However, these frameworks (and equivalents on other platforms) have limitations in terms of multi-turn workflows, tools, sometimes which models you can use. To get to true agents, we want highly customizable multi-turn workflows that may use many different models, tools and sub-agents.

  • Chain‑of‑Thought Prompting – Elicit step‑by‑step reasoning traces that make the model’s logic explicit and usually boost accuracy on math, logic and multi‑hop tasks. Kind of like ‘rubber duck debugging’, telling the model to explain what it’s doing as it does it, forcing it to think and improve performance. Paper: Wei et al.

  • ReAct (Reason + Act) Loops – Interleave “Thought → Action → Observation” so the agent both reasons and calls tools (search, code, DB) in the same dialog, allowing complex chains of thoughts and actions. Paper: Yao et al. Matt Webb Simon Willison

  • Prompt Chaining & Sequential Workflows – Break a complex task into ordered sub‑prompts with intermediate validation (“gate checks”) before moving to the next stage. LangChain: Build an Agent

  • Structured Output – Ask the model to return JSON, letting downstream code parse or act on the response safely. The GPT-4.1 models are exceptionally good at returning valid JSON, which you can also use Pydantic to specify and validate. Study the prompting guide thoroughly. Since any errors compound as you go down the agent’s trajectory, structured outputs and validation are critical.

  • Human-in-the-loop – At the current maturity of AI development, fully autonomous agents are typically unachievable in complex, high-stakes environments. The AI can speed things up dramatically but it can also be hit-or-miss, so human supervision is critical. It’s much more realistic to try to make agentic assistants and copilots that take humans through a structured process than to try to be fully autonomous. At key steps the human should evaluate and course-correct as necessary. Time travel to go back to a previous step, adjust, and try again can also be useful. Use AI for what it’s good for, which is parsing lots of information quickly and generating a first draft at a near-human level; use tools for what they are good at, for instance executing simple repeatable processes; use the humans for what they are good at which is critical thinking.

  • Reflection – After an initial answer, the agent critiques its own work and revises. Can iterate multiple times. Improves reliability without extra finetuning. DeepLearning.ai “Reflection” pattern. Paper: Shinn et al.

  • Evaluator‑Optimizer (Generator‑Critic) Loops – Divides reflection pattern into 2 steps. One LLM prompt proposes an answer, another scores/criticizes it, providing direction for improvement; iterate until the score crosses a threshold. Anthropic post “Building Effective AI Agents”

  • Task Routing / Mixture‑of‑Experts – A router runs a classification prompt based on the current state to choose the next action, such as a prompt or sub-agent workflow. Anthropic Agentic Systems - #2. Routing]

  • Agentic RAG & Specialized Retrieval Teams – Multiple retrieval agents each query their own knowledge pool; an aggregator agent fuses the evidence before final generation. IBM primer “What is Agentic RAG?”

  • Short‑Term Memory – Keep just enough context (conversation buffer, sliding window, or summary) inside the model’s token limit for coherent multi‑turn chats. Context Windows: The Short‑Term Memory of LLMs

  • Long‑Term Memory – Persist facts or conversation summaries in a vector database or in-memory structure and retrieve them on demand so the agent “remembers” over the course of a long session searching for lots of information, and across sessions. Pinecone guide to conversational memory with LangChain

  • Plan‑and‑Execute (Hierarchical Planning) – First draft a high‑level plan, then execute each sub‑task in order. LangChain: Plan-and-Execute Agents; Paper: Wang et al.

  • Parallelization of Sub‑tasks – In contrast to sequential tasks, we can fan out independent LLM calls asynchronously (map‑reduce, parallel tools) and aggregate results for speed or consensus. We can perform similar tasks in different ways and pick the best one, or take all the outputs and synthesize a response from them. LangChain: How to invoke runnables in parallel

  • Tree of Thoughts (ToT) / Graph of Thought (GoT)

  • Guardrails. . a form of reflection, some frameworks have reusable guardrail assertions and processes.

  • Orchestrator‑Worker Architecture – A central orchestrator maintains state, assigns work to specialized worker agents, and merges their outputs—a pragmatic bridge to full multi‑agent systems. The LangGraph state graph framework is one pattern. Another pattern would be to use OpenAI Agents toolkit, make each node a tool (including tools that call LLMs), and have a top-level reasoning prompt describing a workflow and telling the LLM to run the workflow using the available tools.

  • Multi‑Agent Collaboration – Yet another orchestration pattern that uses distinct role‑based agents (e.g., Planner, Coder, Tester) that converse to solve problems that exceed a single model’s capacity. Multi-agent systems can be a bit like using Docker/Kubernetes microservices vs. monolithic architectures. They can provide a helpful decomposition, or make the system complex and hard to reason about. In both cases I would generally advise waiting until you have really understood and solved the problem and need to scale the solution to the next level. AutoGen is a leading multi-agent framework. Wired article.

  • Model Context Protocol and other communications protocols – When you create a tool, in addition to implementing its functionality there’s a sort of semantic layer you have to provide to an LLM so it knows how to use it: the input schema, the output schema, when and why to use it. MCP is a standard for doing this. There are other ways for agents to communicate with the outside world and each other. Another evolving standard is A2A. If one agent calls another agent, it may be a long-running process and a multi-turn chat conversation, unlike a REST call. So there may be a need for a different standard to monitor long-running processes that come back to ask for more information, or that you want to send a sequence of interactive requests to.

Further Reading: