delete – do while true

WHAT IS THIS?

This document is meant to informally discuss agentic autonomy in the context of [REDACTED], with the primary purpose of aligning on next steps towards productionalization. We already have a great demo; what do we need to make it real? [REDACTED] Answers to these questions and more, coming up.

Crucially, this is a first step. Once we are aligned on the big rocks that will move the needle towards autonomy, subsequent deep dives will take a closer look at specific use cases (including, at least, [REDACTED]) and the specific systems and features we need to build to support them.

WHAT IS AUTONOMY?

It’s simple: stop telling the agent how, and start telling the agent what. This is exactly the same kind of autonomy we expect out of, say, our engineers. Now, of course, it’s a scale. One agent can be considered more autonomous than another. So, the goal isn’t really autonomy as some goal post, but rather more autonomy in comparison to our current systems or even plans.

Take [REDACTED] as a use case. An agent responsible for [REDACTED] might be considered more autonomous if we let it choose its own tools rather than hand-selecting those tools. That same agent might be considered more autonomous if we broke it free of a pre-defined workflow with hardcoded approval steps or computations. But an even more autonomous agent might not even be asked, explicitly, to set [REDACTED]. Instead, that higher-level agent could be asked simply to handle [REDACTED].

Going even further, perhaps the most autonomous agent is not given tasks at all. Instead, they are given something more like job roles or responsibilities. How and when they convert those responsibilities into action is completely up to them. In this world, agents start to look less like glorified functions with inputs and outputs, and more like Bezos-style mechanisms. Like these mechanisms, these agents must establish feedback loops and improve.

This brings us to a possible definition for full autonomy: systems that have “the keys to the city”, and use those keys as necessary to carry out any given responsibility as well as possible.

ITERATIONS TOWARD AUTONOMY

In this section, we explore some of the high-level features that we require for a fully autonomous system. And, crucially, we lay out these features in an iterative manner; we do not need to boil the ocean. The purpose of all of this is to, then, debate which of these features deserve the most short-term development effort.

Imagine we had just an LLM. I mean — I say “just” but they are obviously very powerful. And very good and what they do. But can they run the business? No. That LLM is pretty much limited to surface-level chat. The road to autonomy from this point is long. For starters, the LLM is reactive, not proactive. It is all talk, without the ability to act. And it really does not understand much about the Private Brands business.

The obvious next step is that we give the agent a bunch of Private Brands-specific tools via an MCP server. (You might call out knowledge bases separately, but they are close enough in their impact and use that I lump them in with tools.) Our chatbot has grown in both smarts and capability; learn and be curious, and bias for action. Are we done? No. There a few problems. In particular, it’s still just a chatbot. It’s a glorified UX, working on behalf of the User behind the keyboard. Worse, it sometimes makes mistakes that impact the business.

Let’s address this safety problem by introducing change management. This comes in different forms. The MCP server can be expanded with change management tooling, so that the agent can specifically request approvals and so on. More powerfully, we can provide indirection between the agent-facing tool and the actual business update, giving us a hook to inject change management on an as-needed basis. The decision of whether or not to auto-approve can be driven by simple heuristics, or it can be driven by another agent. The configuration that drives our MCP tools must now support change management customization as well.

Presumably, the change management system preserves a history of approvals. With this history, we can audit problems as they occur. But they don’t tell the whole story. They don’t explain why the change was requested in the first place. They don’t show the User prompt or tool-provided data that provided the context for that update. To understand what is going on across an increasingly intelligent system, we need to centrally track business decisions. These are, in my mind, similar to our Architectural Decision Records (ADRs). Agent-facing update tools will expect the agent to provide rationale, further decorating the intended action in a manner similar to the Command Pattern. The system itself can augment agent-provided rationale with conversational facts like previous tool invocations.

At this point, we might find that some of these decisions are going awry because the agent is over-indexing on industry standards and general knowledge. We can introduce the business playbook to capture and share as much tribal knowledge as possible. This can be exposed as a knowledge base or toolset. Or, why not both? The creation and maintenance of this catalog is obviously a massive undertaking, but we can take it incrementally. (And the timing is right, now. With the move to Centric, we need to unearth and update much of this knowledge regardless.)

Sooner rather than later, the size of the business playbook (and, really, the complexity of the business as a whole) will break a single agent. It will hallucinate, fly through context windows, and make bad choices. We need more than a single agent; we need an entire agent catalog. Each of these agents must be configurable with the prompts, knowledge, and tools to effectively drive a slice of the business. And they must be accompanied by the metadata required for discovery and reuse — not just as User-facing chatbots, but for agent delegation as well.

Agents calling agents calling agents. It’s agents all the way down. From a system perspective, the behavior of the business is becoming more emergent; more chaotic. One way to think about this is that it is computationally irreducible. That is, there are no shortcuts to figure out even a rough idea of what might happen. The only way to know what will happen is to let the damn thing loose and see what happens. This is actually a form of strong emergence, where even in hindsight we cannot fully understand why the system behaves the way that it does.

We can — no, must! — counterweight this chaos by tracking context. That is, every system involved in taking action against our business should understand the context that it is operating in. Then, when a given action is taken, we can trace up this context chain, which effectively amounts to a call stack. Our business is big, but it’s nothing like, say, SCOT. Or the detail page. Stuff is happening, but not that often. Every action taken against our business should be considered a big deal.

If agents are waiting on other agents, we do not want them idling around wasting resources. It’s time for an event-driven architecture. This is what the Agent2Agent protocol fundamentally offers, although we would, ideally, like something that works just as well for other asynchronous work like human-facing tasks or ETL jobs. This effectively amounts to an instance-bound subscription. Again, this should work with any number of building blocks, not just agentic work.

With event subscription, we are officially moving away from glorified LLMs towards true agents. That said, we are still missing cognition. This includes features from short-term and long-term memory all the way to planning and cognitive loops. The difference between an agent with or without cognition is like the difference between me writing a document or rambling in a meeting; the former might be useful for mind-melding and brainstorming, but the latter introduces a level of research and rigor that is required for accurate decision-making.

At this point, the agents are working pretty well, but they are still reactive. They behave like functions — take some input, produce some result (and, perhaps, side effects in the system.) To achieve an autonomous system, we need agents that work more proactively. We need agents that behave like mechanisms rather than functions. We need agents that are given responsibilities rather than tasks. These agents work continuously, using the event-driven architecture described above, to deliver on those responsibilities.

To create a true Bezos-style mechanism, we need a feedback loop. We need evaluation. This is not a purely agentic concern; we can just as well evaluate a traditional workflow. Here, we must face Goodhart’s Law: the metrics we leverage as proxies might lead us astray because they lack a strong cause-effect relationship with our north star goals. On the other hand, our north star metrics are noisy and far-removed from the actions under our control. One proven means to address these difficulties is to develop KPI scorecards.

With a means to evaluate these systems, we can begin to experiment with different implementations. I like to think of this as pitting agents against each other in sort of Darwinian dystopia. This might begin as simple A/B testing, but it can be as complex as we need, while maintaining statistical power. In particular, I like to think of exploring the “tool space”, since tool selection seems to be such a critical factor in agent success.

In order to experiment across paradigm boundaries (e.g. agent vs. ML model vs. simple heuristic) we must define what they have in common. These paradigms are implementation choices; what is their interface? This requires that we define non-agentic abstractions for the agentic work. Think tasks, goals, mechanisms, functions, and so on.

With these new abstractions, we can start to plug-in agents to other aspects of our software. Let me explain. To this point, we have top-level (root) agents in the form of mechanisms, and bottom-level (leaf) agents in the form of User-facing chatbots. We can get more mileage out of our agents by putting them to work in the middle. Have agents work on workflow tasks; have agents perform transformations during integrations; have agents respond to the event of price updates. This is easy when our workflows and integrations and event policies can integrate with abstractions like mechanisms and tasks — no special integration with agents required.

Plugging in agents directly into these existing processing paradigms (workflows, functions, policies) is all-or-nothing. It is either implemented by that agent or it isn’t. We need a way to get the best of both worlds: the dynamism and simplicity of agentic implementation; the consistency and safety of human-provided guardrails. The paradigm proposed to solve for this tension is the work set, which behaves something like a workflow where the flow is optional. Flexible when we can, rigid when we can’t.

Phew. This is a lot already. These agents are clearly very powerful. But that doesn’t mean they don’t occasionally need a helping hand. If our business were an episode of Who Wants To Be A Millionaire?, we want our agents to be able to phone a friend. This is especially true for value judgments. This is different from approvals; it is about active collaboration rather than guardrails. My favorite approach for this kind of human-in-the-loop interaction is the question and answer. Flip the script on the traditional chatbot and have the agents reach out to Users with a schema-backed question, almost like a micro-form.

We have covered agent-human and agent-agent collaboration, already. But what if an agent needs the help of another agent that does not yet exist? This is where code mode comes in. Let the agent configure new building blocks as needed in a centralized workbench. These building blocks can then be shared in the catalog, or used as a one-off. And they can be anything from workflows to agents to tables — anything. This has the potential to explode agent power similar to what third-gen programming languages did for Developers.

This workbench perfectly illustrates the need for a catalog. Earlier, we established the need for an agent catalog; here, we extend this need to a building block catalog. Take ASINs. It is not enough to offer tools for creating, querying, fetching, updating, and deleting ASINs. Those tools (or their underlying APIs) help the agent execute, but they do not help the agent build. For example, an agent might want to make a given ASIN the scope of a workflow task. For that kind of work, the agent must understand the nouns, not just the verbs. This has historical precedent in declarative programming.

You know what would make the catalog a lot more useful to agents? Natural language metadata. That is, it should be possible for agents and humans alike to talk about the entities in the model so that they can better work with that data. Think something like comments in code; the two live side by side.

There is one more problem. A traditional catalog can be explored in depth by human builders. Those humans can learn all the idiomatic behavior and interactions of those components as they use them to build something new. However, as we know, an agent is only as good as its tools. If it fails to find the right tools, it’s going to fail to do the job. We can address this through something like a sommelier; a service that excels at zooming in on the specific aspects of a massive business that matter for a specific use case.

Are we done? No! I tricked you earlier when we left the topic of mechanisms and feedback loops. Evaluation and experimentation are useless until we use them to improve. Now, we could do this manually, of course — that’s how Weblabs work today, for example. But an agentic mechanism should improve itself. That means exploration and exploitation. It means reinforcement learning. I will leave the details to a future doc.

Phew! That’s enough for now. You can see a breakdown of these features below.

FROM TOY TO REAL BOY

We just presented an awesome demo aiming towards autonomy from [REDACTED]. You can check out that demo at [REDACTED]. A picture is worth a thousand words, of course…

[REDACTED]

So — what would it take to actually leverage this? Ultimately, these are the features (and, therefore, components) that we plan to build now.

MUST HAVE (1) Change management. Agents should never take action against our systems directly. Those actions (via MCP tools, API calls, or whatever) should always go through a kind of trust barrier, a separate component dedicated to making sure those changes are the right ones. The changes do not need to be human-reviewed, necessarily! They could be reviewed by an agent; they could be let right on through via an established policy. The point is, we ultimately maintain control over the changes happening to the business.

MUST HAVE (2) Auditing. This is closely tied to change management; when we see data that doesn’t make sense, we need to know what change it was a part of, why it was approved, and the rationale behind the change. This might be recursive! That change might be made in response to another change, or the task that drove the change might be part of a much larger task. This doesn’t have to be perfect, but we need to start with at least some basic auditability. We do not need a formalization of business decisions, but we do need the context to trace through changes and tasks and agents and workflows. On the other hand, we might find that without properly tracking decisions and enforcing a kind of “show your work”, the agents are a little eager to forge ahead under whatever assumptions they make at the time.

MUST HAVE (3) Business playbook. Today, our agents are “hardcoded” to the degree that they have custom-built prompts and tool selection. A huge part of autonomy is getting out of this business. We would much rather the agent (under the hood, a higher-level supervisor agent) to do the research necessary to determine the best prompts and tools for the task at hand. However, this requires an understanding of the business, and that in turn requires a cataloging of our existing workflows, SOPs, and tribal knowledge. Again, this doesn’t need to be perfect. We can iterate on the contents of the playbook over time to facilitate new use cases as they come up. But we do need the container in which the playbook can grow.

SHOULD HAVE (4) Existing workflow integrations. This might be better bucketed as event-based triggers. The point is, the agents we have so far must, generally, be invoked by a human being. That puts an upper bound on autonomy, because the agents must be told what to do. We can get more mileage (and more CO-LAB-O-RA-TION!) out of these agents by plugging them into our existing workflows. Let them help with the work we are already doing.

[REDACTED]

THAT’S ALL FOR TODAY

Thank you as always for reading!

Tag: delete

Noodling on Autonomy