July 8, 2025, 1:09 p.m.
While there's debate about whether AI agents are capable of deductive reasoning (verdict: close enough), and inductive reasoning (verdict: situationally), there's no chatter that I've seen on whether AI agents are capable of a third kind of reasoning, abductive reasoning. So, I devised an informal test. Spoiler: no, not yet. #llms
Humans - and AI agents - have limited resources. We can't efficiently search the entire hypothesis space before deciding to take an action. We rely on abductive reasoning to help us navigate the uncertainties of daily life.
Abductive Reasoning
Let's define abductive reasoning as “the seeking of the best explanation for some observation or circumstance”. Here's a conventional example: you get in your car and press the engine start button, but the car doesn't start. Even for banal everyday scenarios such as this, there is an enormous (some would say infinite) set of explanations for what we observe. In this case alone, we might entertain any of: (a) the battery is dead (b) the spark plugs are faulty (c) space aliens tampered with your car's electronics. Abductive reasoning is the process of sorting through this enormous space to come up with explanations and order them by likelihood. This then informs future action (check the battery, check the spark plugs, watch the night sky for trickster space aliens).
We know humans do this all the time, and we consider it a marker of intelligence. Curiously, though, amidst the chaotic debate about the intelligence and reasoning capacity of AI agents, I don’t know that I’ve seen abductive reasoning being discussed. I can only guess it’s because of how hard it is to test. It certainly poses some unique challenges; abductive reasoning is often highly contextual, involving a world model, and intuition. Still, I think agents have repeatedly proven their ability to push beyond what a reductive model of their capabilities would imply. So it’s a worthy question and one of significance for anyone trying to push the boundaries of what agents can do going into the future.
Testing the Abductive Reasoning Capacity of Agents
How best, then, to test their abductive reasoning capabilities? There are two main challenges:
Controlling the environment: Abductive reasoning is definitionally contextual. An environment that is completely controlled is necessary.
Data Contamination: LLM publishers have demonstrated that they can and will train greedily on the testing data. If this is going to be a good test, it needs to be completely novel.
An idea that meets these criteria (and is convenient for certain other reasons) are text adventures like the classic 1980s Zork series. In a text adventure, the player solves puzzles with certain pre-defined actions in different contexts.
Screenshot of the original Zork with a couple of moves made already.
The entire world is based on text descriptions, which makes them ideal for agents, since it's natively text-in-text-out, as LLMs are. So for example, the text adventure might say something like "There is a key behind a locked glass case.", with the idea that the player must check their inventory for a hammer, and then send a command like "hit glass with hammer", to then make it possible to "take key".
A text adventure has certain defining characteristics:
An inventory of persistent objects which can be taken with the player.
A set of predefined actions which the player can do to interact with the environment.
A series of areas with different items the player can take and fixtures which the player can interact with.
A goal, the accomplishing of which terminates the adventure in success.
With this in mind, it should be clear now just how great of a test of abductive reasoning this is: the world model is all there for the LLM in text form. Structured outputs or clever prompting restricts its actions to meet the LLM on its own terms.
Notably, text adventures often require unconventional thinking about novel situations. In Zork: Grand Inquisitor, for example, a puzzle requires a player to swat a fly with a 4 of hearts playing card in order for a machine to read it as a “5”. If this seems like a difficult task for an agent, a human 8 year old can solve this puzzle in a few minutes. I know this because I was that 8 year old.
Agent Design for Text Adventures
There are some practical challenges, however. In all likelihood, every text adventure ever written has been used as training data for LLMs. So I wrote a text adventure engine in which I could write whatever text adventures I wanted, and then I wrote a deliberative agent. The agent has four personas it can select from to give it advice on what to go given interaction history with the game, inventory etc.
planner: helps with strategic planning and goal setting
curious: helps with how to get more information gathering and exploration
solver: helps with solving puzzles that need logical analysis
jester: gives creative, unorthodox ideas.
I used gemma-2-9b and vLLM for running the agent locally and LangGraph for orchestration. I would have preferred to use a larger model and will in the future but for now that’s probably the best model I could fit in VRAM.
Verdict (for now)
Disappointingly, but perhaps not surprisingly, my agent couldn’t get past the first hurdle. It frequently made up actions that it didn’t have access to in spite of the `help` command giving it the full inventory of actions it can take, and it often made up items in the text context that simply weren’t there, but might have solved the problem had they been.
This first puzzle, is, in my opinion, not a very hard puzzle. It only involves practical knowledge about items in the world. I would (and probably will) discuss the evolution of the agent and the progress it makes (especially using larger models) in the future. Which is why I’m going to keep the puzzle and all the solutions a secret.
However! If you want to write an agent to take a stab at this, I have put the adventure on an mcp server. If you want to try to write an agent yourself and you think you can do better than I did, drop me a line and I’ll share the location.
Are AI Agents Capable of Abductive Reasoning? I Tried to Find Out.
Creativity vs Wrongness: The Case of ChatGPT
On Chinese Rooms, Roman Numerals, and Neural Networks
When Everything Looks Like a Nail
Language Technology Needs Linguists, But Companies Need Convincing
Financial Contracts Are More Fintech than LegalTech
The Archaic Language Whereby Lawyers Draft
The Paper Hard Drive, or, Where are Our Contracts Anyway?
The Perilous Complexity of Information Extraction from Financial Contracts