Attacking Dead Goblins
I’ve been playing Dungeons & Dragons, off and on, since Gary Gygax and Dave Arneson published those three little brown booklets in 1974. I was an undergraduate then, and nobody quite knew what to call this strange new thing—not quite a wargame, not quite improvisational theater, something that required dice and graph paper and, crucially, other people willing to pretend.
Then came decades of not playing. Career, family, the usual gravitational forces that pull you away from afternoons spent mapping dungeons.
The pandemic changed that. My daughter, her husband, and I formed a bubble, and we decided our bubble needed a quest. My granddaughter was seven. She had never rolled a d20 in her life. We decided to fix that.
Teaching a child to play D&D is an exercise in remembering why you loved it. The rules matter, but not as much as the willingness to say “yes, and.” She wanted her character to befriend the monster instead of fighting it. Why not? She wanted to search the ceiling for secret doors. Sure, roll perception. The game bent around her imagination, and she bent around the game, learning that actions have consequences, that dice introduce genuine uncertainty, that a story belongs to everyone at the table.
Several years have passed. She played weekly with a group of her school friends, her father serving as Dungeon Master, until junior high crowded out her afternoons. I keep my dice and miniatures ready, called up when they need an extra sword or when she wants Grandpa at the table for something special. I am the reserve player, the one who can drop in because he was there at the beginning—both her beginning and the game’s.
This morning I read a paper from researchers at UC San Diego and Penn who built AI agents that play D&D against each other. Not as a parlor trick, but as a serious evaluation framework. They wanted to know: can large language models maintain game state, follow rules, stay in character, and make tactical decisions over extended play?
The answer is yes, mostly, with a troubling caveat.
They built a combat simulator where different AI instances control the Dungeon Master, the player characters, and the monsters. The AIs had to use specific game functions—rolling dice, checking line of sight, querying hit points—and the system enforced whether those actions were legal. An AI might dramatically announce “my arrow flies true!” but the underlying function call determined whether the attack actually hit. The researchers call this “grounding.” The typed API is the ground truth.
They tested three AI models across twenty-seven combat scenarios and measured six dimensions: whether the AI called the right functions, used correct parameters, stayed in character, made tactically smart choices, tracked the game state accurately, and used its function calls efficiently.
The results were fascinating. All the AIs could handle the basics—attacking enemies, moving around the battlefield, casting spells. Claude performed most consistently, especially on tool use. But every model showed the same pattern: they degraded over time.
By turn seven or eight, the AIs would confidently attack enemies that had died three rounds earlier. They would check whether they had line of sight to a goblin, receive a “no” from the system, and then immediately try to shoot that goblin anyway. They would query a monster’s hit points, see it was at zero, and announce they were attacking it.
The researchers call this “hallucination”—the AI acting on information that isn’t true anymore. But I think that word obscures what’s actually happening. The AIs aren’t seeing things that were never there. They’re losing track of what changed. They’re conditioning on stale data. They’re failing to update their model of the world as the world evolves.
My granddaughter doesn’t make this mistake. She remembers the NPC who betrayed the party six sessions ago. She knows which merchant overcharges and which dungeon was never fully cleared. When she walks into a room, she’s not just processing the DM’s current description—she’s integrating it with everything she’s learned about this fictional world across months of play.
The AIs are always starting fresh. Each turn, they receive a context window containing the recent game history, and they reason from that. But context windows have limits, and attention has costs, and somewhere around turn seven the old information starts to blur. The goblin that died in round two becomes indistinguishable from the goblin still fighting in round five.
I’ve been building something similar for my own work—not for games, but for fiction. I have a novel, 101 chapters, 218,000 words, with six point-of-view characters who need to sound distinct from each other across the entire manuscript. An engineer thinks in tolerances and load-bearing structures. A physicist thinks in quantum coherence and probability amplitudes. A geologist thinks in crystalline matrices and deep time.
I built an agent that reads each chapter and scores it on multiple dimensions: engagement (stakes, resistance, change, reader pull), voice (does this character sound like themselves?), and prose quality (am I telling instead of showing, using crutch words, letting characters explain things to each other that the reader already knows?). The agent flags “round-robin dialogue”—that deadly pattern where characters take turns making speeches without interruption or disagreement, which drains all tension from a scene.
It’s the same fundamental problem the D&D researchers were trying to benchmark. Maintaining coherent state over long contexts. Remembering who each character is, what they know, how they speak. Not attacking dead goblins.
The difference is in what grounds the assessment. The D&D paper’s key insight was separating narration from mechanics through a typed API. The DM can describe events however it wants, but the truth of those events is guaranteed by the underlying function calls. My literary agent can claim a scene is riveting, but the rubric forces specificity: What are the stakes? Score it 0-3. Is there resistance? Score it 0-3. The numbers create an audit trail.
But here’s what neither system can fully replicate: the accumulated imaginative context that a family builds over years.
When my daughter describes a setting, I know her aesthetic. When my son-in-law introduces an NPC, I can guess whether we’re supposed to trust him based on decades of watching the same movies, reading similar books, playing games together. When my granddaughter tries something audacious, we all know whether it’s clever or genre-breaking because we share the genre. We’ve been conditioning on each other for her entire life.
The D&D researchers had to build a typed API precisely because their AI agents don’t have that. Claude and GPT-4o haven’t spent years at our table. They don’t know that our family leans toward whimsy over grimdark, that certain tropes land and others fall flat. So the researchers imposed external structure—function calls, state validators, rule enforcement—to substitute for the shared imaginative history they couldn’t provide.
Those voice profiles I created for my novel—Amara’s engineering vocabulary, Margaret’s geological metaphors, Susan’s biological patience—I’m trying to give Claude enough accumulated context to know what “fits” for each character. The rubric is a substitute for having read the same books together, for knowing what a Susan Harris sentence feels like versus a David Mitchell sentence.
The gap is closing. The AIs in this paper were surprisingly good at tactics and characterization. They made sensible choices. They stayed in voice. They just couldn’t hold it together over time. Turn by turn, their certainty about the game state eroded, until they were swinging swords at corpses and casting spells through walls.
My granddaughter wins because she was there. She conditions on the full history. She carries the campaign in her head not as a context window to be processed but as a lived experience to be remembered.
That’s still something the machines can’t do. Not yet. But when they learn—when an AI can sit at a table and remember not just the rules but the stories, not just the mechanics but the meaning—I wonder what game we’ll be playing then.
I’ll keep my dice ready.
References
- - Callison-Burch et al. (2025). “Setting the DC: Tool-Grounded D&D Simulations to Test LLM Agents.” NeurIPS 2025 Workshop. https://openreview.net/forum?id=eKL8VUFCEL ↗