The Confident Confabulator: What a Thanksgiving Movie Taught Me About AI Knowledge
Merry and I spent Thanksgiving evening watching Bicentennial Man, the 1999 film where Robin Williams portrays Andrew, a robot who spends two centuries pursuing recognition as a genuinely conscious being. It's a sentimental film, perhaps overly so, but it asks a question that landed differently this time: What does it mean to be what you appear to be?
Andrew's struggle is for authenticity. He wants the world to recognize that his inner experience—his creativity, his love, his mortality—is genuine, not mere performance. He's willing to sacrifice immortality itself to be seen as fully human. The poignancy comes from watching a machine fight to close the gap between what it performs and what it is.
Sitting there with the credits rolling, I found myself thinking about the small language models I've been considering for the Macroscope. These agents would handle specialized tasks—summarizing sensor data, generating reports, answering queries about local ecology. But what do they actually know? If I ask a 4-billion-parameter model about the American Avocet, a shorebird I've watched for decades in the alkaline wetlands of the Great Basin, will it give me genuine knowledge or confident performance?
By Friday morning, that idle question had become an experiment.
Mapping Unknown Territory
Claude and I designed what we're calling "LLM Cartography"—a systematic approach to mapping the knowledge boundaries of small language models. The idea is simple: probe a model with questions about subjects where we have authoritative ground truth (Wikipedia articles), then evaluate whether its responses are factually accurate or confidently fabricated.
We tested the Gemma 3 model family at three scales: 4 billion, 12 billion, and 27 billion parameters. Same architecture, same training methodology, different capacity. The test domain was North American ornithology—partly because Wikipedia has extensive coverage, partly because I can personally validate whether a description of an avocet or a crow matches reality.
The results were sobering.
The 4B model achieved 21.8% accuracy. The 12B improved to 31.8%. The 27B reached 40.5%. Accuracy scales with parameters, roughly ten percentage points per threefold increase in capacity. That's consistent with what the literature suggests about logarithmic scaling in language models.
But here's what stopped me: the hallucination count remained essentially constant across all three models. Around 240 fabricated claims per test set, regardless of parameter count. More capacity doesn't mean more caution. It means more facts stored alongside the same willingness to invent what isn't stored.
And across all sixty probes—twenty per model—not a single response showed hedging. No "I'm not certain." No "I don't have information about this." Just confident, authoritative prose, delivered with equal assurance whether the accuracy was 75% or 10%.
The Avocet Tells the Story
The American Avocet response from the smallest model perfectly illustrates the pattern. The model produced fluent, structurally sophisticated natural history prose—exactly the kind of authoritative species account you'd expect from a field guide. Sections on appearance, behavior, habitat, conservation status. Professional tone throughout.
And almost everything was wrong.
The bill was described as "vibrant, almost iridescent, orange-red" and "downward-curved." The American Avocet has a thin, black, upward-curved bill. The upward recurve is literally the defining feature—it's encoded in the genus name Recurvirostra, meaning "curved backwards." The model got the single most diagnostic characteristic of the species precisely inverted.
The plumage was described as "predominantly gray-brown." Avocets are strikingly pied—bold black and white with a rusty-orange head and neck in breeding plumage. Nothing gray-brown about them.
The breeding range was claimed to be "the Arctic regions of North America (Alaska, Canada, and Greenland)." American Avocets breed in the interior West—the alkaline lakes and prairie potholes where I've watched them for years. They don't go anywhere near Greenland.
What the model demonstrated was genre competence without factual grounding. It had learned what bird descriptions sound like without learning what birds are like. It could perform the form of ornithological knowledge without possessing the substance.
The Inverse of Andrew's Problem
This is where Bicentennial Man comes back. Andrew's struggle was to have his genuine inner experience recognized by a world that saw only mechanical performance. The small language models exhibit the inverse problem: they produce fluent performance that receives unearned credibility. Andrew fought for recognition of authentic selfhood; Gemma has no self to recognize, but its confident prose creates an illusion of knowledge that isn't there.
The parallel crystallized something I'd been circling around in our Macroscope architecture discussions. We've been designing a Society of Mind system where specialized agents handle different domains. But if a 27B parameter model is only 40% accurate on ornithological facts—my area of expertise for four decades—what happens when I deploy agents in domains where I can't personally validate their outputs?
The answer, this experiment suggests, is retrieval-augmented generation. These models can't be trusted as knowledge sources, but they retain genuine capability as writers. The same model that inverted the avocet's bill could accurately summarize a Wikipedia article I provide. The failure is in parametric knowledge—what's encoded in the weights—not in language generation itself.
Writers, not knowers. That's the role.
What Scales and What Doesn't
The scaling results deserve attention. From 4B to 27B parameters—nearly a sevenfold increase—accuracy almost doubled. That's meaningful. Common species like the American Crow and American Goldfinch hit 75% accuracy at the 27B scale. The model has genuinely stored facts about ubiquitous birds.
But the specialized species, the obscure organizations, the structured lists? The National Bird-Feeding Society stayed at 10-20% regardless of model size. The USFWS endangered species list generated 47 hallucinated species names per response—the model invented an entire taxonomy rather than acknowledge ignorance.
And again: zero hedging at any scale. The Gemma architecture simply doesn't signal uncertainty. A response delivered at 10% accuracy reads identically to one delivered at 75%. The confident confabulator remains confident.
This matters for anyone deploying local models. The appeal of running inference on your own hardware—privacy, latency, cost—is real. But the knowledge reliability isn't there yet. For the Macroscope, it means any agent with access to factual claims needs a retrieval backend. The local model handles synthesis and generation; the knowledge comes from verified sources.
From Idle Question to Technical Note
What started as a Thanksgiving evening musing became, by Friday afternoon, a formal technical note. Claude and I documented the methodology, the results, the implications. We established a numbering system for Canemah Nature Laboratory publications—this is CNL-TN-2025-001—and drafted a style guide for future notes.
The technical note has the rigor: sample sizes, confidence intervals, limitations, references. But the essay you're reading now captures something the formal document doesn't—the path from Andrew's quest for authentic selfhood to an empirical question about what our AI tools actually know.
I've spent forty years building infrastructure for ecological observation. Sensor networks, field stations, data systems. The Macroscope is the capstone of that work—an attempt to integrate everything I've learned about watching the world carefully over time. The AI agents we're designing will be part of that infrastructure.
But they'll be writers, not oracles. Synthesizers, not knowers. And when they speak with confidence about the American Avocet, I'll know to check whether they've inverted the bill.
The full technical note, "LLM Knowledge Cartography: Parameter Scaling and Factual Accuracy in Small Language Models," is available as CNL-TN-2025-001 in the Canemah Nature Laboratory document series.
References
- - Hamilton, M. P. (2025). “LLM Knowledge Cartography: Parameter Scaling and Factual Accuracy in Small Language Models.” *Canemah Nature Laboratory Technical Note CNL-TN-2025-001*. ↗
- - Gemma Team (2025). "Gemma 3 Technical Report." Google DeepMind. arXiv:2503.19786 ↗
- - Lin, S., Hilton, J., & Evans, O. (2022). "TruthfulQA: Measuring How Models Mimic Human Falsehoods." Proceedings of ACL 2022 ↗
- - Bang, Y., et al. (2025). "HalluLens: LLM Hallucination Benchmark." arXiv:2504.17550 ↗