NEW YORK: Even the creators of generative AI—the technology expected to transform how we live and work—admit they don’t fully understand how it works.
Anthropic co-founder Dario Amodei wrote in an essay published in April.
“People are often surprised, even alarmed, to hear that we don’t understand how our own AI systems function,”
“This kind of knowledge gap is unheard of in the history of technology,” he added.
Unlike traditional software that runs on rules programmed by humans, generative AI (or gen AI) systems learn on their own, figuring out how to respond to prompts with minimal guidance.
Chris Olah, a former OpenAI researcher now at Anthropic, compared gen AI models to “scaffolding,” where neural circuits build and grow over time. Olah is a leading expert in mechanistic interpretability, a field focused on reverse-engineering AI to understand how it produces outputs.
This area of research, just about a decade old, has gained momentum as AI becomes more influential. Neel Nanda, a senior research scientist at Google DeepMind, said.
“Understanding an entire large language model is a huge task.”
“It’s a bit like trying to fully understand the human brain,” he told AFP, noting that neuroscience hasn’t cracked that code either.
Interest in understanding AI’s inner workings has surged, especially among students. “They see the potential impact,” said Boston University computer science professor Mark Crovella. He noted that the field is also appealing because it combines technical challenge with intellectual curiosity, and could make AI systems more reliable and powerful.
Peering Inside the Machine
Mechanistic interpretability doesn’t just focus on outputs. It examines the internal processes—the actual calculations an AI performs when answering a query. Crovella explained.
“You can look into the model and observe what’s happening step-by-step,”
Startups like Goodfire are developing tools to visualize this logic. Their software shows AI’s reasoning process and flags errors or dangerous behavior. The goal is to catch problems early and prevent misuse.
Goodfire CEO Eric Ho said.
““It feels like a race against time. We need to understand these systems before we unleash more powerful ones into the world.”
In his essay, Amodei said he’s hopeful the breakthrough in interpretability could come within two years. Auburn University associate professor Anh Nguyen echoed this, saying that by 2027, AI models could be reliably examined for bias or harmful behavior.
Unlike the human brain, AI systems already allow researchers to observe every “digital neuron” and its role. “The model’s inner workings are visible to us,” Crovella said. “It’s just a matter of asking the right questions.”
Why Understanding AI Matters
Better interpretability could make gen AI safe for high-stakes fields like defense, healthcare, and finance, where a small error can have major consequences. It could also help humans learn new things, as seen with AlphaZero, DeepMind’s chess-playing AI that came up with strategies no human had imagined.
An AI model that is powerful and explainable could become a game-changer, not just in business, but in the global tech race, especially between the U.S. and China.
“Powerful AI will shape the future of humanity,” Amodei wrote. “We have a right to understand what we’ve created—before it reshapes our economy, our lives, and our world.”