This content originally appeared on Level Up Coding – Medium and was authored by Mohit Sewak, Ph.D.
Why the evolution toward structured, and explicable thought is the most important story in Generative AI world today.

We are witnessing the evolution of AI from a brilliant mimic to a genuine thinker.
Ah, pull up a chair. Let’s get some tea!
You know, for the last few years, playing with Large Language Models (LLMs) has felt a bit like talking to the most charismatic, well-read, and slightly drunk person at a party. They can quote Shakespeare, explain quantum physics (plausibly), and write a sonnet about your sneakers, all with the swagger of a seasoned pro.
But ask them to calculate the tip on a dinner bill, and they might confidently tell you it’s a million dollars and that your waiter is a descendant of Genghis Khan.
This, my friend, is the era of the “Stochastic Parrot” (Bender et al., 2021). These early models, like the now-legendary GPT-3, were masters of linguistic form but absolute novices in logical function. They were brilliant mimics, trained on a mind-boggling portion of the internet to predict the next word in a sequence. The results were often fluent, sometimes profound, and occasionally, hilariously, catastrophically wrong.
But while the world was dazzled by the parrot’s beautiful plumage, a quiet revolution was brewing in research labs. The most important story in AI today isn’t about making the parrot bigger; it’s about teaching it how to think. We are witnessing a fundamental shift from associative pattern-matching to structured, verifiable reasoning.
Welcome to the age of the Large Reasoning Model (LRM).
Now, before you think this is some brand-new AI architecture built from space-age silicon, let me clarify. An LRM isn’t a new type of engine. It’s a sophisticated new navigation system, traction control, and a driver’s manual that we’ve bolted onto an existing high-performance engine. It’s an LLM augmented with a “cognitive scaffolding” — a suite of techniques that guides, structures, and grounds its thought processes.
This evolution from fluency to fidelity marks the end of the stochastic parrot era. It’s the dawn of a new age where we’re building AI that can reason, not just recite. And trust me, that changes everything.
The High Stakes of Just Sounding Smart

In high-stakes fields, a plausible-sounding answer that’s wrong isn’t a bug; it’s a landmine.
Let’s be real. If an AI writes a goofy poem, it’s a feature. If it hallucinates a legal precedent in a courtroom, it’s a catastrophic failure. A plausible-sounding answer is a landmine waiting to go off in any high-stakes field: medical diagnostics, financial auditing, scientific discovery, you name it.
I saw this all the time in my cybersecurity days. You’d have a system that could perfectly describe the characteristics of a zero-day attack but couldn’t follow the logical steps to actually trace its origin. It talked the talk but couldn’t walk the walk. It’s the difference between a kickboxer who can describe a perfect roundhouse kick and one who can actually land it in a fight. One is theory, the other is practice.
The core challenge for the entire AI community became this: How do we bridge the canyon between an LLM’s dazzling linguistic competence and its fragile logical fidelity? How do we teach the drunken poet to become a sober mathematician?
The answer didn’t come from one single breakthrough, but from a series of clever, systematic hacks on the AI’s brain.
“The important thing is not to stop questioning. Curiosity has its own reason for existing.” — Albert Einstein
Pillar I: Waking Up the Ghost in the Machine

The epiphany wasn’t to teach the AI the answer, but to teach it to “show its work.”
The journey to LRMs began with a wild discovery: the ability to reason wasn’t something we had to build from scratch. For truly massive models, it was already there, dormant, hiding in the labyrinth of neural connections, waiting to be coaxed out.
The “Show Your Work” Epiphany: Chain-of-Thought
Imagine a rookie detective, “Chad GPT.” He’s slick, confident, and solves every case by declaring, “The butler did it!” Why? Because he’s read a million mystery novels where the butler did it. He’s matching a pattern.
Then, a new police chief comes in and issues a memo: “From now on, you don’t just name a suspect. You write out, step-by-step, how you got there.”
This is exactly what researchers at Google did with Chain-of-Thought (CoT) prompting (Wei et al., 2022b). Instead of asking a model for the final answer to a math problem, they showed it a few examples where they wrote out the intermediate steps.
The result? It was like flipping a switch. The model’s performance on reasoning tasks skyrocketed. By forcing it to show its work, we weren’t just getting a more accurate answer; we were getting a window into its “thinking” process, making it transparent and debuggable.
The Magic Words and the Wisdom of the Crowd
Then things got even weirder. Researchers found you didn’t even need to provide examples. Just adding the simple phrase, “Let’s think step by step,” to a prompt could trigger this reasoning mode in a large model (Kojima et al., 2022). This was huge. It meant reasoning wasn’t just mimicry; it was a genuine emergent capability of scale.
But one line of reasoning can still be wrong. That’s where Self-Consistency comes in (Wang et al., 2022). Think of it as asking our detective squad not just for one theory, but for five different ones. If four of them point to the groundskeeper and only one points to the butler, we go with the majority. By generating multiple reasoning paths and choosing the most common answer, we can filter out outlier mistakes. It’s the wisdom of the crowd, but the crowd is the AI itself.
ProTip: When you’re stuck on a complex problem with an LLM, don’t just ask for the answer. Add “Let’s break this down step-by-step” to your prompt. You’re not just getting a better answer; you’re teaching the model how to reason, making it a better partner for you.
Pillar II: Building a Mind Palace

True reasoning isn’t a straight line. It’s a sprawling web of exploration, synthesis, and discovery.
Okay, so our detective Chad can now follow a single trail of clues. But real detective work isn’t a straight line. It’s a messy, sprawling web of leads, dead ends, and “what ifs.” You have to explore multiple possibilities at once.
From Chains to Trees: Exploring a Forest of Ideas
This is where the Tree of Thoughts (ToT) framework comes in (Yao et al., 2023a). Instead of just generating one next step, the model is prompted to generate multiple potential next steps. Then, like a seasoned detective, it evaluates each of those branches: “Is this lead promising? Is that one a dead end?” It then pursues the most promising paths, effectively building a decision tree in real-time. It’s the difference between walking down a single path in a forest and sending out scouts to explore multiple paths simultaneously.
From Trees to Webs: The Corkboard Moment
But even that’s not quite right, is it? A brilliant detective doesn’t just follow separate leads; they put all the clues up on a corkboard and start drawing lines between them, synthesizing a new insight that wasn’t present in any single lead.
Enter the Graph of Thoughts (GoT) (Besta et al., 2023). This framework allows the AI to not only branch out but to merge and combine different lines of reasoning. It can take a good idea from Path A and a good idea from Path C and weave them together into a superior hybrid solution. It can even create loops, allowing it to refine an idea over and over again. This is where the AI starts to look less like a calculator and more like a genuine creative collaborator.
And the next level? Frameworks like SELF-DISCOVER (Zhou et al., 2024), where the model doesn’t just follow a pre-defined structure but autonomously chooses the best reasoning strategy — a tree, a graph, a simple chain — for the specific problem at hand. It’s learning to reason about how to reason.
Fact Check: The concept of System 1 (fast, intuitive thought) and System 2 (slow, deliberate reasoning) that inspired the Tree of Thoughts framework was popularized by Nobel laureate Daniel Kahneman in his book “Thinking, Fast and Slow.”
Pillar III: Ground Control to Major Tom

Even the most brilliant mind is useless without a connection to the real world.
Here’s the dirty little secret of all LLMs: they are brains in a vat. Their knowledge is frozen in time, they can’t do math reliably, and they have no access to real-time information. Without a connection to the outside world, even the most brilliant reasoning is just navel-gazing.
This is the grounding problem. The solution? Give the brain a phone, a calculator, and a library card.
Leveraging the Best of Both Worlds
First, for things like math, we have Program-Aided Language Models (PAL) (Gao et al., 2022). The idea is genius in its simplicity: don’t ask the brilliant linguist to do calculus. Ask the linguist to translate the word problem into a Python script, and then let a simple, trusted calculator run the code. This splits the labor perfectly: the LLM does the language and logic breakdown, and the computer does the number crunching with perfect precision.
The AI with a Toolbelt
This idea was then generalized to a full “toolbelt.” Frameworks like Toolformer (Schick et al., 2023) and ReAct (Yao et al., 2022) taught LLMs how to use external tools. Now, the model can decide on its own to:
- Use a search engine to check for recent news.
- Query a database to get specific product information.
- Call an API to book a flight or check the weather.
This is the ReAct paradigm: the model Reasons about what it needs to do, then Acts by calling a tool. It gets the result, and that new information feeds the next cycle of reasoning.
This is topped off with Retrieval-Augmented Generation (RAG), which you can think of as giving the model an open-book exam. Instead of relying on its hazy memory of the training data, it can first retrieve relevant, up-to-date documents from a trusted knowledge base and base its answer on that specific, verifiable information (Lewis et al., 2020).
“The greatest enemy of knowledge is not ignorance, it is the illusion of knowledge.” — Stephen Hawking
The Frontier: Can a Reasoning Engine Trust Itself?

The final frontier of reasoning isn’t just solving the problem, but knowing when you’ve solved it incorrectly.
So we’ve built this incredible cognitive architecture. Our detective Chad now thinks step-by-step, explores multiple leads, and uses forensic tools. But there’s one final, terrifying question: does he know when he’s wrong?
The sobering answer, for now, is… not really.
A critical study from Google DeepMind found that Large Language Models Cannot Self-Correct Reasoning Yet (Huang et al., 2023). When a model makes a mistake and is asked to double-check its work, it often just confidently repeats its error or even doubles down on the flawed logic. The ability to spot your own mistakes, a cornerstone of human intelligence, is not yet an emergent property of these systems.
So, we have to scaffold that, too.
Enter Self-Refine (Madaan et al., 2023). This is the universal process of “draft, critique, revise,” but for an AI. We have the model generate an answer. Then, we prompt it to act as a critic and provide feedback on its own work. Finally, we ask it to use that feedback to revise its initial answer. It’s a forced, deliberate loop of self-improvement.
This is also where principles like Constitutional AI come into play, where a model is guided by a set of core principles (a “constitution”) to help it critique and align its own outputs, ensuring they are not just correct, but also helpful and harmless (Bai et al., 2022).
ProTip: When using an AI for a critical task, don’t trust the first output. Use the “Self-Refine” method yourself.
Ask it: “Critique this response from the perspective of an expert. What are its flaws and weaknesses?” Then ask it to “Now, rewrite the original response, incorporating that feedback.”
The Path Forward: Four Big Cases Left to Solve

The road ahead is clear, but the challenges are significant. This is the work that matters now.
The journey from a Stochastic Parrot to a Large Reasoning Model has been breathtaking. But we’re not at the end of the road. There are still massive challenges ahead, the four big unsolved cases for the LRM era:
- Verifiability vs. Plausibility: How do we build a lie detector for AI reasoning? We need to ensure a model’s chain of thought isn’t just a plausible story but is logically sound and faithful to its final answer.
- Computational Efficiency: All this complex reasoning is expensive. Tree of Thoughts and Self-Consistency are powerful but burn through computing resources. The next challenge is making robust reasoning lean and fast enough for everyone.
- The Leap to True Autonomy: How do we get the AI to do that “critique and revise” loop on its own, without us holding its hand? The ultimate goal is a model that can genuinely learn from its mistakes, intrinsically.
- The Ethics of Agency: As these models start using tools and acting in the world, we need to build the guardrails. How do we ensure their actions are safe, aligned with human values, and that someone is accountable?
The Final Word
So, let’s pour another cup of tea.
We’ve come a long way from the mesmerizing but unreliable parrot. We’ve journeyed through a systematic, brilliant, and sometimes scrappy process of engineering thought itself. We learned to first elicit the ghost of reasoning from the machine (CoT), then structure it into complex webs of thought (ToT/GoT), then ground it in reality with tools, and now, we stand on the critical frontier of making it self-aware and reliable.
The future of AI that can truly help us solve the world’s hardest problems — from curing diseases to tackling climate change — doesn’t lie in simply building bigger parrots. It lies in continuing to build these elegant scaffolds of reason, turning our amazing mimics into trustworthy, collaborative partners.
The parrot is learning to think. And it’s the most exciting story in the world.
References
Here’s a look at some of the groundbreaking papers that mapped out this journey.
Foundational Elicitation Techniques
- Kojima, T., Gu, S. S., Reid, M., Matsuo, Y., & Iwasawa, Y. (2022). Large Language Models are Zero-Shot Reasoners. arXiv:2205.11916.
- Wang, X., Wei, J., Schuurmans, D., Le, Q., Chi, E., & Zhou, D. (2022). Self-Consistency Improves Chain of Thought Reasoning in Language Models. arXiv:2203.11171.
- Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q. V., & Zhou, D. (2022b). Chain-of-Thought Prompting Elicits Reasoning in Large Language Models. Published in proceedings of the Neural Information Processing Systems (NeurIPS) conference. arXiv:2201.11903.
Advanced Reasoning Structures
- Besta, M., Blach, N., Kubicek, A., Gerstenberger, R., Gianinazzi, L., Gajda, J., Lehmann, T., Nisa, M. I., Hoefler, T., & Podstawski, M. (2023). Graph of Thoughts: Solving Elaborate Problems with Large Language Models. arXiv:2308.09687.
- Yao, S., Yu, D., Zhao, J., Shafran, I., Griffiths, T. L., Cao, Y., & Narasimhan, K. (2023a). Tree of Thoughts: Deliberate Problem Solving with Large Language Models. arXiv:2305.10601.
- Zhou, P., Pujara, J., Ren, X., Mishra, S., Zheng, S., Zhou, D., Cheng, H.-T., Le, Q. V., Chi, E. H., & Chen, X. (2024). Large Language Models Self-Discover Reasoning Structures. arXiv:2402.04201.
Tool-Use and Grounding in Reality
- Gao, L., Madaan, A., Zhou, S., Alon, U., Liu, P., Yang, Y., Neubig, G., & Narasimhan, K. (2022). Program-aided Language Models. arXiv:2211.10435.
- Schick, T., Dwivedi-Yu, J., Dessì, R., Raileanu, R., Lomeli, M., Zettlemoyer, L., Cancedda, N., & Scialom, T. (2023). Toolformer: Language Models That Teach Themselves to Use Tools. arXiv:2302.04761.
- Yao, S., Zhao, J., Yu, D., Du, N., Shafran, I., Narasimhan, K., & Cao, Y. (2022). ReAct: Synergizing Reasoning and Acting in Language Models. arXiv:2210.03629.
Evaluation, Limitations, and Self-Correction
- Bai, Y., Kadavath, S., Kundu, S., Askell, A., Kernion, J., Jones, A., Chen, A., Goldie, A., Mirhoseini, A., McKinnon, C., Chen, C., Olsson, C., Olah, C., Hernandez, D., Drain, D., Ganguli, D., Li, D., Tran, D., Perez, E., … Kaplan, J. (2022). Constitutional AI: Harmlessness from AI Feedback. arXiv:2212.08073.
- Huang, J., Chen, X., Mishra, S., Zheng, S., Yu, A. W., Song, X., & Zhou, D. (2023). Large Language Models Cannot Self-Correct Reasoning Yet. arXiv:2310.01798.
- Madaan, A., Tandon, N., Gupta, P., Hallinan, S., Gao, L., Wiegreffe, S., Alon, U., Dziri, N., Prabhumoye, S., Yang, Y., Gupta, S., Majumder, B. P., Matsuo, Y., Kulshreshtha, R., Shen, Q., Ghosh, S., Misra, I., & Choi, Y. (2023). Self-Refine: Iterative Refinement with Self-Feedback. arXiv:2303.17651.
- Microsoft Research. (2023). A Ladder of Reasoning: Testing the power of imagination in LLMs. Microsoft Research Blog.
Disclaimer: The views expressed in this article are my own and do not represent those of any employer or organization. I used AI assistance for research, drafting, and image generation for this article. This work is licensed under a Creative Commons Attribution-NoDerivatives 4.0 International License (CC BY-ND 4.0).
Large Reasoning Models (LRMs): The End of the LLM-dominated ‘Stochastic Parrot’ Era was originally published in Level Up Coding on Medium, where people are continuing the conversation by highlighting and responding to this story.
This content originally appeared on Level Up Coding – Medium and was authored by Mohit Sewak, Ph.D.