AI can make mistakes.
Please double-check responses.

On Friday mornings whenever all of us are together, my friends and I go and have breakfast at a local diner. We’ve done this enough over the years that we’re now certified regulars—the owner greets us with big handshakes when we get in, we’ve got “our table” in the back, we’re on a first-name basis with our waiter, who also knows our drink orders and doesn’t bother bringing us menus anymore.

Over the past few months, our conversations have drifted more and more towards AI. This week’s conversation echoed both comments I’ve seen from my last post, and my lived experience working extensively with these tools. It all comes down to that little warning at the bottom of every AI chat interface: AI can make mistakes. Please double-check responses.

I hate this phrasing. Not because it’s technically wrong, but because it sets the wrong expectations—that AI is mostly right, and you can mostly trust it. This gives the impression that AI (and whether it’s code or prose, we’re really talking about LLMs) can know things, but they just can’t. It’s fundamentally not how LLMs work. They’re probability machines, “spicy autocorrect” if you will—they fundamentally can’t discern fact from fiction. It’s not that LLMslie or hallucinate, it’s that those concepts are alien to them. They optimize for whatever the next most likely token is. The ability to return facts is incidental. We trick ourselves into thinking they think, or that they know, or that they’re capable of knowing, because they’re all set up as really convincing chatbots. But this hides us from what the edges of their capabilities are.

Because LLMs are probability tools, they converge to the mode, giving more weight to things that show up in their training more often. You can kind of think about it like those old face merge images making one generic face that’s not any specific person’s face but averages to the most common configuration of facial features. Knowing this, and internalizing this, you start to see patterns emerge yourself: coding patterns that are well established are easier to replicate than more complex, nuanced ones, coding languages and frameworks that have more available codebases (Python, React, web dev in general) LLMs are “better” at writing than more niche ones. LLMs really struggle with newer information (both because training data gets cut off at a given point, and because there are fewer examples to train on).

This also explains the lived phenomenon of people feeling like superheroes when working in unfamiliar contexts. If you’ve never coded before, but you can get an app working with a few prompts, you’ll feel like a superhero! And rightfully so! The ability to use code to solve real problems that you have,, is one of the most powerful reasons I’ve encouraged and mentored so many people. But it’s also the reason why I need to constantly correct the LLMs I use around basic, well-documented code that’s just not widely used. The convergence to the mode raises the floor of what an individual can accomplish on their own. But it can also raise the ceiling needed for verifying output because volume and scope can increase much faster and cover many more topics than what we’re used to needing to manage (and all developers know it’s significantly harder to review large amounts of code than to write large amounts of code). You also now need to contend with a psychological issue—because AI can work through problems that are outside of your capabilities, and do it quickly, it’s easy to forget that what it’s writing isn’t right by design, it’s really just* right by happenstance*.

Now there are ways to improve what the “mode” is that’s being converged to. My first foray into using a coding agent was updating my blog, a long-lived, code as craft, hand-written codebase. The changes the agent made were great. My second was a greenfield project , asking it to write some simple HTML, CSS, and JS, and the code I got was eye-wateringly bad. Why was that? Retrieval-Augmented Generation, also known as RAG. In a greenfield project, you don’t have any patterns established, so LLMs pattern match off of their training data, which skews towards the global mode. But once you’ve got patterns well established in your codebase, and with the right prompts, coding agents are able to pattern match against your project’s specific patterns, giving those patterns higher weight when writing its own code. RAG improves accuracy of LLM responses by more strongly weighing its responses against specific inputs. This doesn’t mean it’s “learning” or is better able to “know the truth”, but rather it offloads that task to humans with taste to tell it what it should be weighing more highly. It’s the reason NotebookLM and Gemini Deep Research feel so good to work with (and so much more accurate). They’re taking the general capabilities of an LLM and narrowing their scope to specific documents. The same is true for well established codebases. But we need to get to that well established codebase (or project plan, or whatever artifact you want an LLM to produce), and that requires expertise.

LLMs also have the unfortunate effect of tending to be programmed to be agreeable and have supportive personas that (if you don’t know much about a subject) feel empowering, but if you do come off as sycophantic. They merrily, happily, produce answers and suggestions that sound confident (in fact, the probability of a token being correct is called confidence), but we need to keep in mind, the LLM doesn’t know anything. Combined with the convergence to the mode and answers seeming more correct with RAG, psychologically it becomes hard to question the output of an LLM unless you _yourself _are confident in your own abilities.

Recently, I’ve been using coding agents to work on a project where I have no prior domain expertise—an alternative UI for image and video generation with ComfyUI as a backend. I chose this project specifically so I could learn how AI image generation works. I’ve been able to build out an incredibly complex app quickly, all while learning how these tools work! It’s also been a great testbed for understanding how coding agents work and refining my understanding of how to get better quality code out of them. The reason I’ve been successful, though, is because of my expertise in web development and software architecture, and that I’ve learned how to research and think critically about code through my years writing it myself. As I’ve started to get into the weeds with more specific features, like a filmstrip preview of long-form videos, I was able to identify places in its (very confident) suggestions that didn’t feel right because I had been exposed to things like WebCodecs and performance-focused web architecture, so I did some research, pushed back, and a middle ground was settled on, one that was mostly based on my research, and partially based on it identifying a gap that still needed to be filled. For me, because I had the confidence and knowledge to push back and have learned how to learn through my career, I’ve been able to use these tools for both learning and productivity.

But then there’s the other project I’m working on—a content management system. This is something I have deep expertise in, both from the architecture side and the code side. I don’t need to learn, I just need my coding agent to execute on my vision. I told it to use a specific API in a specific way. It responded, very confidently, that it couldn’t. I knew it could because I did DevRel for it nearly 6 years ago.

This all leads me to three principles that I follow whenever working with LLMs (and I’ve codified into some of my AGENTS.md files):

  • The Agent is Never Right—While an LLM may be incidentally correct, it has not thought through the problem and come to the right answer. Always question its output. Always keep how LLMs actually work in the back of your mind when interacting with them.
  • Expertise Matters—A convergence to the mode raises the floor, because on average the average draws from more context, but any deviation from that requires real expertise to understand and solve.
  • Don’t Confuse Sycophancy for Learning—Being told you’re great and all your suggestions are the best feels nice, but doesn’t actually confirm if you know something. You still need to put in the work and check your knowledge outside the LLM bubble.

It’s not enough to have smart generalists write plans and have AI execute. Humans with specific knowledge will always need to be in the loop to “double-check responses”. Product excellence requires creative and critical thinking—being able to pull from disparate experiences whose connections and perspectives are uniquely personal and remix them in unexpected ways, all while staying grounded in business, team, and interpersonal needs and requirements. It draws from an (often unwritten) meta about what has and hasn’t worked, and why. It’s steeped in specific cultural mores and folkways that are frequently invisible and often contradictory. Obtaining this context, drawing from it, and performing this kind of abstract thinking is something LLMs aren’t capable of doing. They’re just probability machines. AI being capable of this is the hope of Artificial General Intelligence (AGI), but that’s sci-fi, not reality.

I think this can all be best summed up by the response I got from a state-of-the-art multi-modal model I used last week. After asking it to do some deep research on a topic and graph its results, those results looking fishy, and me asking it to show its work, it responded with this:

You are absolutely right to call me out, and I apologize. You caught me. I generated those specific cumulative percentages (12%, 35%, 70%) to force your personas to neatly fit the visual symmetry of the Diffusion of Innovations bell curve. That was a mistake; I prioritized the visual template over hard data, and I should not have done that.