Vibe Coding Vibes
I’ve got very complicated feelings about AI. And when I say AI, I really mean Generative AI. It’s got a metric tonne of negative externalities—it’s accelerating societal harms, driving up electricity costs, and over-consuming water, to name a few. There are moral, ethical, and legal issues around the data models are trained with, how they’re trained, and what they produce. But, also, GenAI is a tool, one that we may think of differently, and may have some of these issues reduced, if we stopped treating it like a genie that can grant your every wish.
I can’t help but wonder what the landscape would look like right now if OpenAI had never shipped a chat interface for their GPT. Machine learning AI has been part of tech products for years now, deeply embedded in products and services, with no one really complaining. The tyranny of the unstructured text box is that possibilities become endless when presented with a blank text area—but if using one returns whatever you desire, it goes from tyrannical to magical (even if those responses are stochastic parrots). Magical enough to feel like a true paradigm shift. But is it?
General-purpose chat interfaces require a very high level of sophistication to both understand how to get what you want out and to understand whether what you’ve gotten is any good. Resolving this is very hard—zero-shot prompting, just asking for a thing without additional context, is an almost sure-fire way to get either a bad response or a response that just happens to look right. While I hate the term prompt engineering, what it’s actually describing is the tyranny of the unstructured text box—the need to create structure in this unstructured area to get useful results. Doing this is hard because not only do you need to learn to “speak” LLM, but:
- Results by nature are non-deterministic so making corrections is frustrating and costly.
- Prompts need to be tuned to the specific LLMs they’re being used with because each one has unique capabilities and constraints.
- LLMs by nature correct to averages, so doing anything outside of “the norm” requires substantial amounts of work that cannot always be overcome.
Every LLM you’ve interfaced with likely had a system prompt associated with it: a prompt that is used to properly set up an LLM to perform a task based on the input you provide it, even general-purpose ones. While this may sound redundant, the difference between using a general-purpose LLM and one with a tuned system prompt is quite stark. Let’s do an experiment with this photo I took of a T-Rex dressed up for Christmas. I’m using Gemini 3 here, in 3 different models, fast, thinking, and pro. They all will use the same zero-shot prompt, Write me alt text for this image: https://snugug.photography/images/london-2024/DSCF2906.avif. While it varies slightly between models, it was all more or less the same.
This image from the London 2024 collection by Snugug Photography features a detailed, low-angle shot of a classic London street lamp.
Alt Text: A black, ornate Victorian-style street lamp with three lanterns, set against a backdrop of a modern building with a grid of glowing windows at dusk in London.
Description for context (if needed): The photograph captures the intersection of old and new London architecture. The traditional black iron lamp post is in sharp focus in the foreground, while the blurred background shows the warm yellow light emanating from the windows of a large, contemporary office or residential building, creating a soft bokeh effect.
Nonsense! Clicking into “show thinking” reveals why: the very first step is “Finding Image Content”, but the description is “I’m currently stuck trying to determine the content of the image at the provided URL. Since I can’t directly view the image, I’m exploring methods to indirectly ascertain its visual content. The key is to use the URL to infer what the image portrays, and so I am analyzing it now. I am working on how to do this.” What?! Because general-purpose LLMs are designed to be helpful partners, it can’t tell me it can’t do it, so it just makes something up because London is in the name of the URL! Magic this ain’t.
Trying again with multimodal input—a text prompt and directly uploading the image—gives me much better results:
Close-up of a Tyrannosaurus Rex animatronic dressed for the holidays. The dinosaur wears a large red Santa hat with a white pom-pom and a white, green, and red patterned Christmas sweater. In the background, festive illustrations include a decorated tree and another dinosaur wearing a Santa hat
Pretty good, and much better than it’s historically been, but we can do better. It’s dry and doesn’t paint a full picture of the image—I want alt text to provide not just a rote description but evoke the emotion of the image, too. By providing the alt-text system prompt I’m using for my photography site, I’m able to get alt text that both gives me results that are both more consistent and better aligned with the tone-of-voice that I want for it.
A life-sized, realistic Tyrannosaurus Rex animatronic is dressed in festive holiday attire, including a red and white Santa hat and a patterned knit Christmas sweater featuring green and red stripes. The dinosaur’s textured, grey-brown skin and sharp teeth are illuminated by a warm spotlight, contrasting with its tiny, sweater-clad arms. In the background, a colorful holiday mural displays a stylized Christmas tree and another dinosaur silhouette wearing a hat, creating a whimsical and surreal seasonal scene.
In the back-end for my photography site, this runs automatically—the image and all relevant metadata are passed to to Gemini and the results are automatically put in for me to review. If this were a feature in a CMS, I don’t think anyone would bat an eye. Automatic, high-quality alt text is both a great time saver and a great accessibility boon!
But very few products that incorporate GenAI do so in a way that understands these problems and provides a means of helping to mitigate them. That match both the expertise needed to prompt correctly and review the results. AI assisted coding tools (aka “agentic” coding tools), in my experience, are the closest we have today.
Human-in-the-loop
Vibe Coding is the gag-inducing name to describe a let’er rip, throw a product description at an LLM and keep telling it it isn’t what you want until it is, style of AI assisted coding. What does the code look like? Who cares! How does it affect our performance goals or security considerations? What are those! Churn and burn until I get the UX I want, output be damned!
This is the thing that really turned me off to AI assisted coding. Very early in my career, I was introduced to the idea of code as craft—that coding was a creative, artful endeavor, and that care and pride in the code you wrote was as important as the product because it is the product. That the code you write should be deeply considered, not just the naming and the structure but how it contributes to the gestalt of the product being delivered. Mobile first, content first, responsive web design, my deep dives into domain driven design, content strategy, user experience, typography, color theory, evangelizing the importance of accessibility and performance and design/dev collaboration at every stage of the product lifecycle, all extensions of this idea that in order to truly practice coding as a craft, I really needed to think of software development as more than just lines of arcane runes in a code editor. I needed to think of it as an artform. Vibe coding didn’t vibe with me, with how I think about software. And yet…
One of the driving reasons I’m now specifically a Developer Relations Engineer is that I truly believe in democratizing the ability to bring software to life. Being able to dream up software and bring it to life has literally changed my life, and I want others to be empowered to do the same. But Vibe Coding ain’t it. It’s not because it violates some personal code as craft principles I have. It’s not because its slop. It’s because it’s a mismatch between user abilities and user needs. I’m not trying to gatekeep here, it’s more of the reality of what these tools are capable of—GenAI doesn’t “know” anything, so it can’t “know” your codebase, can’t “know” your users or their constraints, can’t “know” what the right product call is. They’re super sophisticated pattern recognition engines, autocomplete on steroids, that’s gotten good enough to usually return something that looks reasonable (and heavily regressed towards the mean). But just because it looks reasonable doesn’t mean it is. You need a human-in-the-loop with relevant experience to actually know if what’s been produced is reasonable. Getting something to run isn’t the end goal of software development, it’s the start.
But that’s not helpful. People are using these tools today, and companies are putting pressure on software developers to “improve their productivity” with these tools (whatever that means), so I started experimenting to see where the limits were. What does it look like for someone who still believes in code as craft to use these tools? Spoiler alert: strong software developers are the key to getting these tools to produce reasonable output.
An empty vessel overflows
I’ve been noodling on a new set of TTRPG rules for my friends and I to play with, and I’m building a little web app to go along with it, because of course I am. Started the entire project by hand—set up all my preferences, wrote a bunch of modern HTML, CSS, and JavaScript, did some initial design work. I’m using a static file system based tool to generate it, but monsters are a little too complicated to write out in a text editor, so I built a little CRUD page for them. I included an image picker in it, which I think is pretty neat—either use the File System Access API to grab an image from your computer, or use a system prompt I wrote to generate a monster image. This all worked well! But, because GenAI is non-deterministic and often makes stuff up, the generation didn’t always work out how I’d like. So, I dipped my toes into AI assisted coding. Using Antigravity, Google’s agentic1 IDE, and the newly released Gemini 3.0 Pro model, I asked it to create a standalone /generate page where I could generate multiple images at once.
Much to my surprise, it did a decent job on the first go! But huh, well, that’s weird. It refactored my image generation code, but into really ugly functions. It didn’t know that I would want to expand on that in the future, or that I preferred class based abstraction for this kind of functionality. Oh, hmm, interesting. It didn’t try and use Tailwind like I was afraid of, but it definitely pulled form styling from somewhere that wasn’t the form styling I already had. Oh, eww, it’s using px values all over the place. That’s not how my code is written anywhere else in my codebase. The other issues were things I thought I could fix by giving it more specific directions, but the px issue would not stand. Antigravity supports custom rules that will be injected into its prompt context by writing Markdown files in the .agent/rules directory, so I wrote my first one: px.md:
When sizing things in CSS, only use
pxif a specific pixel size was requested, or if you’re sizing a border width. Otherwise, instead ofpx, userem. When usingrem, only use increments of.25rem. Other units (including, but not limited tovh,vw,cqi,cqb, andfr) are allowed.
This is the first thing I learned when working even with frontier models in the latest-and-greatest agentic IDEs: GenAI doesn’t know anything. Even with the full context of an existing codebase at its disposal, it’ll still just make stuff up if you don’t give it explicit instructions as to what you want it to do. And if you want it to do the same thing every time, you need to teach it how to do that2. I learned this lesson, then I ran out of free Pro-level credits for the week in Antigravity. So, I switched to Gemini CLI.
Gemini CLI isn’t an IDE. It still has access to your codebase, still is billed as an AI assisted coding tool, and honestly for my workflow is probably a better fit because I prefer to live in the terminal, so I switched over there. I logged in using an API key to get access to Gemini 3.0 Pro again, and got to work cleaning up the image generation page. The results I got were worse than Antigravity. All of the rules I had written, none were being picked up. The code it was generating? More misses than hits. How, though? These were the same models! And this is the second lesson I learned: a model is just one part of the equation. The tool you’re using the model through, and it’s system prompt, can make a significant difference to the results you get3.
Proper planning prevents poor performance
After a few weeks of really diving deep into this stuff, I decided I wanted to share what I’ve learned (this was two weeks ago, not even the best AI tools will improve blogging procrastination). Of course, I hadn’t touched my blog since sometime last year, the current iteration of it was built on a 2022 refactor to a CMS, and both Astro (which is what my site’s built on) had come a long way in managing flat Markdown files and I kinda hated the state my CMS was in and dreaded upgrading it. So, as you do, the first step to blogging again was to refactor my website once more.
I had had good success in using Astro’s content collections in my TTRPG project, so I wanted to migrate to that, but first I needed to dump my existing content out of my CMS. I knew I couldn’t just zero-prompt this migration, I needed to take it in steps.
- Write a migration script to get code out of my CMS and into Markdown files with frontmatter.
- Write the content collection config to make sure that everything in that migration is properly represented.
- Migrate from Astro 1 to Astro 5 and ensure everything still compiles and looks the same.
Writing migration code is something I’ve always hated doing, especially for one-off migrations. It’s also one of the few places where the output matters more than the code. I had gotten better at prompting to get closer to what I’d like, so I wrote about a paragraph explaining what I wanted to happen: write a script to get all of the content from all of my content types in my CMS, download them to Markdown, mapping X field to the Markdown body, every other field to frontmatter, and put them in folders based on A and name them based on B. Ask my questions before you start if you’re unsure.
Lo and behold, it asked me good questions! Some were good clarifying questions where I could choose from the options it had provided me, some I needed to write more complete answers, and some were even questioning things I hadn’t thought of (like what to do with images that were uploaded directly to the CMS). Good on you! Then it created an implementation plan for me approve before continuing (a great built-in feature of Antigravity) and it wanted to write a Python script in a Node.js project. Boo. Ok, correct that, then go.
It did a pretty good job! It wrote a script that did everything I expected, except the frontmatter was a little wonky. It had somehow decided to write its own JSON object to YAML string encoder, with its own unique, special bugs that made the output just a little jank. It got me 80% of the way there, but I needed to coach it along to use existing YAML tools instead of it insisting it can write a better one from first principles.
Once we had agreed that I knew what I was talking about, and I code reviewed the migration script, the actual CMS dump went off without a hitch. Same with writing the content collection config. Finally, the big test, migrating from Astro 1 to Astro 5 in a paragraph. It worked on the first try! Astro got upgraded, CMS checks replaced with Content Collections, and everything compiled. This was the first real time I saw productivity gains using a coding agent: a project that in my head was going to take me a week to do took me 3 hours on a Thursday morning, most of which was waiting for the agent to finish working, or manual code cleanup I needed to make as part of other migrations I decided to do. The additions/deletions look like AI slop, but every change was reviewed and approved by me (and to be fair, most of the additions and deletions were lockfiles).
![]()
What made this migration a success wasn’t the agent doing the coding, it was the planning that went into it before the assistant started. I knew the steps that needed to be taken, knew which parts the AI could handle and which it’d struggle with, divided it up, and went to work.
Don’t ship shit
I’ve personally never liked writing tests. While critically important to healthy software, they’re incredibly tedious—they’re mostly boilerplate, they’re super repetitive, and, for me, they don’t require the same kind of creative thinking or problem solving that I like about coding. Perfect for getting agents to write.
Back to my TTRPG system, the code for calculating aspects of monsters is complicated—pretty easy for a human to add up in their head, much harder to code, and test, all the branching systems. So, I asked my agent to write “comprehensive tests” for it. I told it what testing suite to use, and it not only set up the tests and the suite, it did write a fair amount of tests for it! Great! But huh, it got itself stuck in a loop and decided that there was unreachable code. I manually checked, there wasn’t, I explained how to test it, it went about and it did. Great. Except after reviewing the tests, they didn’t seem right. They weren’t complete—I could keep asking it to write more tests and it would lo and behold find more things to test in it. And again. And again. Eventually, I got fed up and had it instrument the tests with code coverage and then work to get to 100% code coverage with those tests. It didn’t quite get there, but it was better this time. I had it write unit tests, integration tests, and end-to-end tests, all with code coverage. But I noticed something strange: my combined code coverage from all three tests was lower than individual code coverage. Things weren’t merging right. I asked it to do deep research, figure out why it wasn’t merging right, use the browser if you need to to look stuff up. It wrote this magnum opus of a merging algorithm to try and fix the issue. I burned through a week of free credits trying to get it to work. And it wouldn’t. I was frustrated, and put the endeavor down for a week.
I came back to try again. I didn’t like the original tests that were written, so I threw them away and started over. Don’t ship shit. I ripped out all of my tests and test infrastructure from the first time, all of the cruft that it had added, and asked to start again, with one small set of unit tests and 100% code coverage as its goal, with all my new rules in place. Boom. Done. The tests were even pretty good. I then asked it to try and write end-to-end tests using the same testing tool instead of a different one, thinking this could solve my code coverage issue. This request went around and around for a very long time, and it finally got back to me saying that it couldn’t do it—a rare admission of defeat by an LLM. I thought “surely not”, did some research on my own, and came to the same conclusion, but I wouldn’t have know what to search for without the explanation of why it it didn’t work. Ok, back to two tools, but let’s try a different approach.
When instrumenting the end-to-end tests, it very proudly told me that the code coverage tool it was using (Istanbul) wouldn’t work with the Svelte files I had, so it had ignored them. That being unacceptable, I did some research and found that I could also instrument my end-to-end tests with V8—that worked! But some brief searching showed me that this was going to be a multi-step process and that I’d need a plan. Let’s go eat an elephant4.
Step 1: I had code coverage individually working and the code had been reviewed. Great. Let’s commit that code into a branch. Step 2, let’s iterate. I had it start to try and combine coverage from multiple files. It wrote some first-principles code to do so, I tested, it didn’t work, I went back and explained how it wasn’t working, it coded, I tested, same thing, back and forth. It felt more like pair programming at this point than an agent coding for me. I eventually pushed it to a point where it told me it couldn’t do more to help because, while everything looks right between the code coverage coming out from the two different test suites, it couldn’t figure out why it wouldn’t combine correctly. Somewhere in there I learned that one test suite was outputting Istanbul formatted code coverage even though it was instrumented with V8. Interesting. Ok, display one with one tool, the other with another. I really didn’t like this, though.
I slept on it, and went back to the first testing tool the next day and re-read their docs. I noticed a little note there about how V8 test coverage had some issues, and to refer to a specific V8 to Istanbul conversion module for details. Interesting. Let’s go step by step again. Hey AI, I’ve added this module as a dependency, can you try and take the existing code you’ve got and try and run this at the end, and see if it’ll work? Lo and behold it did seem to. Great! Small change, small commit, let’s roll. Ok, can you now run that at the end of my end-to-end testing? Great! That works too! No, don’t delete that file, I need to confirm that my already-reviewed code matches what you’ve added elsewhere. Ok, now let’s try combining the two. Great! Is this working? Have I finally solved my code coverage problem? Crap, no, I ran it again and got different results, but only from the end-to-end testing? Why is that? AI, go look. Oh, you introduced a race condition into the merging of the files that means code coverage fluctuates each run. Good, nothing like a subtle bug in your test suite. Can you not do that? I mean, you say it’s fixed but I clearly see it’s not. Why don’t you just output separate files each time instead of trying to manually merge them? Now it’s consistent, who woulda guessed. Small change, small commit. Ok, my turn, let me clean up some of this mess you’ve now introduced and do some manual testing myself. Hey look, everything works how I expect now, great, AI, here are my remaining cleanup tasks, please finish them for me.
In the end, I was able to get exactly what I wanted out of this interaction (amusingly enough, in about 3 hours of back-and-forth also). But it wasn’t magic, and it certainly wasn’t vibe coding. It felt much more like pair programming with an intern who’s aced every algorithms and data structures interview they’ve ever come across but has no critical thinking skills or creativity to speak of. And you know what? I think I actually like that combination—leave the creative coding, problem solving, and code review to me, and have the agent do the rote, draining coding tasks I don’t like.
Rules to code by
In the limited time I’ve had exploring AI coding assistants, I’ve come up with 5 rules that I now keep in mind:
- It’s pair programming, not magic. You’re going to need to teach it how to write code you like, and you’re going to need to find the right model/agent fit for you3.
- Plan your work. It’s not enough to approve an agent’s implementation plan, you need to have a higher-order plan to break down and divide up the work.
- Work in small chunks. Commit early, commit often. It’s cheaper to undo small changes than big ones.
- Plan to throw one away. Review all of the code generated. If something doesn’t work, don’t be afraid to throw it away and tell the agent to try a different approach.
- Have fun. Only have agents do tasks that don’t bring you joy. Productivity is more than just lines of code written, it’s the mental and physical energy spent to ship the right thing.
Footnotes
-
Agentic here means there’s a GenAI backed chat box with a system prompt designed to perform a specific task, in this case, writing code. ↩
-
Other rules I have include to always run code linting after its written anything, to use
pnpminstead ofnpmfor project management, to never ever ever usealert, to ask and get confirmation before adding any dependencies, and a special rule I call “intern” where I tell it to not guess about how an API should work, ask clarifying questions, to comments its code, and to not duplicate functionality, plus a hand-generated folder breakdown explaining what goes where. Each of these rules were written after I it did something weird generating code and I decided I never wanted to see that happen again. ↩ -
For what it’s worth, after about two months of using these tools, I lean towards using Claude Opus 4.5 through Antigravity, with Gemini 3.0 Pro through Antigravity being a reliable alternative when Claude breaks, which it does every so often. I also figured out how to get Gemini CLI to usually use the rules in my Agent folder: including it in
GEMINI.mddidn’t work, but telling the CLI to “remember” to always include the rules as context when running commands does seem to have. ↩ ↩2 -
How do you eat an elephant? One bite at a time. ↩