Issue #3July 10, 202510 min read

Spike & Stabilize with Coding Agents

refactoringpair programmingragllmcode-quality

I've been pair programming with AI coding agents for over 3 months, and I've noticed something crucial: bots generate code so fast that we often skip a vital step. The refactoring step.

When the bot generates code, resist immediately moving on to the next feature. Instead, take a moment to understand what just happened. Read through the code. Question the design decisions. Make it yours.

Yes, you can ask the AI to refactor for you. But unfortunately, LLMs still hallucinate and tend to make large, sweeping changes that introduce subtle bugs. That's why you need solid test coverage before letting any bot refactor the code.

In the previous issue, I talked about how coding agents can refactor the code it generated. This process needs a solid test suite because coding agents are not as reliable at refactoring as the automated refactorings in IDEs.

As a long time practitioner of TDD, I’ve been experimenting with a different way of programming inspired by Dan North’s technique of Spike and Stabilize. Over the course of my career, I’ve mostly been in large enterprise organizations that move slowly and gradually. Health insurance and banking industries are highly regulated so being super fast and risky is not wise in this context. However, I’ve now had my first startup experience where there’s high pressure to deliver to users quickly.

Spike and Stabilize in Action

What started out as a spike and proof of concept ended up turning into the real thing. Over the course of a week, my pair, Claude Code, and I created a POC of an LLM Chatbot. It had a RAG pipeline that imported the companies’ knowledge base articles into a vector database so the LLM could use that as its context to pull from.

LLM Terminology

RAG stands for retrieval-augmented generation. It’s a technique for giving the LLM a specific context. Instead of allowing the model to pull from all of its training data, you give it access to only specific data that it determines to be relevant to the provided query. This relevant context is where the vector database comes into play. The initial query is embedded (turned into a vector) and used in a cosine similarity lookup. Whatever embeddings are most related to the initial query will be added to the context of the prompt and the LLM will use that to create a response it chooses that best answers the query.

Spike → prod deploy

I’ve seen spikes turn into production deploys many times over my career and have been warned against it. At one point, I said to my pair—”Now that we’ve seen how to build this, we can start from scratch and use our learnings to build it the right way (using TDD and evolutionary design)”. And they said “Why don’t we just use this? Is this the wrong way?”. They were challenging the dogmatic views of spike vs. TDD and I agreed. Except for the frontend, we used every piece of that original POC and got it ready for production. Without realizing it, we were applying the Spike and Stabilize technique.

Over the next week, we focused exclusively on refactoring. There were parts of the code that were complex and we didn’t understand, so with a mix of bot-driven refactoring and classic IDE-driven refactoring we started to make the code our own.

Some of the refactorings we did were:

  • introducing clean architecture (segregated packages of infrastructure, domain, etc.)
  • introducing domain objects to encapsulate complex article chunking behavior
  • started moving toward a more generic architecture as we knew we needed to build at least one more pipeline

We also improved the performance by adding Node Streams and making the number of articles pulled at a time configurable.

Build in quality last?

Instead of building in the quality up front as I am used to by writing tests first, we wrote the tests as needed. Early on we only had 1 test 😯! It was an edge-to-edge acceptance test that faked out the source of data and the sink datasource. It gave us roughly 40% of coverage all on its own and temporarily used approval testing to capture the approved behavior. Before that we would run the pipeline and import 10 articles to verify no functionality had broken. We didn’t have edge cases covered, because we were focusing on the core functionality at that time.

As we started to separate the I/O from the behavior, it was easy to add unit tests to verify some of those edge cases. This helped us identify some over-engineering our coding agent had done and we were able to simplify the logic further.

For untested code I used automated refactorings in Webstorm or carefully carried out manual refactorings with verification by running the pipeline. For tested code, I would ask the coding agent to carry out more complex refactorings that would take me more time to do on my own.

One huge benefit of refactoring the code myself is I get to truly understand the generated code and make it my own. Eventually we had to share our work with the rest of the team so we needed to understand how every part of the pipeline worked so we could confidently explain it.

One rule I always follow in this new paradigm of programming is:

Never ship code to production that you don’t understand.

Once we refactored the app to have more structure and consistent patterns, it made the work of the coding agent more consistent and reliable. Bots, as do humans, tend to generate new code with a similar style as to what already exists in the codebase. So if the codebase is unstructured and highly coupled then the bot will generate code to conform to this norm. However, when software design concepts like ports & adapters, DDD, and encapsulation are consistently applied across the codebase then the bot will add to it with more structured code that aligns with that design.

Make it work → Make it right → Make it fast

In the end, this pipeline that will have light usage over the years only had 60% test coverage. We were okay with this because of the context we were in. If this were an application that was used daily by thousands of users we would have poured even more effort into the quality. There will always be a long list of improvements, tweaks, and design enhancements that can be made, but the pragmatic engineer knows when enough is enough, ships it, and moves on.

When to use this approach

This spike and stabilize method works best when you're exploring unfamiliar technologies or working in fast-moving environments that prioritize rapid iteration over upfront design. It's particularly effective in startup contexts where market feedback trumps perfect architecture, or when you're prototyping with new AI tools and frameworks where the "right way" isn't yet established. However, avoid this approach for mission-critical systems, heavily regulated environments, or codebases that will immediately scale to thousands of users where comprehensive test coverage is non-negotiable.

I’ve shared my experience with you today as one way to collaborate with coding agents and get the best out of them.

Our workflow looked like this:

  1. Prompt the agent to add a new behavior (no tests)
  2. Run the app to confirm new functionality
  3. Review the code and work with the bot to improve poor design decisions
  4. Evaluate if tests are needed
  5. Refactor the code on our own to understand the theory
  6. Evaluate performance needs

If your context needs a different way of working, then try some other approach, but be willing to experiment, refactor often, and challenge your coding agents to write simpler code that solves the problem at hand.


Want to consistently ship high-quality code that's easy to change and understand? Subscribe to Refactor to Grow for bi-weekly insights delivered to your inbox.

Join the Refactor to Grow community!

Get bi-weekly insights on refactoring, LLMs, and software craftsmanship

Practical refactoring knowledge
Real world agentic coding adventures
Bi-weekly delivery, no spam