The first time I let Claude Code finish a plan before it touched a single file, I felt the same small relief I used to get from a code review going green on the first pass. No rework. No "wait, that's not what I meant." No diff that touched four files when the task lived in one.
That session is when the workflow I'm about to describe stopped being a habit and started being a method. Plans, parallel sub-agents, phased execution, and a four-way review loop. It's not glamorous, but it's the difference between AI as a party trick and AI as actual leverage.
Why plan mode changed everything
The old loop with any AI coding tool was depressingly familiar: prompt, generate, scan the diff, realise it misread the intent, prompt again. By the third round, I was writing the spec retroactively in commit messages.
Plan mode flips the order. Claude Code writes the plan first — to a markdown file I can edit — and nothing in the codebase moves until I approve it. The plan becomes the contract. Misunderstandings get caught in prose, where they cost a minute to fix, instead of in a diff, where they cost an hour to undo.
The line I now say out loud to anyone who'll listen: the plan is cheaper to fix than the diff.
Ask before assuming
The piece that makes plan mode actually work is AskUserQuestion. Before Claude Code commits to a plan, it can — and should — surface the ambiguous bits as multiple-choice questions and wait for me to answer.
This sounds trivial. It is the single highest-leverage tool in the workflow.
Without it, the agent guesses. Sometimes it guesses right. When it guesses wrong, that guess gets baked into the plan, into the diff, and into the commit. By the time I notice, I'm reverting two files of perfectly written code that solved the wrong problem.
With it, the ambiguity gets surfaced as a structured prompt — "long-form or mid-length?", "include a worked example or stay conceptual?", "match the existing pattern or introduce a new one?" — and the answer constrains everything downstream. Five seconds of clicking an option pays for itself in every phase that follows.
The rule I follow: if there's more than one reasonable interpretation of the prompt, ask. Don't ship a confident guess. Confidence on the wrong specification is the most expensive failure mode in this whole workflow.
Breaking work into phases
Once you commit to planning first, the rest of the workflow falls out naturally. Every non-trivial task I give Claude Code now flows through five phases:
- Understanding — Explore agents only. Read-only. Map the territory. No edits, no opinions, just "where does X live and what calls it."
- Design — A Plan agent synthesises the exploration into an approach. This is where tradeoffs surface.
- Review — I read the plan as a human. Push back. Tighten scope. Cut anything that smells like a future-proofing instinct.
- Implement — Only after the plan is signed off. Often a different agent, with the plan as input.
- Verify — Run the app. Run the tests. Read the diff. Confirm the thing actually does what the plan said it would.
The reason phasing matters is concrete: it stops the model from "helpfully" refactoring three adjacent files when I asked it to fix one. Each phase has a narrow remit. The model can't drift if the lane is painted clearly.
Sub-agents in parallel — what actually saves time
The single biggest unlock after plan mode was learning to spawn sub-agents in parallel instead of letting one agent do everything sequentially.
Explore agents are cheap and read-only. If I'm about to touch a function I half-remember, I'll launch two or three Explore agents in a single message — one for the function definition and its callers, one for the related tests, one for any documentation or recent commits. They run concurrently. By the time I've reread the original prompt, all three have reported back.
Plan agents are different. They synthesise. I give a Plan agent the exploration notes and the goal, and it comes back with an implementation strategy that already accounts for the existing utilities the Explore agents surfaced. That's the part that matters — the Plan agent stops proposing code that already exists.
A few rules I follow without thinking now:
- Pick the cheapest model that can do the job. Haiku for mechanical bulk tasks. Sonnet for scoped research and synthesis. Opus only when real tradeoffs are in play. Running Opus everywhere is wasteful and, surprisingly often, slower.
- Cap spawn depth. Parent agent, sub-agent, one more tier. After that, the task was wrong-sized.
- Parallel by default. If two queries don't depend on each other, they belong in the same message.
# What "parallel by default" looks like in practice:
# one message, three Explore agents, three concurrent tool calls.
explore "where is processPayment defined and called?"
explore "what tests cover the payment flow?"
explore "recent commits touching billing/"
The feedback loop between phases
The phases aren't a one-way conveyor. They're a loop, and the loop is the point.
After each phase, I read what came back and choose: continue, redirect, or restart. Plan files get edited, not just approved — they're working documents, not signed contracts. If the Explore phase surfaces something the Plan phase didn't account for, I push the plan back. If the Implement phase exposes an assumption that doesn't hold, I push the plan back further.
The cost curve is steep and the lesson is obvious in hindsight: catching a misunderstanding in Phase 2 costs a minute, in Phase 4 it costs an afternoon. The whole point of the workflow is to fail in prose, not in production.
The multi-model review pipeline
This is the part most people skip, and it's where the compounding value lives.
After Claude Code lands a diff, I don't just merge. I run it through three more passes — two reviewers with different priors and a tool that doesn't have priors at all.
- Claude Code's own
/code-review. Same family, different context. Re-prompted with "review this diff" instead of "ship this diff," it catches its own correctness bugs and the simplification opportunities it overlooked while focused on shipping. - GitHub Copilot review. Fast and conservative. It's at its best on line-level nits, unused variables, and dead branches. Low signal-to-noise on architecture; high signal-to-noise on cleanup.
- SonarQube. Static analysis. Rules, not vibes. Complexity thresholds, duplication, known security smells. It catches the things LLMs hand-wave past because they "look fine."
How I reconcile them:
- SonarQube first. It has rules, not opinions. If it flags something, it's mechanical and verifiable. Triage those before reading LLM commentary.
- Both-LLMs rule. If Claude's review and Copilot both flag the same thing, it's almost always real. If only one flags it and I disagree, I dismiss it — explicitly, with a one-line reason, so I have a record of the decision.
- The goal isn't consensus, it's coverage. Each reviewer has blind spots; the overlap is where confidence lives.
A small example, end-to-end
Last week I needed to add a new field to the insurance claims workflow — a risk_tier string set during intake and read by a downstream activity. Tiny change on paper. Here's what it actually looked like.
Prompt. "Add risk_tier to the claims intake workflow. It's set in the intake activity and read by the underwriting activity. Update the workflow signature, the activity inputs, and the persistence layer."
Plan file. Claude Code wrote a plan touching seven files: workflow struct, two activities, the protobuf, the repository, a migration, and two tests. It also proposed a new helper to normalise the tier string. Reasonable on its face.
Two parallel Explore agents. Before I approved, I sent two Explore agents: one to find existing tier-like fields elsewhere in the codebase, one to look at how recent fields were added to workflows. The first agent found a priority_band field with the exact normalisation logic the plan proposed to write from scratch. The second surfaced a migration pattern the plan had reinvented in a slightly worse shape.
Revised plan. I edited the plan file directly. Reused the existing normaliser. Switched the migration to match the established pattern. Dropped one of the proposed tests because the existing table-driven test covered the same surface. The plan went from seven files to five.
Implementation. A fresh agent picked up the revised plan and shipped the diff in one pass. No drift, because the scope was now genuinely small.
Three-way review. /code-review flagged that the new field wasn't being validated against the allowed-values enum on the way in — real bug, fixed. It also caught that I was constructing the workflow input twice in a test — true, fixed. Copilot pointed out an unused import the rebase had left behind — nit, fixed. SonarQube raised the cyclomatic complexity of the underwriting activity past its threshold; the new branch was the straw, and the right fix was to extract a helper that had been waiting to exist for months.
Total elapsed time from prompt to merge: about thirty-five minutes. Reading that walkthrough takes almost as long as the work did. That's the workflow paying for itself.
What I had to unlearn
A few habits I had to actively kill:
- Trusting the first plan. Early plans are drafts. Treat them that way and they get better. Treat them as contracts and you ship the wrong thing politely.
- Single-agent thinking. One big prompt is almost always worse than three small parallel ones. Stop trying to be efficient with prompts and start being efficient with wall-clock time.
- "Just ship it." The review loop adds ten minutes and removes hours of debugging. Every time I've skipped it, I've paid for it within the week.
- Letting the agent pick its own model. Defaulting to the smartest model for everything is wasteful and, weirdly, often slower. Match the model to the task and the system breathes.
The instinct, not the tools
None of this is really about Claude Code. The tools will keep changing — they always do. What's actually transferable is the instinct underneath: treat an AI coding agent like a brilliant but context-starved colleague who just walked into the room. Give it a plan. Give it scope. Run its work past more than one reviewer. Decide what to keep and what to throw out, deliberately.
The medium changes. The instinct doesn't.