LLMs and the Adversarial Loop

9 minute read

LLMs are confident. So much so that if you ask one to give you the recipe to immortality, it will probably take a stab at it, even though it knows as much about mortality as it does its own existence. Ask one to write a design spec and it’ll produce something that reads well, covers the obvious cases, and sounds authoritative. Ask it to implement code from that spec and it’ll write something that compiles and handles the happy path. But that’s as good as it gets.

Trouble is, “reads well” and “is correct” aren’t the same thing. LLMs gloss over edge cases, make assumptions without telling you, and produce code that works for the examples they had in mind but breaks for the ones they didn’t consider. And if you ask an LLM to review its own work? It rubber-stamps it. The same blind spots that caused the problem cause it to miss the problem during review.

The solution? Get a second, independent LLM instance to review instead. It’ll catch things the first one missed.

That’s it. If you want to stop reading, give up here. Wait, actually, not quite: Do the second LLM thing, but do it in a loop until the reviewer is satisfied (or you hit a limit and move on with whatever you’ve got). Ok, now you can give up.

Why This Works

LLMs are stochastic. Two calls with the same prompt will emphasise different aspects, notice different patterns, and follow different reasoning. That’s not like running a linter twice. A linter finds the same issues every time, so re-running it is pointless. Two independent LLM calls will look at different parts of the output and reach different conclusions.

Self-review (same instance, same session) doesn’t help much either, because the session context anchors the model to its previous reasoning. It’s already committed to certain assumptions and isn’t going to challenge them. A fresh instance starts from scratch, and because of stochastic sampling, it’ll evaluate things differently.

You can think of it as harvesting the variance in LLM outputs. One sample might be good or bad. Multiple independent samples, set up as a create-then-critique cycle, use the differences between them to push toward higher quality.

The GAN Analogy

The pattern’s borrowed from GANs. Generator produces outputs, discriminator evaluates them, the tension between them drives quality up. The analogy breaks down in the details (GANs converge via gradient descent; the adversarial loop uses heuristics like severity thresholds and iteration caps), but the principle holds: a creator paired with a critic does better than either alone.

How It Works

Two LLM instances, a “primary” (creator) and an “adversary” (critic), alternate rounds until the work’s good enough.

The Basic Cycle

Primary produces work. Round 1 is the initial output. Later rounds incorporate the adversary’s feedback.
Adversary reviews. It produces structured issues with a category, description, and severity (critical/major/minor), plus an overall verdict: “approved” or “needs work”.
Convergence check. Accept or keep going.
Repeat or exit.

If the primary and adversary can’t agree (the primary thinks it’s done, the adversary keeps raising issues, or that the primary thinks that the adversary’s issues are not relevant), you’ve got options. The simplest is to let the convergence rules below handle it: cap the rounds, detect stale loops, move on. If you want something smarter, add a 3rd agent to read both sides and make the call.

Session Isolation

The primary reuses the same session across rounds because it’s building on its own work and needs that context. The adversary gets a fresh session every round.

Why? If the adversary kept its context, it’d anchor to previous findings, assume fixes without checking, and focus on what changed rather than the whole picture. Fresh sessions force it to evaluate on its own merits. And because of stochastic sampling, each round’s adversary is effectively a different reviewer. That’s expensive with humans but trivial with LLMs.

Convergence

This is always going to vary by application, but say we want to create a tool to implement various work items, you might say that the round cap is 10 (we’ll call that N).

The loop stops when:

The adversary approves. The clean exit.
All remaining issues are minor. Naming conventions and comment wording aren’t worth another round.
No new major issues after round N/2. This is the stale-loop detector. If the adversary keeps raising the same things, the loop’s stuck and continuing just burns budget. Duplicate detection uses fuzzy matching: normalised descriptions compared by word overlap, 80% threshold for longer ones, exact match for anything under 4 words.
Iteration cap. Hard limit at N rounds by default.

The number of rounds will vary depending on how accurate you want your results to be. The more rounds, the “better” your result will be. The fewer, the faster you get to a “good enough” answer.

Escalation

Should you choose it, the adversary’s prompt can change over time. The exact thresholds should scale with your N, but as an example with N=10: Round 1 is unrestricted: find everything you can. Rounds 2-4 are normal: focus on substantive issues. Round 5+ is escalated: only raise genuinely new things, stop re-raising stuff that’s been addressed.

Without this, the adversary often tends to just repeat itself endlessly. But again, this is subject to tuning.

Prior Rounds Context

Even though the adversary gets a fresh session, it’s not completely blind. Both agents get a summary of prior rounds: how many issues, what severity, how they were resolved. The last 5 rounds are shown in detail; older ones get collapsed into aggregates so the prompt doesn’t grow forever.

Variations

While this process was designed for the implementation of tasks / work items, it doesn’t end there.

Standard flow is the default for design, work items, and ordering phases. Primary produces, adversary reviews. Nothing fancy.

Inverted flow is used for validation and code review. Here the adversary goes first (does the review), then the primary responds to the findings. This makes more sense when review is the natural starting point.

User-in-the-loop, in our example, is specific to the design phase. After each adversarial review, the user sees the issues and can steer things (“focus on the auth flow”), accept the spec as-is, or just let the default revision happen.

Per-item loops are how implementation is suggested to work. In a tool I created, one of the phases breaks the task down into work items. That phase runs a separate adversarial loop for each work item. If one work item doesn’t converge, it’s still committed. Partial progress on, for example, 5 of 28 items, with success on the other 23, beats perfect progress on 0.

Advantages

The obvious one: it catches real bugs. Missing error handling, unvalidated inputs, race conditions, spec ambiguities. The kind of stuff that survives a single LLM pass and ends up in production. These are the bugs that a human developer tends to avoid introducing.

It’s also reasonably cheap. A review round costs roughly the same as the original generation, so 2-3 rounds might double or triple the cost of a phase. Still a lot less than debugging those bugs in production later.

While not enforced by this “schema”, when using a tool that supports it (such as Claude Code), the structured output format is worth calling out too. The adversary can’t just say “this could be better”. It has to produce a concrete issue with a severity. That makes the feedback actionable, so the primary can make targeted fixes instead of vague rewrites.

Fresh sessions give you independence as a structural guarantee, not something you’re hoping for. And the stochastic nature of LLMs (usually treated as a reliability problem) actually helps here. Each round really is a different review, not the same one repeated.

When the loop doesn’t converge, it doesn’t throw the work away. It takes the best output it has and moves on. Some review is better than none.

Disadvantages

Cost is the big one. Every round is at least two LLM calls. A $2 phase becomes $4-6 with adversarial review. Multiply that across 7 phases and it adds up fast.

Time compounds too. Each round is sequential, so the primary has to finish before the adversary can review, and vice versa. A 3-round loop takes roughly 3x as long. Phases that already run 30-60 minutes can end up taking hours.

The adversary isn’t always right, either. It raises false positives, misses real issues, and sometimes fixates on style when correctness matters more. The primary wastes time “fixing” things that weren’t broken.

Loops stall sometimes. The adversary keeps insisting something’s wrong and the primary can’t or won’t fix it. The stale-loop detector catches this and moves on, but it means exiting with unresolved issues. That’s fine if it’s a naming dispute, less fine if it’s a security hole.

The adversary by default never actually runs anything (but that’s not a hard limit and can be changed based on your implementation). It doesn’t execute code, run tests, or check constraints. It just reads and reasons.

Fresh sessions cut both ways. The adversary might re-raise issues that were genuinely fixed, because it doesn’t remember the previous round. The prior-rounds summary helps, but it’s lossy. More context would reduce wasted rounds but increase bias, so the current design picks independence and accepts some waste.

And convergence doesn’t mean correctness. The adversary might’ve missed something critical entirely. The fuzzy duplicate detection has edge cases. These are heuristics, not proofs.

When It’s Worth It

Use it when bugs are expensive (production code, user-facing stuff, sensitive data), when the task is complex enough that a single pass will miss things, or when nobody’s around to review. It’s your quality gate for unattended runs.

Skip it when cost is the main constraint, the task is trivial, or you need a rough draft fast and don’t care about polish.

Based on my own experience, this pattern is more than capable of handling large tasks at a fraction of the cost of a regular developer. Just this week I had it implement a change. It took about 20 hours wall time and cost $1000 in tokens. That might sound terrifying and wasteful, but we’d budgeted a month for this task for a single developer. For a developer to match the LLM, they’d have to finish in 2.5 days and be making less than $104k per year. Good luck finding that person. And if they take the budgeted month instead, assuming an average of 21 work days per month, their salary better be below $13k. Not per month. Per year. And that’s just to break even. It ignores the time advantage.

Before I End…

A quick note of honesty: the adversarial loop improves quality, but it doesn’t prove quality. Convergence is heuristic. The LLM’s output still needed human review and testing before it shipped. That’s true of developer code too, though. Nobody ships without review.

The point isn’t that the loop produces perfect code. It’s that it produces code that’s good enough to review, at a cost and speed that makes it worth trying. And if it saves you even one round of “back to the drawing board”, it’s already paid for itself.

Twitter Facebook LinkedIn

Dale Myers