⬡ █

Agent Fleet Retry Loops

Agent Fleet Retry Loops: Carry Rejection Forward

The Symptom That Brings You Here

Your reviewer agent is rejecting PRs. Your coder agent is picking them up again. After 3–5 cycles you check the logs and see the exact same critique repeated verbatim, on identical tasks, across consecutive attempts. Merge rate is single-digit percent after days or weeks in production.

This is not a model capability problem. It is a context-threading failure: the rejection reason exists somewhere in the system, but the retrying coder never actually receives it in actionable form.

The specific production case that generated this page: the-autobots fleet (Hatchery-backed heterogeneous swarm, 13 roles) hit a 5% PR merge rate over 12 days. Prowl (QA reviewer) was rejecting PRs via text-only critique; Ratchet (coder) retried from cold context. The same rejection complaint appeared 4 consecutive times in production logs. (source: the-autobots production fix, commit 70d68b1)


The Three Compounding Failure Modes

1. Critique buried at the bottom of the retry task description

When a reviewer refiles a task, appending the rejection to the bottom of the original description guarantees the coder’s attention mechanism deprioritizes it. Models exhibit recency bias during generation but primacy bias during retrieval — long task descriptions bury the most-recent signal at the tail. See [[AI Memory Systems]] for the Liu et al. “lost in the middle” finding: performance is worst for information in the middle and end of long contexts.

Fix: lead the refile description with the rejection block. Structure:

## ATTEMPT #N — previous attempt rejected

**Required change**: <one imperative sentence>
**Reviewer critique**: <verbatim excerpt>

---

<original task description below>

2. No structured required_change field — only free-text critique

Free-text rejection comments require the retrying model to extract an imperative from prose, identify which part applies to its output, and translate it into a concrete code change. Each extraction step is lossy. A reviewer LLM producing fluent prose is actually making the coder’s job harder.

Fix: add a required_change: <imperative> field to the reviewer’s JSON output schema. This field must be a single imperative sentence: “Add error handling for 429 responses”, “Rebase onto main before opening the PR”, “Push actual file changes, diff is empty”. The coder’s prompt template should reference this field directly, not the full critique body.

For deterministic-gate failures (merge conflict, empty diff, lint fail), synthesize the imperative without calling an LLM at all:

Gate failurerequired_change
Merge conflict”Rebase onto main and resolve conflicts before reopening”
Empty diff”Push actual file changes — the PR diff is empty”
Lint failure”Fix lint errors listed in the gate output”
Build failure”Fix build errors listed in the CI log”

3. Reviewer LLM call failure silently treated as needs_changes

When the reviewer’s LLM call fails (429 rate-limit, timeout, malformed JSON), a naively-structured reviewer returns needs_changes with conf=0 rather than propagating the error. This destroys valid work: the coder’s output was never actually evaluated, but it gets refiled as if it failed review.

In the production case, 32% of “rejections” were rate-limit artifacts, not real QA failures. (source: the-autobots production fix, commit 70d68b1)

Fix: LLM-call failure returns verdict=retry (or verdict=error), never needs_changes. Gate the refile path: only refile if verdict=needs_changes with conf > threshold. Cap retries at 3 (or N), then escalate to a human-review queue rather than looping forever.


The Complete Fix Pattern

Reviewer JSON schema:
  verdict: "approved" | "needs_changes" | "retry" | "escalate"
  required_change: "<imperative sentence>"   ← NEW
  critique: "<full prose>"
  conf: 0.0–1.0

Refile logic:
  if verdict == "needs_changes" and conf > 0.6 and attempt_count < 3:
      new_description = f"## ATTEMPT #{n}\n\nRequired change: {required_change}\n\n---\n\n{original_description}"
      refile_task(new_description)
  elif verdict == "retry" or conf <= 0.6:
      # LLM call failed or low confidence — don't penalize coder
      requeue_same_task()
  else:
      escalate_to_human()

Why This Pattern Generalizes

Any multi-agent system with a reviewer→coder (or critic→generator) loop will hit this failure unless rejection context is explicitly threaded. The pattern appears across:

  • Code review loops (PR-level or file-level)
  • Essay / doc revision loops
  • Design-spec → implementation loops
  • Test-generation → fix loops
  • RAG answer → fact-check → re-answer loops

The underlying cause is always the same: the reviewing agent and the generating agent share state only through task/message metadata, and that metadata is designed for routing (which agent picks this up?) not for context (what does the next agent need to know?).


Prior Art and Research Anchors

Reflexion (Shinn et al., 2023)

The canonical treatment. Agent generates verbal self-reflection after a failed attempt; reflection is stored in episodic memory and prepended to the next attempt’s context. The key insight: the reflection must be prepended, not appended, and it must be in the same format as the original task, not in a separate “feedback” field. The autobots fix independently re-derived this. (Closely related to [[AI Memory Systems]] — episodic memory as inter-attempt scratchpad.)

MetaGPT shared message pool

MetaGPT routes role outputs through a shared blackboard. The engineer role sees the prior rejection AND the PRD AND its own previous code simultaneously in one context window. No threading needed because everything is always present. The tradeoff is prompt length; the benefit is zero dropped context. Relevant if your task queue is Hatchery-style (pull-based, stateless workers) and you need to decide how much context to embed in the task payload vs. query at claim time.

Live-SWE-agent (arXiv 2511.13646)

Inline trajectory preservation: the agent’s full action/observation history from the failed attempt is compressed and reinjected into the retry prompt. Stronger than just the reviewer’s critique — carries the coder’s own reasoning trace. Expensive but addresses cases where the critique is accurate but the coder can’t reproduce the required state.

LLM-as-judge overconfidence (arXiv 2508.06225)

ECE (Expected Calibration Error) of 39% measured on GPT-4o acting as judge. LLM reviewers are systematically overconfident. Implication: conf scores from reviewer LLMs should be treated as ordinal rankings, not probabilities. A conf=0.9 needs_changes verdict may be wrong 39% of the time. Use structured rejection (imperative field, explicit gate results) to reduce dependence on the reviewer LLM’s confidence estimate.

METR time-horizon study

Task sizing matters for retry loops. METR found code agents succeed on tasks completable in ~7 minutes but fail sharply beyond ~1 hour. If your retry tasks are growing in scope with each refile (accumulated context, wider diffs), the retry loop is working against the time-horizon curve. Keep refiled tasks narrow: carry the imperative forward, not the full accumulated history.


Fleet-Level Operational Checklist

When debugging a low merge-rate in a reviewer→coder fleet:

  • Check consecutive task descriptions for the same reviewer critique — if present, context threading is broken
  • Check what fraction of “needs_changes” verdicts correlate with LLM call errors in reviewer logs
  • Verify reviewer JSON schema has a required_change or equivalent imperative field
  • Verify refile task description leads with the rejection block, not appends it
  • Verify there is a retry cap and an escalation path
  • Verify deterministic-gate failures (lint, build, merge conflict) produce synthesized imperatives, not just error logs
  • Instrument: log attempt_count on every task; plot distribution — bimodal (mostly 1 or mostly 3+) indicates the loop is pathological

  • [[AI Memory Systems]] — episodic memory layer is the right analogy for inter-attempt rejection context
  • [[Retrieval-Augmented Generation (RAG)]] — context injection patterns apply to task payload design
  • [[Next-Gen AI Memory Architectures]] — architectural approaches to state persistence across agent turns