agentflow ·mini-ork ·orchestration ·solo-founder ·arxiv-research

Mini-ork: A year of autonomous parallel feature delivery on a solo-founder codebase

May 4, 2026 · 11 min read

Most teams that ship code through an LLM coding assistant accept the same trade: faster commits, more babysitting. The assistant writes faster than a human; the human still has to be in the loop for every dispatch, every review, every merge decision. For a solo-founder project, the math collapses, because the human-hour is the bottleneck the assistant was supposed to remove.

Mini-ork is the orchestration loop we wrap around Claude Code at SourceShift to take more of the human out of the inner loop: kickoff document goes in, merged pull request comes out, with cost-aware reviewer selection, mid-iteration self-correction, BDD gates, and an auto-merger in between. A year ago it was a single-worker prototype. Today it routinely runs four parallel workers across worktrees while I sleep.

What follows is the year in review: which capabilities shipped, which 2026 papers named the patterns we landed on, and which trade-offs the system is honest about not having solved.

A year ago

A year ago, we introduced mini-ork: a worker-plus-reviewer cycle, gated by an objectively-greppable definition of done, that shipped single epics end-to-end inside a worktree.

Across the year, that loop has broadened from a single-worker prototype into cost-aware reviewer selection, per-iteration self-correction, a unified state database, and a cross-agent message bus — what Barnes et al. (2605.07062) call the shift “from assistance to agency” in CI/CD pipelines, where the central question moves from can the agent do the task to how much decision authority should be delegated, and where should control reside.

What follows is a collection of mini-ork’s most significant changes over the past year, anchored against the 2026 papers that named each pattern in the broader literature.

The whole loop in one picture:

flowchart TD
  KICK([kickoff document]) --> DECOMP[decompose into sub-epics]
  DECOMP --> PARALLEL[parallel worker dispatch<br/>across worktrees]
  PARALLEL --> WORKERS[N workers writing code]
  WORKERS --> CLASS{reviewer classifier:<br/>≤3 files? schema-safe?}
  CLASS -- low complexity --> CHEAP[DeepSeek V4 Pro<br/>$0.50 cap]
  CLASS -- high complexity --> OPUS[Anthropic Opus<br/>$1.00 cap]
  CHEAP --> VERDICT{verdict}
  OPUS --> VERDICT
  VERDICT -- approve --> BDD{BDD gate:<br/>live e2e passes?}
  VERDICT -- ≤3 issues --> PATCH[targeted self-correction<br/>$0.10 patch]
  VERDICT -- many issues --> WORKERS
  PATCH --> VERDICT
  BDD -- pass --> MERGE[auto-merge to main]
  BDD -- fail --> WORKERS
  MERGE --> LEDGER[(unified state +<br/>lessons store)]
  LEDGER -.preflight feeds next dispatch.-> DECOMP

  classDef gate fill:#7a5a1f,stroke:#fff,color:#fff
  classDef alloc fill:#1f3a7a,stroke:#fff,color:#fff
  classDef serve fill:#1f5e3a,stroke:#fff,color:#fff
  classDef store fill:#4a1f7a,stroke:#fff,color:#fff
  class CLASS,VERDICT,BDD gate
  class DECOMP,PARALLEL,WORKERS,CHEAP,OPUS,PATCH alloc
  class KICK,MERGE serve
  class LEDGER store

Yellow gates are the load-bearing decisions: which reviewer to spend money on, whether the verdict admits a cheap patch, whether the running scenario actually passes. Blue is the work itself. Green is the surface the operator sees. Purple is the state that learns over time. The dotted feedback arrow at the bottom is the part that mattered most: every merged epic enriches the preflight that runs before the next dispatch, so the system gets cheaper to run the more it runs.

Those gates exist to make trade-offs explicit — and the most visible trade-off in the first year of running four parallel workers is that every gate has a price tag.

Driving cost and reliability

Mini-ork has shipped a series of changes that cut per-epic cost while raising delivery reliability.

The biggest single win comes from cost-aware reviewer auto-selection. The reviewer used to be Anthropic’s Opus on every iteration, with a flat $1.00 budget cap per call. The auto-selector now classifies each sub-epic at dispatch time (three files or fewer, no schema or security paths involved, complexity score under the bar) and routes those reviews to DeepSeek V4 Pro on a$ 0.50 cap. The complex tier stays on Opus. The net effect across a typical week is roughly half the reviewer spend with no measured drop in approval quality, because the cases that needed Opus’s judgment never left it.

The shape is exactly the cost-quality cascade Moslem and Kelleher describe in their dynamic-routing survey (2603.04445): “smaller models suffice for routine queries; complex tasks demand more capable models”, but static deployment cannot tell them apart. RouteNLP (2604.23577) reports that more than 70% of enterprise traffic in their study lands in the small-model band, a number that lines up uncomfortably well with how many of our reviewer dispatches are three-file renames in disguise. Bouchard (2605.06350) gives the cascade a decision-theoretic backbone, treating the deferral threshold as a point on a cost-quality frontier rather than an empirical hyperparameter. Mini-ork’s selector is the engineering version of the same idea.

“The reviewer that’s right for a three-file rename is not the reviewer that’s right for a schema migration. Until we split the two, every approve was paying for the harder reviewer it did not need.”

A second compounding change is mid-iteration self-correction. When the reviewer returns a change-request verdict with three or fewer issues, zero scope violations, and zero blockers, the orchestrator no longer redispatches the full worker. A targeted patcher emits the fix in place. The per-iteration cost on these cases dropped from roughly $0.50 to$ 0.10, and the wall-clock time from minutes to seconds.

The empirical question of when self-correction actually helps was an open one until Liu and Meng (2604.22273) framed it as a control-theoretic Markov diagnostic: iterate only when the error-correction rate exceeds the error-introduction rate, conditioned on current accuracy. Mini-ork’s three-issue cap is the engineering analog, an empirical proxy for “iteration is still net-positive”. LLMLOOP (2603.23613) and How Many Tries Does It Take? (2604.10508) both report sharply diminishing returns past a small number of refinement passes, which is why our max-iteration default is three, not ten.

The third reliability piece is stage-cache reuse. The spec-author, spec-reviewer, mutation-adversary, rubric-prescreen, and bdd-runner stages now consult a session cache keyed on the prompt template content before they call the LLM. A re-dispatch against an unchanged input lands in the cache hit path, drops cost to near zero, and shaves the cold-start out of the loop. Cheaper re-dispatch freed budget for capabilities the orchestrator could not afford before: live BDD gates, an agent-to-agent bus, and a failure memory that grows smarter each time the loop runs.

Advancing the orchestration frontier

Beyond cost, mini-ork has grown three new capabilities that change what the loop can ship.

Test-mandatory delivery wired live-environment BDD testing into the reviewer cycle, with specialist overlays per language and a debugger ledger that survives across iterations. The shape is exactly the closed-loop, self-correcting testing framework Naqvi et al. introduce in The Rise of Agentic Testing (2601.02454): a Test Generation Agent, an Execution Agent, and a Feedback Agent operating against execution-aware signal rather than static single-shot output. Lint and type-check used to be the merge gate; today it is lint, type-check, and an actually-running end-to-end scenario before the orchestrator squash-merges to main.

A subtler reliability piece is reviewer-as-separate-session. Mini-ork’s reviewer runs in a fresh Claude Code session with no access to the worker’s conversation history, only the diff and the kickoff. Song (2603.12123) calls this Cross-Context Review and reports it materially raises catch rates on production-session outputs, because the reviewer is no longer anchored to the worker’s framing. The mini-ork implementation predates the paper, but the validation is welcome.

An agent-to-agent message bus introduced a structured way for agents to ask each other questions, even across job boundaries. The bus carries asks through a mediator daemon, enforces an access-control list via epic lineage, depth-caps recursive conversations at three hops, and ledgers every exchange against a daily budget guard. Nine of nine integration tests pass; the bus has already been used to resolve cross-epic questions that previously fell back to a human relay.

The design constraints (resource-bounded, ACL-gated, expressive enough to carry typed questions) match what Mallick and Chebolu propose in μACP (2601.00219), a formal calculus for expressive, resource-constrained agent communication. The depth cap and the per-conversation budget close the gap Dubey (2604.17400) identifies as the dominant source of multi-agent token waste: unrestricted context sharing and unstructured parallel execution. Mini-ork’s bus does neither.

“An orchestrator that cannot let two agents talk is an orchestrator that funnels every cross-cutting question through the operator. The bus broke that bottleneck.”

Self-healing on escalation turned failure into a learning surface. A lessons store accumulates fingerprinted failure classes; a memory layer maps fingerprint to remediation; a self-healer composes them into a patch; and an execution gate runs before any worker dispatch to apply known fixes pre-emptively.

flowchart LR
  FAIL([reviewer rejects N times]) --> FP[fingerprint the failure]
  FP --> LOOKUP{match in<br/>lessons store?}
  LOOKUP -- yes --> APPLY[apply known patch<br/>before next worker run]
  LOOKUP -- no --> ESCALATE[escalate to operator]
  ESCALATE --> ANNOT[operator records remediation]
  ANNOT --> STORE[(lessons store)]
  STORE --> LOOKUP
  APPLY --> RECHECK{still failing?}
  RECHECK -- no --> DONE([merged])
  RECHECK -- yes --> ESCALATE
  STORE -.preflight gate fires<br/>before any future dispatch.-> NEXT([next dispatch])

  classDef gate fill:#7a5a1f,stroke:#fff,color:#fff
  classDef alloc fill:#1f3a7a,stroke:#fff,color:#fff
  classDef serve fill:#1f5e3a,stroke:#fff,color:#fff
  classDef store fill:#4a1f7a,stroke:#fff,color:#fff
  class LOOKUP,RECHECK gate
  class FP,APPLY,ANNOT alloc
  class FAIL,DONE,ESCALATE,NEXT serve
  class STORE store

The pattern is the same one Fang et al. capture in Trajectory-Informed Memory Generation (2603.10600): actionable lessons extracted from execution trajectories, applied to future runs without human curation. AgentDevel (2601.04620) reframes the same loop as “release engineering” for self-evolving agents, with the explicit goal of auditable, non-regressing improvement trajectories. Our lessons store carries that property by design: every entry is dated, every healing patch is logged, every escalation that was previously seen produces a measurably cheaper second pass.

Yu et al. (2605.09315) issue a caveat worth carrying: “self-evolving agents forget”, and adapting to new distributions can degrade previously acquired capabilities. The mitigation in mini-ork is conservative by default; the healer applies fingerprint-matched patches, not learned policy updates. The trade-off is real and acknowledged. Those trade-offs compound on the operator who has to reason about them; the next challenge was making the loop’s growing surface area accessible without adding a commensurate cognitive load.

Improving the developer-experience surface

The user-facing surface has narrowed as the engine underneath has gotten bigger.

A single delivery command now takes a kickoff and runs the full pipeline: decompose, scaffold worktrees, dispatch in parallel, review with auto-selected reviewer, BDD-gate, self-correct mid-iteration, and auto-merge on approval. The previous flow needed five-plus commands. The new one needs one, with flags for skipping decomposition, backgrounding the run, or dry-running for the cases where finer control matters.

A second DX win followed: a project-aware autopilot skill took the iter-1 evaluation rate from 90.5% to 100% on its benchmark dataset while running 47% faster.

Parallel multi-track delivery through the Book Context-First refactor shipped 11 sub-epics in a single dispatch across four different LLM providers (Kimi, GLM, MiniMax, Sonnet) in a single afternoon. The orchestrator handled the cross-track contracts, a scope-pattern sentinel kept track-A’s worker from writing files owned by track-B, and the auto-merger sequenced the approves so no two epics fought over the main branch. Geng and Neubig (2603.21489) study exactly this regime under the name asynchronous software engineering agents, showing that “long-horizon tasks involving multiple interdependent subtasks” benefit from asynchronous multi-agent collaboration over monolithic dispatch. Agyn (2602.01465) makes the role-and-review separation explicit (workers, reviewers, and coordinators with distinct responsibilities), which is also how mini-ork’s per-epic actor manifest carves up the work. Shipping 11 sub-epics in an afternoon opened an adjacent question: if the same parallel dispatch could be turned on the literature pipeline that feeds the kickoff queue, could mini-ork accelerate its own research intake?

Scaling research workflows

Mini-ork’s reach has extended beyond pure code into the research stack that informs the code.

A recent example: chapter generation for our technical book pipeline was failing across three Anthropic-compatible providers, each in a different way. A timeout, a thinking-loop with zero output, and a successful-on-disk-but-reported-as-failure artifact. The naive response would have been provider whack-a-mole. Instead, the loop mined our local arXiv corpus (137K papers, embedded with Gemini) for matching failure-mode names and turned up five papers from the last six months that named every observed shape, with composable remediation primitives. The architecture that came out (a planner agent emitting a typed schema, an ICR floor gate on output-to-input ratio, per-provider character-density calibration) is captured in a handoff document and is the next dispatch on deck.

The same loop powers a local arXiv search skill that turned what would have been a day of web-search whack-a-mole into a focused evening of literature mining, including, recursively, the literature mining that produced this post. A loop that mines its own literature is one step from a loop that schedules its own next dispatch — and that gap is exactly where the next year’s work is aimed.

The future of mini-ork

The past year has shown that a solo-founder orchestrator does not need to be a small thing. It needs to be a composable thing: small enough that any individual piece is replaceable in an afternoon, large enough in aggregate that the parallel-delivery economics actually beat sequential human work. Mini-ork has crossed that threshold. The next year is about pulling more of the day-to-day decision surface (what to dispatch, which reviewer to assign, when to escalate) out of the operator’s head and into the loop’s defaults.

We’re excited to extend mini-ork in three directions: deeper integration with the lessons store so escalations seed automatic preflight gates, broader provider coverage so the per-tier router can compose across more of the cost-quality frontier Bouchard (2605.06350) maps, and a tighter feedback path from the production runtime back into the dispatch heuristics so the orchestrator picks epics that have already paid their integration-cost dues. None of those directions were arrived at in isolation — the papers that named each pattern are the reason the roadmap is as concrete as it is.

Acknowledgements

Mini-ork sits on top of Claude Code, the Claude Agent SDK, and the Daytona sandbox.

We gratefully acknowledge the 2026 multi-agent literature that named, validated, or sharpened the patterns mini-ork ships: Moslem and Kelleher (2603.04445) and Bouchard (2605.06350) on dynamic routing and cascades; Liu and Meng (2604.22273), Ravi et al. (2603.23613) and Arimbur (2604.10508) on iterative self-correction; Naqvi et al. (2601.02454) on agentic testing; Mallick and Chebolu (2601.00219) and Dubey (2604.17400) on resource-bounded agent communication; Fang et al. (2603.10600) and Zhang (2601.04620) on trajectory-informed self-evolution; Yu et al. (2605.09315) on capability degradation in self-evolving agents; Song (2603.12123) on cross-context review; Geng and Neubig (2603.21489) and Benkovich and Valkov (2602.01465) on asynchronous and team-based SWE agents; and Barnes et al. (2605.07062) for the framing that opens this post.

The chapter-pipeline architecture mentioned in Scaling research workflows additionally draws on Liu et al. (2604.16736), the VIGIL terminal-commitment paper (2605.08747), Budget-Aware Tool-Use (2511.17006), the DeepNews Knowledge Cliff analysis (2512.10121), and CogWriter (2502.12568).