llm-ops ·deep-research ·agents ·survey ·arxiv-research

What 170 papers agreed on about deep research agents

May 10, 2026 · 14 min read

The first deep research agent we shipped had a fifteen-line plan, a single retrieval call, and a one-paragraph report. The eighth one had a four-stage pipeline, multi-agent verification, and an evaluation gate. In between, we read about 170 papers on the topic. Most of the value was not in any one paper. It was in noticing that the field had been converging on a shape — and that the shape was not the one most agent frameworks ship by default.

This post is the distillation. Five surveys’ worth of papers, eight subfield categories, one consensus pipeline, six recurring failure modes, and the parts the literature still cannot agree on. Where we have shipped a piece of this at LibWit, it gets a cross-link; where we tried and dropped something, that gets named too.

The pipeline everyone ended up at

Huang et al’s Deep Research Agents survey (Jun 2025) and Zhang et al’s Deep Research: A Survey of Autonomous Research Agents (Aug 2025) both name the same four-stage shape.

flowchart LR
  Q([user query]) --> P[plan:<br/>decompose into<br/>sub-questions]
  P --> E[explore:<br/>retrieve, browse,<br/>tool-call]
  E --> S[synthesize:<br/>extract claims,<br/>cross-check]
  S --> R[report:<br/>structured<br/>output]
  R --> OUT([final answer<br/>+ citations])

  S -.gap detected.-> P
  E -.evidence weak.-> P

  classDef stage fill:#1f3a7a,stroke:#fff,color:#fff
  classDef serve fill:#1f5e3a,stroke:#fff,color:#fff
  classDef back fill:#7a5a1f,stroke:#fff,color:#fff
  class P,E,S,R stage
  class Q,OUT serve

Plan. Explore. Synthesize. Report. The two arrows that close back to plan are what separate this from a one-shot RAG pipeline — when the synthesizer detects a gap, or when the explorer surfaces weak evidence, the agent updates the plan instead of finishing.

Across those same surveys, the variations are mostly naming. Huang et al call it information acquisition + modular tool use + report generation. Zhang et al call the stages planning, question developing, web exploration, report generation. From Web Search towards Agentic Deep ReSearch (Jun 2025) frames the same thing as the difference between “one-shot prompt + retrieve” and “an agentic LLM plans a series of steps.” Different vocabulary, same shape.

The reason this converged is that every shorter shape gets bitten in production. Plan-then-execute alone underperforms when the early plan is built without evidence. Retrieve-then-summarize alone fails when the question is multi-hop. Plan + execute + report without a synthesis stage produces shallow outputs because the model has no chance to cross-check before writing. The four-stage shape is the smallest one that survives realistic queries.

The four-stage shape is the consensus, but within it three architecture decisions split the field, and each choice changes which queries succeed and which fail.

Three taxonomy splits worth knowing

The same surveys identify three architecture splits that materially affect how the pipeline runs. Each split has a “right” answer that depends on the task.

Split	Static workflow	Dynamic workflow
Who decides next step	Pre-defined graph; agent walks it.	Agent decides at runtime based on intermediate state.
When this wins	Bounded research questions with a known shape (e.g. “summarize this paper”).	Open-ended exploration (e.g. “what does the literature say about X”).
Cost shape	Linear, predictable.	Variable, can blow up.
Failure mode	Misses turns that the plan didn’t anticipate.	Drifts off the question.

Split	Single-agent	Multi-agent
Composition	One agent owns the whole pipeline.	Separate agents for planning, search, synthesis, verification.
When this wins	Short queries, tight cost budget.	Complex queries where verification matters more than latency.
Cost shape	One model, many turns.	Many models, fewer turns each.
Failure mode	Confabulation under fatigue.	Coordination cost, message-passing overhead.

Split	API retrieval	Browser retrieval
Mechanism	Structured tool calls to search APIs (Google, Bing, Brave).	Headless browser emulating human web exploration.
When this wins	Authoritative sources (Wikipedia, arXiv, PubMed).	Heterogeneous web content, paywalls, JS-rendered pages.
Cost shape	Per-query cost; rate-limited.	Per-second compute; can be parallelized.
Failure mode	Missing what’s not indexed.	Bot detection, layout drift.

The non-obvious lesson from the survey papers is that no single combination wins. A static-single-agent-API stack is fine for “summarize this paper.” A dynamic-multi-agent-browser stack is what you want for “what is the state of the art on X.” The engineering decision is matching the stack to the query shape — not picking a single architecture and forcing every query through it.

Matching the stack to the query shape is the right call, but the field learned that move the hard way, through six failure modes that recur regardless of which combination you pick.

Six failure modes that bit the field

Six failure modes recurred across the field, and every one of them maps onto something we hit in production. Trehan and Chopra’s Why LLMs Aren’t Scientists Yet (Jan 2026) is the post-mortem the deep research literature needed: they built four end-to-end LLM-scientist pipelines, watched three fail, and documented the six modes.

#	Failure mode	What it looks like in production	Mitigation in current literature
1	Bias toward training data defaults	Agent reaches for a familiar method instead of the one the task needs	Inject domain-specific schemas at the planning stage (see Schema-Constrained Generation)
2	Implementation drift under execution pressure	Agent simplifies the original plan to fit its context budget, then declares success	Make the original plan a checked artifact; flag drift loudly
3	Memory and context degradation across long-horizon tasks	Late stages forget what was decided early; constraints fall out	The same shape Memory-Augmented LLM Agents (2604.27003) named for chat — applies just as hard here
4	Overexcitement that declares success despite obvious failures	Final report claims results that the trace does not support	Adversarial verification stage (more below)
5	Insufficient domain intelligence	Agent does not know what a “good” result looks like for the domain	Domain-specific rubrics; cross-model verification
6	Weak scientific taste in experimental design	Agent runs experiments that cannot distinguish hypotheses	Hardest to fix; needs human-in-the-loop or a domain-trained verifier

Six-panel grid of LLM-scientist failure modes from Trehan and Chopra 2601.03315. Two rows of three tiles: TRAINING DATA BIAS, IMPLEMENTATION DRIFT, MEMORY DEGRADATION, OVEREXCITEMENT, WEAK DOMAIN INTELLIGENCE, WEAK SCIENTIFIC TASTE. Each has a short label and one-line description. Color gradient deepens on tiles 3 and 6, marking the failures hardest to mitigate. — Six modes, two rows. Memory degradation and weak scientific taste are the two with no clean fix in the current literature.

Trehan and Chopra’s framing matters because it stops you from celebrating the pipeline working on benchmark queries. Half of these failures are invisible when the benchmark is “did the agent return a plausible-looking report.” All of them are visible when the question is “did the agent return a true report grounded in what it actually retrieved.”

The honest reading of the paper is that one of their four attempts succeeded only by passing AI4Science review — which is itself AI-reviewed. The success is not a clean win.

If one AI-reviewed success is not a clean win, the field needed a structural check that does not inherit the same bias, and the answer the literature converged on is adversarial debate between agents with different bias profiles.

Verification through adversarial debate

The mitigation the field has converged on for several of those failure modes is multi-agent adversarial verification. The simplest shape: one agent generates a claim, another agent argues against it, a third agent adjudicates. The variance reduction comes from the bias profiles of the agents being different — a single model debating itself does almost nothing useful, but two different model families (Claude vs Gemini, GPT-4 vs Llama) catch different errors.

The critic role is the load-bearing one in adversarial verification; without it, the researcher’s bias compounds rather than cancels. Autonomous Research via Adversarial Multi-Agent Collaboration (Aris, May 2026) is the clearest current writeup: a researcher agent proposes a hypothesis, a critic agent attacks it with the strongest opposing reading of the evidence, an adjudicator reconciles.

Three-role adversarial verification diagram. RESEARCHER on the left (model family A) proposes a hypothesis. CRITIC in the middle (model family B, different from A) attacks with the strongest opposing reading. ADJUDICATOR on the right reconciles. Curved arrows show hypothesis flowing to critic, critique flowing back to researcher, and both feeding into the adjudicator. A horizontal VERIFIED CLAIM bar at the bottom notes that bias profiles of A and B differ so errors do not compound. — Three agents, two model families, one adjudicator. The critic role is the load-bearing one. Without it, the researcher's bias just compounds.

We use a variant of this shape in our prompt-canary allocator and our matroid style resolver: for any non-trivial prompt choice, the prompt that wins has to beat the runner-up across two different scoring models, not one. The adversarial framing is the same principle in a different costume.

Three agents is the working compromise on debate sizing, but nobody in the literature has a principled bound. Two agents are not enough — the critic just becomes a yes-and-but reader. Five agents are too expensive — each round multiplies cost.

Self-evolution is where the field is actually moving

If you sample the last six months of deep research papers, the largest single category is self-evolving agents. Gao et al’s 77-page survey (Jul 2025, updated Jan 2026) carved it into three axes that have become the de-facto framework: what evolves, when it evolves, how it evolves.

What: model weights / memory / tool library / architecture itself. Most papers focus on memory and tool library because those are cheap to evolve at runtime.
When: intra-test-time (during a single run) versus inter-test-time (between runs). Intra is harder; inter is where most production wins are.
How: scalar rewards (RL-style), textual feedback (LLM-as-judge), multi-agent learning loops, distillation from successful traces.

Self-evolution without a calibrator amplifies the agent’s existing biases; it rewrites its memory using the same biased priors that produced the original output, so the error compounds rather than corrects. That is the headline finding of Your Agent May Misevolve (Sep 2025), the paper that matters most for anyone shipping this in production. The optimistic framing — self-evolving agents close the loop the rest of the pipeline leaves open, the agent gets better at deep research by doing deep research — does not survive contact with that result.

This is the same failure mode the CAM post names for constructivist memory systems — when an LLM writes its own memory, it inherits the biases it had at write time. The convergent fix across both literatures is a write-time calibrator that scores each new memory or skill for bias risk before it lands in the durable store.

The write-time calibrator guards one layer of the control loop, but the same agentic-RAG control loop has three additional failure modes that compound silently at query time, none of which a smarter LLM can fix.

Agentic RAG: where the failure modes hide

Yadav and colleagues’ SoK: Agentic RAG (Mar 2026) is the clearest current systematization of how retrieval-augmented agents fail in production. They formalize the agentic-RAG loop as a finite-horizon partially observable Markov decision process — useful as a frame even if you do not work in that vocabulary — and identify four failure modes that compound:

Compounding hallucination propagation: an early hallucinated retrieval shapes the next query, which retrieves something that confirms the first hallucination. The error gets harder to detect with each turn.
Memory poisoning: when the agent writes its retrievals back into a long-term memory, low-quality retrievals contaminate future queries. This is the same hazard CAM closes via its calibrator and FadeMem closes via decay.
Retrieval misalignment: the agent’s query rewriter drifts from the user’s original intent. Each rewrite carries a small bias; chained, they exit the user’s question.
Cascading tool-execution vulnerabilities: when tools call other tools, an error in a leaf tool produces a confident wrong result that the parent agent treats as ground truth.

Three of these four are not failures of the LLM. They are failures of the control loop the agent runs inside. The SoK framing is useful because it stops you from trying to fix them by making the LLM smarter — they are scaffolding bugs, not model bugs.

The scaffolding-bug framing drove most of the architectural choices we made at LibWit, some of which held up in production and some of which we dropped after watching them compound the very failures they were meant to prevent.

What we shipped, what we skipped

All three deep-research-shaped agents we built at LibWit started as flat retrieve-then-synthesize loops and converged toward the four-stage shape: chapter generation, cross-document citation detection, and the in-document chat assistant.

What we shipped	Where it lives
Four-stage pipeline	All three agents. Plan / explore / synthesize / report stages are explicit and individually observable.
Static + dynamic mix	Static plan templates for repeatable queries (chapter generation), dynamic agent control for open exploration (chat)
Multi-agent verification	Cross-model verifier in our matroid resolver and the canary allocator
Memory calibrator	CAM-style bias gate at memory-write time
Substrate separation	Context engine composes per-feature; prompt harness handles routing; agent run substrate handles pause-resume-steer

What we tried and dropped	Why
Five-agent debate	Cost climbed faster than quality. Three-agent (proposer + critic + adjudicator) is the sweet spot.
Pure browser retrieval	Worked well for one domain, brittle across domains. Switched to API-first with browser as fallback.
Full intra-test-time self-evolution	Per Your Agent May Misevolve, the bias-amplification risk on short-horizon evolution is too high. We only evolve between runs, never inside a run.
Single-shot planning	Plans built without retrieval evidence drift fast. Now plans go through one retrieval pass before being committed.

This post is the meta-view that ties the individual substrate implementation pieces (linked above) together.

Where the literature still cannot agree

If you read all five surveys looking for the open questions, three themes stay unresolved.

How to evaluate. Every survey flags this. The current benchmarks (DeepResearch-Bench, BrowseComp, GAIA, Humanity’s Last Exam) optimize for plausible-looking reports, not for grounded ones. Newer rubrics try to score citation faithfulness and methodology validity, but those are themselves AI-evaluated, which is the metric problem in miniature. Evaluating Deep Research Agents via Academic Survey Tasks (Aug 2025) is the most recent honest attempt; the headline finding is that humans still detect failures the AI evaluators miss.
How to budget cost. Dynamic workflows let cost balloon when the agent re-plans aggressively. Static workflows under-perform on the questions where re-planning would have helped. The compromise — token budgets per stage with hard fail-over — is in production at most shops, but the principled answer is still open.
How to handle adversarial inputs. Hierarchical Attacks for Multi-Modal Multi-Agent Reasoning and Lying with Truths both show that multi-agent setups are more vulnerable to coordinated manipulation than single-agent ones — the same coordination that helps verification can be exploited. This is an open problem.

These three together mean the field is roughly where databases were in the mid-1990s — the architecture is agreed, the implementations are converging, but the operational discipline (eval, cost, adversarial robustness) is still being figured out in production.

What to take from this regardless of who wins

Three things the convergence makes obvious that no single paper can argue alone:

If you are building a deep research agent and you do not have an explicit four-stage pipeline, you are doing it wrong. The cost of forcing the shape on a one-shot RAG is two days of refactoring. The cost of not having it is unbounded confabulation on multi-hop queries.
Verification is not optional. Adversarial multi-agent debate is the cheapest non-optional addition you can make. Three agents, two model families, one adjudicator. The improvement on grounded-correctness is large enough that the literature has stopped arguing about whether to include it and now argues about how to size it.
Self-evolution is interesting but dangerous. The architecture works. The bias-amplification risk is real. The production-safe move is inter-run evolution with explicit calibration; intra-run self-modification is research-only until the calibration story matures.

The honest closing observation is that none of this was discoverable from any one paper. It is the shape that emerged when 170 of them were stacked. The shape is not radical, it is not surprising, and it is not even particularly clever. It is what falls out of trying to make agents do research and watching them fail until they don’t. The literature is now ahead of most production deployments — if you are still running a one-shot RAG and calling it a research agent, the answer is not in another paper. The answer is in implementing the convergent shape, watching it fail differently, and shipping the next iteration.