llm-as-judge ·multi-judge-panel ·rubric ·evaluation ·rulers ·krippendorff ·arxiv-research

Three LLM judges, but really 1.5: why a same-family panel collapses to noise

Jun 4, 2026 · 8 min read

Two weeks ago I wrote the post-mortem on a five-agent code-review pipeline that converged on a clean, confident verdict — and turned out to be three Claudes in a trench coat. The unanimous validators were running the same internal heuristics. The agreement signal that looked like high confidence was correlated error. That post ended on the rule: if you want a panel, you need cross-family judges or you have nothing.

This is the post where I actually built one and ran it.

Two LLM reviews of the same architecture doc had landed on opposite verdicts. One said the proposal was structurally sound and ready to ship. The other said it was conflating two unrelated concerns and needed to be split. They disagreed on five of the eight dimensions a reasonable reviewer would care about. I needed a tiebreaker, and after the validator-bias write-up I couldn’t justify asking a third Claude. LLM-as-a-judge is the standard playbook for this — run a panel, score independently, average the votes — and the playbook only works if the panel is actually a panel.

Nine Judges, Two Effective Votes (May 2026) measures the same thing the validator-bias post described in prose. When you run N LLM judges over the same evaluation task, the effective independent vote count is closer to two than to nine. The judges’ errors are correlated. Three judges from the same family give you about the same signal as one of them, paid for triple. A panel is a statistical instrument, not a vibe check. The question isn’t “should I run a panel?” — it’s “what is the smallest panel that buys me ≥ 2 independent votes?”

Hand-drawn technical diagram on cream paper. Three labelled judge boxes on the left arranged vertically, enclosed by a dotted oval: J1 GLM with a triangle glyph, J2 KIMI with a square glyph, J3 MINIMAX with a hexagon glyph. Each emits an arrow toward a central LOCKED RUBRIC rectangle in the middle, which carries eight vertical tick-marks for the eight dimensions and a small padlock icon attached to its top-left corner. From the rubric, two parallel arrows go right to two stacked rectangles labelled REVIEW A and REVIEW B. A librarian-stamp-red caption tag above the rubric reads 'rho approx 0.1 cross-family'; a second red note below reads 'N_eff approx 2.5'. — Three judges, three families, one locked rubric. The dotted oval is what makes the panel a panel; without it you have three Claudes in a trench coat.

The 1.5-vote panel

With N judges and average pairwise error-correlation $\rho$ , the effective independent vote count is approximately:

N_{\text{eff}} \approx \frac{N}{1 + (N-1)\rho}

Plug in N = 3 same-family judges with intra-family $\rho \approx 0.5$ and you get $N_{\text{eff}} \approx 1.5$ — pay for three calls, buy one and a half votes. Plug in N = 3 distinct-family judges with cross-family $\rho \approx 0.1$ and you get $N_{\text{eff}} \approx 2.5$ . The cost is identical; the information gain is 67 percent higher. Nasser et al. 2026 (arxiv:2601.05114) measured an inter-judge Krippendorff’s α of $0.042$ across nine LLM judges over 3,240 evaluations — agreement less than chance on two dimensions. The judges weren’t independent; they were running the same internal heuristics in lockstep.

Design rule: three judges, three distinct model families. Not three sizes of the same model. Not three checkpoints of the same family. I picked one model each from three labs with Anthropic-compatible inference endpoints. The selection is incidental; what matters is no two judges share more than a base architecture.

Five more ways the panel lies to you

Cross-family judging is necessary but not sufficient. Five other failure modes compound. Each has a citable fix.

Anchor hallucination. Judges fabricate file:line citations they can’t actually see. I’ve watched two of three model families hallucinate path errors inside a review whose explicit topic was path-hallucination. Fix per CiteAudit: every judge spot-checks a sample of citations from the review under scrutiny and docks the score on failure.

Sycophancy. Judges agree with whatever framing they’re asked to evaluate. Beyond Consensus names this “agreeableness bias.” Fix: an anti-sycophancy clause — name at least one disagreement per review with a cited line. Fail to find one and the bias-resistance dimension is docked.

Position bias. Judges anchor on whichever item is presented first. “Position Bias in Rubric-Based LLM Judge” measures it even with locked rubrics. Fix: randomize A/B labels per judge. If a judge always rates “A” higher regardless of content, you’ve found a position-bias fingerprint and you can throw out that judge.

Rubric drift. The same prompt gives different scores across runs because the LLM re-derives the criteria internally. RULERS pins three fixes: locked rubric text, evidence-anchored scoring, fixed scale. Don’t ask the judge to “consider the criteria” — give them the verbatim anchor table.

Scale calibration. Judges disagree more on 0-100 scales than 0-5 scales — higher granularity invites hallucinated precision. 0-5 Grading Scale Highest H-LLM Alignment measured this against humans: 0-5 wins, 0-100 is noise. Fix: every dimension on 0-5 with anchor sentences per score.

I ran the panel. Here’s what it said.

The three judges came back with a unanimous verdict. Review A scored 37.67 out of 40. Review B scored 30.33. The margin — Δ = 7.33 on the 0-40 scale — was wider than the per-dimension variance, which means the rubric and the content overwhelmed every bias I’d built countermeasures for. All six countermeasures fired correctly. None of them changed the outcome.

The interesting line is the standard deviation across judges: ≤ 0.58 on every dimension. That’s tight. Nasser’s α of $0.042$ was measured on subtle preference ratings — the close calls where cross-family panels really do fragment. When the underlying quality gap is wide, families agree fast and the variance flags I planned to use as a diagnostic produce no signal because there’s nothing to disagree about. The thing the heterogeneous panel buys you most isn’t a stronger verdict on close calls — it’s an honest variance signal on the close calls and faster convergence on the easy ones.

The diagnostic I did get was elsewhere. The biggest single dimension gap was Empirical Validation — the one that rewards a judge for citing a real production query or runtime artifact over a theoretical claim. Review A scored +2.67 over Review B on that dimension alone. A 30-second SQL query against production data was worth more to all three judges than fifteen minutes of code-anchor citation work. If you want a panel to grade your review high: run one query. Just one. The empirical anchor outweighs every other dimension by a factor I didn’t expect.

The position-bias countermeasure was the other interesting non-result. One judge saw Review B as “Review A” — labels swapped. If position bias had been load-bearing at this gap, that judge would have favored the swapped-A. It didn’t. The mitigation worked and wasn’t needed. I’d write it in again anyway — finding out which countermeasures aren’t needed is itself a panel output.

The bias I baked in

There is an obvious problem with the verdict I just reported. I designed the rubric. I also wrote one of the two reviews under evaluation.

Every one of the eight dimensions has academic grounding — RULERS for locking the text, Beyond Consensus for the anti-sycophancy clause, CiteAudit for the spot-check mandate, the 0-5 paper for the scale. The literature is real. But picking which eight dimensions matter is itself a choice, and my choice happened to weight Empirical Validation (the dimension where my review scored highest) and Strategy Completeness (likewise) as a quarter of the total score.

A rubric weighted differently — heavier on Engineering-Seam Precision, say, or on Parser-Architecture Correctness — might have produced a different unanimous verdict in the other direction. The reviews didn’t change. The rubric chose the winner.

This is the rubric-selection bias I didn’t see coming, and it’s the load-bearing caveat on the whole exercise. The honest framing of the result is: Review A is better by the rubric this panel applied. Not: Review A is better. A different rubric author would have produced a different panel.

The fix is the same shape as every other fix in this post — break the circularity by removing the author. Either a second LLM designs the rubric, or a human does, but the rubric design and the artifact authorship cannot live in the same head. I’ll do this next time. I’m flagging it now because I think rubric-selection bias is going to be the next-shaped-failure-mode the eval literature catches up to. The moment LLM-as-a-judge becomes the standard playbook, rubric design becomes the new place where the bias hides.

So: the panel works. The six countermeasures fire correctly. The math says you should pay for distinct families. And the entire setup is one rubric-design move away from being a beautifully-instrumented yes-man for whoever holds the pen. If your eval pipeline has the same author writing both the artifacts and the rubrics, you have a panel that always agrees with you — three times, in three different model families, with academic citations.

That might actually be the most expensive way to fool yourself the eval literature has produced.

The 1.5-vote panel

Five more ways the panel lies to you

I ran the panel. Here’s what it said.

The bias I baked in

Comments