What was measured
A controlled ablation. A single real TypeScript codebase (~35 source files) is held constant. The only
variable is how the project's rules are written down. Every format encodes the identical
rule set; semantic parity is machine-checked. The subject models are autonomous agents that grep, read,
edit, and run npm test / tsc.
The codebase contains planted traps a careless edit hits: deep-import boundary violations, dependency cycles, cross-event data leaks, cache-header ordering, AI hallucination from marketing copy, plus 10 hard-correctness traps (TOCTOU double-booking, floating-point money, DST conversion, Unicode folding, cursor pagination, idempotency, IDOR, ReDoS, illegal state transitions, unbounded caches).
How it was scored (deterministic & tamper-proof)
- Hidden tests - injected after the agent finishes, run, then removed. The agent never sees them, so it cannot teach to the test. This is the primary correctness signal.
- Architecture-violation check - a pristine diff detects forbidden imports / cycles.
- tsc --noEmit, a gzip bundle-budget gate, and rubric probes.
- Final score (0-100) is weighted: visible 20 / hidden 30 / architecture 20 / obedience 15 / perf-security 10 / minimality 5. No LLM judge is in the score.
Headline results (powered to n=60)
Mean final-score on the discriminating original-10 suite. arch-viol = % of runs that
crossed a forbidden import boundary (lower is better).
Claude Sonnet 4.6 (stronger tier)
| Format | mean | pass@1 | hidden | arch-viol |
|---|---|---|---|---|
| 🧬 aigx_terse | 95.4 | 0.92 | 98.6% | 8% |
| md | 95.1 | 0.80 | 96.4% | 0% |
| exifai_v2 | 94.6 | 0.80 | 96.1% | 3% |
| aigx_v9 | 93.6 | 0.77 | 94.3% | 10% |
| xml | 93.1 | 0.80 | 93.8% | 13% |
Claude Haiku 4.5 (weaker tier)
| Format | mean | pass@1 | hidden | arch-viol |
|---|---|---|---|---|
| 🧬 aigx_terse | 93.5 | 0.78 | 96.0% | 7% |
| aigx_v9 | 92.8 | 0.70 | 92.6% | 5% |
| exifai_v2 | 92.4 | 0.67 | 90.2% | 0% |
| xml | 92.3 | 0.75 | 93.3% | 8% |
| md | 92.2 | 0.70 | 93.6% | 10% |
AIGX ranks nominally first on mean, pass@1, and hidden-test pass on both models - but the honest story is consistency, not margin. Markdown is excellent on Sonnet yet near-last on Haiku; XML is roughly the reverse. AIGX is the only format first on both tiers.
The statistics, stated plainly. At n=60 the top cluster (AIGX, Markdown, EXIFAI-v2) is a statistical tie on the mean. We do not claim a significant margin. AIGX's defensible wins are cross-model consistency, robustness under challenge, simplicity, and being the only format measured at all.
The challenger log - we tried hard to beat it
After AIGX won the comparison, we ran a deliberate campaign to beat the winning design: ~24 challenger variants across 6 research rounds - in-source guards, positional tricks (primacy/recency), salience ladders, positive re-framing, 10 prose re-renderings, and combinations of the two best ideas. Every one tied or lost. Two nominal challengers that looked good at n=30 both collapsed to a tie - and below terse on hidden + pass@1 - when powered to n=60.
The honest caveats
- At matched power, the top formats are close. AIGX is not a 20-point blowout. Its win is #1-on-every-metric-on-both-models, survived-every-challenger, and simplest to author.
- One codebase, one task family. The absolute numbers are specific to this app.
- Two models + one single-shot model. Broad for a study of this kind, not universal.
- The residual is model capability. Past a point, better docs cannot fix harder problems.
The full methodology, raw data, single-shot Gemini phase, and round-by-round challenger table are the canonical record: BENCHMARK.md on GitHub ↗. Independent replication is welcome - open a discussion ↗.