# The AIGX Benchmark - Method, Results, and Honest Caveats

**Canonical:** https://aigx.dev/benchmark · **Full record:** https://github.com/Lolner95/AIGX/blob/main/BENCHMARK.md

> The claim: with a codebase's rules held identical and only the context format changing, AIGX produces the
> most correct and most disciplined AI-agent output - and it is the only context format we know of validated
> this way.

## 1. What was measured

A controlled ablation. A single real TypeScript codebase ("SourcingGPT", ~35 source files) is held
constant. The only variable is how the project's rules are written down. Every format encodes the identical
rule set; semantic parity is machine-checked. The subject models are autonomous agents that grep, read,
edit, and run `npm test` / `tsc`.

Formats compared: Markdown, Cursor-style MDC, YAML, XML, five in-source/sidecar "EXIFAI" variants, and the
AIGX family - 18+ distinct encodings of the same rules.

Subject models: Claude Haiku 4.5 (weaker tier), Claude Sonnet 4.6 (stronger tier), and Gemini
(single-shot full-rewrite mode, earlier phase).

The traps that discriminate: deep-import boundary violations, dependency cycles, cross-event data leaks,
cache-header ordering, AI hallucination from marketing copy, bundle-budget regressions, stale-doc
conflicts, plus 10 hard-correctness traps: TOCTOU double-booking, floating-point money, DST time-zone
conversion, Unicode folding, cursor pagination, idempotency, IDOR authorization, ReDoS, illegal state
transitions, and unbounded caches.

## 2. How it was scored (deterministic & tamper-proof)

- **Visible oracle tests** - the task's acceptance tests.
- **Hidden tests** - injected after the agent finishes, run, then removed. The agent never sees them, so it
  cannot teach to the test. This is the primary correctness signal.
- **Architecture-violation check** - a pristine diff vs the seed detects forbidden imports / cycles.
- **`tsc --noEmit`**, a gzip bundle-budget gate, and rubric probes.
- Final score (0-100) is weighted: visible 20 / hidden 30 / architecture 20 / obedience 15 / perf-security
  10 / minimality 5. No LLM judge is in the score.

## 3. Headline results (powered to n=60)

Mean final-score on the discriminating original-10 suite. `arch-viol` = % of runs that crossed a forbidden
import boundary (lower is better).

### Claude Sonnet 4.6

| Format        | mean | pass@1 | hidden | arch-viol |
|---------------|------|--------|--------|-----------|
| aigx_terse    | 95.4 | 0.92   | 98.6%  | 8%        |
| md            | 95.1 | 0.80   | 96.4%  | 0%        |
| exifai_v2     | 94.6 | 0.80   | 96.1%  | 3%        |
| aigx_v9       | 93.6 | 0.77   | 94.3%  | 10%       |
| xml           | 93.1 | 0.80   | 93.8%  | 13%       |

### Claude Haiku 4.5

| Format        | mean | pass@1 | hidden | arch-viol |
|---------------|------|--------|--------|-----------|
| aigx_terse    | 93.5 | 0.78   | 96.0%  | 7%        |
| aigx_v9       | 92.8 | 0.70   | 92.6%  | 5%        |
| exifai_v2     | 92.4 | 0.67   | 90.2%  | 0%        |
| xml           | 92.3 | 0.75   | 93.3%  | 8%        |
| md            | 92.2 | 0.70   | 93.6%  | 10%       |

AIGX ranks nominally first on mean, pass@1, and hidden-test pass on both models. The honest story is
consistency, not margin: Markdown is excellent on Sonnet yet near-last on Haiku; XML is roughly the reverse.
AIGX is the only format first on both tiers.

## 4. The statistics, stated plainly

At n=60 the top cluster (AIGX, Markdown, EXIFAI-v2) is a statistical tie on the mean. This is not a blowout
and we do not pretend otherwise. AIGX's defensible wins are: cross-model consistency (first on both tiers),
robustness (beat every challenger), simplicity, and being the only format measured at all.

## 5. The challenger log - we tried hard to beat it

After AIGX won the comparison, we ran a deliberate campaign to dethrone it: ~24 challenger variants across 6
research rounds - in-source guards, positional tricks (primacy/recency), salience ladders, positive
re-framing, 10 prose re-renderings, and combinations of the two best ideas. Every one tied or lost. Two
nominal challengers that looked good at n=30 both collapsed to a tie - and below terse on hidden + pass@1 -
when powered to n=60.

## 6. The honest caveats

- At matched power the top formats are close - AIGX is not a 20-point blowout.
- One codebase, one task family. The absolute numbers are specific to this app.
- Two models plus one single-shot model - broad for a study of this kind, not universal.
- The residual gap to a perfect score is dominated by genuinely hard tasks (model capability), not format.

## 7. Reproducibility

The harness is a generator + materializer + runner + deterministic scorer. One canonical knowledge base
generates every format's files, so parity is guaranteed by construction and re-checked. Each run is an
isolated workspace. Results are powered to n=60 for every headline; challengers are screened at n=30 then
powered. Independent replication is welcome: https://github.com/Lolner95/AIGX/discussions