How the audit works, and what to trust about each verdict.
What VERSIO-AI is, and what it isn’t
VERSIO-AI is a reporting checklist, not a benchmark. It does not re-run the model on the paper's task; it reads what the paper itself discloses (or fails to disclose) about which model variant was tested, when the queries were submitted, how completely the model was elicited, and whether the conclusion sentence kept its subject scoped to those tested models. Every verdict the card renders is sourced from a verbatim substring of the abstract or, when an open-access full text resolves, the body. Verdicts that the paper's text does not support are coded undisclosed, never inferred, even when the inference would be defensible from external metadata such as the publication date.
The audit is editorial in register and methodologist in target. The failure modes it surfaces — a paper testing GPT-3.5 in 2025 without saying so; a comparator set spanning only one tier of the field; a findings sentence whose grammatical subject is bare LLMs when only one was tested — are the choices an attentive reviewer would have flagged in 2018 and mostly stopped flagging by 2024 because the cadence of model releases outran the publication cycle. Frontier Lag, the companion paper, is a structural account of why; VERSIO-AI is what the same extraction looks like on a single paper.
From DOI to card, in five steps
- Identity resolution. The DOI is normalised and looked up via OpenAlex, with fallbacks to Crossref, PubMed (E-utilities), and Semantic Scholar for venues that do not deposit abstracts. The pipeline keeps title, authors, venue, year, publication date, OA URL, and a domain inferred from OpenAlex's topic taxonomy.
- Full-text fetch. The audit attempts six OA paths in sequence (arXiv-by-DOI, OpenAlex's OA URL, Unpaywall, EuropePMC, Semantic Scholar author-deposit, arXiv-by-title) and returns the first parseable PDF or HTML body. When all six fail, the audit binds the abstract instead, with an explicit abstract-only read banner above the verdicts.
- Extraction. The binding text is sent to Claude Opus 4.7 (via OpenRouter) with the v1.4 VERSIO-AI prompt. The prompt encodes the 13-item rubric, six borderline-case rules for the capability-frame judgement, calibration norms for the confidence field, and an explicit verbatim-quote requirement per verdict. Every response is parsed as strict JSON; the model is run at temperature 0.
- Frontier comparison. The tested model is canonicalised against a registry of ~160 named models with release dates, Arena Elo scores, the project's elicitable-capability-index (ECI), and Artificial Analysis's Intelligence Index. The audit anchors the tested model against the contemporaneous Arena frontier (the top of the leaderboard the month the evaluation ran) and against today's per-axis frontier. When the paper does not disclose an evaluation date, the audit imputes a publication- bounded eval window and renders the gap as a range.
- Card render. Verdicts, quotes, rationales, the qualitative summary, and the full-vs-abstract disclosure delta (when the binding read is the full text) all flow into the audit card. No persistence; every audit recomputes on visit.
What “months behind frontier at evaluation” actually means
The frontier-gap is the most load-bearing number on the card and the easiest to misread. It measures the tested model against the model that sat at the top of the Chatbot Arena leaderboard the month the evaluation ran, not the latest model that exists today. A paper that tested GPT-4 in March 2024 was not behind frontier; the same paper testing GPT-4 in March 2025 was, by ~12 months, by the same metric.
When the paper does not disclose an evaluation date (Item 3 fail), the audit imputes an eval window from the publication date and tested-model release dates, then renders the gap as a range across that window with an imputed badge. The lower bound assumes the eval ran the moment the tested model became available; the upper bound assumes it ran the day before publication. Both numbers are a floor; the actual eval almost certainly ran earlier than the publication date, since papers spend months in review, which would widen the gap.
What the confidence field means, and why we publish quotes
The extraction prompt requires Opus to report a calibrated confidence between 0 and 1 for every coded value. A confidence of 0.90 means the model would expect 9 of 10 independent extractors to agree. The prompt explicitly forbids defaulting to a flat 0.95; confidences in the audit corpus span the full range, with the capability-frame judgement (Item 5) typically the lowest because it depends on grammatical-subject identification under six borderline-case rules.
Every verdict that the prompt could find a verbatim quote for displays the quote inline. If you disagree with a verdict, the quote is the place to start: either the quote does not say what the verdict claims (extraction error), or the quote does not survive in the body of the paper (abstract-vs-body mismatch; the disclosure-delta panel surfaces these), or the verdict is fairly drawn but you read the quote differently (legitimate disagreement; flag it).
Where the registry comes from
- Arena Elo trajectory and per-model lookup. Crawled from the Chatbot Arena public leaderboard; refreshed monthly. The trajectory underlies the contemporaneous-frontier comparison and the ECI synthesis.
- Release dates. Vendor announcement pages, model cards, and the curated alias table built from the audit pipeline. Family-anchor fallbacks fire when the paper's string is too generic for an exact lookup (e.g. GPT-4o resolves via the GPT-4o May 2024 anchor when no snapshot is given).
- ECI (elicitable-capability index). An Arena-anchored composite the project derives across the registry. ECI is the unit the gap-in-months conversion ultimately uses, so it is reported alongside Elo to make the conversion auditable.
- Artificial Analysis Intelligence Index. Cross-checked against artificialanalysis.ai with their variant-aware coverage (mini, high effort, reasoning vs. non-reasoning) preserved.
What stops the audit from accepting a misleading frontier-tier verdict
The capability-frame extraction reports frontier_comparator when the LLM judges that any tested model was contemporaneously frontier-tier at the eval date. The audit does not display that verdict directly: a four-step gate runs before the at-frontier hero renders, with each step responsible for a distinct failure mode.
- Capability-tier proximity, axis cascade. The highest-capability tested model the registry can resolve must sit within ~3 months of frontier-progress of the contemporaneous Arena frontier at the anchor date (disclosed eval if present, else publication date). The threshold is derived from the trajectory slope on whichever axis the cascade falls to: ECI is primary (≈2.3-point window on the current slope), Arena Elo is the fallback when ECI is unmeasured for either side (≈23-point window; by construction every trajectory frontier has an Arena Elo, so this closes gaps the ECI table doesn't yet cover, e.g. the gpt-4-1106-preview window 2023-12 → 2024-05), AA Index is the further fallback. When the gate falls to a non-ECI axis the chart defaults to that same axis and a discreet caveat above the chart names the substitution; chart and gate then tell the same story on the same scale.
- Vendor diversity for class-level claims. A claim about "AI" or "LLMs" grounded in five GPT-X variants is still single-vendor evidence; the breadth-mitigation branch on Item 5 therefore requires at least two distinct model families for partial credit. The gate intentionally does not check capability tier: five non-frontier models from five vendors still earns the breadth mitigation, since Item 5 grades subject-tested match, not capability ceiling. Tier critique lives in Item 7.
- Release-after-anchor. When the resolved tested-model release date post-dates the anchor, the gate fails closed: a paper cannot have evaluated a model before it existed. This usually surfaces an extraction error (mis-pinned variant) or a wrongly-stated eval date.
- Release-recency, last resort. When neither ECI nor Elo nor AA resolves for the contemporaneous frontier (pre-Arena 2022 evaluations, totally unrated tested models), the gate falls back to a fixed 6-month release-recency window, looser than capability-tier but defensible for the corner of the corpus no capability axis can reach. The cascade is built so this branch fires rarely; closing the long tail requires real measurements on the upstream registry, not synthetic proxies.
What this audit does not do
- It does not re-run the paper's evaluation. VERSIO-AI is a reporting audit; benchmark replication is a different category of work and a much larger one. Every verdict about capabilityis a verdict about the paper's disclosure of capability, never about the underlying model's performance on the task.
- Item 4 (statistical comparison method) is manual-only. The Opus 4.7 extraction does not score the test machinery because methods sections are too varied and a structured-output extraction has not yet earned the calibration the audit corpus required.
- Frontier comparison is Arena-anchored. Capability axes the leaderboard does not adjudicate (genuinely new modalities, agent harnesses, vertical-task-specific tuning) are out of scope. The card surfaces this honestly when the tested model is not in the registry.
- Single-call extraction has known variance. Per-paper agreement against the gold-standard manual coding sits at ≥0.86 Cohen's kappa across all Core 3 items in the audit corpus pilot; we publish the per-field confidence on the card so you can see when the model itself was uncertain.