Verdicts below are scored against the paper's abstract. The two homepage exemplars are stored as fixtures so the card visual stays stable while the live pipeline iterates; live audits run the same pipeline against any DOI not in the fixture set.
A four-model frontier-tier comparison run within weeks of each model's release; the rare evaluation that does not lag the field. The 100% top-1 accuracy is on expert-validated vignettes, not real patient encounters, and there is no same-study clinician arm; the headline number therefore speaks to recognition under idealised conditions, not to clinical deployment readiness. Cross-vendor breadth and a contemporaneous frontier comparator are the strengths a careful reader should trust; the human-comparator absence is the limitation to hold against the result.
ECI is the project's elicitable-capability composite (Arena-anchored). Elo is the model's latest Arena rating. AA is Artificial Analysis's Intelligence Index. Each axis is anchored to the registry's observed range; the dot positions are commensurable across audits.
Abstract-only read: Elicitation and Capability-frame verdicts can upgrade once the full-text discloses thinking effort, prompting, or scope-bounding language; Model-version is unaffected by the binding source.
"four frontier LLMs (ChatGPT 5.1, Claude Opus 4.5, Gemini 3 Pro, and Grok 4.1)"
All four models named at variant level with explicit version numbers. Snapshot IDs absent; variant-pinning satisfies item 1 under v1.1 (cross-checkpoint drift on a single variant is small).
Disclosed: scaffolding, prompting/context, multi-agent setup. Missing load-bearing dimensions: reasoning mode, thinking effort, tool/search. The capability ceiling reported may not reflect what fuller elicitation would yield. Within-family: the strongest variant of the chosen family generation was tested.
"Under idealised vignette conditions, frontier LLMs thus demonstrated high accuracy in recognising Category A bioterrorism syndromes"
Subject ("frontier LLMs") is anaphoric to the four named models tagged earlier as "four frontier LLMs", which under v7.2 BC1 is defensible as pass; but the bare generic-class subject of a capability verb is exactly the phrasing item 5 flags, so the verdict is warn-with-caveat.
Partial disclosures across Core 3.
The complete VERSIO-AI rubric. 12 items are scored automatically by the live extraction; the remaining 1 require a manual reader pass against the paper's methods or supplementary.