VERSIO-AI · Frontier Lagv1.2 · 2026

Frontier large language models and clinical recognition of Category A bioterrorism agents: a cross-sectional analysis

Global Security: Health, Science and Policymedicinedoi.org/10.1080/23779497.2026.2643956

Abstract-level auditBinding read: abstract

Verdicts below are scored against the paper's abstract. The two homepage exemplars are stored as fixtures so the card visual stays stable while the live pipeline iterates; live audits run the same pipeline against any DOI not in the fixture set.

What a careful reader should notice

A four-model frontier-tier comparison run within weeks of each model's release; the rare evaluation that does not lag the field. The 100% top-1 accuracy is on expert-validated vignettes, not real patient encounters, and there is no same-study clinician arm; the headline number therefore speaks to recognition under idealised conditions, not to clinical deployment readiness. Cross-vendor breadth and a contemporaneous frontier comparator are the strengths a careful reader should trust; the human-comparator absence is the limitation to hold against the result.

Frontier-gap

Frontier-tier when evaluated; ≈15 months behind today’s frontier

vs Gemini 3 Pro (at evaluation) · vs Claude Fable 5 (today)

Tested modelFrontier at evaluationFrontier today

On December 2025, the elicitable frontier was Gemini 3 Pro (ECI 153.4) · gap to Claude Opus 4.5: +3.5 ECIToday's frontier (Claude Fable 5, Jun 2026) is +11.1 ECI ahead of the tested model.

Cross-checkedElo +32 vs frontier on Dec 2025 · same direction as ECI

vs todayElo +47 · AA +17.0

Paper disclosed an eval date — scrub to see what the gap would be on a different date.

Tested → frontier at evaluation:+3.5 ECI · +32 Elo ≈ 5 months of frontier progress

Tested → frontier today:+11.1 ECI · +47 Elo · +17 AA ≈ 1 year of frontier progress

›What ECI / Elo / AA mean

ECI is Epoch AI's Capabilities Index, a cross-benchmark capability score. Elo is the model's latest Arena rating. AA is Artificial Analysis's Intelligence Index. Each axis is anchored to the registry's observed range; the dot positions are commensurable across audits.

What this gap looks like in practice

Generating side-by-side comparison…

VERSIO-AI Core 3

Editorial desk-reject tier

Abstract-only read: Elicitation and Capability-frame verdicts can upgrade once the full-text discloses thinking effort, prompting, or scope-bounding language; Model-version is unaffected by the binding source.

Model version

Variant

"four frontier LLMs (ChatGPT 5.1, Claude Opus 4.5, Gemini 3 Pro, and Grok 4.1)"

All four models named at variant level with explicit version numbers. Snapshot IDs absent; variant-pinning satisfies item 1 under v1.1 (cross-checkpoint drift on a single variant is small).

Pass

Elicitation

Partial elicitation disclosure

Disclosed: scaffolding, prompting/context, multi-agent setup. Missing load-bearing dimensions: reasoning mode, thinking effort, tool/search. The capability ceiling reported may not reflect what fuller elicitation would yield. Within-family: the strongest variant of the chosen family generation was tested.

Partial

Capability frame

Anaphoric but generic-tier phrasing

"Under idealised vignette conditions, frontier LLMs thus demonstrated high accuracy in recognising Category A bioterrorism syndromes"

Subject ("frontier LLMs") is anaphoric to the four named models tagged earlier as "four frontier LLMs", which under v7.2 BC1 is defensible as pass; but the bare generic-class subject of a capability verb is exactly the phrasing item 5 flags, so the verdict is warn-with-caveat.

Partial

Mixed signals

Partial disclosures across Core 3.

Extended items

Comparator and breadth

Frontier coverage of evaluated setIncludes frontier-tier model(s)

All four evaluated models were within weeks of release at the December 2025 evaluation; the finding speaks to current state-of-the-art capability.

Human comparatorNo same-study clinician comparator

Vignettes were expert-validated for ground truth, but no same-study clinician group performed the diagnostic task against the models. Without a parallel human-physician arm the 100% top-1 result has no human reference for calibration.

Models evaluatedFour cross-vendor models

"ChatGPT 5.1, Claude Opus 4.5, Gemini 3 Pro, and Grok 4.1"n = 4 models across four vendors (OpenAI, Anthropic, Google, xAI). Cross-vendor breadth is unusually strong for a single-paper evaluation.

Conclusion

Conclusion valencePositive

"frontier LLMs thus demonstrated high accuracy in recognising Category A bioterrorism syndromes, suggesting potential utility as diagnostic support and educational tools"Positive valence reported descriptively. Conclusion qualifies the result with 'under idealised vignette conditions' and 'with appropriate governance' — measured rather than promotional.

Transparency

Evaluation date disclosureDay-precision

"On 10 December 2025, four frontier LLMs … were each prompted, in new chat sessions"Day-precision evaluation date stated. Frontier comparator (Gemini 3 Pro, released 18 November 2025) was 22 days old at evaluation.

Elicitation breakdown

Within-family variantStrongest variant

All four named models were the strongest variants of their respective family generations at the eval date (Opus over Sonnet/Haiku; Pro over Flash; Grok 4.1 was the strongest Grok 4 variant).

Full 13-item checklist · 12 automated · 1 reader-pass

The complete VERSIO-AI rubric. 12 items are scored automatically by the live extraction; the remaining 1 require a manual reader pass against the paper's methods or supplementary.

Item 1 · automated

Model version

Vendor, family, variant, snapshot ID — every level the abstract or full text disclosed.

Item 2 · automated

Multiple-model evaluation

Whether more than one model was evaluated; cross-model comparison shifts what a class-level claim can rest on.

Item 3 · automated

Evaluation date

When the model was queried — not the publication date, training cutoff, or submission date.

Item 4 · manual

Statistical comparison method

Whether the comparison reports statistical-test machinery (effect size, CI, test name). Read from the methods section.

Item 5 · automated

Capability frame

Whether the conclusion stays scoped to the tested model or generalises to LLMs / AI as a class.

Item 6 · automated

Conclusion valence

Positive / negative / mixed / neutral. Reported, not graded — used for valence-asymmetry analysis.

Item 7 · automated

Comparator adequacy

Whether at least one evaluated model was contemporaneously frontier-tier, or only weaker baselines were tested.

Item 8 · automated

Human comparator

Same-study quantitative human performance reported alongside model performance.

Item 9 · automated

Elicitation completeness

Aggregate over reasoning mode, thinking effort, tool/search use, prompting strategy, scaffolding, multi-agent setup, access method, temperature, AND within-family variant choice. Core 3 third slot in v1.2.

Item 10 · automated

Domain

Medicine / law / coding / education / scientific reasoning / other.

Item 11 · automated

Task description

One-sentence description of what was evaluated.

Item 12 · automated

Frontier-gap at eval

Months behind contemporaneous Arena-elicited frontier at the evaluation date (or imputed window when item 3 is missing).

Item 13 · automated

Qualitative reader summary

Short plain-English domain-reader summary from the reasoning model — what to hold provisionally vs. trust.

VERSIO-AI · Frontier Lagv1.2 · 2026

Frontier large language models and clinical recognition of Category A bioterrorism agents: a cross-sectional analysis

Global Security: Health, Science and Policymedicinedoi.org/10.1080/23779497.2026.2643956

Abstract-level auditBinding read: abstract

What a careful reader should notice

Frontier-gap

Frontier-tier when evaluated; ≈15 months behind today’s frontier

vs Gemini 3 Pro (at evaluation) · vs Claude Fable 5 (today)

Tested modelFrontier at evaluationFrontier today

Cross-checkedElo +32 vs frontier on Dec 2025 · same direction as ECI

vs todayElo +47 · AA +17.0

Paper disclosed an eval date — scrub to see what the gap would be on a different date.

Tested → frontier at evaluation:+3.5 ECI · +32 Elo ≈ 5 months of frontier progress

Tested → frontier today:+11.1 ECI · +47 Elo · +17 AA ≈ 1 year of frontier progress

›What ECI / Elo / AA mean

What this gap looks like in practice

Generating side-by-side comparison…

VERSIO-AI Core 3

Editorial desk-reject tier

Model version

Variant

"four frontier LLMs (ChatGPT 5.1, Claude Opus 4.5, Gemini 3 Pro, and Grok 4.1)"

All four models named at variant level with explicit version numbers. Snapshot IDs absent; variant-pinning satisfies item 1 under v1.1 (cross-checkpoint drift on a single variant is small).

Pass

Elicitation

Partial elicitation disclosure

Partial

Capability frame

Anaphoric but generic-tier phrasing

"Under idealised vignette conditions, frontier LLMs thus demonstrated high accuracy in recognising Category A bioterrorism syndromes"

Partial

Mixed signals

Partial disclosures across Core 3.

Extended items

Comparator and breadth

Frontier coverage of evaluated setIncludes frontier-tier model(s)

All four evaluated models were within weeks of release at the December 2025 evaluation; the finding speaks to current state-of-the-art capability.

Human comparatorNo same-study clinician comparator

Models evaluatedFour cross-vendor models

"ChatGPT 5.1, Claude Opus 4.5, Gemini 3 Pro, and Grok 4.1"n = 4 models across four vendors (OpenAI, Anthropic, Google, xAI). Cross-vendor breadth is unusually strong for a single-paper evaluation.

Conclusion

Conclusion valencePositive

Transparency

Evaluation date disclosureDay-precision

Elicitation breakdown

Within-family variantStrongest variant

All four named models were the strongest variants of their respective family generations at the eval date (Opus over Sonnet/Haiku; Pro over Flash; Grok 4.1 was the strongest Grok 4 variant).

Full 13-item checklist · 12 automated · 1 reader-pass

The complete VERSIO-AI rubric. 12 items are scored automatically by the live extraction; the remaining 1 require a manual reader pass against the paper's methods or supplementary.

Item 1 · automated

Model version

Vendor, family, variant, snapshot ID — every level the abstract or full text disclosed.

Item 2 · automated

Multiple-model evaluation

Whether more than one model was evaluated; cross-model comparison shifts what a class-level claim can rest on.

Item 3 · automated

Evaluation date

When the model was queried — not the publication date, training cutoff, or submission date.

Item 4 · manual

Statistical comparison method

Whether the comparison reports statistical-test machinery (effect size, CI, test name). Read from the methods section.

Item 5 · automated

Capability frame

Whether the conclusion stays scoped to the tested model or generalises to LLMs / AI as a class.

Item 6 · automated

Conclusion valence

Positive / negative / mixed / neutral. Reported, not graded — used for valence-asymmetry analysis.

Item 7 · automated

Comparator adequacy

Whether at least one evaluated model was contemporaneously frontier-tier, or only weaker baselines were tested.

Item 8 · automated

Human comparator

Same-study quantitative human performance reported alongside model performance.

Item 9 · automated

Elicitation completeness

Item 10 · automated

Domain

Medicine / law / coding / education / scientific reasoning / other.

Item 11 · automated

Task description

One-sentence description of what was evaluated.

Item 12 · automated

Frontier-gap at eval

Months behind contemporaneous Arena-elicited frontier at the evaluation date (or imputed window when item 3 is missing).

Item 13 · automated

Qualitative reader summary

Short plain-English domain-reader summary from the reasoning model — what to hold provisionally vs. trust.

VERSIO-AI Core 3

Model version

Elicitation

Capability frame

Comparator and breadth

Conclusion

Transparency

Elicitation breakdown

VERSIO-AI Core 3 · scoring in progress

Model version

Evaluation date

Capability frame

VERSIO-AI Core 3

Model version

Elicitation

Capability frame

Comparator and breadth

Conclusion

Transparency

Elicitation breakdown