Verdicts below are scored against the paper's abstract. The two homepage exemplars are stored as fixtures so the card visual stays stable while the live pipeline iterates; live audits run the same pipeline against any DOI not in the fixture set.
A parallel-extraction pipeline rather than a head-to-head capability test: GPT-5.2 and Gemini 3 Pro each pull the same 48 fields per study, with their agreement functioning as a confidence signal. Cross-vendor frontier-tier design is the strength; the missing query date is the central disclosure failure, since the reader cannot tell whether the comparison was anchored against December 2025 or April 2026 frontier capabilities. Treat the accuracy numbers as informative for the orthopaedic-extraction task as specified, not as a generic LLM-systematic-review claim.
Eval date undisclosed; gap range computed from a publication-bounded window with the OpenAlex publication date as the latest plausible eval and the strongest tested model's release as the earliest. Mirrors the audit corpus's pre-registered pub-date − 180d imputation midpoint.
Set an evaluation date below and the gap recomputes against that anchor. The URL updates so you can share the override; Item 3 still reports what the paper disclosed.
ECI is the project's elicitable-capability composite (Arena-anchored). Elo is the model's latest Arena rating. AA is Artificial Analysis's Intelligence Index. Each axis is anchored to the registry's observed range; the dot positions are commensurable across audits.
Abstract-only read: Elicitation and Capability-frame verdicts can upgrade once the full-text discloses thinking effort, prompting, or scope-bounding language; Model-version is unaffected by the binding source.
"Generative Pre-Trained Transformer 5.2 (GPT-5.2) and Google Gemini 3 Pro"
Both models named at variant level with explicit version numbers spelled out. Snapshot IDs absent; variant-pinning satisfies item 1.
Disclosed: scaffolding. Missing load-bearing dimensions: reasoning mode, thinking effort, tool/search, prompting/context. The reported capability is from a substantially under-specified configuration; stronger elicitation could change the result. Within-family: the strongest variant of the chosen family generation was tested.
"A parallel-LLM approach using GPT-5.2 and Gemini 3 Pro achieved strong accuracy with a high degree of efficiency for automated data extraction in an orthopaedic systematic review."
Conclusion subject is the named GPT-5.2 + Gemini 3 Pro approach; capability bounded to 'automated data extraction in an orthopaedic systematic review'. No generic-LLM extension.
A single editorial desk-reject signal fires; the rest pass.
The complete VERSIO-AI rubric. 12 items are scored automatically by the live extraction; the remaining 1 require a manual reader pass against the paper's methods or supplementary.