VERSIO-AI · Frontier Lagv1.2 · 2026

Automated data extraction for systematic reviews using GPT-5.2 and Google Gemini Pro 3: A dual-large language model approach in orthopaedic research

Knee Surgery, Sports Traumatology, Arthroscopymedicinedoi.org/10.1002/ksa.70412

Abstract-level auditBinding read: abstract

Verdicts below are scored against the paper's abstract. The two homepage exemplars are stored as fixtures so the card visual stays stable while the live pipeline iterates; live audits run the same pipeline against any DOI not in the fixture set.

What a careful reader should notice

A parallel-extraction pipeline rather than a head-to-head capability test: GPT-5.2 and Gemini 3 Pro each pull the same 48 fields per study, with their agreement functioning as a confidence signal. Cross-vendor frontier-tier design is the strength; the missing query date is the central disclosure failure, since the reader cannot tell whether the comparison was anchored against December 2025 or April 2026 frontier capabilities. Treat the accuracy numbers as informative for the orthopaedic-extraction task as specified, not as a generic LLM-systematic-review claim.

Frontier-gapIMPUTED

≈2 months behind when evaluated; ≈9 months behind today’s frontier

vs Gemini 3 Pro (at evaluation) · vs GPT-5.5 Pro (today)

Tested modelFrontier at evaluationFrontier today

On December 2025, the elicitable frontier was Gemini 3 Pro (ECI 153.4) · gap to GPT-5.2: +1.4 ECIToday's frontier (GPT-5.5 Pro, Apr 2026) is +6.7 ECI ahead of the tested model.

Cross-checkedElo +8 · AA +4.0 vs frontier on Dec 2025 · same direction as ECI

Eval date imputed — scrub to apply your own date if you know when the eval ran.

Tested → frontier at evaluation:+1.4 ECI · +8 Elo ≈ 2 months of frontier progress

Tested → frontier today:+6.7 ECI ≈ 9 months of frontier progress

Most generous assumption. The contemporaneous-frontier reference here is fixed at Apr 2026 — the latest plausible eval given publication and tested-model release. The actual eval was almost certainly earlier (papers typically run evals months before submission), and an earlier eval would widen both gaps below. Treat the displayed gap as a floor, not a point estimate.

›How the eval date was imputed

Eval date undisclosed; gap range computed from a publication-bounded window with the OpenAlex publication date as the latest plausible eval and the strongest tested model's release as the earliest. Mirrors the audit corpus's pre-registered pub-date − 180d imputation midpoint.

What if you knew the eval date

Set an evaluation date below and the gap recomputes against that anchor. The URL updates so you can share the override; Item 3 still reports what the paper disclosed.

›What ECI / Elo / AA mean

ECI is the project's elicitable-capability composite (Arena-anchored). Elo is the model's latest Arena rating. AA is Artificial Analysis's Intelligence Index. Each axis is anchored to the registry's observed range; the dot positions are commensurable across audits.

What this gap looks like in practice

Generating side-by-side comparison…

VERSIO-AI Core 3

Editorial desk-reject tier

Abstract-only read: Elicitation and Capability-frame verdicts can upgrade once the full-text discloses thinking effort, prompting, or scope-bounding language; Model-version is unaffected by the binding source.

Model version

Variant

"Generative Pre-Trained Transformer 5.2 (GPT-5.2) and Google Gemini 3 Pro"

Both models named at variant level with explicit version numbers spelled out. Snapshot IDs absent; variant-pinning satisfies item 1.

Pass

Elicitation

Sparse elicitation disclosure

Disclosed: scaffolding. Missing load-bearing dimensions: reasoning mode, thinking effort, tool/search, prompting/context. The reported capability is from a substantially under-specified configuration; stronger elicitation could change the result. Within-family: the strongest variant of the chosen family generation was tested.

Fail

Capability frame

Model-specific, task-bounded

"A parallel-LLM approach using GPT-5.2 and Gemini 3 Pro achieved strong accuracy with a high degree of efficiency for automated data extraction in an orthopaedic systematic review."

Conclusion subject is the named GPT-5.2 + Gemini 3 Pro approach; capability bounded to 'automated data extraction in an orthopaedic systematic review'. No generic-LLM extension.

Pass

One Core-3 fail

A single editorial desk-reject signal fires; the rest pass.

Extended items

Comparator and breadth

Frontier coverage of evaluated setIncludes frontier-tier model(s)

Both GPT-5.2 and Gemini 3 Pro were contemporaneously frontier-tier within ~4 months of the imputed evaluation window; the cross-vendor parallel design strengthens the frontier-coverage signal.

Human comparatorPre-defined gold standard

"Eight studies … were used to test extraction accuracy, agreement, and efficiency against a pre-defined gold-standard."Same-study comparator: each of the 48 fields per study was graded against a pre-defined gold standard derived from a previously published systematic review.

Models evaluatedTwo cross-vendor models

"GPT-5.2 and Gemini 3 Pro"n = 2 models across two vendors (OpenAI, Google). Cross-vendor parallel-extraction design is the paper's central methodological claim.

Conclusion

Conclusion valencePositive

"achieved strong accuracy with a high degree of efficiency for automated data extraction in an orthopaedic systematic review … supporting the use of a dual-LLM framework as a reliable first-pass tool for human verification."Positive valence reported descriptively. The 'first-pass tool for human verification' framing keeps the claim bounded.

Transparency

Evaluation date disclosureNot disclosed

Abstract describes the dual-LLM extraction pipeline against a pre-defined gold standard but does not state when the models were queried. Without a query date, the reader has no anchor for which frontier the comparison should sit against.

Elicitation breakdown

Within-family variantStrongest variant

GPT-5.2 was the strongest GPT-5 variant at eval; Gemini 3 Pro was the strongest Gemini 3 variant. No within-family weaker-variant flag.

Full 13-item checklist · 12 automated · 1 reader-pass

The complete VERSIO-AI rubric. 12 items are scored automatically by the live extraction; the remaining 1 require a manual reader pass against the paper's methods or supplementary.

Item 1 · automated

Model version

Vendor, family, variant, snapshot ID — every level the abstract or full text disclosed.

Item 2 · automated

Multiple-model evaluation

Whether more than one model was evaluated; cross-model comparison shifts what a class-level claim can rest on.

Item 3 · automated

Evaluation date

When the model was queried — not the publication date, training cutoff, or submission date.

Item 4 · manual

Statistical comparison method

Whether the comparison reports statistical-test machinery (effect size, CI, test name). Read from the methods section.

Item 5 · automated

Capability frame

Whether the conclusion stays scoped to the tested model or generalises to LLMs / AI as a class.

Item 6 · automated

Conclusion valence

Positive / negative / mixed / neutral. Reported, not graded — used for valence-asymmetry analysis.

Item 7 · automated

Comparator adequacy

Whether at least one evaluated model was contemporaneously frontier-tier, or only weaker baselines were tested.

Item 8 · automated

Human comparator

Same-study quantitative human performance reported alongside model performance.

Item 9 · automated

Elicitation completeness

Aggregate over reasoning mode, thinking effort, tool/search use, prompting strategy, scaffolding, multi-agent setup, access method, temperature, AND within-family variant choice. Core 3 third slot in v1.2.

Item 10 · automated

Domain

Medicine / law / coding / education / scientific reasoning / other.

Item 11 · automated

Task description

One-sentence description of what was evaluated.

Item 12 · automated

Frontier-gap at eval

Months behind contemporaneous Arena-elicited frontier at the evaluation date (or imputed window when item 3 is missing).

Item 13 · automated

Qualitative reader summary

Short plain-English domain-reader summary from the reasoning model — what to hold provisionally vs. trust.

VERSIO-AI · Frontier Lagv1.2 · 2026

Automated data extraction for systematic reviews using GPT-5.2 and Google Gemini Pro 3: A dual-large language model approach in orthopaedic research

Knee Surgery, Sports Traumatology, Arthroscopymedicinedoi.org/10.1002/ksa.70412

Abstract-level auditBinding read: abstract

What a careful reader should notice

Frontier-gapIMPUTED

≈2 months behind when evaluated; ≈9 months behind today’s frontier

vs Gemini 3 Pro (at evaluation) · vs GPT-5.5 Pro (today)

Tested modelFrontier at evaluationFrontier today

On December 2025, the elicitable frontier was Gemini 3 Pro (ECI 153.4) · gap to GPT-5.2: +1.4 ECIToday's frontier (GPT-5.5 Pro, Apr 2026) is +6.7 ECI ahead of the tested model.

Cross-checkedElo +8 · AA +4.0 vs frontier on Dec 2025 · same direction as ECI

Eval date imputed — scrub to apply your own date if you know when the eval ran.

Tested → frontier at evaluation:+1.4 ECI · +8 Elo ≈ 2 months of frontier progress

Tested → frontier today:+6.7 ECI ≈ 9 months of frontier progress

›How the eval date was imputed

What if you knew the eval date

Set an evaluation date below and the gap recomputes against that anchor. The URL updates so you can share the override; Item 3 still reports what the paper disclosed.

›What ECI / Elo / AA mean

What this gap looks like in practice

Generating side-by-side comparison…

VERSIO-AI Core 3

Editorial desk-reject tier

Model version

Variant

"Generative Pre-Trained Transformer 5.2 (GPT-5.2) and Google Gemini 3 Pro"

Both models named at variant level with explicit version numbers spelled out. Snapshot IDs absent; variant-pinning satisfies item 1.

Pass

Elicitation

Sparse elicitation disclosure

Fail

Capability frame

Model-specific, task-bounded

"A parallel-LLM approach using GPT-5.2 and Gemini 3 Pro achieved strong accuracy with a high degree of efficiency for automated data extraction in an orthopaedic systematic review."

Conclusion subject is the named GPT-5.2 + Gemini 3 Pro approach; capability bounded to 'automated data extraction in an orthopaedic systematic review'. No generic-LLM extension.

Pass

One Core-3 fail

A single editorial desk-reject signal fires; the rest pass.

Extended items

Comparator and breadth

Frontier coverage of evaluated setIncludes frontier-tier model(s)

Both GPT-5.2 and Gemini 3 Pro were contemporaneously frontier-tier within ~4 months of the imputed evaluation window; the cross-vendor parallel design strengthens the frontier-coverage signal.

Human comparatorPre-defined gold standard

Models evaluatedTwo cross-vendor models

"GPT-5.2 and Gemini 3 Pro"n = 2 models across two vendors (OpenAI, Google). Cross-vendor parallel-extraction design is the paper's central methodological claim.

Conclusion

Conclusion valencePositive

Transparency

Evaluation date disclosureNot disclosed

Elicitation breakdown

Within-family variantStrongest variant

GPT-5.2 was the strongest GPT-5 variant at eval; Gemini 3 Pro was the strongest Gemini 3 variant. No within-family weaker-variant flag.

Full 13-item checklist · 12 automated · 1 reader-pass

The complete VERSIO-AI rubric. 12 items are scored automatically by the live extraction; the remaining 1 require a manual reader pass against the paper's methods or supplementary.

Item 1 · automated

Model version

Vendor, family, variant, snapshot ID — every level the abstract or full text disclosed.

Item 2 · automated

Multiple-model evaluation

Whether more than one model was evaluated; cross-model comparison shifts what a class-level claim can rest on.

Item 3 · automated

Evaluation date

When the model was queried — not the publication date, training cutoff, or submission date.

Item 4 · manual

Statistical comparison method

Whether the comparison reports statistical-test machinery (effect size, CI, test name). Read from the methods section.

Item 5 · automated

Capability frame

Whether the conclusion stays scoped to the tested model or generalises to LLMs / AI as a class.

Item 6 · automated

Conclusion valence

Positive / negative / mixed / neutral. Reported, not graded — used for valence-asymmetry analysis.

Item 7 · automated

Comparator adequacy

Whether at least one evaluated model was contemporaneously frontier-tier, or only weaker baselines were tested.

Item 8 · automated

Human comparator

Same-study quantitative human performance reported alongside model performance.

Item 9 · automated

Elicitation completeness

Item 10 · automated

Domain

Medicine / law / coding / education / scientific reasoning / other.

Item 11 · automated

Task description

One-sentence description of what was evaluated.

Item 12 · automated

Frontier-gap at eval

Months behind contemporaneous Arena-elicited frontier at the evaluation date (or imputed window when item 3 is missing).

Item 13 · automated

Qualitative reader summary

Short plain-English domain-reader summary from the reasoning model — what to hold provisionally vs. trust.

VERSIO-AI Core 3

Model version

Elicitation

Capability frame

Comparator and breadth

Conclusion

Transparency

Elicitation breakdown

VERSIO-AI Core 3 · scoring in progress

Model version

Evaluation date

Capability frame

VERSIO-AI Core 3

Model version

Elicitation

Capability frame

Comparator and breadth

Conclusion

Transparency

Elicitation breakdown