What the confidence score represents

Each verification produces a score from 0 to 100, representing an AI-assisted assessment of methodological credibility โ€” not scientific truth. The score reflects how rigorously a paper reports its methods, handles its data, and acknowledges its limitations. A high score means the methodology is well-documented and internally consistent. It does not mean the findings are correct, replicable, or clinically significant.

The score is derived by evaluating a fixed set of criteria (4 to 16, depending on subscription tier) against the paper's content using Claude AI (Anthropic). Each criterion receives an independent sub-score and explanation, displayed alongside the overall figure.

The overall confidence score is a holistic assessment of the paper's credibility โ€” constrained to within ยฑ20 points of the mean of the per-criterion scores. It reflects the paper's methodological rigour, its publication context, its post-publication record, and the weight of subsequent evidence. A server-side guardrail enforces the ยฑ20 constraint automatically; if the AI-generated score deviates beyond this band, it is clamped and the adjustment is logged for audit.

VerifyScience assessments are produced by a large language model and are subject to hallucination, misinterpretation, and coverage gaps. They are designed to accelerate human expert review โ€” not to replace it. No automated tool, including VerifyScience, should be the sole basis for decisions about research inclusion, citation, funding, or clinical practice.
Score rangeInterpretationTypical characteristics
80 โ€“ 100Strong methodological rigourThe paper demonstrates well-documented methods, appropriate analytical choices, and candid acknowledgement of limitations.
65 โ€“ 79Adequate with minor concernsBroadly sound methodology with one or two addressable weaknesses. Suitable for inclusion with caveats.
50 โ€“ 64Moderate โ€” verify before citingMeaningful methodological concerns identified. Warrants careful reading and expert scrutiny before reliance.
Below 50Significant concerns identifiedThe assessment flagged substantive issues affecting the reliability of findings. Exercise significant caution.

What is assessed

VerifyScience evaluates each paper across five analytical domains. Within each domain, the number of dimensions examined and the depth of explanation provided increases with subscription tier. The specific questions asked within each domain, the weighting applied, and the internal prompt structure are proprietary and not disclosed โ€” this is intentional, both to protect the integrity of the assessment and to prevent gaming.

What is disclosed: the five domains, what each domain is designed to answer, and how tier depth affects the assessment.

Domain-aware calibration. VerifyScience automatically detects the research domain and methodology type of each paper โ€” distinguishing, for example, a randomised controlled trial from a qualitative interview study or a computational simulation paper. The assessment criteria are calibrated accordingly. The specific frameworks applied for each domain are proprietary and not disclosed, consistent with the anti-gaming principle above. Professional and Enterprise subscribers see an abstracted domain indicator in their results: Assessment calibrated for: [domain category]. This indicator reflects the detected research domain, not the name of any underlying checklist or framework.

Design & methodology
Does the study design match its research question? Is the methodology sufficiently documented to assess validity? Are the procedures replicable in principle?
All tiers
Core dimensions
Statistical rigour
Do the statistical methods support the conclusions drawn? Is significance interpreted appropriately? Higher tiers examine additional dimensions of quantitative reasoning that are not evaluated at entry level.
All tiers
Depth increases at Starter+
Bias & confounding
What sources of bias are present or absent? How well are confounding factors controlled? Are study limitations acknowledged candidly? This domain draws on established risk-of-bias frameworks.
All tiers
Depth increases at Starter+
Transparency & reproducibility
Is the research independently verifiable? Is underlying data accessible? Are methods described with sufficient precision? This domain is assessed at Professional tier and above; it is not available at Free or Starter.
Professional+
Not available at Free or Starter
Integrity & reporting standards
Are relevant reporting guidelines followed? Are conflicts of interest and ethical approvals disclosed? Are claims proportionate to the evidence presented? Enterprise tier applies the most comprehensive assessment in this domain, including dimensions not assessed at lower tiers.
Enterprise depth
Partial at Professional

The number of sub-dimensions examined within each domain โ€” and the length and depth of the AI's written explanation per dimension โ€” increases with tier. Free tier: 4 sub-dimensions total. Starter: 8. Professional: 12. Enterprise: 16. The specific sub-dimensions, their relative weighting, and the prompts used to elicit them are not published.

TierDomains assessedSub-dimensionsExplanation depthBatch processing
FreeDesign & methodology ยท Statistical rigour ยท Bias & confounding ยท Integrity4Briefโ€”
StarterAll above + extended statistical and bias dimensions8Standardโ€”
ProfessionalAll above + Transparency & reproducibility domain12DetailedUp to 20 papers
EnterpriseAll five domains at full depth16ComprehensiveUp to 50 papers

Underlying model and processing

Verification is performed by Claude AI (Anthropic). The model reads the full paper content โ€” retrieved via URL, DOI, or direct text input โ€” and produces a structured credibility assessment. The AI is instructed to reason solely from the submitted paper's text, not from prior knowledge of the paper's reputation, citation count, or author profile.

Scores are generated as structured output and parsed deterministically โ€” no score is inferred or interpolated by VerifyScience's own code. The internal prompt architecture, per-domain instructions, and weighting model are proprietary and not disclosed.

Free tier verifications use Claude Haiku (fast, cost-efficient). Starter, Professional, and Enterprise tiers use Claude Sonnet (higher reasoning depth, longer explanations). The model version is fixed at deployment and updated only with advance notice in the changelog.

Known limitations

Content accessibility
Papers behind paywalls that do not expose full text via URL will be assessed on abstracts alone. Abstract-only analysis materially reduces score reliability across most analytical domains.
Disciplinary variation
The assessment is calibrated for empirical research. Theoretical papers, mathematical proofs, qualitative studies, and computational work will produce less reliable scores, as the analytical framework does not apply uniformly across all research traditions.
LLM hallucination
The AI may misread technical details, mischaracterise study designs, or produce explanations that sound plausible but do not accurately reflect the paper's content. All assessments should be verified against the source by a qualified reader.
Recency and replication
VerifyScience does not monitor papers post-verification. Retractions, corrections, failed replications, or post-publication peer review that emerge after the verification date are not reflected in the score.
Language coverage
Assessment quality is highest for English-language papers. Non-English papers are supported but may exhibit reduced accuracy, particularly where nuanced methodological language is involved.
No ground truth validation
VerifyScience scores have not been formally validated against expert consensus panels or retraction databases at scale. The tool is in active development; accuracy benchmarks will be published as data accumulates.

EU AI Act compliance status

VerifyScience operates under the EU AI Act's transparency obligations for AI-assisted tools used in research and information contexts. Current compliance status: approximately 90%.

Deployed measures include: an AI disclosure banner on every results page, a dedicated transparency page at /ai-info.html, footer AI information links on all pages, EU AI Act disclaimers on every PDF export, and explicit acknowledgement that outputs are AI-generated assessments rather than authoritative judgments. Remaining items โ€” enhanced PDF disclaimers and extended FAQ coverage โ€” are scheduled for completion in Q2 2026.

VerifyScience does not use user-submitted papers to train AI models. Paper content is processed transiently for verification and is not retained beyond the session unless explicitly saved to verification history by the user. Full data handling details are available in the Privacy Policy.

Intended and non-intended use

Use caseAssessment
Rapid first-pass screening of papers for literature reviewsWell suited
Prioritising which papers merit deep expert reviewWell suited
Systematic review pre-screening (with human verification)Well suited
Embedding credibility scores in shared bibliographiesWell suited
Batch assessment of large literature corpora via APIWell suited
Sole basis for funding or grant allocation decisionsNot intended
Replacement for formal peer reviewNot intended
Clinical practice guideline developmentNot intended
Legal or regulatory evidence assessmentNot intended
Sole basis for retraction or misconduct investigationsNot intended

Questions about methodology?

We welcome scrutiny from researchers and research integrity professionals. Write to us at [email protected] or use the support chat.