A precise account of what the confidence score measures, how it is constructed, where it is reliable, and where it is not. Intended for researchers, research integrity officers, and institutional evaluators.
Each verification produces a score from 0 to 100, representing an AI-assisted assessment of methodological credibility โ not scientific truth. The score reflects how rigorously a paper reports its methods, handles its data, and acknowledges its limitations. A high score means the methodology is well-documented and internally consistent. It does not mean the findings are correct, replicable, or clinically significant.
The score is derived by evaluating a fixed set of criteria (4 to 16, depending on subscription tier) against the paper's content using Claude AI (Anthropic). Each criterion receives an independent sub-score and explanation, displayed alongside the overall figure.
The overall confidence score is a holistic assessment of the paper's credibility โ constrained to within ยฑ20 points of the mean of the per-criterion scores. It reflects the paper's methodological rigour, its publication context, its post-publication record, and the weight of subsequent evidence. A server-side guardrail enforces the ยฑ20 constraint automatically; if the AI-generated score deviates beyond this band, it is clamped and the adjustment is logged for audit.
| Score range | Interpretation | Typical characteristics |
|---|---|---|
| 80 โ 100 | Strong methodological rigour | The paper demonstrates well-documented methods, appropriate analytical choices, and candid acknowledgement of limitations. |
| 65 โ 79 | Adequate with minor concerns | Broadly sound methodology with one or two addressable weaknesses. Suitable for inclusion with caveats. |
| 50 โ 64 | Moderate โ verify before citing | Meaningful methodological concerns identified. Warrants careful reading and expert scrutiny before reliance. |
| Below 50 | Significant concerns identified | The assessment flagged substantive issues affecting the reliability of findings. Exercise significant caution. |
VerifyScience evaluates each paper across five analytical domains. Within each domain, the number of dimensions examined and the depth of explanation provided increases with subscription tier. The specific questions asked within each domain, the weighting applied, and the internal prompt structure are proprietary and not disclosed โ this is intentional, both to protect the integrity of the assessment and to prevent gaming.
What is disclosed: the five domains, what each domain is designed to answer, and how tier depth affects the assessment.
Domain-aware calibration. VerifyScience automatically detects the research domain and methodology type of each paper โ distinguishing, for example, a randomised controlled trial from a qualitative interview study or a computational simulation paper. The assessment criteria are calibrated accordingly. The specific frameworks applied for each domain are proprietary and not disclosed, consistent with the anti-gaming principle above. Professional and Enterprise subscribers see an abstracted domain indicator in their results: Assessment calibrated for: [domain category]. This indicator reflects the detected research domain, not the name of any underlying checklist or framework.
The number of sub-dimensions examined within each domain โ and the length and depth of the AI's written explanation per dimension โ increases with tier. Free tier: 4 sub-dimensions total. Starter: 8. Professional: 12. Enterprise: 16. The specific sub-dimensions, their relative weighting, and the prompts used to elicit them are not published.
| Tier | Domains assessed | Sub-dimensions | Explanation depth | Batch processing |
|---|---|---|---|---|
| Free | Design & methodology ยท Statistical rigour ยท Bias & confounding ยท Integrity | 4 | Brief | โ |
| Starter | All above + extended statistical and bias dimensions | 8 | Standard | โ |
| Professional | All above + Transparency & reproducibility domain | 12 | Detailed | Up to 20 papers |
| Enterprise | All five domains at full depth | 16 | Comprehensive | Up to 50 papers |
Verification is performed by Claude AI (Anthropic). The model reads the full paper content โ retrieved via URL, DOI, or direct text input โ and produces a structured credibility assessment. The AI is instructed to reason solely from the submitted paper's text, not from prior knowledge of the paper's reputation, citation count, or author profile.
Scores are generated as structured output and parsed deterministically โ no score is inferred or interpolated by VerifyScience's own code. The internal prompt architecture, per-domain instructions, and weighting model are proprietary and not disclosed.
Free tier verifications use Claude Haiku (fast, cost-efficient). Starter, Professional, and Enterprise tiers use Claude Sonnet (higher reasoning depth, longer explanations). The model version is fixed at deployment and updated only with advance notice in the changelog.
VerifyScience operates under the EU AI Act's transparency obligations for AI-assisted tools used in research and information contexts. Current compliance status: approximately 90%.
Deployed measures include: an AI disclosure banner on every results page, a dedicated transparency page at /ai-info.html, footer AI information links on all pages, EU AI Act disclaimers on every PDF export, and explicit acknowledgement that outputs are AI-generated assessments rather than authoritative judgments. Remaining items โ enhanced PDF disclaimers and extended FAQ coverage โ are scheduled for completion in Q2 2026.
VerifyScience does not use user-submitted papers to train AI models. Paper content is processed transiently for verification and is not retained beyond the session unless explicitly saved to verification history by the user. Full data handling details are available in the Privacy Policy.
| Use case | Assessment |
|---|---|
| Rapid first-pass screening of papers for literature reviews | Well suited |
| Prioritising which papers merit deep expert review | Well suited |
| Systematic review pre-screening (with human verification) | Well suited |
| Embedding credibility scores in shared bibliographies | Well suited |
| Batch assessment of large literature corpora via API | Well suited |
| Sole basis for funding or grant allocation decisions | Not intended |
| Replacement for formal peer review | Not intended |
| Clinical practice guideline development | Not intended |
| Legal or regulatory evidence assessment | Not intended |
| Sole basis for retraction or misconduct investigations | Not intended |