Methodology — VerifyScience

What the confidence score represents

Each verification produces a score from 0 to 100, representing an AI-assisted assessment of methodological credibility — not scientific truth. The score reflects how rigorously a paper reports its methods, handles its data, and acknowledges its limitations. A high score means the methodology is well-documented and internally consistent. It does not mean the findings are correct, replicable, or clinically significant.

The score is derived by evaluating a set of methodological criteria whose assessment depth increases by plan (Essential, Standard, Advanced, Comprehensive) against the paper's content using Claude AI (Anthropic). Each criterion receives an independent sub-score and explanation, displayed alongside the overall figure.

The overall confidence score is a holistic assessment of the paper's credibility — constrained to within ±20 points of the mean of the per-criterion scores. It reflects the paper's methodological rigour, its publication context, its post-publication record, and the weight of subsequent evidence. A server-side guardrail enforces the ±20 constraint automatically; if the AI-generated score deviates beyond this band, it is clamped and the adjustment is logged for audit.

VerifyScience assessments are produced by a large language model and are subject to hallucination, misinterpretation, and coverage gaps. They are designed to accelerate human expert review — not to replace it. No automated tool, including VerifyScience, should be the sole basis for decisions about research inclusion, citation, funding, or clinical practice.

Score range	Interpretation	Typical characteristics
80 – 100	Strong methodological rigour	The paper demonstrates well-documented methods, appropriate analytical choices, and candid acknowledgement of limitations.
65 – 79	Adequate with minor concerns	Broadly sound methodology with one or two addressable weaknesses. Suitable for inclusion with caveats.
50 – 64	Moderate — verify before citing	Meaningful methodological concerns identified. Warrants careful reading and expert scrutiny before reliance.
Below 50	Significant concerns identified	The assessment flagged substantive issues affecting the reliability of findings. Exercise significant caution.

What is assessed

VerifyScience evaluates each paper across five analytical domains. Within each domain, the number of dimensions examined and the depth of explanation provided increases with subscription tier. The specific questions asked within each domain, the weighting applied, and the internal prompt structure are proprietary and not disclosed — this is intentional, both to protect the integrity of the assessment and to prevent gaming.

What is disclosed: the five domains, what each domain is designed to answer, and how tier depth affects the assessment.

Domain-aware calibration. VerifyScience automatically detects the research domain and methodology type of each paper — distinguishing, for example, a randomised controlled trial from a qualitative interview study or a computational simulation paper. The assessment criteria are calibrated accordingly. The specific frameworks applied for each domain are proprietary and not disclosed, consistent with the anti-gaming principle above. Professional and Enterprise subscribers see an abstracted domain indicator in their results: Assessment calibrated for: [domain category]. This indicator reflects the detected research domain, not the name of any underlying checklist or framework.

Design & methodology

Does the study design match its research question? Is the methodology sufficiently documented to assess validity? Are the procedures replicable in principle?

All tiers

Core dimensions

Statistical rigour

Do the statistical methods support the conclusions drawn? Is significance interpreted appropriately? Higher tiers examine additional dimensions of quantitative reasoning that are not evaluated at entry level.

All tiers

Depth increases at Starter+

Bias & confounding

What sources of bias are present or absent? How well are confounding factors controlled? Are study limitations acknowledged candidly? This domain draws on established risk-of-bias frameworks.

All tiers

Depth increases at Starter+

Transparency & reproducibility

Is the research independently verifiable? Is underlying data accessible? Are methods described with sufficient precision? This domain is assessed at Professional tier and above; it is not available at Free or Starter.

Professional+

Not available at Free or Starter

Integrity & reporting standards

Are relevant reporting guidelines followed? Are conflicts of interest and ethical approvals disclosed? Are claims proportionate to the evidence presented? Enterprise tier applies the most comprehensive assessment in this domain, including dimensions not assessed at lower tiers.

Enterprise depth

Partial at Professional

The number of sub-dimensions examined within each domain — and the length and depth of the AI's written explanation per dimension — increases with tier. Free tier: 4 sub-dimensions total. Starter: 8. Professional: 12. Enterprise: 16. The specific sub-dimensions, their relative weighting, and the prompts used to elicit them are not published.

Tier	Domains assessed	Sub-dimensions	Explanation depth	Batch processing
Free	Design & methodology · Statistical rigour · Bias & confounding · Integrity	4	Brief	—
Starter	All above + extended statistical and bias dimensions	8	Standard	—
Professional	All above + Transparency & reproducibility domain	12	Detailed	Up to 20 papers
Enterprise	All five domains at full depth	16	Comprehensive	Up to 50 papers

Underlying model and processing

Verification is performed by Claude AI (Anthropic). The model reads the full paper content — retrieved via URL, DOI, or direct text input — and produces a structured credibility assessment. The AI is instructed to reason solely from the submitted paper's text, not from prior knowledge of the paper's reputation, citation count, or author profile.

Scores are generated as structured output and parsed deterministically — no score is inferred or interpolated by VerifyScience's own code. The internal prompt architecture, per-domain instructions, and weighting model are proprietary and not disclosed.

Free tier verifications use Claude Haiku (fast, cost-efficient). Starter, Professional, and Enterprise tiers use Claude Sonnet (higher reasoning depth, longer explanations). The model version is fixed at deployment and updated only with advance notice in the changelog.

Known limitations

Content accessibility

Papers behind paywalls that do not expose full text via URL will be assessed on abstracts alone. Abstract-only analysis materially reduces score reliability across most analytical domains.

Disciplinary variation

The assessment is calibrated for empirical research. Theoretical papers, mathematical proofs, qualitative studies, and computational work will produce less reliable scores, as the analytical framework does not apply uniformly across all research traditions.

LLM hallucination

The AI may misread technical details, mischaracterise study designs, or produce explanations that sound plausible but do not accurately reflect the paper's content. All assessments should be verified against the source by a qualified reader.

Recency and replication

VerifyScience does not monitor papers post-verification. Retractions, corrections, failed replications, or post-publication peer review that emerge after the verification date are not reflected in the score.

Language coverage

Assessment quality is highest for English-language papers. Non-English papers are supported but may exhibit reduced accuracy, particularly where nuanced methodological language is involved.

No ground truth validation

VerifyScience scores have not been formally validated against expert consensus panels or retraction databases at scale. The tool is in active development; accuracy benchmarks will be published as data accumulates.

EU AI Act compliance status

VerifyScience operates under the EU AI Act's transparency obligations for AI-assisted tools used in research and information contexts. Current compliance status: approximately 90%.

Deployed measures include: an AI disclosure banner on every results page, a dedicated transparency page at /ai-info.html, footer AI information links on all pages, EU AI Act disclaimers on every PDF export, and explicit acknowledgement that outputs are AI-generated assessments rather than authoritative judgments. Remaining items — enhanced PDF disclaimers and extended FAQ coverage — are scheduled for completion in Q2 2026.

VerifyScience does not use user-submitted papers to train AI models. Paper content is processed transiently for verification and is not retained beyond the session unless explicitly saved to verification history by the user. Full data handling details are available in the Privacy Policy.

Intended and non-intended use

Use case	Assessment
Rapid first-pass screening of papers for literature reviews	Well suited
Prioritising which papers merit deep expert review	Well suited
Systematic review pre-screening (with human verification)	Well suited
Embedding credibility scores in shared bibliographies	Well suited
Batch assessment of large literature corpora via API	Well suited
Sole basis for funding or grant allocation decisions	Not intended
Replacement for formal peer review	Not intended
Clinical practice guideline development	Not intended
Legal or regulatory evidence assessment	Not intended
Sole basis for retraction or misconduct investigations	Not intended

Questions about methodology?

We welcome scrutiny from researchers and research integrity professionals. Write to us at [email protected] or use the support chat.

Try VerifyScience Institutional access API Docs

How VerifyScience evaluates research credibility

What the confidence score represents

What is assessed

Underlying model and processing

Known limitations

EU AI Act compliance status

Intended and non-intended use

Questions about methodology?