How assessing AI tools for systematic reviews protects rigor

Insights AI News How assessing AI tools for systematic reviews protects rigor

AI News

18 Jun 2026

Read 10 min

How assessing AI tools for systematic reviews protects rigor

Assessing AI tools for systematic reviews protects rigour and speeds reviews with checks and reports.

Assessing AI tools for systematic reviews helps teams gain speed without losing rigor. Use a clear pre-use check, confirm evidence and data sources, and set performance thresholds. Pair AI with strong human oversight and transparent reporting. Stop using a tool if it fails validation, harms usability, or cannot meet predefined standards. Artificial intelligence can help with searching, screening, and drafting. But good science needs care. Start by assessing AI tools for systematic reviews before you add them to your workflow. You are still responsible for methods, results, and how AI is used. The goal is simple: faster work without weaker evidence.

Assessing AI tools for systematic reviews: what to check before you start

Clarify purpose and scope

What task will the tool perform (e.g., title screening, data extraction, text drafting)?

Is this task high risk for bias or error in your review?

Check data sources and validation

Where did training, testing, and validation data come from?

Is there peer‑reviewed validation in a context like yours?

Test performance and risks

Are results reproducible across datasets?

Are known risks manageable with human checks?

Usability and user capability

Can your team use the tool correctly after standard onboarding?

Does the interface fit your workflow and timelines?

Transparency, licensing, availability

Are methods, versioning, and documentation clear?

Do terms of use protect your data (e.g., opt‑out from training)?

Red flags: when to walk away

No relevant validation or only developer-led claims

Results that others cannot reproduce

Legal, policy, or ethics conflicts

Terms that allow your content to train models without opt‑out

No monitoring, audit trail, or human-in-the-loop design

Poor developer transparency or slow support

If you see these issues, do not proceed. If evidence is decent but incomplete, you may proceed with mitigations and plan added checks.

Show the tool won’t weaken your review

RAISE guidance suggests how to use tools across review stages. Some uses may be acceptable with disclosure. Others require human verification or validation within your review (a “study within a review,” or SWaR). For large language model tools, proceed with mitigations. Use human checks for critical outputs or run a SWaR.

Real-world validation: learn from platform studies

Cochrane’s CESAR project tests AI tools across multiple review updates. Tools stay only if they meet predefined thresholds for screening, data extraction, and usability. This approach shows how assessing AI tools for systematic reviews can protect quality while generating evidence about what “good enough” looks like.

When speed backfires

AI should save time. If, after normal onboarding, a tool slows you down, increases confusion, or lacks support, that is a strong reason to stop using it.

Set thresholds that matter

Define performance thresholds before you test a tool. Two types help decision-making:

Futility boundary: the lowest performance you will accept. Falling below it means stop.

Non‑inferiority margin: the level you aim for. If even optimistic estimates cannot reach it, the tool is not promising enough.

Examples inspired by current practice:

Screening sensitivity: stop if below about 80%, and aim for near 95% on upper confidence limits.

Full‑text specificity: stop if it drops near 50%, and target around 60% or higher on upper limits.

Data extraction sensitivity: stop if under roughly 92%, and seek near 97% on upper limits.

Major extraction errors: stop if these exceed around 3%, and aim for at most about 2% on upper limits.

Usability (SUS score): stop if the score is clearly poor (around the high 50s). Aim for a good experience (mid‑70s or higher).

These numbers are examples to guide thinking. Your thresholds may differ by task and risk. Set them prospectively, justify them, and stick to them.

Report AI use with precision

Readers must know what you used, how you used it, and why. Include:

Tool name, version, and date of use

Exact purpose in the review process

Parameters, prompts, or custom settings

Human oversight steps: review, verification, or override

How you judged the tool methodologically sound

How you validated or calibrated it for your context

Known limitations, biases, and ethical issues

Links to protocols, guides, or supplementary methods

Human oversight is non-negotiable

People remain accountable for choices and outcomes. Use public information, published evidence, and your own verification or validation to justify every use of AI. Document decisions, monitor outputs, and be ready to stop if a tool misses your thresholds or introduces risk. Strong practice grows when teams share results and publish evaluations. Platform studies and community standards will refine thresholds over time. Until then, lead with clarity and caution. Good evidence comes from good choices. By assessing AI tools for systematic reviews, setting clear thresholds, and reporting transparently, you gain efficiency while protecting rigor. (pSource: https://www.cochrane.org/about-us/news/right-tool-right-job-deciding-when-not-use-ai-tool)

For more news: Click Here

FAQ

Q: What is the first step when assessing AI tools for systematic reviews? A: When assessing AI tools for systematic reviews, start by using the responsible handover framework from RAISE 3 to clarify the tool’s purpose, provenance of training and validation data, validation status, usability, and transparency/licensing. This assessment can rely on public information and may include contacting developers for missing details before you decide to proceed. Q: What red flags should make me decide not to use an AI tool? A: When assessing AI tools for systematic reviews, watch for red flags such as lack of relevant validation, validation that is not replicable, performance claims based only on developer‑led studies, legal or policy conflicts, terms that permit reuse of your content for training without opt‑out, inadequate human oversight, or poor developer transparency and support. If you see any of these issues, the guidance is to stop and not proceed with the tool. Q: How can I demonstrate an AI tool will not compromise the methodological rigour of my review? A: When assessing AI tools for systematic reviews you should use evidence gathered via the responsible handover framework and then carry out human verification or validation as needed. RAISE guidance categorises tool uses from acceptable to not acceptable and recommends disclosure, human verification, or validation within the review (a SWaR) depending on the category, with current advice to proceed with mitigations for large language models. Q: What performance thresholds should teams set before validating an AI tool? A: When assessing AI tools for systematic reviews, set performance thresholds prospectively by defining a futility boundary (minimum acceptable) and a non‑inferiority margin (target or aim). CESAR gives example thresholds such as screening sensitivity stop if under about 80% and aim for an upper confidence limit near 95%, full‑text specificity stop near 50% and aim around 60%, data‑extraction sensitivity stop near 92% and aim near 97%, major extraction errors stop if above about 3% and aim for about 2% on upper limits, and usability (SUS) stop if in the high‑50s and aim for mid‑70s. Use these examples as guides, justify your own thresholds for the task and risk level, and adhere to them during validation. Q: What should I include in my review to transparently report AI use? A: When assessing AI tools for systematic reviews, fully disclose the AI system name, version, date of use, developer, specific purpose in the review, any customisations or parameters, and the degree of human oversight. Also report how you judged it methodologically sound, any validation or calibration performed in your context, known limitations or biases, and links to protocols or supplementary methods. Q: Is human oversight always required when using AI in evidence synthesis? A: Yes — expectation four in Cochrane’s guidance is that AI must be used with human oversight, because people remain accountable for research decisions and outcomes. When assessing AI tools for systematic reviews you should document monitoring, verification or override steps and be prepared to stop use if outputs fail your checks or thresholds. Q: When is it acceptable to proceed with an AI tool that has incomplete validation or evidence gaps? A: When assessing AI tools for systematic reviews, you may proceed with mitigations if a tool is promising but has evidence gaps or moderate, monitorable risks, provided you plan additional verification or validation. Such mitigations can include human verification for critical outputs or running a study within a review (SWaR) to validate the tool’s performance in your context. Q: What practical signs should make teams stop using an AI tool even if it passes validation? A: A tool should be stopped if it fails to meet predefined performance thresholds, harms usability by slowing workflows or increasing confusion after standard onboarding, or lacks adequate support or auditability. When assessing AI tools for systematic reviews, these practical usability and support failures are valid reasons to discontinue use even if prior validation existed.