Performance and usability of machine learning for screening in systematic reviews: a comparative evaluation of three tools.

Allison Gates, Samantha Guitard, Jennifer Pillay, Sarah A Elliott, Michele P Dyson, Amanda S Newton, Lisa Hartling
Author Information
  1. Allison Gates: Department of Pediatrics, Alberta Research Centre for Health Evidence and the University of Alberta Evidence-based Practice Center, University of Alberta, 11405 87 Ave NW, Edmonton, Alberta, T6G 1C9, Canada. ORCID
  2. Samantha Guitard: Department of Pediatrics, Alberta Research Centre for Health Evidence and the University of Alberta Evidence-based Practice Center, University of Alberta, 11405 87 Ave NW, Edmonton, Alberta, T6G 1C9, Canada.
  3. Jennifer Pillay: Department of Pediatrics, Alberta Research Centre for Health Evidence and the University of Alberta Evidence-based Practice Center, University of Alberta, 11405 87 Ave NW, Edmonton, Alberta, T6G 1C9, Canada.
  4. Sarah A Elliott: Department of Pediatrics, Alberta Research Centre for Health Evidence and the University of Alberta Evidence-based Practice Center, University of Alberta, 11405 87 Ave NW, Edmonton, Alberta, T6G 1C9, Canada.
  5. Michele P Dyson: Department of Pediatrics, Alberta Research Centre for Health Evidence and the University of Alberta Evidence-based Practice Center, University of Alberta, 11405 87 Ave NW, Edmonton, Alberta, T6G 1C9, Canada.
  6. Amanda S Newton: Department of Pediatrics, University of Alberta Evidence-based Practice Center, University of Alberta, 11405 87 Ave NW, Edmonton, Alberta, T6G 1C9, Canada.
  7. Lisa Hartling: Department of Pediatrics, Alberta Research Centre for Health Evidence and the University of Alberta Evidence-based Practice Center, University of Alberta, 11405 87 Ave NW, Edmonton, Alberta, T6G 1C9, Canada. hartling@ualberta.ca.

Abstract

BACKGROUND: We explored the performance of three machine learning tools designed to facilitate title and abstract screening in systematic reviews (SRs) when used to (a) eliminate irrelevant records (automated simulation) and (b) complement the work of a single reviewer (semi-automated simulation). We evaluated user experiences for each tool.
METHODS: We subjected three SRs to two retrospective screening simulations. In each tool (Abstrackr, DistillerSR, RobotAnalyst), we screened a 200-record training set and downloaded the predicted relevance of the remaining records. We calculated the proportion missed and workload and time savings compared to dual independent screening. To test user experiences, eight research staff tried each tool and completed a survey.
RESULTS: Using Abstrackr, DistillerSR, and RobotAnalyst, respectively, the median (range) proportion missed was 5 (0 to 28) percent, 97 (96 to 100) percent, and 70 (23 to 100) percent for the automated simulation and 1 (0 to 2) percent, 2 (0 to 7) percent, and 2 (0 to 4) percent for the semi-automated simulation. The median (range) workload savings was 90 (82 to 93) percent, 99 (98 to 99) percent, and 85 (85 to 88) percent for the automated simulation and 40 (32 to 43) percent, 49 (48 to 49) percent, and 35 (34 to 38) percent for the semi-automated simulation. The median (range) time savings was 154 (91 to 183), 185 (95 to 201), and 157 (86 to 172) hours for the automated simulation and 61 (42 to 82), 92 (46 to 100), and 64 (37 to 71) hours for the semi-automated simulation. Abstrackr identified 33-90% of records missed by a single reviewer. RobotAnalyst performed less well and DistillerSR provided no relative advantage. User experiences depended on user friendliness, qualities of the user interface, features and functions, trustworthiness, ease and speed of obtaining predictions, and practicality of the export file(s).
CONCLUSIONS: The workload savings afforded in the automated simulation came with increased risk of missing relevant records. Supplementing a single reviewer's decisions with relevance predictions (semi-automated simulation) sometimes reduced the proportion missed, but performance varied by tool and SR. Designing tools based on reviewers' self-identified preferences may improve their compatibility with present workflows.
SYSTEMATIC REVIEW REGISTRATION: Not applicable.

Keywords

References

  1. J Biomed Inform. 2009 Apr;42(2):377-81 [PMID: 18929686]
  2. Syst Rev. 2019 Jul 11;8(1):163 [PMID: 31296265]
  3. Syst Rev. 2015 Jan 14;4:5 [PMID: 25588314]
  4. BMC Bioinformatics. 2010 Jan 26;11:55 [PMID: 20102628]
  5. Syst Rev. 2019 Feb 20;8(1):57 [PMID: 30786933]
  6. Syst Rev. 2019 Jun 18;8(1):143 [PMID: 31215463]
  7. Res Synth Methods. 2018 Sep;9(3):470-488 [PMID: 29956486]
  8. Syst Rev. 2018 May 19;7(1):77 [PMID: 29778096]
  9. BMJ Open. 2017 Feb 27;7(2):e012545 [PMID: 28242767]
  10. J Biomed Inform. 2017 Sep;73:1-13 [PMID: 28711679]
  11. Syst Rev. 2018 Jan 09;7(1):3 [PMID: 29316980]
  12. Syst Rev. 2014 Jul 09;3:74 [PMID: 25005128]
  13. Nurs Health Sci. 2013 Sep;15(3):398-405 [PMID: 23480423]
  14. Syst Rev. 2015 Jun 15;4:80 [PMID: 26073974]
  15. Res Synth Methods. 2019 Mar;10(1):72-82 [PMID: 30561081]
  16. Syst Rev. 2018 Mar 12;7(1):45 [PMID: 29530097]
  17. Res Synth Methods. 2011 Mar;2(1):1-14 [PMID: 26061596]
  18. BMC Med Res Methodol. 2019 Jun 28;19(1):132 [PMID: 31253092]

Grants

  1. 290-2015-00001-I/Agency for Healthcare Research and Quality

MeSH Term

Abstracting and Indexing
Humans
Information Storage and Retrieval
Machine Learning
Reproducibility of Results
Software
Systematic Reviews as Topic
Time Factors
Workload

Word Cloud

Created with Highcharts 10.0.0simulationpercentautomatedsemi-automatedscreeningrecordsusertoolmissedsavings0threelearningtoolssingleexperiencesAbstrackrDistillerSRRobotAnalystproportionworkloadmedianrange1002performancemachinesystematicreviewsSRsreviewerrelevancetime82998549hoursUserpredictionsBACKGROUND:exploreddesignedfacilitatetitleabstractusedeliminateirrelevantbcomplementworkevaluatedMETHODS:subjectedtworetrospectivesimulationsscreened200-recordtrainingsetdownloadedpredictedremainingcalculatedcompareddualindependenttesteightresearchstafftriedcompletedsurveyRESULTS:Usingrespectively528979670231749093988840324348 percent353438percent for15491183185952011578617261429246643771identified33-90%performedlesswellprovidedrelativeadvantagedependedfriendlinessqualitiesinterfacefeaturesfunctionstrustworthinesseasespeedobtainingpracticalityexportfilesCONCLUSIONS:affordedcameincreasedriskmissingrelevantSupplementingreviewer'sdecisionssometimesreducedvariedSRDesigningbasedreviewers'self-identifiedpreferencesmayimprovecompatibilitypresentworkflowsSYSTEMATICREVIEWREGISTRATION:applicablePerformanceusabilityreviews:comparativeevaluationAutomationMachineSystematicUsabilityexperience

Similar Articles

Cited By (37)