of 10

Exploration of Analysis Methods for Diagnostic Imaging Tests: Problems with ROC AUC and Confidence Scores in CT Colonography

0 views10 pages

Download

All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Exploration of Analysis Methods for Diagnostic Imaging Tests: Problems with ROC AUC and Confidence Scores in CT Colonography Susan Mallett 1 *, Steve Halligan 2, Gary S. Collins 3, Doug G. Altman 3 1 Department
Exploration of Analysis Methods for Diagnostic Imaging Tests: Problems with ROC AUC and Confidence Scores in CT Colonography Susan Mallett 1 *, Steve Halligan 2, Gary S. Collins 3, Doug G. Altman 3 1 Department of Primary Care Health Sciences, University of Oxford, Oxford, United Kingdom, 2 Centre for Medical Imaging, University College London, London, United Kingdom, 3 Centre for Statistics in Medicine, University of Oxford, Oxford, United Kingdom Abstract Background: Different methods of evaluating diagnostic performance when comparing diagnostic tests may lead to different results. We compared two such approaches, sensitivity and specificity with area under the Receiver Operating Characteristic Curve (ROC AUC) for the evaluation of CT colonography for the detection of polyps, either with or without computer assisted detection. Methods: In a multireader multicase study of 10 readers and 107 cases we compared sensitivity and specificity, using radiological reporting of the presence or absence of polyps, to ROC AUC calculated from confidence scores concerning the presence of polyps. Both methods were assessed against a reference standard. Here we focus on five readers, selected to illustrate issues in design and analysis. We compared diagnostic measures within readers, showing that differences in results are due to statistical methods. Results: Reader performance varied widely depending on whether sensitivity and specificity or ROC AUC was used. There were problems using confidence scores; in assigning scores to all cases; in use of zero scores when no polyps were identified; the bimodal non-normal distribution of scores; fitting ROC curves due to extrapolation beyond the study data; and the undue influence of a few false positive results. Variation due to use of different ROC methods exceeded differences between test results for ROC AUC. Conclusions: The confidence scores recorded in our study violated many assumptions of ROC AUC methods, rendering these methods inappropriate. The problems we identified will apply to other detection studies using confidence scores. We found sensitivity and specificity were a more reliable and clinically appropriate method to compare diagnostic tests. Citation: Mallett S, Halligan S, Collins GS, Altman DG (2014) : Problems with ROC AUC and Confidence Scores in CT Colonography. PLoS ONE 9(10): e doi: /journal.pone Editor: Jason Mulvenna, Queensland Institute of Medical Research, Australia Received January 27, 2014; Accepted August 19, 2014; Published October 29, 2014 Copyright: ß 2014 Mallett et al. This is an open-access article distributed under the terms of the Creative Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited. Funding: This work was funded by United Kingdom (UK) Department of Health via a National Institute for Health Research (NIHR) programme grant (RP-PG ) and Cancer Research UK programme grant (C5529) and Medical Research Council (G ). A proportion of this work was undertaken at UCLH/UCL, which receives a proportion of funding from the NIHR Comprehensive Biomedical Research Centre funding scheme. The views expressed in this publication are those of the authors and not necessarily those of the UK Department of Health. The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript. Competing Interests: The authors have read the journal s policy and have the following conflicts: The data collection was funded by Medicsight PLC. Medicsight PLC funded a research grant at Centre for Statistics in Medicine in Oxford, but had no control over the research that was funded. Professor Halligan was a remunerated research consultant for Medicsight PLC. There are no other patents, products in development or marketed products to declare. This does not alter the authors adherence to all the PLOS ONE policies on sharing data and materials. * Introduction Comparisons of diagnostic tests aim to inform healthcare providers and patients which tests are most accurate. The ideal test would give all patients a correct diagnosis, in a short time and with minimal inconvenience to the patient. Unfortunately no test is perfect, and in practice some patients with the target disease will be missed (false negative result), and some patients without disease will be diagnosed incorrectly with disease (false positive result). Measuring diagnostic performance There are three main approaches for comparing diagnostic test accuracy that use different statistical measures. In a previous paper we have discussed these approaches with illustrative examples [1]. The first approach is to use paired measures at specific test thresholds, using either sensitivity and specificity, positive predictive value and negative predictive value (PPV and NPV), or positive likelihood ratio and negative likelihood ratio (LR+ and LR2). A second approach is to examine test performance across all diagnostic test thresholds, using summary measures such as ROC AUC or diagnostic odds ratio (DOR). A third approach gives an overall measure at a specific threshold (or series of thresholds), reported alongside the paired measures for example using a weighted comparison measure [2,3] or net benefit [4,5]; using a single measure can be to simplify comparisons of overall results compared to using paired measures that are likely to change PLOS ONE 1 October 2014 Volume 9 Issue 10 e107633 in different directions, of which sensitivity and specificity are the best known examples. Multi-Reader Multi-Case designs In radiology, multi-reader multi-case (MRMC) studies are often used to compare the diagnostic accuracy of alternative imaging approaches and this design is currently required by the United States Food and Drug Administration (FDA) for pre-market evaluation [6]. Key attributes of good study design are uncontroversial and include interpretation of medical images from the clinical population of interest by radiologists typical of those who would read the test in clinical practice, and unaware of the patient disease status or prevalence of abnormality. Studies often compare test interpretation by the same radiologists in the same patients, with the only difference being the diagnostic test. Multi-reader multi-case studies can either use a fully crossed design, where all readers interpret all patient images or split-plot designs [7]. Learning and order bias are reduced by presenting images and tests to each reader in random order. Interpretation of the same case is often separated by at least one month to reduce potential for recall bias. Clinical utility of CT colonography Computed tomography (CT) colonography is a CT scanning technique used to identify colon polyps, the precursor of colon cancer. Diagnostic improvement occurs when correct detection of patients with polyps increases (false negative results are reduced), corresponding to an increase in sensitivity, without an unacceptable increase in false positive diagnoses, corresponding to a decrease in specificity. It is important to take disease prevalence into account when balancing changes in sensitivity and specificity. We have recently measured the relative value that patients and clinicians place on false positive results compared to false negative results using discrete choice experiments [8]. Both patients and medical professionals valued reducing false negative (increasing sensitivity) more desirable than reducing false positive results (reduction in specificity) for both colon polyps and colon cancer [8]. Similarly when in mammography screening women will exchange 500 false-positives for one additional cancer [9]. This is pertinent to ROC AUC, where the analysis automatically sets a weighting of the relative importance of diagnoses [1]. Sensitivity and specificity are usually direct measures calculated from diagnostic data reported by radiologists in normal clinical practice, namely the presence or absence of polyps. By contrast ROC AUC is a summary measure of performance across all potential diagnostic thresholds for positivity, rather than performance at any specific threshold. As such ROC AUC is classified as a surrogate endpoint [10]. ROC AUC requires confidence scores ROC AUC is derived from confidence scores which are scores usually assigned by radiologists to indicate their confidence in their diagnosis. Confidence scores may or may not form part of the normal clinical report. Confidence scores can be assigned either to individual lesions within a patient, or to an overall patient diagnosis. In imaging studies there are two broad types of clinical scenario in which confidence scores can be assigned to enable calculation of ROC AUC. In classification studies, visualised lesions are classified according to morphological characteristics perceived by the radiologist; for example in mammography studies lesions are either benign or malignant and the strength of the radiologist s belief is captured using a confidence scale such as benign, probably benign, equivocal, probably malignant, or definitely malignant. If there is a lesion on every image presented, then the task is purely classification. In some studies the confidence score is adapted from a clinical measure used in clinical practice, such as the BI-RAD scale [11]. In detection or presence versus absence studies, readers are asked to record their confidence regarding the presence or absence of a lesion rather than its nature; often a scale such as 0 to 100 is used. These confidence scores are often recorded in clinical trials solely to calculate ROC AUC. It has been suggested that lesion size could act as a confidence score for presence/absence studies linked to normal clinical practice. However this approach is flawed as lesion size cannot be measured when there is no lesion. Many studies are hybrids between these two scenarios. For example not all images may contain a lesion, and readers may be asked to classify lesions when present and use a different confidence score when not. Similarly detection studies may require readers to report confidence scores for abnormalities that they do not classify as lesions. Aim of research In this paper we compare two statistical methods for measuring diagnostic performance, namely sensitivity and specificity versus ROC AUC. When used to compare two diagnostic tests these methods may estimate diagnostic performance differently. In this article we investigate why this can happen using data from a previously published clinical study [12] and examine which aspects of study design and characteristics of the data contributed to ROC AUC method assumptions being considered inappropriate. We illustrate the issues using a study comparing CT colonography with and without Computer Assisted Detection (CAD) to identify colon polyps [12]. We compare the diagnostic measure area under the Receiver Operating Characteristic Curve (ROC AUC) to sensitivity and specificity. This work was motivated by an FDA strong presumption in favour of using ROC AUC to measure diagnostic accuracy for licensing of CAD in radiological imaging [6]. We identify and present the problems encountered when using ROC AUC to measure diagnostic performance. Methods Study design Full methods for the study are described in the original study publication [12]. In brief, ten radiologists each read CT colonography images from the same 107 patients, reading images with and without CAD assistance to detect colon polyps. Each read was separated by two months to avoid potential recall bias, with both test and patient order randomised for each reader. The reference standard was a consensus of two from a panel of three experienced and independent radiologists who read each case combination with colonoscopy reports: 60 patients had polyps and 47 were normal. Each reader identified polyps, noting their diameter and location. In addition they recorded whether they believed the patient case was normal (i.e. no polyps were seen) or abnormal (where polyps were reported). All statistical measures were calculated per patient since a positive CT colonography will mean subsequent colonoscopy (where the entire colon is examined and polyps removed). Sensitivity was the percentage of patients identified by radiologists as having a polyp, either through true positive or false positive polyp identification(s), from patients positive according to the reference standard. This definition of sensitivity reflects that patients are referred based on identification of polyps in the clinical referral pathway. Specificity is the percentage of patients where no polyps were reported by PLOS ONE 2 October 2014 Volume 9 Issue 10 e107633 radiologists, of those classified as negative by the reference standard. Table 1 shows the steps used to calculate ROC AUC in this study. A confidence score between 1 and 100 was reported for each potential polyp identified, with readers instructed to use scores of 25 or above for polyps with high confidence and scores of 1 to 24 for abnormalities believed more likely to be something else. Where no confidence score was recorded by radiologists, a zero score was introduced during statistical analysis. Where more than one polyp was recorded per patient, the highest confidence score recorded with each patient was used for analysis. ROC AUC calculations used DBM MRMC v2.1 ( uchicago.edu/krl/krl_roc/software_index6.htm) and Proproc v0.0 ( software) [13]. DBM MRMC fits ROC curves based on parametric binormal methods [14]. PROPROC fits ROC curves based on a maximum-likelihood estimation using a proper binormal distribution [13]. In this paper, for illustrative purposes, we selected five of the ten readers that best demonstrate issues when comparing sensitivity and specificity versus ROC AUC. Results Different diagnostic performance We compared the diagnostic performance of two tests to detect colonic polyps, CT colonography either with or without CAD, using the difference in diagnostic accuracy measured by (i) the number of patients correctly diagnosed and (ii) ROC AUC. We expected diagnostic performance to increase when CAD was used. However, we observed no clear relationship between these two measures of diagnostic performance despite readers and cases being identical (Figure 1A). We then investigated the relationship between the difference in sensitivity and specificity and the difference in ROC AUC (DROC AUC) for individual readers, focussing on five of the ten readers as illustrative examples (Figure 1B & 1C). Readers 2, 3 and 5 exhibited clear gains in sensitivity of 21, 22 and 21%, along with decreases in specificity of 15, 11 and 8% respectively. Reader 5 had the best performance followed by readers 3 and 2 respectively. Reader 4 also had a 13% increase in sensitivity with a smaller 4% decrease in specificity. Reader 1, by contrast, showed no increase in sensitivity but unusually had a 4% increase in specificity. Use of CAD improved clinical diagnosis in readers 2 to 5 but not in reader 1, based on the large increases in sensitivity when using CAD. As noted above, these are considered more important to both clinicians and patients than smaller reductions in specificity [8]. By contrast, the change in ROC AUC (Figure 1B) defines a positive benefit of CAD in readers 1 and 5 and a negative benefit in readers 2, 3 and 4. Perversely, reader 1 had one of the highest increases in ROC AUC (Figure 1A and 1B) since CAD had no influence on sensitivity, the most clinically important aspect, and also had little impact on specificity. Problems recording confidence scores that cause zero values During our study readers encountered several problems when assigning the confidence scores needed to derive ROC AUC. A key problem was that radiologists only reported confidence scores for regions of the colon where they identified polyps, despite instructions to use confidence scores between 1 and 25 to report irregularities that were, on balance, likely not polyps. CT colonography of a normal colon identifies many potential abnormalities that are ultimately proven not to be polyps, often numerous, and it was impracticable to score all of these or to select a meaningful subset to score. Further, when an abnormality believed to be a polyp was encountered, it tended to be reported with high confidence. In order to include all patients in the study, the statistician or data manager assigned a value of zero when a confidence score was not assigned by a radiologist. Figure 2 shows the distribution of confidence scores for five readers. The most common score for every reader was zero. This zero-inflated spike then accompanies a second distribution of the confidence scores assigned for abnormalities believed to be polyps. This results in a bimodal distribution of confidence scores that cannot be not transformed to a normal distribution by simple data transformations used in standard open source software [15] developed for these analyses (Figure 2). Despite instructions in which distinct ranges of scores were linked to descriptions of confidence, each reader interprets the guidance differently and uses the scores in different ways. Examples of distributions of confidence scores from literature Very few published articles using MRMC ROC AUC report the distribution of confidence scores. We identified only two examples from the literature where individual reader scores were reported and another where the distribution of scores across the group of readers was shown (Figure 3 [16 18]). These examples show clearly that the distribution of confidence scores is not close to normality in either patient group, with or without monotonic Table 1. Steps in calculation of ROC AUC. Step A: Assigning confidence scores N Confidence scores were assigned by radiologists. Missing values assigned a value of zero by the data manager or statistician. Step B: Building the ROC curve from confidence scores and calculating ROC AUC N Distributions of the confidence scores were examined. Evaluation of potential limitations arising from non-normal distributions or extreme values. N Real data points directly generated from confidence scores presented in ROC space N ROC curves fitted using both parametric and nonparametric methods and examination of differences in resulting ROC AUCs. Evaluation of sensitivity of ROC curve to key values (such as values of confidence scores if few false positives), especially important as there are few false positive results. N ROC AUC calculated using both parametric and nonparametric methods Step C: ROC AUC averaged across multiple readers and cases N Different models using fixed and random effects used to model data N Random effects with 95% confidence intervals modelled by resampling (bootstrap [40]). Alternative methods can include jackknife [41]), permutation [42] or probabilistic method [43]. doi: /journal.pone t001 PLOS ONE 3 October 2014 Volume 9 Issue 10 e107633 Figure 1. Difference in diagnostic performance of two tests showing readers from a multi-reader study. Change in diagnostic performance of CT colonography for the detection of polyps; difference with computer assisted detection (CAD) minus without CAD. Results from individual readers. A. Comparison of increase in the number of patients with a correct diagnosis with change in ROC AUC. The five readers selected for illustrative purposes as examples for the rest of the article are labelled from 1 to 5. B. Arrows indicate values of sensitivity and specificity for each reader, the arrow bases showing unassisted read values and the arrow head the CAD assisted read values for the same reader. C. Difference in ROC AUC using two methods for fitting ROC curves. ROC AUC could not be calculated for reader 4 using LabMRMC method. doi: /journal.pone g001 transformation. However they generally have one peak (unimodal) rather than d
Advertisement
MostRelated
View more
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks