|
Published at www.nejm.org October 1, 2008 (10.1056/NEJMoa0803545) |
| |||||||||||||||||||||||||||||||||
Background The sensitivity of screening mammography for the detection of small breast cancers is higher when the mammogram is read by two readers rather than by a single reader. We conducted a trial to determine whether the performance of a single reader using a computer-aided detection system would match the performance achieved by two readers.
Methods The trial was designed as an equivalence trial, with matched-pair comparisons between the cancer-detection rates achieved by single reading with computer-aided detection and those achieved by double reading. We randomly assigned 31,057 women undergoing routine screening by film mammography at three centers in England to double reading, single reading with computer-aided detection, or both double reading and single reading with computer-aided detection, at a ratio of 1:1:28. The primary outcome measures were the proportion of cancers detected according to regimen and the recall rates within the group receiving both reading regimens.
Results The proportion of cancers detected was 199 of 227 (87.7%) for double reading and 198 of 227 (87.2%) for single reading with computer-aided detection (P=0.89). The overall recall rates were 3.4% for double reading and 3.9% for single reading with computer-aided detection; the difference between the rates was small but significant (P<0.001). The estimated sensitivity, specificity, and positive predictive value for single reading with computer-aided detection were 87.2%, 96.9%, and 18.0%, respectively. The corresponding values for double reading were 87.7%, 97.4%, and 21.1%. There were no significant differences between the pathological attributes of tumors detected by single reading with computer-aided detection alone and those of tumors detected by double reading alone.
Conclusions Single reading with computer-aided detection could be an alternative to double reading and could improve the rate of detection of cancer from screening mammograms read by a single reader. (ClinicalTrials.gov number, NCT00450359
[ClinicalTrials.gov]
.)
Computer-aided detection systems use computer algorithms to analyze digital mammographic images. They identify and mark potentially suspicious regions to attract the reader's attention to features that might have been overlooked or dismissed as normal.6,11,12,13,14,15,16,17,18,19,20,21,22,23,24 In the United States, where single reading is standard practice, commercial computer-aided detection systems have been widely adopted to improve reader performance25,26,27,28,29,30,31,32,33,34,35,36 and are used in 25 to 30% of all mammogram readings.37 However, the benefit of computer-aided detection in screening mammography was recently questioned in an observational study by Fenton et al.35 and remains controversial.38
There have been calls for robust evidence from multicenter, randomized trials involving large numbers of participants and readers to inform screening programs that are considering an investment in computer-aided detection.36,39,40,41,42 We conducted the prospective, randomized Computer-Aided Detection Evaluation Trial II (CADET II), in which single reading with computer-aided detection was compared with double reading during routine screening mammography in the United Kingdom National Health Service Breast Screening Programme (NHSBSP). Arbitration was used in cases of disagreement between the two readers, since this is United Kingdom practice and has been shown to reduce recall rates while maintaining high cancer-detection rates.43,44,45
Methods
Study Design
The study was designed as an equivalence trial, with matched comparisons between the sensitivity of single reading with computer-aided detection and the sensitivity of double reading. A total of 31,057 women were recruited between September 2006 and August 2007 from the static and mobile units of three NHSBSP centers in England (Manchester; Nottingham; and Warwickshire, Solihull, and Coventry). Information about the trial was posted 3 to 6 weeks before their appointments to women who were invited to undergo routine screening by two-view film mammography. Written informed consent was obtained at the mammography appointment. The study was approved by the South East Multi-Centre Research Ethics Committee.
Randomization
Batches of films from each screening session were randomly assigned, at a ratio of 1:1:28, to one of three reading regimens. Group A, with a planned number of 1000 subjects, was assigned to receive a double reading only; group B, with a planned number of 1000 subjects, was assigned to receive a single reading with computer-aided detection only; group C, with a planned number of 28,000 subjects, was assigned to receive both a single reading with computer-aided detection and a double reading. Assignments were made with the aid of a random-number generator and provided to each center in sequentially numbered, sealed envelopes.
Film Digitization and Computer-Aided Detection Output
Trial mammograms were digitized and analyzed by the software detection algorithms of an ImageChecker DMax computer-aided detection system, version 8.1 (Hologic/R2 Technology). The prompting markers were viewed on a CheckMate flat-panel display screen superimposed on an image of the corresponding mammogram at a reduced resolution. The PeerView facility provided the reader with a high-resolution image of any region of interest that was identified by the computer-aided detection algorithm.46 Markers were generated for masses (asterisks) and microcalcifications (solid triangles). If both a mass and microcalcification were detected, a four-pointed star was displayed. The sizes of the markers were related to the likelihood of cancer, as determined by the algorithms. In this study, the software-detection algorithm thresholds were set to operate at a detection sensitivity of approximately 88% for masses and 95% for calcifications, with corresponding false marker rates of 1.5 and 1.0, respectively, per four-film examination.46
Film Readers
The film readers were 17 radiologists, 2 breast-cancer clinicians, and 8 trained film-reading technologists (radiographers). All film readers met the requirements of the NHSBSP quality-assurance guidelines by reading at least 5000 screening mammograms per year and participating in the annual PERFORMS self-assessment test.47 Readers acting as single readers with computer-aided detection or as the first of the two readers in a double reading had a median of 6 years of mammography experience (interquartile range, 4 to 14 years); the second readers had slightly less experience, with a median of 5 years (interquartile range, 2 to 10 years). Before the study, the readers completed a training course in the use of computer-aided detection involving 300 to 400 subjects.42,48
Film Reading
Films were independently double read or single read with computer-aided detection in separate reading sessions by different readers with no access to the decision of the recorded outcome from the other reading regimen. A reader who was one of the two double readers was not permitted to read the same subject's films with computer-aided detection. Mammograms from a previous round were mounted for viewing.
The readers recorded the results for each subject as "recall for further assessment" or "return to routine screening." Discordant results from double readings were arbitrated by a third reader or, in Manchester, by another pair of readers. In single reading with computer-aided detection, the readers viewed the mammogram and then accessed the computer image with prompts and reviewed any areas marked by the computer-aided detection system before recording in the National Breast Screening System their overall recommendation for "recall for further assessment" or "return to routine screening."
Outcome Measures
The primary outcome measures were the proportion of cancers detected by the two reading regimens and the recall rate. These were analyzed in a matched-pair comparison.
Statistical Analysis
The trial was designed as an equivalence study,49,50 with a matched-pair comparison of the two regimens in the group that received both single reading with computer-aided detection and double reading. Equivalence was defined as a 95% confidence interval that ruled out a difference of more than 10% in either direction in the rate of detection of cancers. The 10% equivalence criterion is more rigorous than the 15% criterion often used in pharmaceutical equivalence studies.51,52 Within the group receiving both regimens, the differences between recall rates for all subjects and for those with cancer were compared with the use of McNemar's method.53 This method takes into account the fact that the decisions resulting from single reading with computer-aided detection and from double reading are made on the same subjects. For the primary end point, detection of cancers, the 95% confidence interval for the difference between detection rates was calculated. If the 95% confidence interval did not extend beyond ±10%, it was considered to be equivalence.
In 2004, the rate of detection of invasive and in situ breast cancers in the United Kingdom NHSBSP with the use of double reading was 6.91 cancers per 1000 women screened.54 Increasing this rate by 11% because each mammogram was read according to two regimens in our study would result in a detection rate of 7.67 per 1000. Power calculations on the assumption of a 20% disagreement rate between the two regimens indicated a requirement of 215 cancers to establish equivalence according to our criterion of 10%. Thus, 28,000 mammograms would have to be read according to both regimens to achieve a power of 90%. In addition, to minimize reader bias, so that a reader using one of the regimens would not know whether the mammogram would be read according to the other regimen, a further 2000 mammograms, randomly assigned to either double reading or single reading with computer-aided detection, were included.55
Results
Figure 1 shows a flow diagram of the trial. In the group receiving both single reading with computer-aided detection and double reading, 227 cancers were detected among 28,204 subjects, for an overall screening-detection rate of 8.0 per 1000. The corresponding numbers for the group receiving double reading only and the group receiving single reading with computer-aided detection only were 12 cancers among 1152 subjects (10.4 per 1000) and 8 cancers among 1182 subjects (6.8 per 1000), respectively. Table 1 shows the distribution of subjects in the group receiving both single reading with computer-aided detection and double reading according to age, type of screening (prevalence or incidence screening), and center.
|
|
|
Table 3 shows the recall rates according to center and screening regimen for subjects participating in the trial and for women not participating in the trial who were undergoing routine contemporaneous screening, with double reading of their mammograms. The recall rates were similar for trial subjects whose mammograms were read by a single reader with computer-aided detection and for nontrial subjects at the Nottingham center and the Warwickshire, Solihull, and Coventry center. However, at Manchester the recall rate among trial subjects undergoing single reading with computer-aided detection was 1.4 percentage points higher than the recall rate among trial subjects undergoing double reading and 1.2 percentage points higher than the recall rate for nontrial subjects. Overall, the recall rate was 0.3 percentage point higher for mammograms that were given a single reading with computer-aided detection than for routine screening mammograms of nonparticipating women. The recall decision was arbitrated for 365 of the 28,723 women (1.3%) whose mammograms received a single reading with computer-aided detection; these women included 6 of the 227 with cancer (2.6%).
|
|
The trial demonstrated that the cancer-detection rates of a single reader using computer-aided detection and of two readers are similar (7.02 per 1000 and 7.06 per 1000, respectively). However, the recall rate for mammograms given a single reading with computer-aided detection was 0.5 percentage point higher than that for mammograms given a double reading (3.9% and 3.4%, respectively), a relative difference of 15%.
There has been much recent debate regarding the use of computer-aided detection in clinical practice. Eight studies making the comparison between single reading with computer-aided detection and double reading were reviewed by Bennett et al.,56 but methodologic differences precluded any definitive conclusions. Most of the studies were small, had a high proportion of subjects with cancer, and involved a limited number of readers with variable training in the use of computer-aided detection.23 Taylor and Potts used pooled estimates of effect sizes from two meta-analyses7: one included 10 U.S. studies in which the comparison was made between single reading alone and single reading with computer-aided detection, and the second included 17 studies comparing single reading with double reading. The authors concluded that there was no significant difference in cancer-detection rates between single reading with computer-aided detection and double reading with arbitration in cases of disagreement, but they found that the recall rate with double reading was significantly lower than the rate with single reading with computer-aided detection. However, differences between the screening programs of the United States and the United Kingdom make these evaluations difficult to interpret.57 Our large, prospective study helps to clarify some of these issues. The small increase in the number of women recalled for assessment in the group given a single reading with computer-aided detection as compared with the group given a double reading is consistent with the results of previously published studies.26,30,31,58 Furthermore, our results are in line with the pooled estimates predicted by Taylor and Potts7 in a comparison between double reading with arbitration in cases of disagreement and single reading with computer-aided detection.
The strengths of this study are that it was a multicenter, prospective trial with more than 30,000 women randomly assigned to one of three reading groups and that the readers were unaware of the reading assignments. Reflecting practice in the United Kingdom, the readers were all high-volume readers (reading more than 5000 mammograms per year) and were breast radiologists, breast-cancer clinicians, or film-reading technologists. The single readers using computer-aided detection and the first readers of mammograms given double readings had almost identical experience in reading mammograms, with a median of 6 years of experience (interquartile range, 4 to 14 years) for both groups of readers. Thus, our study was not biased by a difference in the amount of experience of the two groups of readers, but the conclusion that the two reading regimens are equivalent applies only if the single reader with computer-aided detection has experience similar to that of a first reader in a double reading. We used double reading with arbitration in cases of disagreement, which is routine practice in the United Kingdom, because it has been shown to reduce the recall rate.1,2,3,4,5,6,7 In the group assigned to single reading with computer-aided detection, 1.3% of the cases were arbitrated by another reader (or pair of readers in Manchester). The recall rates at all sites were within the limits acceptable to the United Kingdom NHSBSP (<7% for prevalence screenings and <5% for incidence screenings)47 and other European countries.8 The recall rates for trial subjects assigned to double reading (and at two centers, for those assigned to single reading with computer-aided detection) were similar to those for nontrial subjects assigned to double reading, a result indicating that the behavior of readers participating in double reading was not substantially altered by participation in the trial.
The main study design was a matched comparison of the results of the application of the two reading regimens to the same subject. Some interval cancers were missed by both reading regimens (i.e., the number of false negatives was the same for both regimens), but the failure to detect these cancers did not alter the absolute difference in sensitivity between the two regimens. The advantage of the matched design is that it enables a valid comparison of sensitivities to be performed before the interval cancers come to light. In the mammograms in which breast cancer was detected after discordance between the recall decisions of the two reading regimens, we observed no significant differences between the pathological attributes of the tumors detected by the two regimens.
The film-reading and arbitration procedures were not standardized among the three centers but reflected current practice at each center and were within national guidelines. Similarly, each center applied a pragmatic approach to the protocol for single reading with computer-aided detection. In two centers, there was no significant difference in recall rates between mammograms given a single reading with computer-aided detection and those given a double reading; however, the Manchester center had a significant increase of 1.4 percentage points in the rate of recalls with single reading with computer-aided detection as compared with double reading. This difference was not associated with the amount of experience or the staff category of the readers or whether the mammograms were obtained at a prevalence or an incidence screening. However, of the three centers, Manchester also had both the highest recall rates for double reading and the highest cancer-detection rates in the trial. For Warwickshire, Solihull, and Coventry; Manchester; and Nottingham, the overall cancer-detection rates within the trial were 7.1, 10.0, and 7.0 per 1000 screened, respectively (P=0.03), a result suggesting that the increased recall rate with single reading with computer-aided detection in Manchester is due to a more cautious recall policy in general in that center. During the period of the trial, the time between mammography screenings was greater than 3 years in Manchester, an interval that might have permitted more cancers to become available for detection than a shorter period might have done.
Our results suggest that single reading with computer-aided detection is an alternative to double reading; whether to adopt this technology is a question of cost-effectiveness. The additional costs of the computer-aided–detection equipment and the costs associated with an increase in recall must be balanced against the potential savings in reader time. Clearly, the cost-effectiveness of computer-aided detection requires investigation.6,23,37 Furthermore, comparison between the performance of computer-aided detection in full-field digital mammography and its performance in film mammography (which was used in this study) will be required.59
Double reading, which is recognized as the best method for the detection of small invasive cancers,60 is often difficult to achieve in practice because of costs and the need for two readers. The results of this study are applicable to programs in which double reading is standard practice.8 Where single reading is standard practice, computer-aided detection has the potential to improve cancer-detection rates to the level achieved by double reading.
Supported by a grant from Cancer Research UK (CRUK/03/010) and by the National Health Service Cancer Screening Programme.
Professor Gilbert reports being vice-chairman of the Royal College of Radiologists Breast Group, which advises on mammography standards in the United Kingdom; Dr. Wallis reports being president of the European Society for Breast Imaging and a member of the Royal College of Radiologists Breast Group, the National Health Service Breast Screening Programme (NHSBSP) Radiology Quality Assurance Group, the NHSBSP Evaluation Group, and the NHSBSP Association of Breast Surgeons at the British Association of Surgical Oncology Audit Group, all of which advise on and set standards for breast screening and breast imaging; Dr. Astley and Dr. Boggis report receiving fees from VuComp for digitizing and transmitting data, and consulting with Hologic with no remuneration; Dr. Astley reports being a past member of the former NHSBSP Digital Steering Group; and Drs. Gilbert, Astley, and Boggis report being loaned two CAD systems by R2 Technology (now Hologic) for 12 months in 2003 for the retrospective CADET I study. No other potential conflict of interest relevant to this article was reported.
We thank E. Fawcett-Gough, P. Shann, S. Berry, and S. Ferguson at Manchester and M. Wheaton and S. Wright at Warwickshire, Solihull, and Coventry for local assistance and data management; J. Matthews and L. Checkley at Nottingham, and the radiographers at all three centers for assistance with recruitment; the screening office managers and the radiographic, clerical, and secretarial staff at all three centers for their assistance with data collection; and the women at all three centers who consented to their mammograms being used in this study.
Source Information
From the Aberdeen Biomedical Imaging Centre, University of Aberdeen, Aberdeen (F.J.G., M.G.C.G.); the Department of Imaging Science and Biomedical Engineering, University of Manchester, Manchester (S.M.A.); the Department of Epidemiology, Mathematics, and Statistics, Wolfson Institute of Preventive Medicine, London (O.F.A., S.W.D.); the Cambridge Breast Unit, Addenbrookes Hospital, Cambridge (M.G.W.); the Nottingham Breast Institute, Nottingham City Hospital, Nottingham (J.J.); and the Nightingale Breast Screening Unit, Wythenshawe Hospital, Manchester (C.R.M.B.) — all in the United Kingdom.
This article (10.1056/NEJMoa0803545) was published at www.nejm.org on October 1, 2008. It will appear in the October 16 issue of the Journal.
Address reprint requests to Dr. Gilbert at the Aberdeen Biomedical Imaging Centre, University of Aberdeen, Lilian Sutton Bldg., Foresterhill, Aberdeen AB25 2ZD, Scotland, United Kingdom, or at f.j.gilbert{at}abdn.ac.uk.
References
The members of the CADET II group were as follows: Analysis and Writing Committee: F.J. Gilbert, S.M. Astley, M.G.C. Gillan, O.F. Agbaje, M.G. Wallis, J. James, C.R.M. Boggis, S.W. Duffy; Trial Management and Site Coordination Group: F.J. Gilbert, M.G.C. Gillan, S.W. Duffy, O.F. Agbaje, S.M. Astley, M.G. Wallis, J. James, C.R.M. Boggis, H. Flight, K. Heer, J. Cooper; Statistical Analysis: S.W. Duffy, O.F. Agbaje; Film readers at participating NHSBSP centers (Warwickshire, Solihull, and Coventry; Manchester; and Nottingham): M.G. Wallis, C.R.M. Boggis, J. James, R.R. Ramachandra, M. Wilson, H.C. Burrell, C. Lewis, U.M. Beetles, E.J. Cornford, S. Garnett, E. Hurley, L.J. Hamilton, A.A. Duncan, A.K. Jain, A.J. Evans, V. Gaur, N.B. Barr, R. Wilson, R. Walker, M. Griffiths, A. Luck, L. Chapman, J. Johnson, J. Patel, J. Hackney, R. Roberts, S. Bundred.
| |||||||||||||||||||||||||||||||||
This article has been cited by other articles:
HOME | SUBSCRIBE | SEARCH | CURRENT ISSUE | PAST ISSUES | COLLECTIONS | PRIVACY | TERMS OF USE | HELP | beta.nejm.org Comments and questions? Please contact us. The New England Journal of Medicine is owned, published, and copyrighted © 2009 Massachusetts Medical Society. All rights reserved. |