|
| |||||||||||||||||||||||||||||||||||||||
Background Despite the proved value of mammography in screening for breast cancer, its efficacy depends on radiologists' interpretations. The variability in such interpretations is not well understood.
Methods Using a technique of stratified random sampling, we selected 150 mammograms obtained in 1987: 27 from women with histopathologically confirmed breast cancer and 123 from women with no evidence of breast cancer after three years of follow-up examinations. Ten radiologists, who were unaware of the diagnoses and research hypothesis, each interpreted the 150 mammograms. Disagreement was analyzed within pairs of the 10 radiologists, as well as for the group of 150 women as a whole.
Results The diagnostic consistency between pairs of radiologists was moderate, with a median weighted percentage of agreement of 78 percent (weighted kappa, 0.47). The frequency of the radiologists' recommendations for an immediate workup ranged from 74 to 96 percent for mammograms from the women with cancer and from 11 to 65 percent for films from the women without cancer. A substantial disagreement in management recommendations -- in which one radiologist recommended routine follow-up and another recommended a biopsy for the same patient -- occurred in 3 percent of the pairwise comparisons but in 25 percent of the comparisons for the group of women as a whole. When two or more radiologists recommended a biopsy for the same patient, a disagreement in the stated location (right or left breast) occurred in 2 percent of the pairwise comparisons among the radiologists but in 9 percent of comparisons for the group of women as a whole. Because some disagreement was likely, given that 10 radiologists read each film, the pairwise comparison is a more conservative estimate of disagreement.
Conclusions Although mammography is of value in screening women for breast cancer, radiologists can differ, sometimes substantially, in their interpretations of mammograms and in their recommendations for management. Efforts to improve accuracy and reduce variability in interpretation may increase the effectiveness of mammography in detecting early breast cancers.
Methods
Case Selection
From mammograms obtained in 1987 at Yale-New Haven Hospital, 150 were randomly selected from three strata of diagnostic findings ("normal," "abnormal, probably benign," and "abnormal, suggestive of cancer") and from the definitive designation of breast cancer or the absence of cancer. The definitive diagnosis was breast cancer if it was histopathologically confirmed within three years after the 1987 mammogram. The three-year period was allowed because there may have been some false negative readings of the 1987 films6. A definitive designation of the absence of cancer required both no evidence of breast cancer during the three-year period after 1987 and a designation of normal findings or abnormal, probably benign findings on mammograms obtained in 1990. To avoid bias associated with the workup,7 a biopsy was not required to verify the absence of cancer.
All the patients were women, and those with benign findings on previous breast biopsies were eligible for inclusion in the study. Patients were ineligible if the 1987 mammogram had not been definitively interpreted. After other exclusions, discussed below, the mammograms for each eligible patient were reviewed for technical quality without knowledge of the patient's identity, the 1987 diagnostic interpretation, or the patient's outcome.
Mammography was performed with a standard film-screen technique on a Senographe 500T Unit (Thompson CGR Medical, Columbia, Md.), with the use of Kodak Ortho M film and MIN-R screens. Mediolateral oblique and craniocaudal views of original films, rather than copies, were available for each breast. The films were considered to be of good technical quality by 1987 standards. To maintain confidentiality, all mammograms were coded by number. The protocol was approved by the Yale-New Haven Medical Center Human Investigation Committee.
Participating Radiologists
Board-certified diagnostic radiologists from community and academic practices were invited to participate in the study if they routinely read mammograms. The radiologists were unaware of the objectives and design of the study and of the number of cases in which cancer was subsequently diagnosed. Each radiologist received a small honorarium.
Data Acquisition
In phase 1 of the research, each of the 10 radiologists independently read the mammograms for the 150 patients. In phase 2, which occurred five months later, the same films were reviewed again, this time in a new, randomly arranged sequence. The radiologists were told that all the patients were female, with no history of previous breast cancer, but they were not told that they would be seeing the same films a second time.
In both phase 1 and phase 2, films for 50 patients, shown with the patient's age but no clinical information, were used to assess intraobserver variability. Of the films for the remaining 100 patients, half were shown with a detailed clinical history in phase 1 but with only the patient's age in phase 2; this sequence was reversed for the other half. The detailed clinical history included signs and symptoms (e.g., a palpable lump or nipple discharge), the location of a previous breast biopsy if there had been one, menopausal status, estrogen use, history of other cancers, and family history of breast cancer. We performed a separate analysis of the effect of clinical histories on mammographic interpretation (data not shown).
For each mammogram, the radiologists used a checklist to indicate observations, diagnostic interpretations, and recommendations for management. The checklist was reviewed with the participating radiologists before phase 1, and those who were unfamiliar with the form received brief written instructions. The radiologists were asked to interpret the films as they would in their own clinical practice, with no time limits for the readings.
In descriptive observations, the radiologists could note specific abnormalities (e.g., a mass or calcification) and their location (right or left side, with the position demarcated according to the hours on a standard 12-hour clock). If more than one abnormality was noted in the same patient, the radiologists were asked to indicate which side contained the most serious abnormality (i.e., the most suggestive of cancer). Diagnostic interpretations could be chosen from one of four categories: normal; abnormal, probably benign; abnormal, indeterminate (for an uncertain finding); or abnormal, suggestive of cancer. The radiologists were asked to avoid the category "abnormal, indeterminate" whenever possible.
Management recommendations could be chosen from the following categories: routine mammographic follow-up (i.e., screening with mammography), mammographic follow-up after a short interval (e.g., within six months), additional mammographic views, ultrasound examination, or biopsy.
Data Analysis
The original data were double-entered and verified for electronic coding. Statistical analyses were performed either with electronic hand calculators or with programs in the Statistical Analysis System8.
Interobserver variability in interpretations of the mammograms for the 150 patients was appraised from the readings of pairs of radiologists in phases 1 and 2. Intraobserver variability was assessed for the 50 mammograms that were shown in both phases with the age of the patient as the only additional information provided. The indexes of variability used in the paired comparisons were the percentage of agreement and the kappa statistic9,10. For ordinal measures, both indexes were weighted with integers representing the number of ordinal categories from perfect agreement11,12.
The frequency of agreement was evaluated for each of the four categories of diagnostic interpretation. The accuracy (sensitivity and specificity) of the diagnosis of cancer was determined for films interpreted as "abnormal, suggestive of cancer," as compared with the accuracy for a combination of the other three diagnostic categories. The percentage of patients with recommendations for a workup was evaluated separately for patients with cancer and those without cancer.
For any pair of radiologists reading the same patient's mammogram, a substantial clinical disagreement was considered to exist if the films were interpreted as normal by one radiologist and as abnormal and suggestive of cancer by the other, or if routine mammographic surveillance was recommended by one radiologist and a biopsy was recommended by the other. If both members of a pair recommended an immediate workup (or a biopsy), a substantial disagreement was considered to exist if the most abnormal lesion was located in the right breast by one radiologist and in the left breast by the other. A description of bilateral abnormalities by one radiologist and of a unilateral abnormality by the other radiologist was not counted as a disagreement.
The proportions of substantial clinical disagreements were calculated in two ways: first, within the maximal number of pairwise comparisons, and second, for the overall group of 150 patients (per-patient comparisons). For the pairwise calculation, the numerator was the number of disagreements between the members of a pair of radiologists. The denominator was the maximal number of possible pairwise comparisons for the 10 radiologists: 45 pairs for an individual patient (10 x 9 divided by 2), and 6750 pairs (45 x 150) for all 150 patients. For the per-patient calculations, the numerator was the number of patients about whom at least two radiologists had a substantial disagreement, and the denominator was the total number of patients (150). Since immediate workup and biopsies were not recommended for all patients, disagreement about the location of abnormalities was calculated with denominators consisting of pairwise (or per-patient) comparisons in which at least two radiologists had recommended such a workup.
A simulated screening population was composed of the subgroup of asymptomatic women whose mammograms were accompanied by a clinical history. Regardless of a possible family history of breast cancer or a previous breast biopsy with benign results, these women reported no symptoms related to the breast at the time the mammography was performed. The results in this subgroup were assessed separately for variability, substantial clinical disagreements, and the percentage of patients without cancer for whom an immediate workup was recommended.
Assessment of Sources of Variability
After the phase 2 analysis, the participating radiologists were assembled to review specific mammograms chosen as examples of disagreement and to comment on the variability.
Results
Selection of Cases
The patients were selected from those with mammograms in 1987 from three strata of diagnostic findings: normal (>3000 women), abnormal but probably benign (567), or abnormal, suggestive of cancer (124). Random sampling within these categories resulted in 275 eligible women with normal findings and 207 with abnormal but probably benign findings plus all 124 considered to have abnormalities suggestive of cancer. As is customary in studies of variability among observers, the groups were selected to provide an enriched spectrum rather than the predominance of normal cases in actual screening, which would not be challenging and could also lead to artificially low results for kappa indexes of concordance13.
Of the 606 eligible patients, 271 had adequate follow-up (pathological studies or additional mammograms) for the determination of a definitive outcome: 89 were in the original group with normal findings, 110 had abnormal but probably benign findings, and 72 had abnormalities considered suggestive of cancer. From this group, 121 patients were then excluded for the following reasons: a history of breast cancer (24); large breasts, requiring special films or more than one film per breast (24); previous cosmetic breast surgery (8); radiopaque skin marker on the breast (6); films of inadequate technical quality (11); or films not available or only one view available (48). In the final group of 150 patients, the diagnostic interpretations from 1987 (and the subsequent outcome status) were as follows: 54 with normal findings (1 with cancer), 61 with abnormal but probably benign findings (7 with cancer), and 35 with abnormal findings suggestive of cancer (19 with cancer).
Of the 150 patients, 95 (63 percent) were at least 50 years old (range, 33 to 83), and 40 (27 percent) had previously undergone a breast biopsy, with negative results. Breast cancer had been diagnosed in 22 patients within one year after the 1987 mammography, and in 5 more patients during the two ensuing years. For 85 of the 127 patients who had reported no symptoms, the mammograms were read with the clinical history available, and this group of 85 women became the simulated screening subgroup; 9 were subsequently found to have cancer.
Participating Radiologists
Of the 10 participating board-certified radiologists, 7 were in private practice and 3 held full-time academic positions in New York or Connecticut. All 10 had received training in mammography during residency or in continuing medical education courses (or both), and 9 would have qualified to read mammograms in a facility accredited by the American College of Radiology14. The group had a median of 7 years of clinical experience reading mammograms (range, 1.5 to 20) and had interpreted a median of 1900 mammograms in the year before the study (range, 200 to 6000). The median percentage of time spent reading mammograms in clinical practice was 23 percent (range, 8 to 50 percent). Seven of the radiologists later said they were not aware that the same patients' mammograms had been shown in phases 1 and 2, two said they had recognized less than 3 percent of the mammograms in phase 2, and one claimed to have recognized about 25 percent of the films.
Observer Variability
Observations, Interpretations, and Recommendations for Management
The observations, diagnostic interpretations, and recommendations for management in phase 1 are shown in Table 1. Among the 27 patients with cancer, a mass was reported in a median of 54 percent (range, 37 to 78 percent) by the 10 radiologists. The diagnostic interpretation of an abnormality suggestive of cancer was reported for 37 to 85 percent of the patients (median, 70), and biopsies were recommended for 33 to 81 percent (median, 65). Figure 1 shows three sets of films for which there was 100 percent agreement on recommended management. Figure 2 shows a set of films from a patient without cancer for which interpretations differed. Interobserver variability in phase 2 closely resembled that in phase 1 and is not presented here.
|
|
|
Although all 10 radiologists agreed on the diagnostic interpretation in only 10 cases (7 percent), at least 5 of the 10 radiologists agreed in 129 cases (86 percent). In most instances in which at least five radiologists agreed, the films were interpreted either as normal or as abnormal and suggestive of cancer. The median agreement (Table 2) was 78 percent for interobserver variability in diagnostic interpretations and 85 percent for the recommendation of a biopsy. The corresponding median kappa values were 0.47 for the diagnostic interpretation and 0.49 for the biopsy recommendation. As expected, the corresponding results were higher for intraobserver agreement than for interobserver agreement.
|
The 10 radiologists differed widely in their recommendations for management (Table 3). For example, the percentage of patients with cancer for whom an immediate workup was recommended was highest (96 percent) for radiologist A and lowest (74 percent) for radiologist J. Radiologist A recommended an immediate workup for almost all the patients with cancer, but also for 64 percent of the patients without cancer. In general, the radiologists who recommended workups most frequently did so for both the patients with cancer and those without cancer. The radiologists varied particularly in requesting additional mammographic views (range for patients without cancer, 6 to 55 percent) but were more consistent in requesting ultrasound examinations and biopsies. The diagnostic interpretations of the 10 radiologists had a median sensitivity of 70 percent (range, 37 to 85 percent) and a median specificity of 94 percent (range, 85 to 99 percent).
|
At least one pair of radiologists had a substantial diagnostic disagreement in 2 percent of the pairwise comparisons and for 19 percent of the patients (Table 4). Disagreements regarding biopsy recommendations occurred in 3 percent of the pairwise comparisons and 25 percent of the per-patient comparisons. When at least two radiologists recommended an immediate workup for the same patient, they disagreed on the side (right or left breast) for the workup in 9 percent of pairwise comparisons and 33 percent of per-patient comparisons. In the subgroup of patients for whom biopsies were recommended, the stated side of the abnormality differed in 2 percent of the pairwise comparisons and 9 percent of the per-patient comparisons (four patients).
|
In the simulated screening group of 85 asymptomatic patients, the results were similar to the overall results for the 150 patients. For the diagnostic interpretations, the median agreement was 78 percent (kappa index, 0.38), and for the biopsy recommendations, it was 86 percent (kappa index, 0.43). Among the 76 patients without cancer in the simulated screening group, additional mammographic views were recommended for 8 to 67 percent, ultrasound studies for 5 to 17 percent, and biopsies for 4 to 18 percent. The proportion of substantial clinical disagreements in the simulated screening group was also similar to that in the overall group of patients.
Explanations for Variability
Three sources of variability were identified. First, there were differences in visual observation. Some radiologists noted masses where others saw only normal parenchyma, and some missed subtle findings, such as tiny microcalcifications. For example, on being shown a lesion, one radiologist stated, "I just didn't see it." Second, in some cases the same abnormality was perceived differently. Thus, some radiologists regarded an obvious mass as well defined and therefore probably benign, whereas others believed portions of the margin were obscured, making it suggestive of cancer. Third, there were different thresholds of concern about perceived abnormalities. Although some radiologists were willing merely to follow what they regarded as probably benign abnormalities, others sought additional evaluation for almost every perceived abnormality.
Discussion
No study can perfectly mimic the real world. There are some notable differences between the interpretation of mammograms in routine clinical practice and that in our study. The testing situation may have led the radiologists to identify more subtle abnormalities than would otherwise be noted. Previous mammograms were not available for comparison, and a clinical history was not presented for every patient. In addition, the group of patients included many more women with breast cancer or other mammographic abnormalities than would be found in a screening population. Because of this deliberate enrichment, additional workup was recommended in a much larger proportion of patients than the 5 to 14 percent customarily recalled for further evaluation after routine screening6,15,16.
The effects of the study situation are difficult to determine. The radiologists may have paid more attention than usual, so that any possible lesion (regardless of its clinical importance) was reported, with a recommendation for further evaluation. On the other hand, knowing these were not films of their actual patients, the radiologists may have taken less care than usual.
Although the usual two views of each breast were evaluated, as they would be during screening, the goal of screening mammography is to separate normal from potentially abnormal mammograms rather than to establish a definitive diagnosis. Therefore, analysis of the radiologists' recommendations for management may be more clinically pertinent than their diagnostic interpretations. In addition, measures of sensitivity and specificity may not truly reflect the diagnostic accuracy because of the criteria used. For example, a true positive result for sensitivity was defined as a finding designated as abnormal and suggestive of cancer, with a subsequent confirmation of cancer. Accuracy might have been better appraised if true positive results had been defined as recommendations for immediate workups in patients with cancer (range, 74 to 96 percent). Analogously, false positive results could have been defined as recommendations for immediate workups in patients without cancer (range, 11 to 65 percent).
In the absence of an ideal way to calculate clinical disagreements, results were examined for both the pairwise and the per-patient assessments. Because 10 radiologists participated, the per-patient assessments inflate the amount of variability by counting a disagreement for a patient if any one radiologist disagreed with any one of the other nine, even if those nine were in perfect agreement with one another. The pairwise analysis avoids the excesses of the per-patient method but may underestimate the amount of variation, because disagreements were determined with a denominator of 6750 (or 45 x 150), which is the maximal number of possible pairwise agreements. Although reported here for easy understanding, this denominator is nearly twice the number of possible pairwise disagreements, which is 25 per patient and 3750 (or 25 x 150) for the total. In addition, the data were not matched for all analyses according to the specific patient, the side of the abnormality (right or left breast), or the stated location of the abnormality within the breast. For example, if two radiologists each recommended a biopsy for the same patient, the recommendations would be scored as an agreement, even though the abnormality could have been located in the left breast by one radiologist and in the right by the other.
The review after the study indicated that the participating radiologists may have observed the same thing (e.g., a mass in the left breast) but interpreted it differently. They also may have made the same diagnostic interpretation (e.g., an abnormality that was probably benign) but differed in their recommendations for follow-up because of different thresholds of concern. For each radiologist, the threshold may reflect not only personal estimates of the probability of cancer but also subjective factors (such as fiscal, physical, and psychological risks for the patient, as well as the risk of malpractice for the physician). Future efforts to reduce variability must include accuracy as a goal, because accuracy cannot be improved if radiologists agree on a diagnosis that is wrong.
Our results show that radiologists can differ, sometimes substantially, in their mammographic interpretations and recommendations for management. These results should not be regarded as casting doubt on the efficacy of mammography, the value of which has been well documented17,18,19,20,21,22,23. The variability found in this study, however, indicates that among radiologists who read mammograms there is a wide range of accuracy. Recent efforts to improve the accuracy of mammographic interpretation and to standardize it include the accreditation program of the American College of Radiology,14 publication of a lexicon for describing mammographic findings,24,25 decision-making aids,26 guidelines for mammographic reporting,27 and a description of auditing procedures15,28. A reduction in variability will probably require more consistent criteria for diagnostic interpretation and standards for recommending subsequent evaluation. In addition, examinations of performance will probably be required, rather than merely the dissemination of criteria; moreover, collaborative efforts, self-auditing procedures, and specialized education will be needed.
Supported by a grant (RD-346) from the American Cancer Society. Dr. Elmore was a Robert Wood Johnson Clinical Scholar.
We are indebted to Drs. R.I. Horwitz and J.F. Jekel for their thoughtful comments and to the 10 radiologists who participated in the study and who prefer to remain anonymous.
Source Information
From the Departments of Internal Medicine (J.G.E., C.K.W., D.H.H., A.R.F.), Diagnostic Radiology (C.H.L.), and Epidemiology and Public Health (A.R.F.), Yale University, New Haven, Conn. Presented in part at the meeting of the American Association of Physicians, Washington, D.C., April 30-May 3, 1993.
Address reprint requests to Dr. Elmore at the Primary Care Center, Yale University School of Medicine, 20 York St., New Haven, CT 06504.
References
| |||||||||||||||||||||||||||||||||||||||
Related Letters:
Variability in the Interpretation of Mammograms
Silen W., Gaskin T. A., D'Orsi C. J., Swets J. A., Hall F. M., Elmore J. G., Wells C. K., Feinstein A. R.
Extract |
Full Text
N Engl J Med 1995;
332:1171-1173, Apr 27, 1995.
Correspondence
This article has been cited by other articles:
HOME | SUBSCRIBE | SEARCH | CURRENT ISSUE | PAST ISSUES | COLLECTIONS | PRIVACY | TERMS OF USE | HELP | beta.nejm.org Comments and questions? Please contact us. The New England Journal of Medicine is owned, published, and copyrighted © 2009 Massachusetts Medical Society. All rights reserved. |