|
| |||||||||||||||||||||||
Unfortunately, these technological advances also create confusion that may ultimately be harmful to patients. Consider the case of prostate cancer. Although the prevalence of clinically apparent prostate cancer in men 60 to 70 years of age is only about 1 percent,1 over 40 percent of men in their 60s with normal rectal examinations have been found to have histologic evidence of the disease2. Consequently, because the prostate is studied increasingly by transrectal ultrasonography3,4 and MRI, which can detect tumors too small to palpate, the reported prevalence of prostate cancer increases. In addition, the increased detection afforded by imaging can confuse the evaluation of therapeutic effectiveness. As the spectrum of detected prostate cancer becomes broader with the addition of tumors too small to palpate, the reported survival from the time of diagnosis improves regardless of the actual effect of the new tests and treatments5.
In this article, we explain how advances in diagnostic imaging create confusion in two crucial areas of medical decision making: establishing how much disease there is and defining how well treatment works. Although others have described these effects in the narrow context of mass screening6,7 and in a few clinical situations, such as the staging of lung cancer,5 these consequences of modern imaging increasingly pervade everyday medicine. Besides describing the misperceptions of disease prevalence and therapeutic effectiveness, we explain how the increasing use of sophisticated diagnostic imaging promotes a cycle of increasing intervention that often confers little or no benefit. Finally, we offer suggestions that may minimize these problems.
Prevalence of Disease
As a general principle, the prevalence of any disease increases with the observer's ability to detect the abnormalities associated with the disease. In the case of diagnostic imaging, the ability to detect an anatomical abnormality is closely related to the size of the abnormality. Thus, as technological advances make it possible for imaging equipment to detect smaller abnormalities, more are found and the prevalence of the associated disease increases accordingly. The problem is illustrated if one asks the deceptively simple question, "How many islands surround Britain's coast"8? There is no one correct answer, because the number of islands increases with the resolution of the map on which they are counted. There are analogous relations between prevalence and diagnostic scrutiny for several familiar diseases.
Abdominal Aortic Aneurysms
Two decades ago, abdominal aortic aneurysms were detected mainly by palpation. The threshold in size for palpation of an aneurysm is about 5 cm when the clinician is directed to look for the abnormality in the investigational setting and considerably higher when the clinician is not so directed9. Today, however, most abdominal aortic aneurysms are detected by ultrasonography or CT, which have a detection threshold well below 3 cm, the most commonly used criterion for the diagnosis of these aneurysms.
A recent screening study of 201 patients at high risk (men between the ages of 60 and 75 years with hypertension or coronary artery disease) showed how much ultrasonography can affect the reported prevalence of abdominal aortic aneurysms. As shown in Figure 1, five aneurysms were detected on physical examination as a "definite pulsatile mass," and four of the five were 5 cm in diameter or larger9. However, 18 aneurysms 3.6 cm or more in diameter were detected in the same population by ultrasound. Of the 13 aneurysms not detected by physical examination, 1 measured
5 cm, 5 measured
4 cm, and 13 measured
3.6 cm. Thus, from the perspective of the clinician performing physical examinations, the prevalence of abdominal aortic aneurysms in this high-risk population was only 2 percent, and the modal (i.e., the most common) size was larger than 5.0 cm. From the perspective of the ultrasonographer, however, the prevalence of aneurysms in this population was 9 percent, and the modal size was less than 4.0 cm.
|
Cancer
For diseases defined on the basis of macroscopic size criteria, such as abdominal aortic aneurysms, the amount of detectable subclinical disease has been fairly well defined by the imaging techniques in current use. Therefore, their prevalences and incidences cannot increase much with future improvements in imaging. However, for diseases defined microscopically, such as cancer, the reservoir of detectable subclinical disease is huge. The prevalence and incidence of cancer have the potential to rise continually as detection thresholds are lowered by advances in imaging.
This potential is most clearly illustrated in the case of the thyroid gland, which has probably been more closely scrutinized for cancer than any other internal organ. According to the Connecticut Tumor Registry,1 the prevalence of clinically apparent thyroid cancer (tumor size >2 cm) is only about 0.1 percent in adults between the ages of 50 and 70 years. By slicing the thyroid at 2.5-mm intervals at autopsy, however, Harach et al.11 found at least one papillary carcinoma in 36 percent of Finnish adults of comparable ages. In addition, Harach et al. realized that the probability that they would observe under the microscope a tumor with a diameter smaller than the distance between slices was equal to the diameter of the tumor divided by the distance between slices. For example, given the 2.5-mm interval between slices, they reasoned that all tumors larger than 2.5 mm in diameter but only one fifth of tumors with a diameter of 0.5 mm (i.e., 0.5/2.5) would be seen under the microscope. Applying this reasoning to the size distribution of the tumors they observed, Harach et al. reconstructed the likely size distribution for thyroid cancer in their patients studied at autopsy (Figure 2) and concluded that the prevalence of histologically verifiable papillary carcinoma was close to, if not equal to, 100 percent if one could look at thin enough slices of the gland.
|
Breast Cancer
Before the widespread use of mammography, most breast cancers were discovered on physical examination, as palpable lumps. In one of the few studies to assess directly the accuracy of physical examination in screening for breast cancer, only 27 percent of tumors more than 1.0 cm in diameter and 10 percent of those less than 1.0 cm in diameter were detected by physical examination13. However, the mean size of breast cancers detected by state-of-the-art screening mammography is about 1.0 cm,14,15 and many of the cancers detected as microcalcifications are only a few millimeters in size.
Again, prevalence depends on the degree of scrutiny. According to the Connecticut Tumor Registry, clinically apparent breast cancer afflicts about 1 percent of all women between the ages of 40 and 50 years1. In a recent medicolegal autopsy study, however, small foci of breast cancer were found in 39 percent of women in this age group16. Most cancers were in the form of ductal carcinoma in situ. Furthermore, over 45 percent of the women with cancer had two or more lesions, and over 40 percent had bilateral lesions. Although it has been argued that such small in situ lesions are not detected by and are therefore irrelevant to screening mammography, about half the lesions in that study16 were detected, usually as microcalcifications, on postmortem plain-film radiography of the resected breasts. Because of continual technical improvements and increasingly broad criteria for the interpretation of mammograms, the detection threshold for breast cancer has fallen considerably since the time of the Breast Cancer Screening Project of the Health Insurance Plan of Greater New York17 (1963 to 1975). This can explain the increased prevalence of cancer on mammographic screening, from 2.717 to 7.614 per 1000 examinations (with the incidence increasing from 1.517 to 3.214 per 1000 examinations). The lower detection threshold can also explain the increase in the percentage of carcinomas in situ (stage 0) among all mammographically detected cancers -- from 12.7 percent17 to over 30 percent15,18,19. The principal indication for biopsy has changed from suspicious mass to suspicious microcalcifications. This can explain why the reported incidence of breast cancer has increased and why most of the increase is in smaller lesions, particularly ductal carcinoma in situ20.
As the foregoing examples illustrate, there are large reservoirs of clinically occult disease. In fact, this is true of most primary cancers and metastases that have been closely scrutinized2,11,16,21,22,23,24,25,26. For such diseases, the observed prevalence can increase considerably as detection thresholds are lowered by advances in imaging (Table 1).
|
Not only are advances in imaging changing physicians' perspectives on the prevalence of disease, but they are also distorting their perceptions of the natural history of disease and its response to medical intervention. Because only a tiny fraction of the tests and treatment strategies used in routine practice have been subjected to randomized trials, physicians must rely heavily on less rigorous methods of evaluation, which usually entail comparisons with historical controls. Whether or not these comparisons are based on published series in the literature or on the physician's recollections of personal experience, they are subject to lead-time and length biases5,6,7,27.
Lead-Time Bias
Lead-time bias pertains to comparisons that are not adjusted for the timing of the diagnosis. If survival is measured from the time of diagnosis, as is usual, then the comparison between patients who are given diagnoses earlier on the basis of the test and those given diagnoses on the basis of clinical findings is a biased one, regardless of the real effect of the earlier diagnosis. In the simple case in which earlier diagnosis has no real effect on the length of survival, the new test will appear to prolong survival (by the amount of time between detection with the test and clinical diagnosis). Therefore, the comparison should be adjusted by subtracting the lead time from the group with test-based diagnoses. In general, however, this adjustment cannot be made when a new test becomes available, because the rate of disease progression and hence the lead time afforded by testing are unknown. Furthermore, this adjustment for lead time assumes that test-detected cases progress at the same rate as those that eventually present clinically. When there is variability in the rate of disease progression, as is usually the case, then this assumption is incorrect and introduces a second bias.
Length Bias
Length bias pertains to comparisons that are unadjusted for the rate of progression of disease. The probability that a disease will be detected by testing is directly proportional to the length of its detectable preclinical phase, which is inversely related to its rate of progression (Figure 3). Therefore, disease detected by testing tends to progress less rapidly than disease that would ultimately present clinically in the absence of testing. Furthermore, the effect of the length bias increases in magnitude as the detection threshold of the test is lowered and the spectrum of detected disease is broadened to include the cases progressing the least rapidly (Figure 4). Among these may be cases that would regress, remain stable, or progress too slowly to become clinically apparent during the patient's lifetime. Some authors have described these as cases of "pseudodisease"28 and consider this aspect of length bias separately, as a bias of overdiagnosis29.
|
|
This constraint is particularly relevant to the detection of subclinical cancer. As the detection threshold of diagnostic imaging decreases to the level of pathological inspection, the upper bound of the probability of dying of detected cancer becomes small. Table 1 shows how small these upper bounds would become given the relatively constant probabilities of eventually dying of breast,31 prostate,31 and thyroid32 cancer.
Apparent, Real, and Spurious Effects
From the perspective of the clinician reviewing case series in the literature or from personal experience, in which patients are tracked from the time of diagnosis, the apparent effect (usually positive) of a new diagnostic test that lowers the detection threshold is equal to the real effect (variable) plus the spurious effects (always positive) of the lead-time and length biases27:
Apparent effect = Real effect + Spurious effect.
The individual effects of the lead-time and length biases may be impossible to disentangle and quantify. Two recent randomized trials demonstrate, however, that the combined effect of these biases -- the spurious effect -- can be the chief component of the apparent effect, with the real effect being zero or negative. In the Malmo mammographic screening trial, women over the age of 45 were randomly assigned to either regular mammography or no screening. The case fatality rate tracked from the time of diagnosis was 15 percent for breast cancers detected in the control group and only 3 percent for breast cancers detected at the time of screening33. This apparent reduction of 80 percent in mortality was entirely attributed to lead-time and length biases, however, because there was no difference in mortality from breast cancer tracked from the time of randomization. In the Czech lung cancer screening trial,34 men at high risk were randomly assigned to either chest radiography twice a year or no screening. The five-year survival was 23 percent for lung cancers diagnosed in the study group and 0 percent for those diagnosed in the control group. Again, this apparent improvement was entirely attributed to lead-time and length biases, because mortality from lung cancer was actually higher in the screened group, indicating that the real effect of screening and subsequent intervention was negative.
It should be emphasized that these studies do not prove screening mammography and chest radiography to be futile. Had the study conditions -- the methods of testing, the modes of therapy, the linkages between interpretation of the film and treatment, or the targeted populations -- been different, the effects of screening might have been better (or worse). These studies do, however, demonstrate examples of screening that was ineffective and yet appeared to be highly effective from the ordinary clinical perspective (that in which patients were tracked from the time of diagnosis). These disparities between the real and apparent effects of screening mammography and chest radiography are especially disturbing in the light of the fact that these are the only screening strategies using diagnostic imaging that have ever been evaluated in a randomized trial.
Lead-time and length biases pertain not only to changes that lower the threshold for detecting disease, but also to new treatments that are applied at the same time. Whether or not new therapy is more effective than old therapy, patients given diagnoses with the use of lower detection thresholds will appear to have better outcomes than their historical controls because of these biases. Consequently, new therapies often appear promising35 and could even replace older therapies that are more effective or have fewer side effects. Because the decision to treat or to investigate the need for treatment further is increasingly influenced by the results of diagnostic imaging, lead-time and length biases increasingly pervade medical practice.
The Cycle of Increasing Intervention
Misperceptions of disease prevalence and therapeutic effectiveness can promote a cycle of increasing medical intervention, despite the best intentions of all parties. The cycle usually begins with some form of increased testing that lowers the threshold for detecting disease, such as technical improvement in imaging tests, more frequent testing, or closer scrutiny of the images. This immediately leads to a higher diagnostic yield of the disease and a spectrum of milder cases. These effects are almost always interpreted as indicating progress and provide immediate reinforcement for the increased testing, despite the caveat that earlier detection is a double-edged sword7. Unfortunately, the assessment of diagnostic accuracy often contributes to the confusion, because the conventional gold standards are surgical or pathological inspection rather than outcomes for patients. Tests that are more sensitive (at a fixed rate of false positive results) are accepted as better, even though they detect a broader spectrum of disease that includes a subgroup whose natural history and response to intervention are unknown. Consequently, the assessment legitimizes the use of the more sensitive imaging test and becomes a distraction from the fundamental question: How should patients with this newly detectable subclinical disease be treated?
Over time, the reported incidence and prevalence of the detected disease increase. In addition, because of lead-time and length biases, the patients' outcomes usually appear to improve, whether or not there is real improvement. Thus, the apparent increase in the number of detected cases and the apparent improvement in the outcome per case detected reinforce the initial increase in testing and treatment and encourage even more use in the future36. Unless they are interrupted by astute clinicians, testing and treatment may become even more frequent as long as there remain undetected cases of disease and new means of detecting them. This cycle pertains both to individual patients, who may get caught in a cascade of interventions,37 and to large populations of patients, who may be subjected to increasingly intensive screening.
That this cycle has strong potential to occur in the absence of any benefit is strongly supported by studies of screening for lung cancer. Higher rates of detection, resectability, and five-year survival (from the time of diagnosis) in the screened populations were reported during the early phases of the four most recent randomized trials38,39,40,41. Each of these trials eventually demonstrated, however, that screening did not reduce mortality from lung cancer34,42,43,44. Despite these findings in lung cancer, the same potentially misleading measures of success are being used today to justify aggressive testing and treatment for other diseases, such as breast cancer. Increased detection of minimal cancers is almost always reported as progress,45 and longer survival (from the time of diagnosis) is commonly used to justify mammographic screening in women under 50 years of age,18 who constitute about half of all women screened in the United States14.
Although we have focused on a few specific diseases in asymptomatic patients, these misperceptions about disease prevalence and therapeutic effectiveness are relevant to a wide range of conditions. Patients are now more likely to be given diagnoses of a variety of conditions, such as gallstones,46 herniated disks,47 meniscal tears,48 deep venous thrombosis,49 and pulmonary embolism,50 whether or not they are symptomatic, whether or not their diagnoses are responsible for the symptoms, and whether or not the patients benefited from medical intervention. In addition, advances in imaging can result in upward migration of the disease stage, regardless of its severity5. This greatly complicates the assessment of therapeutic effectiveness because it makes stage-specific historical comparisons invalid. For example, Feinstein et al.5 have demonstrated how the illusion of increased survival for patients at all stages of lung cancer was created by the diagnostic techniques developed in the 1970s, which have since been replaced by even newer techniques. In fact, the benefit of more accurate staging for cancer in general has not been directly demonstrated, and there is indirect evidence that the overall benefit has been small or nonexistent51.
Conclusions
The past two decades have produced dramatic technological advances in diagnostic imaging. Undoubtedly, many patients have benefited from these advances, particularly those that permit the faster and safer diagnosis of symptomatic, treatable disease. However, technological progress has also created confusion, which needs to be recognized and dealt with. Despite clinicians' best intentions, many patients may have been labeled with diseases they do not really have, and many have been given therapy they do not really need.
Much of the confusion resulting from advances in diagnostic imaging could be eliminated if diseases were categorized more carefully according to size or anatomical extent. Data on prevalence, natural history, and therapeutic effectiveness should be explicitly related to size. For the sake of consistency and precision, size should be recorded in standard dimensional units, such as centimeters and cubic centimeters, as opposed to subjective impressions and overly broad categorizations, such as present versus absent. Stratification according to size and adjustment for the sensitivity of the detection method would make statistics on prevalence less dependent on the constantly changing methods of detection. By minimizing lead-time and length biases, stratification according to size would improve the reliability of historical comparisons used to assess the effectiveness of new tests and treatment strategies. In addition, this approach would help to define those newly detectable strata of disease for which the effectiveness of intervention is unknown, thereby providing the opportunity for prospective trials of alternative interventions, including watchful waiting52,53,54. Finally, size stratification of prevalence and effectiveness would foster evaluations of diagnostic accuracy that are similarly stratified. The accuracy of imaging tests should be measured with regard to the anatomical extent of disease, not simply its presence or absence55. This would facilitate the integration of data on prevalence, effectiveness, and diagnostic accuracy for the determination of probabilities of disease56 and clinical usefulness.
All these recommendations will take time to implement. Meanwhile, clinicians can heed the following advice. First, expect the incidence and prevalence of diseases detectable by imaging to increase in the future. Some increases may be predictable on the basis of autopsy studies or other intensive cross-sectional prevalence studies in sample populations. Others may not be so predictable. All types of increases should be expected. The temptation to act aggressively must be tempered by the knowledge that the natural history of a newly detectable disease is unknown. For many diseases, the overall mortality rate has not changed, and the increased prevalence means that the prognosis for any given patient with the diagnosis has actually improved.
Second, expect that advances in imaging will be accompanied by apparent improvements in therapeutic outcomes. The effect of lead-time and length biases may be potent, and clinicians should be skeptical of reported improvements that are based on historical and other comparisons not controlled for the anatomical extent of disease and the rate of progression. Clinicians may even consider that the opposite may be true -- i.e., real outcomes may have worsened because of more aggressive interventions.
Finally, consider maintaining conventional clinical thresholds for treating disease until well-controlled trials prove the benefit of doing otherwise. This will require patience. A well-designed randomized clinical trial takes time. So does accumulating enough experience on outcomes from nonexperimental methods that can be used to control for the extent of disease and the rate of progression. From the point of view of both patients and policy, it is time well spent.
Dr. Welch was supported by the Department of Veterans Affairs Career Development Program, Health Services Research and Development.
We are indebted to Drs. R. Peter Mogielnicki, Robert F. Nease, Jr., Harold M. Swartz, John H. Wasson, and John E. Wennberg for their critique of this manuscript.
Source Information
From the Center for the Evaluative Clinical Sciences, Dartmouth Medical School, Hanover, N.H. (W.C.B., H.G.W.); the Department of Radiology, Dartmouth-Hitchcock Medical Center, Lebanon, N.H. (W.C.B.); and the Medical Service, Veterans Affairs Hospital, White River Junction, Vt. (H.G.W.).
Address reprint requests to Dr. Black at the Center for the Evaluative Clinical Sciences, Dartmouth Medical School, 318 Strausenburgh Hall, Hanover, NH 03755-3863.
References
3.0 cm) renal parenchymal tumor: detection, diagnosis, and controversies. Radiology 1991;179:307-317. [Erratum, Radiology 1991;181:289.]
| |||||||||||||||||||||||
This article has been cited by other articles:
HOME | SUBSCRIBE | SEARCH | CURRENT ISSUE | PAST ISSUES | COLLECTIONS | PRIVACY | HELP | beta.nejm.org Comments and questions? Please contact us. The New England Journal of Medicine is owned, published, and copyrighted © 2008 Massachusetts Medical Society. All rights reserved. |