|
| |||||||||||||||||||||||||||||||
Within this overall picture, several puzzles have emerged. First, the estimated rates of inappropriate use provide little explanation for widespread differences between geographic regions in the rates of use of specific treatments2,3. Furthermore, in the Rand Health Insurance Study, the rates of inappropriate treatment did not vary among insurance plans, despite wide differences in both the generosity of the insurance and the actual amounts of health care used by subjects covered by the various plans11. One possible explanation is that the process of sorting candidates for medical and surgical interventions does not work well, but the results presented here suggest that another factor -- flawed estimates of the rates of inappropriate treatment -- may account for these findings.
Ratings of the appropriateness of medical interventions have been used to support practice guidelines12,13 and have been suggested for preoperative screening and even for studies of rates of inappropriate use of interventions by individual physicians12. Researchers using indicators of appropriate care have investigated how such indicators are created, but there has been little analysis of the fundamental characteristics of the methods. With this paper I hope to open a discussion of these methods, with the aim of improving them, sharpening their application, and stimulating further research to resolve some of the issues raised here.
Methods
The particular methods used in appropriateness studies differ according to the researcher, but they share a common approach. First, the researcher defines a medical intervention (e.g., carotid endarterectomy) for analysis and reviews the literature to find sets of clinical indications -- often numbering in the hundreds -- that have been suggested as sufficient to justify the intervention. This review of the literature, often involving a meta-analysis, also assists the expert panel that comes next in the process. Second, the researcher convenes a panel of experts to rank each of the indications on a scale of appropriateness, with scores commonly ranging from 1 (certainly inappropriate) to 9 (certainly appropriate). Third, the researcher employs a group of investigators, typically nurses, to abstract the records of patients in various institutions who received the intervention, looking for information relevant to the previously defined indications. Finally, the researcher matches each patient's abstracted record to the closest possible indication and assigns to that patient's treatment the appropriateness score associated with the indication. These or similar methods have formed the basis for most published studies of the appropriateness of medical interventions.
The methods used to study appropriateness have a number of intrinsic problems. They only study interventions that have already occurred (hence ignoring issues of inappropriate failure to perform an intervention), they ignore patients' preferences, and at best, they can reflect only the consensus of experts, who may have little clinical science on which to base their judgments. These problems have been addressed elsewhere,13 so this paper focuses on an as yet unexplored issue: Methods used to study appropriateness can lead to biased estimates of the rates of inappropriate treatment that may differ markedly from the true rates.
To analyze the problem of biased estimates, one can regard the methods used to study appropriateness as diagnostic tests that attempt to classify patients treated by community physicians as having been treated appropriately or inappropriately. Like any diagnostic test, these methods can create two types of errors: they can have false positive results (classifying treatments as inappropriate when they were appropriate) at a rate of 1 minus specificity per truly appropriate treatment and false negative results (classifying truly inappropriate treatments as appropriate) at a rate of 1 minus sensitivity per truly inappropriate treatment. The Appendix shows the relation between the estimated and true rates of inappropriate treatment as, first (equation 1),
Estimated rate = true rate x (sensitivity) + (1 - true rate) x (1 -specificity)
and second (equation 2), in terms relative to the true rate,
Estimated rate/true rate = sensitivity + (1 - true rate)/true rate x (1 - specificity).
This process can label a treatment as inappropriate either correctly, when it was inappropriate, or incorrectly, when the community doctors proceeded appropriately but the method mislabeled it as inappropriate. The correct labeling of inappropriate treatment occurs at an overall rate equal to the true rate times the sensitivity. Incorrect labeling occurs at an overall rate of 1 minus the true rate times 1 minus the specificity. Equation 1 shows that the estimated rate of inappropriate treatment combines the correct and incorrect labeling results of both, thus creating the potential for biased estimates.
To compare this with a familiar clinical problem, suppose a physician screened 1000 patients for a disease present in 4 percent of the population, using a test with 95 percent sensitivity and specificity. The test would, on average, correctly identify 38 of the 40 truly sick patients and falsely identify 48 of the 960 healthy patients as sick. The test would produce positive results 86 times out of 1000. In the notation of equation 1, the estimated rate would be 0.086, and in equation 2, the ratio of the estimated to the true rate would be 2.15 (0.086/0.04). In other words, the estimated rate would be double the true rate. As with any diagnostic test, the predictive value of a positive result depends heavily on the underlying rate of occurrence of the event measured by the test. False positives commonly outnumber true positives when the underlying rate of occurrence is low or the false positive rate is very far from zero (or both).
With a perfect diagnostic device (sensitivity = specificity = 1), the estimated rate equals the true rate. We have no reason to believe, however, that the methods used to study appropriateness have perfect accuracy, since few if any diagnostic methods have ever achieved such accuracy, particularly for problems such as those considered in appropriateness studies. The members of the expert panels that create appropriateness ratings for each of the many indications for a particular treatment often disagree (i.e., different ratings of appropriateness are assigned by different panel members). On the few occasions when independent ratings from a number of panels have been applied to similar populations, they reveal some disagreement,14,15,16 but no "pure" tests exist in the literature. However, these studies strongly suggest that the methods used to study appropriateness cannot always have perfect sensitivity and specificity.
Results
If one plots the estimated rate as a function of the true rate (equation 1), the result is a straight line with the vertical intercept equal to the false positive rate and the slope equal to the true positive rate minus the false positive rate. Figure 1 shows such a graph for a sensitivity and specificity of 80 percent, as well as the diagonal line representing a perfectly accurate test. As the figure demonstrates, when the true rate becomes sufficiently large, the estimated rate falls below it. For true rates below this cutoff point, the estimated rate exceeds the true rate. If sensitivity and specificity are equal, then the crossover point always occurs when the true rate is 0.5, a rate exceeding all estimated rates of inappropriate treatment. As intuition suggests, the methods used to study appropriateness generally understate the true rate only when the false positive rate of the method is very small, the true rate of inappropriate treatment is quite large, or both.
|
|
|
It bears emphasizing that there are no estimates of the misclassification rates of methods used to measure appropriateness. To estimate the accuracy of these methods in the traditional fashion of evaluating diagnostic tests, one would need a gold standard of truth that we cannot know. Fortunately, several methods have been developed to analyze cases such as these. They require the simultaneous application of independent tests (i.e., independent panels on appropriateness) to the same population or (even better) to different populations. These maximum-likelihood methods allow estimation of both the true prevalence rates and the misclassification rates for diagnostic tests, in addition to the interrater reliability rates that such studies commonly provide17,18.
Existing methods also allow the estimation of complete receiver-operating-characteristic curves, which show the various combinations of true positive and false positive rates that occur when one selects different cutoff values for diagnostic tests. These methods,19 like those previously discussed,17,18 require the use of more than one diagnostic test when there is no true gold standard. This technique could potentially be used in analyzing methods to study appropriateness, although some commentators urge caution because it relies on a consensus gold standard that may itself be biased20.
It is also worth noting that, once receiver-operating-characteristic curves have been estimated with appropriate methods,19 the arbitrary cutoff points usually chosen to define appropriateness (i.e., 1 to 3, inappropriate; 4 to 6, equivocal; and 7 to 9, appropriate) may potentially be improved by taking account of the costs of false positive and false negative errors and the underlying frequency of inappropriate treatment. The methods for this are well known21,22 but have not yet been applied to the study of appropriateness. They would suggest, for example, classifying any intervention with a score lower than, say, 5 (rather than the customary 3) as inappropriate if the errors of false negative classifications were relatively large. Similarly, this approach suggests that the cutoff point be shifted in the other direction if the costs of false positive mistakes are relatively large (labeling treatments as inappropriate only if the scores are, say, 2 or lower). Although such an approach will not improve the accuracy of methods used to classify appropriateness, it will reduce the costs of misclassification errors.
A different approach would assess changes in the health of patients classified as appropriately or inappropriately treated. Both "healthy" and "incurable" patients treated inappropriately should have less improvement in health than those treated appropriately. Thus, studies of the changes in health status of patients categorized as inappropriately or appropriately treated should illuminate the validity of the process. If correctly classified, inappropriately treated patients should have no improvement in health, whereas appropriately treated patients should, at least on average, have some improvement.
Discussion
The overall value and credibility of methods to assess the appropriateness of medical interventions cannot be determined until studies estimate the sensitivity and specificity of this "diagnostic test." The nature of diagnostic tests makes the chances of biased estimates quite high. The bias can occur in either direction, but the nature of the problem suggests that the magnitude of upward bias will be more severe if it occurs, and it is perhaps more likely to occur than downward bias. Only studies allowing estimates of the sensitivity and specificity of these methods can illuminate the direction and magnitude of any biases. There are currently no estimates of the misclassification rates of these methods, so any consideration of the consequences of using the appropriateness method must remain speculative.
If the methods used to study appropriateness do suffer from the problems identified here, that would offer one explanation for the lack of correlation between estimated rates of inappropriate treatment and overall rates of treatment identified in the literature2,3 and across experimental plans in the Rand Health Insurance Study11. Returning to equation 1, if true rates of inappropriateness are low, then estimated rates can be dominated by even small rates of false positive results, and hence may show little correlation with actual treatment rates, as these studies found.
The results from New York8,9,10 are the only anomaly in the general finding of high rates of inappropriate medical intervention1,2,3,4,5,6,7. Several issues bear mention. First, the authors of those studies noted that the regulatory environment in New York may lead to important differences in rates of inappropriate care8,9,10.
Second, these results highlight the possible vulnerability of ratings to apparently small decisions. For example, the treatment of 33 patients with unexplained cardiomegaly or congestive heart failure who underwent angiography was classified as uncertain in the 1990 ratings, but would have been declared inappropriate under previous criteria10. This single modification in the criteria shifted the rate of inappropriate care from 6.5 to 4 percent, showing that apparently innocuous decisions can alter estimated rates substantially.
Similarly, the results for percutaneous transluminal coronary angioplasty reveal the potential importance of a panel's composition; 38 percent of patients underwent treatment of uncertain appropriateness, "[mostly] because the median panel rating was within the uncertain range (i.e., between 4-6)"9. In the parallel New York angioplasty study, 20 percent of the cases were classified as uncertain. In such settings, the shift of a single panel member's ratings from, say, 4 to 3 can readily alter estimated rates of appropriateness by shifting the median rating for specific indications from uncertain to inappropriate.
Relatively frequent use of the "uncertain" category may also lower the sensitivity of the appropriateness method, increasing the chance that the estimated rate contains a downward bias that acts to offset the upward bias arising from the presence of false positives (equation 1). This is most common when the "uncertain" and "appropriate" categories are combined, as is frequently done.
Rates of inappropriate care have most often been estimated for geographic regions, but there is interest in applying the method to smaller units of observation as well. Brook notes that "if appropriateness is to be improved, it will have to be assessed directly at the level of each patient, hospital, and physician"12. Most troublesome for individual physicians would be the problem of poor specificity, which often labels an appropriate intervention as inappropriate. Troublesome for patients and payers would be poor sensitivity, which labels treatments as appropriate when in fact they are inappropriate. Of course, these methods cannot identify patients who could have received a beneficial treatment but did not, potentially a greater source of concern to patients.
Finally, the methods used to study appropriateness merely provide a refined way of recording conventional wisdom about the efficacy of medical therapies, wisdom that often stands without strong scientific support. They cannot substitute for careful analysis of the actual effectiveness of medical treatments. Methods based on reaching a consensus among experts do not create new scientific data, they only codify old beliefs. Greatly increased investments to provide a scientific basis for understanding when various treatments work, and for whom, will provide the best possible information for decision making. Decisions guided by scientific data must be better than those based only on consensus. Major new investments in studies of the effectiveness of medical treatments could perhaps accomplish this goal, and the expected payoff from such studies exceeds the costs of conducting them by several orders of magnitude23,24.
Supported in part by a grant (R01-5477) from the Agency for Health Care Policy and Research.
Source Information
From the Department of Community and Preventive Medicine, University of Rochester School of Medicine and Dentistry, 601 Elmwood Ave., Box 644, Rochester, NY 14642, where reprint requests should be addressed to Dr. Phelps.
References
This Appendix employs the concept of the "true" state of health, unknown to both the community doctors whose practices are evaluated and the expert panel that provides the basis for that evaluation. Relative to this standard, both the community doctors and the expert panel make errors of judgment. In the following equations, A represents the fact that the community doctor has treated the patient, B the fact that the expert panel says the treatment is inappropriate, S the true (gold standard) condition of "sick" (i.e., treatment will benefit the patient), and H the true (gold standard) condition of "healthy" (i.e., treatment will not benefit the patient). The conditional probabilities can be defined as follows: P(B|A) is the estimated rate of inappropriate treatment, P(H|A) is the true rate of inappropriate treatment (so P(S|A) is 1 minus the true rate), P(B|H,A) is the sensitivity of the appropriateness method, and P(B|S,A) is 1 minus the specificity of the appropriateness method. Then, assuming the stochastic independence of A and B,
P(B|A) = P(B,S|A) + P(B,H|A) = [P(S|A) P(B|S,A) + P(H|A) P(B|H,A)] = (1 - true rate) x (1 - specificity) + true rate x sensitivity.
If the events A and B are not independent, then the joint distributions of P(B,S|A) and P(B,H|A) must be used, complicating the expression but generally not altering the basic insight into the problem.
| |||||||||||||||||||||||||||||||
Related Letters:
Appropriateness Studies
Black N., Park R. E., Brook R. H., Dubois R. W., Hall M. A., Phelps C. E., Kassirer J. P.
Extract |
Full Text
N Engl J Med 1994;
330:432-434, Feb 10, 1994.
Correspondence
This article has been cited by other articles:
HOME | SUBSCRIBE | SEARCH | CURRENT ISSUE | PAST ISSUES | COLLECTIONS | PRIVACY | HELP | beta.nejm.org Comments and questions? Please contact us. The New England Journal of Medicine is owned, published, and copyrighted © 2009 Massachusetts Medical Society. All rights reserved. |