The New England Journal of Medicine
e-mail icon  FREE NEJM E-TOC    HOME   |   SUBSCRIBE   |   CURRENT ISSUE   |   PAST ISSUES   |   COLLECTIONS   |    Advanced Search
Sign in | Get NEJM's E-Mail Table of Contents — Free | Subscribe
 
Special Article
PreviousPrevious
Volume 329:1241-1245 October 21, 1993 Number 17
NextNext

The Methodologic Foundations of Studies of the Appropriateness of Medical Care
Charles E. Phelps

 

Commentary
-Letters

Tools and Services
-Add to Personal Archive
-Add to Citation Manager
-Notify a Friend
-E-mail When Cited

More Information
-PubMed Citation
As health care costs continue to increase rapidly, both health care providers and consumers have expressed concern that the additional resources used for health services do not provide commensurate increases in health benefits. Adding fuel to this concern, a number of disquieting studies have estimated the rates of "inappropriate" use in a variety of settings of a variety of procedures such as coronary angiography, carotid endarterectomy, endoscopy, and coronary-artery bypass graft surgery1,2,3,4. The estimated rates of inappropriate treatment have ranged from about 15 to 30 percent, reaching as high as 40 percent for particular procedures at individual institutions. Recent studies estimated a rate of 16 percent for inappropriate hysterectomy in seven health maintenance organizations,5 a rate of 24 percent for inappropriate days spent in a Canadian children's hospital,6 and a rate of 23 percent for inappropriate hospitalizations for measles7. The only results that run counter to these high rates have come from recent studies of coronary-artery bypass graft surgery (2 percent inappropriate use),8 percutaneous transluminal coronary angioplasty (4 percent),9 and angioplasty (4 percent)10 in New York. One study concluded that "a substantial fraction of hospitalization is potentially avoidable"11. A leader in studies of medical appropriateness has stated, "If one could extrapolate from the available literature, then perhaps one fourth of hospital days, one fourth of procedures, and two fifths of medications could be done without"12. If this is true, then the country's annual health care bill could be cut by perhaps $100 billion without harm to the public.

Within this overall picture, several puzzles have emerged. First, the estimated rates of inappropriate use provide little explanation for widespread differences between geographic regions in the rates of use of specific treatments2,3. Furthermore, in the Rand Health Insurance Study, the rates of inappropriate treatment did not vary among insurance plans, despite wide differences in both the generosity of the insurance and the actual amounts of health care used by subjects covered by the various plans11. One possible explanation is that the process of sorting candidates for medical and surgical interventions does not work well, but the results presented here suggest that another factor -- flawed estimates of the rates of inappropriate treatment -- may account for these findings.

Ratings of the appropriateness of medical interventions have been used to support practice guidelines12,13 and have been suggested for preoperative screening and even for studies of rates of inappropriate use of interventions by individual physicians12. Researchers using indicators of appropriate care have investigated how such indicators are created, but there has been little analysis of the fundamental characteristics of the methods. With this paper I hope to open a discussion of these methods, with the aim of improving them, sharpening their application, and stimulating further research to resolve some of the issues raised here.

Methods

The particular methods used in appropriateness studies differ according to the researcher, but they share a common approach. First, the researcher defines a medical intervention (e.g., carotid endarterectomy) for analysis and reviews the literature to find sets of clinical indications -- often numbering in the hundreds -- that have been suggested as sufficient to justify the intervention. This review of the literature, often involving a meta-analysis, also assists the expert panel that comes next in the process. Second, the researcher convenes a panel of experts to rank each of the indications on a scale of appropriateness, with scores commonly ranging from 1 (certainly inappropriate) to 9 (certainly appropriate). Third, the researcher employs a group of investigators, typically nurses, to abstract the records of patients in various institutions who received the intervention, looking for information relevant to the previously defined indications. Finally, the researcher matches each patient's abstracted record to the closest possible indication and assigns to that patient's treatment the appropriateness score associated with the indication. These or similar methods have formed the basis for most published studies of the appropriateness of medical interventions.

The methods used to study appropriateness have a number of intrinsic problems. They only study interventions that have already occurred (hence ignoring issues of inappropriate failure to perform an intervention), they ignore patients' preferences, and at best, they can reflect only the consensus of experts, who may have little clinical science on which to base their judgments. These problems have been addressed elsewhere,13 so this paper focuses on an as yet unexplored issue: Methods used to study appropriateness can lead to biased estimates of the rates of inappropriate treatment that may differ markedly from the true rates.

To analyze the problem of biased estimates, one can regard the methods used to study appropriateness as diagnostic tests that attempt to classify patients treated by community physicians as having been treated appropriately or inappropriately. Like any diagnostic test, these methods can create two types of errors: they can have false positive results (classifying treatments as inappropriate when they were appropriate) at a rate of 1 minus specificity per truly appropriate treatment and false negative results (classifying truly inappropriate treatments as appropriate) at a rate of 1 minus sensitivity per truly inappropriate treatment. The Appendix shows the relation between the estimated and true rates of inappropriate treatment as, first (equation 1),

Estimated rate = true rate x (sensitivity) + (1 - true rate) x (1 -specificity)

and second (equation 2), in terms relative to the true rate,

Estimated rate/true rate = sensitivity + (1 - true rate)/true rate x (1 - specificity).

This process can label a treatment as inappropriate either correctly, when it was inappropriate, or incorrectly, when the community doctors proceeded appropriately but the method mislabeled it as inappropriate. The correct labeling of inappropriate treatment occurs at an overall rate equal to the true rate times the sensitivity. Incorrect labeling occurs at an overall rate of 1 minus the true rate times 1 minus the specificity. Equation 1 shows that the estimated rate of inappropriate treatment combines the correct and incorrect labeling results of both, thus creating the potential for biased estimates.

To compare this with a familiar clinical problem, suppose a physician screened 1000 patients for a disease present in 4 percent of the population, using a test with 95 percent sensitivity and specificity. The test would, on average, correctly identify 38 of the 40 truly sick patients and falsely identify 48 of the 960 healthy patients as sick. The test would produce positive results 86 times out of 1000. In the notation of equation 1, the estimated rate would be 0.086, and in equation 2, the ratio of the estimated to the true rate would be 2.15 (0.086/0.04). In other words, the estimated rate would be double the true rate. As with any diagnostic test, the predictive value of a positive result depends heavily on the underlying rate of occurrence of the event measured by the test. False positives commonly outnumber true positives when the underlying rate of occurrence is low or the false positive rate is very far from zero (or both).

With a perfect diagnostic device (sensitivity = specificity = 1), the estimated rate equals the true rate. We have no reason to believe, however, that the methods used to study appropriateness have perfect accuracy, since few if any diagnostic methods have ever achieved such accuracy, particularly for problems such as those considered in appropriateness studies. The members of the expert panels that create appropriateness ratings for each of the many indications for a particular treatment often disagree (i.e., different ratings of appropriateness are assigned by different panel members). On the few occasions when independent ratings from a number of panels have been applied to similar populations, they reveal some disagreement,14,15,16 but no "pure" tests exist in the literature. However, these studies strongly suggest that the methods used to study appropriateness cannot always have perfect sensitivity and specificity.

Results

If one plots the estimated rate as a function of the true rate (equation 1), the result is a straight line with the vertical intercept equal to the false positive rate and the slope equal to the true positive rate minus the false positive rate. Figure 1 shows such a graph for a sensitivity and specificity of 80 percent, as well as the diagonal line representing a perfectly accurate test. As the figure demonstrates, when the true rate becomes sufficiently large, the estimated rate falls below it. For true rates below this cutoff point, the estimated rate exceeds the true rate. If sensitivity and specificity are equal, then the crossover point always occurs when the true rate is 0.5, a rate exceeding all estimated rates of inappropriate treatment. As intuition suggests, the methods used to study appropriateness generally understate the true rate only when the false positive rate of the method is very small, the true rate of inappropriate treatment is quite large, or both.


View larger version (37K):
[in this window]
[in a new window]
 
Figure 1. Relation between the Estimated and True Rates of Inappropriate Treatment, Assuming a Sensitivity and Specificity of 80 Percent.

For a given true positive rate, decreasing the false positive rate reduces the intercept and increases the slope of the line for an imperfect test, thus creating a lower crossover point at which the estimated rate understates the true rate. For a given false positive rate, increasing the true positive rate increases the slope of the line for an imperfect test, also lowering the crossover point.

 
To show the effects of classification errors, Figure 2 and Figure 3 (using equation 2) show the ratio of the true rate of inappropriateness to the estimated rate for various underlying true rates ranging from 0.05 to 0.2. Figure 2 shows the results when the methods used to study appropriateness have a high sensitivity (95 percent). In this case, the methods almost always overstate the true rate, and any understatement is trivially small. In Figure 3, the sensitivity is lower (80 percent), as would occur, for example, if the method used to study appropriateness went to great lengths to avoid falsely labeling doctors as having high rates of inappropriate treatment. Both figures show a flat line (the "no bias line") where the ratio of the estimated to the true rate equals 1, to assist in determining combinations of true rate and specificity that lead to upward and downward biases.


View larger version (43K):
[in this window]
[in a new window]
 
Figure 2. Errors in Estimating Four True Rates of Inappropriate Treatment, Assuming a Sensitivity of 95 Percent.

The no-bias line indicates a ratio of the estimated to the true rate of 1. The vertical intercept for the four sloping lines is the method's sensitivity (95 percent in this figure). The ratio of the estimated to the true rate exceeds 1 (i.e., the method overstates the true rate) for all but the very smallest false positive rates.

 

View larger version (39K):
[in this window]
[in a new window]
 
Figure 3. Errors in Estimating Four True Rates of Inappropriate Treatment, Assuming a Sensitivity of 80 Percent.

The no-bias line indicates a ratio of the estimated to the true rate of 1. The vertical intercept of the four sloping lines is the method's sensitivity (80 percent in this figure). The ratio of the estimated to the true rate is less than 1 (i.e., the method understates the true rate) for false positive rates ranging from about 1 percent (for a true rate of 0.20) to 5 percent (for a true rate of 0.05). For all other false positive rates, the ratio exceeds 1 (i.e., the method overstates the true rate).

 
These figures demonstrate a common pattern. First, the estimated rate understates the true rate only when the specificity is quite high, and the degree of understatement, when it occurs, is relatively small. Second, the estimated rate overstates the true rate as the specificity falls., and third, the problem gets worse as the true rate falls; in other words, better actual practice leads to larger relative errors in the estimated rates of inappropriate treatment.

It bears emphasizing that there are no estimates of the misclassification rates of methods used to measure appropriateness. To estimate the accuracy of these methods in the traditional fashion of evaluating diagnostic tests, one would need a gold standard of truth that we cannot know. Fortunately, several methods have been developed to analyze cases such as these. They require the simultaneous application of independent tests (i.e., independent panels on appropriateness) to the same population or (even better) to different populations. These maximum-likelihood methods allow estimation of both the true prevalence rates and the misclassification rates for diagnostic tests, in addition to the interrater reliability rates that such studies commonly provide17,18.

Existing methods also allow the estimation of complete receiver-operating-characteristic curves, which show the various combinations of true positive and false positive rates that occur when one selects different cutoff values for diagnostic tests. These methods,19 like those previously discussed,17,18 require the use of more than one diagnostic test when there is no true gold standard. This technique could potentially be used in analyzing methods to study appropriateness, although some commentators urge caution because it relies on a consensus gold standard that may itself be biased20.

It is also worth noting that, once receiver-operating-characteristic curves have been estimated with appropriate methods,19 the arbitrary cutoff points usually chosen to define appropriateness (i.e., 1 to 3, inappropriate; 4 to 6, equivocal; and 7 to 9, appropriate) may potentially be improved by taking account of the costs of false positive and false negative errors and the underlying frequency of inappropriate treatment. The methods for this are well known21,22 but have not yet been applied to the study of appropriateness. They would suggest, for example, classifying any intervention with a score lower than, say, 5 (rather than the customary 3) as inappropriate if the errors of false negative classifications were relatively large. Similarly, this approach suggests that the cutoff point be shifted in the other direction if the costs of false positive mistakes are relatively large (labeling treatments as inappropriate only if the scores are, say, 2 or lower). Although such an approach will not improve the accuracy of methods used to classify appropriateness, it will reduce the costs of misclassification errors.

A different approach would assess changes in the health of patients classified as appropriately or inappropriately treated. Both "healthy" and "incurable" patients treated inappropriately should have less improvement in health than those treated appropriately. Thus, studies of the changes in health status of patients categorized as inappropriately or appropriately treated should illuminate the validity of the process. If correctly classified, inappropriately treated patients should have no improvement in health, whereas appropriately treated patients should, at least on average, have some improvement.

Discussion

The overall value and credibility of methods to assess the appropriateness of medical interventions cannot be determined until studies estimate the sensitivity and specificity of this "diagnostic test." The nature of diagnostic tests makes the chances of biased estimates quite high. The bias can occur in either direction, but the nature of the problem suggests that the magnitude of upward bias will be more severe if it occurs, and it is perhaps more likely to occur than downward bias. Only studies allowing estimates of the sensitivity and specificity of these methods can illuminate the direction and magnitude of any biases. There are currently no estimates of the misclassification rates of these methods, so any consideration of the consequences of using the appropriateness method must remain speculative.

If the methods used to study appropriateness do suffer from the problems identified here, that would offer one explanation for the lack of correlation between estimated rates of inappropriate treatment and overall rates of treatment identified in the literature2,3 and across experimental plans in the Rand Health Insurance Study11. Returning to equation 1, if true rates of inappropriateness are low, then estimated rates can be dominated by even small rates of false positive results, and hence may show little correlation with actual treatment rates, as these studies found.

The results from New York8,9,10 are the only anomaly in the general finding of high rates of inappropriate medical intervention1,2,3,4,5,6,7. Several issues bear mention. First, the authors of those studies noted that the regulatory environment in New York may lead to important differences in rates of inappropriate care8,9,10.

Second, these results highlight the possible vulnerability of ratings to apparently small decisions. For example, the treatment of 33 patients with unexplained cardiomegaly or congestive heart failure who underwent angiography was classified as uncertain in the 1990 ratings, but would have been declared inappropriate under previous criteria10. This single modification in the criteria shifted the rate of inappropriate care from 6.5 to 4 percent, showing that apparently innocuous decisions can alter estimated rates substantially.

Similarly, the results for percutaneous transluminal coronary angioplasty reveal the potential importance of a panel's composition; 38 percent of patients underwent treatment of uncertain appropriateness, "[mostly] because the median panel rating was within the uncertain range (i.e., between 4-6)"9. In the parallel New York angioplasty study, 20 percent of the cases were classified as uncertain. In such settings, the shift of a single panel member's ratings from, say, 4 to 3 can readily alter estimated rates of appropriateness by shifting the median rating for specific indications from uncertain to inappropriate.

Relatively frequent use of the "uncertain" category may also lower the sensitivity of the appropriateness method, increasing the chance that the estimated rate contains a downward bias that acts to offset the upward bias arising from the presence of false positives (equation 1). This is most common when the "uncertain" and "appropriate" categories are combined, as is frequently done.

Rates of inappropriate care have most often been estimated for geographic regions, but there is interest in applying the method to smaller units of observation as well. Brook notes that "if appropriateness is to be improved, it will have to be assessed directly at the level of each patient, hospital, and physician"12. Most troublesome for individual physicians would be the problem of poor specificity, which often labels an appropriate intervention as inappropriate. Troublesome for patients and payers would be poor sensitivity, which labels treatments as appropriate when in fact they are inappropriate. Of course, these methods cannot identify patients who could have received a beneficial treatment but did not, potentially a greater source of concern to patients.

Finally, the methods used to study appropriateness merely provide a refined way of recording conventional wisdom about the efficacy of medical therapies, wisdom that often stands without strong scientific support. They cannot substitute for careful analysis of the actual effectiveness of medical treatments. Methods based on reaching a consensus among experts do not create new scientific data, they only codify old beliefs. Greatly increased investments to provide a scientific basis for understanding when various treatments work, and for whom, will provide the best possible information for decision making. Decisions guided by scientific data must be better than those based only on consensus. Major new investments in studies of the effectiveness of medical treatments could perhaps accomplish this goal, and the expected payoff from such studies exceeds the costs of conducting them by several orders of magnitude23,24.

Supported in part by a grant (R01-5477) from the Agency for Health Care Policy and Research.


Source Information

From the Department of Community and Preventive Medicine, University of Rochester School of Medicine and Dentistry, 601 Elmwood Ave., Box 644, Rochester, NY 14642, where reprint requests should be addressed to Dr. Phelps.

References

  1. Brook RH, Park RE, Chassin MR, Solomon DH, Keesey J, Kosecoff J. Predicting the appropriate use of carotid endarterectomy, upper gastrointestinal endoscopy, and coronary angiography. N Engl J Med 1990;323:1173-1177. [Abstract]
  2. Chassin MR, Kosecoff J, Park RE, et al. Does inappropriate use explain geographic variations in the use of health care services? A study of three procedures. JAMA 1987;258:2533-2537. [Abstract]
  3. Leape LL, Park RE, Solomon DH, Chassin MR, Kosecoff J, Brook RH. Does inappropriate use explain small-area variations in the use of health care services? JAMA 1990;263:669-672. [Abstract]
  4. Winslow CM, Kosecoff JB, Chassin M, Kanouse DE, Brook RH. The appropriateness of performing coronary artery bypass surgery. JAMA 1988;260:505-509. [Abstract]
  5. Bernstein SJ, McGlynn EA, Siu AL, et al. The appropriateness of hysterectomy: a comparison of care in seven health plans. JAMA 1993;269:2398-2402. [Abstract]
  6. Gloor JE, Kissoon N, Joubert GI. Appropriateness of hospitalization in a Canadian pediatric hospital. Pediatrics 1993;91:70-74. [Free Full Text]
  7. Havens PL, Butler JC, Day SE, Mohr BA, Davis JP, Chusid MJ. Treating measles: the appropriateness of admission to a Wisconsin children's hospital. Am J Public Health 1993;83:379-384. [Free Full Text]
  8. Leape LL, Hilborne LH, Park RE, et al. The appropriateness of use of coronary artery bypass graft surgery in New York State. JAMA 1993;269:753-760. [Abstract]
  9. Hilborne LH, Leape LL, Bernstein SJ, et al. The appropriateness of use of percutaneous transluminal coronary angioplasty in New York State. JAMA 1993;269:761-765. [Abstract]
  10. Bernstein SJ, Hilborne LH, Leape LL, et al. The appropriateness of use of coronary angiography in New York State. JAMA 1993;269:766-769. [Abstract]
  11. Siu AL, Sonnenberg FA, Manning WG, et al. Inappropriate use of hospitals in a randomized trial of health insurance plans. N Engl J Med 1986;315:1259-1266. [Abstract]
  12. Brook RH. Practice guidelines and practicing medicine: are they compatible? JAMA 1989;262:3027-3030. [Abstract]
  13. Audet AM, Greenfield S, Field M. Medical practice guidelines: current activities and future directions. Ann Intern Med 1990;113:709-714.
  14. Brook RH, Kosecoff JB, Park RE, Chassin MR, Winslow CM, Hampton JR. Diagnosis and treatment of coronary artery disease: comparison of doctors' attitudes in the USA and the UK. Lancet 1988;1:750-753. [CrossRef][Medline]
  15. Merrick NJ, Fink A, Brook RH, et al. Indications for selected medical and surgical procedures: a literature review and ratings of appropriateness: carotid endarterectomy. Santa Monica, Calif.: RAND, 1986. (Report no. R-3204/6.)
  16. Park RE, Fink A, Brook RH, et al. Physician ratings of appropriate indications for six medical and surgical procedures. Am J Public Health 1986;76:766-772. [Free Full Text]
  17. Hui SL, Walter SD. Estimating the error rates of diagnostic tests. Biometrics 1980;36:167-171. [CrossRef][Medline]
  18. Walter SD, Irwig LM. Estimation of test error rates, disease prevalence and relative risk from misclassified data: a review. J Clin Epidemiol 1988;41:923-937. [CrossRef][Medline]
  19. Henkelman RM, Kay I, Bronskill MJ. Receiver operator characteristic (ROC) analysis without truth. Med Decis Making 1990;10:24-29.
  20. Begg CB, Metz CE. Consensus diagnoses and "gold standards." Med Decis Making 1990;10:29-30. [Erratum, Med Decis Making 1990;10:149.]
  21. Swets JA, Pickett RM. Evaluation of diagnostic systems: methods from signal detection theory. New York: Academic Press, 1982.
  22. Phelps CE, Mushlin AI. Focusing technology assessment using medical decision theory. Med Decis Making 1988;8:279-289.
  23. Phelps CE, Parente ST. Priority setting in medical technology and medical practice assessment. Med Care 1990;28:703-723. [Erratum, Med Care 1992;30:744-51.] [CrossRef][Medline]
  24. Phelps CE, Mooney C. Correction and update on "Priority setting in medical technology assessment." Med Care 1992;30:744-751. [CrossRef][Medline]
Appendix

This Appendix employs the concept of the "true" state of health, unknown to both the community doctors whose practices are evaluated and the expert panel that provides the basis for that evaluation. Relative to this standard, both the community doctors and the expert panel make errors of judgment. In the following equations, A represents the fact that the community doctor has treated the patient, B the fact that the expert panel says the treatment is inappropriate, S the true (gold standard) condition of "sick" (i.e., treatment will benefit the patient), and H the true (gold standard) condition of "healthy" (i.e., treatment will not benefit the patient). The conditional probabilities can be defined as follows: P(B|A) is the estimated rate of inappropriate treatment, P(H|A) is the true rate of inappropriate treatment (so P(S|A) is 1 minus the true rate), P(B|H,A) is the sensitivity of the appropriateness method, and P(B|S,A) is 1 minus the specificity of the appropriateness method. Then, assuming the stochastic independence of A and B,

P(B|A) = P(B,S|A) + P(B,H|A) = [P(S|A) P(B|S,A) + P(H|A) P(B|H,A)] = (1 - true rate) x (1 - specificity) + true rate x sensitivity.

If the events A and B are not independent, then the joint distributions of P(B,S|A) and P(B,H|A) must be used, complicating the expression but generally not altering the basic insight into the problem.


 

Commentary
-Letters

Tools and Services
-Add to Personal Archive
-Add to Citation Manager
-Notify a Friend
-E-mail When Cited

More Information
-PubMed Citation

Related Letters:

Appropriateness Studies
Black N., Park R. E., Brook R. H., Dubois R. W., Hall M. A., Phelps C. E., Kassirer J. P.
Extract | Full Text  
N Engl J Med 1994; 330:432-434, Feb 10, 1994. Correspondence

This article has been cited by other articles:



HOME  |  SUBSCRIBE  |  SEARCH  |  CURRENT ISSUE  |  PAST ISSUES  |  COLLECTIONS  |  PRIVACY  |  HELP  |  beta.nejm.org

Comments and questions? Please contact us.

The New England Journal of Medicine is owned, published, and copyrighted © 2009 Massachusetts Medical Society. All rights reserved.