Chest ACCP Education Calendar
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     

Guest Access | Sign In via User Name/Password
This Article
Right arrow Full Text (PDF) Free
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Article Archive
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Woolf, S. H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Woolf, S. H.
(Chest. 2000;117:182S-185S.)
© 2000 American College of Chest Physicians

Panel Methodology*

Analytic Principles in Evaluating the Performance Characteristics of Diagnostic Tests for Ventilator-Associated Pneumonia

Steven H. Woolf, MD, MPH

* From the Department of Family Practice, Medical College of Virginia, Fairfax, VA 22033.


    Introduction
 TOP
 Introduction
 Definition of Focus
 Review of Evidence
 Development of Recommendations
 Outside Review
 Principles for Evaluating...
 Validity
 Reproducibility
 
The panel followed a systematic approach to reviewing the evidence and developing its recommendations, which consisted of the following: (1) definition of focus, (2) review of evidence, (3) assessment of expert opinion and feasibility issues, (4) development of recommendations, and (5) outside review.


    Definition of Focus
 TOP
 Introduction
 Definition of Focus
 Review of Evidence
 Development of Recommendations
 Outside Review
 Principles for Evaluating...
 Validity
 Reproducibility
 
In Spring 1996, the panel defined the following: the target condition, patient population, and providers to which the guidelines would apply; the specific interventions to be evaluated; and the types of scientific evidence to be reviewed. Time and resource constraints compelled the panel to define relatively narrow boundaries, excluding important topic areas beyond the scope of the panel’s mandate. It should not be inferred that the excluded topics are irrelevant to the management of patients with ventilator-associated pneumonia (VAP) or that they do not require careful review. Other groups are encouraged to examine the evidence in these areas and to make recommendations to help complete the scientific knowledge base.

Target Condition
The condition of interest was VAP. Pneumonia in patients not receiving ventilatory support (even those in critical care settings) was beyond the scope of the panel review. Although this report focuses on the sensitivity and specificity of tests in detecting VAP, an equally important outside consideration is the accuracy of these tests in identifying pathologic organisms of clinical significance. The use of some tests that are highly sensitive in detecting VAP results in overdiagnosis due to the identification of pathologic organisms that are unrelated to the patient’s illness, which, thereby, triggers unnecessary antibiotic therapy.

Patient Population
Included in the patient population were immunocompetent adults receiving ventilatory support in hospital or long-term care settings. Children, adolescents, and immunocompromised patients, including patients with AIDS, were excluded.

Providers
The target audience for the guidelines included pulmonary medicine specialists and other physicians, such as internists, surgeons, anesthesiologists, and infectious-disease specialists whose critical care patients require ventilatory assistance. Other providers who care for such patients, such as critical care nurses and respiratory technicians, also may find the guidelines useful.

Interventions
The panel limited its review to the following diagnostic areas: clinical features; chest radiography; culture or Gram’s stain; endotracheal aspiration specimens; antibody coating; elastin fiber assessment; bronchoscopic BAL specimens; protected-specimen brush (PSB) specimens; and blinded invasive diagnostic procedures. Other diagnostic areas were not covered. Although the effectiveness of treatment for VAP was beyond the scope of the project, it is critically important in evidence-based guidelines for VAP and cannot be ignored in evaluating the use of diagnostic tests. The indications for testing and effectiveness of test procedures are linked to whether the test information will influence treatment and patient outcomes, and if so, in what way. These matters and the investigation of treatment failures in VAP are discussed in this document.

Measures of Effectiveness
The panel could not address whether testing improves health outcomes or whether the benefits of testing outweigh risks, although some chapters discuss the potential risks of certain diagnostic procedures. Ultimately, the health outcomes of diagnostic testing must be addressed to formulate evidence-based guidelines on treating VAP.

Admissible Evidence
Evidence considered relevant to the review were prospective or retrospective studies of diagnostic testing in immunocompetent adults with VAP, published after 1966 in English-language reports. To be admissible, studies had to provide data on sensitivity or specificity, or the raw data for calculating them, and a reference standard for how VAP was defined. Also included were studies of the epidemiology of VAP and the risk factors involved. Non-English-language publications and retrospective studies were excluded.


    Review of Evidence
 TOP
 Introduction
 Definition of Focus
 Review of Evidence
 Development of Recommendations
 Outside Review
 Principles for Evaluating...
 Validity
 Reproducibility
 
Pairs of panelists reviewed and summarized the evidence for specific topic areas according to the methods outlined in the following chapters. The Panel Chair edited the final draft. Examination of the evidence involved the three steps discussed below.

Literature Search
The MEDLINE database was searched for articles published from 1966 through 1995 by exploding the term "pneumonia" and the MESH terms "cross infection/artificial respiration" or the text words "ventilator associated pneumonia." Citations in this set were cross-referenced with articles retrieved by exploding the text word "diagnosis," MESH terms "sensitivity and specificity," and text words "BAL," "bronchoscopy," "protected brush catheters," "predictive value," and "likelihood ratio." Results of the computerized search were supplemented by examining personal files, other studies known to panel members, and reference lists of all primary studies and review articles retrieved in locating relevant studies.

Analysis of Individual Studies
The quality of individual studies was judged using specific criteria for evaluating internal and external validity. Criteria for judging internal validity included the following: sample size, selection bias, definition of interventions and outcomes, and confounding variables. Criteria for judging external validity related to how well the results could be generalized to patients and conditions outside the study settings. Several central principles in evaluating diagnostic test performance, outlined below in the section "Principles for Evaluating Diagnostic Test Performance," were especially important in judging study quality.

Grading systems for judging the quality of evidence typically identify randomized, controlled trials as the "gold standard," followed by controlled observational studies, descriptive epidemiology studies, and case reports. This paradigm is not useful in evaluating studies of test accuracy, because randomized, controlled trials are not necessarily the best setting for evaluating diagnostic test performance. Therefore, this report relies on narrative descriptions of study quality, rather than on rating schemes.

Synthesis of the Results
The evidence was summarized in narrative text and evidence tables. In addition to presenting the results of the studies, the tables compare the study designs according to the panel’s criteria for judging quality. Data on the sensitivity and specificity of tests were not pooled through meta-analysis to obtain an overall estimate of test performance. The significant variability in research methods, study populations, and definitions across studies made such a synthesis invalid.


    Development of Recommendations
 TOP
 Introduction
 Definition of Focus
 Review of Evidence
 Development of Recommendations
 Outside Review
 Principles for Evaluating...
 Validity
 Reproducibility
 
Recommendations were developed in group discussions based on consideration of the evidence and, when direct evidence was lacking, on expert opinion. They were graded as follows:

A: Recommendation based on direct scientific evidence;

B: Recommendation based on scientific evidence, supplemented by expert opinion;

C: Recommendation based on expert opinion alone; and

D: There is no definitive evidence or consensus opinion.


    Outside Review
 TOP
 Introduction
 Definition of Focus
 Review of Evidence
 Development of Recommendations
 Outside Review
 Principles for Evaluating...
 Validity
 Reproducibility
 
The recommendations were reviewed by the Health and Science Policy Committee of the American College of Chest Physicians and were referred for peer review by content experts.


    Principles for Evaluating Diagnostic Test Performance
 TOP
 Introduction
 Definition of Focus
 Review of Evidence
 Development of Recommendations
 Outside Review
 Principles for Evaluating...
 Validity
 Reproducibility
 
Performance characteristics of diagnostic tests are typically measured in terms of validity and reproducibility (reliability). Validity is the extent to which a test measures what it intends to measure. Reproducibility is the consistency of results between measurements. A test that lacks reproducibility but is valid produces inconsistent results that, on average, are accurate. A test that is reproducible but invalid produces consistently incorrect results.


    Validity
 TOP
 Introduction
 Definition of Focus
 Review of Evidence
 Development of Recommendations
 Outside Review
 Principles for Evaluating...
 Validity
 Reproducibility
 
Sensitivity and Specificity
The standard measures of validity are sensitivity and specificity. Sensitivity is the proportion of persons with the condition under consideration who have positive test results (the denominator is the person with the condition who has been tested). Specificity is the proportion of persons without the condition who correctly have negative test results (the denominator is the person without the condition who has been tested).

Two closely related measures are positive and negative predictive value. Positive predictive value (PPV) is the proportion of patients with an abnormal test result who have the condition (the denominator is the person with positive test results). Negative predictive value is the proportion of patients with normal test results who do not have the condition (the denominator is the person with negative test results). The formulas for these tests are presented in Table 1 .


View this table:
[in this window]
[in a new window]

 
Table 1. Two-By-Two Table for Sensitivity, Specificity, and PPV*

 
Often there is an inverse relationship between sensitivity and specificity, in which an increase in one parameter is accompanied by a decrease in the other. This is especially true for tests that use threshold criteria. Setting a higher threshold increases specificity (ie, yields fewer false positives) but decreases sensitivity (eg, more cases are missed).

For example, Sutherland et al12 found that the presence of fever, leukocytosis, asymmetric radiographic infiltrates, or purulent tracheal aspiration specimens had a sensitivity of 100% in detecting VAP (ie, no cases were missed). However, because these criteria occur commonly in other diseases, specificity was only 4%. If two of these criteria were required to make the diagnosis, sensitivity decreased to 69% but specificity increased to 18%. If four criteria were required, sensitivity was only 6% (because few patients with VAP have all four clinical features) but specificity was 96% (patients without VAP are unlikely to have all four features).

The sensitivity and specificity rates reported in this review vary dramatically across studies, in part because of differences and imperfections in study design. Because sensitivity and specificity rates are fractions, underreporting in the numerator or denominator can distort true values. For example, sensitivity can be overestimated if the number of patients with the condition who have negative test results is underrepresented.

Consider a study of the sensitivity of chest radiography in detecting autopsy-confirmed pneumonia in 100 critical-care patients. The chest radiographs detect pneumonia in 90 patients, giving a sensitivity of 90%. These patients are obviously drawn from a subset of deceased critical care patients, however, and the question of interest is the sensitivity of the test in all critical care patients, not just those who die. Patients who survive are less likely to have pneumonia or abnormal findings on chest radiographs. Suppose that the 100 patients were drawn from a total of 500 critical care patients, and only 50 of the 400 patients who lived had chest radiograph results that were consistent with pneumonia. The true sensitivity of the test would be only 28% (140/500; Table 2 ).


View this table:
[in this window]
[in a new window]

 
Table 2. Illustration of Selection Bias in Determining Test Sensitivity

 
Numerator Error
The numerator for sensitivity and specificity calculations is subject to error if the criteria for an abnormal result are imprecise, subjective, or poorly standardized. "Purulent" tracheal specimens, "worsened" infiltrates, and other subjective descriptions do not represent hard end points and are defined differently within and across studies. Even objective criteria lack standardization across studies. Definitions for "positive" cultures include qualitative descriptors (presence or absence of bacteria), semiquantitative measures, and quantitative measures with highly variable thresholds for significant colony counts (eg, from 100 to 1 million cfu/mL). The unit of analysis in calculating the sensitivity and specificity of culture results may be patients or microorganisms, yielding results that are not comparable across studies.

Other factors that affect the validity of the numerator include the administration of antibiotic treatment to patients when cultures are taken, temporal separation in measurement (eg, using chest radiographs performed several days before death to measure correlations between radiographic and autopsy findings), and inadequate measurements (eg, not documenting the presence of squamous epithelial cells in BAL fluid or PSB samples to indicate the degree of upper airway contamination).

Sensitivity and specificity are best determined when results are classified in a binary fashion as positive or negative. In some studies reviewed in this report, investigators included a third category, "indeterminate results," thereby confounding sensitivity and specificity calculations. To achieve consistency in our review, we treated indeterminate results as negative and recalculated the sensitivity and specificity accordingly. Therefore, our calculations sometimes differ from those of authors who treated these as positive results or ignored them.

Denominator Error: The Reference Standard
The validity of reported sensitivity and specificity rates is highly dependent on the quality of the reference standard, the test or criteria used to define the presence or absence of the target condition. This is an important problem in pneumonia. If the presence of pneumonia is defined on the basis of arguable criteria (eg, air bronchogram signs, high colony count on BAL or PSB culture, or clinical improvement with antibiotic treatment), 95% sensitivity of a diagnostic test may have little meaning. The same test might perform poorly if a more reliable reference standard, such as autopsy confirmation, were used. In this review, we assume that the presence of a radiologic infiltrate or purulent sputum captures all patients with VAP, but we recognize the limitation of this assumption.

The chapters that follow include examples of the limitations of such inferences. Studies of the sensitivity and specificity of clinical criteria for diagnosing VAP (eg, fever, leukocytosis, and purulent secretions) often rely on the presence of one or more of these findings as the reference standard. The fact that patients with autopsy-confirmed VAP often are not treated with antibiotics37,38 underscores the limitations of relying on clinical criteria for defining the presence of disease. Similarly, an abnormal finding on a chest radiograph is an imperfect reference standard. For example, VAP can occur in patients with normal findings on chest radiographs, and findings suggestive of VAP are common in patients without the disease.

Blinding
The interpretation of test results and chest radiographs can be biased if observers know the clinical circumstances of the case or the diagnostic suspicions of the treating physician. In most studies of diagnostic accuracy based on endotracheal specimens or BAL, interpreters of the test results knew whether the reference standard was abnormal, which is a potential source of measurement bias. Better studies of test accuracy, therefore, include blinding, in which the evaluators are unaware of the patient’s identity and clinical history. Measurement validity also can be improved, especially for subjective parameters, by having independent observers perform multiple assessments. Unless stated otherwise, investigators in this review were not blinded: ie, those who interpreted diagnostic test results may have known whether the patient had or did not have VAP.

PPV
PPV, the proportion of those with a positive test result who have the disease of interest, is dependent on the prevalence (or pretest probability) of the disease in the population being tested. A test that has high PPV (ie, a low proportion of false positives) in settings where the disease is common may have a low PPV when used to test patients at low risk for the disease. The dramatic influence of pretest probability on PPV is illustrated in Table 3 .


View this table:
[in this window]
[in a new window]

 
Table 3. Comparison of PPV in Settings With Different Prevalences*

 
In this example, a test with a sensitivity and specificity of 90% has a PPV of 50% (one false positive for each case detected) when the prevalence of the disease is 10%. However, the PPV for the same test falls to 8% (12 false positives for each case detected) when the prevalence is 1%.

Thus, unlike sensitivity and specificity, PPV is not a constant value that applies to the test from place to place. It can only be extrapolated to populations with a similar prevalence or pretest probability. PPV values reported without prevalence data, therefore, have little meaning. The incidence of VAP in study populations ranges from 15 to 74%,11,12,20,38–41 so reports of the accuracy of tests must be interpreted with caution. The potential for selection bias deserves special attention. Several studies of the diagnostic accuracy of chest radiography, for example, included in the denominator only cases of suspected VAP,10 introducing a selection bias that would tend to exaggerate the PPV.

Likelihood Ratio
A useful tool for integrating this information is the likelihood ratio (LR), which is defined as the sensitivity divided by (1 - specificity). The LR for the example in Table 1 would be 0.9/0.1, or 9.0. This means that an abnormal result on this test would be nine times more likely in patients with pneumonia than in patients without pneumonia. Like sensitivity and specificity, the LR is independent of the prevalence of the disease. It is useful because it demonstrates the added predictive value of a test as thresholds change, as is shown in Table 4 .41a


View this table:
[in this window]
[in a new window]

 
Table 4. Illustration of LRs at Different Test Levels*

 

    Reproducibility
 TOP
 Introduction
 Definition of Focus
 Review of Evidence
 Development of Recommendations
 Outside Review
 Principles for Evaluating...
 Validity
 Reproducibility
 
Reproducibility, or reliability, is the ability of a test to yield consistent results when repeated by the same observer (intrarater reliability) or observers (interrater reliability) at different times. A test with perfect reproducibility should give the same result when repeated on the same sample or patient. Sources of variation include random error and nonrandom sources of disagreement. There are reasons to question the reproducibility of VAP measures associated with wide intrarater and interrater variation, such as the detection of purulent secretions, the interpretation of abnormal radiographic findings (eg, atelectasis), and the interpretation of PSB culture samples.


    Footnotes
 
Abbreviations: LR = likelihood ratio; PPV = positive predictive value; PSB = protected-specimen brush; VAP = ventilator-associated pneumonia




This article has been cited by other articles:


Home page
ChestHome page
A. S. Michalopoulos, S. Geroulanos, and S. D. Mentzelopoulos
Determinants of Candidemia and Candidemia-Related Death in Cardiothoracic ICU Patients
Chest, December 1, 2003; 124(6): 2244 - 2255.
[Abstract] [Full Text] [PDF]


This Article
Right arrow Full Text (PDF) Free
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Article Archive
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Woolf, S. H.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Woolf, S. H.


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS