Chest ACCP Member Benefits
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     

Guest Access | Sign In via User Name/Password
This Article
Right arrow Full Text (PDF) Free
Right arrow Sleep Apnea: Additional Tables, Appendix
Right arrow Submit a response
Right arrow View responses
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Right arrow Citation Map
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Article Archive
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via HighWire
Right arrow Citing Articles via ISI Web of Science (60)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Flemons, W. W.
Right arrow Articles by Loube, D. I.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Flemons, W. W.
Right arrow Articles by Loube, D. I.
(Chest. 2003;124:1543-1579.)
© 2003 American College of Chest Physicians

Home Diagnosis of Sleep Apnea: A Systematic Review of the Literature*

An Evidence Review Cosponsored by the American Academy of Sleep Medicine, the American College of Chest Physicians, and the American Thoracic Society

W. Ward Flemons, MD; Michael R. Littner, MD, FCCP; James A. Rowley, MD, FCCP; Peter Gay, MD, FCCP; W. McDowell Anderson, MD, FCCP; David W. Hudgel, MD, FCCP; R. Douglas McEvoy, MBBS, MD and Daniel I. Loube, MD, FCCP

* From the Faculty of Medicine (Dr. Flemons), University of Calgary, Calgary, AB, Canada; David Geffen School of Medicine (Dr. Littner), University of California Los Angeles, Los Angeles, CA; Division of Pulmonary, Critical Care, and Sleep Medicine (Dr. Rowley), Department of Medicine, Wayne State University School of Medicine, Detroit, MI; Pulmonary, Critical Care, and Sleep Medicine (Dr. Gay), Mayo Clinic, Rochester, MN; the University of South Florida College of Medicine (Dr. Anderson), Tampa, FL; Case Western Reserve University (Dr. Hudgel), Cleveland, OH; Adelaide Institute for Sleep Health (Dr. McEvoy), Repatriation General Hospital, Daw Park, Australia; and the Swedish Medical Center (Dr. Loube), Seattle, WA.

Correspondence to: W. Ward Flemons, MD, Faculty of Medicine, University of Calgary, 1403 Twenty-Ninth St NW, Calgary, AB, Canada T2N 2T9; e-mail: flemons{at}ucalgary.ca

Key Words: diagnosis • likelihood ratios • polysomnography/methods • research design • review literature • sensitivity and specificity • sleep apnea syndromes


    1.0 INTRODUCTION/BACKGROUND
 TOP
 1.0 INTRODUCTION/BACKGROUND
 2.0. MATERIALS AND METHODS
 3.0. LITERATURE SEARCH RESULTS
 4.0. DISCUSSION OF THE...
 5.0. DIRECTIONS/NEED FOR FUTURE...
 References
 
Sleep apnea is a common disorder that affects both children and adults. It is characterized by periods of breathing cessation (apnea) and periods of reduced breathing (hypopnea). Both types of events have similar pathophysiology and are generally considered to be equal with respect to their impact on patients.1 The most common form of sleep apnea, called obstructive sleep apnea, is caused by the partial or complete collapse of the upper airway. There are several methods of quantifying the severity of the disorder such as measuring the number of apneas and hypopneas per hour of sleep (ie, the apnea-hypopnea index [AHI]), the severity of oxygen desaturation during sleep, or the severity of the most commonly associated symptom, daytime somnolence. The prevalence of an AHI of >= 5 was 24% in men and 9% in women aged 30 to 60 years in the Wisconsin Sleep Cohort Study.2 The prevalence of symptomatic sleep apnea (ie, AHI of >= 5 with excessive daytime somnolence) for men and women was 4% and 2%, respectively.2 The standard approach to diagnosis is in-laboratory, technician-attended polysomnography that monitors, at a minimum, sleep time and respiration. Polysomnography requires technical expertise, and is labor-intensive and time-consuming. Timely access is a problem for many patients, the majority of whom continue to have undiagnosed sleep apnea. In the Wisconsin sleep cohort study,3 93% of women and 82% of men with moderate-to-severe sleep apnea did not receive diagnoses. Thus, there is a growing interest in alternative approaches to diagnosis, such as portable monitoring, that have been proposed as a substitute for polysomnography in the diagnostic assessment of patients with suspected sleep apnea. The term portable monitoring encompasses a wide range of devices that can record as many signals as does attended polysomnography or only one signal, such as with oximetry (see section 1.1). When EEG and electromyogram (EMG) signals are recorded, sleep staging can be performed that provides a denominator for the AHI. More commonly, EEG and EMG signals are not recorded by portable monitors, in which case breathing events are usually quantified per hour of monitoring time as a respiratory disturbance index (RDI). The use of portable monitoring to assess patients suspected of having sleep apnea is controversial and has been the subject of previous reviews of the literature.4 5 6 7 8 Since the last review was completed, there have been additional research studies published and more standardized methods developed for rating the evidence of studies on diagnostic tests.

The American Thoracic Society (ATS), the American College of Chest Physicians (ACCP), and the American Academy of Sleep Medicine (AASM) individually planned to review and update the evidence on the diagnostic validity of portable monitors for diagnosing sleep apnea in adults. At a conference hosted by the ACCP in September 2000, an initial proposal to collaborate on this project was discussed by all three organizations that eventually led to a formal agreement to cosponsor a working group and to hire an evidence-based practice center to produce a detailed literature search and evidence review on the use of portable monitors for investigating patients with suspected sleep apnea. Two other organizations, the National Association for the Medical Direction of Respiratory Care and the Australasian Sleep Association, agreed to participate as liaison organizations and appointed members to the committee structure. Detailed conflict-of-interest guidelines were established that prevented anyone with a link to industries that made commercially available sleep apnea portable monitors from working on this project (details available on request). The ACCP accepted administrative responsibility for the working group. The following three committees were created with at least one representative from each sponsoring organization: (1) Steering Committee, Nancy Collop (Chair), Patrick Strollo, and John Shepard; (2) Evidence Review Committee (ERC), Ward Flemons (Chair), James Rowley, Michael Littner, William Anderson, David Hudgel, Dan Loube, Peter Gay, and Doug McEvoy; and (3) Guideline Committee, Andrew Chesson (Chair), Allan Pack, and Richard Berry.

Funding for this project, including an evidence review that was performed under contract by a team of evidence-based researchers at RTI International and the University of North Carolina at Chapel Hill (RTI-UNC), was provided completely and jointly by the ATS, the ACCP, and the AASM.

Three previous reviews of sleep apnea portable monitoring devices have been published. In 1994, the AASM (formerly the American Sleep Disorders Association) published a description of 23 studies4 that reported some features of portable monitoring. In this review, sleep studies were categorized into the following four types: type 1, standard polysomnography; type 2, comprehensive portable polysomnography; type 3, modified portable sleep apnea testing; and type 4, continuous single-bioparameter or dual-bioparameter recording (see section 1.1.1). In 1997, the AASM published practice parameters5 and a review6 for indications for polysomnography and related procedures that included a section on type 3 and type 4 studies. Based on the review, the practice parameters recommended that attended type 3 studies were potentially appropriate in patients with a high pretest probability (eg, > 70%) of sleep apnea. The parameters recommended that negative type 3 monitor studies in symptomatic patients be followed up with a full polysomnogram. The parameters did not recommend type 4 studies for the investigation of suspected sleep apnea. In 1997, the Agency for Healthcare Research and Quality (AHRQ) [formerly, the Agency for Health Care Policy and Research] in the United States commissioned a systematic review of the research on the diagnosis of sleep apnea.7 8 Part of that review focused on studies of portable monitors (25 studies), including oximetry (12 studies), and included articles published from 1980 to November 1, 1997.7 8 As part of this systematic review, the quality of each reviewed study was rated using a scale that the authors developed. This was a potentially helpful addition to the AASM reviews because it attempted to identify and account for biases that may undermine the validity of the findings and conclusions of a study.

Over the past decade, there has been increasing interest in developing methods to rate the quality of research studies, especially when a systematic review is undertaken. There has been more work published on the methods for rating the research evidence of therapeutics studies than the rating of diagnostic testing studies. The ACCP/ATS/AASM working group decided it was important to update the literature review from 1997 as well as to update the system used to rate the quality of the research evidence on portable monitoring. The method published by Sackett et al9 in 2000 for rating evidence of research on diagnostic tests was used because it closely aligns with the accepted methods used for rating the quality of articles on therapeutics and prognosis. In addition, it focuses on the following key aspects of design for studies of diagnostic tests: avoiding selection bias (by using a consecutively referred sample of patients); blinding of the interpreters; and avoidance of verification bias (by performance of the reference standard on all subjects).

An increasing amount of research has been published comparing some type of portable monitoring for sleep apnea with polysomnography. From 1990 to 2001, a total of 51 articles that met preselected inclusion/exclusion criteria for being included in this latest systematic review of portable monitoring for sleep apnea have been published in the English literature. These articles were rated with respect to the level of evidence (ie, I, II, III, or IV) based, in part, on the approach published by Sackett et al9 (see section 1.5). The majority of studies (30 of 51 studies) were of higher quality (ie, levels I and II), but there is not yet a trend of this percentage increasing over time. In Figure 1 , the number of level I studies (best quality) and level II studies, as well as the total number of studies published on portable monitoring are shown over time.



View larger version (16K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 1. Quality of published studies on portable monitoring for sleep apnea, 1990 to 2001.

 
The goal of a systematic review is to summarize a body of literature to aid in reaching conclusions about a particular practice in medicine. A common approach used to synthesize evidence is meta-analysis. This approach was used in the AHRQ commissioned review, and the results were reported in the form of summary receiver operating characteristic (ROC) curves.8 The current working group decided against a meta-analysis of results because there was too much heterogeneity between studies with respect to types of signals measured (Table 1 1A, 1B 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 and section 1.1.1), criteria used to define a breathing event (section 1.1.2), how signals from portable monitors were scored (section 1.1.3), and study quality (section 1.5). Therefore, the working group elected to summarize and report the details of each study to allow for conclusions to be drawn about the evidence without combining results across studies in a formal meta-analysis. Study data were synthesized into tables and were categorized as follows: (1) monitor type (section 1.1); (2) location of the study (unattended at home vs attended in the sleep laboratory) [section 1.4.1]; and (3) evidence level and quality rating (section 1.5).


View this table:
[in this window]
[in a new window]

 
Table 1. Breathing Event Definitions and Study Information*

 

View this table:
[in this window]
[in a new window]

 
Table 1A. Continued*

 

View this table:
[in this window]
[in a new window]

 
Table 1B. Continued*

 
Three primary and four secondary areas are addressed in this report. The primary areas are as follows:
  1. The utility of portable monitors in reducing the probability that a patient has an abnormal AHI (rule out the disorder) [section 4.1.1];
  2. The utility of portable monitors in increasing the probability that a patient has an abnormal AHI (rule in the disorder) [section 4.1.2];
  3. The utility of portable monitors in both reducing and increasing the probability that a patient has an abnormal AHI (rule out and rule in the disorder) [section 4.1.3].

The secondary areas are as follows:

  1. The reproducibility of portable monitor results [section 4.2.1];
  2. The cost benefit of portable monitors [section 4.2.2];
  3. The failure rates of portable monitors [section 4.2.3];
  4. The patient populations studied and the generalizability of findings [section 4.2.4].

Finally, it was the goal of this working group to outline the deficiencies in the current evidence on portable monitors for the investigation of patients with suspected sleep apnea, to describe opportunities for future research, and to highlight key methodological issues that should be addressed by future researchers, journal editors, reviewers, and readers of this literature.

1.1. Portable Monitoring
1.1.1 Types of monitors
Portable monitors were classified according to the approach used in the 1994 American Sleep Disorders Association review.4 Type 1 (standard polysomnography) was considered the reference standard to which the other monitor types were compared. The physiologic signals that were recorded and used to define a breathing event on a portable monitor varied among studies and across monitor types (Table 1, 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 ). As detailed below, type 2 monitors incorporate sleep staging as well as respiratory measures, type 3 monitors use at least three respiratory channels, and type 4 monitors use at least one respiratory channel, usually either oxygen saturation or airflow.

1.1.1.1. Type 2: comprehensive portable polysomnography
These monitors incorporate a minimum of seven channels, including EEG, electrooculogram, chin EMG, ECG or heart rate, airflow, respiratory effort, and oxygen saturation. This type of monitor allows for sleep staging and therefore for the calculation of an AHI.

1.1.1.2. Type 3: modified portable sleep apnea testing
This type of monitor incorporates a minimum of four monitored channels, including ventilation or airflow (at least two channels of respiratory movement, or respiratory movement and airflow), heart rate or ECG, and oxygen saturation.

1.1.1.3. Type 4: continuous single or dual bioparameters
Most monitors of this type measured a single parameter or two parameters, for example, oxygen saturation or airflow. A monitor that did not meet the criteria for type 3 (ie, a monitor that measured one to three channels or did not include airflow despite having four channels) was classified as type 4.

1.1.2. Signals used for detecting events
A challenge for the working group was to look for similarities and differences in the ways that different monitors record signals and how those signals were used to define a breathing event. As with polysomnography, there was heterogeneity with respect to defining an abnormal breathing event on a portable monitor (Table 1) . The most common methods to detect breathing events were reduction in airflow measured by a thermistor or by a nasal pressure signal, and oxygen desaturation (several different approaches). In some circumstances, these methods were combined.

1.1.2.1. Flow
A reduction in airflow or tidal volume is the standard method for defining an apnea or hypopnea. A criterion for defining an hypopnea has been recommended to be reduction to < 50% from baseline of a valid measurement.1 The best method for quantifying flow is a pneumotachograph. However, no portable monitors use this technology.

1.1.2.1.1. Thermistor
Thermistors sense differences in temperature and do not have a linear relationship with true airflow. Therefore, they may not be sensitive for detecting hypopneas. For these reasons, it has been recommended that for clinical research purposes thermistors not be used in polysomnography.1 However, they are capable of sensing airflow through the nose and mouth, and it remains the most common method for defining breathing events based on a flow measurement (Table 1) .

1.1.2.1.2. Nasal pressure
Nasal pressure provides a linear approximation of airflow across its complete range except at extremes. The linear relationship can be improved with a square root transformation of the signal. However, this may not be necessary if the primary use of the measure is event detection. It may not be as accurate as a thermistor in distinguishing an apnea from a hypopnea, however, in routine clinical use this distinction is not thought to be important.1 The signal could produce false-positive events if the patient was intermittently mouth breathing, or it could be a poor quality signal if the patient was mouth breathing for long periods of time. This may require visual confirmation of apparent apneas and hypopneas, making it potentially difficult to use in an unattended study.11 12 13 14 16 17

1.1.2.2. Respiratory inductance plethysmography
Respiratory inductance plethysmography, when properly calibrated, can provide a measure of tidal volume. Uncalibrated, it can still be useful to detect breathing disturbances. It is used primarily during polysomnography. It was used in only one study on portable monitoring as a secondary signal (Table 1) .

1.1.2.3. Oxygen saturation
Oximeters differ from other devices in important ways, particularly in the sampling frequencies and algorithms used to record oxygen saturation. Some oximeters take multiple readings, store them in memory, average them, and report a value every 3 to 12 s. Others sample and report each value at a frequency of up to 10 Hz.33 A sampling rate of 1/12 Hz has been shown in one study to provide oxygen desaturation rates with a low number of artifacts.38 Methods of automated analysis of the oxygen saturation signal are also variable. Most methods rely on the detection of a drop in oxygen saturation, some detect resaturation, while others use both criteria (Table 1) . Some automated analyses define what baseline oxygen saturation is, but most do not. Some studies have measured the percentage of cumulative time that a patient has an oxygen saturation of < 90% to determine whether it identifies patients with sleep apnea. Other studies have derived a "delta index" that quantifies the variability in oxygen saturation over an entire study. These last two methods do not identify specific events but instead identify patients who are likely to be experiencing apneas or hypopneas.

Oximetry analysis that is designed to detect transient drops in oxygen saturation should be more sensitive in situations in which the baseline oxygen saturation is lower because of the shape and thresholds of the oxyhemoglobin desaturation curve. Thus, patients who are studied at altitude or patients with underlying lung disease (eg, COPD) may show more desaturations, which could improve the sensitivity of a monitor but would likely adversely affect its specificity. Two studies have been published that evaluated COPD patients.45 50 However, those studies did not determine how the presence of COPD affected the sensitivity and specificity of the portable monitor.

1.1.2.4. Other
One report57 used snoring as a primary method for event detection and combined it with oxygen desaturation as a second required criterion for event detection. Other studies43 44 have used snoring in conjunction with heart rate variability as criteria for event detection. Spectral analysis of heart rate was used in one study,58 and a single study reported the use of pharyngoesophageal pressure measurement59 as a method for detecting breathing events (Table 1) .

1.1.3. Methods for scoring events
Studies differed in the physiologic channels monitored, the criteria used to define events, and the methods used to score events (Table 1) . The majority of studies of monitors in which flow was measured by thermistor used manual scoring, while most studies of monitors in which flow was measured by nasal pressure used automated scoring. Some monitors provide automated scoring with either a computer or printed output that also allows for manual checking or editing. Some authors were explicit about how events were scored (ie, automated, automated with manual scoring, or only manual scoring), and in the case in which there was a component of manual scoring, by whom it was scored. However, most studies were not explicit about this. Automated scoring has the advantage that it eliminates a source of variability in results, the human recognition of events. However, polysomnograms that are used as the reference standard for defining patients with and without sleep apnea are manually scored. Therefore, an argument could be made that the automated scoring of a portable monitor is not comparable. In addition, polysomnography scoring can include an arousal from sleep as a secondary criterion (Table 1) . Portable monitoring scoring, when done manually, is frequently performed in a compressed time frame from 2 to >= 10 min. Several studies have used automated scoring with manual editing and have reported results as various combinations of these different approaches to scoring (Table 1) . Variability (interrater and intrarater) in manual scoring has not been reported (section 4.2.1). Some users may be concerned with automated scoring systems that are "black boxes," that is, they fail to identify on a record of appropriate resolution the events that were scored so that a technician or a clinician can review them, and be able to assess and edit the scoring and artifacts, and to assess the quality of the study. Details about the ability of specific monitors to display breathing events for a technician or clinician to review were not always reported in the studies that used automated scoring.

1.2. Study Location and Attendance
Portable monitors can be used in a variety of settings, including a hospital, a sleep laboratory, or in the patient’s home. Portable monitors can be attended by a technician or left unattended. The role of the technician is to determine whether the portable monitor is functioning properly, to provide guidance to the patient such as encouraging patients to sleep on their backs, and for safety purposes in case there is an untoward event. With few exceptions, the research studies performed in the sleep laboratory were attended and performed simultaneously with the polysomnogram, while those performed in the home were unattended. One small study25 performed in a sleep laboratory examined the performance of a portable type 2 monitor, attended and unattended. A second study48 performed in a hospital was partially attended, but not simultaneously with polysomnography. One portable monitor study29 performed at home had the opportunity for technicians to observe the study remotely and to intervene by calling the patient if there were technical problems.

1.3. Measuring Agreement
A detailed discussion of the methods for measuring agreement can be found in an accompanying publication (see page 1535). Several methods exist for evaluating the extent of agreement between two methods designed to measure the same phenomenon, including Pearson correlation coefficient, intraclass correlation coefficient, the approach of Bland and Altman of mean differences and limits of agreement, and sensitivity/specificity/likelihood ratios (LRs). Although the Pearson correlation coefficient is widely used, it is not recommended because it is a measure of association, not agreement.61 Intraclass correlation coefficients can be used to assess agreement,62 however, this approach is not familiar to most clinicians and is not commonly used. The approach of Bland and Altman of calculating mean differences between two measurements is preferable to the Pearson correlation coefficient, however, the limits of agreement, which is the key descriptor that relates how well the measures agree, can also be misleading if not calculated properly. Sensitivity, specificity, and LRs have the advantage that they are in common use and are easier to understand. They address the more fundamental question of the proper classification of patients in contrast to how closely two methods agree.

Using sensitivity/specificity/LRs demands that a patient be classified as having or not having the disorder based on an arbitrary cutoff for the AHI that is variable across studies. There is a wide spectrum of the severity of breathing events at night, and the AHI captures only a single dimension. Since a substantial number of patients have indexes around the usual cutoff point, it is possible that a patient’s classification might change due to expected variability in the measure (section 4.2.1). In addition, there are legitimate questions as to whether the AHI, which is derived from sleep laboratory-based polysomnography, is the correct reference standard. It is the reference standard that is most commonly used and the metric of sleep apnea severity for which there is the most published data relating to morbidity (eg, neurocognitive dysfunction, hypertension, and quality of life). For these reasons, it formed the basis for the systematic review that has been conducted by the ATS/AASM/ACCP working group.

The analysis of results using sensitivity, specificity, and LRs should take into account the precision of the estimates (ie, the calculation of confidence intervals), which are a direct reflection of sample size and study design. Studies rated with level IV evidence levels and those with small patient numbers (and wide confidence intervals) should be interpreted with caution (see Tables 3 and 4 ). Sensitivity, specificity, and LRs are descriptors of the operating characteristics of a test (ie, the degree to which the probability of disease is changed by a positive or a negative result). However, since a clinician needs to know the actual probability that the patient does or does not have a disorder (ie, the posttest probability or predictive value), the operating characteristics of a test have to be interpreted with the knowledge of the pretest probability (or prevalence) of the disorder (Table 2) . The utility of a diagnostic test for patients with suspected sleep apnea to substitute for polysomnography can be viewed as the percentage of patients who have either a positive or negative test result and the percentage of those who have a false-positive or false-negative result, respectively (see Tables 3 4 5 ). Since the number of true-positive results is governed by the sensitivity and the number of false-positive results is governed by the specificity, both dictate the utility of a test, in addition to the pretest probability.


View this table:
[in this window]
[in a new window]

 
Table 3. Best-Reported Sensitivity and Calculated Negative LRs*

 

View this table:
[in this window]
[in a new window]

 
Table 4. Best-Reported Specificity and Calculated Positive LRs*

 

View this table:
[in this window]
[in a new window]

 
Table 2. Number of Studies Published on Use of Portable Monitors for Diagnosing Sleep Apnea

 

View this table:
[in this window]
[in a new window]

 
Table 5. Studies With Both High (> 5.0) and Low (< 0.2) LRs*

 
1.4. Validating Portable Monitors
Several approaches can be used to validate portable monitoring. The standard approach, and indeed what has been done to date, has been to compare portable monitoring with a reference standard, as described in the previous section. The limitation of this approach is that it assumes that sleep laboratory-based polysomnography is the optimal approach for diagnosing sleep apnea. However, this is not completely true for several reasons. From a technical perspective, patients frequently do not sleep as well in a laboratory as they do at home, and they likely spend more time on average sleeping supine. From a pragmatic perspective, the AHI correlates poorly with outcomes that are important to patients, such as quality of life and daytime sleepiness, and does not predict very well those patients who ultimately will use and thereby benefit from therapy. Therefore, a more appropriate validation study would compare the impact of portable monitoring and polysomnography on a physician’s decision-making ability and outcomes important to patients. To date, there have been no studies published that have used this approach to validate the use of a portable monitor.

There are several aspects of study design and methods that, if not carefully controlled, can threaten the validity of findings and conclusions. In this review, we assigned an evidence level and quality rating for each study based on how well its design controlled possible bias (section 2.3). Other aspects of study design that affect the interpretation of findings are reviewed below.

1.4.1. Attended/nonattended monitors
The evaluation of a portable monitor in an attended setting (most often in a sleep laboratory) allows an assessment of its performance under ideal circumstances eliminating important sources of possible differences that have nothing to do with the portable monitor, such as night-to-night variability. Simultaneous assessment with polysomnography answers an important question of whether the monitor can work. If it is not also tested in an unattended setting, preferably the patient’s home, the question of whether it works in the setting for which it was intended remains unanswered. When the data from a monitor used at home are compared with those from polysomnography performed in a laboratory, the limitations of polysomnography, as a reference standard, must be kept in mind.

1.4.2. Study methodology
1.4.2.1. Describing the study population
A sufficient description of the population of patients who were studied is essential to assist readers in deciding whether the results are generalizable to their own patient population. Ideally, a broad spectrum of patients (eg, disease severity, age, race, men, and women) is used without the investigators participating in the selection of patients. This latter point helps to avoid selection bias. If investigators study a group of patients that they have participated in selecting (eg, patients referred to a sleep laboratory that the investigators refer patients to), their findings on prevalence, sensitivity, specificity, LRs, and limits of agreement could be affected and cannot be generalized to any other population of patients. There should be a clear description of who refers patients to the sleep center, the volume of referrals, as well as a description of the type of sleep center (eg, community-based, university setting, or Veterans Affairs hospital). If some patients are not going to be included in the study, preselected exclusion criteria should be used, with justification for those criteria. The number of patients who were referred during the time of recruitment, the number that are eligible for participation, the number who actually entered the study (with a listing of the number of patients and the reasons why patients did not participate), how many patients completed the study (with the numbers of and reasons for study drop-outs), and, finally, the percentage of cases that had uninterpretable data.

1.4.2.2. Describing portable monitor and polysomnography methods
There should be a clear description of the type of equipment used to record the signals that were used by the portable monitor and the polysomnogram. The definitions of a breathing event on polysomnography and on the portable monitor should be detailed enough to allow someone else to replicate the methods. If indexes such as the RDI are derived from scoring events, there also needs to be a clear description of these definitions. A statement about how events were scored (eg, automated scoring, manual scoring, epoch length, monitor type, or automated scoring with a manual review that allows for editing of results) needs to be included. Use of reference to a previous study for methods is not acceptable when the methods are part of the evaluation of the portable monitor.

1.4.2.3. Repeatability
There are many sources of variability that can limit the generalizability of a study’s results. One of the most important is variability in the human recognition of events. If a monitor has an automated analysis algorithm that does not allow for manual editing, then this is not an issue. However, if there is some element of manual scoring, then the ability of two scorers and that of a single scorer repeating a review of previous scoring should be checked and reported. This intermeasurement and intrameasurement repeatability is most appropriately reported as a {kappa} coefficient. Another source of variability that is important to study and report is night-to-night changes. This can be analyzed and reported using the approach of Bland and Altman61 or an intraclass correlation coefficient.62 Pearson product correlation coefficients often are used to report repeatability, but, as with reporting agreement between two different methods, it is not a recommended approach.

1.4.2.4. Avoiding bias
Several aspects of study design will dictate whether the results are more likely or less likely to be valid. There is evidence that studies of diagnostic tests with flawed designs tend to overestimate the accuracy of the test.63 Using an appropriate series of consecutive patients controls for selection bias. Verification bias occurs when the results of one study determine whether the second study will be performed, and is avoided by ensuring that the reference standard and the diagnostic test are completed on all eligible patients. Equally important is having the diagnostic test and the reference standard interpreted and scored separately in a manner that is blinded to the results of the other test.

The post hoc analysis of results allows the investigators to optimize the apparent utility of the test. Performing multiple analyses reduces the reliability of statistical tests since each time an analysis is performed there is a probability by chance alone that the result will be positive (ie, the use of multiple analyses increases the probability of a spurious result). While there are statistical approaches to adjust for this, preselecting thresholds for a positive/negative monitoring test result prior to the study is recommended. Ideally, these thresholds have been defined in an initial study and confirmed in an independent, prospective study. Another approach is to develop thresholds using part of a patient population and to validate them on the remaining patients. Very few of the studies have adopted these approaches, and few have attempted to adjust for multiple analyses. Only two studies30 35 validated the thresholds they used to estimate the probability of sleep apnea.

1.4.2.5. Reporting of results
The results of studies on diagnostic test accuracy should provide the number of patients who had both tests and results that clearly establish the prevalence of the disease in question, as well as the number of true-positive results, false-positive results, true-negative results, and false-negative results that will allow the calculation of sensitivity, specificity, LRs, as well as positive and negative predictive values. If different thresholds are used for the reference standard to define the presence/absence of disease, the authors should explicitly state the effect this had on prevalence as well as the operating characteristics of the diagnostic test. If multiple thresholds for a positive/negative diagnostic test result are reported, the effect of varying the threshold should be reported with a ROC curve and/or the calculation of an LR for each threshold. Finally, as with all statistical estimates, the 95% confidence intervals for the estimate (ie, sensitivity, specificity, and LRs) should be reported. Small patient numbers yield imprecise estimates, which are reflected by very wide confidence intervals.

1.5. Rating Levels of Evidence
The AHRQ review7 8 on the diagnosis of sleep apnea that was released in 1999 was the first systematic review to evaluate the quality of published research in this field. The methods for rating the level of evidence of studies published on diagnostic tests have not been widely used, and the authors of the AHRQ report established their own approach. They assigned points if the study met predefined criteria for quality (a total of 44 points for 18 criteria). They decided to exclude the 20% of articles with the lowest quality scores from further analysis.

Other publications have addressed the question of how to assess research studies on diagnostic testing. The Journal of the American Medical Association has published numerous user guides to assist clinician recognition of high-quality clinical research. The guides for understanding diagnostic testing list several important criteria for judging the validity of the results of a study.64 65 The primary guides are as follows: (1) there was an independent, blind comparison with a reference standard; and (2) the patient sample included an appropriate spectrum of patients to whom the test was applied. The secondary guides include the following: (1) the results of the test being evaluated did not influence the decision to perform the reference standard (verification bias); and (2) the methods for performing the test were described in sufficient detail to permit replication. Although the articles in the Journal of the American Medical Association are useful guides for clinicians, they donot provide a methodology or scoring system for rating studies. The approach published by Sackett et al9 used the following small but essential number of study design features for rating research on diagnostic tests including: (1) an independent blind comparison with a reference standard; (2) an appropriate spectrum of consecutively referred patients (ie, avoidance of selection bias); and (3) the use of a reference standard applied to all study patients. These criteria have been organized into levels of evidence (I through V), with level I evidence considered to be the best.9

The ATS/AASM/ACCP working group elected not to follow the methodology of the AHRQ review on sleep apnea for rating the quality of research evidence because of several concerns. First, the rationale for assigning some criteria a large number of points (eg, randomized controlled trial design, 10 points; study test readers blind to clinical status, 5 points) and others a small number of points (eg, verification bias [results of the study test do not determine who gets a polysomnogram], 1 point; patients included with a wide spectrum of sleep apnea severity, 1 point) was not clear. Second, they included a criterion for study quality (ie, randomized control design) that is not listed as a criterion in either the series in the Journal of the American Medical Association or the article by Sackett et al,9 and they provided a score that was 10 times greater than other important quality criteria such as avoidance of verification bias.7 There is an important distinction between a randomized controlled trial and the random assignment of subjects to the order of having either the portable monitoring test or polysomnography first. The latter may be important if testing is likely to have an order effect. For example, if there is a first-night effect for polysomnography that is different from that of portable monitoring testing, this could influence the results of the comparison. The ATS/AASM/ACCP working group adapted the method proposed by Sackett et al9 to rate the level of evidence of the articles included in this systematic review (see section 2.3).

Subsequent to the ATS/AASM/ACCP working group completing its evaluation of the literature on portable monitoring for sleep apnea, the AHRQ issued a report entitled "Systems to Rate the Strength of Scientific Evidence."66 The report outlines the following five key domains and elements for systems to rate the quality of individual articles on diagnostic test studies: (1) study population; (2) adequate description of the test; (3) appropriate reference standard; (4) blinded comparison of the test and the reference; and (5) avoidance of verification bias. We assume that the first domain (study population) refers to an appropriate spectrum of consecutively referred patients (ie, avoidance of selection bias). Therefore, the criteria published by Sackett et al9 include four of these elements (elements 1, 3, 4, and 5). To be part of this current systematic review on portable monitoring for investigating sleep apnea, the results had to be compared to polysomnography. It was not possible to rate the different methods/definitions used for performing polysomnography. Each article was considered equal in this regard. The working group assessed whether there was an adequate description of the methods used to record the signals used by the portable monitor, to define an event (including definitions of events), and whether the method used to score the event was properly described (section 2.3.2). Thus, these two quality criteria are reflected in the quality rating score. Of 30 studies that were rated as having evidence level I or II, only 1 study failed to meet both criteria and had a quality rating of "d" (Table 2 ). Seven other articles failed to meet one of these criteria, one of which had a quality rating of d, while three each were rated a or b. We do not believe that incorporating the domain of "adequate description of the test" into the evidence level would have affected the results or conclusions of this report.

In addition to defining important domains for systems to rate the quality of evidence for randomized clinical trials, observation studies and diagnostic test studies, the AHRQ report7 8 also has listed the important domains for systematic reviews and for systems for grading the strength of a body of evidence. For systematic reviews, the AHRQ report recommends that the following 11 domains be addressed:

  1. Study question;
  2. Search strategy;
  3. Inclusion and exclusion criteria;
  4. Interventions;
  5. Outcomes;
  6. Data extraction;
  7. Study quality and validity;
  8. Data synthesis and analysis;
  9. Results;
  10. Discussion;
  11. Funding

In this systematic review, we have outlined the study questions (sections 1.0, 4.1, and 4.2) and the search strategy, including inclusion and exclusion criteria (section 2.1). We reviewed portable monitoring for sleep apnea that is performed in a sleep laboratory (ie, attended setting, usually performed simultaneously with polysomnography) and/or at home (ie, unattended setting). The outcomes reviewed are diagnostic accuracy (ie, sensitivity, specificity, and LRs), the results of testing on a population of patients suspected of having sleep apnea (ie, the percentage of patients with a result that is either positive or negative and the percentage of those who are misclassified by the portable monitor), as well as cost, failure rate, and repeatability. A description of data extraction (section 2.2), and a detailed process for evaluating study quality and validity (section 2.3) have been included. Data from 51 studies have been synthesized, and the results have been presented in a series of tables. Although a formal meta-analysis was not performed, the summaries provided in these tables make it possible to determine how much evidence there is regarding a particular question, how "good" the evidence is, and whether there was a consistent finding among higher quality studies. Recommendations for improving the quality of research on diagnostic methods for sleep apnea are discussed in section 5.0.


    2.0. MATERIALS AND METHODS
 TOP
 1.0 INTRODUCTION/BACKGROUND
 2.0. MATERIALS AND METHODS
 3.0. LITERATURE SEARCH RESULTS
 4.0. DISCUSSION OF THE...
 5.0. DIRECTIONS/NEED FOR FUTURE...
 References
 
2.1. Literature Review
The ATS/ACCP/AASM working group contracted with the RTI-UNC evidence practice center to conduct a systematic review of the literature and to abstract data in a standard fashion from relevant studies that allowed summaries of their findings to be generated by the ERC. The RTI-UNC team followed the recommended methods for conducting systematic reviews, which emphasized comprehensive literature search and evaluation, and used standardized procedures for the review (and its documentation) of selected articles.

A systematic review of the literature on the diagnosis of sleep apnea was completed in 1997 by the AHRQ.7 Our literature search focused on articles published since 1997. The initial search was completed June 26, 2001. The bibliographies from two American Sleep Disorder Association reviews4 6 also were searched for relevant articles. Several search strategies were used, focusing on screening, diagnosis, and costs. The search strategy used the headings "Screening" (including the terms "Reproducibility of Results," or "Predictive Value of Tests," or "Sensitivity and Specificity") "Diagnosis" for finding citations involving the terms "Sleep Apnea Syndromes," "Sleep Apnea (Obstructive)," "Oximetry," "Polysomnography," "Monitoring Physiologic," "Airway Resistance," "Upper Airway Resistance Syndrome," "Respiratory Disturbance Index," "Autoset," "Snoring," or "Respiratory Event-Related Arousals." The term "Home Care Services" also was used to identify citations. For the heading "Screening," the MEDLINE search identified 157 citations, and for "Diagnosis" the MEDLINE search identified 180 citations. The use of the terms "Home Care Services" and "Polysomnography" identified 14 additional citations.

For costs, the MESH heading "Costs and Cost Analysis" was exploded to include the terms "Cost Benefit Analysis," "Cost Allocation," Cost Control," "Cost Savings," "Cost Sharing," "Cost of Illness," "Health Care Costs," and "Health Expenditures." The MEDLINE search was conducted from 1997 to the present and identified 35 citations.

The inclusion criteria were as follows:

• Male/female patients, ages >= 18 years, with ANY diagnosis of obstructive sleep apnea;
• Study published in English, no race or gender restrictions;
• Portable device used for diagnosis;
• Polysomnography or other acceptable objective test used for the diagnosis of sleep apnea;
After completion of the study, each analysis group was >= 10 subjects;

The exclusion criteria were as follows:

• Studies in children;
• Studies in languages besides English;
• Reviews, meta-analyses, case reports, abstracts, letters, and editorials.

The titles of retrieved articles were reviewed, and the abstract of any article the title of which mentioned diagnosis of sleep apnea was reviewed for relevance to this review. If there was ambiguity about the study meeting the inclusion/exclusion criteria, the full article was reviewed. The reference lists of articles included in this review were scanned to determine other possible articles that should be included.

2.2. Evidence Tables
RTI-UNC worked closely with the ERC to identify the key questions, to develop an abstract review form, to identify the key extraction elements, and then to develop a data extraction elements form (see "Appendix"; available online at http://www.chestjournal.org/cgi/content/full/124/4/1543/DC1). Two evidence practice center reviewers then abstracted complete data independently from each study. The reviewers then compared their results for each element on the data extraction form for each study, and in situations in which there was disagreement a consensus was reached among the reviewers. The final data abstraction forms then were completed by the evidence practice center and were sent to the members of the ERC, who decided on how the evidence would be summarized. The ERC elected to have the search updated to include articles up to December 31, 2001; that identified two additional articles, which members of the ERC abstracted.

2.3. Rating the Quality of Research Articles
The assessment of the study quality was performed by rating 10 separate features of each article that allowed categorization of the evidence level of an article as level I, II, III, or IV (based on three of these items), and then by using a further rating of study quality (a, b, c, or d) that was based on the remaining items. Each of the 10 items was independently rated by two RTI-UNC reviewers and by two members of the ERC. A final evidence level and quality rating was determined by consensus of the ERC, based on preselected standard definitions.

2.3.1. Evidence level (I, II, III, and IV)
The ERC relied on the presence or absence of three key indicators of quality that dictated the assignment of evidence level based on an approach published by Sackett et al.9 The definitions of these evidence levels are listed below as follows:

I, blinded comparison, consecutive patients, reference standard performed on all patients;
II, blinded comparison, nonconsecutive patients, reference standard performed on all patients;
III, blinded comparison, consecutive patients, reference standard not performed on all patients;
IV, reference standard was not applied blindly or independently.

The definitions of the three indicators used to assign level of evidence were as follows:

Blinded comparison: the portable monitor and polysomnogram were scored separately and without knowledge of the results of the other investigation. If the investigators failed to mention whether or not the scorers were blinded, this criterion was deemed not to have been met.

Consecutive patients: the investigators did not participate in deciding what patients were included in the study. This criterion was met if patients were referred to a sleep clinic rather than a sleep laboratory (unless the investigators explicitly stated that they did not participate in selecting the patients referred to the laboratory).

Reference standard was performed on all patients: all patients entered into the study must have undergone both a portable monitor test and a polysomnogram. If the results of one test influenced the decision to perform the other, then this criterion was deemed not to have been met.

2.3.2. Quality rating (a, b, c, d)
Seven other aspects of a study’s methodology were scored, and a quality rating was assigned based on the number of indicators for which the study met the criteria. Although the random assignment of testing was an important indicator, it was not applicable to studies that had studied a portable monitor simultaneously with polysomnography. Thus, in some circumstances studies were rated on six indicators rather than seven. The quality indicator (a to d) was based on the number of indicators for which that study did not meet the criteria, as follows:

a, zero or one quality indicators not met;
b, two quality indicators not met;
c, three quality indicators not met;
d, four or more quality indicators not met.

The seven indicators and their definitions are listed below as follows:

  1. Prospective recruitment of patients: the portable monitoring test and the polysomnogram were performed as patients were recruited into the study rather than reviewing a series of patients who had previously been studied.
  2. Random order of testing: patients were assigned to undergo portable monitoring testing or polysomnography first at random rather than at the discretion of the investigators. If the portable monitoring study was performed simultaneously with the polysomnogram, this indicator was not rated.
  3. Low data loss (< 10%): there were < 10% of patients whose results could not be compared because of the loss of polysomnography or portable monitoring data.
  4. High percentage completed (> 90%): of the patients who were initially enrolled into the study (not counting a priori exclusions), > 90% completed the study protocol.
  5. Polysomnography methodology/definitions fully described: the polysomnography methods must include the following:
    a. characterization of the equipment used;
    b. definitions and criteria of all types of breathing events scored and used in comparisons.

  6. Portable monitor methodology/definitions fully described: the polysomnography methods must include the following:
    a. characterization of the equipment used;
    b. definitions and criteria of all types of breathing events scored and used in comparisons

  7. Portable monitor scoring fully described: includes a clear statement of whether manual or automated scoring was used, and, if automated, whether there was manual review/revision done.

2.4. Approach to Summarizing the Evidence on Portable Monitors
When evaluating the diagnostic accuracy of portable monitors, almost all studies chose to report results as sensitivity and specificity. Many studies examined multiple thresholds for defining a positive result that gave combinations of sensitivity and specificity. When trying to address the issue of whether a portable monitor could reduce the probability that a patient had sleep apnea (section 4.1.1), articles were examined for their best-reported sensitivity, since this should provide the lowest number of false-negative results and the lowest LRs. In circumstances in which various combinations of sensitivity and specificity were reported and two sensitivities were close in value, the "best-reported sensitivity" was taken as the value with the higher corresponding specificity. If the authors had reported different definitions for sleep apnea, the working group selected an AHI definition of <= 15, when it was reported, because it was thought that most clinicians would want to know the probability that a patient had an AHI below this level, since theoretically they would then be making a decision on whether or not to offer a trial of therapy. Conversely, when the working group addressed the issue of whether a portable monitor could increase the probability of sleep apnea, the research studies were examined for their best-reported specificity, since this should provide the highest LR and the lowest number of false-positive results. In circumstances in which various combinations of specificity and sensitivity were reported, and two specificities were close in value, the "best-reported specificity" was taken as the value with the higher corresponding sensitivity.

When the best-reported sensitivity and the best-reported specificity used different thresholds (ie, different points on a ROC curve), then some patients would meet one or the other criteria, but some would meet neither and therefore would have an indeterminate result. It is important to examine the percentage of patients who meet the criteria for a negative result (ie, best-reported sensitivity) and the percentage who meet the criteria for a positive result (ie, best-reported specificity), which are reported for each study in Tables 3 and 4 , respectively. It is also important to determine the percentage of patients meeting the criteria who had a false result (ie, were misclassified by the diagnostic test). This information is reported in Tables 3 and 4 , and will be affected by the prevalence as well as the operating characteristics of the test (ie, sensitivity, specificity, and LRs). The number of studies, summarized by their level of LRs, monitor type, study location, and evidence level for the best-reported sensitivity and the best-reported specificity, are presented in Tables 6 and 7, respectively (available online at http://www.chestjournal.org/cgi/content/full/124/4/1543/DC1). A similar summary for articles with both high and low LRs is presented in Table 8 (available online at http://www.chestjournal.org/cgi/content/full/124/4/1543/DC1).

Ideally, a portable monitor test would have a single cutoff that has both a high sensitivity and a high specificity, so that patients are either negative or positive and there is no "gray zone." To address the question of whether some monitors came close to being an "ideal test" that had thresholds that minimized the number of patients in the gray zone and the misclassification rate, studies were examined that had both a high and low calculated LR (results are reported in Table 5 and are summarized in section 4.1.3). As with Tables 3 and 4 , Table 5 highlights the percentage of patients with a false result and the percentage of patients who did not meet the criteria for a positive or negative result (ie, those in the gray zone). If a study had a single threshold for best-reported sensitivity and specificity, then the percentage of patients without a negative or positive result is 0.

The 95% confidence intervals are reported in Tables 3 and 4 for the best-reported sensitivity and best-reported specificity, respectively, as well as for the corresponding LRs. The 95% confidence intervals were calculated from the reported sensitivity, specificity, prevalence, and number of patients according to the method of Simel et al.67 In some studies, the prevalence was not reported but was estimated from figures.

When assessing the evidence on portable monitoring, it is important to keep in mind the following points to avoid misinterpreting the data:

• How low does the best-reported sensitivity (ie, low LR) reduce the probability of sleep apnea?
How high does the best-reported specificity (ie, high LR) increase the probability of sleep apnea?
• Can the portable monitor both substantially reduce the probability that a patient has sleep apnea (if the test result is negative) and increase the probability that a patient has sleep apnea (if the test result is positive)?
• What percentage of patients in the study actually met the thresholds for a positive or negative test result?
• What percentage of patients who met the thresholds were misclassified (ie, had a false result)?
• How likely is it that the results of a study are valid (ie, evidence level and quality rating)?
• How precise are the estimates (ie, width of the confidence intervals) for sensitivity, specificity, and LRs?
• Were the RDI thresholds used to define a positive and negative result preselected or determined by post hoc (retrospective) analysis?

Using different RDI thresholds for a positive or negative result, and using different thresholds to define best-reported sensitivity and specificity, can be difficult to understand. A detailed discussion and example of the effect of varying RDI thresholds, and thus the best-reported sensitivity/specificity, on the number of nondiagnostic and false-positive/false-negative results can be found in an accompanying paper (see page 1535).


    3.0. LITERATURE SEARCH RESULTS
 TOP
 1.0 INTRODUCTION/BACKGROUND
 2.0. MATERIALS AND METHODS
 3.0. LITERATURE SEARCH RESULTS
 4.0. DISCUSSION OF THE...
 5.0. DIRECTIONS/NEED FOR FUTURE...
 References
 
3.1. Number of Articles Reviewed/Rejected
The initial literature search resulted in 59 original research articles being identified that met the inclusion criteria. Of these, 46 articles were selected for review by the evidence practice center.10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 29 30 31 32 33 34 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 Thirteen articles were not included68 69 70 71 72 73 74 75 76 77 78 79 80 for the following reasons: they were reports of older monitors for which more recent research had been published (one article) or were known to no longer be commercially available (one article); they evaluated technology that was not portable (eg, the static charge-sensitive bed) [five articles]; they studied technology that was not widely used or available (four articles); they did not use technology that was involved with monitoring a physiologic signal (one article); or patients had been tested following a surgical intervention (one article).

The search was first extended to October 31, 2001, which identified an additional three articles,28 58 60 and then to December 31, 2001, which yielded in additional two articles35 59 for a total of 51 articles included in this review.

3.2. Type of Monitor/Home- vs Sleep Laboratory-Based Studies/Evidence Rating
There were four studies describing five groups of patients published on type 2 monitors (comprehensive polysomnography) [Table 2 ]. Three of the four studies were rated as having level IV evidence, and one study was rated as having level II evidence.

There were 12 published studies on type 3 monitors, describing 14 groups of patients. There were five assessments of type 3 monitors at home and nine assessments in the sleep laboratory. Overall, the studies were of higher quality (at home, three level II studies; in the sleep laboratory, three level I studies and five level II studies) than the studies of type 2 monitors (Table 2) .

The majority of the published studies were on type 4 monitors (35 of the 51 studies), with 29 reports of patients studied in the sleep laboratory and 9 reports of patients studied at home. Approximately 50% of the studies on type 4 monitors had level I or II evidence (Table 2) .

3.3. Sensors to Detect Breathing
The categorization of portable monitors according to type may not be completely relevant. It is clear that type 2 monitors are different from other monitors because of their ability to measure EEG and EMG signals. The distinction between type 3 and type 4 monitors is less clear. Regardless of monitor type, usually only one signal is used to define a breathing event; occasionally a second signal is used, and rarely a third channel is used. However, all type 3 monitors have the option of using more channels, and many incorporate a body-position sensor. Many type 4 monitors use only one channel. It may be more important to distinguish between the types of signals that are monitored (eg, flow measured by thermistor, flow measured by nasal pressure, or oxygen saturation) and how they are used to detect breathing disturbances. Table 1 groups studies based on the primary signal that was used to define breathing disturbances during sleep. It also indicates how events were defined by polysomnography, differences in oxygen saturation sampling frequency (when oxygen saturation was used to define an event), as well as the type of portable monitor scoring that was used (ie, automated, manual, or a combination).

The most common signal used in portable monitors was airflow measured by a thermistor. Ten of 15 studies that used this as the primary scoring channel were rated as having level I or II evidence. Only two of these used the same criteria for defining a breathing event (Table 1) . Flow measured by nasal pressure was used in eight studies (four were rated as having level II evidence). All four of the level I/II studies used the same portable monitor and the same criterion for defining a breathing event. An oxygen saturation signal was used in 22 studies as the primary signal to define a breathing event, 13 of which were rated as having level I or II evidence. In these 13 studies, there were 11 different criteria used for scoring an event. More studies using flow, measured by nasal pressure or oxygen saturation, utilized automatic scoring than did studies that used flow measured by thermistor, which used manual scoring or a combination.

3.4. Bland and Altman Analysis
Some studies reported how well the data from portable monitoring agreed with those of polysomnography using mean differences between the two methods and the limits of agreement (Table 9; available online at http://www.chestjournal.org/cgi/content/full/124/4/1543/DC1). A total of 24 studies reported this analysis (type 2 monitors, 2 studies; type 3 monitors, 6 studies; type 4 monitors, 16 studies). A few authors reported confidence intervals rather than limits of agreement (Table 9). The limits of agreement tended to be quite wide, suggesting that the two methods did not agree particularly well. But the limits were wider for higher levels of RDI, for which it was less important to get the same number as it was at lower levels of RDI. The limits of agreement can be adjusted by using a logarithmic transformation of the differences, but investigators rarely did this. Thus, it is challenging to interpret these data and make a recommendation about the utility of portable monitoring based on this alone.


    4.0. DISCUSSION OF THE EVIDENCE ON PORTABLE MONITORS
 TOP
 1.0 INTRODUCTION/BACKGROUND
 2.0. MATERIALS AND METHODS
 3.0. LITERATURE SEARCH RESULTS
 4.0. DISCUSSION OF THE...
 5.0. DIRECTIONS/NEED FOR FUTURE...
 References
 
4.1. Primary Questions
There are many causes of variability in results between portable monitors and polysomnography. A simultaneous comparison of portable monitoring with polysomnography (sleep laboratory-attended) controls for a number of conditions that nonsimultaneous studies do not and, as such, provides an estimate of what might ideally be expected during home-unattended portable monitoring. The comparison of portable monitoring performed in the home-unattended setting with sleep laboratory polysomnography may not capture differences that favor one environment (ie, home or laboratory) over the other.

Conclusions regarding the utility of portable monitors are most applicable to the population of patients, and the methods that the portable monitors used for detecting events, that are the focus of this report. As detailed in the following sections, the prevalence of sleep apnea was high, averaging about 55%. Patients were predominantly male and generally were selected for studies by practitioners with expertise in evaluating patients with sleep apnea. The methods for scoring respiratory events on polysomnography varied from study to study.

A number of studies examined multiple RDI thresholds for determining an abnormal AHI. These optimal thresholds were determined using post hoc analyses and, therefore, may not necessarily be reproducible. LRs calculated from the reported sensitivity and specificity data from many studies had wide 95% confidence intervals, indicating a lack of precision in these estimates. A meta-analysis using summary ROC curves would have allowed for pooled results and narrower 95% confidence intervals. However, the ERC thought that there was not enough similarity between studies to warrant using this approach.

The effect of using time in bed for portable monitors, in general, led to a slightly higher AHI by polysomnography than RDI for type 3 monitors, with inconsistent effects on type 4 monitors (Table 10; available on-line at http://www.chestjournal.org/cgi/content/full/124/4/1543/DC1). The effect of the higher AHI is, in general, to reduce sensitivity and increase specificity.

4.1.1. Evidence that portable monitors can be used to reduce the probability that a patient has an abnormal AHI
If a portable monitor is going to be used to exclude a diagnosis of sleep apnea, the monitor needs to have a high sensitivity/low LR for a negative result. The percentage of patients who will have a negative result is determined by (1) the pretest probability or prevalence of the disease (a characteristic of the population being studied) and (2) the sensitivity/specificity or negative LR (the operating characteristics) of the portable monitor being used. The research conducted to date has been in sleep clinic populations that have a very high prevalence of disease. The result is that, even by using a portable monitor with very good operating characteristics, a small percentage of patients will be classified as negative with a higher chance for those with a negative result to be classified incorrectly (ie, a false-negative result). Details about the studies covered in this section can be found in Tables 3 (page 1557) and 6 (http://www.chestjournal.org/cgi/content/full/124/4/1543/DC1).

4.1.1.1. Type 2 monitors
There were three studies that reported results on sensitivity and specificity.20 25 32

4.1.1.1.1. Sleep laboratory-attended
There were two studies performed in the sleep laboratory, with one rated as having level II evidence but reporting on a small number of patients.25 The calculated LR for a negative result was not very low (0.22), and the confidence intervals were very wide.

4.1.1.1.2. Home-unattended
One study was performed at home20 but was given a low evidence rating. The study did not report a very low LR for a negative result (0.19), and, of the patients who had a negative result, a substantial proportion of the results (15%) was false-negative.

4.1.1.2. Type 3 monitors
4.1.1.2.1. Sleep laboratory-attended
There were nine studies performed on patients in a sleep laboratory setting where the portable monitoring occurred simultaneously with the polysomnogram. Eight of these studies18 21 22 23 27 28 29 31 were rated as having either level I or II evidence, all of which had very low or reasonably low LRs (range, 0 to 0.15). Two studies21 29 reported a sensitivity of 100% resulting in a LR of 0, but both had small numbers of patients. All eight studies reported a substantial proportion of patients (range, 20 to 73%) having a negative result on portable monitoring, with a small percentage of those results being false-negative (range, 4 to 8%). In seven of those studies, the portable monitor used flow measured by thermistor to detect events. Four of the studies included oxygen desaturation as a necessary criterion for hypopneas.

4.1.1.2.2. Home-unattended
There were four studies reporting data on portable monitoring performed in the home setting; two of which26 29 were rated as having level II evidence. Both of these studies and one of the level IV evidence studies24 reported low LRs in a modest number of patients. Two of these studies24 26 had a very high prevalence, so that the number of patients who had a negative result was small (9% and 18%, respectively). The study with a higher percentage of negative results (32%) had a substantial percentage of false-negative results (17% of those with a negative result).29 All three of these studies used flow measured by thermistor, with two of the studies including oxygen desaturation as a necessary criterion for hypopneas.

The majority of the evidence from attended monitoring indicates that type 3 monitors using flow measured by thermistor can substantially reduce the probability of sleep apnea in a substantial percentage of patients. This approach is not as well-validated in the home setting with this type of monitor.

4.1.1.3. Type 4 monitors
The methods used to analyze type 4 portable monitors were diverse. Devices m