|
|
||||||||
Guest Access | Sign In via User Name/Password |
|||||||||
* From the Clinical Epidemiology and Health Service Evaluation Unit (Drs. Manser, Byrnes, and Campbell), Royal Melbourne Hospital, Parkville, Victoria; and Department of Respiratory Medicine (Mr. Rochford and Dr. Pierce), Austin and Repatriation Medical Center, Heidelberg, Victoria, Australia.
Correspondence to: Renee L. Manser, MBBS, Clinical Epidemiology and Health Service Evaluation Unit, Ground Floor, Charles Connibere Building, Royal Melbourne Hospital, Parkville, Victoria 3050; e-mail: ManserRL{at}mh.org.au
| Abstract |
|---|
|
|
|---|
Design: Retrospective analysis of 48 diagnostic polysomnographic records.
Setting: Tertiary-hospital sleep-disorders clinic.
Measurements: AHIs were
derived from three different methods for scoring hypopneas. The
hypopnea definitions used incorporated different combinations and
threshold values of respiratory signal changes in addition to
differences in the requirement for associated oxygen desaturation or
arousal. The level of agreement between different scoring methods was
assessed by constructing Bland-Altman plots and calculating intraclass
correlation coefficients (ICCs).
statistics were used to assess
agreement between the different methods using varying thresholds of AHI
to categorize sleep apnea (AHI > 5, AHI > 15, and
AHI > 20).
Results: The random-effects ICC for the
three methods was 0.89, suggesting that the different scoring methods
tended to rank patients fairly consistently. However, the point
prevalence of disease estimated by using different thresholds of AHI
was found to vary depending on the method used to score sleep studies
(
, 0.30 to 0.95).
Conclusions: These findings have implications for case finding, population-prevalence estimates, and grading of disease severity for access to government-funded continuous positive airway pressure services. Guidelines for standardizing the measurement and reporting of sleep studies in clinical practice should be implemented.
Key Words: diagnosis hypopnea polysomnography sleep apnea syndromes
| Introduction |
|---|
|
|
|---|
Despite these limitations, the apnea-hypopnea index (AHI) remains the primary measurement of sleep-disordered breathing. The problem of standardization impacts on sleep apnea diagnosis, grading of disease severity, treatment decisions, and research. Importantly, in Victoria and elsewhere, reimbursement for government-funded continuous positive airway pressure (CPAP) services is linked to the severity of sleep-disordered breathing as defined by the AHI. This study was therefore designed to assess the effect of "real-world" variations in hypopnea measurement on the AHI.
We report the level of agreement between AHIs derived from three different hypopnea scoring methods by rescoring a random sample of diagnostic sleep studies. The methods used were chosen to reflect those used by Victorian sleep laboratories currently, and include different combinations of respiratory signals and thresholds and differences in the requirement for associated changes in oxygen saturation or arousal. In this study, we aimed to explore the effect of using different scoring criteria for hypopneas in the scoring of polysomnographic studies (1) by estimating the level of agreement between AHIs derived from different scoring methods, and (2) by examining the point prevalence of disease using different scoring methods and different AHI thresholds.
| Materials and Methods |
|---|
|
|
|---|
Exclusion Criteria for Sleep Studies
Studies were excluded if the sleep efficiency was < 50% or if
poor signals were noted for > 10% of the total sleep time.
Sleep Staging and Scoring
Sleep stages were scored according to the criteria of
Rechlschaffen and Kales.5
For each method, apneas were
defined as cessation of oronasal airflow (such that no breaths were
discernible in the airflow signal) for
10 s in association with a
2% oxygen desaturation. Arousals were defined according to the
American Sleep Disorders Association criteria.6
The three
different methods for scoring hypopneas are as follows:
Method A:
A
50% reduction in one or more of three
respiratory signals (airflow, thoracic, or abdominal respiratory
inductive plethysmography) compared with baseline breathing level for
> 10 s and associated with an oxygen desaturation
2% compared
with baseline.
Method B:
Either 1 and 3 or 2 and 3 of the following: (1) a
reduction of
50% from baseline in the amplitude of respiratory
inductive plethysmography signals; the reduction must be in both the
thoracic and abdominal movement channels (which were recorded on
separate channels for this study); (2) a clear amplitude reduction that
does not reach the above-mentioned criterion but is associated with
either an oxygen desaturation of
3% or an arousal; (3) event
lasting
10 s.
Method C:
Any discernible reduction in airflow lasting
10 s associated with a
3% oxygen desaturation (with or without
arousal).
The methods were selected to reflect the "extremes" in the range of different methods used by Victorian sleep laboratories.1 Method B is based on published guidelines.7 Some of the Victorian laboratories were using slight variations of method B.
Intrarater Reliability
In order to assess intraobserver variability for each scoring
method, six sleep studies were randomly selected and rescored by the
same technologist. The time interval between rescores was 3 months, and
the technologist was blinded to the previous results.
Equipment and Instrumentation
For each sleep study, polysomnographic signals were recorded
using a computerized polysomnographic system (Sleepwatch; Compumedics;
Melbourne, Australia). The montage consisted of the following signals:
central EEG (C3/A2), chin electromyogram, eye electro-oculographic
activity (right and left electro-oculograms) and ECG, ribcage and
abdominal excursions by uncalibrated inductance plethysmography,
nasal-oral airflow with thermistors, and arterial oxygen saturation by
pulse oximetry (Radiometer OXI; Radiometer; Copenhagen, Denmark) using
finger probes.
Statistical Analysis
Statistical analysis was performed using statistical software
(Stata version 6.0; Stata Corporation; College Station,
TX).8
Repeated-measures analysis of variance was used to
assess whether the mean AHIs derived from each of the three methods
were significantly different. Pearsons correlation coefficient was
used to assess the correlation between AHIs derived from different
hypopnea definitions. The Pearson correlation coefficient tests the
strength of the linear relationship between the variables and is not a
measure of agreement. It may not detect systematic differences between
the methods.9
An intraclass correlation coefficient (ICC) was calculated using both
the random-effect and fixed-effect methods described by Streiner and
Norman.9
Bias-corrected confidence intervals (CIs) were
estimated by using 1,000 bootstrap repetitions. The ICC is a ratio of
the true variance between patients to the observed variance of scores
and is equivalent to a weighted
with quadratic
weights.9
Agreement between the different scoring methods was further evaluated using Bland-Altman plots and limits of agreement.10 Bland and Altman10 have developed a method for measuring the absolute level of agreement between two measurement techniques. This method is useful for comparing measures on the same scale. Agreement was assessed by calculating the limits of agreement. The limits of agreement are equal to the mean difference plus or minus twice the SE. The precision of the estimated limits of agreement was determined by calculating CIs. Bland-Altman plots were constructed for each of the three methods using paired comparisons. These plots were constructed by plotting the difference between two of the methods against the mean of the two methods.
statistics were used to assess agreement between the different
methods using varying thresholds of AHI to categorize sleep apnea
(AHI > 5, AHI > 15, and AHI > 20). In clinical practice,
thresholds for defining disease vary between > 5 events per hour and
> 15 events per hour. In Victoria, patients (without comorbidities)
are eligible for government-funded CPAP services for sleep apnea if
they have an AHI of > 20 events per hour.
| Results |
|---|
|
|
|---|
Summary of Polysomnographic Findings
Some variables were nonnormally distributed (for example, the
arousal index). Descriptive data therefore include median and range
values, where appropriate. Tests of normality (Kolmogorov-Smirnov) were
performed on the AHI measures derived from each of the three scoring
methods. The distributions of these variables were not significantly
different from normal (p = 0.20).
The mean ± SD non-rapid eye movement sleep time was 234 ± 44 min (range, 120 to 331 min), and the mean rapid eye movement sleep time was 53 ± 25 min (range, 8 to 107 min). The median sleep efficiency was 74% (range, 50 to 96%). The median arousal index was 32 (range, 5 to 87). An average AHI was calculated by averaging the score from each of the three scoring methods. The mean AHI was 36.7 ± 22 (range, 1.6 to 90). The mean (SD; range) hypopnea indexes for methods A, B, and C were 30 (21; 0 to 81), 35 (19; 4 to 83), and 24 (20; 0 to 83), respectively. The AHIs for methods A, B, and C were 37.5 (23; 0 to 91), 42 (21; 4 to 92), and 30.5 (22; 0 to 87), respectively. The differences in mean AHI scores among the three methods for scoring hypopneas were statistically significant using repeated-measures analysis of variance (p < 0.0005).
Correlation Between Different Scoring Methods
Pearson correlation coefficients were calculated for paired
comparisons between the different AHIs derived from three methods for
scoring hypopneas. The correlation between method A and method B was
0.96. The correlation between method A and method C was 0.97. The
correlation between method B and method C was 0.93.
ICC
ICCs were calculated using the three different methods for scoring
hypopneas. When the scoring method was treated as a random factor, the
ICC was 0.89 (95% CI, 0.85 to 0.9). When the scoring method was
treated as a fixed factor, the ICC was 0.95 (95% CI, 0.94 to 0.95).
The result suggests that 89% of the variance in scores arises from
true variance in the patients.
Limits of Agreement and Bland-Altman Plots
The mean ± SD difference between method A and method B was
- 4.55 ± 6.48 (95% CI, - 6.43 to - 2.67). The lower limit of
agreement was - 17.5 (95% CI, - 20.77 to - 14.25), and the upper
limit of agreement was 8.41 (95% CI, 5.15 to 11.67). The mean
difference between method A and method C was 6.97 ± 5.9 (95% CI,
5.26 to 8.69), the lower limit of agreement was - 4.83 (95% CI,
- 7.8 to - 1.86), and the upper limit of agreement was 18.77 (95%
CI, 15.8 to 21.74). The mean difference between method B and method C
was 11.53 ± 8.38 (95% CI, 9.09 to 13.96), the lower limit agreement
was - 5.23 (95% CI, - 9.44 to - 1.01), and the upper limit of
agreement was 28.29 (95% CI, 24.07 to 32.5).
The Bland-Altman plots were constructed for each of the paired comparisons. Figure 1 is a Bland-Altman plot comparing method A and method B. This shows that on average, method A produced an AHI of five events per hour lower than method B, but there is considerable scatter at any average level of AHI. Similar plots were constructed for the comparisons between method B and method C, and method A and method C (not displayed). Method B produced an AHI, on average, about 11 events per hour higher than method C, and method A produced an AHI, on average, of 7 events per hour higher than method C, but again there was considerable scatter at any average level of AHI.
|
Statistics and Disease Prevalence With Different AHI Thresholds
statistics for method A vs method B were 0.66 (agreement,
98%), 0.48 (agreement, 85%), and 0.63 (agreement, 88%) for AHI
thresholds of > 5, > 15, > 20, respectively.
statistics for
method A vs method C were 0.54 (agreement, 94%), 0.95 (agreement,
98%), and 0.77 (agreement, 90%) for AHI thresholds of > 5, > 15,
> 20, respectively.
statistics for method B vs method C were 0.3
(agreement, 89%), 0.44 (agreement, 81%), and 0.44 (agreement, 77%)
for AHI thresholds of > 5, > 15, > 20, respectively. Thus, the
agreement between methods varies considerably depending on the methods
and threshold values applied (fair to very good agreement); however,
the value of
statistics may be influenced by the proportion of
subjects in each category (prevalence), and it may be misleading to
make comparisons between them.11
The clinical significance
of these results may be assessed by examining the frequency table
(Table 1
), showing the number of patients (of 48 patients total) identified as
having disease when different scoring methods and AHI thresholds are
applied to the polysomnographic results. It is clear from Table 1
that
method B gives a higher frequency of disease, regardless of the
threshold value of AHI, compared with method A or method C, and method
A gives a higher frequency of disease compared with method C. For
example, in this sample of 48 subjects, when method B is used instead
of method C, 11, 9, and 4 additional subjects receive a diagnosis of
sleep apnea using AHI thresholds of 20, 15, and 5, respectively. The
differences between the methods are less when the threshold value of
AHI is lower.
|
|
| Discussion |
|---|
|
|
|---|
Agreement was further assessed by calculating an ICC for the three scoring methods. An ICC of 0.89 is relatively high; however, this indicates that there is consistent ranking of subjects by the different scoring methods but does not necessarily imply good absolute agreement. Furthermore, the adequacy of agreement is a practical matter; for some tests, it may need to be very high.12 The value of the ICC may be influenced by the selection of subjects over which it is defined. Where subjects are highly variable, the value of the ICC will tend to be high.13 Nevertheless, in the context of this study, it is useful to contrast the ICC obtained by comparing the different methods for scoring hypopneas with the intrarater reliability. The variation between methods is noted to be greater than the intrarater variability.
This study was designed specifically to examine the impact of varying hypopnea definitions on the evaluation of sleep apnea. In order to limit other sources of variation, a single, experienced sleep technologist undertook all the scoring of polysomnographic records. The scorer was blinded to previous scores, and the studies were scored in a random sequence to avoid potential bias from an order effect. This approach is somewhat artificial, and the results cannot necessarily be generalized to clinical practice. Indeed, the very high levels of intrarater reliability compared with other studies reflect the idealized conditions under which this study was conducted.14 Variability between methods or between scorers is likely to be greater in practice.
A further possible limitation of this study is that the subjects were highly selected and selection was retrospective. Our exclusion and inclusion criteria, however, were developed in advance. Subject selection was stratified according to previously determined disease severity; therefore, subjects with more severe sleep apnea are overrepresented in the sample. This method was chosen so that we could specifically examine the effect of variations in hypopnea definitions on the assessment of sleep-disordered breathing in participants with moderate sleep apnea. The results of studies15 16 examining other sources of measurement error in the assessment of sleep apnea hypopnea syndrome suggest that variability tends to be greater in the midrange of disease. In the clinical setting, variability in the classification of patients with mild-to-moderate sleep apnea has implications for assessing treatment options and, in some circumstances, eligibility for government-funded CPAP services, such as the Victorian CPAP program. It is likely that if these scoring methods were applied to a more heterogeneous population, the level of agreement would improve.9
The methods of defining hypopneas chosen for this study were based on those currently used by laboratories in Victoria and may not be in widespread use elsewhere. Method B was adapted from published guidelines.7 These guidelines acknowledge the limitation to the evidence base for this proposed approach. Where possible, they recommend that changes in thoracoabdominal signals (respiratory inductive plethysmography) be based on the sum signal. Victorian sleep laboratories do not currently use the sum signal; therefore, we based our hypopnea scoring criteria for method B on changes in dual thoracic and abdominal signals.
Previous researchers have also examined the agreement between different approaches to determining AHIs. Tsai et al17 compared a hypopnea definition that incorporated a 4% desaturation with one that included corroborative changes in either desaturation (4%) or arousal. They found that the addition of arousal-based scoring criteria for hypopnea caused only small changes in AHI. Redline et al18 recently reported prevalence data from the Sleep Heart Health Study, in which they examined the effect of using 11 different criteria for scoring hypopneas on the prevalence of disease in a large community-based sample. Redline et al18 concluded that different approaches for measuring AHI could result in substantial variability in identifying and classifying sleep-disordered breathing. There were a large number of methods assessed in this study, and limits of agreement were not presented. To our knowledge, the present study is the first to examine the impact of altering methods of assessing respiratory signal changes on hypopnea scoring.
The findings of this study suggest that based on current practices in Victoria, the AHIs reported by different laboratories may not be comparable. The study highlights the need to standardize the methods used to evaluate and define sleep-disordered breathing in clinical practice. This is an issue that needs to be addressed at both the national and international levels. Consideration should be given to adopting the currently available guidelines for research in clinical practice.7 In the absence of such guidelines, polysomnographic reports should include a description of the measurement techniques and methods used to derive AHIs. Adjustments could then be made to the scores depending on the method used. The results of this study suggest that the agreement between the methods could be improved if a correction factor is applied to adjust for the systematic differences between the methods. Thus, when the scoring method is treated as a fixed factor, the ICC is 0.95. Although this represents an improvement, a small proportion of patients would still be misclassified even after adjusting the scores.
We have examined only a few of the methods used by sleep laboratories in Victoria. Further research evaluating different approaches to diagnosing sleep apnea is required. The best method for validating diagnostic criteria is to determine which methods of defining respiratory events predict adverse health outcomes. Some longitudinal studies, such as the Sleep Heart Health Study,19 are currently being conducted, but given the diversity of approaches to diagnosing sleep-disordered breathing in current practice, further studies are required.
| Conclusion |
|---|
|
|
|---|
| Footnotes |
|---|
The study was funded by the Victorian Department of Human Services.
This study was performed at the Austin and Repatriation Medical Center, Heidelberg, Victoria, Australia.
Received for publication December 1, 2000. Accepted for publication April 12, 2001.
| References |
|---|
|
|
|---|
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |