Chest ACCP Member Benefits
HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS
 QUICK SEARCH:   [advanced]


     

Guest Access | Sign In via User Name/Password
doi:10.1378/chest.06-2088
(Chest. 2007; 131:628-632)
© 2007 American College of Chest Physicians
This Article
Right arrow Full Text (PDF) Free
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Article Archive
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via ISI Web of Science (2)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Lang, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lang, T.
Related Content
Right arrow Medical Writing Tip of the Month

Documenting Research in Scientific Articles: Guidelines for Authors*

3. Reporting Multivariate Analyses

Tom Lang, MA

* From Tom Lang Communications and Training, Davis, CA.

Correspondence to: Tom Lang, MA, 1925 Donner Ave, No. 3, Davis, CA 95618; e-mail: tomlangcom{at}aol.com

Multivariate analyses include two broad statistical techniques, regression analysis and analysis of variance (ANOVA). The reporting guidelines for each are similar and here have been condensed from the book How To Report Statistics in Medicine.1

Reporting Regression Analysis

Regression analysis attempts to predict or estimate the value of a response variable or outcome from the known values of one or more explanatory variables or predictors. The type of regression analysis is determined by the number of explanatory (or independent) variables and of the response (or dependent) variables, as well as by the "level of measurement" of these variables.

The phrase level of measurement refers to the kind of information collected about a variable. Nominal data are categorical data with no inherent ranking, such as blood type (eg, A, B, AB, and O); ordinal data are categorical data that do have an inherent ranking, such as severity categories (eg, mild, moderate, and severe); and continuous data are measurements made on a continuous scale of equal intervals. The level of measurement can also be set by the researcher. For example, data on BP can be collected as a nominal variable (hypertensive or not hypertensive), an ordinal variable (hypotensive, normotensive, or hypertensive) or a continuous variable (systolic BP measured in millimeters of mercury.)

The most common types of regression analyses are as follows:

Simple linear regression is used to assess the relationship between a single continuous explanatory variable and a single continuous response variable that varies linearly over a range of values.
Multiple linear regression is used to assess the linear relationship between two or more continuous or categorical explanatory variables and a single continuous response variable.
Simple logistic regression is used to assess the relationship between a single continuous or categorical explanatory variable and a single categorical response variable, usually a binary variable, such as whether or not a heart attack has occurred.
Multiple logistic regression is used to assess the relationship between two or more continuous or categorical explanatory variables and a single categorical response variable.
Nonlinear regression is used to assess variables that are not linearly related and that cannot be transformed into a linear relationship. These equations model more complex relationships than the other forms of regression analysis.
Polynomial regression can be used for any of the above combinations of explanatory and response variables when the relationship among the variables is curvilinear, which requires, say, squaring or cubing one or more explanatory variables in the model.
Cox proportional hazards regression, an aspect of time-to-event (survival) analysis, is used to assess the relationship between two or more continuous or categorical explanatory variables and a single continuous response variable (the time to the event). Typically, the event (usually death) has not yet occurred for all participants in the sample, which creates censored observations.

Guideline: Describe the Relationship of Interest or the Purpose of the Analysis

In addition to predicting one value from one or more others, regression analysis can be used to "control for" the potential confounding effects of explanatory variables that are associated with the response variable. Regression analysis can separate the effects of, say, age and sex on survival after surgery, for example.

Regression analysis can also be used to create risk scores. Here, the variables of the risk score are those of the regression equation, and the score itself is the value predicted by the regression model.

Guideline: Identify the Variables Used in the Analysis and Summarize Each With Descriptive Statistics

Continuous variables should be summarized with medians and ranges or interquartile ranges (or means and SDs if the data are normally distributed), and categorical data can be summarized with counts or percentages.

Guideline: Confirm That the Assumptions of the Analysis Were Met and State How Each Was Checked

A statement that the assumptions were verified and by which methods is all that need be included. There are both formal checks (eg, hypothesis tests) and informal checks (eg, inspection of graphs of residuals) for these assumptions. Sometimes, data that violate the assumptions can be adjusted (eg, with data transformations) to meet the assumptions. If such adjustments were made, they should be identified.

Guideline: Report How Any Missing Data Were Treated in the Analyses

Missing data can be a problem in multivariate analysis because it reduces the sample size unless corrective measures are taken. To create a model for predicting weight from age and height, for example, values for each of these variables must be collected for each patient. If age is missing from one patient, the patient is excluded from the analysis, and the sample size is reduced by one. In regression models with several variables, losses to missing data can be common.

However, missing data can be replaced in a process called imputation. Simple imputation methods include using the mean of all observed values for all people in place of the missing value; using the mean observed value for the same person in other time periods; using the mean of the previous and following values for the person, if they exist; or using the most recent observed value for the person (called the last-observation-carried-forward method, which is commonly used in pharmaceutical research). Other methods of imputing data are possible, but they should be based on sound judgment.

Guideline: Report How Any Outlying Values Were Treated in the Analysis

Outliers are extreme values that appear to be anomalies. Outliers cannot be ignored: even a single outlier can have a profound effect on the relationship derived from the regression line.23 All outliers must be reported, but it is permissible to report the results with and without the outliers to indicate their effect on the results.

Guideline: Report the Regression Model

A simple linear regression equation can be reported in the text or in a scatter plot of the data. Multiple linear regression models can be reported as equations (Fig 1 ) or in tables (Table 1 ); logistic regression models are typically reported in tables because the equations are so complex (Table 2 ).


Figure 1
View larger version (5K):
[in this window]
[in a new window]
[Download PPT slide]
 
Figure 1.. A multiple linear regression equation. In this example, the model predicts overall function score, Y, for patients with multiple sclerosis based on: disease severity, X1; ambulatory ability (measured as the rate of walking in laps per minute), X2; and number of lesions, X3. Here, X1, X2, and X3 are explanatory variables (sometimes called risk factors); the numbers in front of the X values are called regression coefficients or ß-weights. (40.8 is the Y intercept point, where the line crosses the Y axis.) Coefficients are interpreted as follows: if X1 and X3 are held constant (or "controlling for" disease severity and number of lesions), then mean functional score increases by about 1.25 times (1.22, the coefficient for X2) for each additional lap per minute. The final model had a coefficient of multiple determination, R2, of 0.58, indicating that the three variables in the model explain 58% of the variation in the response variable.

 

View this table:
[in this window]
[in a new window]

 
Table 1.. A Table for Reporting a Multiple Linear Regression Model With Three Explanatory Variables*

 

View this table:
[in this window]
[in a new window]

 
Table 2.. A Table for Reporting a Multiple Logistic Regression Model With Four Explanatory Variables*

 
Guideline: Report the actual p Value and the 95% Confidence Interval for the Regression Coefficient(s) of the Explanatory Variable(s), and in Logistic Regression, Report the Odds Ratio and the Associated 95% Confidence Interval

In regression analysis, the regression coefficient for an explanatory variable indicates how much the average value of the response variable, Y, varies with each unit change in the explanatory variable, X. The coefficient, or ß-weight, is an estimate and so should be accompanied by a confidence interval that indicates its precision.

Odds ratios are widely used in logistic regression analysis. For a binary explanatory variable, the odds ratio is the ratio of the odds that an event will occur in one group to the odds that the event will occur in the other group. An odds ratio of 1 means that both groups have a similar likelihood of having a heart attack. The larger the odds ratio, the more likely the event is expected to occur in the group used in the numerator.

Guideline: Specify How the Explanatory Variables That Appear in the Final Regression Model Were Chosen

One of the first steps in building a multiple regression model is to identify the explanatory variables that are significantly related to the response variable.4 Several dozens of variables may be considered one at a time in this process, called univariate analysis. Often, a less-restrictive {alpha}-level, such as 0.1, is used in the univariate analysis to identify a broad range of explanatory variables that might be associated with the response variable. That is, variables with p values less than 0.1 on univariate analysis are considered for inclusion in the model.

The second step in building a regression model is to identify the best combination of explanatory variables to include in the model. In simultaneous regression, all of the explanatory variables are included in the model and are tested as a group. In hierarchical regression, the investigator defines the number and order in which the explanatory variables are entered into the model. Common procedures are forward, backward, stepwise, and best-subset techniques.

Guideline: In Multiple Regression Models, Specify Whether All Potential Explanatory Variables Were Assessed for Collinearity (Nonindependence)

The explanatory variables in a multiple linear regression equation should be independent of one another.4 If two or more explanatory variables are correlated, that is, if their regression lines are parallel or "collinear," then they are not independent. Collinear variables add much the same information to the model, so only one is needed. The variable with the strongest relationship with the response variable should be considered for inclusion in the final model.

Guideline: In Multiple Regression Models, Specify Whether the Explanatory Variables Were Tested for Interaction

Two explanatory variables are said to interact if the effect of one explanatory variable on the response variable depends on the level of the second explanatory variable. Interaction implies that the variables should be considered together, not separately. So, for example, if alcohol interacts with antibiotics in the blood, the model should have a variable for blood alcohol level, one for blood antibiotic level, and an interaction term that expresses the relationship between serum alcohol and antibiotic level.

Guideline: Provide a Measure of the "Goodness of Fit" of the Model to the Data

The predictive value of a regression model is affected by how well it "fits" the data.56 Thus, a measure of goodness of fit is useful because it reveals how well the model reflects the data on which it was created.

Simple linear regression analysis can be thought of as an extension of correlation analysis, except that now one variable is being used to predict the other with the addition of a regression line. As in correlation analysis, scatter plots can be useful for showing this relationship. The correlation coefficient itself can indicate indirectly how well the model can predict. Correlations have to be high, say, above 0.7, as well as statistically significant, if a simple linear regression model is to predict with any degree of accuracy.

In simple linear regression analysis, the correlation coefficient associated with the scatter plot is also useful in the form of the coefficient of determination (r2). This coefficient indicates how much of the variability in the response variable is explained by the explanatory variable. For example, if the correlation between skin-fold thickness and body fat is 0.8, then r2 = 0.64, or 64%. That is, 64% of the variability in body fat can be accounted for by skin-fold thickness. In multiple linear regression analysis, the coefficient of multiple determination (R2) has the same function.

A residual is the difference between the value predicted by the model and the actual value of the data point as collected. The smaller the residual, the better the prediction. Residuals can also be graphed to determine how well the assumption of linearity was met. Thus, a graph of residuals (one kind of "model diagnostic plot") in which the values are small for all values of X, meaning that they stay close to an average difference of zero, indicates that the assumption of linearity was met and that the model predicts reasonably well. Outlier assessments work the same way as residual assessments, in that they and their associated residuals are apparent on the graph as data points to investigate.

Formal goodness-of-fit tests calculate a p value. If the p value is statistically significant, the model does not appropriately fit the data.

Guideline: Specify Whether the Model Was Validated

Regression models can be validated or tested against a similar set of data to show that they explain what they seek to explain. One method used when the sample is large is to develop the model on, say, 75% of the data, then to create another model on the remaining 25% of the data, and determine whether the models are similar. Another method involves removing the data from one subject at a time and recalculating the model. The coefficients and the predictive validity of all the models can then be assessed. Such methods are called jack-knife procedures. A third method involves developing another model on a separate set of similar data and determining whether the models differ.

Guideline: Name the Statistical Package or Program Used in the Analysis

Although commercial statistical programs generally are validated and updated, and have met the test of time, the performance characteristics of privately developed programs are often unknown.

Reporting ANOVA

ANOVA is a form of hypothesis testing for studies involving two or more variables. It is closely related to regression analysis and should be reported according to the same general guidelines. Usually, ANOVA is used to assess categorical explanatory variables, whereas regression analysis is used to assess continuous explanatory variables. When a study includes both continuous and categorical explanatory variables, the analysis may be called multiple regression or analysis of covariance.

ANOVA is a "group comparison" that determines whether a statistically significant difference exists somewhere among the groups studied. If a significant difference is indicated, ANOVA is usually followed by a multiple comparison procedure that compares combinations of groups to examine further any differences among them.

The most common ANOVA procedures used in biomedical research are as follows:

One-way ANOVA assesses the effect of a single (hence the "one-way" designation) categorical explanatory variable (sometimes called a factor) on a single continuous response variable. Note, too, that the factor (category) has three or more alternatives (or "levels" or "values"; eg, blood type is A, B, AB, or O). When there are only two alternatives (two groups), this analysis reduces to Student t test.
Two-way ANOVA assesses the effect of two categorical explanatory variables (again, sometimes called factors) on a single continuous response variable.
Multiway ANOVA assesses the effect of three or more categorical explanatory variables (still called factors) on a single continuous response variable.
Analysis of covariance assesses the effect of one or more categorical explanatory variables while controlling for the effects of some other (possibly continuous) explanatory variables (now called covariates) on a single continuous response variable.
Repeated-measures ANOVA is used to assess several, or repeated, measurements of the same participants under different conditions (such as BP measurements taken while the patient is supine, sitting, or standing) or at different points over time (such as muscle strength measured 1, 5, 10, and 20 days after surgery).

ANOVA is typically used to compare three or more group means on a certain response variable. It can also be expanded to include additional explanatory variables and can assess their simultaneous effects on the response variable. Whereas the purpose of regression analyses is usually to predict the value of the response variable, the purpose of ANOVA is usually to compare groups for differences in the means of the response variable. ANOVA models are also usually reported in tables (Table 3 ).


View this table:
[in this window]
[in a new window]

 
Table 3.. A Table for Presenting the Results of a Two-Way ANOVA for Analyzing the Two Factors Group and Age*

 

Acknowledgements

This article draws heavily from How To Report Statistics in Medicine, by Tom Lang and Michelle Secic.1

Footnotes

The author receives royalties from the sale of How To Report Statistics in Medicine, from which this article is taken. He has no other conflicts of interest with the publication of this article.

Received for publication August 21, 2006. Accepted for publication August 24, 2006.

References

  1. Lang, T, Secic, M (2006) How to report statistics in medicine 2nd ed. American College of Physicians. Philadelphia, PA:
  2. Godfrey, K Simple linear regression in medical research. Bailar, JC Mosteller, F eds. Medical uses of statistics 2nd ed. 1992,201-232 NEJM Books. Boston, MA:
  3. Altman, DG, Gore, SM, Gardner, MJ, et al Statistical guidelines for contributors to medical journals. BMJ 1983;286,1489-1493[ISI][Medline]
  4. Shutty, M Guidelines for presenting multivariate statistical analyses in rehabilitation psychology. Rehabil Psych 1994;39,141-144
  5. Bagley, SC, White, H, Golomb, BA Logistic regression in the medical literature: standards for use and reporting, with particular attention to one medical domain. J Clin Epidemiol 2001;54,979-985[CrossRef][ISI][Medline]
  6. Hosmer, DW, Taber, S, Lemeshow, S The importance of assessing the fit of logistic regression models: a case study. Am J Public Health 1991;81,1630-1635[Abstract/Free Full Text]




This Article
Right arrow Full Text (PDF) Free
Right arrow Submit a response
Right arrow Alert me when this article is cited
Right arrow Alert me when eLetters are posted
Right arrow Alert me if a correction is posted
Services
Right arrow Email this article to a friend
Right arrow Similar articles in this journal
Right arrow Similar articles in ISI Web of Science
Right arrow Similar articles in PubMed
Right arrow Alert me to new issues of the journal
Right arrow Add to My Personal Article Archive
Right arrow Download to citation manager
Right arrow reprints & permissions
Citing Articles
Right arrow Citing Articles via ISI Web of Science (2)
Right arrow Citing Articles via Google Scholar
Google Scholar
Right arrow Articles by Lang, T.
Right arrow Search for Related Content
PubMed
Right arrow PubMed Citation
Right arrow Articles by Lang, T.
Related Content
Right arrow Medical Writing Tip of the Month


HOME HELP FEEDBACK SUBSCRIPTIONS ARCHIVE SEARCH TABLE OF CONTENTS