|
|
||||||||
Guest Access | Sign In via User Name/Password |
|||||||||
* From Tom Lang Communications and Training, Davis, CA.
Correspondence to: Tom Lang, MA, 1925 Donner Ave, No. 3, Davis, CA 95618; e-mail: tomlangcom{at}aol.com
Multivariate analyses include two broad statistical techniques, regression analysis and analysis of variance (ANOVA). The reporting guidelines for each are similar and here have been condensed from the book How To Report Statistics in Medicine.1
Reporting Regression Analysis
Regression analysis attempts to predict or estimate the value of a response variable or outcome from the known values of one or more explanatory variables or predictors. The type of regression analysis is determined by the number of explanatory (or independent) variables and of the response (or dependent) variables, as well as by the "level of measurement" of these variables.
The phrase level of measurement refers to the kind of information collected about a variable. Nominal data are categorical data with no inherent ranking, such as blood type (eg, A, B, AB, and O); ordinal data are categorical data that do have an inherent ranking, such as severity categories (eg, mild, moderate, and severe); and continuous data are measurements made on a continuous scale of equal intervals. The level of measurement can also be set by the researcher. For example, data on BP can be collected as a nominal variable (hypertensive or not hypertensive), an ordinal variable (hypotensive, normotensive, or hypertensive) or a continuous variable (systolic BP measured in millimeters of mercury.)
The most common types of regression analyses are as follows:
Guideline: Describe the Relationship of Interest or the Purpose of the Analysis
In addition to predicting one value from one or more others, regression analysis can be used to "control for" the potential confounding effects of explanatory variables that are associated with the response variable. Regression analysis can separate the effects of, say, age and sex on survival after surgery, for example.
Regression analysis can also be used to create risk scores. Here, the variables of the risk score are those of the regression equation, and the score itself is the value predicted by the regression model.
Guideline: Identify the Variables Used in the Analysis and Summarize Each With Descriptive Statistics
Continuous variables should be summarized with medians and ranges or interquartile ranges (or means and SDs if the data are normally distributed), and categorical data can be summarized with counts or percentages.
Guideline: Confirm That the Assumptions of the Analysis Were Met and State How Each Was Checked
A statement that the assumptions were verified and by which methods is all that need be included. There are both formal checks (eg, hypothesis tests) and informal checks (eg, inspection of graphs of residuals) for these assumptions. Sometimes, data that violate the assumptions can be adjusted (eg, with data transformations) to meet the assumptions. If such adjustments were made, they should be identified.
Guideline: Report How Any Missing Data Were Treated in the Analyses
Missing data can be a problem in multivariate analysis because it reduces the sample size unless corrective measures are taken. To create a model for predicting weight from age and height, for example, values for each of these variables must be collected for each patient. If age is missing from one patient, the patient is excluded from the analysis, and the sample size is reduced by one. In regression models with several variables, losses to missing data can be common.
However, missing data can be replaced in a process called imputation. Simple imputation methods include using the mean of all observed values for all people in place of the missing value; using the mean observed value for the same person in other time periods; using the mean of the previous and following values for the person, if they exist; or using the most recent observed value for the person (called the last-observation-carried-forward method, which is commonly used in pharmaceutical research). Other methods of imputing data are possible, but they should be based on sound judgment.
Guideline: Report How Any Outlying Values Were Treated in the Analysis
Outliers are extreme values that appear to be anomalies. Outliers cannot be ignored: even a single outlier can have a profound effect on the relationship derived from the regression line.23 All outliers must be reported, but it is permissible to report the results with and without the outliers to indicate their effect on the results.
Guideline: Report the Regression Model
A simple linear regression equation can be reported in the text or in a scatter plot of the data. Multiple linear regression models can be reported as equations (Fig 1 ) or in tables (Table 1 ); logistic regression models are typically reported in tables because the equations are so complex (Table 2 ).
|
|
|
In regression analysis, the regression coefficient for an explanatory variable indicates how much the average value of the response variable, Y, varies with each unit change in the explanatory variable, X. The coefficient, or ß-weight, is an estimate and so should be accompanied by a confidence interval that indicates its precision.
Odds ratios are widely used in logistic regression analysis. For a binary explanatory variable, the odds ratio is the ratio of the odds that an event will occur in one group to the odds that the event will occur in the other group. An odds ratio of 1 means that both groups have a similar likelihood of having a heart attack. The larger the odds ratio, the more likely the event is expected to occur in the group used in the numerator.
Guideline: Specify How the Explanatory Variables That Appear in the Final Regression Model Were Chosen
One of the first steps in building a multiple regression model is to identify the explanatory variables that are significantly related to the response variable.4 Several dozens of variables may be considered one at a time in this process, called univariate analysis. Often, a less-restrictive
-level, such as 0.1, is used in the univariate analysis to identify a broad range of explanatory variables that might be associated with the response variable. That is, variables with p values less than 0.1 on univariate analysis are considered for inclusion in the model.
The second step in building a regression model is to identify the best combination of explanatory variables to include in the model. In simultaneous regression, all of the explanatory variables are included in the model and are tested as a group. In hierarchical regression, the investigator defines the number and order in which the explanatory variables are entered into the model. Common procedures are forward, backward, stepwise, and best-subset techniques.
Guideline: In Multiple Regression Models, Specify Whether All Potential Explanatory Variables Were Assessed for Collinearity (Nonindependence)
The explanatory variables in a multiple linear regression equation should be independent of one another.4 If two or more explanatory variables are correlated, that is, if their regression lines are parallel or "collinear," then they are not independent. Collinear variables add much the same information to the model, so only one is needed. The variable with the strongest relationship with the response variable should be considered for inclusion in the final model.
Guideline: In Multiple Regression Models, Specify Whether the Explanatory Variables Were Tested for Interaction
Two explanatory variables are said to interact if the effect of one explanatory variable on the response variable depends on the level of the second explanatory variable. Interaction implies that the variables should be considered together, not separately. So, for example, if alcohol interacts with antibiotics in the blood, the model should have a variable for blood alcohol level, one for blood antibiotic level, and an interaction term that expresses the relationship between serum alcohol and antibiotic level.
Guideline: Provide a Measure of the "Goodness of Fit" of the Model to the Data
The predictive value of a regression model is affected by how well it "fits" the data.56 Thus, a measure of goodness of fit is useful because it reveals how well the model reflects the data on which it was created.
Simple linear regression analysis can be thought of as an extension of correlation analysis, except that now one variable is being used to predict the other with the addition of a regression line. As in correlation analysis, scatter plots can be useful for showing this relationship. The correlation coefficient itself can indicate indirectly how well the model can predict. Correlations have to be high, say, above 0.7, as well as statistically significant, if a simple linear regression model is to predict with any degree of accuracy.
In simple linear regression analysis, the correlation coefficient associated with the scatter plot is also useful in the form of the coefficient of determination (r2). This coefficient indicates how much of the variability in the response variable is explained by the explanatory variable. For example, if the correlation between skin-fold thickness and body fat is 0.8, then r2 = 0.64, or 64%. That is, 64% of the variability in body fat can be accounted for by skin-fold thickness. In multiple linear regression analysis, the coefficient of multiple determination (R2) has the same function.
A residual is the difference between the value predicted by the model and the actual value of the data point as collected. The smaller the residual, the better the prediction. Residuals can also be graphed to determine how well the assumption of linearity was met. Thus, a graph of residuals (one kind of "model diagnostic plot") in which the values are small for all values of X, meaning that they stay close to an average difference of zero, indicates that the assumption of linearity was met and that the model predicts reasonably well. Outlier assessments work the same way as residual assessments, in that they and their associated residuals are apparent on the graph as data points to investigate.
Formal goodness-of-fit tests calculate a p value. If the p value is statistically significant, the model does not appropriately fit the data.
Guideline: Specify Whether the Model Was Validated
Regression models can be validated or tested against a similar set of data to show that they explain what they seek to explain. One method used when the sample is large is to develop the model on, say, 75% of the data, then to create another model on the remaining 25% of the data, and determine whether the models are similar. Another method involves removing the data from one subject at a time and recalculating the model. The coefficients and the predictive validity of all the models can then be assessed. Such methods are called jack-knife procedures. A third method involves developing another model on a separate set of similar data and determining whether the models differ.
Guideline: Name the Statistical Package or Program Used in the Analysis
Although commercial statistical programs generally are validated and updated, and have met the test of time, the performance characteristics of privately developed programs are often unknown.
Reporting ANOVA
ANOVA is a form of hypothesis testing for studies involving two or more variables. It is closely related to regression analysis and should be reported according to the same general guidelines. Usually, ANOVA is used to assess categorical explanatory variables, whereas regression analysis is used to assess continuous explanatory variables. When a study includes both continuous and categorical explanatory variables, the analysis may be called multiple regression or analysis of covariance.
ANOVA is a "group comparison" that determines whether a statistically significant difference exists somewhere among the groups studied. If a significant difference is indicated, ANOVA is usually followed by a multiple comparison procedure that compares combinations of groups to examine further any differences among them.
The most common ANOVA procedures used in biomedical research are as follows:
ANOVA is typically used to compare three or more group means on a certain response variable. It can also be expanded to include additional explanatory variables and can assess their simultaneous effects on the response variable. Whereas the purpose of regression analyses is usually to predict the value of the response variable, the purpose of ANOVA is usually to compare groups for differences in the means of the response variable. ANOVA models are also usually reported in tables (Table 3 ).
|
Acknowledgements
This article draws heavily from How To Report Statistics in Medicine, by Tom Lang and Michelle Secic.1
Footnotes
The author receives royalties from the sale of How To Report Statistics in Medicine, from which this article is taken. He has no other conflicts of interest with the publication of this article.
Received for publication August 21, 2006. Accepted for publication August 24, 2006.
References
| ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| HOME | HELP | FEEDBACK | SUBSCRIPTIONS | ARCHIVE | SEARCH | TABLE OF CONTENTS |