You're browsing as a guest — progress won't be saved.

Browsing as Guest
Back to ELM2: Evidence Based Practice & Epidemiology

Advanced Statistical Methods

~2 min read

Lesson 19 of 20

Notes

This lecture extends regression modelling to predictions, multiple regression, logistic regression, and introduces correlation and R².

In SLR, two types of inference are made from a fitted line. Point estimates of the mean response for a group (given a specific x value) are accompanied by confidence intervals. Point predictions of an individual response are accompanied by prediction intervals. Prediction intervals are always wider than CIs because they must account for both the uncertainty in the mean (accounted for by the CI) and the additional variation of an individual around that mean (the error term). Both intervals are narrowest at the mean of the data and widen as x moves away from the mean. Extrapolation — using the regression equation for x values outside the observed data range — should be avoided as the model may not hold outside the range.

Multiple regression extends SLR to include multiple explanatory variables (X₁, X₂, ..., Xk): Y = β₀ + β₁x₁ + ... + βkxk + e. Applications: adjusting for confounding variables; identifying important predictors; prediction; describing associations. Each βⱼ is interpreted as the change in mean Y for a unit increase in xⱼ, holding all other predictors constant. Error variance is estimated with df = n − k − 1 (subtracting 1 for each estimated parameter). Including statistically insignificant variables adds noise; variable selection removes them.

The correlation coefficient r measures the strength and direction of a linear relationship between two continuous variables (−1 to +1). It is symmetric (correlation X with Y = correlation Y with X) and cannot be used for prediction. The coefficient of determination R² = r² = proportion of variation in Y explained by the regression model. R² ranges from 0 (model explains nothing) to 1 (model explains all variation). Multiple regression produces a larger R² than SLR if additional predictors are meaningful.

Logistic regression is used when the outcome variable is binary. It models the log-odds (logit) of the outcome as a linear function of the predictors: logit(π) = ln(odds) = β₀ + β₁x. Because odds can only be positive, log-odds (which extend from −∞ to +∞) allow a linear model to be fitted. Parameters are estimated by maximum likelihood rather than least squares. Hypothesis test: H₀: β₁ = 0; test statistic follows a standard normal (Z) distribution, not t, because logistic regression does not estimate σ². Logistic regression can also be applied to a binary predictor (analysing 2×2 contingency tables).

What to study next