Regression and ANOVA
~2 min read
Lesson 18 of 20
Notes
Regression models describe the relationship between a response (dependent) variable and one or more explanatory (independent, predictor) variables. They are used to test hypotheses, make predictions, and control for confounding.
Simple linear regression (SLR) models the relationship between a continuous outcome (Y) and a continuous explanatory variable (x) using a straight line: Y = β₀ + β₁x + e. β₀ is the y-intercept (value of Y when x = 0); β₁ is the slope (change in mean Y for each unit increase in x); e is the error term, assumed to follow e ~ N(0, σe²). The term σe² describes the magnitude of variation of individuals around their sub-population mean.
Parameters are estimated by the method of least squares — minimising the sum of squared residuals (the squared differences between observed values yi and fitted values ŷi). Residuals (estimated error terms) = yi − ŷi; can be positive or negative.
The LINE assumptions of SLR: Linearity (mean response is a linear function of x); Independence (responses are independent); Normality (error terms are normally distributed); Equal variance (homoscedasticity — error terms have the same variance regardless of x). Failure of linearity is most critical (all conclusions invalid); failure of normality is least critical (especially with large samples). Homoscedasticity failure invalidates CIs and test results; independence failure requires sophisticated modelling.
Hypothesis testing in SLR: H₀: β₁ = 0 (no linear relationship); H_A: β₁ ≠ 0. The test statistic is T = β̂₁ / SE(β̂₁), following a t-distribution with df = n − 2. The CI for β₁ uses the t-multiplier with df = n − 2.
Strength and significance of prediction (p-value for hypothesis test) vs type and magnitude of errors (decision errors): p < 0.01 = strong evidence; 0.01 < p < 0.05 = some evidence against H₀. Type I error (false positive) = rejecting H₀ when it is true; probability = α. Type II error (false negative) = failing to reject H₀ when it is false; probability = β. Power = 1 − β = probability of correctly rejecting H₀. Power is affected by: sample size (larger n → more power); effect size (larger true difference → more power); variability (lower σ → more power); significance level (larger α → more power). Controlling error: type I error is controlled by setting α; type II error controlled by sample size calculation; power should be 80–90%.
Analysis of variance (ANOVA) compares the means of three or more groups (continuous response, categorical predictor). H₀: all group means are equal; H_A: not all means are equal. Total variation (TSS) = between-group variation (GSS/ESS) + within-group variation (RSS). The F statistic = (GSS/df_GSS) / (RSS/df_RSS). Larger F → more evidence against H₀. ANOVA assumptions: independence, normality, equal variance (homoscedasticity). Blocking removes residual variation (between blocks), improving power.