Biostatistics Fundamentals
~7 min read
Lesson 8 of 11
Notes
Introduction to Biostatistics
Biostatistics is the application of statistical methods to biological, health, and medical data. For public health and clinical medicine, a working knowledge of biostatistics is essential for interpreting research evidence, evaluating screening tests, understanding epidemiological studies, and applying evidence-based practice. This lecture covers descriptive statistics, measures of central tendency and spread, the normal distribution, hypothesis testing, confidence intervals, and diagnostic test characteristics.
Measures of Central Tendency
Three principal measures describe the centre of a data distribution:
Mean (arithmetic mean): the sum of all values divided by the number of values. Sensitive to outliers โ a single extreme value pulls the mean substantially. Best used when data are normally distributed (symmetric).
Example: blood pressures of 5 patients (mmHg): 110, 120, 125, 130, 170.
Mean = (110+120+125+130+170) / 5 = 655 / 5 = 131 mmHg.
Note: 170 pulls the mean above the middle value.
Median: the middle value when data are arranged in ascending order. If n is even, the median is the average of the two middle values. Robust to outliers โ not influenced by extreme values. Best used for skewed distributions (income, length of hospital stay, survival times).
Example: ordered data = 110, 120, 125, 130, 170. Median = 125 mmHg (the third value).
Mode: the most frequently occurring value. Can be used for any data type including categorical data. A distribution can be unimodal (one peak), bimodal (two peaks), or multimodal.
When to use each: symmetrical distributions โ mean; skewed distributions โ median; categorical data โ mode; for clinical income or wait-time data โ median is almost always preferable.
Measures of Spread (Variability)
Range: maximum minus minimum. Simple but highly sensitive to outliers.
Interquartile range (IQR): the range of the middle 50% of data (Q3 โ Q1). Robust to outliers. Reported alongside the median. Example: IQR for the above data = 130 โ 120 = 10 mmHg.
Variance: the average of the squared deviations from the mean. Units are squared (e.g., mmHgยฒ) โ not directly interpretable.
Variance = ฮฃ(xแตข โ xฬ)ยฒ / (nโ1) [sample variance uses nโ1 for unbiasedness]
Standard deviation (SD): the square root of the variance. Same units as the original data โ directly interpretable. For a normal distribution: ~68% of observations lie within ยฑ1 SD; ~95% within ยฑ2 SD; ~99.7% within ยฑ3 SD (the "68-95-99.7 rule").
Standard error of the mean (SEM): SEM = SD / โn. Measures how precisely the sample mean estimates the population mean. As n increases, SEM decreases โ larger samples give more precise estimates. SEM is used to construct confidence intervals for the mean.
The Normal Distribution
The normal (Gaussian) distribution is a symmetric, bell-shaped distribution characterised by its mean (ฮผ) and standard deviation (ฯ). Many biological variables approximate normality (height, blood pressure, serum creatinine in a healthy population).
Properties:
- Symmetric about the mean (mean = median = mode)
- Defined by ฮผ and ฯ alone
- Approximately 68% of values fall within ฮผ ยฑ 1ฯ
- Approximately 95% fall within ฮผ ยฑ 1.96ฯ
- Approximately 99.7% fall within ฮผ ยฑ 3ฯ
The z-score standardises any observation: z = (x โ ฮผ) / ฯ. A z-score of +2 means the observation is 2 SD above the mean โ approximately the top 2.3% of a normal distribution.
Confidence Intervals
A confidence interval (CI) is a range of plausible values for a population parameter (e.g., mean, proportion, relative risk, odds ratio) estimated from a sample. A 95% CI means that if we repeated the study 100 times with different random samples, approximately 95 of the 100 resulting CIs would contain the true population parameter.
95% CI for a mean: xฬ ยฑ 1.96 ร SEM = xฬ ยฑ 1.96 ร (SD/โn)
Interpretation: the CI does NOT mean there is a 95% probability the true mean lies within this specific interval โ the true mean either is or is not in the interval (frequentist). The CI tells us our uncertainty about the estimate given this sample.
Wider CI = more uncertainty (small sample, high variability). Narrower CI = more precision (large sample, low variability).
Clinical interpretation: if a 95% CI for an odds ratio crosses 1.0 (e.g., OR = 1.4, 95% CI 0.8โ2.4), the result is not statistically significant at ฮฑ = 0.05 โ the null value (OR = 1) is plausible. If the CI does not cross the null (e.g., OR = 1.4, 95% CI 1.1โ1.8), the result is statistically significant.
Hypothesis Testing and P-values
Null hypothesis (Hโ): the default assumption of no effect or no difference. Example: Hโ = "the new drug has the same effect as placebo."
Alternative hypothesis (Hโ): what we are testing for. Example: Hโ = "the new drug reduces systolic BP more than placebo."
Test statistic: a numerical summary of the data that measures how far the observed results are from Hโ. Examples: t-statistic (t-test), z-statistic, chi-squared statistic.
P-value: the probability of observing results as extreme as (or more extreme than) those obtained, assuming Hโ is true. A small p-value indicates the observed results are unlikely under Hโ โ providing evidence against Hโ.
Significance threshold (ฮฑ): conventionally ฮฑ = 0.05. If p < 0.05, we reject Hโ and conclude the result is "statistically significant."
Critical misconceptions:
- p-value is NOT the probability that Hโ is true
- p-value is NOT the probability that the result occurred by chance
- Statistical significance โ clinical significance (a large trial may give p < 0.001 for a trivially small effect)
- A non-significant result (p > 0.05) does NOT prove Hโ โ it means insufficient evidence to reject it
Type I error (ฮฑ): rejecting Hโ when it is true โ a false positive. Probability = ฮฑ (conventional 0.05).
Type II error (ฮฒ): failing to reject Hโ when it is false โ a false negative. Probability = ฮฒ (often 0.20).
Power: 1 โ ฮฒ = probability of correctly detecting a real effect (often set to 0.80 or 0.90 in study planning).
Sensitivity, Specificity, and Predictive Values
These measures characterise diagnostic test performance:
Sensitivity: the proportion of true positives correctly identified by the test.
Sensitivity = TP / (TP + FN) = TP / all diseased
A highly sensitive test has few false negatives โ good for ruling OUT disease ("SnNOut": if Sensitive test is Negative, rule Out disease).
Specificity: the proportion of true negatives correctly identified.
Specificity = TN / (TN + FP) = TN / all disease-free
A highly specific test has few false positives โ good for ruling IN disease ("SpPIn": if Specific test is Positive, rule In disease).
Positive predictive value (PPV): the probability that a positive test result truly reflects disease.
PPV = TP / (TP + FP) = TP / all test-positive
PPV depends strongly on disease prevalence โ in low-prevalence settings, even a specific test has low PPV (most positives are false positives).
Negative predictive value (NPV): the probability that a negative test result truly reflects absence of disease.
NPV = TN / (TN + FN) = TN / all test-negative
NPV is higher when prevalence is low.
Likelihood ratios (LR): more stable across populations (do not depend on prevalence):
LR+ = Sensitivity / (1 โ Specificity): how much a positive result increases the odds of disease
LRโ = (1 โ Sensitivity) / Specificity: how much a negative result decreases the odds of disease
Correlation and Regression
Pearson correlation coefficient (r): measures the strength and direction of linear association between two continuous variables. Range: โ1 to +1. r = +1: perfect positive linear relationship; r = โ1: perfect negative; r = 0: no linear relationship.
Interpreting r: |r| < 0.3 = weak; 0.3โ0.7 = moderate; > 0.7 = strong. Correlation does NOT imply causation.
Linear regression: models the relationship between a continuous outcome (Y) and one or more predictors (X). Simple linear regression: Y = ฮฑ + ฮฒX + ฮต. ฮฒ (the regression coefficient) = the expected change in Y per unit increase in X.
What to study next
Related courses