Biostatistics Fundamentals

~7 min read

Lesson 8 of 11

Notes

Introduction to Biostatistics

Biostatistics is the application of statistical methods to biological, health, and medical data. For public health and clinical medicine, a working knowledge of biostatistics is essential for interpreting research evidence, evaluating screening tests, understanding epidemiological studies, and applying evidence-based practice. This lecture covers descriptive statistics, measures of central tendency and spread, the normal distribution, hypothesis testing, confidence intervals, and diagnostic test characteristics.

Measures of Central Tendency

Three principal measures describe the centre of a data distribution:

Mean (arithmetic mean): the sum of all values divided by the number of values. Sensitive to outliers — a single extreme value pulls the mean substantially. Best used when data are normally distributed (symmetric).

Example: blood pressures of 5 patients (mmHg): 110, 120, 125, 130, 170.

Mean = (110+120+125+130+170) / 5 = 655 / 5 = 131 mmHg.

Note: 170 pulls the mean above the middle value.

Median: the middle value when data are arranged in ascending order. If n is even, the median is the average of the two middle values. Robust to outliers — not influenced by extreme values. Best used for skewed distributions (income, length of hospital stay, survival times).

Example: ordered data = 110, 120, 125, 130, 170. Median = 125 mmHg (the third value).

Mode: the most frequently occurring value. Can be used for any data type including categorical data. A distribution can be unimodal (one peak), bimodal (two peaks), or multimodal.

When to use each: symmetrical distributions → mean; skewed distributions → median; categorical data → mode; for clinical income or wait-time data → median is almost always preferable.

Measures of Spread (Variability)

Range: maximum minus minimum. Simple but highly sensitive to outliers.

Interquartile range (IQR): the range of the middle 50% of data (Q3 − Q1). Robust to outliers. Reported alongside the median. Example: IQR for the above data = 130 − 120 = 10 mmHg.

Variance: the average of the squared deviations from the mean. Units are squared (e.g., mmHg²) — not directly interpretable.

Variance = Σ(xᵢ − x̄)² / (n−1) [sample variance uses n−1 for unbiasedness]

Standard deviation (SD): the square root of the variance. Same units as the original data — directly interpretable. For a normal distribution: ~68% of observations lie within ±1 SD; ~95% within ±2 SD; ~99.7% within ±3 SD (the "68-95-99.7 rule").

Standard error of the mean (SEM): SEM = SD / √n. Measures how precisely the sample mean estimates the population mean. As n increases, SEM decreases — larger samples give more precise estimates. SEM is used to construct confidence intervals for the mean.

The Normal Distribution

The normal (Gaussian) distribution is a symmetric, bell-shaped distribution characterised by its mean (μ) and standard deviation (σ). Many biological variables approximate normality (height, blood pressure, serum creatinine in a healthy population).

Properties:

Symmetric about the mean (mean = median = mode)
Defined by μ and σ alone
Approximately 68% of values fall within μ ± 1σ
Approximately 95% fall within μ ± 1.96σ
Approximately 99.7% fall within μ ± 3σ

The z-score standardises any observation: z = (x − μ) / σ. A z-score of +2 means the observation is 2 SD above the mean — approximately the top 2.3% of a normal distribution.

Confidence Intervals

A confidence interval (CI) is a range of plausible values for a population parameter (e.g., mean, proportion, relative risk, odds ratio) estimated from a sample. A 95% CI means that if we repeated the study 100 times with different random samples, approximately 95 of the 100 resulting CIs would contain the true population parameter.

95% CI for a mean: x̄ ± 1.96 × SEM = x̄ ± 1.96 × (SD/√n)

Interpretation: the CI does NOT mean there is a 95% probability the true mean lies within this specific interval — the true mean either is or is not in the interval (frequentist). The CI tells us our uncertainty about the estimate given this sample.

Wider CI = more uncertainty (small sample, high variability). Narrower CI = more precision (large sample, low variability).

Clinical interpretation: if a 95% CI for an odds ratio crosses 1.0 (e.g., OR = 1.4, 95% CI 0.8–2.4), the result is not statistically significant at α = 0.05 — the null value (OR = 1) is plausible. If the CI does not cross the null (e.g., OR = 1.4, 95% CI 1.1–1.8), the result is statistically significant.

Hypothesis Testing and P-values

Null hypothesis (H₀): the default assumption of no effect or no difference. Example: H₀ = "the new drug has the same effect as placebo."

Alternative hypothesis (H₁): what we are testing for. Example: H₁ = "the new drug reduces systolic BP more than placebo."

Test statistic: a numerical summary of the data that measures how far the observed results are from H₀. Examples: t-statistic (t-test), z-statistic, chi-squared statistic.

P-value: the probability of observing results as extreme as (or more extreme than) those obtained, assuming H₀ is true. A small p-value indicates the observed results are unlikely under H₀ — providing evidence against H₀.

Significance threshold (α): conventionally α = 0.05. If p < 0.05, we reject H₀ and conclude the result is "statistically significant."

Critical misconceptions:

p-value is NOT the probability that H₀ is true
p-value is NOT the probability that the result occurred by chance
Statistical significance ≠ clinical significance (a large trial may give p < 0.001 for a trivially small effect)
A non-significant result (p > 0.05) does NOT prove H₀ — it means insufficient evidence to reject it

Type I error (α): rejecting H₀ when it is true — a false positive. Probability = α (conventional 0.05).

Type II error (β): failing to reject H₀ when it is false — a false negative. Probability = β (often 0.20).

Power: 1 − β = probability of correctly detecting a real effect (often set to 0.80 or 0.90 in study planning).

Sensitivity, Specificity, and Predictive Values

These measures characterise diagnostic test performance:

Sensitivity: the proportion of true positives correctly identified by the test.

Sensitivity = TP / (TP + FN) = TP / all diseased

A highly sensitive test has few false negatives — good for ruling OUT disease ("SnNOut": if Sensitive test is Negative, rule Out disease).

Specificity: the proportion of true negatives correctly identified.

Specificity = TN / (TN + FP) = TN / all disease-free

A highly specific test has few false positives — good for ruling IN disease ("SpPIn": if Specific test is Positive, rule In disease).

Positive predictive value (PPV): the probability that a positive test result truly reflects disease.

PPV = TP / (TP + FP) = TP / all test-positive

PPV depends strongly on disease prevalence — in low-prevalence settings, even a specific test has low PPV (most positives are false positives).

Negative predictive value (NPV): the probability that a negative test result truly reflects absence of disease.

NPV = TN / (TN + FN) = TN / all test-negative

NPV is higher when prevalence is low.

Likelihood ratios (LR): more stable across populations (do not depend on prevalence):

LR+ = Sensitivity / (1 − Specificity): how much a positive result increases the odds of disease

LR− = (1 − Sensitivity) / Specificity: how much a negative result decreases the odds of disease

Correlation and Regression

Pearson correlation coefficient (r): measures the strength and direction of linear association between two continuous variables. Range: −1 to +1. r = +1: perfect positive linear relationship; r = −1: perfect negative; r = 0: no linear relationship.

Interpreting r: |r| < 0.3 = weak; 0.3–0.7 = moderate; > 0.7 = strong. Correlation does NOT imply causation.

Linear regression: models the relationship between a continuous outcome (Y) and one or more predictors (X). Simple linear regression: Y = α + βX + ε. β (the regression coefficient) = the expected change in Y per unit increase in X.

✍️

SAQs & Essay

Short answer questions + essay writing practice

🃏

Flashcards

FSRS spaced-repetition card review

📝

MCQ Quiz

Multiple choice questions with explanations