Molecular Genetics & the Genomic Toolbox

~10 min read

Lesson 4 of 5

Notes

Genetics and Genomics, Lecture 4. This lecture covers the core molecular techniques used in clinical and research genetics: PCR, Sanger sequencing, next-generation sequencing (NGS), variant classification, and pharmacogenomics.

POLYMERASE CHAIN REACTION (PCR)

PCR is the foundational technique for amplifying specific DNA sequences from minute quantities of starting material. The reaction requires: a DNA template; two oligonucleotide primers (approximately 18-25 bp) that flank the target region; a heat-stable DNA polymerase (classically Taq polymerase from Thermus aquaticus, which evolved to function optimally at 72°C — the extension temperature); deoxynucleotide triphosphates (dNTPs: dATP, dCTP, dGTP, dTTP); and a buffered reaction containing Mg²⁺ (essential cofactor for Taq polymerase activity).

The PCR cycle has three steps repeated typically 30-40 times:

Step 1 — Denaturation: temperature raised to approximately 94-95°C for 20-30 seconds. The hydrogen bonds holding the double-stranded DNA together are disrupted, separating the two strands into single-stranded template.

Step 2 — Annealing: temperature lowered to approximately 55-65°C (depends on primer Tm — melting temperature, typically 5°C below the calculated Tm) for 20-30 seconds. The primers anneal to their complementary sequences on the single-stranded template by Watson-Crick base pairing. Specificity is determined here: a primer will only bind stably to the target sequence if there is sufficient complementarity. Mismatches at the 3' end of the primer are particularly disruptive.

Step 3 — Extension: temperature raised to 72°C (optimal for Taq) for 30 seconds to several minutes depending on amplicon size (approximately 1 minute per 1 kb). Taq polymerase extends from the 3' end of the primer, synthesising the complementary strand by incorporating dNTPs. The polymerase reads the template 3'→5' and synthesises the new strand 5'→3'.

After n cycles, the number of copies of the target region is approximately 2^n. After 30 cycles, a single template molecule yields approximately 10^9 copies — sufficient for downstream analysis.

Primer design principles: primers should be 18-25 bp; approximately 50-60% GC content; Tm 55-65°C; avoid self-complementarity (hairpins) or inter-primer complementarity (primer dimers); the 3' end should be specific for the intended target and typically end in G or C (GC clamp) for stable extension.

Reverse transcriptase PCR (RT-PCR): RNA is first converted to complementary DNA (cDNA) by reverse transcriptase enzyme, then amplified by PCR. Used to detect gene expression (mRNA) rather than genomic DNA; detects spliced transcripts; useful for RNA viruses (e.g., SARS-CoV-2 RT-PCR testing). Allows amplification across exon-exon boundaries, which is not possible from genomic DNA.

Quantitative PCR (qPCR / real-time PCR): monitors PCR product accumulation in real time using a fluorescent reporter. The cycle threshold (Ct value) is the PCR cycle at which fluorescence crosses a threshold. Ct is inversely proportional to the starting template quantity — a lower Ct = more starting material. Standard curves (serial dilutions of known concentrations) allow absolute quantification. Applications: viral load quantification (HIV, CMV, hepatitis B); gene expression analysis; copy number variation analysis; minimal residual disease monitoring in leukaemia. SYBR Green binds double-stranded DNA non-specifically; TaqMan probes are sequence-specific (higher specificity, required when quantifying a specific target in a complex background).

SANGER SEQUENCING

Sanger sequencing (chain termination sequencing, developed by Frederick Sanger, 1977) was the gold standard for DNA sequencing for three decades and remains used for targeted validation. Principle: a standard PCR reaction is set up, but in addition to normal dNTPs, a small amount of chain-terminating dideoxynucleotide triphosphates (ddNTPs) are included. ddNTPs lack the 3'-OH group required for phosphodiester bond formation; when a ddNTP is incorporated, chain elongation terminates. In modern (automated) Sanger sequencing, four ddNTPs are each labelled with a different fluorescent dye (ddATP, ddCTP, ddGTP, ddTTP). The reaction generates a mixture of fragments of all possible lengths, each ending with the fluorescently labelled ddNTP at its 3' end. These fragments are separated by size using capillary electrophoresis (high-voltage, single-channel capillary), and a fluorescence detector reads the dye colour at the end of each fragment as it passes — producing an electropherogram (chromatogram) with coloured peaks corresponding to each base. Sequence is read from the electropherogram in 5'→3' order.

Reading the electropherogram: each peak represents a nucleotide. The sequence is called from left (5') to right (3'). A heterozygous variant appears as two overlapping peaks at a single position. Limitations: maximum read length approximately 800-900 bp (signal degrades over distance); cannot detect variants present in a minority of cells (sensitivity approximately 20% allele fraction); technically requires pure PCR product; cost-inefficient for large-scale sequencing.

Clinical applications: confirmatory sequencing of a specific variant identified by another method; carrier testing in a family with a known pathogenic variant; sequencing of small genes (e.g., BRCA1 exon 11); sequencing of PCR products to confirm NGS findings.

NEXT-GENERATION SEQUENCING (NGS)

NGS (also called massively parallel sequencing) enables simultaneous sequencing of millions to billions of DNA fragments. The key conceptual advance is that the entire sequencing reaction is parallelised: instead of one sequencing reaction per capillary, millions of reactions occur simultaneously on a solid surface (flow cell), giving throughput many orders of magnitude greater than Sanger sequencing.

Library preparation: the starting DNA (or RNA, after RT) is fragmented (mechanically by sonication or enzymatically), size-selected to typically 150-300 bp, end-repaired, and ligated to sequencing adapters at both ends. Each adapter contains a universal primer binding site and a unique molecular barcode (index/sample barcode) that allows multiple samples to be sequenced simultaneously (multiplexing). The library is then PCR amplified to generate sufficient quantity.

Cluster amplification (Illumina platform): the library is loaded onto a flow cell whose surface is coated with oligonucleotides complementary to the sequencing adapters. Each DNA fragment anneals to the surface, and bridge amplification generates clonal clusters of approximately 1000 copies of each fragment on the flow cell — providing sufficient signal for detection.

Sequencing by synthesis (SBS — Illumina): DNA synthesis proceeds using reversible terminator nucleotides. Each dNTP is labelled with a fluorescent dye and blocked at the 3'-OH so that only one base is incorporated per cycle. After each incorporation, the identity of the added base is determined by fluorescence imaging; the dye and blocking group are then chemically cleaved, allowing the next cycle. This process is repeated for the length of the read — typically 100-300 bp per read in paired-end mode (both ends of the fragment sequenced).

Short-read vs long-read sequencing: Illumina (short-read, 150-300 bp per read): high accuracy (>99.9% per base after quality filtering); very high throughput; excellent for SNV and indel detection; lower cost per base. Oxford Nanopore Technologies (ONT, long-read, reads of 10-100+ kb): single-molecule sequencing by measuring ionic current changes as DNA passes through a protein pore; lower per-base accuracy but excellent for structural variant detection, repeat expansion sizing, phasing, and direct methylation detection; can sequence a whole human genome in hours on a portable MinION device.

Bioinformatics pipeline — FASTQ to VCF: raw sequencing data is output as FASTQ files (text files containing read sequences and per-base quality scores encoded as ASCII characters). Analysis pipeline: (1) FASTQ — quality trimming and filtering (FastQC, Trimmomatic); (2) alignment/mapping to reference genome (hg38/GRCh38) using BWA-MEM or similar aligner; produces a BAM (binary alignment map) file; (3) duplicate marking (PCR duplicates identified by identical start/end coordinates, marked by Picard/GATK); (4) base quality score recalibration (BQSR); (5) variant calling — identifies SNVs (single nucleotide variants) and indels (insertions/deletions) relative to the reference genome (GATK HaplotypeCaller for germline; Mutect2 for somatic); produces a VCF (variant call format) file; (6) variant annotation (ANNOVAR, VEP) — adds gene names, predicted effect, population frequency, existing ClinVar/dbSNP classifications.

WHOLE GENOME vs WHOLE EXOME vs GENE PANEL

Whole genome sequencing (WGS): sequences all ~3.2 billion base pairs of the human genome. Coverage typically 30× (each base sequenced an average of 30 times for germline). Detects: SNVs, indels, structural variants (SVs), copy number variants (CNVs), non-coding variants affecting regulatory elements and splicing, repeat expansions. Applications: rare undiagnosed disease, research, comprehensive cancer genomics. Cost: ~NZD $1,000-2,000 for sequencing; bioinformatics costs additional. Diagnostic yield in rare disease: ~25-40%.

Whole exome sequencing (WES): enriches and sequences only the protein-coding regions (exons) of the genome (~1.5% of the genome, ~30 Mb). Coverage typically 100×. Detects: SNVs and indels in coding regions. Misses: non-coding variants, deep intronic splicing variants, structural variants, CNVs (unreliably). Cost: approximately half of WGS. Diagnostic yield for Mendelian disease: ~25-35% — similar to WGS for protein-coding variants.

Gene panels: NGS targeting a curated list of genes relevant to a specific clinical indication (e.g., a hereditary breast/ovarian cancer panel: BRCA1, BRCA2, PALB2, ATM, CHEK2, BRIP1, RAD51C, RAD51D; a cardiomyopathy panel: MYH7, MYBPC3, TNNT2, etc.). Coverage very high (>200×), low cost per test, fast turnaround, straightforward clinical interpretation. Limitation: limited to known genes; misses variants in genes not on the panel.

ACMG VARIANT CLASSIFICATION

The American College of Medical Genetics and Genomics (ACMG) and the Association for Molecular Pathology (AMP) published a landmark 2015 framework for classifying sequence variants into 5 tiers:

Pathogenic: strong evidence that the variant causes disease.
Likely pathogenic: evidence supports pathogenicity but is not definitive.
Variant of uncertain significance (VUS): insufficient evidence for classification.
Likely benign: evidence supports benignity.
Benign: strong evidence that the variant does not cause disease.

Classification uses a system of weighted criteria. Key criteria include: PVS1 — null variant (nonsense, frameshift, canonical splice site) in a gene where loss-of-function is the known disease mechanism (very strong pathogenic evidence). PS1 — same amino acid change as a known pathogenic variant. PM2 — variant absent or at very low frequency in population databases (gnomAD). PP3 — multiple computational tools predict damaging effect (SIFT, PolyPhen-2, CADD, REVEL). BA1 — variant present at >5% allele frequency in a large, diverse population database (stand-alone benign). BS4 — variant co-segregates with absence of disease in multiple affected family members. BP7 — synonymous variant with no predicted splice impact. Rules for combining criteria determine final classification. VUS is the default when evidence does not reach the threshold for likely pathogenic or likely benign — VUS rates are higher for non-European populations because variant databases are predominantly of European ancestry.

PHARMACOGENOMICS

Pharmacogenomics (PGx) is the study of how genetic variation influences drug response, including drug metabolism, efficacy, and adverse effects.

CYP2D6 — codeine and opioid analgesia: CYP2D6 is responsible for metabolising approximately 25% of clinically used drugs. It converts the prodrug codeine to its active metabolite morphine (O-demethylation). Four pharmacokinetic phenotypes are recognised based on CYP2D6 genotype: poor metabolisers (PMs) — carry two non-functional alleles (e.g., *4/*4, *5/*4); cannot convert codeine to morphine; receive minimal analgesia from codeine; tramadol and some antidepressants (TCAs, paroxetine) also affect. Intermediate metabolisers (IMs) — one reduced-function allele; reduced conversion. Extensive metabolisers (EMs) — the normal population; two functional alleles; expected drug response. Ultra-rapid metabolisers (UMs) — carry duplicated functional alleles (e.g., *1xN); convert codeine to morphine very rapidly and completely; risk of morphine toxicity: overdose, respiratory depression; particularly dangerous in breastfeeding mothers (morphine passes into breast milk, causing neonatal respiratory depression — fatal cases have been reported). CYP2D6 is also relevant to tamoxifen (requires conversion to active endoxifen) and many antidepressants and antipsychotics.

HLA pharmacogenomics: Human leucocyte antigen (HLA) alleles mediate immune-mediated adverse drug reactions by forming antigen-presenting complexes with drug metabolites that activate T cells.

HLA-B*5701 and abacavir: abacavir (NRTI antiretroviral) causes a severe, potentially fatal hypersensitivity reaction (AHR) in approximately 5-8% of untreated patients — characterised by fever, rash, gastrointestinal symptoms, and respiratory symptoms, typically within 6 weeks of initiation; symptoms resolve on stopping but re-exposure causes a life-threatening immediate reaction. HLA-B*5701 is present in approximately 5-8% of European patients and has a near-perfect positive predictive value for AHR. Prospective HLA-B*5701 screening before prescribing abacavir has been adopted as mandatory in New Zealand and most Western countries, virtually eliminating AHR. HLA-B*5701 prevalence varies by ethnicity: ~5-8% in Europeans, lower in Asian and African populations.

HLA-B*1502 and carbamazepine: HLA-B*1502 is strongly associated with carbamazepine-induced Stevens-Johnson syndrome (SJS) and toxic epidermal necrolysis (TEN) in Han Chinese, Thai, and other Southeast and East Asian populations (~8% frequency). The association is not seen in Europeans. Pre-prescription screening for HLA-B*1502 before carbamazepine in at-risk Asian populations is recommended and required in several countries (Singapore, Hong Kong, Taiwan). Alternative anticonvulsants should be used in HLA-B*1502-positive individuals.

Additional pharmacogenomic examples: TPMT and NUDT15 — azathioprine and 6-mercaptopurine toxicity (see earlier lesson); G6PD deficiency — haemolytic anaemia triggered by primaquine, rasburicase, dapsone, and high-dose aspirin; relevant in Pacific, African, and Mediterranean populations in New Zealand; G6PD testing required before primaquine for malaria relapse prevention. SLCO1B1 — statin-induced myopathy: SLCO1B1*5 variant impairs hepatic uptake of simvastatin, increasing plasma concentration and muscle toxicity risk; relevant to prescribing high-dose simvastatin.

✍️

SAQs & Essay

Short answer questions + essay writing practice

🃏

Flashcards

FSRS spaced-repetition card review

📝

MCQ Quiz

Multiple choice questions with explanations