Core Concepts

Pipeline Phases

Deep dive into each of the 6 analysis phases from basecalling to reporting.

Phase Overview

Phase 1

Basecalling

Convert raw signal (FAST5/POD5) to sequences (FASTQ)

Tool
Dorado Duplex
Duration
2-4 hours
Instance Type
g4dn.xlarge (GPU)
Phase 2

Quality Control

Assess read quality metrics and filter low-quality reads

Tool
PycoQC, NanoPlot
Duration
10-15 minutes
Instance Type
t3.large
Phase 3

Host Removal

Align reads to Sus scrofa genome and remove host DNA

Tool
Minimap2, SAMtools
Duration
30-60 minutes
Instance Type
r5.xlarge
Phase 4

Pathogen Detection

Multi-database screening for PMDA pathogens

Tool
Kraken2, BLAST, Diamond
Duration
1-2 hours
Instance Type
r5.4xlarge
Phase 5

Quantification

Absolute copy number calculation with spike-in normalization

Tool
Custom Python scripts
Duration
15-30 minutes
Instance Type
t3.large
Phase 6

Reporting

Generate PMDA-compliant reports in PDF, JSON, and HTML

Tool
ReportLab, WeasyPrint
Duration
10-15 minutes
Instance Type
t3.medium

Phase 1: Basecalling

Dorado Duplex Mode

Q30+ accuracy basecalling

Converts raw electrical signals from FAST5/POD5 files into nucleotide sequences (FASTQ) using Oxford Nanopore's Dorado basecaller in duplex mode for maximum accuracy.

Key Features

  • Duplex Mode: Sequences both DNA strands for 99.9% accuracy (Q30+)
  • GPU Acceleration: NVIDIA T4 GPU on g4dn.xlarge instance
  • Real-time Processing: Can process data as it's generated
  • Quality Filtering: Automatically filters reads below Q9

Script Example

basecall_duplex.shbash
#!/usr/bin/env bash
# Basecalling with Dorado Duplex

DORADO_BIN=/opt/dorado/bin/dorado
MODEL=dna_r10.4.1_e8.2_400bps_sup.cfg
INPUT_DIR=/data/fast5
OUTPUT_DIR=/data/fastq

# Run duplex basecalling
$DORADO_BIN duplex \
  --device cuda:0 \
  $MODEL \
  $INPUT_DIR > $OUTPUT_DIR/basecalled.fastq

# Generate sequencing summary
python3 generate_summary.py \
  --input $OUTPUT_DIR/basecalled.fastq \
  --output $OUTPUT_DIR/sequencing_summary.txt

Output Metrics

Total Reads
50,000+
Mean Quality
Q10.5
Total Bases
150 Mb
N50 Length
3.5 kb

Phase 2: Quality Control

PycoQC & NanoPlot

Read quality assessment

Comprehensive quality control analysis to ensure data meets minimum standards before proceeding to analysis phases.

Quality Thresholds

  • ✓ Minimum reads: 10,000
  • ✓ Mean quality score: Q9+
  • ✓ Q30 reads: >10%
  • ✓ Read length N50: >2 kb

QC Reports Generated

# PycoQC HTML report
pycoQC -f sequencing_summary.txt -o pycoQC_report.html

# NanoPlot visualization
NanoPlot --fastq basecalled.fastq --plots kde --legacy hex dot

# Output files:
# - NanoPlot-report.html
# - Read length distribution
# - Quality score distribution
# - Yield over time

Phase 3: Host Genome Removal

Minimap2 Alignment

Sus scrofa genome depletion

Aligns reads to the porcine reference genome (Sus scrofa 11.1) and removes host DNA to enrich for pathogen sequences.

Alignment & Filtering

# Align to host genome
minimap2 -ax map-ont \
  /data/references/sus_scrofa_11.1.mmi \
  basecalled.fastq > aligned.sam

# Extract unmapped reads (potential pathogens)
samtools view -f 4 aligned.sam | \
  samtools fastq - > unmapped.fastq

# Calculate depletion statistics
python3 calculate_depletion_stats.py \
  --total $(wc -l < basecalled.fastq) \
  --unmapped $(wc -l < unmapped.fastq)

Expected Depletion

Phase 4: Pathogen Detection

Multi-Database Search

Kraken2, BLAST, Diamond, PERV-specific

Comprehensive pathogen screening using multiple complementary databases and detection methods.

Detection Pipeline

# 1. Kraken2 classification (rapid screening)
kraken2 --db /data/kraken2_db \
  --threads 16 \
  --report kraken_report.txt \
  unmapped.fastq > kraken_output.txt

# 2. BLAST search against PMDA database
blastn -query unmapped.fastq \
  -db /data/pmda_pathogens \
  -num_threads 16 \
  -outfmt 6 -out blast_results.txt

# 3. Diamond viral protein search
diamond blastx \
  --query unmapped.fastq \
  --db /data/rvdb.dmnd \
  --threads 16 \
  --outfmt 6 -out diamond_results.txt

# 4. PERV-specific detection
bash perv_analysis.sh \
  --input unmapped.fastq \
  --output perv_results.json

PERV Critical Detection

Result Integration

# Integrate results from all methods
python3 integrate_results.py \
  --kraken kraken_output.txt \
  --blast blast_results.txt \
  --diamond diamond_results.txt \
  --perv perv_results.json \
  --output integrated_pathogens.json

# Validate against PMDA checklist
python3 pmda_check.py \
  --results integrated_pathogens.json \
  --checklist /data/pmda_91_pathogens.json \
  --output pmda_compliance.json

Circular Genome Handling (v2.1)

Phase 5: Quantification

Spike-in Normalization

Absolute copy number calculation

Converts read counts to absolute pathogen copy numbers using PhiX174 spike-in as internal standard.

Quantification Formula

Pathogen copies/mL = (Pathogen reads / Spike-in reads) × Spike-in copies/mL

Calculation Script

# Absolute quantification with confidence intervals
python3 absolute_quantification.py \
  --pathogen-reads 1250 \
  --spike-in-reads 1000 \
  --spike-in-copies 1000000 \
  --confidence 0.95 \
  --output quantification.json

# Output:
# {
#   "copies_per_ml": 1250000,
#   "log10_copies": 6.097,
#   "ci_lower": 1180000,
#   "ci_upper": 1320000,
#   "confidence_level": 0.95
# }

Circular Genome Coverage (v2.1)

Phase 6: Report Generation

PMDA-Compliant Reports

PDF, JSON, HTML formats

Generates comprehensive analysis reports in multiple formats for different audiences.

Report Types

PDF Report
Comprehensive report with visualizations for human review
JSON Report
Machine-readable PMDA 91 pathogen checklist
HTML Report
Interactive web-based report with search

Report Generation

# Generate all report formats
python3 generate_reports.py \
  --run-id RUN-2024-001 \
  --results integrated_pathogens.json \
  --quantification quantification.json \
  --pmda pmda_compliance.json \
  --output-dir /data/reports

# Send notifications
python3 send_notifications.py \
  --run-id RUN-2024-001 \
  --reports /data/reports \
  --recipients alerts@example.com

Report Contents

  • ✓ Executive summary
  • ✓ PMDA 91 pathogen checklist
  • ✓ Detailed pathogen detection results
  • ✓ Quantification data (copies/mL)
  • ✓ Quality control metrics
  • ✓ PERV analysis section
  • ✓ Pipeline execution log