Pipeline Phases
Deep dive into each of the 6 analysis phases from basecalling to reporting.
Total Pipeline Duration
Protocol 12 v2.1: Sample Preparation Update
Phase Overview
Basecalling
Convert raw signal (FAST5/POD5) to sequences (FASTQ)
Quality Control
Assess read quality metrics and filter low-quality reads
Host Removal
Align reads to Sus scrofa genome and remove host DNA
Pathogen Detection
Multi-database screening for PMDA pathogens
Quantification
Absolute copy number calculation with spike-in normalization
Reporting
Generate PMDA-compliant reports in PDF, JSON, and HTML
Phase 1: Basecalling
Dorado Duplex Mode
Q30+ accuracy basecalling
Converts raw electrical signals from FAST5/POD5 files into nucleotide sequences (FASTQ) using Oxford Nanopore's Dorado basecaller in duplex mode for maximum accuracy.
Key Features
- • Duplex Mode: Sequences both DNA strands for 99.9% accuracy (Q30+)
- • GPU Acceleration: NVIDIA T4 GPU on g4dn.xlarge instance
- • Real-time Processing: Can process data as it's generated
- • Quality Filtering: Automatically filters reads below Q9
Script Example
#!/usr/bin/env bash
# Basecalling with Dorado Duplex
DORADO_BIN=/opt/dorado/bin/dorado
MODEL=dna_r10.4.1_e8.2_400bps_sup.cfg
INPUT_DIR=/data/fast5
OUTPUT_DIR=/data/fastq
# Run duplex basecalling
$DORADO_BIN duplex \
--device cuda:0 \
$MODEL \
$INPUT_DIR > $OUTPUT_DIR/basecalled.fastq
# Generate sequencing summary
python3 generate_summary.py \
--input $OUTPUT_DIR/basecalled.fastq \
--output $OUTPUT_DIR/sequencing_summary.txtOutput Metrics
Phase 2: Quality Control
PycoQC & NanoPlot
Read quality assessment
Comprehensive quality control analysis to ensure data meets minimum standards before proceeding to analysis phases.
Quality Thresholds
- ✓ Minimum reads: 10,000
- ✓ Mean quality score: Q9+
- ✓ Q30 reads: >10%
- ✓ Read length N50: >2 kb
QC Reports Generated
# PycoQC HTML report
pycoQC -f sequencing_summary.txt -o pycoQC_report.html
# NanoPlot visualization
NanoPlot --fastq basecalled.fastq --plots kde --legacy hex dot
# Output files:
# - NanoPlot-report.html
# - Read length distribution
# - Quality score distribution
# - Yield over timePhase 3: Host Genome Removal
Minimap2 Alignment
Sus scrofa genome depletion
Aligns reads to the porcine reference genome (Sus scrofa 11.1) and removes host DNA to enrich for pathogen sequences.
Alignment & Filtering
# Align to host genome
minimap2 -ax map-ont \
/data/references/sus_scrofa_11.1.mmi \
basecalled.fastq > aligned.sam
# Extract unmapped reads (potential pathogens)
samtools view -f 4 aligned.sam | \
samtools fastq - > unmapped.fastq
# Calculate depletion statistics
python3 calculate_depletion_stats.py \
--total $(wc -l < basecalled.fastq) \
--unmapped $(wc -l < unmapped.fastq)Expected Depletion
Phase 4: Pathogen Detection
Multi-Database Search
Kraken2, BLAST, Diamond, PERV-specific
Comprehensive pathogen screening using multiple complementary databases and detection methods.
Detection Pipeline
# 1. Kraken2 classification (rapid screening)
kraken2 --db /data/kraken2_db \
--threads 16 \
--report kraken_report.txt \
unmapped.fastq > kraken_output.txt
# 2. BLAST search against PMDA database
blastn -query unmapped.fastq \
-db /data/pmda_pathogens \
-num_threads 16 \
-outfmt 6 -out blast_results.txt
# 3. Diamond viral protein search
diamond blastx \
--query unmapped.fastq \
--db /data/rvdb.dmnd \
--threads 16 \
--outfmt 6 -out diamond_results.txt
# 4. PERV-specific detection
bash perv_analysis.sh \
--input unmapped.fastq \
--output perv_results.jsonPERV Critical Detection
Critical Alert Trigger
Result Integration
# Integrate results from all methods
python3 integrate_results.py \
--kraken kraken_output.txt \
--blast blast_results.txt \
--diamond diamond_results.txt \
--perv perv_results.json \
--output integrated_pathogens.json
# Validate against PMDA checklist
python3 pmda_check.py \
--results integrated_pathogens.json \
--checklist /data/pmda_91_pathogens.json \
--output pmda_compliance.jsonCircular Genome Handling (v2.1)
Phase 5: Quantification
Spike-in Normalization
Absolute copy number calculation
Converts read counts to absolute pathogen copy numbers using PhiX174 spike-in as internal standard.
Quantification Formula
Pathogen copies/mL = (Pathogen reads / Spike-in reads) × Spike-in copies/mLCalculation Script
# Absolute quantification with confidence intervals
python3 absolute_quantification.py \
--pathogen-reads 1250 \
--spike-in-reads 1000 \
--spike-in-copies 1000000 \
--confidence 0.95 \
--output quantification.json
# Output:
# {
# "copies_per_ml": 1250000,
# "log10_copies": 6.097,
# "ci_lower": 1180000,
# "ci_upper": 1320000,
# "confidence_level": 0.95
# }Circular Genome Coverage (v2.1)
Phase 6: Report Generation
PMDA-Compliant Reports
PDF, JSON, HTML formats
Generates comprehensive analysis reports in multiple formats for different audiences.
Report Types
Report Generation
# Generate all report formats
python3 generate_reports.py \
--run-id RUN-2024-001 \
--results integrated_pathogens.json \
--quantification quantification.json \
--pmda pmda_compliance.json \
--output-dir /data/reports
# Send notifications
python3 send_notifications.py \
--run-id RUN-2024-001 \
--reports /data/reports \
--recipients alerts@example.comReport Contents
- ✓ Executive summary
- ✓ PMDA 91 pathogen checklist
- ✓ Detailed pathogen detection results
- ✓ Quantification data (copies/mL)
- ✓ Quality control metrics
- ✓ PERV analysis section
- ✓ Pipeline execution log