Core Concepts

Pipeline Phases

Deep dive into each of the 6 analysis phases from basecalling to reporting.

Total Pipeline Duration

Complete analysis typically takes 4-8 hours depending on sample size and complexity. Each phase runs sequentially with automated transitions.

Protocol 12 v2.1: Sample Preparation Update

Laboratory sample preparation now includes Step 2.5 (circular DNA linearization + ssDNA conversion) to enable detection of PCV2, PCV3, TTV, and PPV. The bioinformatics pipeline (Phase 4 & 5) automatically handles circular genome references with duplication strategy for proper junction read mapping.

Phase Overview

Phase 1

Basecalling

Convert raw signal (FAST5/POD5) to sequences (FASTQ)

Tool

Dorado Duplex

Duration

2-4 hours

Instance Type

g4dn.xlarge (GPU)

Phase 2

Quality Control

Assess read quality metrics and filter low-quality reads

Tool

PycoQC, NanoPlot

Duration

10-15 minutes

Instance Type

t3.large

Phase 3

Host Removal

Align reads to Sus scrofa genome and remove host DNA

Tool

Minimap2, SAMtools

Duration

30-60 minutes

Instance Type

r5.xlarge

Phase 4

Pathogen Detection

Multi-database screening for PMDA pathogens

Tool

Kraken2, BLAST, Diamond

Duration

1-2 hours

Instance Type

r5.4xlarge

Phase 5

Quantification

Absolute copy number calculation with spike-in normalization

Tool

Custom Python scripts

Duration

15-30 minutes

Instance Type

t3.large

Phase 6

Reporting

Generate PMDA-compliant reports in PDF, JSON, and HTML

Tool

ReportLab, WeasyPrint

Duration

10-15 minutes

Instance Type

t3.medium

Phase 1: Basecalling

Dorado Duplex Mode

Q30+ accuracy basecalling

Converts raw electrical signals from FAST5/POD5 files into nucleotide sequences (FASTQ) using Oxford Nanopore's Dorado basecaller in duplex mode for maximum accuracy.

Key Features

• Duplex Mode: Sequences both DNA strands for 99.9% accuracy (Q30+)
• GPU Acceleration: NVIDIA T4 GPU on g4dn.xlarge instance
• Real-time Processing: Can process data as it's generated
• Quality Filtering: Automatically filters reads below Q9

Script Example

basecall_duplex.shbash

#!/usr/bin/env bash
# Basecalling with Dorado Duplex

DORADO_BIN=/opt/dorado/bin/dorado
MODEL=dna_r10.4.1_e8.2_400bps_sup.cfg
INPUT_DIR=/data/fast5
OUTPUT_DIR=/data/fastq

# Run duplex basecalling
$DORADO_BIN duplex \
  --device cuda:0 \
  $MODEL \
  $INPUT_DIR > $OUTPUT_DIR/basecalled.fastq

# Generate sequencing summary
python3 generate_summary.py \
  --input $OUTPUT_DIR/basecalled.fastq \
  --output $OUTPUT_DIR/sequencing_summary.txt

Output Metrics

Total Reads

50,000+

Mean Quality

Q10.5

Total Bases

150 Mb

N50 Length

3.5 kb

Phase 2: Quality Control

PycoQC & NanoPlot

Read quality assessment

Comprehensive quality control analysis to ensure data meets minimum standards before proceeding to analysis phases.

Quality Thresholds

✓ Minimum reads: 10,000
✓ Mean quality score: Q9+
✓ Q30 reads: >10%
✓ Read length N50: >2 kb

QC Reports Generated

# PycoQC HTML report
pycoQC -f sequencing_summary.txt -o pycoQC_report.html

# NanoPlot visualization
NanoPlot --fastq basecalled.fastq --plots kde --legacy hex dot

# Output files:
# - NanoPlot-report.html
# - Read length distribution
# - Quality score distribution
# - Yield over time

Phase 3: Host Genome Removal

Minimap2 Alignment

Sus scrofa genome depletion

Aligns reads to the porcine reference genome (Sus scrofa 11.1) and removes host DNA to enrich for pathogen sequences.

Alignment & Filtering

# Align to host genome
minimap2 -ax map-ont \
  /data/references/sus_scrofa_11.1.mmi \
  basecalled.fastq > aligned.sam

# Extract unmapped reads (potential pathogens)
samtools view -f 4 aligned.sam | \
  samtools fastq - > unmapped.fastq

# Calculate depletion statistics
python3 calculate_depletion_stats.py \
  --total $(wc -l < basecalled.fastq) \
  --unmapped $(wc -l < unmapped.fastq)

Expected Depletion

Typical depletion efficiency: 90-99% of reads should map to host genome for blood samples. Remaining 1-10% unmapped reads proceed to pathogen detection.

Phase 4: Pathogen Detection

Multi-Database Search

Kraken2, BLAST, Diamond, PERV-specific

Comprehensive pathogen screening using multiple complementary databases and detection methods.

Detection Pipeline

# 1. Kraken2 classification (rapid screening)
kraken2 --db /data/kraken2_db \
  --threads 16 \
  --report kraken_report.txt \
  unmapped.fastq > kraken_output.txt

# 2. BLAST search against PMDA database
blastn -query unmapped.fastq \
  -db /data/pmda_pathogens \
  -num_threads 16 \
  -outfmt 6 -out blast_results.txt

# 3. Diamond viral protein search
diamond blastx \
  --query unmapped.fastq \
  --db /data/rvdb.dmnd \
  --threads 16 \
  --outfmt 6 -out diamond_results.txt

# 4. PERV-specific detection
bash perv_analysis.sh \
  --input unmapped.fastq \
  --output perv_results.json

PERV Critical Detection

Critical Alert Trigger

Any PERV detection (PERV-A, PERV-B, PERV-C) triggers immediate SNS notification to alert recipients. This is the highest priority pathogen for xenotransplantation safety.

Result Integration

# Integrate results from all methods
python3 integrate_results.py \
  --kraken kraken_output.txt \
  --blast blast_results.txt \
  --diamond diamond_results.txt \
  --perv perv_results.json \
  --output integrated_pathogens.json

# Validate against PMDA checklist
python3 pmda_check.py \
  --results integrated_pathogens.json \
  --checklist /data/pmda_91_pathogens.json \
  --output pmda_compliance.json

Circular Genome Handling (v2.1)

Protocol 12 v2.1 Update: Circular virus genomes (PCV2, PCV3, TTV) are detected using duplicated references (e.g., PCV2: 1768 bp → 3536 bp) to properly map junction reads created during random DNase I linearization. This ensures accurate detection without split alignments.

Phase 5: Quantification

Spike-in Normalization

Absolute copy number calculation

Converts read counts to absolute pathogen copy numbers using PhiX174 spike-in as internal standard.

Quantification Formula

Pathogen copies/mL = (Pathogen reads / Spike-in reads) × Spike-in copies/mL

Calculation Script

# Absolute quantification with confidence intervals
python3 absolute_quantification.py \
  --pathogen-reads 1250 \
  --spike-in-reads 1000 \
  --spike-in-copies 1000000 \
  --confidence 0.95 \
  --output quantification.json

# Output:
# {
#   "copies_per_ml": 1250000,
#   "log10_copies": 6.097,
#   "ci_lower": 1180000,
#   "ci_upper": 1320000,
#   "confidence_level": 0.95
# }

Circular Genome Coverage (v2.1)

For circular genomes (PCV2, PCV3, TTV), coverage is calculated only on the first half of duplicated references to avoid double-counting. Quantification uses actual genome sizes (e.g., PCV2 = 1768 bp), not duplicated reference sizes, ensuring accurate copy number calculations.

Phase 6: Report Generation

PMDA-Compliant Reports

PDF, JSON, HTML formats

Generates comprehensive analysis reports in multiple formats for different audiences.

Report Types

PDF Report

Comprehensive report with visualizations for human review

JSON Report

Machine-readable PMDA 91 pathogen checklist

HTML Report

Interactive web-based report with search

Report Generation

# Generate all report formats
python3 generate_reports.py \
  --run-id RUN-2024-001 \
  --results integrated_pathogens.json \
  --quantification quantification.json \
  --pmda pmda_compliance.json \
  --output-dir /data/reports

# Send notifications
python3 send_notifications.py \
  --run-id RUN-2024-001 \
  --reports /data/reports \
  --recipients alerts@example.com

Report Contents

✓ Executive summary
✓ PMDA 91 pathogen checklist
✓ Detailed pathogen detection results
✓ Quantification data (copies/mL)
✓ Quality control metrics
✓ PERV analysis section
✓ Pipeline execution log