Getting Started¶

Introduction¶

ODHL_AR is a bioinformatics best-practice pipeline for detecting antimicrobial resistance (AMR) genes and assessing bacterial genomic characteristics from whole-genome sequencing (WGS) data.

The pipeline is built using Nextflow, a workflow tool that efficiently runs tasks across multiple compute infrastructures in a portable and scalable manner. It uses Docker/Singularity containers, making installation trivial and ensuring reproducible results.

Pipeline Summary¶

Read QC: FastQC – Provides quality metrics for raw sequencing reads.
Trimming reads: Fastp – Trims low-quality reads and adapter sequences.
Taxonomic Classification: Kraken2 – Identifies bacterial species present in the sequencing data.
Genome Assembly: SPAdes – Performs de novo genome assembly.
Quality Assessment of Assembly: QUAST – Evaluates assembly quality.
Plasmid Identification: PlasmidFinder – Detects plasmids in the assembled genome.
MLST Typing: MLST – Determines Multilocus Sequence Typing (MLST) for bacterial isolates.
AMR Gene Detection: AMRFinderPlus – Identifies antimicrobial resistance genes.
Virulence Gene Detection: Gamma – Predicts bacterial virulence factors.
Genome Annotation: Prokka – Annotates genomic features.
Whole-Genome Comparisons:
- Mash – Estimates genome distances for clustering.
- FastANI – Calculates Average Nucleotide Identity (ANI) for species identification.
Final Report: MultiQC – Summarizes results across all samples in a single interactive report.

Entry Points¶

Currently, there are several entry points for the AR pipeline:

arBASESPACE: Downloads files directly from Illumina Basespace.
arANALYSIS: Downloads file Illumina Basespace, if needed; performs quality control, genome assembly, and taxonomic classification; detects antimicrobial resistance genes, plasmids, and virulence factors; performs generation of unique WGS ID's; preparaes NCBI submission.
DBProcessing: Performs quality control, compiles NCBI ID's into a compiled user file.

In addition there are several entry points for AR outbreak analysis. 1. outbreakANALYSIS: Performs additional analysis for outbreak detection. 2. outbreakREPORTING: Generates reports depending on the type of outbreak analysis required. 3. NFCORE_OUTBREAK: Executes outbreakANALYSIS and outbreakREPORTING for an second end-to-end workflow.

Processes¶

Quality Control & Preprocessing¶

FastQC: Provides sequencing quality metrics.
Fastp: Trims low-quality reads and removes adapters.
Kraken2: Classifies bacterial species from sequencing data.

Genome Assembly & Assessment¶

SPAdes: Performs de novo genome assembly.
QUAST: Evaluates assembly quality.
PlasmidFinder: Identifies plasmid sequences in assembled genomes.

Genotyping & Comparative Genomics¶

MLST: Identifies bacterial strain types.
Mash: Estimates genetic distances between isolates.
FastANI: Computes Average Nucleotide Identity (ANI) for species classification.

Antimicrobial Resistance & Virulence Detection¶

AMRFinderPlus: Detects AMR genes in bacterial genomes.
Gamma: Predicts virulence factors from genomic data.

Genome Annotation¶

Prokka: Annotates bacterial genomes.

Final Reports¶

MultiQC: Compiles a summary of all quality control and analysis results.

Dependencies ¶

Software & Tools¶

Ensure that the following software is installed before running the pipeline:

Nextflow (>=21.04.0)
Docker or Singularity
Conda (Optional)

Core Software Versions¶

The pipeline requires the following tools:

- Python 3.9
- Samtools 1.21
- FastQC 0.12.1
- Fastp 0.23.4
- Kraken2 2.1.2
- SPAdes 3.15.5
- QUAST 5.0.2
- PlasmidFinder 2.1
- MLST 2.23.0
- Mash 2.3
- FastANI 1.33
- AMRFinderPlus 3.10
- Gamma 2.2
- Prokka 1.14.5
- MultiQC 1.21

Reference Databases¶

Kraken2 DB¶

The database file can be taken from Ben Langmead's repository which links directly to the database file. It is recommended to use the latest version of the 8GB database, and reformat it using the bin/reformat_kraken.sh script.

Pipeline Execution¶

The ODHL_rAcecaR pipeline is implemented using Nextflow, which allows for execution on local machines, HPC clusters, or cloud environments.

1. Install Nextflow¶

curl -s https://get.nextflow.io | bash
mv nextflow ~/bin/

2. Clone the Repository¶

git clone https://github.com/ODHL/ODHL_rAcecaR.git
cd ODHL_rAcecaR

3. Configure the Pipeline¶

Modify the Nextflow configuration file (nextflow.config) to specify reference databases and execution profiles.

Example modification:

params {
    kraken2_db = "/home/ubuntu/refs/k2/k2_standard_08gb_202412.tar.gz"
    amrfinder_db = "/home/ubuntu/refs/amrfinderplus/latest"
    plasmidfinder_db = "/home/ubuntu/refs/plasmidfinder/latest"
}

* 4. (Optional) Install Basespace¶

The pipeline allows for automatic download from basespace. If you choose to use this feature, you'll need to add basespace to your $PATH.

# Install basespace
## Docs
## https://developer.basespace.illumina.com/docs/content/documentation/cli/cli-overview
if [[ ! -d $HOME/tools/ ]]; then mkdir -p $HOME/tools/; done
wget "https://launch.basespace.illumina.com/CLI/latest/amd64-linux/bs" -O $HOME/tools/basespace
chmod u+x $HOME/tools/basespace
./basespace auth
### follow path to website and sign in
### should display "Welcome [name of user]

Reproducibility¶

To ensure consistent results, specify the pipeline version when running:

nextflow run ODHL/ODHL_AR -r 1.0.0

You can check for the latest version on the ODHL/ODHL_AR GitHub Releases page.

Last update: 2025-01-29