nf-core/createtaxdb
Edit

Parallelised and automated construction of metagenomic classifier databases of different tools

databasedatabase-buildermetagenomic-profilingmetagenomicsprofilingtaxonomic-profiling

This is the development version of the pipeline.

Launch development version https://github.com/nf-core/createtaxdb

Introduction

This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.

The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.

Pipeline overview

The pipeline is built using Nextflow and processes data using the following steps:

MultiQC - Aggregate report describing versions and methods text for your pipeline run
Pipeline information - Report metrics generated during the workflow execution
Bracken - Database files for Bracken
ganon - Database files for ganon
Centrifuge - Database files for Centrifuge
DIAMOND - Database files for DIAMOND
Kaiju - Database files for Kaiju
KMCP - Database files for KMCP
Kraken2 - Database files for Kraken2
KrakenUniq - Database files for KrakenUniq
MALT - Database files for MALT
sourmash - Database files for sourmash
sylph - Database files for sylph

The pipeline can also generate downstream pipeline input samplesheets. These are stored in <outdir>/downstream_samplesheets.

MultiQC

Output files

multiqc/
- multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
- multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
- multiqc_plots/: directory containing static images from the report in various formats.

MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Pipeline information

Output files

pipeline_info/
- Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
- Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter’s are used when running the pipeline.
- Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.
- Parameters used by the pipeline run: params.json.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

Bracken

Bracken(Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.

Output files

bracken/
- <db_name>/
  - database100mers.kmer_distrib: Bracken kmer distribution file
  - database100mers.kraken: Bracken index file
  - database.kraken: Bracken database file
  - hash.k2d: Kraken2 hash database file
  - opts.k2d: Kraken2 opts database file
  - taxo.k2d: Kraken2 taxo database file
  - library/: Intermediate Kraken2 directory containing FASTAs and related files of added genomes
  - taxonomy/: Intermediate Kraken2 directory containing taxonomy files of added genomes
  - seqid2taxid.map: Intermediate Kraken2 file containing taxonomy files of added genomes

Note that all intermediate files are required for Bracken2 database, even if Kraken2 itself only requires the *.k2d files.

The resulting <db_name>/ directory can be given to Bracken itself with bracken -d <your_database_name> etc.

Centrifuge

Centrifuge is a very rapid and memory-efficient system for the classification of DNA sequences from microbial samples.

Output files

centrifuge/
- database-centrifuge/
  - <database>.*.cf: Centrifuge database files

A directory and cf files can be given to the Centrifuge command with centrifuge -x /<path>/<to>/<cf_files_basename> etc.

Ganon

ganon classifies genomic sequences against large sets of references efficiently, with integrated download and update of databases (refseq/genbank), taxonomic profiling (ncbi/gtdb), binning and hierarchical classification, customized reporting and more.

Output files

ganon/
- <database>.hibf: main bloom filter index file
- <database>.tax: taxonomy tree used for taxonomy assignment

The directory containing these two files can be given to ganon itself with using the name as a prefix, e.g., ganon classify -d /<path>/<to>/<database name without extensions>.

Diamond

DIAMOND is a accelerated BLAST compatible local sequence aligner particularly used for protein alignment.

Output files

diamond/
- <database>.dmnd: DIAMOND dmnd database file

The dmnd file can be given to one of the DIAMOND alignment commands with diamond blast<x/p> -d <your_database>.dmnd etc.

Kaiju

Kaiju is a fast and sensitive taxonomic classification for metagenomics utilising nucletoide to protein translations.

Output files

kaiju/
- <database_name>.fmi: Kaiju FMI index file

The fmi file can be given to Kaiju itself with kaiju -f <your_database>.fmi etc.

KMCP

KMCP is a metagenomic profiling tool focused on prokaryotic and viral sequences.

Output files

kmcp/
- database-kmcp-index/: directory containing KMCP index files

The database-kmcp-index/ directory can be given to KMCP itself with kmcp search --db-dir <your_database>/ etc, see kmcp search documentation. Note that the pipeline does not output files from kmcp-compute as these are not used in downstream tools.

Kraken2

Kraken2 is a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds.

Output files

kraken2/
- <db_name>/
  - hash.k2d: Kraken2 hash database file
  - opts.k2d: Kraken2 opts database file
  - taxo.k2d: Kraken2 taxo database file
  - library/: Intermediate directory containing FASTAs and related files of added genomes (only present if --build_bracken or --kraken2_keepintermediate supplied)
  - taxonomy/: Intermediate directory containing taxonomy files of added genomes (only present if --build_bracken or --kraken2_keepintermediate supplied)
  - seqid2taxid.map: Intermediate file containing taxonomy files of added genomes (only present if --build_bracken or --kraken2_keepintermediate supplied)

The resulting <db_name>/ directory can be given to Kraken2 itself with kraken2 --db <your_database_name> etc.

KrakenUniq

KrakenUniq Metagenomics classifier with unique k-mer counting for more specific results.

Output files

kraken2/
- <db_name>/
- database-build.log: KrakenUniq build process log
- database.idx: KrakenUniq index file
- database.kdb: KrakenUniq database file
- taxDB: KrakenUniq taxonomy information file

Note there may be additional files in this directory, however the ones listed above are the reportedly the required ones.

MALT

MALT is a fast replacement for BLASTX, BLASTP and BLASTN, and provides both local and semi-global alignment capabilities.

Output files

malt/
- malt_index/: directory containing MALT index files

The malt_index directory can be given to MALT itself with malt-run --index <your_database>/ etc.

sourmash

sourmash is a command-line tool and Python/Rust library for metagenome analysis and genome comparison using k-mers.

Output files

sourmash/
- <your_database>-sourmash-dna-<kmersize>mer.sbt.zip: Default sourmash DNA database file
- <your_database>-sourmash-protein-<kmersize>mer.sbt.zip: Default sourmash AA database file

The database name by default distinguishes the sequence type (dna or protein) and the k-mer size for which the index was created.

sylph

sylph is a program that performs ultrafast (1) ANI querying or (2) metagenomic profiling for metagenomic shotgun samples.

Output files

sylph/
- <your_database>-sylph.syldb: sylph multi-genome sketch database file

The <your_database>-sylph.syldb file can be given to sylph profile itself with sylph profile <your_database>-sylph.syldb <...> etc.

Downstream samplesheets

The pipeline can also generate input files for the following downstream pipelines:

nf-core/taxprofiler

Output files

downstream_samplesheets/
- taxprofiler.csv: Partially filled out nf-core/taxprofiler --databases csv with paths to database directories relative to the results directory. e.g. nextflow run nf-core/taxprofiler -profile docker --input samplesheet.csv --databases <createtaxdb_outdir>/downstream_samplesheets/<database_name>.csv>

Warning

Any generated downstream samplesheet is provided as ‘best effort’ and are not guaranteed to work straight out of the box! They may not be complete (e.g. some columns may need to be manually filled in).

Tip

We highly recommend moving all created database directories to a central ‘cache’ location before running downstream pipelines. This ensures that the database files are not lost if the pipeline is re-run, and also allows you to share the database files with other users.

If you do so, make sure to update the paths in the corresponding downstream samplesheet files accordingly.

On this page

nf-core/createtaxdb Edit

Introduction

Pipeline overview

MultiQC

Pipeline information

Bracken

Centrifuge

Ganon

Diamond

Kaiju

KMCP

Kraken2

KrakenUniq

MALT

sourmash

sylph

Downstream samplesheets

nf-core/createtaxdb
Edit