nf-core/createtaxdb
Parallelised and automated construction of metagenomic classifier databases of different tools
Introduction
This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline.
The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory.
Pipeline overview
The pipeline is built using Nextflow and processes data using the following steps:
- MultiQC - Aggregate report describing results and QC from the whole pipeline
- Pipeline information - Report metrics generated during the workflow execution
- Bracken - Database files for Brakcen
- Centrifuge - Database files for Centrifuge
- DIAMOND - Database files for DIAMOND
- Kaiju - Database files for Kaiju
- Kraken2 - Database files for Kraken2
- KrakenUniq - Database files for KrakenUniq
- MALT - Database files for MALT
MultiQC
Output files
multiqc/
multiqc_report.html
: a standalone HTML file that can be viewed in your web browser.multiqc_data/
: directory containing parsed statistics from the different tools used in the pipeline.multiqc_plots/
: directory containing static images from the report in various formats.
MultiQC is a visualization tool that generates a single HTML report summarising all samples in your project. Most of the pipeline QC results are visualised in the report and further statistics are available in the report data directory.
Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.
Pipeline information
Output files
pipeline_info/
- Reports generated by Nextflow:
execution_report.html
,execution_timeline.html
,execution_trace.txt
andpipeline_dag.dot
/pipeline_dag.svg
. - Reports generated by the pipeline:
pipeline_report.html
,pipeline_report.txt
andsoftware_versions.yml
. Thepipeline_report*
files will only be present if the--email
/--email_on_fail
parameter’s are used when running the pipeline. - Reformatted samplesheet files used as input to the pipeline:
samplesheet.valid.csv
. - Parameters used by the pipeline run:
params.json
.
- Reports generated by Nextflow:
Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.
Bracken
Bracken(Bayesian Reestimation of Abundance with KrakEN) is a highly accurate statistical method that computes the abundance of species in DNA sequences from a metagenomics sample.
Output files
bracken/
<db_name>/
database100mers.kmer_distrib
: Bracken kmer distribution filedatabase100mers.kraken
: Bracken index filedatabase.kraken
: Bracken database filehash.k2d
: Kraken2 hash database fileopts.k2d
: Kraken2 opts database filetaxo.k2d
: Kraken2 taxo database filelibrary/
: Intermediate Kraken2 directory containing FASTAs and related files of added genomestaxonomy/
: Intermediate Kraken2 directory containing taxonomy files of added genomesseqid2taxid.map
: Intermediate Kraken2 file containing taxonomy files of added genomes
Note that all intermediate files are required for Bracken2 database, even if Kraken2 itself only requires the *.k2d
files.
The resulting <db_name>/
directory can be given to Bracken itself with bracken -d <your_database_name>
etc.
Centrifuge
Centrifuge is a very rapid and memory-efficient system for the classification of DNA sequences from microbial samples.
Output files
diamond/
<database>.*.cf
: Centrifuge database files
A directory and cf
files can be given to the Centrifuge command with centrifuge -x /<path>/<to>/<cf_files_basename>
etc.
Diamond
DIAMOND is a accelerated BLAST compatible local sequence aligner particularly used for protein alignment.
Output files
diamond/
<database>.dmnd
: DIAMOND dmnd database file
The dmnd
file can be given to one of the DIAMOND alignment commands with diamond blast<x/p> -d <your_database>.dmnd
etc.
Kaiju
Kaiju is a fast and sensitive taxonomic classification for metagenomics utilising nucletoide to protein translations.
Output files
kaiju/
<database_name>.fmi
: Kaiju FMI index file
The fmi
file can be given to Kaiju itself with kaiju -f <your_database>.fmi
etc.
Kraken2
Kraken2 is a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds.
Output files
kraken2/
<db_name>/
hash.k2d
: Kraken2 hash database fileopts.k2d
: Kraken2 opts database filetaxo.k2d
: Kraken2 taxo database filelibrary/
: Intermediate directory containing FASTAs and related files of added genomes (only present if--build_bracken
or--kraken2_keepintermediate
supplied)taxonomy/
: Intermediate directory containing taxonomy files of added genomes (only present if--build_bracken
or--kraken2_keepintermediate
supplied)seqid2taxid.map
: Intermediate file containing taxonomy files of added genomes (only present if--build_bracken
or--kraken2_keepintermediate
supplied)
The resulting <db_name>/
directory can be given to Kraken2 itself with kraken2 --db <your_database_name>
etc.
KrakenUniq
KrakenUniq Metagenomics classifier with unique k-mer counting for more specific results.
Output files
kraken2/
<db_name>/
database-build.log
: KrakenUniq build process logdatabase.idx
: KrakenUniq index filedatabase.kdb
: KrakenUniq database filetaxDB
: KrakenUniq taxonomy information file
Note there may be additional files in this directory, however the ones listed above are the reportedly the required ones.
MALT
MALT is a fast replacement for BLASTX, BLASTP and BLASTN, and provides both local and semi-global alignment capabilities.
Output files
malt/
malt_index/
: directory containing MALT index files
The malt_index
directory can be given to MALT itself with malt-run --index <your_database>/
etc.