Reports & Summaries

Overview
Sample Reads, Assemblies, & Taxonomy
Cluster Assignments
Core Genome Alignments
Recombination
SNP Variants
Quality Control
Distances
Summary
Phylogeny
Reports

This page describes results saved to --outdir. Files written to the BigBacter database (--db) are described on the BigBacter Database page.

Overview

All outputs intended for routine interpretation and reporting are organized under a run-specific subdirectory named by Unix timestamp (${timestamp}). This allows results from multiple runs to be saved to a common --outdir without overwriting previous results and also provides baked-in tracibility 🥖.

Below is an overview of the standard outputs produced by BigBacter.

${outdir}/
├── ${timestamp}
│   ├── ${sample}
│   │   ├── asm
│   │   │   └── ${sample}.fa.gz
│   │   ├── taxa
│   │   │   └── ${sample}_gambit.csv
│   │   └── reads
│   │       ├── ${sra}_1.fastq.gz
│   │       └── ${sra}_2.fastq.gz
│   ├── ${taxa}
│   │   ├── ${cluster}
│   │   │   ├── aln
│   │   │   │   ├── ${timestamp}-${taxa}-${cluster}.aln
│   │   │   │   ├── ${timestamp}-${taxa}-${cluster}_full.aln
│   │   │   │   ├── ${timestamp}-${taxa}-${cluster}_full.csv
│   │   │   │   ├── ${timestamp}-${taxa}-${cluster}.masked.aln
│   │   │   │   ├── ${timestamp}-${taxa}-${cluster}_full.masked.aln
│   │   │   │   └── ${timestamp}-${taxa}-${cluster}_full.masked.csv
│   │   │   ├── dist
│   │   │   │   ├── ${timestamp}-${taxa}-${cluster}_db-dist.csv
│   │   │   │   ├── ${timestamp}-${taxa}-${cluster}_snp-dist.csv
│   │   │   │   ├── ${timestamp}-${taxa}-${cluster}_db-dist.masked.csv
│   │   │   │   └── ${timestamp}-${taxa}-${cluster}_snp-dist.masked.csv
│   │   │   ├── qc
│   │   │   │   ├── ${timestamp}-${taxa}-${cluster}_cg-plot.html
│   │   │   │   └── ${timestamp}-${taxa}-${cluster}_cg-plot.masked.html
│   │   │   ├── recomb
│   │   │   │   ├── ${timestamp}-${taxa}-${cluster}.vcf
│   │   │   │   ├── ${timestamp}-${taxa}-${cluster}.bed
│   │   │   │   └── ${timestamp}-${taxa}-${cluster}.per_branch_statistics.csv
│   │   │   ├── reports
│   │   │   │   ├── ${timestamp}-${taxa}-${cluster}.microreact
│   │   │   │   └── ${timestamp}-${taxa}-${cluster}.masked.microreact
│   │   │   ├── summary
│   │   │   │   ├── ${timestamp}-${taxa}-${cluster}_summary.csv
│   │   │   │   └── ${timestamp}-${taxa}-${cluster}_summary.masked.csv
│   │   │   ├── tree
│   │   │   │   ├── ${timestamp}-${taxa}-${cluster}.nwk
│   │   │   │   └── ${timestamp}-${taxa}-${cluster}.masked.nwk
│   │   │   └── var
│   │   │       └── ${sample}.tar.gz
│   │   └── cluster
│   │       ├── clusters.csv
│   │       └── global_containment.csv
│   └── multiqc
│       └── multiqc_report.html
└── pipeline_info
    └── ...

Sample Reads, Assemblies, & Taxonomy

Raw reads from NCBI SRA, de novo genome assemblies from GenBank or created via Shovill, and taxonomic classifications determined via GAMBIT are published in sample-specific subdirectories.

│   ├── ${sample}
│   │   ├── asm
│   │   │   └── ${sample}.fa.gz
│   │   ├── taxa
│   │   │   └── ${sample}_gambit.csv
│   │   └── reads
│   │       ├── ${sra}_1.fastq.gz
│   │       └── ${sra}_2.fastq.gz

File	Description
`${sample}.fa.gz`	Compressed genome assembly in FASTA format
`${sample}_gambit.csv`	GAMBIT taxonomic classification results
`${sra}_1.fastq.gz`	Forward reads downloaded from NCBI SRA
`${sra}_2.fastq.gz`	Reverse reads downloaded from NCBI SRA

Cluster Assignments

Samples are assigned to clusters within each taxon using MinHash-based dissimilarity. Cluster assignments and global containment scores are published at the taxon level.

│   └── cluster
│       ├── clusters.csv
│       └── global_containment.csv

File	Description
`clusters.csv`	Per-sample cluster assignments for the current run
`global_containment.csv`	Per sample MinHash global containment scores of nearest matching cluster

Core Genome Alignments

Core genome SNP alignments produced by Polycore are published per cluster. Files with .masked in the filename are produced using the recombination-masked alignment from Gubbins.

│   └── aln
│       ├── ${timestamp}-${taxa}-${cluster}.aln
│       ├── ${timestamp}-${taxa}-${cluster}_full.aln
│       ├── ${timestamp}-${taxa}-${cluster}_full.csv
│       ├── ${timestamp}-${taxa}-${cluster}.masked.aln
│       ├── ${timestamp}-${taxa}-${cluster}_full.masked.aln
│       └── ${timestamp}-${taxa}-${cluster}_full.masked.csv

File	Description
`*.aln`	Core genome SNP alignment in FASTA format
`*_full.aln`	Full core genome alignment including invariant sites
`*_full.csv`	Per-site summary of the full alignment

Recombination

Recombinant regions identified by Gubbins are published per cluster. Only produced when recombination masking is enabled and the cluster has sufficient samples.

│   └── recomb
│       ├── ${timestamp}-${taxa}-${cluster}.vcf
│       ├── ${timestamp}-${taxa}-${cluster}.bed
│       └── ${timestamp}-${taxa}-${cluster}.per_branch_statistics.csv

File	Description
`*.vcf`	Recombinant SNPs identified by Gubbins in VCF format
`*.bed`	Recombinant regions in BED format
`*.per_branch_statistics.csv`	Per-branch recombination statistics

SNP Variants

Per-sample Snippy output tarballs are published per cluster. These files are also published to the BigBacter database (--db) when using --push true.

│   └── var
│       └── ${sample}.tar.gz

File	Description
`${sample}.tar.gz`	Compressed Snippy output directory for each sample

Quality Control

A per-cluster core genome plot is produced by Polycore and a MultiQC report aggregating per-sample QC metrics is produced at the run level. Files with .masked in the filename are produced using the recombination-masked alignment from Gubbins.

│   ├── qc
│   │   ├── ${timestamp}-${taxa}-${cluster}_cg-plot.html
│   │   └── ${timestamp}-${taxa}-${cluster}_cg-plot.masked.html
└── multiqc
    └── multiqc_report.html

File	Description
`*_cg-plot.html`	Interactive plot of core genome size and SNP density per cluster
`multiqc_report.html`	Aggregated QC report including FastQC and fastp metrics for all samples

Distances

Pairwise MinHash and core SNP distance matrices are published per cluster. Files with .masked in the filename are produced using the recombination-masked alignment from Gubbins.

│   └── dist
│       ├── ${timestamp}-${taxa}-${cluster}_db-dist.csv
│       ├── ${timestamp}-${taxa}-${cluster}_snp-dist.csv
│       ├── ${timestamp}-${taxa}-${cluster}_db-dist.masked.csv
│       └── ${timestamp}-${taxa}-${cluster}_snp-dist.masked.csv

File	Description
`*_db-dist.csv`	Pairwise MinHash distances computed by Floc during clustering
`*_snp-dist.csv`	Pairwise core SNP distances

Summary

A per-cluster summary table combining cluster assignments, QC metrics, and SNP distances is published per cluster. Files with .masked in the filename are produced using the recombination-masked alignment from Gubbins.

│   └── summary
│       ├── ${timestamp}-${taxa}-${cluster}_summary.csv
│       └── ${timestamp}-${taxa}-${cluster}_summary.masked.csv

File	Description
`*_summary.csv`	Per-sample summary including cluster assignment, core genome size, and closest neighbor

Phylogeny

A maximum likelihood phylogenetic tree is produced per cluster for clusters with sufficient samples. Files with .masked in the filename are produced using the recombination-masked alignment from Gubbins.

│   └── tree
│       ├── ${timestamp}-${taxa}-${cluster}.nwk
│       └── ${timestamp}-${taxa}-${cluster}.masked.nwk

File	Description
`*.nwk`	Maximum likelihood phylogenetic tree in Newick format produced by IQ-TREE

Reports

A Microreact report is produced per cluster, combining the phylogenetic tree, Floc and SNP distance matrices, per-sample summary, and core genome plot. When recombination masking is enabled, two reports are produced — one using the standard outputs and one using the masked outputs.

│   └── reports
│       ├── ${timestamp}-${taxa}-${cluster}.microreact
│       └── ${timestamp}-${taxa}-${cluster}.masked.microreact

File	Description
`*.microreact`	Microreact project file for interactive visualization at microreact.org