BigBacter Database
Overview
BigBacter maintains a structured database of genomic files required for routine bacterial surveillance analysis. Each execution compares new samples from the input samplesheet (--input) against historical samples stored in the database (--db).
Samples are only saved to the database when you run BigBacter using the --push true parameter.
This page details the database structure, file formats, and the purpose of each component within a BigBacter database.
Database Structure
The BigBacter database follows a hierarchical organization based on taxonomic classification and clustering results:
${db}/
└── ${taxa}/
├── clusters/
│ └── ${cluster}/
│ ├── asm/
│ │ ├── ${sample}.fa.gz
│ │ └── ref.fa.gz
│ ├── aux/
│ │ └── ref.json.gz
│ └── var/
│ └── ${sample}.tar.gz
└── sig/
└── ${sample}.sig.gz
Directory Structure Explanation
${db}: Root database directory${taxa}: Species-level taxonomic identifier (e.g.,*Escherichia coli*)clusters/${cluster}: Cluster-specific data grouped by genomic similarity using Floc (e.g.,clusters/1)sig/: Species-level signature files for cluster assignment
Taxon-Level Files
Floc Signatures (sig/)
Purpose
Species-wide genomic signatures used for cluster assignment of new samples.
Contents
Example
[
{
"class": "sourmash_signature",
"email": "",
"hash_function": "0.murmur64",
"filename": "1",
"name": "Sample01",
"license": "CC0",
"signatures": [...],
"version": 0.4
}
]
Floc repurposes the filename field in the Sourmash signature to store cluster information.
Cluster-Level Files
Each cluster maintains three distinct data types organized into separate directories:
Assembly Files (asm/)
Purpose
Storage of genomic assemblies in FASTA format for reference and sample comparison.
Contents
| File | Description |
ref.fa.gz | Cluster-specific reference assembly used for core genome analysis |
${sample}.fa.gz | Individual sample assemblies (not currently used - retained for future dev) |
Reference assemblies are immutable (cannot be changed)!
Auxiliary Files (aux/)
Purpose
Auxillary / supplemental information storage.
Contents
| File | Description |
ref.json.gz | Reference metadata, including the original sample name, number of contigs, and length |
Example:
{
"length": 4840898,
"n_contigs": 3,
"name": "Sample01"
}
Variant Files (var/)
Purpose
Comprehensive variant analysis data for core genome phylogenetics.
Contents
| File | Description |
${sample}.tar.gz | Compressed tarball (.tar.gz) containing the minimal Snippy output needed for core SNP analysis |