BigBacter Database

  1. Overview
  2. Database Structure
    1. Directory Structure Explanation
  3. Taxon-Level Files
    1. Floc Signatures (sig/)
      1. Purpose
      2. Contents
  4. Cluster-Level Files
    1. Assembly Files (asm/)
      1. Purpose
      2. Contents
    2. Auxiliary Files (aux/)
      1. Purpose
      2. Contents
    3. Variant Files (var/)
      1. Purpose
      2. Contents

Overview

BigBacter maintains a structured database of genomic files required for routine bacterial surveillance analysis. Each execution compares new samples from the input samplesheet (--input) against historical samples stored in the database (--db).

Samples are only saved to the database when you run BigBacter using the --push true parameter.

This page details the database structure, file formats, and the purpose of each component within a BigBacter database.

Database Structure

The BigBacter database follows a hierarchical organization based on taxonomic classification and clustering results:

${db}/
└── ${taxa}/
    ├── clusters/
    │   └── ${cluster}/
    │       ├── asm/
    │       │   ├── ${sample}.fa.gz
    │       │   └── ref.fa.gz
    │       ├── aux/
    │       │   └── ref.json.gz
    │       └── var/
    │           └── ${sample}.tar.gz
    └── sig/
        └── ${sample}.sig.gz

Directory Structure Explanation

  • ${db}: Root database directory
  • ${taxa}: Species-level taxonomic identifier (e.g., *Escherichia coli*)
  • clusters/${cluster}: Cluster-specific data grouped by genomic similarity using Floc (e.g., clusters/1)
  • sig/: Species-level signature files for cluster assignment

Taxon-Level Files

Floc Signatures (sig/)

Purpose

Species-wide genomic signatures used for cluster assignment of new samples.

Contents

File Description
${sample}.sig Sourmash MinHash signatures generated by Floc in JSON format

Example

[
  {
    "class": "sourmash_signature",
    "email": "",
    "hash_function": "0.murmur64", 
    "filename": "1",
    "name": "Sample01",
    "license": "CC0",
    "signatures": [...],
    "version": 0.4
  }
]

Floc repurposes the filename field in the Sourmash signature to store cluster information.


Cluster-Level Files

Each cluster maintains three distinct data types organized into separate directories:

Assembly Files (asm/)

Purpose

Storage of genomic assemblies in FASTA format for reference and sample comparison.

Contents

File Description
ref.fa.gz Cluster-specific reference assembly used for core genome analysis
${sample}.fa.gz Individual sample assemblies (not currently used - retained for future dev)

Reference assemblies are immutable (cannot be changed)!

Auxiliary Files (aux/)

Purpose

Auxillary / supplemental information storage.

Contents

File Description
ref.json.gz Reference metadata, including the original sample name, number of contigs, and length

Example:

{
  "length": 4840898,
  "n_contigs": 3,
  "name": "Sample01"
}

Variant Files (var/)

Purpose

Comprehensive variant analysis data for core genome phylogenetics.

Contents

File Description
${sample}.tar.gz Compressed tarball (.tar.gz) containing the minimal Snippy output needed for core SNP analysis


This site uses Just the Docs, a documentation theme for Jekyll.