About
Data Science Training Team: Data Catalog
Business need
- Genomic sequencing and related Advanced Molecular Detection (AMD) work at the Washington State Department of Health (WADOH) requires extensive interdisciplinary collaboration
- AMD data are generated, transformed, analyzed, and shared by many different teams including:
- Bioinformatics: this team generates assembly data from raw reads generated by the Washington State Public Health Laboratory (PHL), validates new pipelines, and submits assembly data to public repositories (such as those hosted by NCBI: GenBank, RefSeq, and others).
- Data Integration and Quality Assurance: this team extracts and transforms sequencing metadata submitted by sequencing laboratories and links these data to case data within the state public health surveillance system.
- Molecular Epi: this team analyzes transformed assembly data from multiple different sources and generates reports for internal and external stakeholders.
- Data are stored in siloed environments that have limited access, limited/no information on how to request access, and lack of visibility that the data is even being generated and owned by other teams at WADOH.
- There is a need for greater visibility around AMD data at WADOH. Data need to be discoverable and explorable. Downstream users need to understand the lineage of the data they are using.
Project goals
We assembled members of a Data Science Training Team with representation from three distinct AMD teams at WADOH (described above). This team set out to build a model data catalog to demonstrate a solution addressing the needs for greater visibility around data and data lineage.
The goals of the data catalog were to:
- Explore metadata fields required for a minimum viable product (MVP)
- Visualize data lineage through a Directed Acyclic Graph (DAG)
- Build an example infrastructure that could be used to make AMD data explorable and discoverable at WADOH
Infrastructure
- Built in Shinylive R, a serverless Shiny package which can render and run code in a browser. WebAssembly in package is used to make page serverless.
- Packaged into a Quarto website. This allows for easy web design, theming, and page organization. Quarto supports running R code blocks (along with other languages) and has built-in Observable JS support.
- Deployed by GitHub Actions and published on GitHub Pages: GitHub Actions render and deployment allows for automatic updating of GitHub Pages with prod GitHub repository releases. GitHub Pages allows for web hosting a serverless website.
Disclaimer
This work was executed as part of the CSTE Data Science Training Team program. The primary goal of this work was to increase data science skills and knowledge. This Data Catalog is not intended to be used as a final or scalable product.