Back to Article
Template Submitters Script
Download Notebook

Template Submitters Script

Author

DSSU/DIQA

Published

March 26, 2024

In [1]:
knitr::opts_chunk$set(eval = FALSE)

Overview

Current Run Schedule: M, W, F @ 11:00AM
This script must run prior to fuzzy matching and roster compile scripts on roster days.

Sometimes adhoc submissions are added the morning of a roster, so run time should allow some opportunity for external parties to drop new files to be included in the day’s roster.

Where do the input files come from? This script receives patient-level sequencing information submitted by laboratories in a standard template format as .csv or .xlsx files.

What happens during the script? The patient-level information contained in the submission files is joined to COVID-19 genomic sequencing results in the WDRS ELR Entire table. Records are matched on accession number (exact match) and specimen collection date (within a margin of 14 days). Then, entries undergo cleaning, variables of interest are generated, QA checks are performed, and the results are transformed into the sequencing roster format.

What is output from this script?

  • fuzzy_match: Records that contain demographic information but cannot be matched using clinical accession number are sent to the FUZZY_MATCHING script

  • write_roster_here: Valid matched records are compiled and output to the write_roster_here folder for inclusion in the sequencing roster

  • For_Review/to_process: Invalid submissions that do not have a sequence status or FAILED or LOW QUALITY are sent to For_Review/to_process folder in the manual review format

  • keep_na: Records that meet the criteria for the keep_na process are appended to the existing data object to be reconsidered for matching at a later date

  • Completed_Submissions/Template_Submitters/last_run.csv: This is a reference list used internally by this script (i.e. not output to external sources) to document which files were processed in the prior run. Sometimes the final chunk of code fails to delete submission files that are open in the R session. This list is updated at the end of a code run and indicates which files were processed. When the script scans the Submissions folder for new files at the beginning of a code run, it uses this reference list to complete the process of removing those files if any are lingering. Using this approach, no manual action is required to move files.

At the end of the script, the processed files are moved from their locations within the Submissions folder to the Completed Submissions folder. An email receipt is also generated at this time.

Note: This process is not an ideal method for receiving records. Ultimately we hope to transition all submitters to electronic laboratory reporting (ELR), which is preferable to receiving piecemeal .csv documents. However, using a template does improve consistency among submitters that have not adopted ELR, which allows us to prepare their records for rostering in a somewhat standardized manner. The incoming .csv files can be subject to formatting or submitter error and sometimes require manual edits before the script can properly ingest them.

Developer notes

In order to process files from a new submitter, a few actions need to be taken to update this script:

  1. If a new sender is validated to submit patient-level sequencing information following the template format, the laboratory name needs to be added to the lab_vars object via the write_lab_vars_here.R script. There is a code chunk listing all confirmed template submitters (lab_names_template), and the spelling must match that of the SUBMITTING_LAB column in the submission; simply add the new name to the list! Other scripts beyond this one use that shared object to access the acceptable values for a given field, so making a change in the lab_vars object will populate downstream.

  2. If the new sender very rarely submits information, it may be appropriate to drop their new files into the Submissions/adhoc subfolder. If they are regularly sending sequencing data via SFT, a designated sub-folder should be created for that submitter in the Submissions folder. Make sure code Chunk 10 of this script includes all of the Submissions sub-folders that contribute Template Submissions for ingestion into this script. (Upon completion of the script, files are moved from their Submissions sub-folder to a corresponding Completed Submissions sub-folder – this could be created at the same time, but won’t be used until later!)

  3. Then, examine the format of the GISAID_IDs included in submissions from the new laboratory to determine what modifications should be made to the validation steps in Chunk 14. If the GISAID_IDs follow the standard “^(hCoV-19/)?USA/WA-.*/[[:digit:]]{4}$” regex pattern, update the list of other submitters to include any new Submissions sub-folders created in the previous step. If the GISAID_IDs do not follow the expected pattern (likely due to a text prefix or unusual number of characters), a new set of code must be written to indicate the typical GISAID_ID format for that submitter for future validation – this has already been done for Aegis, Helix, and Labcorp, so there is an example to follow! In a related step, Chunk 17 must be updated to standardize GISAID_IDs for the new submitter if they are partial or do not follow the usual regex pattern.

  4. Next, updates need to be made to Chunk 20 where columns for the final sequencing roster are created. The SEQUENCE_LAB variable is derived from the SUBMITTING_LAB variable included in the submissions. The variable should be updated to mutate this variable properly for the new submitter. Note that submitters sometimes use different values in the SEQUENCE_LAB column (they’re not supposed to, but it happens), but only one value per submitter should be output. This is in part addressed by the code in Chunk 20 using str_detect to identify the lab – real-life example: str_detect(toupper(SUBMITTING_LAB), “UW VIROLOGY”) ~ “UW Virology” will catch submissions that have been sent with a SUBMITTING_LAB of “UW Virology” or “UW Virology Lab”. Similarly, str_detect(toupper(SUBMITTING_LAB), “AEGIS”) ~ “Aegis” detects submissions from “Aegis” or “Aegis Sciences Corporation”. However, multiple rows may need to be written per submitter if they use acronyms or other names that do not contain shared text – example: the United States Air Force has sent submissions with the name “USAFSA” and “United States Air Force School of Aerospace” in the SUBMITTING_LAB column; two lines of code are needed for each variant of the name, but both output the same value to roster: “USAFSA”.

  5. After the script has run, files need to be moved from their sub-folders within the Submissions folder to Completed Submissions. Chunk 30 contains code to move these files. The name of any new sub-folders in Submissions and Completed Submissions need to be included here.

These steps may not be comprehensive, and files from new submitters may contain their own idiosyncrasies - but this should offer a decent starting place if adding a newly validated laboratory to our suite of template submitters!

Script setup

Libraries

In [2]:
library(DBI)
library(odbc)
library(lubridate)
library(tidyverse)
library(readxl)
library(here)
library(fs)
library(vroom)

Import objects

Data objects

This part of the script calls several data objects that contain information required for this process. The lab_vars object is curated by Sequencing Project script runners and serves as a master reference for validated variable input. Adding new submitters, sequencing reasons, etc should be done in this object so the changes populate across all scripts that use these variables.

In [3]:
# read in r_creds.RDS
r_creds <-readRDS(file.path(Sys.getenv("USERPROFILE"), "Projects/Sequencing/Data_Objects", "r_creds.RDS")) 

# Read in keep_na
keep_na_running <- read_csv("keep_na/keep_na.csv",
                   col_types = cols(.default = "c"),
                   na = c("", "NA", "N/A")) 

# Read in lineages
lineages <- read_csv("Data_Objects/Lineages/Lineages.csv",
                   col_names = TRUE,
                   col_types = cols(.default = "c"),
                   na = c("", "NA", "N/A")) 

# Read in the Washington_cumulative file (contains records submitted by CDC contracted labs)
wa_cdc <- read_csv("Submissions/CDC_Cumulative/Washington_cumulative.csv",
                   col_types = cols(.default = "c"),
                   na = c("", "NA", "N/A")) 

# Read in lab_vars
lab_vars <- readRDS("Data_Objects/lab_variables.rds")

# Read in quality_filters
source(file.path(Sys.getenv("USERPROFILE"), "Projects/Sequencing/Roster_scripts/quality_filters.R"))

# Remove submissions that were duplicated in Completed Submissions at the end of the last run
last_run <- read.csv("Completed_Submissions/Template_Submitters/Template_Submitters_File_Completed.csv")

for (i in 1:nrow(last_run)) {
  if (file.exists(paste0(last_run[i,1]))) { 
    file.remove(paste0(last_run[i,1]))
  }
}

Access WDRS

Connect

Establish a connection to WDRS using credentials stored in a data object.

In [4]:
# connect
connection <- DBI::dbConnect(odbc::odbc(), 
                             Driver = r_creds$conn_list[1], 
                             Server = r_creds$conn_list[2], 
                             Database = r_creds$conn_list[3], 
                             Trusted_connection = r_creds$conn_list[4], 
                             ApplicationIntent = r_creds$conn_list[5])

Query WDRS

After connecting to the WDRS database, create ENTIRE and FLATTENED tables in R global environment

  • ENTIRE table contains all WDRS SARS-CoV-2 entries
  • FLATTENED table contains WDRS SARS-CoV-2 entries that have been included in the sequencing roster
In [5]:
# WDRS ENTIRE TABLE
    ## Pull CASE_ID, SEQUENCE_CLINICAL_ACCESSION AND COLLECTION_DATE FROM the [DD_ELR_DD_ENTIRE] table on WDRS
    ## This table will be joined against to pull CASE_ID's using the SEQUENCE_CLINICAL_ACCESSION
wdrs_ent <- dbGetQuery(connection, "
                    SELECT Distinct CASE_ID,
                      [FILLER__ORDER__NUM] as FILLER__ORDER__NUM__WDRS,
                      SPECIMEN__COLLECTION__DTTM as SPECIMEN__COLLECTION__DTTM__WDRS
                    FROM [dbo].[DD_ELR_DD_ENTIRE]
                    WHERE CODE = 'SARS' AND STATUS != '6' 
                    ")

  # transform COLLECTION_DATE to character format
wdrs_ent$SPECIMEN__COLLECTION__DTTM__WDRS <- as.character(wdrs_ent$SPECIMEN__COLLECTION__DTTM__WDRS) 

# WDRS FLATTENED TABLE
    ## Pull all SEQUENCE_ACCESSION from the [DD_GCD_COVID_19_FLATTENED] table on WDRS. (these are specimens that have already been rostered)
    ## The final roster will be filtered against this table to remove any potential duplicates
wdrs_flat <- dbGetQuery(connection, "
                    SELECT DISTINCT CDC_N_COV_2019_SEQUENCE_ACCESSION_NUMBER,
                    CDC_N_COV_2019_SEQUENCE_CLINICAL_ACCESSION_NUMBER
                    FROM [dbo].[DD_GCD_COVID_19_FLATTENED]
                    ")

Create matching variables

SEQUENCE_ACCESSION and SEQUENCE_CLINICAL_ACCESSION values are extracted from WDRS. These unique identifiers will be used to match genomic sequencing records to the incoming submissions with patient-level information.

In [6]:
# Create a vector containing individual SEQUENCE_ACCESSIONS
    # This variable is located in the first column of the WDRS FLATTENED table 

    ## Extract comma separated values if fields have multiple SEQUENCE_ACCESSIONS
wdrs_sa_flat_split <- unlist(str_split(wdrs_flat[[1]], ","))
    
    ## Omit any NAs
wdrs_sa_flat_clean <- wdrs_sa_flat_split[!is.na(wdrs_sa_flat_split)] %>%
    ## Remove "hCoV-19/" from the beginning of SEQUENCE_ACCESSION if it has been appended (replace with "")
    str_replace("hCoV-19/", "") %>%
    ## Trim white space resulting from str_split and remove " " values  
    str_trim("both")

    ## Remove any values equal to ""
wdrs_sa_flat_values <- wdrs_sa_flat_clean[wdrs_sa_flat_clean != ""]
In [7]:
# Create a vector containing individual SEQUENCE_CLINICAL_ACCESSIONS
    # This variable is located in the second column of the WDRS FLATTENED table

    ## Extract comma separated values if fields have multiple SEQUENCE_CLINICAL_ACCESSIONS
wdrs_sca_flat_split <- unlist(str_split(wdrs_flat[[2]], ","))

    ## Omit any NAs
wdrs_sca_flat_clean <- wdrs_sca_flat_split[!is.na(wdrs_sca_flat_split)] %>%
    ## Trim white space resulting from str_split and remove " " values  
    str_trim("both")

    ## Remove any values equal to ""
wdrs_sca_flat_values <- wdrs_sca_flat_clean[wdrs_sca_flat_clean != ""]

Ingest files

Compile submissions

New laboratory submissions are dropped in their corresponding sub-folders in the Submissions folder.

This code chunk examines all Submissions sub-folders that have been validated to accept Template Submissions and compiles a list of paths to all files present in these folders.

In [8]:
# Identify paths to folders containing Template Submissions
submissions <- fs::dir_ls(c("Submissions/Adhoc",
                            "Submissions/Aegis",
                            "Submissions/Altius",
                            "Submissions/ASU",
                            "Submissions/Boise_VA",
                            "Submissions/Fulgent_Genetics",
                            "Submissions/Gravity Diagnostics",
                            "Submissions/Helix",
                            "Submissions/Infinity_Biologix",
                            "Submissions/Kaiser",
                            "Submissions/Labcorp",
                            "Submissions/Lauring_Lab",
                            "Submissions/NW_Genomics",
                            "Submissions/Quest",
                            "Submissions/UnitedStatesAirForceSchoolofAerospace",
                            "Submissions/UW_Virology"),
                          recurse = TRUE, type = "file"
                          )

# Create a list of metadata for the new submission dataframes
files <- submissions %>% 
  purrr::map_if(str_sub(., -4, -1) == "xlsx", 
         function(x) read_xlsx(x, col_types = "text", na = c("NA", "N/A", "", "None", "none")),
         .else =  function(x) vroom(x, col_types = cols(.default = "c"), na = c("NA", "N/A", "", "None", "none"))
         )

# Create a backup of the submitted files in their original state
files_original <- files

Remove empty files

This code chunk examines the new submissions and identifies any empty files for removal

In [9]:
# Confirm that a non-zero number of submissions were received
# Confirm that the files received contain data, not just an empty template
    ## Occasionally, a submitter may send a file with empty rows or missing observations aside from SUBMITTING_LAB
    ## Files without any complete observations are errors by the submitter part and should be removed
if (length(files) > 0) {
  # For every df in files
  for (i in 1:length(files)) {
    # Filter to rows where the sum of fields not missing data across all columns (1:12) is greater than 1
    files[[i]] <- filter(files[[i]], rowSums(!is.na(files[[i]][, 1:ncol(files[[i]])])) > 1)
  }
}

# Initialize an empty vector to hold the filepaths of submissions that do not contain any data (empty)
empty_submissions <- c()
# For every df in files
for (i in 1:length(files)) {
  # If there are 0 rows of data (file is empty)
  if (length(files) == 0) {
    # Append the filepath to the empty_submissions
    empty_submissions <- c(empty_submissions, names(files[i]))
  }
}

# Remove any empty submissions from the files object
if (length(empty_submissions) > 0) {
  files <- files[names(files) != empty_submissions]
}

Send an email and stop here if there are no files or records to process

In [10]:
# Prepare email details and message

  ## Note that the files object had empty files removed in the previous chunk -- therefore, it will have a length equal to 0 if there were no files received OR if there were no populated files received
if (length(files) == 0) {
  email_from <- ""
  email_to <- ""
  email_subj <- "SEQUENCING - Template Submitters"
  
## If there are no new files
ifelse(length(files_original) == 0,
  email_body_no_files <- paste0("There are no new template submitter files to process on ", today()),
  email_body_no_files <- paste0(""))
  
## If there are only empty files
ifelse(length(files_original) > 0,
  email_body_only_empty <- paste0("All of the template submitter files received on ", today(), " were empty (i.e. contained no records): \n", 
                                  print(empty_submissions), 
                                  "\n\n No records will be processed by the template submitters script today."),
  email_body_only_empty <- paste0(""))

## Compose message
email_body <- paste0(email_body_no_files, "\r", email_body_only_empty)

# Send email
sendmailR::sendmail(from = email_from,
                    to = email_to,
                    subject = email_subj,
                    msg = email_body,
                    headers= list("Reply-To" = email_from),
                    control = list(smtpServer = "")
                    )

stop("Stop script here.", 
     "\n A total of ", length(files_original), " files were received today. \n", 
     length(empty_submissions), " files were empty (i.e. contained no records).", 
     "\n An email receipt has been sent.")
}

# If there are empty files submitted alongside populated files, the script can proceed. However, we want to inform stakeholders of the empty files so they can be addressed. This message will be appended to any future emails generated by this script
ifelse(length(empty_submissions) > 0,
  email_body_empty_files <- paste0(length(empty_submissions), " file(s) received today was/were empty (i.e. contained no records) and may require manual follow up: \n", print(empty_submissions)),
  email_body_empty_files <- paste0(""))

All files processed beyond this point will contain a non-zero number of observations.

Remove invalid files

This section of code examines the new submissions and identifies any files that do not follow the template provided to laboratories for submitting records. Files that do not contain the variables specified below, with identical column names and orders, may not meet the criteria for ingestion and could encounter errors with subsequent steps in this script to prepare them for rostering. These files likely need manual review and may require corrections by the script runner or laboratory submitter.

Define valid submission criteria

The following chunk establishes the parameters expected of submission files following the template. The incoming files will be compared against these criteria to determine whether they can be processed by the remainder of the script as-is.

In [11]:
# Create a vector containing the template submission variable names in the correct order
template_col_names <- c(
  "LAB_ACCESSION_ID", 
  "GISAID_ID", 
  "SPECIMEN_COLLECTION_DATE", 
  "SUBMITTING_LAB", 
  "SEQUENCE_REASON", 
  "SEQUENCE_STATUS", 
  "PANGO_LINEAGE", 
  "FIRST_NAME", 
  "LAST_NAME", 
  "MIDDLE_NAME", 
  "DOB", 
  "ALTERNATIVE_ID"
)

# Create a vector naming validated template submitters (SUBMITTING_LAB)
submitting_lab_values <- lab_vars$lab_names_template

# Create a vector containing valid values for SEQUENCE_REASON
  # Note that NA and blanks are accepted
sequence_reason_values <- lab_vars$seq_reason_template

# Create a vector containing valid values for SEQUENCE_STATUS
  # Note that NA and blanks are not accepted
sequence_status_values <- lab_vars$seq_status_template

# Extract a list of valid lineages from the lineages_list object (this should be updated daily via an automated script)
  # Append "" and NA, as records with a LOW QUALITY or FAILED sequence status will not contain lineage information
lineages_list <- c(
  lineages[[1]], 
  "",
  "UNASSIGNED",
  "Unassigned",
  NA
)

Many of the valid values are located within the lab_vars data object – this is meant to serve as a centralized source of validated information across multiple scripts, so updates to these variables only need to be made once. To add, remove, or otherwise edit relevant variables, make changes to the Roster_scripts//write_lab_variables.R script and update the data object that is output.

Prepare for validation checks

Invalid files will be identified as those having formats or columns that do not match the template criteria outlined in the previous chunk.

The following 8 items are checked for errors:

  • format - variable names and order are correct
  • gisaid - GISAID_ID variable is the expected format (varies by submitter) and there are no GISAID_IDs assigned to records with a SEQUENCE_STATUS that is not COMPLETE
  • lab - SUBMITTING_LAB values are valid
  • reason - SEQUENCE_REASON values are valid
  • status - SEQUENCE_STATUS values are valid
  • collection date - SPECIMEN_COLLECTION_DATE is in a valid format
  • lineage - PANGO_LINEAGE values are valid

Invalid files cannot be processed until all errors identified in this quality check are corrected.

The next code chunk creates a dataframe to locate which errors are present in each submission file.

In [12]:
# Initialize a dataframe to identify work-stop errors within files
invalid_submission_reason <- as.data.frame(matrix(
  nrow = length(files),
  ncol = 8,
  dimnames = list(c(), c(
    "submission", # identify the submission filepath
    "format_check", 
    "gisaid_check", 
    "lab_check", 
    "reason_check", 
    "status_check", 
    "coll_date_check", 
    "lineage_check"))
  )
)

# Assign each file to a row in the df
invalid_submission_reason[,1] <- names(files)

# FORMAT CHECK
## For every submission in files, check that the formatting of the dataframe has not been altered; if so, populate error message in invalid_submissions_reason
## Sometimes there appears to be additional columns - can use this to remove them. Replace with number of file. files[[7]] <- files[[7]][1:12]
for (i in 1:length(files)) { 
    if(!(all(names(files[[i]]) == template_col_names))) {
    invalid_submission_reason[i, 2] <- "Invalid submission format - file variables do not match template"
  }
}

# GISAID_ID CHECK
## GISAID_ID formats vary by submitter. for every submission in files, check that the GISAID_ID values are in the typical format for each submitter and there are not GISAID_ID values present for samples with a sequence status that is not COMPLETE
  ## Note: Aegis, Helix, Labcorp, and Quest are currently submitting via ELR (per 03.11.2022); Because ad-hoc corrections or other records from these labs are sometimes manually dropped into Submissions, we have retained the code to process these records

for (i in 1:length(files)) {

  # Aegis
  if (str_detect(names(files[i]), "Submissions/Aegis")) {
    for (j in 1:nrow(files[[i]])) {
      if (any((!(
        (toupper(files[[i]][j,]$SEQUENCE_STATUS) == "COMPLETE" & str_detect(files[[i]][j,]$GISAID_ID, "^ASC([[:digit:]])*-B.*"))|
        (toupper(files[[i]][j,]$SEQUENCE_STATUS) == "LOW QUALITY" & str_detect(files[[i]][j,]$GISAID_ID, "^ASC([[:digit:]])*-B.*"))|
        (toupper(files[[i]][j,]$SEQUENCE_STATUS) != "COMPLETE" & is.na(files[[i]][j,]$GISAID_ID))
      )))) {
        invalid_submission_reason[i, 3] <- "Invalid GISAID_IDs - check for unexpected format by submitter or GISAID_IDss assigned to records where SEQUENCE_STATUS is not COMPLETE"
      }
    }
  }
  
  # Helix
  else if (str_detect(names(files[i]), "Submissions/Helix")) {
    for (j in 1:nrow(files[[i]])) {
      if (any((!(
        (toupper(files[[i]][j,]$SEQUENCE_STATUS) == "COMPLETE" & str_detect(files[[i]][j,]$GISAID_ID, "USA/WA-CDC-STM-.*"))|
        (toupper(files[[i]][j,]$SEQUENCE_STATUS) == "LOW QUALITY" & str_detect(files[[i]][j,]$GISAID_ID, "USA/WA-CDC-STM-.*"))|
        (toupper(files[[i]][j,]$SEQUENCE_STATUS) != "COMPLETE" & is.na(files[[i]][j,]$GISAID_ID))
      )))) {
        invalid_submission_reason[i, 3] <- "Invalid GISAID_IDs - check for unexpected format by submitter or GISAID_IDss assigned to records where SEQUENCE_STATUS is not COMPLETE"
      }
    }
  }

  # Labcorp
  else if (str_detect(names(files[i]), "Submissions/Labcorp")) {
    for (j in 1:nrow(files[[i]])) {
            if (any((!(
        (toupper(files[[i]][j,]$SEQUENCE_STATUS) == "COMPLETE" & str_detect(files[[i]][j,]$GISAID_ID, "^LC[[:digit:]]*$"))|
        (toupper(files[[i]][j,]$SEQUENCE_STATUS) == "LOW QUALITY" & str_detect(files[[i]][j,]$GISAID_ID, "^LC[[:digit:]]*$"))|
        (toupper(files[[i]][j,]$SEQUENCE_STATUS) != "COMPLETE" & is.na(files[[i]][j,]$GISAID_ID))
      )))) {
        invalid_submission_reason[i, 3] <- "Invalid GISAID_IDs - check for unexpected format by submitter or GISAID_IDss assigned to records where SEQUENCE_STATUS is not COMPLETE"
      }
    }
  }

  # Other submitters
  else if (str_detect(names(files[i]), "Submissions/(
                      Adhoc|
                      Altius|
                      ASU|
                      Boise_VA|
                      Fulgent_Genetics|
                      Gravity Diagnostics|
                      Infinity_Biologix|
                      Kaiser|
                      Lauring_Lab|
                      NW_Genomics|
                      Quest|
                      UnitedStatesAirForceSchoolofAerospace|
                      UW_Virology
                      )")) {
    for (j in 1:nrow(files[[i]])) {
      if (any((!(
        (toupper(files[[i]][j,]$SEQUENCE_STATUS) == "COMPLETE" & str_detect(files[[i]][j,]$GISAID_ID, "^(hCoV-19/)?USA/WA-.*/[[:digit:]]{4}$"))|
        (toupper(files[[i]][j,]$SEQUENCE_STATUS) == "LOW QUALITY" & str_detect(files[[i]][j,]$GISAID_ID, "^(hCoV-19/)?USA/WA-.*/[[:digit:]]{4}$"))|
        (toupper(files[[i]][j,]$SEQUENCE_STATUS) != "COMPLETE" & is.na(files[[i]][j,]$GISAID_ID))
      )))) {
        invalid_submission_reason[i, 3] <- "Invalid GISAID_IDs - check for unexpected format by submitter or GISAID_IDss assigned to records where SEQUENCE_STATUS is not COMPLETE"
      }
    }
  }
}

# LAB CHECK
## For every submission in files, check that the submitting lab variable contains valid values; if not, populate error message in invalid_submissions_reason
    ## Note: Sometimes labs will switch the name they submit under without warning -- examples: "UW Virology Lab" -> "UW Virology", "Aegis" -> "Aegis Sciences Corporation". When a submitter makes a consistent change like this, the write_lab_vars.R script has typically been updated to accept both phrasings
for (i in 1:length(files)) { 
  if(!(all(files[[i]]$SUBMITTING_LAB %in% submitting_lab_values))) {
    invalid_submission_reason[i, 4] <- "Invalid values in SUBMITTING_LAB variable"
  }
}

# REASON CHECK
## For every submission in files, check that the sequence reason variable contains valid values; if not, populate error message in invalid_submissions_reason
for (i in 1:length(files)) { 
  if(!(all(toupper(files[[i]]$SEQUENCE_REASON) %in% sequence_reason_values))) {
    invalid_submission_reason[i, 5] <- "Invalid values in SEQUENCE_REASON variable"
  }
}

# STATUS CHECK
## For every submission in files, check that the sequence status variable contains valid values; if not, populate error message in invalid_submissions_reason
for (i in 1:length(files)) { 
  if(!(all(toupper(files[[i]]$SEQUENCE_STATUS) %in% sequence_status_values))) {
    invalid_submission_reason[i, 6] <- "Invalid values in SEQUENCING_STATUS variable"
  }
}

# COLLECTION DATE CHECK
## For every submission in files, check that the specimen collection date is in a valid format; if not, populate error message in invalid_submissions_reason
for (i in 1:length(files)) { 
  if(!(all(!(is.na(files[[i]]$SPECIMEN_COLLECTION_DATE)) & 
           str_detect(files[[i]]$SPECIMEN_COLLECTION_DATE,
                      "[[:digit:]]{1,2}[[:punct:]][[:digit:]]{1,2}[[:punct:]][[:digit:]]{4}|[[:digit:]]{4}[[:punct:]][[:digit:]]{1,2}[[:punct:]][[:digit:]]{1,2}|[[:digit:]]{5}")))) {
    invalid_submission_reason[i, 7] <- "Invalid SPECIMEN_COLLECTION_DATE - check for incorrect date format"
  }
}

# LINEAGE CHECK
## For every submission in files, check that the PANGO lineage variable contains valid values; if not, populate error message in invalid_submissions_reason
for (i in 1:length(files)) { 
  if(!(all(toupper(files[[i]]$PANGO_LINEAGE) %in% lineages_list))) {
    invalid_submission_reason[i, 8] <- "Invalid values in PANGO_LINEAGE variable"
  }
}

Separate valid and invalid submissions

In [13]:
# Extract invalid files (i.e. triggered QA check errors, do not meet the criteria to be processed)
invalid_files_subset <- invalid_submission_reason %>%
  filter(!(rowSums(is.na(invalid_submission_reason[, c(2:7)])) == 6))

# Extract valid files (i.e. no QA check errors, meet all criteria to be processed)
valid_files_subset <- invalid_submission_reason %>%
  filter(rowSums(is.na(invalid_submission_reason[, c(2:7)])) == 6)

# Bind valid_files into a single dataframe
valid_files <- files[names(files) %in% valid_files_subset[[1]]]
files_bind <- bind_rows(valid_files, .id = "column_label")

# Built-in stop in case no valid files; Print summary
  ## Note: sometimes a batch of files from a single submitter containing identical errors will be received, and often it just requires a simple fix. It's preferable to stop here and address the error manually rather than send out a premature email stating that all files cannot be processed today
  ## If errors are corrected at this point, rerun from Chunk 13 (validation checks) to see if fixed! 
stop(paste0("NOTES FOR SCRIPT RUNNER:",
            "\n", "Stop script to review files.",
            "\n\n", length(files), " files were detected in the Submissions folders"),
            "\n", "   Valid files: ", nrow(valid_files_subset),
            "\n", "   Invalid files: ", nrow(invalid_files_subset),
            "\n\n", "Proceed with script or perform manual review as necessary.")

Stop script and send email if all incoming files are invalid.

In [14]:
# Initialize vectors to hold messages
invalid_files_message <- c()
final_invalid_files_message <- c()

# For all invalid files, note the filepath and the errors that were identified
for (i in 1:nrow(invalid_files_subset)) {
  for (i in 1:nrow(invalid_files_subset)) {
    invalid_files_message[i] <- paste0(invalid_files_subset[i,][!is.na(invalid_files_subset[i,])], collapse="\n     - ")
  }
  
  # Collapse invalid_files that have been processed by line
  invalid_files_message <-  paste0(invalid_files_message, collapse="\n")
  
  # Compose final message
  final_invalid_files_message <- paste0("The following file(s) are invalid and have not been processed: ",
                                        "\n", invalid_files_message)
}

# If there are files to process but none are valid (i.e all files are invalid), send an email stating so
if ((length(files) > 0) & (length(valid_files) == 0)) {
  email_from <- ""
  email_to <- ""
  email_subj <- "SEQUENCING - Template Submitters"
  email_body <- paste0("There are submissions to process, however, the submission files are invalid and require correction. There are no valid submissions that can be processed ", today(), ".",
  "\n", final_invalid_files_message,
  "\n\n", email_body_empty_files) # Append message regarding any empty files

# Send email
sendmailR::sendmail(from = email_from,
                    to = email_to,
                    subject = email_subj,
                    msg = email_body,
                    headers= list("Reply-To" = email_from),
                    control = list(smtpServer = "")
                    )
}

# Stop the script if all incoming submissions are invalid 
if ((length(files) > 0) & (length(valid_files) == 0)) {
  stop("All files are invalid. File(s) require correction. Script will stop here.")
}

Match records

Prepare for matching

The next code chunk performs data cleaning and mutation to prepare the valid patient-level records for matching to WDRS entries.

In [15]:
records_mut <- files_bind %>%
  
# Standardize format of SPECIMEN_COLLECTION_DATE as ymd
  mutate(SPECIMEN_COLLECTION_DATE = case_when(
    # Parse as ymd when SPECIMEN_COLLECTION_DATE is a 5 digit numeric string (Excel date)
    str_detect(SPECIMEN_COLLECTION_DATE, "^[[:digit:]]{5}$") ~ as.character(openxlsx::convertToDate(SPECIMEN_COLLECTION_DATE)),
    # Parse as ymd when SPECIMEN_COLLECTION_DATE is not a 5 digit numeric string - in a mdy (template submission) or ymd (Fulgent) format
    !str_detect(SPECIMEN_COLLECTION_DATE, "^[[:digit:]]{5}$") ~ as.character(ymd(parse_date_time(SPECIMEN_COLLECTION_DATE, c("mdy", "ymd"))))
  )) %>%
  
# Transform and standardize GISAID_ID by submitter
  ## Note: Aegis, Helix, Labcorp, and Quest are currently submitting via ELR (per 03.11.2022); Because ad-hoc corrections or other records from these labs are sometimes manually dropped into Submissions, we have retained the code to process these records
  mutate(GISAID_ID = case_when(
    
    # Aegis submits a partial GISAID_ID with excess string at the end: remove '-BXXXXXX', append 'hCoV-19/' prefix, append '/[year of collection date]' suffix
    SUBMITTING_LAB == "Aegis" & 
      replace_na(str_detect(toupper(SEQUENCE_STATUS), "COMPLETE"), FALSE) & 
      !is.na(GISAID_ID) 
    ~ paste0("USA/WA-CDC-", str_extract(GISAID_ID, "[^-]+"), "/", year(SPECIMEN_COLLECTION_DATE)),

    # Helix is submitting a full GISAID_ID so no edits necessary - however keeping this here in case future adjustments are necessary     
    SUBMITTING_LAB == "Helix" & 
      replace_na(str_detect(toupper(SEQUENCE_STATUS), "COMPLETE"), FALSE) & 
      !is.na(GISAID_ID) 
    ~ paste0(GISAID_ID),
    
    # Labcorp submits a partial GISAID_ID: append 'hCoV-19/' prefix, append '/[year of collection date]' suffix
    SUBMITTING_LAB == "Labcorp" & 
      replace_na(str_detect(toupper(SEQUENCE_STATUS), "COMPLETE"), FALSE) & 
      !is.na(GISAID_ID) 
    ~ paste0("USA/WA-CDC-", GISAID_ID, "/", year(SPECIMEN_COLLECTION_DATE)),
    
    TRUE ~ GISAID_ID
  ))

Perform matching

CASE_IDs from all submitted records will be matched to the WDRS_ENTIRE table using SEQUENCE_CLINICAL_ACCESSION (exact match) and COLLECTION_DATE (+ or - 14 days). The inexact matching is conducted during the quality filters in a later chunk.

In [16]:
# Join WDRS CASE_ID to submissions by exact matching on sequence accession number
records_matched <- left_join(records_mut, 
                           wdrs_ent,
                           by = c("LAB_ACCESSION_ID" = "FILLER__ORDER__NUM__WDRS"),
                            na_matches = "never")

# Extract unique records
  ## Necessary step because CASE_ID can be repeated multiple times in WDRS. The join process can generate duplicate rows where each variable value is repeated.
  ## This step identifies duplicates using all variables aside from column_label, which contains the filepath of each record and could falsely indicate that a record is unique if it was sent in two separate files
records_matched_dedup <- records_matched[!duplicated(records_matched[,2:14]),]

The records_matched_dedup dataframe retains unmatched records in addition to those that were matched. Unmatched records are identifiable by the fact that they have not been assigned a CASE_ID. Next steps will divert the unmatched records to fuzzy matching (if they contain the requisite demographic information) or keep_na. The matched records will then be compiled into the initial roster and undergo subsequent QA checks.

Extract records for fuzzy matching

The next code chunk scans the unnmatched observations and extracts those eligible for fuzzy matching based on the following criteria:

Ineligible for fuzzy matching, proceed towards roster & keep_na: * CASE_ID is not missing (i.e. successfully matched) OR * All demographic columns are NA: FIRST_NAME, LAST_NAME, MIDDLE_NAME, DOB

To fuzzy matching: * CASE_ID is missing (i.e. unmatched) AND * All demographic columns are not NA: FIRST_NAME, LAST_NAME, MIDDLE_NAME, DOB

In [17]:
# Compile matched records AND unmatched records without demographic information
roster_initial <- records_matched_dedup[
  (!is.na(records_matched_dedup$CASE_ID)) |
  (rowSums(is.na(records_matched_dedup[, c(9:12)])) == 4)
, ]

# Extract unmatched records with demographic information for fuzzy matching
fuzzy_match <- records_matched_dedup[
  (is.na(records_matched_dedup$CASE_ID)) &
  (!rowSums(is.na(records_matched_dedup[, c(9:12)])) == 4)
, ]

# If records have been flagged for fuzzy matching, transform and generate output
if(nrow(fuzzy_match) > 0) {
fuzzy_match_transform <- fuzzy_match %>%
  select(-c(14:15)) 
}

Prepare roster

Format roster

The next chunk transforms the roster_initial records into the output format used for rostering sequencing results. This dataframe contains matched records AND unmatched records without demographic information. Once mutated into the final roster format, the latter can be diverted to the keep_na list.

In [18]:
if(nrow(roster_initial) > 0) {
  roster_initial_transform <- roster_initial %>%
    
  # mutate CASE_ID, as number
  mutate(CASE_ID = as.numeric(CASE_ID)) %>%
    
  # mutate SEQUENCE_SGTF, populate as blank
  mutate(SEQUENCE_SGTF = "") %>%
    
  # mutate SEQUENCE_SPECIMEN, populate as "YES"
  mutate(SEQUENCE_SPECIMEN = "YES") %>%
    
  # mutate SEQUENCE_REASON, uppercase
  mutate(SEQUENCE_REASON = toupper(SEQUENCE_REASON)) %>%
    
  # mutate SEQUENCE_DATE, populate as blank
  mutate(SEQUENCE_DATE = "") %>%
    
  # mutate SUBMITTING_LAB, populate with corresponding values
  mutate(SEQUENCE_LAB = case_when(
    str_detect(toupper(SUBMITTING_LAB), "AEGIS") ~ "Aegis",
    str_detect(toupper(SUBMITTING_LAB), "ALTIUS") ~ "Altius",
    str_detect(toupper(SUBMITTING_LAB), "ASU") ~ "ASU",
    str_detect(toupper(SUBMITTING_LAB), "ATLAS") ~ "Atlas Genomics",
    str_detect(toupper(SUBMITTING_LAB), "BOISE VA") ~ "Boise VA",
    str_detect(toupper(SUBMITTING_LAB), "FULGENT") ~ "Fulgent Genetics",
    str_detect(toupper(SUBMITTING_LAB), "GRAVITY") ~ "Gravity",
    str_detect(toupper(SUBMITTING_LAB), "HELIX") ~ "Helix",
    str_detect(toupper(SUBMITTING_LAB), "INFINITY") ~ "Infinity",
    str_detect(toupper(SUBMITTING_LAB), "KAISER") ~ "KP WA Research Inst",
    str_detect(toupper(SUBMITTING_LAB), "LABCORP") ~ "Labcorp",
    str_detect(toupper(SUBMITTING_LAB), "LAURING LAB") ~ "Lauring Lab",
    str_detect(toupper(SUBMITTING_LAB), "NW GENOMICS") ~ "NW Genomics",
    str_detect(toupper(SUBMITTING_LAB), "QUEST") ~ "Quest",
    str_detect(toupper(SUBMITTING_LAB), "USAFSA") ~ "USAFSA",
    str_detect(toupper(SUBMITTING_LAB), "UNITED STATES AIR FORCE SCHOOL OF AEROSPACE") ~ "USAFSAM",
    str_detect(toupper(SUBMITTING_LAB), "UW VIROLOGY") ~ "UW Virology",
    TRUE ~ SUBMITTING_LAB)) %>%
    
  # mutate SEQUENCE_STATUS, uppercase
  # mutate(SEQUENCE_STATUS = toupper(SEQUENCE_STATUS)) %>%
  # mutate(SEQUENCE_STATUS = if_else(PANGO_LINEAGE == "Unassigned", "LOW QUALITY", toupper(SEQUENCE_STATUS))) %>%
    mutate(SEQUENCE_STATUS = case_when(
      PANGO_LINEAGE == "Unassigned" ~ "LOW QUALITY", 
      TRUE ~ toupper(SEQUENCE_STATUS))) %>%
    
  # mutate SEQUENCE_REPOSITORY, populate with "GISAID"
  mutate(SEQUENCE_REPOSITORY = "GISAID") %>% 
    
  # mutate SEQUENCE_ACCESSION, remove "hCoV-19/" from GISAID_ID
  mutate(SEQUENCE_ACCESSION = str_replace(GISAID_ID, "hCoV-19/", "")) %>%
  
  # mutate SEQUENCE_VARIANT_OPEN_TEXT, populate with PANGO_LINEAGE
  mutate(SEQUENCE_VARIANT_OPEN_TEXT = case_when(
    toupper(PANGO_LINEAGE) %in% c(lineages_list, "INVALID") ~ PANGO_LINEAGE,
    TRUE ~ ""
  )) %>% 
    
  # mutate SEQUENCE_CLINICAL_ACCESSION, populate with LAB_ACCESSION_ID
  mutate(SEQUENCE_CLINICAL_ACCESSION = LAB_ACCESSION_ID) %>%
    
  # mutate SEQUENCE_SPECIMEN_COLLECTION_DATE, populate in mdy format
  mutate(SEQUENCE_SPECIMEN_COLLECTION_DATE = as.character(format(as.Date(roster_initial$SPECIMEN_COLLECTION_DATE), "%m/%d/%Y"))) %>%
    
  # mutate SEQUENCE_NOTES
  mutate(SEQUENCE_NOTES = case_when(
    # when PANGO_LINEAGE is not NA, populate into a message with PANGO_LINEAGE and date of processing
    !is.na(PANGO_LINEAGE) ~ paste0("Lineage identified as ", PANGO_LINEAGE, " on ", today(), ". Lineage assignments may change over time."),
    # when PANGO_LINEAGE is not NA, populate as blank
    is.na(PANGO_LINEAGE) ~ ""
    )) %>% 
    
  # mutate SEQUENCE NOTES for Unassigned records
    mutate(SEQUENCE_NOTES = if_else(!is.na(SEQUENCE_VARIANT_OPEN_TEXT) & SEQUENCE_VARIANT_OPEN_TEXT!= "Unassigned", SEQUENCE_NOTES, NULL)) %>%

  # mutate SEQUENCE_REVIEWED, populate as blank
  mutate(SEQUENCE_REVIEWED = "") %>%
    
  # mutate Case.Note, populate with message
  mutate(Case.Note = "External data question package updated by COVID19 Sequencing Roster.") %>% 
    
  # select the variables required in the roster
  select(CASE_ID, 
         SEQUENCE_SGTF, 
         SEQUENCE_SPECIMEN, 
         SEQUENCE_REASON, 
         SEQUENCE_DATE,
         SEQUENCE_LAB, 
         SEQUENCE_STATUS,
         SEQUENCE_REPOSITORY, 
         SEQUENCE_ACCESSION, 
         SEQUENCE_VARIANT_OPEN_TEXT,
         SEQUENCE_CLINICAL_ACCESSION, 
         SEQUENCE_SPECIMEN_COLLECTION_DATE,
         SEQUENCE_NOTES, 
         SEQUENCE_REVIEWED, 
         Case.Note
  )
}

Extract records for keep_na

The next code chunk examines the preliminary roster and removes records for keep_na. Records that qualify for keep_na are unmatched or do not contain sufficient data to be included in the final sequencing roster – they will be retained in the keep_na running list and re-processed by the keep_na script at a future date.

Records designated for keep_na are identified using the following criteria:

  • CASE_ID is missing

  • SEQUENCE_CLINICAL_ACCESSION is missing

  • SEQUENCE_STATUS is COMPLETE and SEQUENCE_ACCESSION is not missing

  • SEQUENCE_STATUS is FAILED, NOT DONE, LOW QUALITY, or HIGH CT and SEQUENCE_ACCESSION is missing

    AND

  • SEQUENCE_STATUS is not PENDING

The keep_na records are then checked against the WDRS_FLATTENED table and the running keep_na file to exclude rostered or previously processed records. A variable indicating the date is added to determine the retention time for these records once added to the keep_na file.

The records that are not diverted to keep_na will be sent to undergo final pre-roster QA checks in subsequent chunks.

In [19]:
# Applies a series of logic checks to filter down records don't meet criteria to the keep_na file (often missing CASE_ID)
keep_na <- roster_initial_transform[
  
    # CASE_ID is NA or
  ((is.na(roster_initial_transform$CASE_ID)) |
     
    # SEQUENCE_CLINICAL_ACCESSION is NA or 
    (is.na(roster_initial_transform$SEQUENCE_CLINICAL_ACCESSION)) |
     
    # must NOT meet ANY of the conditions to be rostered: 
      ## (SEQUENCE_STATUS is COMPLETE & SEQUENCE_ACCESSION is not NA) or
      ## (SEQUENCE_STATUS is FAILED & SEQUENCE_ACCESSION is NA) or
      ## (SEQUENCE_STATUS is NOT DONE & SEQUENCE_ACCESSION is NA) or
      ## (SEQUENCE_STATUS is LOW QUALITY & SEQUENCE_ACCESSION is NA) or
      ## (SEQUENCE_STATUS is HIGH CT & SEQUENCE_ACCESSION is NA)
    !((roster_initial_transform$SEQUENCE_STATUS == "COMPLETE") & (!is.na(roster_initial_transform$SEQUENCE_ACCESSION)) |
    (roster_initial_transform$SEQUENCE_STATUS == "FAILED") & (is.na(roster_initial_transform$SEQUENCE_ACCESSION)) |
    (roster_initial_transform$SEQUENCE_STATUS == "NOT DONE") & (is.na(roster_initial_transform$SEQUENCE_ACCESSION)) |
      
      # (roster_initial_transform$SEQUENCE_STATUS == "LOW QUALITY") & (is.na(roster_initial_transform$SEQUENCE_ACCESSION)) | 
    # Check if there are any records with LOW QUALITY that sill have a SEQUENCE_ACCESSION
    ((roster_initial_transform$SEQUENCE_STATUS == "LOW QUALITY") & (roster_initial_transform$SEQUENCE_VARIANT_OPEN_TEXT == "Unassigned"))  |
      
      
    (roster_initial_transform$SEQUENCE_STATUS == "HIGH CT") & (is.na(roster_initial_transform$SEQUENCE_ACCESSION))
    )
  ) &
    
  # SEQUENCE_STATUS is not PENDING 
  (roster_initial_transform$SEQUENCE_STATUS != "PENDING")  
  , ]

# Filter out records that exist in the WDRS_FLATTENED table or in the running keep_na file, then add a variable containing date processed
keep_na_final <- keep_na[
  # SEQUENCE_ACCESSION does not already exist within the WDRS_FLATTENED table and
  (!(keep_na$SEQUENCE_CLINICAL_ACCESSION %in% wdrs_sca_flat_values)) &
    
  # SEQUENCE_CLINICAL_ACCESSION does not already exist within the keep_na  
  (!(keep_na$SEQUENCE_CLINICAL_ACCESSION %in% keep_na_running$SEQUENCE_CLINICAL_ACCESSION)) 
  , ] %>%
  
  # Add a variable documenting the date that the record was processed
  mutate(DATE_PROCESSED = as.character(today()))

Perform quality checks

This chunk applies a series of logic checks to identify records that meet criteria to be uploaded and flag those that need manual review (For_Review/to_process). These QA checks are performed by the custom functions found in quality_filters.R.

In [20]:
roster_initial_transform_qa <- roster_initial_transform %>%
 # Add COLLECTION_DATE_WDRS column needed for quality filters
  mutate(COLLECTION_DATE_WDRS = as.character(format(as.Date(roster_initial$SPECIMEN__COLLECTION__DTTM__WDRS), "%m/%d/%Y"))) %>%
 # Remove keep_na records
  anti_join(keep_na)  
  
# Applies a series of logic checks to filter down records that meet the criteria to be uploaded into the WDRS_FLATTENED table
temp_sub_quality <- roster_filters(
                      roster_initial_transform_qa, 
                      lab_vars, 
                      wdrs_flat$CDC_N_COV_2019_SEQUENCE_ACCESSION_NUMBER, 
                      wdrs_flat$CDC_N_COV_2019_SEQUENCE_CLINICAL_ACCESSION_NUMBER, 
                      lineages_list)

Create final roster

In [21]:
# Remove any rows that failed quality checks to create final roster
roster_final <- select(temp_sub_quality, -COLLECTION_DATE_WDRS) %>%
  filter(sum==0) %>% 
  select(CASE_ID:Case.Note)

Extract records for manual review

In [22]:
# Extract records that triggered quality filter errors to be sent to the For_Review folder
for_review <- filter(temp_sub_quality, sum>0) %>%
  filter(SEQUENCE_STATUS != "FAILED" &
         SEQUENCE_STATUS != "LOW QUALITY") %>%
  filter(!(SEQUENCE_ACCESSION %in% keep_na$SEQUENCE_ACCESSION))

In [23]:
# Built-in stop before moving any files
stop("Stop script before any files are moved. Opportunity to review file validity, final roster, and records sent to keep_na or fuzzy_match.")

Generate output

For all the output files above, create vectors containing the filepaths and write all output files to their respective destination

In [24]:
# Initialize vectors to contain the filepaths for all output files in the chunks above
roster_final_filepath <- c()
fuzzy_match_filepath <- c()
for_review_filepath <- c()

# If there are records in roster_final, output file as a '.csv'
if(nrow(roster_final > 0)) {
  # create filepath
  roster_final_filepath <- paste0("write_roster_here/", "Template_Submitters_", format(now(), "%Y-%m-%d-%H%M%S"), ".csv")
  # write to csv
  write_csv(roster_final, roster_final_filepath , na = "")
}

# If there are records in fuzzy_match, output file as a '.csv'
if (nrow(fuzzy_match) > 0) {
  # create filepath
  fuzzy_match_filepath <- paste0("Submissions/Fuzzy_Match/", "Template_Submitters_Fuzzy_Match_Required_", format(now(), "%Y-%m-%d-%H%M%S"), ".csv")
  # write to csv
  write_csv(fuzzy_match_transform, fuzzy_match_filepath, na = "")
}

# If there are records in for_review, output file as a '.csv'
if(nrow(for_review > 0)) {
  # create filepath
  for_review_filepath <- paste0("For_Review/to_process/", "For_Review_Template_Submitters_", format(now(), "%Y-%m-%d-%H%M%S"), ".csv")
  # write to csv
  write_csv(for_review, for_review_filepath, na = "")
}

# If there are records in keep_na_final, append the records to the running keep_na file 
if(nrow(keep_na_final) > 0) {
  keep_na_final %>%
    write_csv(paste0("keep_na/keep_na.csv"), na = "", append = TRUE)
} 

Ensure that records were appended to the keep_na file by comparing the number of records in the past and current versions of the object. The difference should equal the nrow of the keep_na_final dataframe created during this script run. If the number of records do not match, output the keep_na_final object to a holding folder to be appended later.

In [25]:
if(nrow(keep_na_final) > 0) {

  # Read in the updated keep_na 
  keep_na_running_update <- read_csv("keep_na/keep_na.csv",
                   col_types = cols(.default = "c"),
                   na = c("", "NA", "N/A")) 
  
  # If the difference in nrow between the previous iteration of keep_na and the updated version does not equal the number of rows in keep_na_final,
  if ( (!( (nrow(keep_na_running_update)) - (nrow(keep_na_running)) ) == nrow(keep_na_final) ) ) {
    # Output keep_na_final as a separate file into a holding folder to ensure these data are reviewed
    keep_na_final %>%
      write_csv(paste0("keep_na/Add_Holding/", "Template_Submitters_", format(now(), "%Y-%m-%d-%H%M%S"), ".csv"), na = "")
  }
}

Email receipt

Compile an email message by concatenating the vectors created in previous chunks containing relevant messages.

In [26]:
# Prepare email message

## Valid/invalid files
valid_files_message <- c()
valid_files_message <-  paste0(valid_files_subset[,1], collapse="\n")
final_valid_files_message <- paste0("The following file(s) are valid and have been processed:\n", valid_files_message, collapse = "\n\n")

## Empty files
empty_submissions_message <-  paste0(empty_submissions, collapse="\n")

# Final message
  ## If the script makes it to this point it will have valid files that were processed -- start with this message and append other potential details
message <- paste0(final_valid_files_message, "\n\n", "A total of ", nrow(roster_final), " records were sent to write_roster_here.")

  ## If there were invalid files, append it to the message
if ((nrow(invalid_files_subset) > 0)) {
  message <- paste0(message,"\n\n" , final_invalid_files_message)
}

  ## If there were empty files, append it to the message
if ((length(empty_submissions) > 0)) {
  message <- paste0(message, "\n\n", "The following file(s) were received but contained no data:", "\n", empty_submissions_message)
}

  ## If there were records that required fuzzy matching, append it to the message
if ((nrow(fuzzy_match) > 0)) {
  message <- paste0(message, "\n\n", "There are a subset of records from the valid files that were processed that require fuzzy matching. These records do not match to an event in WDRS by an accession ID but do contain viable demographic information:", "\n", fuzzy_match_filepath)
}

message <- paste0(message, "\n\n", "Note: This is an automated message. Please enable your outlook to include extra line breaks to view this message in its proper format.")
In [27]:
# Email details
email_from <- ""
email_to <- ""
email_subj <- "SEQUENCING - Template Submitters Automated Email"
email_body <- paste0("Template submission(s) have been processed ", today(), ".", "\n\n", message)

# Send email
sendmailR::sendmail(from = email_from,
                    to = email_to,
                    subject = email_subj,
                    msg = email_body,
                    headers= list("Reply-To" = email_from),
                    control = list(smtpServer = "")
                    )

Move processed files

In [28]:
# Create vector of valid files to move
move_valid_files <- valid_files_subset[[1]]
# Create vector of valid files and empty files to move
move_files <- c(move_valid_files, empty_submissions)

# For every file processed, copy from Submissions to the corresponding Completed_Submissions folder and delete original
  ## All files should be created successfully, but the script may be unable to delete some files that it considers in use by the R environment. These will be removed by Chunk 10 in the next run prior to identifying new submissions for processing
for (i in 1:length(move_files)) {
  
  ## Adhoc
  if (str_detect(move_files[i], "Adhoc")) {
    file.copy(move_files[i], "Completed_Submissions/Adhoc")
    unlist(move_files[i])

  ## Aegis
  } else if (str_detect(move_files[i], "Aegis")) {
    file.copy(move_files[i], "Completed_Submissions/Aegis")
    unlist(move_files[i])
  
  ## Altius
  } else if (str_detect(move_files[i], "Altius")) {
    file.copy(move_files[i], "Completed_Submissions/Altius")
    unlist(move_files[i])
    
  ## ASU
  } else if (str_detect(move_files[i], "ASU")) {
    file.copy(move_files[i], "Completed_Submissions/ASU")
    unlist(move_files[i])

  ## Fulgent Genetics
  } else if (str_detect(move_files[i], "Fulgent_Genetics")) {
    file.copy(move_files[i], "Completed_Submissions/Fulgent_Genetics")
    unlist(move_files[i])
    
  ## Gravity Diagnostics
  } else if (str_detect(move_files[i], "Gravity Diagnostics")) {
    file.copy(move_files[i], "Completed_Submissions/Gravity_Diagnostics")
    unlist(move_files[i])
    
  ## Helix  
  } else if (str_detect(move_files[i], "Helix")) {
    file.copy(move_files[i], "Completed_Submissions/Helix")
    unlist(move_files[i])
    
  ## Infinity Biologix
  } else if (str_detect(move_files[i], "Infinity_Biologix")) {
    file.copy(move_files[i], "Completed_Submissions/Infinity_Biologix")
    unlist(move_files[i])

  ## Kaiser
  } else if (str_detect(move_files[i], "Kaiser")) {
    file.copy(move_files[i], "Completed_Submissions/Kaiser")
    unlist(move_files[i])
  
  ## Labcorp
  } else if (str_detect(move_files[i], "Labcorp")) {
    file.copy(move_files[i], "Completed_Submissions/Labcorp")
    unlist(move_files[i])

  ## Lauring Lab
  } else if (str_detect(move_files[i], "Lauring_Lab")) {
    file.copy(move_files[i], "Completed_Submissions/Lauring_Lab")
    unlist(move_files[i])
  
  ## NW Genomics
  } else if (str_detect(move_files[i], "NW_Genomics")) {
    file.copy(move_files[i], "Completed_Submissions/NW_Genomics")
    unlist(move_files[i])
    
  ## United States Air Force
  } else if (str_detect(move_files[i], "UnitedStatesAirForceSchoolofAerospace")) {
    file.copy(move_files[i], "Completed_Submissions/USAFSAM")
    unlist(move_files[i])
    
  ## UW Virology
  } else if (str_detect(move_files[i], "UW_Virology")) {
    file.copy(move_files[i], "Completed_Submissions/UW_Virology")
    unlist(move_files[i])
  }
}

# Update completed files list for next time
write_csv(as.data.frame(move_files), "Completed_Submissions/Template_Submitters/Template_Submitters_File_Completed.csv")