Code
library(pacman)
p_load(
reticulate,
fs,
lubridate,
dplyr,
stringr,
magrittr,
readr,
httr )
Washington Department of Health
Data Integration/Quality Assurance
September 1, 2023
October 7, 2024
Summary
This project uses pacman::p_load()
to read in libraries. p_load()
will install any packages that aren’t installed in the user’s environment.
CDC provides a list of COVID-19 lineages that we will pull into our R environment
httr::GET()
to connect and pull the texthttr::content
will pull all text in the htmlResponse [https://raw.githubusercontent.com/cov-lineages/pango-designation/master/lineage_notes.txt]
Date: 2024-04-18 20:49
Status: 200
Content-Type: text/plain; charset=utf-8
Size: 243 kB
Lineage Description
A One of the two original haplotypes of the pandemic (A and B). Many sequence...
A.1 USA lineage
A.2 Mostly Spanish lineage now includes South and Central American sequences,...
A.2.2 Australian lineage
A.2.3 Scottish lineage
A.2.4 Panama lineage
A.2.5 Central American/ USA lineage
A.2.5.1 Costa Rica lineage, from #33
A.2.5.2 Predominantly Italy lineage
...
[1] "Lineage\tDescription \nA\tOne of the two original haplotypes of the pandemic (A and B). Many sequences originating from China and many global exports; including to South East Asia Japan South Korea Australia the USA and Europe represented in this lineage\nA.1\tUSA lineage\nA.2\tMostly Spanish lineage now includes South and Central American sequences, other European countries and Kazakhstan.\nA.2.2\tAustralian lineage\nA.2.3\tScottish lineage\nA.2.4\tPanama lineage\nA.2.5\tCentral American/ USA lineage\nA.2.5.1\t"
Note that the text is not formatted and looks like garbage. We’ll clean that up in subsequent steps
Now we will process the data so we can clean and wrangle it. Notice how the output above is delimited by \n
. We’ll use that delimiter to split the string into a list of individual strings and unpack that list into individual rows. Now we can create a matrix with two columns, Lineage and Description
# split by "\n" to get a list by row, unlist
lineages_string_by_row <- lineage_content %>%
stringr::str_split("\n") %>%
unlist()
# create a matrix from lineages_string_by_row with a second var
# created from splitting by "\t"
lineages_mat <- lineages_string_by_row %>%
str_split("\t", simplify = TRUE)
This is what the first 5 rows of the matrix will look like:
[,1]
[1,] "Lineage"
[2,] "A"
[3,] "A.1"
[4,] "A.2"
[5,] "A.2.2"
[6,] "A.2.3"
[,2]
[1,] "Description "
[2,] "One of the two original haplotypes of the pandemic (A and B). Many sequences originating from China and many global exports; including to South East Asia Japan South Korea Australia the USA and Europe represented in this lineage"
[3,] "USA lineage"
[4,] "Mostly Spanish lineage now includes South and Central American sequences, other European countries and Kazakhstan."
[5,] "Australian lineage"
[6,] "Scottish lineage"
Now we can define columns, filter, mutate and deduplicate the dataset
slice()
will remove the first row that contains variable names*
symbol attached, remove thisdistinct()
statementlineages_df <- lineages_mat %>%
as_tibble(.name_repair = "unique") %>%
# Rename columns
rename(lineage_extracted = `...1`, description = `...2`) %>%
# Remove the first row that contains variable names
slice(-1) %>%
# Filter out missing lineages/descriptions
filter(!(lineage_extracted == "" & description == "")) %>%
# create a 'status' var that is populated with either withdrawn or active
mutate(status = case_when(
# when 'withdrawn' is detected in the string then populate status with 'Withdrawn'
str_detect(tolower(description), "withdrawn") ~ "Withdrawn",
# otherwise populate status with 'Active'
TRUE ~ "Active"
)) %>%
# extract string from beginning up until the first space or end of string
# and assign as lineage extracted. This steps is at times necessary due to errors
# in the file pulled. Within each row the lineage and descript should be separated
# by '\t' (see line 53). However at times a white space is
# mistakenly entered instead
mutate(lineage_extracted = str_extract(
lineage_extracted,
".+?(?=$|[[:SPACE:]])"
)
) %>%
# remove '*' from the lineage_extracted variables. Often withdrawn lineages
# are denoted with a '*' at the beginning
mutate(lineage_extracted = str_remove_all(lineage_extracted, "\\*")) %>%
# deduplicate any rows where lineage_extracted and status are the same
# (description is not a priority)
distinct(lineage_extracted, status, .keep_all = TRUE)
New names:
• `` -> `...1`
• `` -> `...2`
Let’s take a look at the result:
# A tibble: 6 × 3
lineage_extracted description status
<chr> <chr> <chr>
1 A One of the two original haplotypes of the pandemic (… Active
2 A.1 USA lineage Active
3 A.2 Mostly Spanish lineage now includes South and Centra… Active
4 A.2.2 Australian lineage Active
5 A.2.3 Scottish lineage Active
6 A.2.4 Panama lineage Active
Now we can deduplicate the whole dataframe. We’ll find where lineages are duplicated and they have status = “Active”
if (any(duplicated(lineages_df$lineage_extracted))) {
# identify all records where lineage_extracted is duplicated
dup_records <- lineages_df[lineages_df$lineage_extracted %in% lineages_df$lineage_extracted[duplicated(lineages_df$lineage_extracted)],]
# identify records where status == "Active"
dup_records_active <- dup_records %>%
filter(status == "Active")
# remove records where lineage_extract is duplicated but status = "Active"
# for final output
lineages_df_final <- anti_join(lineages_df, dup_records_active)
# else if there no records with lineage_extracted is duplicated but status differs
} else {
# lineages_df_6 is the final output
lineages_df_final <- lineages_df
}
Joining with `by = join_by(lineage_extracted, description, status)`
Here’s what the final dataframe should look like
Write the file to this repository