Home
  • User Guide
  • Code Reference
  • Articles
  1. Code Reference
  2. Lineages Extract
  • Code Reference
    • Reference Guide
    • Lineages Extract
    • Lineages Classification Script

On this page

  • 1 Load libraries
  • 2 Pull CDC Lineages via API
  • 3 Process the data
  • 4 Transform the dataframe
  • 5 Deduplicate
  • 6 Review outputs
  • 7 Write the file
  • View source
  • Edit this page
  • Report an issue
  1. Code Reference
  2. Lineages Extract

Lineages Extract

Look under the hood of the lineages.R script
Author
Affiliations

Frank Aragona

Washington Department of Health

Data Integration/Quality Assurance

Published

September 1, 2023

Modified

October 7, 2024

Summary

  • Pull lineage data from the CDC
  • Transform and clean the data
  • Output to a csv file

1 Load libraries

This project uses pacman::p_load() to read in libraries. p_load() will install any packages that aren’t installed in the user’s environment.

Code
library(pacman)
p_load(
  reticulate,
  fs,
  lubridate,
  dplyr,
  stringr,
  magrittr,
  readr,
  httr
)

2 Pull CDC Lineages via API

CDC provides a list of COVID-19 lineages that we will pull into our R environment

  1. Provide the url link to CDC’s repo
  2. Use httr::GET() to connect and pull the text
  3. httr::content will pull all text in the html
Code
# input the url
url <- 'https://raw.githubusercontent.com/cov-lineages/pango-designation/master/lineage_notes.txt'

# get the response of the url 
data <- httr::GET(url)
data
Response [https://raw.githubusercontent.com/cov-lineages/pango-designation/master/lineage_notes.txt]
  Date: 2024-04-18 20:49
  Status: 200
  Content-Type: text/plain; charset=utf-8
  Size: 243 kB
Lineage Description 
A   One of the two original haplotypes of the pandemic (A and B). Many sequence...
A.1 USA lineage
A.2 Mostly Spanish lineage now includes South and Central American sequences,...
A.2.2   Australian lineage
A.2.3   Scottish lineage
A.2.4   Panama lineage
A.2.5   Central American/ USA lineage
A.2.5.1 Costa Rica lineage, from #33
A.2.5.2 Predominantly Italy lineage
...
Code
# extract the content of the response (all the txt you see in the url)
lineage_content <- httr::content(data, as = "text")
substr(lineage_content,1,500)
[1] "Lineage\tDescription \nA\tOne of the two original haplotypes of the pandemic (A and B). Many sequences originating from China and many global exports; including to South East Asia Japan South Korea Australia the USA and Europe represented in this lineage\nA.1\tUSA lineage\nA.2\tMostly Spanish lineage now includes South and Central American sequences, other European countries and Kazakhstan.\nA.2.2\tAustralian lineage\nA.2.3\tScottish lineage\nA.2.4\tPanama lineage\nA.2.5\tCentral American/ USA lineage\nA.2.5.1\t"

Note that the text is not formatted and looks like garbage. We’ll clean that up in subsequent steps


3 Process the data

Now we will process the data so we can clean and wrangle it. Notice how the output above is delimited by \n. We’ll use that delimiter to split the string into a list of individual strings and unpack that list into individual rows. Now we can create a matrix with two columns, Lineage and Description

Code
# split by "\n" to get a list by row, unlist
lineages_string_by_row <- lineage_content %>%
  stringr::str_split("\n") %>%
  unlist()

# create a matrix from lineages_string_by_row with a second var 
# created from splitting by "\t"
lineages_mat <- lineages_string_by_row %>%
  str_split("\t", simplify = TRUE)


This is what the first 5 rows of the matrix will look like:

Code
head(lineages_mat)
     [,1]     
[1,] "Lineage"
[2,] "A"      
[3,] "A.1"    
[4,] "A.2"    
[5,] "A.2.2"  
[6,] "A.2.3"  
     [,2]                                                                                                                                                                                                                                  
[1,] "Description "                                                                                                                                                                                                                        
[2,] "One of the two original haplotypes of the pandemic (A and B). Many sequences originating from China and many global exports; including to South East Asia Japan South Korea Australia the USA and Europe represented in this lineage"
[3,] "USA lineage"                                                                                                                                                                                                                         
[4,] "Mostly Spanish lineage now includes South and Central American sequences, other European countries and Kazakhstan."                                                                                                                  
[5,] "Australian lineage"                                                                                                                                                                                                                  
[6,] "Scottish lineage"                                                                                                                                                                                                                    

4 Transform the dataframe

Now we can define columns, filter, mutate and deduplicate the dataset

  1. The columns for Lineage and Description still have bad names (v1 and v2). We’ll rename them
  2. slice() will remove the first row that contains variable names
  3. We want to filter out any missing lineages - sometimes there are lineages without a description that won’t be of use to us
  4. Find lineages with withdrawn or active status and label them in a new column
  5. Extract the part of the lineage that is useful
  6. Sometimes lineages have a * symbol attached, remove this
  7. Deduplicate any rows with a distinct() statement
Code
lineages_df <- lineages_mat %>%
  as_tibble(.name_repair = "unique") %>%
  
  # Rename columns
  rename(lineage_extracted = `...1`, description = `...2`) %>%
  
  # Remove the first row that contains variable names
  slice(-1) %>%
  
  # Filter out missing lineages/descriptions
  filter(!(lineage_extracted == "" & description == "")) %>%
  
  # create a 'status' var that is populated with either withdrawn or active
  mutate(status = case_when(
    # when 'withdrawn' is detected in the string then populate status with 'Withdrawn'
    str_detect(tolower(description), "withdrawn") ~ "Withdrawn",
    # otherwise populate status with 'Active'
    TRUE ~ "Active"
  )) %>%
  
  # extract string from beginning up until the first space or end of string 
  # and assign as lineage extracted. This steps is at times necessary due to errors
  # in the file pulled. Within each row the lineage and descript should be separated
  # by '\t' (see line 53). However at times a white space is 
  # mistakenly entered instead 
  mutate(lineage_extracted = str_extract(
    lineage_extracted,
    ".+?(?=$|[[:SPACE:]])"
    )
  ) %>%
  
  # remove '*' from the lineage_extracted variables. Often withdrawn lineages 
  # are denoted with a '*' at the beginning
  mutate(lineage_extracted = str_remove_all(lineage_extracted, "\\*")) %>%
  
  # deduplicate any rows where lineage_extracted and status are the same 
  # (description is not a priority)
  distinct(lineage_extracted, status, .keep_all = TRUE)
New names:
• `` -> `...1`
• `` -> `...2`


Let’s take a look at the result:

Code
head(lineages_df)
# A tibble: 6 × 3
  lineage_extracted description                                           status
  <chr>             <chr>                                                 <chr> 
1 A                 One of the two original haplotypes of the pandemic (… Active
2 A.1               USA lineage                                           Active
3 A.2               Mostly Spanish lineage now includes South and Centra… Active
4 A.2.2             Australian lineage                                    Active
5 A.2.3             Scottish lineage                                      Active
6 A.2.4             Panama lineage                                        Active

5 Deduplicate

Now we can deduplicate the whole dataframe. We’ll find where lineages are duplicated and they have status = “Active”

Code
if (any(duplicated(lineages_df$lineage_extracted))) {
  
  # identify all records where lineage_extracted is duplicated 
  dup_records <- lineages_df[lineages_df$lineage_extracted %in% lineages_df$lineage_extracted[duplicated(lineages_df$lineage_extracted)],]
  
  # identify records where status == "Active"
  dup_records_active <- dup_records %>%
    filter(status == "Active")
  
  # remove records where lineage_extract is duplicated but status = "Active" 
  # for final output
  lineages_df_final <- anti_join(lineages_df, dup_records_active)
  
  # else if there no records with lineage_extracted is duplicated but status differs
} else {
  # lineages_df_6 is the final output
  lineages_df_final <- lineages_df
}
Joining with `by = join_by(lineage_extracted, description, status)`

6 Review outputs

Here’s what the final dataframe should look like

Code
glimpse(lineages_df_final)
Rows: 4,204
Columns: 3
$ lineage_extracted <chr> "A", "A.1", "A.2", "A.2.2", "A.2.3", "A.2.4", "A.2.5…
$ description       <chr> "One of the two original haplotypes of the pandemic …
$ status            <chr> "Active", "Active", "Active", "Active", "Active", "A…

7 Write the file

Write the file to this repository

Code
readr::write_csv(lineages_df_final, "data/lineages.csv", na = "")
# save(lineages, file = paste0("data-raw/data_", make.names(Sys.time()), ".Rda"))

🍦

 
  • View source
  • Edit this page
  • Report an issue