dataframe_matcher

dataframe_matcher

Classes

Name Description
DataFrameMatcher A utility class for matching records.

DataFrameMatcher

dataframe_matcher.DataFrameMatcher(
    self,
    df_src: pl.DataFrame,
    df_ref: pl.DataFrame,
    first_name: str | tuple[str, str],
    last_name: str | tuple[str, str],
    dob: str | tuple[str, str],
    spec_col_date: str | tuple[str, str],
    key: str | list | None = None,
    threshold: int | float = 80,
    day_max: int | None = None,
    business_day_max: int | None = None,
)

A utility class for matching records.

This class provides functionality to match submissions to cases (epi data) based on exact matching via accessions or fuzzy matching based on patient demographics

Parameters

df_src : pl.DataFrame

Source dataframe containing any Key(s) and patient demographics.

df_ref : pl.DataFrame

Reference queried dataframe containing patient demographics.

first_name : str | tuple[str, str]

The first name demographic column name in the source and reference dataframes. If the names are different, they should be provided in a tuple containing the source name first, followed by the reference name.

last_name : str | tuple[str, str]

The last name demographic column name in the source and reference dataframes. If the names are different, they should be provided in a tuple containing the source name first, followed by the reference name.

dob : str | tuple[str, str]

The birth date demographic column name in the source and reference dataframes. If the names are different, they should be provided in a tuple containing the source name first, followed by the reference name.

spec_col_date : str | tuple[str, str]

The specimen collection date column name in the source and reference dataframes. If the names are different, they should be provided in a tuple containing the source name first, followed by the reference name.

key : str | list | None = None

The key (or list of keys) which group to a distinct source record. Only one match can be returned per distinct key. If no key given, each row in the source df will be treated as a distinct record to be matched.

threshold : int | float = 80

The inclusive fuzzy scoring threshold used to filter fuzzy matches. Matches with a score at or above the threshold will be returned in the fuzzy matched object. Defaults to 80.

day_max : int | None = None

The max number of days between reference and source specimen collection dates a fuzzy matched record can have and be returned as a match

business_day_max : int | None = None

The max number of business days between reference and source specimen collection dates a fuzzy matched record can have and be returned as a match. Business days are counted as weekdays (holidays are not accounted for).

Returns

fuzzy_matched_review : pl.DataFrame

records that successfully fuzzy matched

fuzzy_without_demo : pl.DataFrame

records missing demographics and can’t be matched

fuzzy_matched_none : pl.DataFrame

records that did not have any match

fuzzy_matched_roster : pl.DataFrame

records that had an exact match to reference dataframe

Examples

Step 1: Import the packages:

from wadoh_raccoon import dataframe_matcher as dfm
import polars as pl
from datetime import date

Step 2: The fuzzy matching functions need two dataframes to match. Bring your dataframe and the reference dataframe:

# Create example data
your_df = pl.DataFrame({
    'submission_number': [453278555, 453278555, 887730141],
    'first_name': ['DAVIS', 'DAVIS', 'GRANT'],
    'last_name': ['SMITHDAVIS', 'SMITHDAVIS', 'MITHCELL'],
    'sub_collection_date': [date(2024, 11, 29), date(2024, 11, 29), date(2024, 12, 2)],
    'birth_date': [date(1989, 7, 15), date(1989, 7, 15), date(1990, 6, 21)]
})

reference_df = pl.DataFrame({
    'CASE_ID': [100000032, 100000041, 100020000],
    'first_name_reference': ['DAVID', 'DAVID', 'TRASH'],
    'last_name_reference': ['SMITDAVIS', 'SMITDAVIS', 'PANDA'],
    'ref_collection_date': [date(2024, 11, 29), date(2024, 8, 31), date(2024, 8, 31)],
    'birth_date': [date(1989, 7, 15), date(1989, 7, 15), date(1990, 6, 21)]
})

Step 3: Initalize the fuzzy matching class and input which dataframes and columns you are matching on

fuzzy_init = dfm.DataFrameMatcher(
    df_src=your_df,
    df_ref=reference_df,
    first_name=('first_name', 'first_name_reference'),
    last_name=('last_name', 'last_name_reference'),
    dob='birth_date',
    spec_col_date=('sub_collection_date', 'ref_collection_date'),
    key='submission_number',
    threshold=80  # set what kind of fuzzy threshold you want, 100 being exact match
)

Step 4: Run fuzzy matching! This will output data on what matched and what didn’t match.

result = fuzzy_init.match()
Success: No data leaks detected. Insert victory cigar
0 exact matches
1 fuzzy matched
1 no match found
0 records without demo

Total unique persons processed:  2 
Original submissions to fuzzy (including no_match rematch attempt): 3

You can also examine the output dataframes by pulling them out of the result class, like this:

result.fuzzy_unmatched
index submission_number first_name last_name sub_collection_date birth_date first_name_clean last_name_clean submitted_collection_date submitted_dob CASE_ID first_name_reference last_name_reference ref_collection_date birth_date_right first_name_clean_right last_name_clean_right reference_collection_date first_name_result last_name_result reverse_first_name_result reverse_last_name_result match_ratio reverse_match_ratio
0 887730141 GRANT MITHCELL 2024-12-02 1990-06-21 GRANT MITHCELL 2024-12-02 1990-06-21 100020000 TRASH PANDA 2024-08-31 1990-06-21 TRASH PANDA 2024-08-31 40 0 40 31 20.0 35.5
result.fuzzy_matched
index submission_number first_name last_name sub_collection_date birth_date first_name_clean last_name_clean submitted_collection_date submitted_dob CASE_ID first_name_reference last_name_reference ref_collection_date birth_date_right first_name_clean_right last_name_clean_right reference_collection_date first_name_result last_name_result reverse_first_name_result reverse_last_name_result match_ratio reverse_match_ratio day_count business_day_count
0 453278555 DAVIS SMITHDAVIS 2024-11-29 1989-07-15 DAVIS SMITHDAVIS 2024-11-29 1989-07-15 100000032 DAVID SMITDAVIS 2024-11-29 1989-07-15 DAVID SMITDAVIS 2024-11-29 80 95 71 53 87.5 62.0 0 0




Methods

Name Description
fuzzy_match Where the magic happens. Do the fuzzy matching to the dataframe
fuzzy_match
dataframe_matcher.DataFrameMatcher.fuzzy_match(dob_match)

Where the magic happens. Do the fuzzy matching to the dataframe

Parameters
dob_match :

the dataframe that has records grouped by their dob match

Returns
fuzzy_matched : pl.DataFrame

dataframe with matches that met or exceeded the fuzzy matching score threshold

fuzzy_unmatched : pl.DataFrame

dataframe with matches that failed to meet the fuzzy matching score threshold

Examples
from wadoh_raccoon import dataframe_matcher as dfm
import polars as pl
from datetime import date

# Create example data
df = pl.DataFrame({
    'submission_number': [453278555, 453278555, 887730141],
    'first_name_clean': ['DAVIS', 'DAVIS', 'GRANT'],
    'last_name_clean': ['SMITHDAVIS', 'SMITHDAVIS', 'MITHCELL'],
    'submitted_collection_date': [date(2024, 11, 29), date(2024, 11, 29), date(2024, 12, 2)],
    'submitted_dob': [date(1989, 7, 15), date(1989, 7, 15), date(1990, 6, 21)],
     'CASE_ID': [100000032, 100000041, None],
    'first_name_clean_right': ['DAVID', 'DAVID', None],
    'last_name_clean_right': ['SMITDAVIS', 'SMITDAVIS', None],
     'reference_collection_date': [date(2024, 11, 29), date(2024, 8, 31), None]
})

# Init dataframe matcher
# (this is not how to use the instance but input data not used in this example)
instance = dfm.DataFrameMatcher(df, df)
fuzzy_matched, fuzzy_unmatched = instance.fuzzy_match(dob_match=df)

Fuzzy match found:

helpers.gt_style(df_inp=fuzzy_matched)

no matches found:

helpers.gt_style(df_inp=fuzzy_matched_none)