dataframe_matcher

dataframe_matcher

Classes

Name	Description
DataFrameMatcher	A utility class for matching records.

DataFrameMatcher

dataframe_matcher.DataFrameMatcher(
    self,
    df_src: pl.DataFrame,
    df_ref: pl.DataFrame,
    first_name: str | tuple[str, str],
    last_name: str | tuple[str, str],
    dob: str | tuple[str, str],
    spec_col_date: str | tuple[str, str],
    key: str | list | None = None,
    threshold: int | float = 80,
    day_max: int | None = None,
    business_day_max: int | None = None,
)

A utility class for matching records.

This class provides functionality to match submissions to cases (epi data) based on exact matching via accessions or fuzzy matching based on patient demographics

Parameters

df_src : pl.DataFrame: Source dataframe containing any Key(s) and patient demographics.
df_ref : pl.DataFrame: Reference queried dataframe containing patient demographics.
first_name : str | tuple[str, str]: The first name demographic column name in the source and reference dataframes. If the names are different, they should be provided in a tuple containing the source name first, followed by the reference name.
last_name : str | tuple[str, str]: The last name demographic column name in the source and reference dataframes. If the names are different, they should be provided in a tuple containing the source name first, followed by the reference name.
dob : str | tuple[str, str]: The birth date demographic column name in the source and reference dataframes. If the names are different, they should be provided in a tuple containing the source name first, followed by the reference name.
spec_col_date : str | tuple[str, str]: The specimen collection date column name in the source and reference dataframes. If the names are different, they should be provided in a tuple containing the source name first, followed by the reference name.
key : str | list | None = None: The key (or list of keys) which group to a distinct source record. Only one match can be returned per distinct key. If no key given, each row in the source df will be treated as a distinct record to be matched.
threshold : int | float = 80: The inclusive fuzzy scoring threshold used to filter fuzzy matches. Matches with a score at or above the threshold will be returned in the fuzzy matched object. Defaults to 80.
day_max : int | None = None: The max number of days between reference and source specimen collection dates a fuzzy matched record can have and be returned as a match
business_day_max : int | None = None: The max number of business days between reference and source specimen collection dates a fuzzy matched record can have and be returned as a match. Business days are counted as weekdays (holidays are not accounted for).

Returns

fuzzy_matched_review : pl.DataFrame: records that successfully fuzzy matched
fuzzy_without_demo : pl.DataFrame: records missing demographics and can’t be matched
fuzzy_matched_none : pl.DataFrame: records that did not have any match
fuzzy_matched_roster : pl.DataFrame: records that had an exact match to reference dataframe

Examples

Step 1: Import the packages:

from wadoh_raccoon import dataframe_matcher as dfm
import polars as pl
from datetime import date

Step 2: The fuzzy matching functions need two dataframes to match. Bring your dataframe and the reference dataframe:

# Create example data
your_df = pl.DataFrame({
    'submission_number': [453278555, 453278555, 887730141],
    'first_name': ['DAVIS', 'DAVIS', 'GRANT'],
    'last_name': ['SMITHDAVIS', 'SMITHDAVIS', 'MITHCELL'],
    'sub_collection_date': [date(2024, 11, 29), date(2024, 11, 29), date(2024, 12, 2)],
    'birth_date': [date(1989, 7, 15), date(1989, 7, 15), date(1990, 6, 21)]
})

reference_df = pl.DataFrame({
    'CASE_ID': [100000032, 100000041, 100020000],
    'first_name_reference': ['DAVID', 'DAVID', 'TRASH'],
    'last_name_reference': ['SMITDAVIS', 'SMITDAVIS', 'PANDA'],
    'ref_collection_date': [date(2024, 11, 29), date(2024, 8, 31), date(2024, 8, 31)],
    'birth_date': [date(1989, 7, 15), date(1989, 7, 15), date(1990, 6, 21)]
})

Step 3: Initalize the fuzzy matching class and input which dataframes and columns you are matching on

fuzzy_init = dfm.DataFrameMatcher(
    df_src=your_df,
    df_ref=reference_df,
    first_name=('first_name', 'first_name_reference'),
    last_name=('last_name', 'last_name_reference'),
    dob='birth_date',
    spec_col_date=('sub_collection_date', 'ref_collection_date'),
    key='submission_number',
    threshold=80  # set what kind of fuzzy threshold you want, 100 being exact match
)

Step 4: Run fuzzy matching! This will output data on what matched and what didn’t match.

result = fuzzy_init.match()

Success: No data leaks detected. Insert victory cigar
0 exact matches
1 fuzzy matched
1 no match found
0 records without demo

Total unique persons processed:  2 
Original submissions to fuzzy (including no_match rematch attempt): 3

You can also examine the output dataframes by pulling them out of the result class, like this:

result.fuzzy_unmatched

index	submission_number	first_name	last_name	sub_collection_date	birth_date	first_name_clean	last_name_clean	submitted_collection_date	submitted_dob	CASE_ID	first_name_reference	last_name_reference	ref_collection_date	birth_date_right	first_name_clean_right	last_name_clean_right	reference_collection_date	first_name_result	last_name_result	reverse_first_name_result	reverse_last_name_result	match_ratio	reverse_match_ratio
0	887730141	GRANT	MITHCELL	2024-12-02	1990-06-21	GRANT	MITHCELL	2024-12-02	1990-06-21	100020000	TRASH	PANDA	2024-08-31	1990-06-21	TRASH	PANDA	2024-08-31	40	0	40	31	20.0	35.5

result.fuzzy_matched

index	submission_number	first_name	last_name	sub_collection_date	birth_date	first_name_clean	last_name_clean	submitted_collection_date	submitted_dob	CASE_ID	first_name_reference	last_name_reference	ref_collection_date	birth_date_right	first_name_clean_right	last_name_clean_right	reference_collection_date	first_name_result	last_name_result	reverse_first_name_result	reverse_last_name_result	match_ratio	reverse_match_ratio	day_count	business_day_count
0	453278555	DAVIS	SMITHDAVIS	2024-11-29	1989-07-15	DAVIS	SMITHDAVIS	2024-11-29	1989-07-15	100000032	DAVID	SMITDAVIS	2024-11-29	1989-07-15	DAVID	SMITDAVIS	2024-11-29	80	95	71	53	87.5	62.0	0	0

Methods

Name	Description
fuzzy_match	Where the magic happens. Do the fuzzy matching to the dataframe

fuzzy_match

dataframe_matcher.DataFrameMatcher.fuzzy_match(dob_match)

Where the magic happens. Do the fuzzy matching to the dataframe

Parameters

dob_match :: the dataframe that has records grouped by their dob match

Returns

fuzzy_matched : pl.DataFrame: dataframe with matches that met or exceeded the fuzzy matching score threshold
fuzzy_unmatched : pl.DataFrame: dataframe with matches that failed to meet the fuzzy matching score threshold

Examples

from wadoh_raccoon import dataframe_matcher as dfm
import polars as pl
from datetime import date

# Create example data
df = pl.DataFrame({
    'submission_number': [453278555, 453278555, 887730141],
    'first_name_clean': ['DAVIS', 'DAVIS', 'GRANT'],
    'last_name_clean': ['SMITHDAVIS', 'SMITHDAVIS', 'MITHCELL'],
    'submitted_collection_date': [date(2024, 11, 29), date(2024, 11, 29), date(2024, 12, 2)],
    'submitted_dob': [date(1989, 7, 15), date(1989, 7, 15), date(1990, 6, 21)],
     'CASE_ID': [100000032, 100000041, None],
    'first_name_clean_right': ['DAVID', 'DAVID', None],
    'last_name_clean_right': ['SMITDAVIS', 'SMITDAVIS', None],
     'reference_collection_date': [date(2024, 11, 29), date(2024, 8, 31), None]
})

# Init dataframe matcher
# (this is not how to use the instance but input data not used in this example)
instance = dfm.DataFrameMatcher(df, df)
fuzzy_matched, fuzzy_unmatched = instance.fuzzy_match(dob_match=df)

Fuzzy match found:

helpers.gt_style(df_inp=fuzzy_matched)

no matches found:

helpers.gt_style(df_inp=fuzzy_matched_none)