from wadoh_raccoon import dataframe_matcher as dfm
import polars as pl
from datetime import datedataframe_matcher
dataframe_matcher
Classes
| Name | Description |
|---|---|
| DataFrameMatcher | A utility class for matching records. |
DataFrameMatcher
dataframe_matcher.DataFrameMatcher(
self,
df_src: pl.DataFrame,
df_ref: pl.DataFrame,
first_name: str | tuple[str, str],
last_name: str | tuple[str, str],
dob: str | tuple[str, str],
spec_col_date: str | tuple[str, str],
key: str | list | None = None,
threshold: int | float = 80,
day_max: int | None = None,
business_day_max: int | None = None,
)A utility class for matching records.
This class provides functionality to match submissions to cases (epi data) based on exact matching via accessions or fuzzy matching based on patient demographics
Parameters
df_src : pl.DataFrame-
Source dataframe containing any Key(s) and patient demographics.
df_ref : pl.DataFrame-
Reference queried dataframe containing patient demographics.
first_name : str | tuple[str, str]-
The first name demographic column name in the source and reference dataframes. If the names are different, they should be provided in a tuple containing the source name first, followed by the reference name.
last_name : str | tuple[str, str]-
The last name demographic column name in the source and reference dataframes. If the names are different, they should be provided in a tuple containing the source name first, followed by the reference name.
dob : str | tuple[str, str]-
The birth date demographic column name in the source and reference dataframes. If the names are different, they should be provided in a tuple containing the source name first, followed by the reference name.
spec_col_date : str | tuple[str, str]-
The specimen collection date column name in the source and reference dataframes. If the names are different, they should be provided in a tuple containing the source name first, followed by the reference name.
key : str | list | None = None-
The key (or list of keys) which group to a distinct source record. Only one match can be returned per distinct key. If no key given, each row in the source df will be treated as a distinct record to be matched.
threshold : int | float = 80-
The inclusive fuzzy scoring threshold used to filter fuzzy matches. Matches with a score at or above the threshold will be returned in the fuzzy matched object. Defaults to 80.
day_max : int | None = None-
The max number of days between reference and source specimen collection dates a fuzzy matched record can have and be returned as a match
business_day_max : int | None = None-
The max number of business days between reference and source specimen collection dates a fuzzy matched record can have and be returned as a match. Business days are counted as weekdays (holidays are not accounted for).
Returns
fuzzy_matched_review : pl.DataFrame-
records that successfully fuzzy matched
fuzzy_without_demo : pl.DataFrame-
records missing demographics and can’t be matched
fuzzy_matched_none : pl.DataFrame-
records that did not have any match
fuzzy_matched_roster : pl.DataFrame-
records that had an exact match to reference dataframe
Examples
Step 1: Import the packages:
Step 2: The fuzzy matching functions need two dataframes to match. Bring your dataframe and the reference dataframe:
# Create example data
your_df = pl.DataFrame({
'submission_number': [453278555, 453278555, 887730141],
'first_name': ['DAVIS', 'DAVIS', 'GRANT'],
'last_name': ['SMITHDAVIS', 'SMITHDAVIS', 'MITHCELL'],
'sub_collection_date': [date(2024, 11, 29), date(2024, 11, 29), date(2024, 12, 2)],
'birth_date': [date(1989, 7, 15), date(1989, 7, 15), date(1990, 6, 21)]
})
reference_df = pl.DataFrame({
'CASE_ID': [100000032, 100000041, 100020000],
'first_name_reference': ['DAVID', 'DAVID', 'TRASH'],
'last_name_reference': ['SMITDAVIS', 'SMITDAVIS', 'PANDA'],
'ref_collection_date': [date(2024, 11, 29), date(2024, 8, 31), date(2024, 8, 31)],
'birth_date': [date(1989, 7, 15), date(1989, 7, 15), date(1990, 6, 21)]
})Step 3: Initalize the fuzzy matching class and input which dataframes and columns you are matching on
fuzzy_init = dfm.DataFrameMatcher(
df_src=your_df,
df_ref=reference_df,
first_name=('first_name', 'first_name_reference'),
last_name=('last_name', 'last_name_reference'),
dob='birth_date',
spec_col_date=('sub_collection_date', 'ref_collection_date'),
key='submission_number',
threshold=80 # set what kind of fuzzy threshold you want, 100 being exact match
)Step 4: Run fuzzy matching! This will output data on what matched and what didn’t match.
result = fuzzy_init.match()Success: No data leaks detected. Insert victory cigar
0 exact matches
1 fuzzy matched
1 no match found
0 records without demo
Total unique persons processed: 2
Original submissions to fuzzy (including no_match rematch attempt): 3
You can also examine the output dataframes by pulling them out of the result class, like this:
result.fuzzy_unmatched| index | submission_number | first_name | last_name | sub_collection_date | birth_date | first_name_clean | last_name_clean | submitted_collection_date | submitted_dob | CASE_ID | first_name_reference | last_name_reference | ref_collection_date | birth_date_right | first_name_clean_right | last_name_clean_right | reference_collection_date | first_name_result | last_name_result | reverse_first_name_result | reverse_last_name_result | match_ratio | reverse_match_ratio |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 887730141 | GRANT | MITHCELL | 2024-12-02 | 1990-06-21 | GRANT | MITHCELL | 2024-12-02 | 1990-06-21 | 100020000 | TRASH | PANDA | 2024-08-31 | 1990-06-21 | TRASH | PANDA | 2024-08-31 | 40 | 0 | 40 | 31 | 20.0 | 35.5 |
result.fuzzy_matched| index | submission_number | first_name | last_name | sub_collection_date | birth_date | first_name_clean | last_name_clean | submitted_collection_date | submitted_dob | CASE_ID | first_name_reference | last_name_reference | ref_collection_date | birth_date_right | first_name_clean_right | last_name_clean_right | reference_collection_date | first_name_result | last_name_result | reverse_first_name_result | reverse_last_name_result | match_ratio | reverse_match_ratio | day_count | business_day_count |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 453278555 | DAVIS | SMITHDAVIS | 2024-11-29 | 1989-07-15 | DAVIS | SMITHDAVIS | 2024-11-29 | 1989-07-15 | 100000032 | DAVID | SMITDAVIS | 2024-11-29 | 1989-07-15 | DAVID | SMITDAVIS | 2024-11-29 | 80 | 95 | 71 | 53 | 87.5 | 62.0 | 0 | 0 |
Methods
| Name | Description |
|---|---|
| fuzzy_match | Where the magic happens. Do the fuzzy matching to the dataframe |
fuzzy_match
dataframe_matcher.DataFrameMatcher.fuzzy_match(dob_match)Where the magic happens. Do the fuzzy matching to the dataframe
Parameters
dob_match :-
the dataframe that has records grouped by their dob match
Returns
fuzzy_matched : pl.DataFrame-
dataframe with matches that met or exceeded the fuzzy matching score threshold
fuzzy_unmatched : pl.DataFrame-
dataframe with matches that failed to meet the fuzzy matching score threshold
Examples
from wadoh_raccoon import dataframe_matcher as dfm
import polars as pl
from datetime import date
# Create example data
df = pl.DataFrame({
'submission_number': [453278555, 453278555, 887730141],
'first_name_clean': ['DAVIS', 'DAVIS', 'GRANT'],
'last_name_clean': ['SMITHDAVIS', 'SMITHDAVIS', 'MITHCELL'],
'submitted_collection_date': [date(2024, 11, 29), date(2024, 11, 29), date(2024, 12, 2)],
'submitted_dob': [date(1989, 7, 15), date(1989, 7, 15), date(1990, 6, 21)],
'CASE_ID': [100000032, 100000041, None],
'first_name_clean_right': ['DAVID', 'DAVID', None],
'last_name_clean_right': ['SMITDAVIS', 'SMITDAVIS', None],
'reference_collection_date': [date(2024, 11, 29), date(2024, 8, 31), None]
})
# Init dataframe matcher
# (this is not how to use the instance but input data not used in this example)
instance = dfm.DataFrameMatcher(df, df)
fuzzy_matched, fuzzy_unmatched = instance.fuzzy_match(dob_match=df)Fuzzy match found:
helpers.gt_style(df_inp=fuzzy_matched)no matches found:
helpers.gt_style(df_inp=fuzzy_matched_none)