The Probabilistic matching notebook examines a set of customer profiles to find likely matches among them. Use the AI Workbench Probabilistic matching notebook to find “fuzzy” matches in your customer profile database. Fuzzy matches are profiles that likely belong to the same person even though not all fields in these profiles have the exact same values.
Before you begin
To get started, contact your BlueConic Customer Success Manager to have this notebook plugin added to your BlueConic tenant.
Add a Probabilistic matching notebook
Navigate to More > AI Workbench > Add notebook.
Choose Probabilistic matching notebook from the pop-up window.
Give your notebook a name.
Save your settings.
Set the Probabilistic matching notebook parameters
Select the Parameters tab.
Select a customer segment to refine the model's search for likely matches, reducing dataset size and improving accuracy.
Select at least 3 profile properties for matching. Use diverse property types (e.g., phone, email) to improve accuracy and reduce false matches.
(Optional) Choose profile properties that must match exactly. These must be a subset of the selected properties or leave this field empty.
Select a profile property for the merge ID. This ID helps automatically merge matching profiles using BlueConic's profile merge rules.
Set the maximum Damerau-Levenshtein edit distance to define how similar profiles must be to match (excluding exact-match properties). A value of 1 is recommended.
(Optional) Set a date to include only profiles updated since then (e.g., order added, contact info updated). Leave empty to include all profiles.
(Recommended) Enable All profile properties must have a value to ensure profiles are only checked if all selected properties have a value. This improves match reliability.
Set a limit for testing or benchmarking. Use 0 to process all profiles, subject to segment size and value requirements.
Click Save.
Run a Probabilistic matching notebook
Select the Schedule and run history tab.
Click Run now to run the notebook analysis manually.
To schedule the import and export for a future date, activate Enable scheduling.
Click the Settings icon to select how to schedule the notebook by choosing an option from the drop-down list.
Set a time for the import. Click OK.
Save your settings.
View your results
When run manually in AI Workbench, the notebook displays execution times (rounded to 0.1 min) for key operations, which scale linearly with profile count. This helps estimate runtimes for larger datasets. Scheduled runs provide limited timing data in status updates. Both manual and scheduled runs log profile counts, exact matches, and fuzzy matches:
For matched profiles, the notebook assigns a merge ID to a selected profile property. If included in BlueConic’s merge rules, profiles with the same ID merge automatically. Matches are also logged in a CSV file with profile IDs and examined property values.
Next steps
Tie the notebook directly to your profile merging rules.
FAQs
What is an example of probabilistic matching?
A common example is when two profiles have the same first and last name, but their phone numbers differ by one digit—possibly due to a typo. The notebook identifies matches by recognizing common typos, misspellings, or intentional variations across multiple profiles. This approach is probabilistic rather than exact, allowing it to find likely matches even when some details differ. It helps answer questions like:
“Which profiles belong to the same person despite minor spelling differences?”
“How common are 'fuzzy' matches in my profile database?”
“How many exact matches are missed by my current merge rules?”
What profile properties are best for probabilistic matching?
For accurate probabilistic matching, include at least one highly diverse property, like a phone number or email, to reduce false matches. Less diverse properties, like age, are fine as long as a diverse one is also included. Without this, false matches increase exponentially—e.g., many John Smiths share the same age, but adding a near-matching phone number makes a true match more likely. Also, properties like age should match exactly since a small typo (e.g., 29 → 92) can create misleading results.
How can I increase the accuracy of probabilistic profile matches?
Choose profile properties carefully for fuzzy matching, ensuring at least one is highly diverse. A simple rule: short codes (e.g., two-letter values) lack uniqueness in large datasets. Some properties, like names and addresses, seem diverse but have common values that can skew results.
Improve accuracy by requiring exact matches for key properties or ignoring profiles with empty values—both also speed up processing. The notebook’s runtime scales linearly with profile count (e.g., doubling profiles doubles runtime). In one test, analyzing 3.4M profiles took ~20 minutes to find 14K matches; 1.7M profiles took 10 minutes.
How does edit-distance improve accuracy in fuzzy matching?
The notebook uses the Damerau-Levenshtein edit-distance measure to assess similarity between values. This distance increases by 1 for deletions, insertions, substitutions, or adjacent character swaps—common in typos or intentional changes (e.g., slight phone number variations). Identical values have an edit-distance of 0. Capitalization, hyphens, and special characters are ignored.
For example, “Jennifer” vs. “Jenifer” (one letter deleted) and “Michael” vs. “Micheal” (adjacent letters swapped) both have an edit-distance of 1. To minimize false matches, an edit-distance of 1 is recommended, as increasing the threshold can exponentially raise both false matches and processing time.
Edit-distance is calculated per profile property, meaning the total edit-distance between two profiles can be up to (number of properties) × (allowed edit-distance) while still qualifying as a fuzzy match. The notebook uses the SymSpell algorithm for rapid matching within the allowed edit-distance.