The Science Behind Matching

5 minute read

You might think matching and duplicate data detection is a straightforward exercise, and many times it is. Want to see if a data file contains duplicate account numbers? That’s easy. A middle-schooler could write a routine that accomplishes this task. It involves no variables other than simple issues like formatting or leading zero suppression.

But what if you’re analyzing data files of customer contact information constructed in different time periods, by different organizations, or with different rules? This data may have names, customer behavior, history, and other contact information in various formats. The data values are probably inconsistent from file to file. A standard match routine will not recognize that James Arnold, Jim Arnold, JR Arnold, Ross Arnold, Junior Arnold, and Arnold James could be the same person.

For more sophisticated matching, you’ll need software developed by data scientists and the routines may use the deterministic or probabilistic methods of match detection-or both!

Deterministic Matching
Deterministic matching seeks equal values for data fields from one data record to another. This may sound like the simple account number matching example we mentioned above, but sophisticated deterministic matching uses scoring to decide how strong a match it has made. The software will also account for the presence or absence of data values. This is more sophisticated than a simple byte-by-byte comparison.

One hundred percent positive matches occur when the values of all inspected data fields are the same in both data records. When data fields exist in both compared records but the values are different, the software will decide the records do not-match exactly and will assign a weighted score value depending on the strength of the match.

Combined matching and non-matching data fields ultimately control the score for a pair of data records with field to field scoring which uses word or phrase similarity, noise word removal, cross-field comparisons, and weighted scoring of fields contributing to the overall record score.

Users decide the thresholds for taking action. If the score falls below the threshold the matching software will not merge the data. High scores may be considered positive matches and cause the data to be combined. Scores in between the high and low thresholds may be tagged for manual review.

Probabilistic Matching
With probabilistic matching, the software computes a matching score that determines the probability of a match. To use our example from above, matching “James Arnold” with “Jim Arnold” would yield a higher score than matching “James Arnold” with “Junior Arnold”. “Jim” is a common nickname for “James” but “Junior” is not. We can’t rule out the possibility however, without additional data. If social security numbers for James and Junior are different, the software won’t make the match. Contrarily, if supporting information such as matching birthdates, spouse names, or street addresses exist, the match score for “Junior Arnold” could rise.

To be most effective, the probabilistic method considers many data fields. The more pieces of data the software compares, the more accurate the results. Probabilistic matching is sometimes referred to as “fuzzy matching” because it includes educated guesses, not exact matches. A scoring system helps software avoid matching records where the ambiguity is too high.

Great Matching Takes Both
In most complex matching scenarios, data scientists combine deterministic and probabilistic matching to make data merging decisions. The two methods complement one another.

Laypersons may believe they should rely only on deterministic matches because it’s more of a sure thing, but they do not understand that probabilistic matching methods can add value to a deterministic-based task. Adding probabilistic methods expands the scope of the matching or consolidation project.

Consider a case where the primary match criterion is a data field well-suited to deterministic matching, such as an email address. If some data sources do not include email addresses for all records, deterministic-only routines might skip valuable information from that data source. Consequently, an organization might lose data such as internet browsing patterns or customer buying history, simply because the data records containing this information lacked an email address matching the master record.

By adding probabilistic matching, the software can compare several data elements, even if the data values vary, and match the records with an acceptable level of certainty. Important customer data will be retained, allowing the organization to use this information to enhance future customer experiences and run more effective marketing campaigns.

Firstlogic Match/Consolidate Methods
Firstlogic’s Match/Consolidate® software combines deterministic and probabilistic methods to give customers the best possible performance while allowing them complete control over the matching process.

Controls within the software allow for deterministic settings such as:

  1. Create simple match/no match rules
  2. Define rules for what to do when data fields in one or both compared records are blank
  3. Rank the records in a match group based on completeness of fields in the record
  4. Specify a weighted score for each field
  5. Set match vs no match thresholds

Firstlogic uses probabilistic matching most often when comparing names of people or firms. Our Match/Consolidate® software uses name aliases and cross-compares known alternate names. We also cross-compare first and middle names knowing that Hubert James Smith is likely to call himself James Smith. We can even unscramble some names. If a data record listed a misspelled customer name as “James Anrold” the Firstlogic software would recognize the likely character transposition and score the match to “James Arnold” appropriately. Our probabilistic approach also removes noise words that can detract from the matching process, allowing us to identify all the possible matches.

Understanding how matching works is important in evaluating your data quality requirements and selecting the right tools for the job. Employing both deterministic and probabilistic matching methods, Firstlogic’s Match/Consolidate® software generates results consistent with our customers varied requirements for data matching, duplicate recognition, and data consolidation.