A Business Guide to Fuzzy Matching
Let’s look at two simple data records one might find in any company’s customer database:
RECORD 1: John Smith Jr, 123 Washingtn St SW, Atlanta, GA, 30315
RECORD 2: Jonathan K Smith Jr, 123 SW Washington St, Atlanta, GA, 30315
If it were your job to find duplicates manually using common sense, you would likely decide that data from these two records could be merged.
You used fuzzy matching to make that decision. The records are not exact duplicates. The names and streets are different and the street directionals and street names are in a different order. And yet, you would probably believe it was safe to assume these two data records referred to a single individual. You know that John and Jonathan are variations of the same name, even though they are spelled differently. You also know that in some neighborhoods, street directionals come before the street name and in others, they come after, and that people sometimes mix them up. Your human brain recognized the missing letter in record 1 is probably a typo.
Fuzzy matching is probabilistic. This means the matching software takes variations into account before it decides if a pair of data records are duplicates. This approach is beneficial to businesses because the matching software can find probable matches even if the data is not identical. Fuzzy matching takes the degree of variance into account–much like a human would do it.
If you had relied on exact matches, or deterministic matching, to find duplicate customers based on names and addresses, both records would remain in the customer database. Every time you mailed something to your customers, Mr. Smith would get two copies. Some of his purchases from your company might be recorded in each of the CRM records, as would a log of customer communications. The count of customers in Atlanta would be inflated.
Inaccuracies caused by duplicate data can affect your company in many ways, none of them positively. You can see why businesses of all types must take steps to eliminate or combine duplicates.
Match IQ® from Firstlogic offers businesses full-featured match, merge/purge, and data consolidation capabilities, including fuzzy matching. Read on to see how fuzzy matching works and when to use it to improve operations throughout a business enterprise.
How Do Businesses End Up with Duplicate Data?
Businesses deal with an influx of data from various sources. Customer interactions, transactions, third-party data, and employee records are just some sources of data that enter an organization. Handling this sheer volume can be a challenge for even the most organized businesses. Mergers and acquisitions are also potential sources of duplicate data.
Duplicate data comes into existence in several ways. The most common scenarios include:
- Data Entry Errors: Data entry is perhaps the most basic and common cause of duplicate data. Mistakes while entering data, such as typos or incorrect formatting, can create multiple versions of the same record.
- Multiple Data Sources: When an organization combines data from different sources, the same records might be duplicated.
- System Migrations & Upgrades: As businesses upgrade or switch their systems, there may be instances where the same data is copied to the new system, thereby creating duplicates.
- Processing Errors: Mistakes happen. Sometimes they cause records to be loaded to a database multiple times.
No organization can count on the data entering their enterprise to be clean, standardized, and unique. Far from it. If the company accepts orders directly from consumers, they may receive data from customers using mobile apps, desktop web browser forms, hand-written order forms, or over the phone. Each of these communication channels has the potential to create duplicate data. Customers who can’t remember their login credentials tend to just create a new account, manufacturing duplicate data. Handwriting may be hard to read. Fat fingers or auto-complete may be responsible for slightly-different data that enters the business.
Matching unstructured data is a perfect application for fuzzy matching.
What is Fuzzy Matching?
Fuzzy matching compares data containing variations or inconsistencies. The technique is useful when dealing with large datasets that have unstructured or semi-structured data. Businesses use fuzzy matching to find and merge similar or related records, even if they are not an exact match. Fuzzy matching software makes matches by measuring the similarity between two strings or sets of data and assigning a similarity score.
Matching algorithms and techniques calculate the similarity between sets of data. These fuzzy matching algorithms take various factors, such as spelling variations, phonetic similarities, and partial matches, into account. The process involves breaking data into smaller units, such as words or characters, and comparing them based on their similarity. Companies control the matching to decrease false positives. They set parameters that specify the allowable degree of similarity between two records.
Fuzzy matching can use techniques such as phonetic algorithms, string distance algorithms, and probabilistic algorithms. Phonetic algorithms, such as Soundex or Metaphone, convert words or names into phonetic codes and compare the codes to find similarities. This algorithm is useful for data that came from spoken conversations or speech-to-text applications. String distance algorithms measure the number of edits required to transform one string into another. Probabilistic algorithms calculate the similarity based on the presence or absence of certain elements in the data.
The choice of algorithm depends on the specific requirements of the data and the desired level of accuracy. Some fuzzy matching algorithms are more suitable for comparing short strings, while others are better for analyzing longer texts. Factors such as the size of the dataset, the complexity of the data, and the computational resources available can influence the performance of fuzzy matching algorithms.
Once the fuzzy matching software calculates the similarity scores, a pre-set threshold determines whether it considers two records a match. Companies can adjust this threshold based on the desired level of precision.
Fuzzy Data Matching Reduces Redundancy
Your business may maintain millions or perhaps billions of data entries. Some of them are duplicates—every company has them. Duplicate data can cause problems when organizations attempt to consolidate data, do statistical analysis, migrate from one system to another, or use the data to drive business operations.
Fuzzy matching allows companies to merge data. As a result, they can:
- Lower Costs – Obvious are the cost savings connected to producing and distributing customer communications. Also affected are informed judgements about capital investments, promotional campaigns, store locations, staffing, and many other decisions based on clean, correct data. Eliminating duplicate data reduces processing time, data transmission time, and data storage requirements.
- Improved Campaign ROI: Computing the effectiveness of campaigns is an important part of the marketing process. Duplicate offers are worthless and make the campaign seem less worthwhile than it really is.
- Improved Customer Experience: Fuzzy matching allows organizations to combine data from multiple sources and build 360-degree customer views, which makes all subsequent customer interactions better.
Fuzzy Matching Scenarios
An online retailer accumulates hundreds of thousands of records from users signing up from multiple sources such as mobile apps, websites, or promotional events. Over time, duplicates creep into the databases. John Doe signs up on the website as John Doe and later uses his Facebook account to log in on the mobile app as Jonny Doe. Fuzzy matching can identify that John Doe and Jonny Doe might be the same person based on matching data points like customer names along with IP addresses or phone numbers.
A hotel group manages guest data from their properties around the globe. Two guests, “A. Smith from New York” and “Anthony Smith from NY, USA” make reservations in two different properties. Mr. Smith’s corporate travel desk made one reservation, and he made the other himself using a travel site on the internet. Here, by analyzing data besides the name and city, fuzzy matching can help point out that these two entries may be for the same person.
In a third scenario, consider a healthcare institution dealing with patient records. The records for “Nancy O’Neal” and “Nancy Oneal” may belong to the same person, but because of a simple typographical error during data entry, they appear as two separate individuals. Fuzzy matching can flag these records as potential duplicates to be investigated and corrected. Personal health information laws require the healthcare company to avoid merging the health records of two patients, but the fuzzy matching software can flag potential matches for further investigation.
One of Firstlogic’s customers, a cruise ship company, requires passengers on international cruises to present a passport when boarding the ship. Cruise company employees compare the documents passengers bring with them to names on their reservation lists. If the names don’t match, passenger boarding can be delayed or even denied, a severe downturn in the customer experience! Embarkation day issues can be avoided by comparing passenger-supplied names on the reservations to legal names on file with the cruise company. Fuzzy matching allows those matches to occur, even if the names are slightly different in the two lists.
These illustrative fuzzy matching scenarios highlight a crucial point: wherever a large dataset with potential for duplication and human error exists, fuzzy matching can prove to be an invaluable tool.
When Should You Use Fuzzy Matching?
Companies should think about using fuzzy matching when dealing with datasets that contain unstructured or semi-structured data. This tool is incredibly beneficial when the data’s inconsistency poses a challenge because of misspellings, typos, or varying formats. Fuzzy matching will help spot these subtle similarities and connect related data records efficiently. For instance, the same physical address written differently across various records could still be matched up, thanks to fuzzy matching!
Deterministic matching is your go-to method when dealing with structured data where exact matches are vital. If there’s a stringent requirement for 100% accurate matches, such as with financial or sensitive data, deterministic matching takes precedence. It is a safer bet as it avoids the risk of false positive matches that fuzzy matching might create.
Deterministic matching usually applies in situations where the key data elements, such as social security numbers or account numbers, are tightly structured. Two or more instances with duplicate values in data fields like these almost always indicate an error.
Match IQ offers both deterministic and probabilistic matching techniques.
Matching Type | Best Application |
Fuzzy Matching | Best suited for unstructured or semi-structured data where exact match isn’t necessary. It’s useful for handling large volumes of data to improve quality through identifying and merging similar or related records. |
Deterministic Matching | Ideal for structured data where exact matches are crucial, particularly in matters involving financial or sensitive data. It prioritizes accuracy to avoid risks of false positive matches. |
Choosing between fuzzy and deterministic matching hinges on the nature and requirements of the task.
Cleanse and Match: Which Comes First?
Organizations faced with an obvious duplication problem in large data sets may be tempted to resolve the duplicates immediately, resulting in a smaller volume of data to handle in later processes. In most circumstances though, this approach is a mistake. Matching duplicate records before cleansing the data can lead to incorrect connections and conclusions. The matching software might erroneously connect distinct records or overlook genuinely similar records. The accuracy of the matching process is directly influenced by the quality of the data being matched.
In almost every case, organizations should cleanse the data before performing fuzzy matching processes. Data cleansing can correct formatting errors, such as reversing names listed as last-first instead of first-last. Specialized software standardizes postal addresses according to USPS requirements and can add or remove formatting items like dashes in phone numbers. Cleansing removes anomalies and discrepancies that can cause fuzzy matching to misinterpret the data.
Matching software relies on data to be arranged in sequence. Clean and standardized files allow the software to order the data as necessary to find and evaluate possible data record pairs. See this whitepaper for a detailed explanation of the capabilities of data cleansing software like Firstlogic’s DataRight IQ®.
The matching process compares the cleansed data with other datasets, using algorithms to detect similarities and differences. The result of the matching process is a grouping of similar records that help answer specific business questions or support decision-making processes. An organization’s objective could be as simple as identifying duplicate customer records or as complex as relating all activities of a particular customer across multiple business areas.
Cleansing the data before matching ensures the data being compared is as accurate as possible.
Data cleansing is a continuous process. New data enters an organization all the time. There will normally be some degree of matching that happens after cleansing.
- Cleanse data first to remove any obvious errors and duplicates.
- Upon completing the data-cleansing stage, start the fuzzy matching process.
- Periodically check and refresh the data cleanse and deduping process to support data quality.
Harnessing the full potential of fuzzy matching requires a balanced and disciplined approach to both data cleansing and matching. By keeping the integrity of corporate data at the forefront, companies will be best positioned to leverage the powerful capabilities of fuzzy matching.
Challenges and Limitations of Fuzzy Matching
Fuzzy matching, while a powerful technique for data management, has its limitations. One of the key challenges is the computational complexity involved in performing fuzzy matching on large datasets. As the size of the dataset increases, the time and resources required to perform fuzzy matching also increase. This can lead to longer processing times and higher costs for organizations.
Companies may face challenges in balancing the trade-off between accuracy and efficiency. Fuzzy matching algorithms aim to find the best possible matches between records, but this can be a time-consuming process. To improve efficiency, some companies may adjust fuzzy matching algorithms and sacrifice accuracy by using approximate matching techniques. While this can speed up the matching process, it may also result in some false positives or false negatives.
Fuzzy matching is also sensitive to the choice of matching criteria. Fuzzy matching algorithms may use different similarity measures or weighting schemes, and the choice of these factors can affect the results. Data analysts may need to fine-tune and experiment to find the optimal set of guidelines for a dataset and matching task.
Certain data types, such as highly unstructured or free text, can be challenging data for fuzzy matching. Data that lacks consistent patterns or structures makes it difficult for fuzzy matching algorithms to identify and match similar records. Companies may need to add preprocessing or feature extraction techniques to handle such data effectively.
Embrace the Power of Fuzzy Matching
Fuzzy matching stands as a highly effective tool for efficient data management. This powerful technology streamlines data-related tasks. It also vastly improves the quality of the information companies rely upon to make strategic business decisions and deliver superior customer service. Fuzzy matching can uncover similarities and links among unrelated or seemingly disjointed data, culminating in informed decisions and strategic insights.
Though not without its challenges, fuzzy matching’s utility and benefits far outweigh any limitations, providing immense value in managing complex datasets. In the automated and artificial intelligence enabled business environment in which companies now operate, data is the currency. Effectively harnessing and leveraging data is crucial. Fuzzy matching paves the way towards precise data interpretation, handling redundancies, and enhancing overall data accuracy.
Leverage the capabilities of fuzzy matching within your organization’s data management processes with tools like Match IQ® to transform data into valuable, actionable knowledge, driving business optimization and growth. Fuzzy matching is more than a technique—the technology is a strategic advantage.
A Business Guide to Data Accuracy
Knowing your business becomes a guessing game if the correctness of your data can’t be trusted.
What is the difference between a Mailing Address and a Physical Address?
A mailing address is where you get your mail, but it doesn’t always match your physical address.