Data Analytics Approach to Name Matching
Anyone can write a simple query to compare two lists:
SELECT a.[name] FROM [list1] a INNER JOIN [list2] b ON a.[name] = b.[name]
But could that query find
Jon McAndrews from a list including
John McAndrews or
Jon Mc Andrews or even
Jonathon McAndrews? They're all the same person but without utilizing some form of fuzzy logic they'll never be detected. The problem is that human beings are natural pattern matching machines with an enormous swath of cognitive tools at our disposal whereas a computer is only as capable as the person who programmed it. Fortunately, with a combination of basic and advanced name-matching techniques, a methodology can be assembled to automatically and efficiently identify people between lists while providing a metric of confidence in the match.
Basic Name Matching
At its core, name matching is simply the act of identifying a one-to-one relationship between two separate lists of names. Name matching is a task performed everywhere from national security terror watchlists, financial services anti-money laundering watchlists and even simple user management tasks within an ERP. While seemingly a very simple task, name matching can be amongst the most aggravating tasks presented to an IT department. It is fundamental from the perspective that as far as data goes, lists of names are composed of very simple data elements like
username and sometimes unique numeric id. These are simple data elements because they are but four fields that do not require derivation through formula, circumstance or any other upstream data dependency.
That said, name matching is difficult for a variety of reasons. Often times the requirement is to match against a third party list which uses a different format, e.g.
fullname may be represented as
first mi last or
last, first. Another problem is the many-to-many correlation between names because last names are common and many parents are not creative with their children's first names. Expanding on the human factor, many individual's use nicknames or even change their name which introduces a mismatch based on how old various lists are. Last, but certainly not least, names are entered into lists and databases by people which introduces the inevitable likelihood of human error.
Unfortunately, the bridge between requests from the business department to the corresponding understanding of that request by the IT department is a gap that is understood by many yet spanned by few. Typically, the business knows what function they desire; for example, take our list of account holders and identify who is on the OFAC watchlist
Another common failure is lacking the understanding of data quality and its effect on the downstream process. All too often, the business believes "we couldn't have made it this far with bad data, so it must be good" and proceeds with an IT effort. To name a few examples within a name matching exercise, bad data consist of misspelled names, duplicate names without a differentiating identifier, incomplete fields, duplicate records, incorrect employee ids, or non-adherence to a defined naming convention. All of these factors affect the ability to marry a list of names against a third party list.
One of the most significant failures is a lack of follow-up validation. After two distinct lists are compared, many businesses proceed without performing a sanity check on what the computer produced. Matches are not sampled to confirm one-to-one validity. Non-matched names are also typically not confirmed to ensure they were not overlooked in the target list. Problems like this result in embarrassment (and possibly lawsuits) at airports for do-not-fly list "confirmation", permit criminals to slip their accounts into "non-risky" categories at banks and drop good employees from ERP systems or have their pay and benefits disrupted
When there are real stakes in play, it is important to have a defined, yet adaptable, methodology to appropriately address the risk of incorrect and missed name matches. The reason for having a framework is to address points of weakness in the name matching assessment early to identify additional mitigating matching logic. By appropriately constructing tests for paring down the possibilities of false positives (incorrect matches) and orphans (missed matches), the reconciliation process at the exercise's end becomes much simpler and the automated portion of the matching can be accepted with a greater degree of confidence.
The following bullets define fundamental steps within a name matching methodology:
- Data Quality Assessment
- Basic Matching
- Advanced Matching
- Score Assignment
- Manual Review
There are six core elements to examine during a data quality check and they are all applicable to a name matching operation:
- Accuracy - data elements are correct and match the system of record
- John "T-Bone" Smith will not have
T-Bonepopulated in the first name field
- John "T-Bone" Smith will not have
- Completeness - there are no missing values within a record
- no record will be missing a first name or surname
- Consistency - data adheres to formatting conventions throughout the system
- formatting means strings like
St.will never have spaces or not have the period i.e.
StMartinare inconsistent with that convention
- formatting means strings like
- Timeliness - data elements are updated within a useful timeframe
- surnames are appropriately updated following a name change
- Uniqueness - there are no unnecessary record duplications
- perhaps there are many John Smiths, but only one with an employee id of 123
- Validity - the value within the data element conforms to rules
- employee ids are not mistakenly recorded in the name fields
If an operator is prone to making a mistake 0.1% of the time, then it can be expected that a list of 10 million names (an easy expectation when dealing with lists of accounts) yields the possibility of ten thousand data errors before the matching operation even begins. Tackling data quality issues before commencing with matching reduces the chance of missing valid matches but especially reduces instances of false positives. Unfortunately, this step is frequently the most ignored component of any data operation and usually results in the biggest problems at the point where it's too late to effectively clean the data.
Perhaps calling the next techniques "advanced" is a misnomer. It may be more accurate to call them "neglected" techniques. Regardless of their perceived level of complexity, these techniques are useful in sweeping up unmatched orphans.
However, performing extra row by row transformations and complete table scanning is not a "cost effective" solution in terms of the processing, especially when the remaining data in question is already suspected of being composed of unmatched orphans. For example, simply creating a fifty character, UTF16 field to hold a single name transformation across 10 million rows requires a gigabyte of storage. Furthermore, performing an unoptimized table scan cross match between two separate lists of 10 million (non-indexed) names requires 100,000,000,000,000 comparison operations. This is, of course, worst case performance but it is clearly obvious that optimizations are necessary as these resource and time costs are multiplied by every additional transformed field and every extra detection algorithm.
As such, each additional test algorithm requires an appropriate assessment of the match value added against the computational intensity of the operation. Sometimes a fundamentally simple test, like applying some normalization filters before matching, incur a significant hit on pre-match preparation. Other tests like SOUNDEX() involve detailed string scanning and table lookups yet can be implemented within only a few lines of code. It is important to understand how alternative matching algorithms work in order to properly assess their appropriate applicability to the task at hand.
Name mangling can also be considered to be a normalization operation. There are a variety of naming conventions that may not be consistent within the data. For example, names like
St. Martin and
St. Lawrence may be recorded differently within the source list than the comparison list where they may not have spaces or punctuation such as
St Martin or
St.Lawrence. Depending on the personnel entering the names into the original list, these inconsistencies might even exist in the source.
Normalization is essentially a series of string replacements designed to detect and correct names deviating from a defined pattern. Which list is to be normalized depends on the application. In the event of comparing customer accounts against a federal list, it is likely the source list will be normalized against the third party convention. However, if the intent were to compare human resources records of an ERP system against a variety of enterprise applications, one would likely treat the ERP names as "golden" and perform transformations on the applications. Another consideration is that transformations are costly in terms of time so list size is a significant consideration. A few basic examples, normalizing prefixes, suffixes and modifiers:
UPDATE tbl_source SET [last_name] = REPLACE([last_name], 'St. ', 'St.') UPDATE tbl_source SET [last_name] = REPLACE([last_name], 'III', '') UPDATE tbl_source SET [last_name] = REPLACE([last_name], 'JR', '')
An important consideration is that an SQL UPDATE statement essentially performs a complete table scan and each transformation multiplies the cost of that run. Identifying all the necessary transformations in advance of the operation can reduce processing time by forming complex, nested REPLACE() logic to perform all replacements in a single pass or perhaps even using other string parsing technologies like regular expressions.
It is good practice to profile a list's names to identify possible anomalies as it is wasteful to perform transformations on nonexistent problems. Punctuation within names needs to be considered. As in the aforementioned example of
St. Martin, the name contains both periods and spaces. Hyphens are also very common within compound names. Modifications that may appear associated with names include academic titles, honorary or professional distinctions and generational indicators
- PhD, M.D.
- Esq, D.D., Rev.
- Jr., Sr., III (and further variations of Roman Numerals)
Whereas computers are incredibly useful for performing literal comparisons very accurately and very rapidly, they are awful at subjective analysis. For example, without special programming, a computer will not assess that
Matthew are more similar than
Joseph. In the early 20th century, a system called Soundex was developed for phonetically describing surnames to track generational name changes in census data.
Soundex describes a name with a simple code beginning with a letter and followed by three numbers
Washington as an example, the first letter of the Soundex code is simply the lead character of the string -
w. Then simply strip away the letters A, E, I, O, U, H, W and Y leaving only -
wsngtn. The subsequent three digits are identified by parsing the sequence of letters from the string and recording their corresponding sound code so
2. The final Soundex for
|1||B, F, P, V|
|2||C, G, J, K, Q, S, X, Z|
Soundex is a relatively simple algorithm to implement but there are more peculiar rules based on its origin as a surname matching system. Such rules pertain to double letters (McCormick), prefixes (VanHeusen), etc. Fortunately, a SOUNDEX() comparison on two strings starting with different consonants as it is a given they'll not match. Performance can be improved by keeping the lists sorted alphabetically, possibly with a clustered index, and restricting SOUNDEX() comparisons only if the first characters of each string match.
Another useful tool for matching logic is the Levenshtein Distance algorithm. According to the National Institute of Standards and Learning, the Levenshtein Distance represents "the smallest number of insertions, deletions, and substitutions required to change one string or tree into another"
The Levenshtein Distance is useful for identifying likely candidates for a match by detecting minor variance. For example,
Mat differ by only a single
t resulting in a distance of 1 which could be the result of a typographic error during entry into an ERP. The algorithm helps to relate
St. Martin to
StMartin with a Levenshtein Distance of 2.
Using Microsoft SQL Server's Transact-SQL syntax, a stored procedure based on the Levenshtein Distance algorithm can be employed to look for similarities between names
IF OBJECT_ID ('tbl_matrix', 'U') IS NOT NULL DROP TABLE tbl_matrix GO CREATE TABLE tbl_matrix ( [x] INT, [y] INT, [value] INT ) GO -- create a stored procedure for computing levenshtein distance -- IF OBJECT_ID ('sp_levenshtein', 'P') IS NOT NULL DROP PROCEDURE sp_levenshtein GO CREATE PROCEDURE sp_levenshtein @source VARCHAR(100), @target VARCHAR(100) AS -- clear the working table TRUNCATE TABLE tbl_matrix -- establish the length of the source string DECLARE @len_source INT SET @len_source = (SELECT LEN(@source)) -- establish the length of the target string DECLARE @len_target INT SET @len_target = (SELECT LEN(@target)) DECLARE @counter INT -- initialize the first row 0..len_source SET @counter = 0 WHILE @counter <= @len_source + 1 BEGIN INSERT INTO tbl_matrix VALUES (@counter, 0, @counter) SET @counter = @counter + 1 END -- initialize the first column 0..len_target SET @counter = 1 WHILE @counter <= @len_target + 1 BEGIN INSERT INTO tbl_matrix VALUES (0, @counter, @counter) SET @counter = @counter + 1 END -- evaluate character distance DECLARE @cost INT DECLARE @up INT DECLARE @left INT DECLARE @diag INT DECLARE @source_index INT DECLARE @target_index INT SET @source_index = 1 SET @target_index = 1 WHILE @source_index <= @len_source + 1 BEGIN WHILE @target_index <= @len_target + 1 BEGIN SET @cost = (CASE WHEN SUBSTRING(@source, @source_index, 1) = SUBSTRING(@target, @target_index, 1) THEN 0 ELSE 1 END) SET @up = (SELECT [value] FROM tbl_matrix WHERE [x] = @source_index AND [y] = @target_index - 1) + 1 SET @left = (SELECT [value] FROM tbl_matrix WHERE [x] = @source_index - 1 AND [y] = @target_index) + 1 SET @diag = (SELECT [value] FROM tbl_matrix WHERE [x] = @source_index - 1 AND [y] = @target_index - 1) + @cost SET @cost = (CASE WHEN (@up <= @left) AND (@up <= @diag) THEN @up WHEN (@left <= @up) AND (@left <= @diag) THEN @left WHEN (@diag <= @up) AND (@diag <= @left) THEN @diag END) INSERT INTO tbl_matrix VALUES (@source_index, @target_index, @cost) SET @target_index = @target_index + 1 END SET @target_index = 1 SET @source_index = @source_index + 1 END GO DECLARE @test1 VARCHAR(100) DECLARE @test2 VARCHAR(100) SET @test1 = 'MarkMCB' SET @test2 = 'VnutZ' EXEC sp_levenshtein @test1, @test2 SELECT [value] FROM tbl_matrix WHERE [x] = LEN(@test1) AND [y] = LEN(@test2)
However, as with any advanced application of fuzzy matching, the Levenshtein Distance should be coupled with additional logic to tighten its scope. For example, the names
Mark are only a distance of 2 moves apart, just like
St. Martin and
StMartin. It is therefore naive to simply look for low distances as a probable indicator of matching. Using Levenshtein Distance in conjunction with SOUNDEX() and additional strings of the name yield a much stronger likelihood that a small distance equates to a valid match.
It almost goes without saying that when human interaction is involved with data entry that errors are bound to manifest themselves. One of the more common mistakes is for name fields to be juxtaposed, e.g. the last name is recorded in the first name field and vice versa. While corrective actions can be taken against typographic errors, swapped fields will yield a mismatch every time. Of the possible errors, however, swapped fields are usually detected in the course of business as customers complain to service representatives or employees complain to human resources about their names being wrong.
For the few cases that persist, however, the remedy is quite simple - perform the match by swapping the first and last name fields. Since there are likely to be very few matches based on juxtaposition, the field swaps and comparisons should be restricted to the subset of names that haven't been matched yet. Typically, this operation is most effective in identity management testing where the unmatched list has been whittled down over time and is less effective when matches weren't necessarily expected as in testing against watchlists which result in a large unmatched data set.
The heart of advanced name matching is not actually in the additional techniques themselves but in the analysis and compilation of those results. Until the day artificial intelligence can truly pass a lexical Turing Test, it does not matter how much attention data quality receives, how tightly standard matching worked or how many orphans the advanced matching found pairs for - there is still an element of manual review that is necessary. This is a sanity check to confirm the autonomous matching from a human perspective. Scoring is a summarizing technique that assigns weight to matches made from the different available mechanisms.
For example, consider matching a human resources ERP system against a separate LDAP resource. 5 points could be assigned for a primary key (like an employee ID) match, 5 points could be assigned for a direct first and last name match and then 1 point could be assigned for each additional match test like a Levenshtein Distance within tolerance, a SOUNDEX() match or a normalization match.
In this situation, an exact match of employee x13318,
Dean Remington would result in 13 points. Now consider a common, repeating name. Employee x12381,
John Smith matches one of the LDAP entries by name but differs in ID. This match would be scored as an 8 (losing 5 points on the primary key match). As a final example, consider employee x13545,
Daryl St.George whose ID was input incorrectly into the LDAP system and whose last name was entered as
St. George with a space. He would only score 3 points for Levenstein Distance, normalization and SOUNDEX().
The scoring system can be used to break out the names into stratified batches. For this example, it would make sense to say that a score of 10 or more does not really need human confirmation, so many tests came up positive that the match is highly likely to be valid. On the other hand, it is clear that a sampling of strings within the lower threshold do warrant a look because
John Smith was a false positive and
Daryl St.George is in fact a valid match. Defining the weight of each test, identifying the stratification indexes of confidence and choosing an appropriate sample size for review are the critical steps to completing a name matching exercise.
Name matching is a simple task with a lot of complicated nuances. In order to effectively meet the objectives of a name matching exercise, it is important to adhere to a planned and thorough methodology tailored to suit the level of acceptable risk. Data quality should be assessed to reduce matching errors and to profile the lists to determine the most appropriate types of additional matching logic to incorporate. As list sizes grow, optimizations are critical for keeping performance under control from both perspectives of data size and processing speed. Using a variety of meaningful algorithms can cut down on false positive matches while simultaneously providing accurate fuzzy matches. Finally, it is important to aggregate the separate algorithms' results together in a meaningful way, using a scoring technique to evaluate a degree of match confidence. Appropriately putting these steps together reduce the amount of manual review necessary for establishing comfort in the final match results.
- Office of Foreign Assets Control, US Treasury accessed January 2009 from http://www.treas.gov/offices/enforcement/ofac/
- 4-Year-Old Turns Up On Government ‘No-Fly' List, MSNBC accessed January 2009 from http://www.msnbc.msn.com/id/10725741/
- Suffix (name), Wikipedia accessed January 2009 from http://en.wikipedia.org/wiki/Suffix_(name)
- The Soundex Indexing System, National Archives accessed January 2009 from http://www.archives.gov/genealogy/census/soundex.html
- SOUNDEX, Microsoft Developer Network accessed January 2009 from http://msdn.microsoft.com/en-us/library/aa259235(SQL.80).aspx
- Levenshtein Distance, NIST accessed January 2009 from http://www.nist.gov/dads/HTML/Levenshtein.html
- Levenshtein Distance, Wikipedia accessed January 2009 from http://en.wikipedia.org/wiki/Levenshtein_distance