Humans err. And when it comes to spelling names, they err like crazy . They don’t seem to see programmers dealing with misheard and mistyped names later on, working endless overtime-hours to come up with algorithms fixing this kind of problems. So did we. Until we realized…
…the fact that there are basically only three types of errors in typed names:
- Mishearing a name – can be caused by unusual pronounciation, different accent or some distraction when a name is spoken to a person writing it down. For example Eric Clapton can be easily heard as Eric Clayton. There are also names that have multiple textual representations, for example Ján Kováč , a typical slovak name is pronounced the same way as Ján Kovács, a hungarian variation of the surname.
- Mistyping a name – this kind of errors can be caused by distraction, fatigue, or dyslexia when inputting a name. It results in mistyped, exchanged or missing letters. As an example imagine someone really tired writing David Coperfeld instead of David Copperfield. By merely seeing these two names, you are pretty sure they represent the same name. But how do you convince your program to feel the same way about it?
- Different representations – when you intergrate data from different sources, you may find the same names represented differently. Some of them may use accent marks and diacriticals, others may not. One source holds string as UTF-8 strings, other uses local encodings.
OK. But I have two names and I need to figure out an algorithm to see if they represent the same real name or not. What should I do?
Solve all the three problems above, I say. Once you don’t know, from which type of errors your data suffer the most, you have to design all three algorithms. Before you get scared, just look how simple it is:
- Misheard names – this is basically just an acoustic problem. If two different names sound very similarly, they are pretty likely to be interchanged when interpreted by a listener. And vice versa – names like Rafael Nadal and Andy Roddick have a very little chance of being misheard in favor of one another. The good news is, that there are a few known algorithms to determine whether two words sound the same or not. See soundex, or Metaphone algorithm entries on Wikipedia. What happens here is, that every name is normalized into a code. Names that sound similar share the same code and that is the solution to our problem (but be sure to check if the algorithm is suitable for the language of your names).
- Mistyped names – this kind of errors is even easier to figure out. Let’s just set up a metric we basically already defined above – let’s say that two strings differ by N letters, if exactly N letters have to be removed, added, or interchanged in the first string in order to end up with the second one. For example John Doe differs from Jon Do in 2 letters (remove h; remove e). More precisely, this metric is called edit distance. The greater the edit distance between two strings is, the more typing errors had to be made during the input. And that’s it. Let’s say people have 10% chance of mistyping a name. So what is the change of two typing errors in one name? Yep, it is less, 1% or so. What we can deduct from this fact is, that the lower the edit distance of two names is, the bigger the chance that they represent the same name.
- Different representations – this is pretty simple to solve. Once you convert both names to the same encoding, normalize them to unaccented and lowercase strings. Be aware that this may create some type I errors making names Emil Sanchéz the same as Emil Sanchez. But we believe that it eliminates substantially bigger amount of type II errors, so it should be a benefit for you in general.
So much for the theory. I presented three approaches to compare names against mistyped, misheard, and “misrepresented” characters.
To keep this post concise, I decided to split it in two parts. Next time, I will show a practical implementation of all the theory above. Bet you’re curious about the Obama question
.
Well, take my name for example. Because nobody speaks arabic around the parts I live, I have become Salem, Sale, even Szalay. And whenever you hear some news about somebody named Salih, Saleh, Salah … even Selah (which is quite different name, I agree), I wonder, whether someone in my remote and extended family got into trouble