It Runs in the Family: Searching for Similar Names using Digitized Family Trees
Searching for a person's name is a common online activity. However, web search engines suffer from low numbers of accurate results to a query containing names. Underlying these poor results are the multiple legitimate spelling variations for a given name, as opposed to regular text that typically possesses a single way to be spelled correctly. Today, most of the techniques suggesting related names based on pattern matching and phonetic encoding approaches. However, they frequently lead to poor performance. Here, we propose a novel approach to tackle the problem of similar name suggestions. Our novel algorithm utilizes historical data collected from genealogy websites along with graph algorithms. In contrast to previous approaches that suggest similar names based on encoded representations or patterns, we propose a general approach that suggests similar names based on the construction and analysis of family trees. Using this valuable and historical information and combining it with network algorithms provides a large name-based graph that offers a great number of suggestions based on historical ancestors. Similar names are extracted from the graph based on generic ordering functions that outperform other algorithms suggesting names based on a single dimension, which limits their performance. Utilizing a large-scale online genealogy dataset with over 17M profiles and more than 200K unique first names, we constructed a large name-based graph. Using this graph along with 7,399 labeled given names with their true synonyms, we evaluated our proposed approach and showed that comparing our algorithm to other algorithms, including phonetic and string similarity algorithms, provides superior performance in terms of accuracy, F1, and precision. We suggest our algorithm as a useful tool for suggesting similar names based on constructing a name-based graph using family trees.
READ FULL TEXT