While working on my securities database project (see https://github.com/davidkellis/securitiesdb) I ran into a problem, I needed to match up vendor-specific data to a common security name. For example, data vendor A might associate EOD data with a company called “ABC Inc” while data vendor B might associate Fundamental data with a company called “ABC Incorporated”; perhaps another vendor might reference “ABC Corporation“. In each case, the data vendors are referring to the same entity, just with slightly different spelling or phrasing.
That’s where approximate string matching comes in. I need to be able to figure out that ABC Inc is the same as ABC Incorporated is the same as ABC Corporation.
While googling for “fuzzy search” and “approximate string search”, I discovered a tool called SimString, and its accompanying paper, “Simple and Efficient Algorithm for Approximate Dictionary Matching”. I’m not exactly sure where SimString sits on the state-of-the-art continuum, but it was published in 2011, so I figure it’s closer to being state of the art than some brute force compute-the-levenschtein-distance-to-every-string-in-the-database method that I might have come up with on my own.
In any case, I finished up a Ruby implementation (see https://github.com/davidkellis/simstring) of the algorithms given in the http://www.aclweb.org/anthology/C10-1096 paper. I originally tried to port the C++ reference implementation to Ruby, but quickly discovered that the C++ version is both hideously complex and doesn’t really mirror the pseudocode implementations described in the paper, so I gave up on that front. It turns out, if you carefully read and re-read the paper over and over, it’s enough to guide you to an implementation. At first I thought it was a pretty bad paper, but after re-reading it over and over, and tinkering with my implementation, I think I got it all figured out.
Next steps, performance testing, and probably a port to Scala.