Finding Likely Duplicate Strings

Question

0 voti

I have an existing database of contact information for various contacts at specified offices across the country (a "lead" list if you will). This database contains information such as first name, last name, etc. In an effort to refresh the database with current information, I have done some manual research and data logging and have compiled a new, separate data set of current contact information for contacts at the same specified offices.

When updating the existing database with the new data, I've noticed that I'm creating "duplicate" contact records quite a bit. The updating algorithm simply looks for an exact match when it references the contact's name in the new, current data set against the contact's name in the old, existing database. The algorithm thinks "Gregory Smith" is not currently in the database because there isn't an exact match, but upon closer inspection "Gregory" IS already in the database as "Greg Smith".

Instead of manually looking through the database as I update the data and "de-duping" things myself, I was wondering if there was a Matlab function that can compare 2 strings and return how likely it is that they're the same. For example, having the computer flag "Gregory Smith" when the database currently has "Greg Smith" in it. Having the computer do this type of preprocessing would save a lot of time. Any help would be greatly appreciated. Thanks.