Finding Likely Duplicate Strings
Mostra commenti meno recenti
I have an existing database of contact information for various contacts at specified offices across the country (a "lead" list if you will). This database contains information such as first name, last name, etc. In an effort to refresh the database with current information, I have done some manual research and data logging and have compiled a new, separate data set of current contact information for contacts at the same specified offices.
When updating the existing database with the new data, I've noticed that I'm creating "duplicate" contact records quite a bit. The updating algorithm simply looks for an exact match when it references the contact's name in the new, current data set against the contact's name in the old, existing database. The algorithm thinks "Gregory Smith" is not currently in the database because there isn't an exact match, but upon closer inspection "Gregory" IS already in the database as "Greg Smith".
Instead of manually looking through the database as I update the data and "de-duping" things myself, I was wondering if there was a Matlab function that can compare 2 strings and return how likely it is that they're the same. For example, having the computer flag "Gregory Smith" when the database currently has "Greg Smith" in it. Having the computer do this type of preprocessing would save a lot of time. Any help would be greatly appreciated. Thanks.
1 Commento
Zachary Messaglia
il 7 Mag 2018
Were you able to solve this?
Risposte (1)
Jan
il 12 Mar 2014
0 voti
It is a good strategy to search in the FileExchange at first:
Categorie
Scopri di più su Database Toolbox in Centro assistenza e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!