tricky problem when comparing strings using edit distances (1:N relation)
1 visualizzazione (ultimi 30 giorni)
Mostra commenti meno recenti
Hello,
I am working on the comparison of strings against each other basing upon edit distances. So how many steps does it take to convert string B into string A. This is not that difficult. But:
1. Consider B to be an array of possible matches for A
A = 'wonderful house, 1234 new york';
B = {'the greatest houses in new york';'houses in new orleans'};
2. Consider that the overall edit distance is computed by deriving the distance for each word and then summing up all distances and using the substring method. (Since deriving the standard distance gives worse results anyway)
The edit distance for B(1,:) is therefor:
'The' = 2; 'greatest' = 7; 'houses' = 1; 'in' = 1; 'new' = 0; 'york'=0;
Score_b1 = 2+7+1+1+0+0 = 11
And the distance for B(2,:) is:
'houses' = 1; 'in' = 1; 'new' = 0; 'orleans' = 5;
Score_b2 = 1+1+0+5 = 7
------------------------------------------------------
As you can directly see is that B(2) gets a way better score although B(1) is a far better match in this case.
So my question is how can I "tell" matlab to choose the correct match instead?
1. Intuitively I tried to relate the score to the number of words. This gives a better result but doesn't change anything (of course did I choose an example where it doesn't work but this is very often the case actually).
new_score_b1 = 11 / 6 ~= 1.83 new_score_b2 = 7 / 4 ~= 1.75
2. I tried to give perfect word matches a greater impact by counting the number of perfect and near perfect (score <= 1) matches and relate them to the number of words in the potential match:
new_score_b1 = 11 / (4/6) = 16.5 new_score_b2 = 7 / (3/4) = 9 1/3
So in conclusion, the more words we have the higher the edit distance in the case of a non perfect match for each word. So how can I compare short potential matches with long potential matches? Secondly, the same problem arises with the word length. Consider the following potential match ' e y e s o n y o u' this would result in the absolut perfect match but is nonsense of course. Hence smaller strings are more likely to get smaller edit distances although they do not match in any way.
Tricky right?
1 Commento
Walter Roberson
il 9 Gen 2013
This appears to be an algorithm question rather than a MATLAB question.
Risposte (0)
Vedere anche
Categorie
Scopri di più su Characters and Strings in Help Center e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!