Comparing lists of years for similarity

Question

James Ryan il 14 Dic 2016

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/316793-comparing-lists-of-years-for-similarity

Commentato: Guillaume il 15 Dic 2016

My problem involves calibrating a numerical model which predicts some event which happens or not in each year. It could be economic events, coral bleaching, or many other things. I want to compare the similarity of results from different model versions, or with real-world historical data.

The models are expected to miss quite often, so looking for exact matches won't do. Size of error matters so Wilcoxson rank-sum won't do. The lists will often be different in length, and they could be quite a bit longer than my examples below.

Examples of what is subjectively "good" and "bad".

A = [1968 1972 1991 1993 2001 2010]
B = [1968 1972 1993 2001 2010]
C = [1969 1973 1991 1995 2001 2011]
D = [1950 1960 1991 1993 2001 2050]
E = [1968 1972 1991 1993 2001 2010 2050]

Consider A to be "correct"

B is missing one year entirely, but this is not disastrous.
C has only two matching values, but the others are close, I'd call this better than B.
D has three exact matches, but the others are way off.  I'd consider this the worst.
E has five exact matches and one really bad point.  Again, not disastrous.

Of course I don't expect an algorithm to match my subjective evaluation all the time. I just want it to take the things I have mentioned into account.

If I were to make up an algorithm off the cuff I'd probably try to for look points with near neighbors in the other list and score their distances root-mean-square style, with some maximum value counted against any points left with no neighbor. This is really crude, and there must be a better way.

Suggestions, please!

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Guillaume il 14 Dic 2016

2
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/316793-comparing-lists-of-years-for-similarity#answer_247124

It sounds like you need some sort of edit distance calculation. A pure edit distance algorithm would rank B as better than C (1 deletion vs 4 substitutions) but you can weight the deletions more than the substitutions and give different weight to the substitutions by how far they are from the original value.

There is an edit distance function on the File Exchange. No idea of its quality.

2 Commenti
Mostra NessunoNascondi Nessuno

James Ryan il 14 Dic 2016

Thanks. This definitely moves me closer to a solution. The only difference is in that algorithm (designed for strings) replacing one letter with another has the same "cost" regardless of the letter. With dates, the replacement matters. Maybe I can tweak it to work.

Guillaume il 15 Dic 2016

Yes, as I said you can modify the standard algorithm to give different weight to substitutions depending on how far they are from the original value.

The concept of what you are trying to do is definitively one of an edit distance, so I'm sure you can find an algorithm already developed somewhere.

Accedi per commentare.

Answer 2

KSSV il 14 Dic 2016

1
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/316793-comparing-lists-of-years-for-similarity#answer_247120

You can try this ismembertol https://in.mathworks.com/help/matlab/ref/ismembertol.html. You can fix some tolerance limits and find out whether two sets of numbers have any common elements. You can decide your scenarios by setting your tolerance limits.

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

James Ryan il 14 Dic 2016

Another step in the right direction. Perhaps I could count exact matches, then near matches, and then count years which don't have a near match. Each count could be weighted differently to create a "nearness" score. Thanks.

Accedi per commentare.

Answer 3

Image Analyst il 15 Dic 2016

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/316793-comparing-lists-of-years-for-similarity#answer_247224

What about ismember() and/or setdiff()? You don't need ismembertol() if all your numbers (years) are integers. setdiff() tells you what numbers are different between the two vectors, and ismember() tells you what number are the same in the two vectors. Neither one cares about position but I don't think that matters to you - you only care if the number(s) is/are present or not in the array.