How to identify duplicate rows between tables

Question

0 voti

I'm using R2020b, and I want to set up a master table for appending new data to - and as part of this I want to identify any duplicate rows in the new, incoming table to filter them out before appending. Ideally, the master table will live in a related directory in a .mat file, and the new data will be read in directly from a set-name, set-location .csv using e.g.

fullname = fullfile('relativepath','newdata.csv');
% grab column headers from input sheet
opts = detectImportOptions(fullname);
% set all variable types to categorical
opts.VariableTypes(:) = {'categorical'};
% read in new data
T = readtable(fullname,opts);
% make any modifications to new data headers to match old data
T = renamevars(T,"NewLabel","OldLabel");
% clean new table headers to match originally-wizard-imported headers (I'd ask why these exhibit different behaviour, but that's a separate tragedy, and this current fix works - I think)
T.Properties.VariableNames = regexprep(T.Properties.VariableNames, ' ', '');
T.Properties.VariableNames = regexprep(T.Properties.VariableNames, '(', '');
T.Properties.VariableNames = regexprep(T.Properties.VariableNames, ')', '');
T.Properties.VariableNames = regexprep(T.Properties.VariableNames, '_', '');

I found the solution suggested here: https://au.mathworks.com/matlabcentral/answers/514921-finding-identical-rows-in-2-tables, but having done a quick test via:

foo = T(str2double(string(T.Year))<1943,:);     % not my actual query, but structurally the same; this gave me ~40% of my original data
bar = T(str2double(string(T.Year))>1941,:);     % similar, gave me ~70% of the original data
baz = ismember(foo,bar);                        % similar, gives the overlap for 1 particular year (should be about 14% of my original data)
blah = T(str2double(string(T.Year))==1942,:);   % to directly extract the number of rows I am looking for 
sum(baz)                                        % What I expect here is the number of rows in the overlap
ans = 
     0

I found that ismember was not finding any duplicates (which were there by construction).

Note: due to categorical data I actually used T(str2double(string(T.Year))...)

Replacing

baz = ismember(foo,bar,'rows');
sum(baz)
ans =
     0

results in the same not finding any duplicates. Using double quotes "rows" does not change the behaviour.

On the other hand, using the function to assess single variables gives the expected behaviour (to some degree):

testest = ismember(foo.var1,bar.var1)
sum(testest)

The sum is now non-zero, and (because single variables are repeated more often than their combinations) gives more like 30% of the original data, which seems reasonable (the number of unique entries in the original set in that variable was about 40% of the total).

I guess I could create a logical index based on the product of multiple calls of this kind, but that seems rather... inefficient... and sensitive to the exact construction of the table/variables used in the filter. I'd rather have a generic solution for full table rows that will be robust if the overall table changes over the long term (or if/when I functionalise the code and use it for other work). Whilst most of the time, a couple of key variables can be used to identify unique rows, occasionally more information is required to distinguish pathological cases. I will probably use this approach if a more elegant solution doesn't appear, though, and put some thought into which groups of variables are 100% correlated (and therefore useless for this distinction) to cut down the Boolean product.

I could also throw good coding practice to the winds and just write two nested loops (one for rows, one for variables) and exhaustively test every combination, but I suspect that would be even less efficient (although I wonder whether the scaling order would be the same given the nature of the comparisons required).

If it is pertinent, I imported all (>25) data columns from a .csv file as categorical variables. The original data before that were a mix of number and general columns from an Excel sheet; I could have used any or all of {double,string,categorical,datetime} to store the various variables, but there are some data which are best stored as categorical to avoid character trimming and consequent data cleaning / returning to original state steps.

Digging further, I also found this: https://au.mathworks.com/matlabcentral/answers/1775400-how-do-i-find-all-indexes-of-duplicate-names-in-a-table-column-then-compare-the-row-values-for-each which appears to imply that ismember should have the functionality I need here.

Similarly, methods using unique (see e.g. https://au.mathworks.com/matlabcentral/answers/1999193-find-duplicated-rows-in-matlab-without-for-loop or https://au.mathworks.com/matlabcentral/answers/1571588-table-find-duplicate-rows-double-char-datetime or https://au.mathworks.com/matlabcentral/answers/305987-identify-duplicate-rows-in-a-matrix) give:

size(unique([foo;bar],'rows'),1) == size(foo,1)+size(bar,1)
ans = 
  logical
   1

instead of the expected 0 due to the lower amount of actual full-row matches. (Same for "rows" again.)

I've also looked into outerjoin/join/innerjoin, but those don't seem to remove duplicates like I need.

7 Commenti
Mostra 5 commenti meno recenti Nascondi 5 commenti meno recenti

dpb il 26 Ago 2024

Modificato: dpb il 26 Ago 2024

Apri in MATLAB Online

size(unique([foo;bar],'rows'),1) == size(foo,1)+size(bar,1)
ans = 
  logical
   1

Then there are not actually any duplicates. They may be close, but that's not the definition of "duplicate".

NOTA BENE: "If A is a table or timetable, it must have the same variable names as B (except for order). For tables, row names are ignored, so that two rows that have the same values, but different names, are considered."

Above one of your comments is "...(because single variables are repeated more often than their combinations)" from which one can infer your definition of "duplicate" is NOT that all elements of a row are the same but that only some variables within the row are. If that is the case, then you will have to compare on the variables themselves, not the row, because that's a totally different condition than that the rows are duplicated.

foo = T(str2double(string(T.Year))<1943,:);
bar = T(str2double(string(T.Year))>1941,:);
baz = ismember(foo,bar);
sum(baz)            % What I expect here is the number of rows in the overlap
ans = 
     0

That is a false expectation; what you should have expected (and got) is the number of identical rows of foo contained in bar. To only consider the year as the overlapping case, you would have to have written

baz = ismember(foo.Year,bar.Year);
nnz(baz)   

As @Divyajyoti Nayak says, it will take either example data to illustrate and an explanation as to why/what you would expect or a a specific definition of what constitutes your working definition of "duplicate"

Daniel il 27 Ago 2024

Apri in MATLAB Online

@Divyajyoti Nayak, agreed. I'm trying to work up some dummy data that exhibits the same behaviour, but struggling at present. The actual data is sensitive, so I can't just drop some here.

Some of the fields are <undefined>, if that affects things. [I.e. ismissing(T.missingdata(cellref)) = logical 1]

@dpb, given the construction from a root table (T), using the logical indexing based on T.Year explicitly designed to have one year of overlap, I expect to see the outputs being effective row-truncated copies of T with the same column/variable data in rows corresponding to that overlapping year.

I.e. if size(T) is (m x n), then size(foo) should be (k x n) with k < m, and size(bar) should be (j x n), with j < m, and j+k > m. Note that the same columns are present throughout. This is borne out in my workspace.

I've now tried explicitly looping to find comparisons, but it is very slow, not really feasible on the scale of data I have.

Done a quick search for the first row of foo in bar (and I can find that relatively quickly), which we'll call rowid (i.e., bar{rowid,:} is the same line of data as foo{1,:}). Running

ismember(foo(1,:),bar(rowid,:))
ans = 
  logical
   0

tells me that something isn't right. Then running

ismember(foo{1,:},bar{rowid,:})
ans = 
  1xn logical array
   1   1   ...   1   1   0   1   1   1   0   1   1   1   

and investigating further, the 0 entries correspond precisely to entries with missing data. Digging further, at least one field in every single row of my data is missing (which for my purposes is fine, since I don't require them to be populated - but I need to know if the incoming data matches for populated fields to avoid duplicates).

The manual page on ismember does not mention that it can't handle missing data.

dpb il 27 Ago 2024

Modificato: dpb il 27 Ago 2024

Apri in MATLAB Online

m = missing returns a missing value displayed as <missing>. You can set an element of an array or table to missing to represent missing data. The value of missing is then automatically converted to the standard missing value native to the data type of the array or table variable. For example, core data types that support missing and their corresponding standard missing values after assignment are as follows:

double, single, duration, and calendarDuration convert missing to NaN
datetime converts missing to NaT
categorical converts missing to <undefined>
string converts missing to <missing>

Consequently,

x=missing;
y=x;
x==y
ans = logical
   0

NaN, NaT, etc., don't compare even to themselves, it's not just ismember, it's the fundamental rule of comparison.

Use fillmissing to set a finite missing value indicator is probably easiest way.

But, if your base database has random missing values in whatever column/variable, then it is probable that even that comparison is going to fail because individual rows may have a different set of missing values. The question is just what it is that is the criterion for keeping a new row.

Daniel il 28 Ago 2024

Modificato: Daniel il 28 Ago 2024

I see the use cases both ways for treating missing variables as either equivalent or not (e.g. depending on whether you expect to need to be able to impute different values for them in different rows, or whether they are simply blank data and for your table purposes, blank is equivalent).

It might be nice in a future update if ismember (or a broader setting) acquires an option to set the treatment of missing data - my gut would keep the default as the current behaviour, but allow setting of handling to equivalent at the user's choice, within a specific function call. (In the same vein, I'd probably write this into each comparative function rather than going for a global setting which may cause havoc elsewhere in code... I can definitely see the potential for a whole lot more going wrong if NaN is globally set to equivalence to itself.)

I did find fillmissing, and it has indeed worked and resulted in the desired behaviour regarding ismember. Mildly annoying in that I can't use the inbuilt <undefined> without re-converting back and forth every time, but I can deal with that.

Thanks for the discussion and help.

dpb il 28 Ago 2024

I agree that an enhancement (or different function to ensure don't break compatability) would be an option that allowed missing values to be compared positionally. That NaN doesn't compare is part of IEEE-754, no way/no how is it feasible to change it; that <missing> is NaN for double was an obvious choice but has the observed ramifications. There is no standard definition that is different and unique.

Daniel il 29 Ago 2024

Modificato: Daniel il 29 Ago 2024

Agreed, having looked into it a little bit I can see why IEEE-754 is set up the way it is. There are other ways around it, e.g. posit schemes, such as Gustafson's unum representation (which even had a MATLAB library developed around it), but I'm not familiar enough with them to know what other issues that would cause if the missing standard was changed. That scheme had a representation of not a real NaR, with the property NaR = NaR, which would fit my use case.

Like I said, I suspect a specific exception within the ismember function would be one way to handle it (if one wanted to), and even then, well-documented and only at the user's insistence for that one call. Or, potentially, a specific exception for a given array to use NaR for <missing> for doubles, in addition to allowing ismember to interpret that as I described. If wishes were fishes... and I absolutely appreciate Chesterton's Fence with regard to this.

Alternatively, as you say, development of a specifically positional <missing> comparison function, and the ability to either run that in tandem with handling the nonmissing data, or call within existing functions (e.g. ismember) might be a better way to handle it. Given the propensity of users to call eval() though, in spite of all the threads about that, I can understand some reluctance on the dev team's part.

I have the fillmissing() workaround in the meantime, and I'll write some code to verify that the categorical used there is nowhere present in any incoming new data to be checked before I run fillmissing() over it in advance of the ismember() call, with an error thrown if it is. That should suffice as a control over the vetting/cleaning/appending process.

@dpb, if you want to summarise this thread as an answer, I'd be very happy to accept/upvote it, and remove mine, to recognise your contribution.

Accedi per commentare.

Accedi per rispondere a questa domanda.

Follow Question

Answer 1

Daniel il 28 Ago 2024

Modificato: Daniel il 28 Ago 2024

0 voti

The fundamental issue experienced was due to missing data.

Use of fillmissing to replace <undefined> with specific phrasing not used anywhere in the data led to the desired behaviour from ismember() to compare the table data.

Thanks to @Divyajyoti Nayak and @dpb for their comments and discussion.

0 Commenti
Mostra -2 commenti meno recenti Nascondi -2 commenti meno recenti

Accedi per commentare.

How to identify duplicate rows between tables

7 Commenti
Mostra 5 commenti meno recenti Nascondi 5 commenti meno recenti

Risposta accettata

0 Commenti
Mostra -2 commenti meno recenti Nascondi -2 commenti meno recenti

Più risposte (0)

Categorie

Prodotti

Release

Tag

Community Treasure Hunt

How to identify duplicate rows between tables

7 Commenti Mostra 5 commenti meno recenti Nascondi 5 commenti meno recenti

Risposta accettata

0 Commenti Mostra -2 commenti meno recenti Nascondi -2 commenti meno recenti

Più risposte (0)

Categorie

Prodotti

Release

Tag

Vedere anche

Community Treasure Hunt

7 Commenti
Mostra 5 commenti meno recenti Nascondi 5 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recenti Nascondi -2 commenti meno recenti