Collating data from different sources

4 visualizzazioni (ultimi 30 giorni)
Stephen
Stephen il 4 Ott 2023
Modificato: Jon il 5 Ott 2023
I have two csv datasets t1 and t2:
t1: t2:
A B C D A B C D
1 5 4.1 "a" 4 3 4.2 "e"
2 3 3.8 "g" 7 5 2.5 "c"
3 2 3.6 "d" 9 4 4.3 "b"
4 NaN 4.2 "e"
5 6 4.2 "h"
6 2 3.6 "d"
7 NaN 2.5 "c"
8 5 1.8 "e"
9 NaN 4.3 "b"
10 6 2.6 "a"
So t1 is missing some values but is the 'master' table. t2 has only the rows with missing values from t1, but with the data filled (from another source).
I am trying to fill the missing values in t1 using the values from t2, where variable 'A' is like an index (having unique values for each observation).
I have tried using outerjoin() but the result is:
t3:
tleft_A tleft_B tleft_C tleft_D tright_A tright_B tright_C tright_D
1 5 4.1 "a" NaN NaN NaN NaN
2 3 3.8 "g" NaN NaN NaN NaN
3 2 3.6 "d" NaN NaN NaN NaN
4 NaN 4.2 "e" 4 3 2.5 "e"
5 6 4.2 "h" NaN NaN NaN NaN
6 2 3.6 "d" NaN NaN NaN NaN
7 NaN 2.5 "c" 7 5 2.5 "c"
8 5 1.8 "e" NaN NaN NaN NaN
9 NaN 4.3 "b" 9 4 4.3 "b"
10 6 2.6 "a" NaN NaN NaN NaN
What I am trying to achieve, however, is:
t3:
A B C D
1 5 4.1 "a"
2 3 3.8 "g"
3 2 3.6 "d"
4 3 4.2 "e"
5 6 4.2 "h"
6 2 3.6 "d"
7 5 2.5 "c"
8 5 1.8 "e"
9 4 4.3 "b"
10 6 2.6 "a"
Is there a simple way to accomplish this, or do I need to extract, index and combine the variables, or approach it programmatically via a loop or something similar? None of the interpolation-type fill methods seem to be appropriate either.
I'm fairly new to MatLab, so if there is any other information I am missing which would help clarify the problem, I'll be happy to update the question with it.
Thanks in advance for any insights you can provide.
  2 Commenti
Dyuman Joshi
Dyuman Joshi il 4 Ott 2023
Please attach the csv files, use the paperclip button to do so.
Stephen
Stephen il 4 Ott 2023
Thanks for your response Dyuman.
The 'data' in the question is abstracted, the csv files are 1000+ records in length with more fields. It's actually the Titanic dataset from Kaggle. I simply extracted the records with missing data and filled them in with data collected from other expert sources. I was primarily hoping for a generalised strategy, to solve the type of problem the abstracted data exposed, since I anticipate running into the same type of issue with other datasets in the future. I will include the csv files, for your edification, but just figured I must be missing something in terms of strategy.

Accedi per commentare.

Risposte (1)

Jon
Jon il 4 Ott 2023
You could use this approach
c1 = {
'A' 'B' 'C' 'D'
1 5 4.1 "a"
2 3 3.8 "g"
3 2 3.6 "d"
4 NaN 4.2 "e"
5 6 4.2 "h"
6 2 3.6 "d"
7 NaN 2.5 "c"
8 5 1.8 "e"
9 NaN 4.3 "b"
10 6 2.6 "a"};
c2 = {
'A' 'B' 'C' 'D'
4 3 4.2 "e"
7 5 2.5 "c"
9 4 4.3 "b"};
t1 = cell2table(c1(2:end,:));
t1.Properties.VariableNames = c1(1,:)
t1 = 10×4 table
A B C D __ ___ ___ ___ 1 5 4.1 "a" 2 3 3.8 "g" 3 2 3.6 "d" 4 NaN 4.2 "e" 5 6 4.2 "h" 6 2 3.6 "d" 7 NaN 2.5 "c" 8 5 1.8 "e" 9 NaN 4.3 "b" 10 6 2.6 "a"
t2 = cell2table(c2(2:end,:));
t2.Properties.VariableNames = c2(1,:)
t2 = 3×4 table
A B C D _ _ ___ ___ 4 3 4.2 "e" 7 5 2.5 "c" 9 4 4.3 "b"
% Find the indices of the missing data in t1
idxMissing = find(isnan(t1.B))
idxMissing = 3×1
4 7 9
% Fill the values from t2
t3 = t1;
[~,idx2] = ismember(idxMissing,t2.A)
idx2 = 3×1
1 2 3
t3(idxMissing,:) = t2(idx2,:)
t3 = 10×4 table
A B C D __ _ ___ ___ 1 5 4.1 "a" 2 3 3.8 "g" 3 2 3.6 "d" 4 3 4.2 "e" 5 6 4.2 "h" 6 2 3.6 "d" 7 5 2.5 "c" 8 5 1.8 "e" 9 4 4.3 "b" 10 6 2.6 "a"
  7 Commenti
Stephen
Stephen il 5 Ott 2023
I'll give it a little time, and if no one else has a more succinct method, I'll be happy to accept this as the answer. If I come up with something else in my researching, I'll update in a comment here so you can get the notification.
Cheers again for your input Jon
Jon
Jon il 5 Ott 2023
Modificato: Jon il 5 Ott 2023
Here is another approach, that utilizes operations with timetables, rather than indexing. Still a little arcane, but maybe you prefer this. You may want to remove a few more semicolons at end of statements to understand better what each step is doing
c1 = {
'A' 'B' 'C' 'D'
1 5 4.1 "a"
2 3 3.8 "g"
3 2 3.6 "d"
4 NaN 4.2 "e"
5 6 4.2 "h"
6 2 3.6 "d"
7 NaN 2.5 "c"
8 5 1.8 "e"
9 NaN 4.3 "b"
10 6 2.6 "a"};
c2 = {
'A' 'B' 'C' 'D'
4 3 4.2 "e"
7 5 2.5 "c"
8 5 1.8 "e"
9 4 4.3 "b"
15 8 6.2 "g"};
% Put data into timetables, using A as the time variable
t1 = cell2table(c1(2:end,:),'VariableNames',c1(1,:));
t1.A = days(t1.A); % convert first column to a duration
t1 = table2timetable(t1);
t2 = cell2table(c2(2:end,:),'VariableNames',c2(1,:));
t2.A = days(t2.A); % convert first column to a duration
t2 = table2timetable(t2);
% Combine the tables (makes union of timetables) putting missing data
% wherever data is not found for a given time
t3 = synchronize(t1,t2)
t3 = 11×6 timetable
A B_t1 C_t1 D_t1 B_t2 C_t2 D_t2 _______ ____ ____ _________ ____ ____ _________ 1 day 5 4.1 "a" NaN NaN <missing> 2 days 3 3.8 "g" NaN NaN <missing> 3 days 2 3.6 "d" NaN NaN <missing> 4 days NaN 4.2 "e" 3 4.2 "e" 5 days 6 4.2 "h" NaN NaN <missing> 6 days 2 3.6 "d" NaN NaN <missing> 7 days NaN 2.5 "c" 5 2.5 "c" 8 days 5 1.8 "e" 5 1.8 "e" 9 days NaN 4.3 "b" 4 4.3 "b" 10 days 6 2.6 "a" NaN NaN <missing> 15 days NaN NaN <missing> 8 6.2 "g"
% Replace the missing values using values from t2
% note by defualt, min omits missing, e.g. NaN
t3.B_t1 = min(t3.B_t1,t3.B_t2);
% Remove columns from t2
t3 = t3(:,1:3);
% Remove any rows that were in t2 but not t1
% These will have missing data from the synchronization
t3 = rmmissing(t3);
% Optionally clean up the variable names
t3.Properties.VariableNames = c1(1,2:end)
t3 = 10×3 timetable
A B C D _______ _ ___ ___ 1 day 5 4.1 "a" 2 days 3 3.8 "g" 3 days 2 3.6 "d" 4 days 3 4.2 "e" 5 days 6 4.2 "h" 6 days 2 3.6 "d" 7 days 5 2.5 "c" 8 days 5 1.8 "e" 9 days 4 4.3 "b" 10 days 6 2.6 "a"

Accedi per commentare.

Categorie

Scopri di più su Data Type Conversion in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by