Categorical bins incorrectly ordered after saving table as .csv

2 visualizzazioni (ultimi 30 giorni)
I created the following table and then exported it as a csv:
gs_pks_threshold = groupsummary(regularizedData_calibrated_pks, 'vmRaw', [0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000 17000 18000 19000 20000 21000 22000 23000 24000 25000 26000 27000 28000 29000 30000 31000 32000], 'nnz', {'pks','pkswt', 'threshold', 'pks_final'})
writetable(gs_pks_threshold, 'gs_pks_threshold.csv')
Next, I import the data (as I am moving from a remote server to my local laptop) with the bins as a categorical variable, as this is how the variable is originally stored.
%% Set up the Import Options and import the data
opts = delimitedTextImportOptions("NumVariables", 6);
% Specify range and delimiter
opts.DataLines = [2, Inf];
opts.Delimiter = ",";
% Specify column names and types
opts.VariableNames = ["disc_vmRaw", "GroupCount", "nnz_pks", "nnz_pkswt", "nnz_threshold", "nnz_pks_final"];
opts.VariableTypes = ["categorical", "double", "double", "double", "double", "double"];
% Specify file level properties
opts.ExtraColumnsRule = "ignore";
opts.EmptyLineRule = "read";
% Specify variable properties
opts = setvaropts(opts, "disc_vmRaw", "EmptyFieldRule", "auto");
% Import the data
gspksP1 = readtable("/Users/shuff/Documents/GitHub/axivity/process_ax6_cwa_files/gs_pks_P041.csv", opts);
When I go to plot this table, if I use the same code that I use to plot the table within matlab, the bins become ordered such that in ascending order they are plotted as folows:
'[0, 1000)'
'[1000, 2000)'
'[10000, 11000)'
'[11000, 12000)'
'[12000, 13000)'
'[13000, 14000)'
'[14000, 15000)'
'[15000, 16000)'
'[16000, 17000)'
'[17000, 18000)'
'[18000, 19000)'
'[19000, 20000)'
'[2000, 3000)'
'[20000, 21000)'
'[3000, 4000)'
'[4000, 5000)'
'[5000, 6000)'
'[6000, 7000)'
'[7000, 8000)'
'[8000, 9000)'
'[9000, 10000)'
How can I import the data so that the bins are ordered correctly?
  2 Commenti
Eric Sofen
Eric Sofen il 18 Mar 2022
If you're just saving to file to move the data between computers, can you save as a MAT file? That will preserve the table with the categorical bins rather than dealing with the "lossy" aspects of CSV.

Accedi per commentare.

Risposte (1)

Siraj
Siraj il 12 Set 2023
Hi! It is my understanding that you are using the “groupsummary” function to group the data, you have specified "vmRaw" as the grouping variable and applied the "GROUPBINS" to discretize the "vmRaw" data. As a result, a column named "disc_vmRaw" is created in the grouped table.
The grouped table is saved using the "writetable" function. However, when the data is read again using “readtableand plotted with "disc_vmRaw" on the x-axis, the categories on the x-axis are sorted lexicographically, that’s why '[10000, 11000)' appear before '[2000, 3000)'.
I was able to replicate this issue using a simple example, and I'm unsure of the exact reason behind it. However, you can try the following workaround. After reading the table, plot the desired columns excluding the discrete column ("disc_vmRaw" in your case). Then, use the "xticklabels" function to relabel the x-ticks using the categories present in the discrete column. This should help address the sorting issue you encountered.
To gain a better understanding, you can execute the provided example below and observe the outcomes.
Profit = [20 13 11 9 19 25 31 4 26 30]';
ItemsSold = [14 13 8 5 10 16 8 6 7 11]';
TT = table(Profit,ItemsSold)
TT = 10×2 table
Profit ItemsSold ______ _________ 20 14 13 13 11 8 9 5 19 10 25 16 31 8 4 6 26 7 30 11
G = groupsummary(TT,"Profit",[1 6 11 16 21 26 31 36],{"mean"},"Profit")
G = 7×3 table
disc_Profit GroupCount mean_Profit ___________ __________ ___________ [1, 6) 1 4 [6, 11) 1 9 [11, 16) 2 12 [16, 21) 2 19.5 [21, 26) 1 25 [26, 31) 2 28 [31, 36] 1 31
%saving the grouped table.
writetable(G,'GroupedTable.csv');
opts = delimitedTextImportOptions("NumVariables", 3);
% Specify range and delimiter
opts.DataLines = [2, Inf];
opts.Delimiter = ",";
opts.VariableTypes = ["categorical", "double", "double"];
opts.VariableNames = ["disc_Profit","GroupCount","mean_Profit"];
%reading the saved table.
G_Read = readtable('GroupedTable.csv', opts)
G_Read = 7×3 table
disc_Profit GroupCount mean_Profit ___________ __________ ___________ [1, 6) 1 4 [6, 11) 1 9 [11, 16) 2 12 [16, 21) 2 19.5 [21, 26) 1 25 [26, 31) 2 28 [31, 36] 1 31
%plotting the original grouped table.
figure;
plot(G.disc_Profit, G.GroupCount)
title("Grouped Table plot")
%plotting the grouped table that is read from "gs_pks_threshold.csv"
figure;
plot(G_Read.disc_Profit, G_Read.GroupCount)
title("Grouped Table plot read from the csv")
%Workaround.
figure,plot(G_Read.GroupCount)
title("Workaround Grouped Table plot read from the csv")
xticklabels(G_Read.disc_Profit)
Refer to the link below to learn more about “xticklabels.
Hope this helps.
  2 Commenti
Stephen23
Stephen23 il 12 Set 2023
Modificato: Stephen23 il 12 Set 2023
"...I'm unsure of the exact reason behind it"
The cause is simply that by default categorical arrays are created from text with their categories in Lexicographic order. Clearly the CSV file does not store the categories themselves anywhere (where would it?) which means that the original category order gets lost as soon as the file is written. This is easy to demonstrate by simply viewing the CSV file.
Exporting to file is a lossy operation because the categories (and hence their order) are not exported, much like when categorical data are converted to text e.g. string.
Profit = [20 13 11 9 19 25 31 4 26 30]';
ItemsSold = [14 13 8 5 10 16 8 6 7 11]';
TT = table(Profit,ItemsSold);
G = groupsummary(TT,"Profit",[1 6 11 16 21 26 31 36],{"mean"},"Profit")
G = 7×3 table
disc_Profit GroupCount mean_Profit ___________ __________ ___________ [1, 6) 1 4 [6, 11) 1 9 [11, 16) 2 12 [16, 21) 2 19.5 [21, 26) 1 25 [26, 31) 2 28 [31, 36] 1 31
categories(G.disc_Profit) % original category order
ans = 7×1 cell array
{'[1, 6)' } {'[6, 11)' } {'[11, 16)'} {'[16, 21)'} {'[21, 26)'} {'[26, 31)'} {'[31, 36]'}
%saving the grouped table.
writetable(G,'GroupedTable.csv'); % the category order is actually lost here
% Note: the order of the rows does NOT determine the category order !!!!
opts = delimitedTextImportOptions("NumVariables", 3);
% Specify range and delimiter
opts.DataLines = [2, Inf];
opts.Delimiter = ",";
opts.VariableTypes = ["categorical", "double", "double"];
opts.VariableNames = ["disc_Profit","GroupCount","mean_Profit"];
%reading the saved table.
G_Read = readtable('GroupedTable.csv', opts)
G_Read = 7×3 table
disc_Profit GroupCount mean_Profit ___________ __________ ___________ [1, 6) 1 4 [6, 11) 1 9 [11, 16) 2 12 [16, 21) 2 19.5 [21, 26) 1 25 [26, 31) 2 28 [31, 36] 1 31
categories(G_Read.disc_Profit) % Lexicographic sorted categories
ans = 7×1 cell array
{'[1, 6)' } {'[11, 16)'} {'[16, 21)'} {'[21, 26)'} {'[26, 31)'} {'[31, 36]'} {'[6, 11)' }
Note that fiddling around with XTICKLABELS does not fix the category order, that categorical variable will be completely incorrect for all other operations using it. To fix the category order you will need to use e.g.
The NATSORT documentation already includes examples of how to do this:
See the HTML file chapter "Example: Categorical Categories".
Siraj
Siraj il 13 Set 2023
Yes, this seems to work, thanks for help. The provided code below is an updated workaround that addresses the issue.
%% Workaround.
G_Read.disc_Profit = reordercats(G_Read.disc_Profit, natsort(categories(G_Read.disc_Profit)));
figure;
plot(G_Read.disc_Profit,G_Read.GroupCount)
title("Workaround Grouped Table plot read from the csv")
To use "natsort" function, download and unzip the required file from the following link.

Accedi per commentare.

Categorie

Scopri di più su Tables in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by