Categorical bins incorrectly ordered after saving table as .csv

Question

Samantha Huff il 15 Mar 2022

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/1672394-categorical-bins-incorrectly-ordered-after-saving-table-as-csv

Commentato: Siraj il 13 Set 2023

I created the following table and then exported it as a csv:

gs_pks_threshold = groupsummary(regularizedData_calibrated_pks, 'vmRaw', [0 1000 2000 3000 4000 5000 6000 7000 8000 9000 10000 11000 12000 13000 14000 15000 16000 17000 18000 19000 20000 21000 22000 23000 24000 25000 26000 27000 28000 29000 30000 31000 32000], 'nnz', {'pks','pkswt', 'threshold', 'pks_final'})
writetable(gs_pks_threshold, 'gs_pks_threshold.csv')

Next, I import the data (as I am moving from a remote server to my local laptop) with the bins as a categorical variable, as this is how the variable is originally stored.

%% Set up the Import Options and import the data
opts = delimitedTextImportOptions("NumVariables", 6);
% Specify range and delimiter
opts.DataLines = [2, Inf];
opts.Delimiter = ",";
% Specify column names and types
opts.VariableNames = ["disc_vmRaw", "GroupCount", "nnz_pks", "nnz_pkswt", "nnz_threshold", "nnz_pks_final"];
opts.VariableTypes = ["categorical", "double", "double", "double", "double", "double"];
% Specify file level properties
opts.ExtraColumnsRule = "ignore";
opts.EmptyLineRule = "read";
% Specify variable properties
opts = setvaropts(opts, "disc_vmRaw", "EmptyFieldRule", "auto");
% Import the data
gspksP1 = readtable("/Users/shuff/Documents/GitHub/axivity/process_ax6_cwa_files/gs_pks_P041.csv", opts);

When I go to plot this table, if I use the same code that I use to plot the table within matlab, the bins become ordered such that in ascending order they are plotted as folows:

'[0, 1000)'

'[1000, 2000)'

'[10000, 11000)'

'[11000, 12000)'

'[12000, 13000)'

'[13000, 14000)'

'[14000, 15000)'

'[15000, 16000)'

'[16000, 17000)'

'[17000, 18000)'

'[18000, 19000)'

'[19000, 20000)'

'[2000, 3000)'

'[20000, 21000)'

'[3000, 4000)'

'[4000, 5000)'

'[5000, 6000)'

'[6000, 7000)'

'[7000, 8000)'

'[8000, 9000)'

'[9000, 10000)'

How can I import the data so that the bins are ordered correctly?

2 Commenti
Mostra NessunoNascondi Nessuno

Stephen23 il 15 Mar 2022

Modificato: Stephen23 il 15 Mar 2022

Perhaps this might help: https://www.mathworks.com/matlabcentral/fileexchange/47433-natural-order-row-sort

Eric Sofen il 18 Mar 2022

If you're just saving to file to move the data between computers, can you save as a MAT file? That will preserve the table with the categorical bins rather than dealing with the "lossy" aspects of CSV.

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Siraj il 12 Set 2023

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/1672394-categorical-bins-incorrectly-ordered-after-saving-table-as-csv#answer_1308066

Apri in MATLAB Online

Hi! It is my understanding that you are using the “groupsummary” function to group the data, you have specified "vmRaw" as the grouping variable and applied the "GROUPBINS" to discretize the "vmRaw" data. As a result, a column named "disc_vmRaw" is created in the grouped table.

The grouped table is saved using the "writetable" function. However, when the data is read again using “readtable” and plotted with "disc_vmRaw" on the x-axis, the categories on the x-axis are sorted lexicographically, that’s why '[10000, 11000)' appear before '[2000, 3000)'.

I was able to replicate this issue using a simple example, and I'm unsure of the exact reason behind it. However, you can try the following workaround. After reading the table, plot the desired columns excluding the discrete column ("disc_vmRaw" in your case). Then, use the "xticklabels" function to relabel the x-ticks using the categories present in the discrete column. This should help address the sorting issue you encountered.

To gain a better understanding, you can execute the provided example below and observe the outcomes.

Profit = [20 13 11 9 19 25 31 4 26 30]';

ItemsSold = [14 13 8 5 10 16 8 6 7 11]';

TT = table(Profit,ItemsSold)

TT = 10×2 table

Profit ItemsSold ______ _________ 20 14 13 13 11 8 9 5 19 10 25 16 31 8 4 6 26 7 30 11

G = groupsummary(TT,"Profit",[1 6 11 16 21 26 31 36],{"mean"},"Profit")

G = 7×3 table

disc_Profit GroupCount mean_Profit ___________ __________ ___________ [1, 6) 1 4 [6, 11) 1 9 [11, 16) 2 12 [16, 21) 2 19.5 [21, 26) 1 25 [26, 31) 2 28 [31, 36] 1 31

%saving the grouped table.

writetable(G,'GroupedTable.csv');

opts = delimitedTextImportOptions("NumVariables", 3);

% Specify range and delimiter

opts.DataLines = [2, Inf];

opts.Delimiter = ",";

opts.VariableTypes = ["categorical", "double", "double"];

opts.VariableNames = ["disc_Profit","GroupCount","mean_Profit"];

%reading the saved table.

G_Read = readtable('GroupedTable.csv', opts)

G_Read = 7×3 table

disc_Profit GroupCount mean_Profit ___________ __________ ___________ [1, 6) 1 4 [6, 11) 1 9 [11, 16) 2 12 [16, 21) 2 19.5 [21, 26) 1 25 [26, 31) 2 28 [31, 36] 1 31

%plotting the original grouped table.

figure;

plot(G.disc_Profit, G.GroupCount)

title("Grouped Table plot")

%plotting the grouped table that is read from "gs_pks_threshold.csv"

figure;

plot(G_Read.disc_Profit, G_Read.GroupCount)

title("Grouped Table plot read from the csv")

%Workaround.

figure,plot(G_Read.GroupCount)

title("Workaround Grouped Table plot read from the csv")

xticklabels(G_Read.disc_Profit)

Refer to the link below to learn more about “xticklabels”.

https://www.mathworks.com/help/matlab/ref/xticklabels.html

Hope this helps.

2 Commenti
Mostra NessunoNascondi Nessuno

Stephen23 il 12 Set 2023

Modificato: Stephen23 il 12 Set 2023

Apri in MATLAB Online

"...I'm unsure of the exact reason behind it"

The cause is simply that by default categorical arrays are created from text with their categories in Lexicographic order. Clearly the CSV file does not store the categories themselves anywhere (where would it?) which means that the original category order gets lost as soon as the file is written. This is easy to demonstrate by simply viewing the CSV file.

Exporting to file is a lossy operation because the categories (and hence their order) are not exported, much like when categorical data are converted to text e.g. string.

Profit = [20 13 11 9 19 25 31 4 26 30]'; 
ItemsSold = [14 13 8 5 10 16 8 6 7 11]'; 
TT = table(Profit,ItemsSold);
G = groupsummary(TT,"Profit",[1 6 11 16 21 26 31 36],{"mean"},"Profit")
G = 7×3 table
    disc_Profit    GroupCount    mean_Profit
    ___________    __________    ___________

     [1, 6)            1               4    
     [6, 11)           1               9    
     [11, 16)          2              12    
     [16, 21)          2            19.5    
     [21, 26)          1              25    
     [26, 31)          2              28    
     [31, 36]          1              31    
categories(G.disc_Profit) % original category order
ans = 7×1 cell array
    {'[1, 6)'  }
    {'[6, 11)' }
    {'[11, 16)'}
    {'[16, 21)'}
    {'[21, 26)'}
    {'[26, 31)'}
    {'[31, 36]'}
%saving the grouped table.
writetable(G,'GroupedTable.csv'); % the category order is actually lost here
% Note: the order of the rows does NOT determine the category order !!!!
opts = delimitedTextImportOptions("NumVariables", 3);
% Specify range and delimiter
opts.DataLines = [2, Inf];
opts.Delimiter = ",";
opts.VariableTypes = ["categorical", "double", "double"];
opts.VariableNames = ["disc_Profit","GroupCount","mean_Profit"];
%reading the saved table. 
G_Read = readtable('GroupedTable.csv', opts)
G_Read = 7×3 table
    disc_Profit    GroupCount    mean_Profit
    ___________    __________    ___________

     [1, 6)            1               4    
     [6, 11)           1               9    
     [11, 16)          2              12    
     [16, 21)          2            19.5    
     [21, 26)          1              25    
     [26, 31)          2              28    
     [31, 36]          1              31    
categories(G_Read.disc_Profit) % Lexicographic sorted categories
ans = 7×1 cell array
    {'[1, 6)'  }
    {'[11, 16)'}
    {'[16, 21)'}
    {'[21, 26)'}
    {'[26, 31)'}
    {'[31, 36]'}
    {'[6, 11)' }

Note that fiddling around with XTICKLABELS does not fix the category order, that categorical variable will be completely incorrect for all other operations using it. To fix the category order you will need to use e.g.

https://www.mathworks.com/help/matlab/ref/categorical.reordercats.html

The NATSORT documentation already includes examples of how to do this:

https://www.mathworks.com/matlabcentral/fileexchange/34464-customizable-natural-order-sort

See the HTML file chapter "Example: Categorical Categories".

Siraj il 13 Set 2023

Apri in MATLAB Online

Yes, this seems to work, thanks for help. The provided code below is an updated workaround that addresses the issue.

%% Workaround.
G_Read.disc_Profit = reordercats(G_Read.disc_Profit, natsort(categories(G_Read.disc_Profit)));
figure;
plot(G_Read.disc_Profit,G_Read.GroupCount)
title("Workaround Grouped Table plot read from the csv")

To use "natsort" function, download and unzip the required file from the following link.

https://www.mathworks.com/matlabcentral/fileexchange/34464-customizable-natural-order-sort

Accedi per commentare.

Categorical bins incorrectly ordered after saving table as .csv

2 Commenti
Mostra NessunoNascondi Nessuno

Risposte (1)

2 Commenti
Mostra NessunoNascondi Nessuno

Vedere anche

Categorie

Tag

Community Treasure Hunt

Categorical bins incorrectly ordered after saving table as .csv

2 Commenti Mostra NessunoNascondi Nessuno

Risposte (1)

2 Commenti Mostra NessunoNascondi Nessuno

Vedere anche

Categorie

Tag

Community Treasure Hunt

2 Commenti
Mostra NessunoNascondi Nessuno

2 Commenti
Mostra NessunoNascondi Nessuno