Retain dummy variable labels from converting categorical to dummyvar
7 visualizzazioni (ultimi 30 giorni)
Mostra commenti meno recenti
Dhruv Ghulati
il 24 Dic 2015
Commentato: Nichole Tyler
il 25 Ott 2017
Hi there,
I have 19 categorical columns which I have converted into being a number for each category. However, I want to increase the number of columns so that I have a dummy for each category. What I find is that I have no idea where the dummy variables have gone, which I need to make an interpretable solution e.g. if a user is from Thailand or not, that variable is significant in a logistic regression.
Here is my code:
%categoricalnbs is the number converted version for all the categorical
%variables. Some columns in that table have categories 1-200, some just
%have categories 1 to 20.
categoricalnbsarray = table2array(categoricalnbs);
% categoricalnbsarray = table2array(finalnbs(:,[9:26,28]));
%finalnbs keeps the actual category names, which I thought could help with
%generating the column labels for the dummyvars, but using that line
%doesn't help.
[~, ~, ugroupA] = unique(categoricalnbsarray(:,2));
dummyvars=dummyvar(ugroupA);
array2table(dummyvars);
This increases the columns in categoricalnbs from 19 to 200, and retains the same number of rows. But how do I interpret the output...
1 Commento
Alessandro Roux
il 29 Dic 2015
Hi Dhruv,
I'm curious what is the output that you expect to receive from "dummyvars".
If I've understood correctly, you have an array of values where each value is the numerical representation of one of twenty categories.
You call "unique" on this array of values and request three outputs from "unique". The first output, which you ignore, contains the unique values of "categoricalnbs" in sorted order (if there are 20 categories numbered 1-20, I'd expect this to be an array from 1 to 20).
The second output, which you also ignore, the indices of "categoricalnbs" that will return each unique value of "categoricalnbs" in sorted order (i.e. the first output of your "unique" call).
The final output, which you call "ugroupA", will return an order of indices that, if applied to the first output, will return all of the values of "categoricalnbs" stacked column after column.
In your code, you are looking for the dummy variables of an index reconstruction of "categoricalnbs".
If I have understood correctly, there will be as many dummy variables as there are unique categories in "categoricalnbs". So, if column 2 of "categoricalnbs" contains 200 possible categories, then you would expect a "dummyvars" output with 200 columns (i.e. 200 dummy variables).
Do you mind clarifying what, in particular, confuses you about the output that you are receiving?
Alessandro
Risposta accettata
Sean de Wolski
il 30 Dic 2015
Modificato: Sean de Wolski
il 30 Dic 2015
I wrote a function that does this, here you go:
function Tdummy = dummytable(T)
% Tdummy = dummytable(T) - convert categorical variables in table to dummy
% variables
%
% This function takes the categorical variables in a table and converts
% them to separate dummy variables with intelligent names. This way they
% can be used in the Classification Learner App and the variable names make
% sense for feature selection, etc.
%
% Usage:
%
% Tdummy = dummytable(T)
%
% Inputs:
%
% T: Table with categoricals or categorical variable
%
% Outputs:
%
% Tdummy: T with categorical variables turned into dummy variables with
% intelligent names
%
% Example:
%
% % Simple Table
% T = table(rand(10,1),categorical(cellstr('rbbgbgbbgr'.')),...
% 'VariableNames',{'Percent','Color'});
% disp(T)
%
% % Turn it into a dummy table
% Tdummy = dummytable(T);
% disp(Tdummy)
%
% See Also: dummyvar, table, categorical, classificationLearner
% Copyright 2015 The MathWorks, Inc.
% Sean de Wolski Apr 13, 2014
% Error checking
narginchk(1,1)
validateattributes(T,{'categorical', 'table'},{},mfilename,'T',1);
% If it's a categorical, do out best to convert it to a table with an
% intelligent variable name
if iscategorical(T)
% Try to use existing variable name
cname = inputname(1);
if isempty(cname)
% It's a MATLAB Expression, default to Var1
cname = 'Var1';
end
T = table(T,'VariableNames',{cname});
end
% Identify categoricals and their names
cats = varfun(@iscategorical,T,'OutputFormat','uniform');
% Short circuit if there are no categoricals
if ~any(cats)
Tdummy = T;
return
end
% Store everything in a cell. w will be the total width of the table
% with each variable dummyvar'd
w = nnz(~cats)+sum(varfun(@(x)numel(categories(x)),T(:,cats),'OutputFormat','uniform'));
% Preallocate storage
datastorage = cell(1,w);
namestorage = cell(1,w);
% Engine
idx = 0; % Start nowhere in cell
for ii = 1:width(T)
idx = idx+1;
% Loop over table deciding what to do with each variable
if cats(ii)
% It's a categorical,
% Extract it and build keep its categories and dummyvar
Tii = T{:,ii};
categoriesii = categories(Tii)';
ncatii = numel(categoriesii); % How many?
% Build dummy var as a row cell with columns in each
dvii = num2cell(dummyvar(Tii), 1); % Dummy var then cell
% Build names
namesii = strcat(T.Properties.VariableNames{ii}, '_', categoriesii);
% Insert
datastorage(idx:(idx+ncatii-1)) = dvii;
namestorage(idx:(idx+ncatii-1)) = namesii;
% Increment
idx = idx+ncatii-1;
else
% Extract non categorical into current storage location
datastorage{idx} = T{:,ii};
namestorage(idx) = T.Properties.VariableNames(ii);
end
end
% Build Tdummy with comma separated list expansion
Tdummy = table(datastorage{:},'VariableNames',matlab.lang.makeValidName(namestorage));
end
2 Commenti
Più risposte (2)
jgg
il 30 Dic 2015
Modificato: jgg
il 30 Dic 2015
I think you can make sense of this by following the documentation: When group is a numeric vector, dummyvar assumes that the groups and their order are 1:max(group). In other words, column order corresponds to the order of the levels. For nominal arrays, the default order is ascending alphabetical.
So, the first dummy variable will be the category with the lowest value, the second will be the second lowest value, etc. You can check this to be sure by comparing the dummy value with the category.
An easy way to make your data more intelligible is to sort by the category before creating the dummies. Then, it should have a nice pattern to it which will be easy to understand.
3 Commenti
Image Analyst
il 30 Dic 2015
There is a function to do this, if I understand you correctly, in the Statistics and Machine Learning Toolbox. It's called ordinal().
1 Commento
Sean de Wolski
il 30 Dic 2015
That's the old way and its use is now discouraged with categoricals in base MATLAB.
Vedere anche
Categorie
Scopri di più su Analysis of Variance and Covariance in Help Center e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!