readtable (excel) behavior for string type with empty cell vs spaces
Mostra commenti meno recenti
I am trying to make an excel import function robust to treat either empty cells or cells containing only spaces in a specific way, namely to fill with the string
""
(I know, this may not seem useful, but it is for me...)
% define the columns of data variable type and name
Variables = [...
"double","Index";
"string","StringData";
"string","Comment"];
% create import options for a table structure that has
% 1st row as a column type descriptor
% 2nd row as the traditional "header" with column names
% 3rd + rows as data
opts = spreadsheetImportOptions(...
"NumVariables",size(Variables,1),...
"VariableNames",Variables(:,2)',...
"VariableTypes",Variables(:,1)',...
"DataRange","A3",...
"VariableNamesRange","A2",...
"VariableDescriptionsRange","A1",...
"RowNamesRange","A3",...
"ImportErrorRule","fill");
% set the variable options for the string data column
opts = setvaropts(opts,"StringData","FillValue","");
out = readtable("example_empty_cell.xlsx",opts)
For the attached excel file, this returns
out =
6×3 table
Index StringData Comment
_____ __________ ______________________
1 1 <missing> "single space"
2 2 <missing> "three spaces"
3 3 <missing> "empty"
4 4 <missing> "apostrophe and space"
5 5 <missing> "apostrophe only"
6 6 "hello" "non empty string"
It's not what I would expect...but if my replacement string is instead
% set the variable options for the string data column
opts = setvaropts(opts,"StringData","FillValue"," ");
I get the expected
out =
6×3 table
Index StringData Comment
_____ __________ ______________________
1 1 " " "single space"
2 2 " " "three spaces"
3 3 " " "empty"
4 4 " " "apostrophe and space"
5 5 " " "apostrophe only"
6 6 "hello" "non empty string"
Perhaps this issue is not specific to readtable/setvaropts, but rather some general internal conversion of "" into <missing>?
One clue is that if I set the fill value to strings(0), it produces an error saying
Expected a value which can be converted to a character vector.
In any case, I'm wondering if there is a one-shot way to achieve what I want without having to post-process the imported data one way or another?
Risposta accettata
Più risposte (2)
Jemima Pulipati
il 21 Dic 2020
Modificato: Jemima Pulipati
il 21 Dic 2020
Hello,
From my understanding, you want to fill the missing/empty values in table with an empty character.
Starting from R2020a, the behaviour of readtable has been modified with respect to the usage of import options (Check Compatibility Considerations). By default readtable uses the import options which implies all missing values are represented using "<missing>". So in your case, when you are trying to set the missing values using
opts = setvaropts(opts,"StringData","FillValue","");
This considers that empty character to be a missing value and uses the standard missing value representation of string i.e '<missing>'. But when you use:
opts = setvaropts(opts,"StringData","FillValue"," ");
This considers the fill value as a valid string and just sets the missing values.
This above behaviour occurs because of import options being used by readtable.
The behaviour of readtable() upto R2019b was different i.e it used the empty character representation instead of '<missing>' to represent missing values in tables.
The suggested workarounds are:
- You may have to use 'Fomat: auto' to use the empty character representation (specifying this option chooses the behaviour that was present upto R2019b and does not by default consider the import options).
Example:
out = readtable("example_empty_cell.xlsx","Format","auto")
- readtable can use the import options and after reading the table you may set the column values explicitly to display according to the use case.
Example :
out = readtable("example_empty_cell.xlsx",opts);
out(1:end-1,2) = {''};
- Modifying the variable type to 'char' instead of 'string' for the column.
Variables = [...
"double","Index";
"char","StringData";
"string","Comment"];
This shows the empty character as a representation for the missing values as this is the representation for the 'char' datatype.
NaN — double, single, duration, and calendarDuration
NaT — datetime
<missing> — string
<undefined> — categorical
' ' — char
{''} — cell of character arrays
1 Commento
J. Alex Lee
il 21 Dic 2020
Modificato: J. Alex Lee
il 21 Dic 2020
Julian Hapke
il 3 Giu 2021
Modificato: Julian Hapke
il 3 Giu 2021
0 voti
If you do not use the FillValue property of the VariableImportOptions, but set EmptyFieldRule to "auto", missing strings are imported as "" instead of <missing>. Since you can set this for every variable separately, this does not make the use of "Format", "auto" necessary.
Edit: unfortunately in my test case, not all missing values are replaced by an empty string, some still come back as missing. this is really frustrating.
3 Commenti
J. Alex Lee
il 3 Giu 2021
Julian Hapke
il 3 Giu 2021
yes, after checking my real world data instead of a simple test case, I must agree (see edit), it's not really useful.
J. Alex Lee
il 3 Giu 2021
Modificato: J. Alex Lee
il 3 Giu 2021
Categorie
Scopri di più su Tables in Centro assistenza e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!