How to split a huge string array efficently

3 visualizzazioni (ultimi 30 giorni)
Hi everyone,
I'm trying to split a huge string (~8.5mb, ~11.500 rows x ~400 columns) efficiently, but I cannot do that without a quiet slow "for" loop I cannot remove.
The number of colums may change from a file to another one so it's not possible for me to determin initially a unique format of the file and then import it according to it.
%% getting data from .txt => really fast
tic
disp('importing file');
a = string(textread([pwd '\test.txt'],'%s','headerlines',1)); %#ok<*DTXTRD>
toc
%% splitting each row in colums by delimiter ";" => slow
tic
disp('splitting each row by ";"');
b = strings(length(a),length(strsplit(a(1),';')));
for k=1:length(a)
b(k,:) = strsplit(a(k),';');
end
toc
%% date(str) to datenum => really fast
tic
disp('conv date to datenum');
dat1 = datenum(b(:,1),'yyyy-mm-dd');
toc
%% str to logical => really fast
tic
disp('converting data to logical array')
dat2 = logical(strcmp(b(:,2:end),'1')); %super fast
%dat2 = str2double(b(:,2:end)); %very slow
toc
% disp('converting data to logical array - 2'); %super fast as well
% tic
% dat2 = zeros(size(b));
% dat2(strcmp(b(:,2:end),'1')) = 1;
% toc
Thanks everyone! :)
Source file sample
  3 Commenti
endystrike
endystrike il 24 Lug 2020
Thanks Walter, I fixed following your advice! :)
tic
a = readtable([pwd '\test.txt'],'delimiter',';');
dat1 = datenum(string(a{:,1}),'yyyy-mm-dd');
dat2 = logical(strcmp(string(a{:,2:end}),'1'));
toc
endystrike
endystrike il 24 Lug 2020
If you want to put it as an answer, I'll accept it: you helped me a lot and I fixed the issue! :)

Accedi per commentare.

Risposta accettata

Walter Roberson
Walter Roberson il 24 Lug 2020
Why not use readtable() ?
I would also point out that textscan() can process character vectors in which the lines are separated by newlines.
Note: in your release if you use detectImportOptions then it would probably automatically figure out that the first column is a date, and would convert it to datetime format.
It will probably also figure out that the other columns are numeric, in which case strcmp() would not be needed, just
date2 = logical(a{:,2:end});
You might need to use 'HeaderLines', 1, 'ReadVariableNames', false

Più risposte (0)

Categorie

Scopri di più su Startup and Shutdown in Help Center e File Exchange

Prodotti


Release

R2019a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by