Reading and processing data from text file to matlab variable quickly
2 visualizzazioni (ultimi 30 giorni)
Mostra commenti meno recenti
Paolo Binetti
il 25 Feb 2017
Modificato: per isakson
il 3 Mar 2017
I use the following code to read data from a text file and process it into two cell arrays, and it works, but can it be done faster? Although I currently need the cell array data format for the downstream code that uses the data, I am also open to consider other data types, if they help reading more quickly from the text file.
adjlist = regexp(fileread('sample_input.txt'), '\r\n', 'split');
adjlist(cellfun('isempty', adjlist)) = [];
nodes = regexp(adjlist, '\w*(?= )', 'match');
nodes = cell2mat(nodes);
edges = regexp(adjlist, '(?<=( |,))\w*', 'match');
2 Commenti
dpb
il 25 Feb 2017
The time overhead is likely not in the file reading portion but the regexp processing afterwards; it is pretty notorious for not being a performance speed demon. You're reading the file as just a cellstr array so I suspect that's not the issue.
Try breaking out the fileread from the surrounding regexp and profile the result; I'll be quite surprised if the above supposition doesn't turn out to be true.
Risposta accettata
per isakson
il 26 Feb 2017
Modificato: per isakson
il 26 Feb 2017
"Reading and processing data from text file to matlab variable quickly"   The short answer is that using textscan to read and do most of the parsing is faster. And gives cleaner code.
It's a bit tricky to measure the speed of reading small files, since the file will be available in the system cache after the first test. However, it's safe to claim that in this case texdtscan is faster.
Run this
>> [nodes,edges,cac] = cssm();
Elapsed time is 0.054037 seconds.
Elapsed time is 0.009937 seconds.
>> cac(:)
ans =
{3001x1 cell}
{3001x1 cell}
where
function [nodes,edges,cac] = cssm()
tic
adjlist = regexp(fileread('sample_input.txt'), '\r\n', 'split');
adjlist(cellfun('isempty', adjlist)) = [];
nodes = regexp( adjlist, '\w*(?= )', 'match' );
% nodes = cell2mat(nodes);
% Error using cell2mat (line 52)
% CELL2MAT does not support cell arrays containing cell arrays or objects.
nodes = cat( 1, nodes{:} );
edges = regexp(adjlist, '(?<=( |,))\w*', 'match');
toc
tic
fid = fopen( 'sample_input.txt' );
cac = textscan( fid, '%s%*s%[^\r\n]', 'Delimiter',' ' );
[~] = fclose( fid );
toc
end
 
A more fair comparison:
>> [nodes,edges,n2,e2] = cssm();
Elapsed time is 0.047859 seconds.
Elapsed time is 0.014726 seconds.
>> edges{1}
ans =
'3' '5' '9'
>> e2{1}
ans =
'3' '5' '9'
where three lines are added to produce the data on the same format
function [nodes,edges,n2,e2] = cssm()
tic
adjlist = regexp(fileread('sample_input.txt'), '\r\n', 'split');
adjlist(cellfun('isempty', adjlist)) = [];
nodes = regexp( adjlist, '\w*(?= )', 'match' );
% nodes = cell2mat(nodes);
% Error using cell2mat (line 52)
% CELL2MAT does not support cell arrays containing cell arrays or objects.
nodes = cat( 1, nodes{:} );
edges = regexp(adjlist, '(?<=( |,))\w*', 'match');
toc
tic
fid = fopen( 'sample_input.txt' );
cac = textscan( fid, '%s%*s%[^\r\n]', 'Delimiter',' ' );
[~] = fclose( fid );
n2 = cac{1}; % new
e2 = regexp( cac{2}, ',', 'split' ); % new
e2 = reshape( e2, 1,[] ); % new
toc
end
7 Commenti
dpb
il 1 Mar 2017
The final line of strsplit after all the preprocessing is
% Split.
[c, matches] = regexp(str, aDelim, 'split', 'match');
so guess it stands to reason it's going to be slower... :)
per isakson
il 2 Mar 2017
Modificato: per isakson
il 3 Mar 2017
"more efficient way to store strings of different lengths"   I guess, that there is no one-size-fits-all.
- "efficient" regarding memory use and computational speed may conflict.
- The number of strings to store
- The variation in length of the strings as Walter pointed out.
- Which operations will be done on the set of strings.
- Whether or not strictly "write-once-read-many"
- Does the cost of making the program/code count?
- And more ... .
Regarding character arrays: "'first','second','third'" should be stored as
fst
ieh
rci
sor
tnd
d
since Matlab is column major. This is tricky to read when debugging.
I recently had a problem:
- a fraction of a million valid Matlab variable names. Most names are short, but some are long. (No, I don't use them in expressions with EVAL.)
- searches typically returns a dozen names
Solution:
- store all names in one row separated by char(31), huge_str. char(31) is displayed as space by editors.
- store the positions of char(31) to avoid repeated use of strfind(huge_str)
- use STRFIND and REGEXP in searches
My resulting code is fast and memory efficient, but it did require some debugging.
Is this undocumented use of char(31), which might not survive next Matlab release? I don't think the use of char(31) is mentioned in the Matlab documentation.
Più risposte (0)
Vedere anche
Categorie
Scopri di più su Data Type Conversion in Help Center e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!