Program architecture for handling large .csv files

Good Afternoon
So i am a bit concerned. I just spent a week writing a matlab program that takes in a large amount of data from a .csv file
When i started the project i opted for a table variable type because of its ease of use with functions like readtable()
However i am not stuck with an unusable program. It takes hours for matlab to process my .csv files
I have considered the following options
it appears matlab has not yet developed any performance for the table datatype with no hope in sight.
1: re-write the entire program using arrays
Problem, the data in my file is all hex values. I will need to convert them to decimal numbers using the hex2dec function which only works with a char data type. So using an array of doubles is out of the question. Not sure where to go here
2: try to re-write the program using paraelle toolbox
thoughts?

Risposte (5)

Peter Perkins
Peter Perkins il 28 Lug 2021
Modificato: Peter Perkins il 28 Lug 2021
With no code at all to go on, it's pretty hard to give specific advice.
Hex issues aside, the first advice I would give would be to write vectorized code. The fact that you have 50k calls to subscripting suggests that you are doing scalar operations in a tight loop. That's not the best way to write code in MATLAB. Again, not much information to go on, so hard to say. So many calls to brace subscripting suggests that you are doing assignments to one variable using braces, which, as Walter points out, is slower than using dot (because it has to do a lot more in general). That's an unfortunate difference that is not highlighted in the doc, but perhaps should be. In any case, no code to go on, so ...
Walter is correct that since 18b, performance, especially for assignments into large tables, has improved quite a bit.
If you can't vectorize, and you can't upgrade to a newer version, it's probably not necessary to "re-write the entire program using arrays". It's usually possible to focus only on the tight loop and "hoist" some of the variables out of the table for that part of the code, then put them back in. Use tables for the organization and convenience they provide, use raw numeric in small doses for performance. Tables have a lot going on, and will never be as fast as double arrays. That doesn't mean you should avoid them.
It really helps to provide concrete code examples.

3 Commenti

I have followed several of your posts and although i understand the need to see code in general, you have to understand that in a professional enviroment, it is not possible to share code.
This is expressly why i wanted to discuss at a high level program architecture and not specifics of functions
Lets stick with my question of general practice for program architecture.
Are you unwilling or unable to share code with anyone outside your team or organization (even with MathWorks Technical Support possibly under a non-disclosure agreement) or are you unwilling or unable to share code in a public forum like MATLAB Answers? If the latter, contact Technical Support and ask if they can meet your terms (like an NDA) to share the code and work with you to determine how to improve the performance of your application.
Or if the application is a large and/or complicated one, perhaps you could retain our consulting department's services to help you optimize your application.
It is difficult to discuss high level program architecture without some understanding of the kinds of operations that need to be performed.
For example there are some needs for which the most efficient method would be to use a multidimensional hypercube of characters, with text for any one "word" stored as columns (instead of the typical rows.) But there are other needs for which you might want the "words" stored as rows in a hypercube.

Accedi per commentare.

In some cases, depending on the format of the file, you have some options of how to proceed:
  • readtable() and readmatrix() and readcell() all permit a 'Format' option, using the same format specifications as are are used by textscan() -- including the potential to use the %x format (possibly with a length specification, if your fields are fixed width.)
  • you could use textscan() directly, since you are working with text files
  • you could use lower-level I/O commands, including fscanf() or fgetl() with sscanf(), depending how complicated your files are. If your format is complicated enough that you effectively need to read one line at a time, then this might be lower performance
  • When your file is not super-complicated but does have different sections, then there are surprisingly high performance gains to be had by reading the entire file in as text and using regexp() to break it up into subsections and then textscan() or sscanf() the subsections. Performance gains relative to looping testing each line, that is.

4 Commenti

Robert Scott's incorrectly posted "Answer" moved here:
I think maybe i did not explain well enough.
So my problem is not readtable()
My problem is that after looking at the code analyzer
the acting of storing data in a table takes a rediculous amount of time.
doing something like. See code below
So essentialyl because of this terrible matlab performance i will need to redesign my enitre program. Below is a screen shot from the code analyzer
table{1,2} = hec2dec('1a')
it appears matlab has not yet developed any performance for the table datatype with no hope in sight.
You have marked that you are using R2018b. If I recall correctly, Mathworks improved table performance in at least two releases after that.
You wrote that
the data in my file is all hex values. I will need to convert them to decimal numbers using the hex2dec function which only works with a char data type.
and I described a number of approaches that avoid using hex2dec() but still convert the values from hex to binary.
I guess I am unclear as to the resolution you were hoping for?
I wonder what relative timings are?
T = array2table(randi(9, 10000, 30));
timeit(@() brace_idx_assign(T), 0)
ans = 7.3777e-04
timeit(@() brace_var_assign(T), 0)
ans = 7.1777e-04
timeit(@() dot_var_assign(T), 0)
ans = 1.1239e-04
function brace_idx_assign(T)
T{1,2} = 26;
end
function brace_var_assign(T)
T{1, 'Var2'} = 26;
end
function dot_var_assign(T)
T.Var2(1) = 26;
end
Interesting, using dot syntax is on the order of 6 times faster, which I personally would not have expected. Could you experiment with using dot indexing instead of {} indexing to see whether your code performance improves substantially ?
Thanks for your comments. Could possibly give that a shot

Accedi per commentare.

Jeremy Hughes
Jeremy Hughes il 28 Lug 2021
Modificato: Jeremy Hughes il 29 Lug 2021
First, it would help to see an example file, and some sample code that demonstrates the problem.
Here's the best I can say with what I see.
If the entire variable is in hex format, you can use import options to do that conversion quickly on import.
detectImportOptions will see 0x1a as a hex value, but not without the prefix, but it can be read as hex if you ask for it.
opts = detectImportOptions(filename,"Delimiter",",")
% varNamesOrNumbers = [1 3 5]
% or varNamesOrNumbers = ["Var1","Var3"]
opts = setvaropts(opts, varNamesOrNumbers, "NumberSystem", "hex", "type", "auto");
% You can also improve reading performance by selecting only the columns
% you would like (this is optional)
opts.SelectedVariableNames = opts.VariableNames([1 3 5 7 9]);
T = readtable(filename,opts)

12 Commenti

I have followed several of your posts and although i understand the need to see code in general, you have to understand that in a professional enviroment, it is not possible to share code.
This is expressly why i wanted to discuss at a high level program architecture and not specifics of functions
Lets stick with my question of general practice for program architecture.
From what you have shared so far, it seems evident that the issues are at the level of specifics of functions and syntax, not at overall program architecture.
I hope we've provided relevant suggestions.
Robert Scott
Robert Scott il 28 Lug 2021
Modificato: Robert Scott il 28 Lug 2021
I disagree, it is well documented through many posts that the matlab data type of table lacks any reasonable performance. Using mathworks very own code analyzer points to a matlab internal function holding up execution by a significant magin. Your insistance that the problem is with my code / syntax despite me providing code analyzer data that the issue is with an internal matlab function call is subpar at best.
Again, hence my original ask to discuss program archetecture suggestions for handling large tables of data.
I assure you, I'm trying to help. The internal functions are certainly causing slowdowns, but I see over 50000 calls to table/subsref. There are many anti-patterns of access that result in poor performance, but I'm mostly speculating about your workflow.
I can understand not wanting to share specifics of propriatary code; Steve Lord's comments might be the most applicable.
I think vectorization is one of the high level suggestions that would help.
Here's an example:
N = 10000;
N = 10000
hexchars = '0123456789ABCDEF';
%Make a bunch of random hex strings.
hexvals = cellstr(hexchars(randi([1 16],N,2)));
% put them in a table.
t = table(hexvals)
t = 10000×1 table
hexvals _______ {'56'} {'AF'} {'FF'} {'E6'} {'F8'} {'B4'} {'2B'} {'01'} {'AA'} {'E2'} {'BD'} {'47'} {'6D'} {'C3'} {'3D'} {'E4'}
Here's some examples of access pattern difference that might help you see what I'm talking about.
tic
% access using {} indexing in a loop
for i = 1:N
t{i,2} = hex2dec(hexvals{i});
end
Warning: The new variables being added to the table have fewer rows than the table. They have been extended with rows containing default values.
toc
Elapsed time is 6.911162 seconds.
% access using {} but vectorized
tic
t{:,3} = hex2dec(hexvals);
toc
Elapsed time is 0.040975 seconds.
% access using .VarName access.
tic
t.Var4 = hex2dec(hexvals);
toc
Elapsed time is 0.014183 seconds.
As you can see--there's orders of magnitude different in these access patterns.
The other option for working with very big tables would be either datastore or tall. But which of these you'd need depend on the size of data, and the types of operations you need to do.
please provide a few lines of working code to setvartype as hex
i dont see anything in the opts object that allows for that.
i am getting an error for "to many input arguments" on your setvartype function
Unfortunately the NumberSystem option was not supported until R2020a, but you appear to be using R2018b.
Robert Scott
Robert Scott il 29 Lug 2021
Modificato: Robert Scott il 29 Lug 2021
i actually just upgraded today im on 2020b
Also, robert
Could you explain the following
I am trying to concat a 2 cells of a cell array together using something like
c1 = {1}
c2 = {2}
answer = [c1;c2}
and that works for a 1d cell array
I need to concat the cols of a cell array but stay on a single row
suggestions?
i was trying
[1;1:2] no success
thanks
Jeremy had posted
opts = setvartype(opts, varNamesOrNumbers, "NumberSystem", "hex", "type", "auto");
but it needs to be
opts = setvaropts(opts, varNamesOrNumbers, "NumberSystem", "hex", "type", "auto");
I need to concat the cols of a cell array but stay on a single row
I am not clear as to what is being asked here ?
c1 = {1}
c1 = 1×1 cell array
{[1]}
c2 = {2}
c2 = 1×1 cell array
{[2]}
answer = [c1, c2]
answer = 1×2 cell array
{[1]} {[2]}
[1;1:2] no success
{1, 1:2}
ans = 1×2 cell array
{[1]} {[1 2]}
Jeremy Hughes
Jeremy Hughes il 29 Lug 2021
Modificato: Jeremy Hughes il 29 Lug 2021
Yes, sorry, that was a typo. I updated it.
So i have come full circle back to this
I think i have a solution
I hace switched to readmatrix and ditched tables in general.
I have found that most of the table functions and methods work with the readmatrix function
I am finding in my code analyzer that the majorit of my time is spent on a function called hex2binaryvector
Which got me thinking about your post. I want to use readmatrix on a csv and read it in ahead of time as hex.
How can i do that. I tried to use your above code and works for readtable It does not seem to work for readmatrix
If i can bring it in ahead of time as a hex value i will be in much better shape
Can one of you offer a suggestions?
If i can bring it in ahead of time as a hex value i will be in much better shape. Can one of you offer a suggestions?"
This has already been done above by Walter.
NB: readXXX of the new forms like readmatrix all pass off the heavy lifting to readtable in the end; the others are just front ends that give it some klews of what the content of the file is and specify how the data are to be returned to the caller.
These may provide a small performance boost over readtable; what they mostly do is return an array or cell array instead of a table. If all operations are on an array basis, then there's probably no advantage at all in using tables.
However, using detectImportOptions and customizing an import object including the variable types as hex where they are will almost certainly give benefits -- the primary one being importing the data as the decimal value instead of 'char' or cellstr or strings which will have to then convert -- instead, pass the work off to the system i/o library.

Accedi per commentare.

I want to get back to this thead.
So i have a significant time investment in this project and wanted to thank everyone that chimed in.
I have found my own answer and im really surprised no one mentioned this that i can see.
The answer is, you cant!!! you cant read in big files. The way matlab brings these files into ram has to much overhead. I rewrote my program with every trick in the book and managed nearly a 1000X speed increase from my rev 1! Well guess what! it still wasnt good enough. When i started to ask my program that was now blazing fast with all kinds of tricks, vectorizing, parfor loops etc to run a 10k line file i got wonderful speeds. 500uS a line! it was excellent. Well guess what happens when you read in a file that has 500k lines, my 500us is now 500mS. The answer is you just cant. You cant bring in all the data locally and not impact performance. You need to read it one line at a time OUTSIDE of matlab so matlab doesnt slow down.
Here is the answer, which is the point of this post. The answer is to use fopen and fgets. fgets is pretty fast.
read the file line by line as you parse it. That is the only way. This also keeps from over filling your ram.
I hope my pain and suffereing may help someone in the future. Dont even bother with tall arrays and datastores. All of this stuff has way to much overhead and what is meant as a solution is actual your demise.
fopen and fgets and read it line by line. fgets executes on a uS scale and never gets slowed down by ram.
What a waste of 10 days this has been for me. Thanks to everyone that supported this learning experiment.

3 Commenti

I have some tests to do, but first I need to get details of the file format.
If I understand correctly, each line consists of repeated entries that are 4 hex digits representing signed 16 bit numbers. You have up to a million such lines, and "a few hundred" such entries per line. For testing purposes, can I set the upper limit to 500 entries per line, or can a lower limit be set?
You mention csv files. For any given line are the entries comma separated with no space and no other characters? Such as
DEAD,BEEF,ABCD,DEF0
or are there spaces after the commas? Instead of commas, is there a single space, like
DEAD BEEF ABCD DEF0
can we assume that every line is exactly the same size, or do we need to worry about different number of entries on each line?
For the section you are timing above, does that only include converting the data to 16 bit signed integer, or does that timing include something else?
If all of the lines have the same number of entries, then is it acceptable to return an int16 array rather than cell arrays of values?
Do you have at least 8 gigabytes of available memory?
" hope my pain and suffereing may help someone in the future....and im really surprised no one mentioned this..." @Robert Scott
Much (if not all) of that pain and the reason nobody suggested the alternative had mostly to do with your insistence on being so secretive in objectives and file structure of input and absolutely no hint that your processing was only record-by-record -- if you'll note, at the same time I just wrote a filter for another poster in which it was apparent the processing was, in fact, the same type of processing at <Splitting-a-file-into-multiple-files-based-on-trigger-words-in-the-first-column> which uses the same idea.
In general, it is needed to do computations on data files that mostly needs all the data in which to do those computations and hence having all the data in memory or memory-mapped to access the whole file in pieces is needed. To do that takes memory and one pays for that with the needed overhead to handle such.
I pointed you towards the lower-level i/o but never got any responses that would lead to knowing that the point was simply to move records from one place to another...you were so focused on simply the one conversion part of the process that "we couldn't see the forest for the trees" being told explicitly not to worry about details, you'd take care of that.
"... each line consists of repeated entries that are 4 hex digits representing signed 16 bit numbers..." @Walter Roberson
I finally got a response that all are hex per record except for the first and last elements of each record, but never responded as to what those two are. All the other details are still opaque given refusal to provide any further useful details...

Accedi per commentare.

This case is closed. I have thanked you all for your help and i am not able to share or discuss this any further. I have now said numerous times i am unable to share ANYTHING other then general matlab prinicipals and discussing how matlab itself works in general terms. I have found a suitable solution using fgets. Good day to you all.

Prodotti

Release

R2018b

Richiesto:

il 27 Lug 2021

Commentato:

il 23 Set 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by