Slow 2020a performance writing to a Table in a loop

15 visualizzazioni (ultimi 30 giorni)
I have run into a problem using tables in that writing to them takes so much time that their use become unfeasible. I developed the application using small ~1000 row writes to the table by creating a structure and using struct2table() to load the table. I expected a linear scaling in time as I moved to 10^6 row but have found an exponential scaling making the application unusable.
I developed this piece of example code to help illustrate the problem. I have found that writing to a subsection of a large (10^6) row table scales exponentially with the number of rows being written when using a for loop. Apparently the struct2table() function has a similar problem.
clear all;
var = { 'A' 'B' 'C' 'D' };
varType = {'double' 'double' 'double' 'double'};
rows = [100 1000 10000 100000 200000 300000 400000 500000];
d1=zeros(size(rows)); d2=zeros(size(rows));
for i = 1:length(rows)
T = table('Size',[1e6,length(var)],'VariableNames',var, 'VariableTypes', varType);
tic
T.B(1:rows(i)) = rand([rows(i),1]);
d1(i) = toc;
fprintf('1) Vector writing %.0f rows takes %.4f seconds\n', rows(i),d1(i));
tic
for j = 1:rows(i)
T.A(j) = rand;
end
d2(i)=toc;
fprintf('2) Loop writing %.0f rows takes %.4f seconds\n\n',rows(i),d2(i));
end
figure
loglog(rows,[d1; d2])
legend('Vector Write', 'Loop Write');
On my Macbook I am seeing this:
1) Vector writing 100 rows takes 0.0077 seconds
2) Loop writing 100 rows takes 0.0190 seconds
1) Vector writing 1000 rows takes 0.0029 seconds
2) Loop writing 1000 rows takes 0.0916 seconds
1) Vector writing 10000 rows takes 0.0005 seconds
2) Loop writing 10000 rows takes 0.9005 seconds
1) Vector writing 100000 rows takes 0.0009 seconds
2) Loop writing 100000 rows takes 12.1660 seconds
1) Vector writing 200000 rows takes 0.0021 seconds
2) Loop writing 200000 rows takes 30.3729 seconds
1) Vector writing 300000 rows takes 0.0024 seconds
2) Loop writing 300000 rows takes 67.7139 seconds
1) Vector writing 400000 rows takes 0.0030 seconds
2) Loop writing 400000 rows takes 161.0859 seconds
1) Vector writing 500000 rows takes 0.0038 seconds
2) Loop writing 500000 rows takes 220.1511 seconds
I had planned on using tall tables for an application that might reach 10^10 rows so any advice on how to best approach writing to sections of tables would be appreciated.
  1 Commento
Sindar
Sindar il 15 Mag 2020
my suggestion would be to go ahead and test out tall tables. It's not unlikely that the scaling is completely different

Accedi per commentare.

Risposte (1)

Campion Loong
Campion Loong il 18 Lug 2020
Modificato: Campion Loong il 22 Lug 2020
Hi Charles,
Thanks for bringing up this performance question. I assume from your code snippet the timing is done either in the base workspace (i.e. directly under the MATLAB command window) or as a script.
There is important and sometimes very significant performance implication between base-workspace/script and functions. For example, here I am placing your code inside a function:
function SwensonTiming
var = { 'A' 'B' 'C' 'D' };
varType = {'double' 'double' 'double' 'double'};
rows = [100 1000 10000 100000 200000 300000 400000 500000];
d1=zeros(size(rows)); d2=zeros(size(rows));
for i = 1:length(rows)
T = table('Size',[1e6,length(var)],'VariableNames',var,'VariableTypes',varType);
tic
T.B(1:rows(i)) = rand([rows(i),1]); % note this is also timing rand()'s scaling with size
d1(i) = toc;
fprintf('1) Vector writing %.0f rows takes %.4f seconds\n', rows(i),d1(i));
tic
for j = 1:rows(i)
T.A(j) = rand;
end
d2(i)=toc;
fprintf('2) Loop writing %.0f rows takes %.4f seconds\n\n',rows(i),d2(i));
end
figure
loglog(rows,[d1; d2])
legend('Vector Write', 'Loop Write');
Note that both your code and the above function (with the exact same code) include rand() in their timings, although that is not at all a significant contribution to runtime.
On my ~4-year old Core-i5/16GB ram desktop, running MATLAB R2020a Update 4, the 'Loop Write' part scales linearly as expected - as also shown by the plot your code (when inside a function) produces:
>> SwensonTiming
1) Vector writing 100 rows takes 0.0003 seconds
2) Loop writing 100 rows takes 0.0073 seconds
1) Vector writing 1000 rows takes 0.0003 seconds
2) Loop writing 1000 rows takes 0.0666 seconds
1) Vector writing 10000 rows takes 0.0005 seconds
2) Loop writing 10000 rows takes 0.6360 seconds
1) Vector writing 100000 rows takes 0.0023 seconds
2) Loop writing 100000 rows takes 6.8393 seconds
1) Vector writing 200000 rows takes 0.0057 seconds
2) Loop writing 200000 rows takes 14.4702 seconds
1) Vector writing 300000 rows takes 0.0068 seconds
2) Loop writing 300000 rows takes 20.4186 seconds
1) Vector writing 400000 rows takes 0.0128 seconds
2) Loop writing 400000 rows takes 26.2447 seconds
1) Vector writing 500000 rows takes 0.0115 seconds
2) Loop writing 500000 rows takes 32.3546 seconds
Of course, you still get much better performance vectorizing, but the scalar loop assignment in your example scales as anticipated.
In short, take advantage of the best performance table currently offers in a function (rather than in base workspace or script). Loren's Blog has published a helpful post about recent advance in table performance in R2020a, as well as some common patterns/anti-patterns for performance.
If you are still running into performance issue with your use case, please get in touch. We'd like to learn more about your workflows.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by