Why Fread a 2 GB file needs more than 8 GB of Ram?

4 visualizzazioni (ultimi 30 giorni)
Gabriel
Gabriel il 4 Giu 2013
textscan is too slow.
Thus, I want to load a 2 GB file in RAM with fread (fast), then scan it.
Fread works well with small files, but if I try to fread(filename,'*char') a 2 GB file, RAM spikes for some reason over my 8 GB limit and I get out of memory.
Ideas?
  2 Commenti
Jan
Jan il 4 Giu 2013
Please post the full code, because there might be unexpected problems.
Gabriel
Gabriel il 4 Giu 2013
Well, the code is simple:
fid = fopen(filename);
test = fread(fid, '*char');

Accedi per commentare.

Risposte (3)

Jan
Jan il 4 Giu 2013
Reading a 2GB-file into a CHAR required 4GB of RAM, because Matlab uses 2-byte-chars. Then it is possible depending on the way you store the data, that the contents of a temporary array is copied, such that 8GB is the expected memory consumption. But actually I'd expect that this copy could be avoided, so it might be helpful, if you show us the code fragment.
  2 Commenti
Gabriel
Gabriel il 4 Giu 2013
Precisely, I expect it to require 4GB, yet watching system monitor, the whole things goes over 8GB and into swap.
I also get the copied into functions parts, etc. But shouldnt FREAD be able to load a 2 GB file into a 4GB char array without needing more than 8GB of Ram?
Jan
Jan il 4 Giu 2013
Modificato: Jan il 4 Giu 2013
I've seen an equivalent behavior for another FREAD implementation (not in Matlab): The required final size was not determined by FSEEK, but the file was read in chunks until the buffer was filled. Then the buffer was re-allocated with the double size. After the obvious drawbacks have been mentioned in a discussion, the author decided to replace the doubling method by a smarter Fibonacci sequence. :-)

Accedi per commentare.


Iain
Iain il 4 Giu 2013
As Jan implied, passing around variables often leads to memory duplication - 2GB arrays get COPIED when put into functions.
The Out of memory error normally comes up when matlab cannot find a single chunk of RAM big enough for a variable.
Use much smaller chunks of memory, and read the file in and parse it in chunks of, say, 64MB.
  2 Commenti
Walter Roberson
Walter Roberson il 4 Giu 2013
The arrays will only get copied if they are modified; otherwise the data pointer will point to the original storage.
Gabriel
Gabriel il 4 Giu 2013
I think I did not express myself well, I apologize. Parsing is not the issue. I fully expect scanning functions to be memory hogs (relatively).
Fread on the other hand, I don't quite get why it needs so much overhead to load a 2GB+ file in the workspace?

Accedi per commentare.


Gabriel
Gabriel il 4 Giu 2013
Modificato: Gabriel il 4 Giu 2013
In any case, I have found a workaround for textscanning large ascii files (4GB and beyond) that contain numbers
The trick is padding the numbers with PERL or SED before trying to read them into matlab. If you pad your numbers with leading 0s, every line has the same ammount of chars, thus FREAD is easy to execute in chunks.
ex:
While not eof
tmp = fread X lines
data = textscan(tmp)
process(data)
end
With this trick, I went from 3 MB/sec to 130 MB/sec for processing a file.

Tag

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by