- read the entire file to a character vector
- replace "#@#@#" by newline
- replace "'~'" by comma (I use "" to avoid escape characters)
- parse the resulting string with textscan() (for some reason readtable doesn't take strings.)
How to read in large text file with special delimiters?
2 visualizzazioni (ultimi 30 giorni)
Mostra commenti meno recenti
Ricardo Lopez A.
il 23 Apr 2021
Commentato: Ricardo Lopez A.
il 26 Apr 2021
Hi,
Is there a way to read data from a text file with the following format per row:
A'~'648387'~'3238157'~'9'~'20'~''~'14'~''~'#@#@#
Thus, the column delimiter is '~', and the row delimiter is #@#@#.
Further, missing/null values are represented by two columns delimiters '~''~', for example:
A'~'216772930'~'Birdbox'~''~'1'~'5'~''~''~''~''~''~''~''~''~''~''~''~''~''~'1'~'213'~'#@#@#
Is there any way to specify your own row and column delimiters to be able to read in this data?
Thanks a lot in advance!
0 Commenti
Risposta accettata
per isakson
il 24 Apr 2021
Modificato: per isakson
il 24 Apr 2021
"Is there any way to specify your own row and column delimiters to be able to read in this data?" No, I don't think so.
How large is the text file? I assume it fits in a fraction of the physical memory (RAM) of your computer.
I ussume that the column delimiter is the three character vector: '~'
Work around
Demo
%%
chr = fileread( 'cssm.txt' );
%%
str = strrep( chr, "#@#@#", newline );
str = strrep( str, "'~'", "," );
%%
cac = textscan( str, '%s%f%f%f%f%f%f%f%f', 'Delimiter',',' );
cac{1}(1:3)
cac{2}(1:3)
cac{6}(1:3)
where cssm.txt contains twentyfile copies of your first example in a single row.
5 Commenti
per isakson
il 25 Apr 2021
Modificato: per isakson
il 25 Apr 2021
"work with some even larger text files"
I wonder whether cssm_tweaked() uses memory more efficiently. (I've excluded the strings.)
cac = cssm_tweaked( 'cssm.txt' );
cac{1}(1:3)
cac{2}(1:3)
cac{6}(1:3)
At least, it returns the same result.
Another approach is to use memorymap(). I haven't tested with a really large data file, but as far as I can interpret the result of profile('-memory','on'); it works as expected. (Actually better than expected, i.e. it's probably not true.) However, be warned that it changes the data file on disk. And it uses an undocumented feature of strrep().
cac = cssm_memmap( 'cssm_large.txt' );
cac{1}(1:3)
cac{2}(1:3)
cac{6}(1:3)
I assume that something like cssm_loop() is possible
cac = cssm_loop( 'cssm_large.txt' );
whos cac
cac{1,1}(1:3)
cac{1,2}(1:3)
cac{1,6}(1:3)
Two tasks remains. The small one is to concatenate the result, cac. The somewhat bigger is to handle varying row length. (In my case the row length is 48, the same row repeated.) The chunk parsed by textscan() must end with newline. Read a large chunk, e.g. 1e9 bytes, cut off the characters after the last newline an save them to the following iteration.
The safest solution, I guess, is to
- Use GSplit to split the huge file into a number of 1e9 bytes files. (Cut after newline, i.e. after "#@#@#".)
- Put all the small files into a separate folder.
- Use datastore() to access the data.
P.S. Characters are represented by two bytes in contemporary Matlab. The text of cssm.txt takes twice the size in Matlab compared to on disk.
chr = fileread( 'cssm.txt' );
sad = dir('cssm.txt');
sad.bytes
whos chr
@Ricardo Lopez A. I'm interested how these functions work with huge data files and would appreciate to learn about your experience from using them.
Più risposte (0)
Vedere anche
Categorie
Scopri di più su Text Files in Help Center e File Exchange
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!