How to read in large text file with special delimiters?

2 visualizzazioni (ultimi 30 giorni)
Hi,
Is there a way to read data from a text file with the following format per row:
A'~'648387'~'3238157'~'9'~'20'~''~'14'~''~'#@#@#
Thus, the column delimiter is '~', and the row delimiter is #@#@#.
Further, missing/null values are represented by two columns delimiters '~''~', for example:
A'~'216772930'~'Birdbox'~''~'1'~'5'~''~''~''~''~''~''~''~''~''~''~''~''~''~'1'~'213'~'#@#@#
Is there any way to specify your own row and column delimiters to be able to read in this data?
Thanks a lot in advance!

Risposta accettata

per isakson
per isakson il 24 Apr 2021
Modificato: per isakson il 24 Apr 2021
"Is there any way to specify your own row and column delimiters to be able to read in this data?" No, I don't think so.
How large is the text file? I assume it fits in a fraction of the physical memory (RAM) of your computer.
I ussume that the column delimiter is the three character vector: '~'
Work around
  1. read the entire file to a character vector
  2. replace "#@#@#" by newline
  3. replace "'~'" by comma (I use "" to avoid escape characters)
  4. parse the resulting string with textscan() (for some reason readtable doesn't take strings.)
Demo
%%
chr = fileread( 'cssm.txt' );
%%
str = strrep( chr, "#@#@#", newline );
str = strrep( str, "'~'", "," );
%%
cac = textscan( str, '%s%f%f%f%f%f%f%f%f', 'Delimiter',',' );
cac{1}(1:3)
ans = 3×1 cell array
{'A'} {'A'} {'A'}
cac{2}(1:3)
ans = 3×1
648387 648387 648387
cac{6}(1:3)
ans = 3×1
NaN NaN NaN
where cssm.txt contains twentyfile copies of your first example in a single row.
  5 Commenti
per isakson
per isakson il 25 Apr 2021
Modificato: per isakson il 25 Apr 2021
"work with some even larger text files"
I wonder whether cssm_tweaked() uses memory more efficiently. (I've excluded the strings.)
cac = cssm_tweaked( 'cssm.txt' );
cac{1}(1:3)
ans = 3×1 cell array
{'A'} {'A'} {'A'}
cac{2}(1:3)
ans = 3×1
648387 648387 648387
cac{6}(1:3)
ans = 3×1
NaN NaN NaN
At least, it returns the same result.
Another approach is to use memorymap(). I haven't tested with a really large data file, but as far as I can interpret the result of profile('-memory','on'); it works as expected. (Actually better than expected, i.e. it's probably not true.) However, be warned that it changes the data file on disk. And it uses an undocumented feature of strrep().
cac = cssm_memmap( 'cssm_large.txt' );
cac{1}(1:3)
ans = 3×1 cell array
{'A'} {'A'} {'A'}
cac{2}(1:3)
ans = 3×1
648387 648387 648387
cac{6}(1:3)
ans = 3×1
NaN NaN NaN
I assume that something like cssm_loop() is possible
cac = cssm_loop( 'cssm_large.txt' );
whos cac
Name Size Bytes Class Attributes cac 480x9 7073280 cell
cac{1,1}(1:3)
ans = 3×1 cell array
{'A'} {'A'} {'A'}
cac{1,2}(1:3)
ans = 3×1
648387 648387 648387
cac{1,6}(1:3)
ans = 3×1
9999 9999 9999
Two tasks remains. The small one is to concatenate the result, cac. The somewhat bigger is to handle varying row length. (In my case the row length is 48, the same row repeated.) The chunk parsed by textscan() must end with newline. Read a large chunk, e.g. 1e9 bytes, cut off the characters after the last newline an save them to the following iteration.
The safest solution, I guess, is to
  • Use GSplit to split the huge file into a number of 1e9 bytes files. (Cut after newline, i.e. after "#@#@#".)
  • Put all the small files into a separate folder.
  • Use datastore() to access the data.
P.S. Characters are represented by two bytes in contemporary Matlab. The text of cssm.txt takes twice the size in Matlab compared to on disk.
chr = fileread( 'cssm.txt' );
sad = dir('cssm.txt');
sad.bytes
ans = 1154
whos chr
Name Size Bytes Class Attributes chr 1x1154 2308 char
@Ricardo Lopez A. I'm interested how these functions work with huge data files and would appreciate to learn about your experience from using them.
Ricardo Lopez A.
Ricardo Lopez A. il 26 Apr 2021
Hi Per,
Again, thanks a lot for your help!
I am working towards a deadline, but once that is past, I will certainly test your code with some large text files to see how this works! For now I am just using your initial workaround since that is working.
I will let you know as soon as I test your code, thanks a lot again!

Accedi per commentare.

Più risposte (0)

Prodotti


Release

R2021a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by