How to read in large text file with special delimiters?

Question

Ricardo Lopez A. il 23 Apr 2021

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/811720-how-to-read-in-large-text-file-with-special-delimiters

Commentato: Ricardo Lopez A. il 26 Apr 2021

Risposta accettata: per isakson

Apri in MATLAB Online

Hi,

Is there a way to read data from a text file with the following format per row:

A'~'648387'~'3238157'~'9'~'20'~''~'14'~''~'#@#@#

Thus, the column delimiter is '~', and the row delimiter is #@#@#.

Further, missing/null values are represented by two columns delimiters '~''~', for example:

A'~'216772930'~'Birdbox'~''~'1'~'5'~''~''~''~''~''~''~''~''~''~''~''~''~''~'1'~'213'~'#@#@#

Is there any way to specify your own row and column delimiters to be able to read in this data?

Thanks a lot in advance!

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

per isakson il 24 Apr 2021

1
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/811720-how-to-read-in-large-text-file-with-special-delimiters#answer_683410

Modificato: per isakson il 24 Apr 2021

Apri in MATLAB Online

cssm.txt

"Is there any way to specify your own row and column delimiters to be able to read in this data?" No, I don't think so.

How large is the text file? I assume it fits in a fraction of the physical memory (RAM) of your computer.

I ussume that the column delimiter is the three character vector: '~'

Work around

read the entire file to a character vector
replace "#@#@#" by newline
replace "'~'" by comma (I use "" to avoid escape characters)
parse the resulting string with textscan() (for some reason readtable doesn't take strings.)

Demo

%%
chr = fileread( 'cssm.txt' );
%%
str = strrep( chr, "#@#@#", newline );
str = strrep( str, "'~'", "," );
%%
cac = textscan( str, '%s%f%f%f%f%f%f%f%f', 'Delimiter',',' );
cac{1}(1:3)
ans = 3×1 cell array
    {'A'}
    {'A'}
    {'A'}
cac{2}(1:3)
ans = 3×1
      648387
      648387
      648387
cac{6}(1:3)
ans = 3×1
   NaN
   NaN
   NaN

where cssm.txt contains twentyfile copies of your first example in a single row.

5 Commenti
Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti

per isakson il 25 Apr 2021

Modificato: per isakson il 25 Apr 2021

Apri in MATLAB Online

"work with some even larger text files"

I wonder whether cssm_tweaked() uses memory more efficiently. (I've excluded the strings.)

cac = cssm_tweaked( 'cssm.txt' );
cac{1}(1:3)
ans = 3×1 cell array
    {'A'}
    {'A'}
    {'A'}
cac{2}(1:3)
ans = 3×1
      648387
      648387
      648387
cac{6}(1:3)
ans = 3×1
   NaN
   NaN
   NaN

At least, it returns the same result.

Another approach is to use memorymap(). I haven't tested with a really large data file, but as far as I can interpret the result of profile('-memory','on'); it works as expected. (Actually better than expected, i.e. it's probably not true.) However, be warned that it changes the data file on disk. And it uses an undocumented feature of strrep().

cac = cssm_memmap( 'cssm_large.txt' );
cac{1}(1:3)
ans = 3×1 cell array
    {'A'}
    {'A'}
    {'A'}
cac{2}(1:3)
ans = 3×1
      648387
      648387
      648387
cac{6}(1:3)
ans = 3×1
   NaN
   NaN
   NaN

I assume that something like cssm_loop() is possible

cac = cssm_loop( 'cssm_large.txt' );
whos cac
  Name        Size              Bytes  Class    Attributes

  cac       480x9             7073280  cell               
cac{1,1}(1:3)
ans = 3×1 cell array
    {'A'}
    {'A'}
    {'A'}
cac{1,2}(1:3)
ans = 3×1
   648387
   648387
   648387
cac{1,6}(1:3)
ans = 3×1
   9999
   9999
   9999

Two tasks remains. The small one is to concatenate the result, cac. The somewhat bigger is to handle varying row length. (In my case the row length is 48, the same row repeated.) The chunk parsed by textscan() must end with newline. Read a large chunk, e.g. 1e9 bytes, cut off the characters after the last newline an save them to the following iteration.

The safest solution, I guess, is to

Use GSplit to split the huge file into a number of 1e9 bytes files. (Cut after newline, i.e. after "#@#@#".)
Put all the small files into a separate folder.
Use datastore() to access the data.

P.S. Characters are represented by two bytes in contemporary Matlab. The text of cssm.txt takes twice the size in Matlab compared to on disk.

chr = fileread( 'cssm.txt' );
sad = dir('cssm.txt');
sad.bytes
ans = 1154
whos chr
  Name      Size              Bytes  Class    Attributes

  chr       1x1154             2308  char               

@Ricardo Lopez A. I'm interested how these functions work with huge data files and would appreciate to learn about your experience from using them.

Ricardo Lopez A. il 26 Apr 2021

Hi Per,

Again, thanks a lot for your help!

I am working towards a deadline, but once that is past, I will certainly test your code with some large text files to see how this works! For now I am just using your initial workaround since that is working.

I will let you know as soon as I test your code, thanks a lot again!

Accedi per commentare.

How to read in large text file with special delimiters?

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposta accettata

5 Commenti
Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti

Più risposte (0)

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

How to read in large text file with special delimiters?

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Risposta accettata

5 Commenti Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti

Più risposte (0)

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

5 Commenti
Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti