Importing data using textscan from a large dataset

Question

Sebastiano delre il 26 Mag 2016

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/285917-importing-data-using-textscan-from-a-large-dataset

Commentato: Sebastiano delre il 30 Mag 2016

example.csv

I would like to import data from a csv file. The data looks like in the example attached. There are 14 variables, with a header in the first row. NA indicates empty values. I would like variables 1-11 and 13 to be numeric, whereas variables 12 and 14 to be string. Notice that in the example.csv file only variables 2, 3 and 13 are written as numeric. The file is very big (3.5 GB), I would like to import data efficiently. Thank you for your help.

4 Commenti
Mostra 2 commenti meno recentiNascondi 2 commenti meno recenti

Stephen23 il 26 Mag 2016

Modificato: Stephen23 il 26 Mag 2016

This question is a continuation of this discussion:

http://www.mathworks.com/matlabcentral/answers/285239-using-regexp-to-create-dataset

@Sebastiano delre: it is useful for us volunteers when you put links to earlier questions on the same topic, then we know what information and code you have already been given, what you have already tried, and what explanation you have given. It makes our job easier!

Sebastiano delre il 26 Mag 2016

Modificato: Sebastiano delre il 26 Mag 2016

Yes, it is. Nevertheless, things are a bit different. And the example file is significantly different.

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

per isakson il 26 Mag 2016

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/285917-importing-data-using-textscan-from-a-large-dataset#answer_223521

Modificato: per isakson il 28 Mag 2016

Apri in MATLAB Online

Who created this file? I know there isn't a strict csv-standard. Anyhow after some trial and error, I came up with this format string on R2013b

>> str = '"7",746540138,9,"573348359","78599","1341119513","573346802","3","0","0","1341111281","-2,-1,-1",-1.33333333333333,"world, asia"';
>> cac = textscan( str, '"%f",%f,%f,"%f","%f","%f","%f","%f","%f","%f","%f","%[^"]",%f,"%[^"]"' )
cac = 
  Columns 1 through 9
    [7]    [746540138]    [9]    [573348359]    [78599]    [1.3411e+09]    [573346802]    [3]    [0]
  Columns 10 through 14
    [0]    [1.3411e+09]    {1x1 cell}    [-1.3333]    {1x1 cell}
>> cac{14}
ans = 
    'world, asia'

"The file is very big (3.5 GB)" &nbsp asks for a big enough physical memory (RAM).
add 'TreatAsEmpty','NA'
"%[^"]" because I failed to make %q work (with R2013b)
" I would like to import data efficiently" &nbsp I guess textscan is the most efficient way.
It should (my reading of the documentation) work to remove the commas, ",", from the format string and add 'Delimiter',','. You might want to try. However, I failed.

&nbsp

Continuation a day later:

The format string above returns error rather than empty for ,"",. Work around: Treating " as a whitespace character or replace it by space isn't feasible because of strings like "-2,-1,-1". Replacing "" by "NA" seems to work.

>> cac = cssm( 'example.csv' )
cac = 
    [9x11 double]    {9x1 cell}    [9x1 double]    {9x1 cell}
>> cac{1}(:,7)
ans =
         NaN
         NaN
   573315745
         NaN
         NaN
         NaN
   573346802
   573315745
         NaN

where

function        cac = cssm( filespec )
    str = fileread( filespec );
    str = strrep( str, '""', '"NA"' );
    %
    frm = '"%f",%f,%f,"%f","%f","%f","%f","%f","%f","%f","%f","%[^"]",%f,"%[^"]"';
    cac = textscan( str,  frm, 'HeaderLines',1, 'TreatAsEmpty',{'NA'}, 'CollectOutput',true );
end

This approach requires a large physical memory. However, mapreduce, Programming technique for analyzing data sets that do not fit in memory, which was Introduced in R2014b, might make it possible to modify the function, cssm, to run with less memory.

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Sebastiano delre il 30 Mag 2016

Thank you, this helps.

Accedi per commentare.

Importing data using textscan from a large dataset

4 Commenti
Mostra 2 commenti meno recentiNascondi 2 commenti meno recenti

Risposta accettata

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Più risposte (0)

Vedere anche

Categorie

Tag

Community Treasure Hunt

Importing data using textscan from a large dataset

4 Commenti Mostra 2 commenti meno recentiNascondi 2 commenti meno recenti

Risposta accettata

1 Commento Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Più risposte (0)

Vedere anche

Categorie

Tag

Community Treasure Hunt

4 Commenti
Mostra 2 commenti meno recentiNascondi 2 commenti meno recenti

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti