Reading in ugly data files

Question

Ryan Egan il 26 Ott 2012

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/51986-reading-in-ugly-data-files

I created a simple data processing script using importdata. I am trying to process a new txt file using this script, but the structure of the data is very different, and importdata is getting tripped up somehow. I've decided to try and change the program a bit to use something more flexible, like textscan.

First, what do you recommend for reading in text file data that has both strings and numerical data? Is textscan really the best option?

Second, how do I deal with HUGE swaths of empty data cells in this particular text file?

edit: I know it says not to do this, but it has become obvious that I should say I am very new to matlab, so I don't really know what you mean when you say "EmptyValue" and "TreatAsEmpty." How do I properly use these parameters when calling the textscan function?

5 Commenti
Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti

Ryan Egan il 29 Ott 2012

Yes, I did read the documentation, and I don't understand it very well at all.

I suppose I expected an answer that would provide syntax that doesn't get tripped up by empty cells or the locations of them and would just take them in so I can then manipulate the resulting matrices and remove the empty data.

I tried reading a small text file with textscan and it was fine. But this is a larger file with large inconsistencies in where the data appear. All the examples I found at Answer were not applicable to my situation of inconsistent locations of empty data. Like you said, all rows need to have the same format or it becomes a bit tricky.

If I look at the data file in an excel spreadsheet, the first 10 rows or so have null value positions that are different from the other 100 or so rows of data. But within the 10 rows and 100 rows, the null locations are the same, consistently.

I'm not able to fit the first ten lines on here, but here is one complete line, I think.

HWFC_Car_P New,1,1,<?xml version="1.0"?>\n<Clock xmlns:dt="urn:schemas-microsoft-com:datatypes"><Description dt:dt="string">E-Prime Primary Realtime Clock</Description><StartTime><Timestamp dt:dt="int">0</Timestamp><DateUtc dt:dt="string">2012-10-19T17:02:29Z</DateUtc></StartTime><FrequencyChanges><FrequencyChange><Frequency dt:dt="r8">2337939</Frequency><Timestamp dt:dt="r8">3612440190334</Timestamp><Current dt:dt="r8">0</Current><DateUtc dt:dt="string">2012-10-19T17:02:29Z</DateUtc></FrequencyChange></FrequencyChanges></Clock>\n,59.827,1,372845373,10-19-2012,13:02:30,5:02:30 PM,1,NULL,NULL,NULL,NULL,33.jpg,p,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,,NULL,NULL,NULL,slowTrialProcedure,2782,123,SlowPracticeList,33.jpg,1,1,1,1,p,-999999,18,537406,p,1174,538580,0,p,,0,0,p,,0,Configuration2,0,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL,NULL

Matt Kindig il 29 Ott 2012

Hi Ryan,

From this line, what data do you need to extract? I'm thinking that regular expressions (regexp() function) would be better for you. In my experience working with irregularly structured text files, regexp() is more flexible/efficient than textscan() or the like. From this line, what form of the output data do you expect?

Ryan Egan il 30 Ott 2012

Well, that's just the thing. I don't need ANY data from the first 10 rows of data (this is consistent across all the data files I would be reading in to matlab). So if I could just figure out a way for matlab to skip the first 10 rows of data and then start reading it in, while ignoring empty cells, that would be perfect.

Of any row in general, I would need values from 3 columns, two of which are numbers and one of which is a string. The remaining 96 columns are not important.

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Argon il 30 Ott 2012

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/51986-reading-in-ugly-data-files#answer_63738

I don't know how efficient that is, but I would try something like this:

preallocate your data variable
open the text file with fopen
start a loop
read line by line with fgetl
ignore the first 10 lines
use something like regexp(line, ',', 'split')
extract and the columns you need, apply trimming and type conversion, ignore a cell if it's empty, and any other post-processing of the cell values
end loop
call fclose

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Ryan Egan il 31 Ott 2012

This worked great! I was able to read in all the data I needed from a raw text file and do all the calculations I needed to do with them. I might have a bit more difficulty now that I am adding loops to read in multiple data files and put them all into the same variables, but I think I just need to learn a bit more about arrays. Thanks!

Accedi per commentare.

Answer 2

per isakson il 26 Ott 2012

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/51986-reading-in-ugly-data-files#answer_63391

Modificato: per isakson il 26 Ott 2012

textscan is a good alternative for "... both strings and numerical data"
with textscan all data rows need to have the same format otherwise it becomes a bit tricky.
The options EmptyValue and TreatAsEmpty will take care of "empty data cells"
HUGE means different things to different people. The amount of empty cells shouldn't be a problem.

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Answer 3

Kevin il 30 Ott 2012

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/51986-reading-in-ugly-data-files#answer_63742

Modificato: Kevin il 30 Ott 2012

I have been doing something similar recently.

I think textscan should work,

% open the file (replace datapath with your file location)

fid = fopen(datapath);

% skip first ten lines (change the bufsize if it's not big enough) % raw will contain the first ten lines, pos is the current position in the file

[raw, pos] = textscan(fid, '%[^\n]',10, 'delimiter', ',', 'BufSize',100000);

% now you can use something like this to read in the first three columns % change the order of the %f %f %s to match your data types

data = textscan(fid, '%f %f %s %*[^\n]', 'delimiter', ',', 'BufSize',100000);

% close the file fclose(fid)

data should now contain the first three columns of your data.

Hopefully I have that correct!! No doubt there is a quicker way to do this.

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Reading in ugly data files

5 Commenti
Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti

Risposta accettata

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Più risposte (2)

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Vedere anche

Categorie

Tag

Community Treasure Hunt

Reading in ugly data files

5 Commenti Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti

Risposta accettata

1 Commento Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Più risposte (2)

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

0 Commenti Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Vedere anche

Categorie

Tag

Community Treasure Hunt

5 Commenti
Mostra 3 commenti meno recentiNascondi 3 commenti meno recenti

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti