Finding problem row of data with TabularTextDatastore

16 visualizzazioni (ultimi 30 giorni)
I'm using a tall array / datastore to process large csv file. Sometimes the input csv files have the header line repeated in the middle of the file. Using the following code:
clear;
ds = datastore('C:\temp\data.csv');
tt = tall(ds);
x = tt.Hours(2);
gather(x)
gives
Evaluating tall expression using the Local MATLAB Session:
- Pass 1 of 2: 42% complete
Evaluation 20% complete
Error using matlab.io.datastore.TabularTextDatastore/readData (line 81)
Mismatch between file and format character vector.
Trouble reading 'Numeric' field from file (row number 2530, field number 2) ==> Hours;MID;ID1
Description;ID2;ID3;PN;ID4;Group;Model;Description;Level Number;Type;Occurrence Count
(Incremental);Timestamp;Location;City;State;Country;Software\n
Learn more about errors encountered during GATHER.
Error in matlab.io.datastore.TabularDatastore/read (line 120)
[t, info] = readData(ds);
Error in tall/gather (line 50)
[varargout{:}] = iGather(varargin{:});
Error in tallArray (line 5)
gather(x)
Caused by:
Reading the variable name 'Hours' using format '%f' from file: 'C:\temp\data.csv' starting at offset 363378899.
The errant line was on row (including header) 1568509 of 3734377. I was only able to find out errant row number out by hunting and pecking, i.e. deleting/re-adding rows and trying again.
My question is: How can I find the errant row more easily? Is there a way to use the "row number" given (2530), the given offset (363378899) to figure out where in the large csv the error is (1568509) ?
additional info:
ds =
TabularTextDatastore with properties:
Files: {
'C:\temp\data.csv'
}
FileEncoding: 'UTF-8'
AlternateFileSystemRoots: {}
ReadVariableNames: true
VariableNames: {'SerialNumber', 'Hours', 'MID' ... and 17 more}
Text Format Properties:
NumHeaderLines: 0
Delimiter: ';'
RowDelimiter: '\r\n'
TreatAsMissing: ''
MissingValue: NaN
Advanced Text Format Properties:
TextscanFormats: {'%q', '%f', '%f' ... and 17 more}
TextType: 'char'
ExponentCharacters: 'eEdD'
CommentStyle: ''
Whitespace: ' \b\t'
MultipleDelimitersAsOne: false
Properties that control the table returned by preview, read, readall:
SelectedVariableNames: {'SerialNumber', 'Hours', 'MID' ... and 17 more}
SelectedFormats: {'%q', '%f', '%f' ... and 17 more}
ReadSize: 20000 rows
  1 Commento
Gunes ERTUNC
Gunes ERTUNC il 5 Feb 2023
Catching possible missing entries via 'TreatAsMissing' helped for me. You can try as follows.
ds = datastore('C:\temp\data.csv','TreatAsMissing',{'-','NA','');

Accedi per commentare.

Risposte (0)

Categorie

Scopri di più su Large Files and Big Data in Help Center e File Exchange

Prodotti


Release

R2018a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by