How to loop fopen/fread/fseek to analyze large binary files in shorter segments ?

I have a large bin file (10Gb) which contains binary data from 16 channels in a single array, plus one value for the time variable (basically a single sample is made by 17 binary values). The size of each sample should be 8bit.
I am already able to extract a segment of this file and analyze it channel by channel. I achieved this by loading the all file, reading it and creating a sub-array sized properly.
[logfname, pathname] = uigetfile('*.mat','Pick a log file');
logfullpathname = [pathname logfname];
load(logfullpathname);
datafullpathname = [pathname datafilename];
FID = fopen(datafullpathname,'r');
fwrite(FID,[(start_seg*60*Fs):(end_seg*60*Fs-1)]); %Fs = sampling freq = 20Khz (20000)
%%load selected data segment and select channel to analyze
currData = fread(FID,'double'); %%<- THIS IS THE COMMAND THAT SLOWS EVERYTHING DOWN
time = currData((1:(n_channels+1):end), 1);
SIGNAL = currData(((activechannel+1):(n_channels+1):end), 1);
trace = [time SIGNAL];
crop = trace((start_seg*60*Fs):(end_seg*60*Fs-1),:); % select data range of active channel to analyze
seg = transpose(crop(:, 2)); % should output column 2 values (SIGNAL) within the time range defined by "crop"
But this takes a minute or so for each segment (each segment is 1 minute).
what I would like to do is just loading a segment at a time and loop the analysis (the analysis per se does not take long).
however I am stuck: I am using ftell(FID) to get the indexes of my bin file just for testing purposes but it gives me always the same number and it does not loop.
this is my code for a 1hour recording file in which I select a segment from the 45th to 46th minute, I have tried both a "while" and a "for" cycle:
this is the WHILE cycle
%%WHILE CYCLE
file=('C:\Users\Admin\Documents\MATLAB\EOD examples\Examples 16ch\copy.mat');
fileID = fopen(file,'r');
feof(fileID)
index = 0;
n_channels = 16;
Fs = 20000; %sampling frequency
seg = Fs*60*(n_channels+1); %in the bin file data are organized in a vector [t(i),Ch(1)..Ch(i)], with t = time
size_of_double = 8;
while ~feof(fileID)
fseek(fileID,index*size_of_double,'bof'); %this should look for the first data point in the file
position = ftell(fileID); %this should report the current index
position
currData = fread(fileID, seg,'double');
currData
index=index+seg;
end
this is the FOR cycle %% FOR CYCLE
file=('C:\Users\Admin\Documents\MATLAB\EOD examples\Examples 16ch\copy.mat');
Fs = 20000;
n_channels = 16; %number of active channels
index=2; %index of the first data point
segsize = 60*Fs*(n_channels+1); % this is the length of 1 min segment, 20.000 samples per second
% get filesize
fileID = fopen(file,'r');
fseek(fileID, 0, 1); % move to end
file_length_in_byte = ftell(fileID); % read end position
size_of_double = 8;
file_length_in_double_elements = file_length_in_byte / size_of_double;
feof(fileID);
%step = 1; % in elements
for i = 1:segsize:file_length_in_double_elements-1; % until end of file
fileindex = (i-1)*segsize*size_of_double; % this is where we wanna go
fseek(fileID,fileindex,'bof'); % go to i-th data point
current_index = ftell(fileID); % get file index
current_index % this is where we went
currData = fread(fileID, 2,'double');
currData % here is what we read
end
fclose(fileID);
Basically, before adding the analysis part I would like to have as an output the indexes in the file corresponding to the starting points of each segment. I am stuck and I do not know where the issue is: the first cycle reports just a sequence of values that does not look to have the right interval (which should be 20000*60*17, if the segment is 1 minute long). The second just prompts one number, as if the loop would just run one time. Thanks for your help in advance !

5 Commenti

Couple comments and some questions...
...
FID = fopen(datafullpathname,'r');
fwrite(FID,[(start_seg*60*Fs):(end_seg*60*Fs-1)]);
You opened the file for read access then try to write to it. Whassup w/ that? What are you trying to do here?
Your description of the file format says "The size of each sample should be 8bit." but then you write
currData = fread(FID,'double'); %%COMMAND THAT SLOWS EVERYTHING DOWN
which reads the file as 8- BYTE floating-point doubles.
So which is it, a floating point double or an 8-bit A/D sampled data stream?
Also, the reason that fread statement takes a while is that it is reading the entire file; if you want to process a subset of the file (which is easily-enough done once we ascertain the actual data format), you have to specify how many elements you wish read with the optional second sizeA parameter.
while ~feof(fileID) fseek(fileID, ,index*size_of_double,'bof'); %this should look for, the first data point in the file position = ftell(fileID); %this should report the current index position
That code risks infinite loop. You are seeking to a constant location relative to the beginning of the file. If feof is not true before the loop and if it is not set by the first fseek then you infinite loop.
POSIX specifically says that fseek clears feof status. POSIX permits seeking past end of file because POSIX permits extending files by seeking and then writing.
But MATLAB does not follow POSIX behavior in this regard, and positions to end of file instead. But whether it would set feof in that case is not as clear.
So your code is either infinite loop or else relies upon an edge case together with relying on the file not to be too large...
Thanks for your comments! @dpb: some lines here and there might be leftovers of something that was changed and simply I did not remove completely. I am new to Matlab, I apologize if some lines make no sense. It was 8 BYTES, not bits, I correct the typo sorry.
@Walter: I tried to apply what I found on the online documentation material. Probably I did apply it wrong. If that's the case I would appreciate if you could underline the mistakes and provide eventually a solution for them. This while cycle does not go to infinite, I tried. It just gives as output a list of ca. 5040 numbers (if I am not wrong). However, this list should be of 60 numbers, as I am trying to divide 1 hour recording file in 1 minute segments (for each minute I have 20.000*17*60 samples: sampling rate = 20.000 Hz, with 16 channels plus time = 17, 1 min = 60 sec).
I think in principle these cycles should work but perhaps there is something wrong I am doing with the size of the segment (in samples)/size of the file (in bytes).
Walter, I do not really need to use the while loop... I just tried to approach two different solutions for the same problem.. I will choose what works best :) thanks anyway
@Livio Oboti: What exactly is the problem?
How to loop fopen/fread/fseek to analyze large binary files in
shorter segments?
This should be easy. With segsize = 60*Fs*(n_channels+1) it should be trivial to use fseek(fid, n * segsize * 8) to move the file pointer to the wanted position.
I know :) but if I run my code (I think is the same as you wrote), I don't get 60 numbers as output, just way more.
this is my code
file=('C:\Users\Admin\Documents\MATLAB\EOD examples\Examples 16ch\copy.mat');
fileID = fopen(file,'r');
feof(fileID)
index = 0;
n_channels = 16;
Fs = 20000; %sampling frequency
seg = Fs*60*(n_channels+1); %in the bin file data are organized in a vector [t(i),Ch(1)..Ch(i)], with t = time
size_of_double = 8;
while ~feof(fileID)
fseek(fileID,index*size_of_double*seg,'bof'); %this should look for the first data point in the file
position = ftell(fileID); %this should report the current index
position
currData = fread(fileID, seg,'double');
currData
index=index+seg;
end

Accedi per commentare.

 Risposta accettata

file=('C:\Users\Admin\Documents\MATLAB\EOD examples\Examples 16ch\copy.mat');
fid=fopen(file,'r');
n_ch=16; % number channels (exclusive of timestamp)
Fs = 20000; % sampling frequency
index = 0;
while ~feof(fid)
index=index+1; % minute number
currData=fread(fid,[n_ch+1,Fs*60],'double').'; % one minute of data in 2D array
% NB: must orient [time+N Channels X TimeSteps] and then transpose to
% order data properly by column-major order internally.
disp(sprintf('Processing minute: %d',index))
% process one minute worth of data here
...
end
fid=fclose(fid);
The key is to just read 20k*60 by 1+16 chnls each pass through the loop and tell fread to return in that orientation. Time vector will be currData(:,1), the channels currData(:,2:end)
Might look at memmapfile object as a higher-level abstraction if intending to read other than sequentially.
ADDENDUM/ERRATUM
Corrected initial above [sizeA} argument order to properly reflect data are in channel-order by time step in the file so must read the partial file indicating the nCH+1 dimension first for the desired number of time steps, then transpose to have as column-wise in internal data storage order.
If were reading the whole file, Matlab would do that automagically but a portion of the file must be treated as a separate grouping of the time plus data channels.

16 Commenti

thanks DB, running your code just gives a second as an output. Also plotting the data just gives a line which means currData is just taking the time values I guess?... I would like to actually see the starting point of each minute segment in a hour long recording (the file is a bin file loaded by loading a corresponding mat file, sorry the first line is confusing).
you are right about sizing the segment as you did, though, is also what I was doing before.
"your code just give a second as an output."
Well, yes, if you're going to process a file piecewise, then that's the logical thing to do... :)
"I would like to actually see the starting point of each minute segment in a hour long recording"
Well, you can compute that directly as
fseek(fid,0,'eof') % position to EOF
ix=1:Fs*60.*(n_ch+1)*8:ftell(fid); % compute segment positions
Or, you can do it without even opening the file
d=dir(file); % get file info
ix=1:Fs*60.*(n_ch+1)*8:d.bytes; % compute segment positions
Of course, one can also recast the algebra to use the number of seconds in the datafile if that is known a priori and not have to query the file at all.
Or, as noted, you can use memmapfile and map the file into the array where you can simply use the indexing directly in terms of records where each record would be the one-second of data and quit counting bytes entirely.
Matlab has many higher-level abstractions to avoid having to do low-level manipulations such as this; productivity will increase markedly if spend some time learning how to make use of them.
there is something wrong... it should stop at 60 secs but it goes up to 40k and more...
the second code just makes a variable with the value '1' in it (ix=1)
What is "it"???
Well, looks ok to me for the algebra but didn't have a file to actually play with so let's just make up what should be:
bytes.sec=Fs*(n_ch+1)*8; % number of bytes/second
bytes.min=60*bytes.sec; % number of bytes/minute
Now let's compute the beginning position for a number of minutes...
N=10; % say 10 minutes
ix=0:bytes.min:N*bytes.min; % for first ten minutes
>> fprintf('%10d\n',ix) % show what are exactly
0
163200000
326400000
489600000
652800000
816000000
979200000
1142400000
1305600000
1468800000
>>
Obviously for any particular minute just use
N*bytes.min
So to follow up. I am trying the following code.
%%THIS SECTION JUST OPENS THE FILE
Fs = 20000;
n_channels = 16; %number of active channels
index=2; %index of the first data point
segsize = 60*Fs*(n_channels+1); % this is the length of 1 min segment, 20.000 samples per second, this is "step" in your script, I think
[logfname, pathname] = uigetfile('*.mat','Pick a log file');
logfullpathname = [pathname logfname];
load(logfullpathname);
datafullpathname = [pathname datafilename];
FID = fopen(datafullpathname,'r');
% THIS SECTION GETS THE FILES SIZE IN BYTES and SAMPLES
fseek(FID, 0, 1); % moves to end of file
file_length_in_byte = ftell(FID); % reads end position
size_of_double = 8;
file_length_in_double_elements = file_length_in_byte / size_of_double;
feof(FID);
% THIS IS THE CYCLE THAT SHOULD READ 1 MIN SEGMENTS OF THE DATA FILE
for i = 1:segsize:file_length_in_double_elements; % until end of file
fileindex = (i)*Fs; % this is where we wanna go
fseek(fileID,fileindex,'bof'); % go to i-th data point
currData=fread(fileID,[Fs*60, (n_channels+1)],'double');
fileindex % these are the indexes of each 1 minute segment that should be read in the recording
end
the code should just give the start indexes of each segment and load each time the proper amount of data in currData. Now the loading, does not work. I am actually wondering whether would be the case to select some channels already here or first load a segment (1min) with 17 channels and then select properly...
To read 1-minute segments of the file sequentially do NOT even think about computing locations or moving the file pointer...just read the proper number of elements via the [sizeA] parameter as I demonstrated.
All you're going to do otherwise is add overhead and screw up the position pointer to either cause a read failure entirely or get the wrong data associated with the returned array by being off. The file pointer will be precisely where it needs to be after each read without your help.
ONLY if you are going to try to read from somewhere other than the first entry and skip time periods is there any reason whatsoever to do anything at all with the file pointer and I showed above how to compute that for each and every minute totally independently of the file pointer positioning.
IF you are intending something like the later, then I again reiterate to look at memmapfile and organize the file object as the array and you can then simply reference the desired minute directly without byte counting.
You have at your disposal a 4G high-level rapid development toolset with many features to do such manipulations; take advantage of it and use them instead of trying to write C-style procedural code in Matlab; that's totally counterproductive.
I apologize for my ignorance but I do not follow completely. I thought I applied the sizing as you suggested. I am pretty sure I am doing something wrong, I am not saying you are not right. I just did not get what I should change. Thanks for being so patient :)
Why can't you just use the Answer already provided?
I didn't try to figure out just what you had done so not sure where the actual error may be but if, as it appears, the intent is to just go through the file from the beginning a minute at a time, then all the stuff about the file pointer machination is simply wasted motion--there's no need for it at all.
"running your code just gives a second as an output"
20kHz= 20,000 samples/sec * 60 sec --> 1,200,000 samples/1 min
currData should have
size(currData) --> 1.2E6,17
is that not what you get?
To plot, remember the time data is the first column; use
plot(currData(:,1),currData(:,2:end))
DOH! I just see I used "second" in the output display instead of "minute". What was I thinking!!?? :)
It's written for 60 sec at Fs per iteration even though for some reason I wrote "second" in the output. I have no idea why; senior moment.
One second of sample time would just be [20000,17] for sizeA.
"I didn't try to figure out just what you had done so not sure where the actual error may be..."
for i=1:...
fileindex = (i)*Fs; % this is where we wanna go
fseek(fileID,fileindex,'bof'); % go to i-th data point
...
NO!!!! 20,000 bytes from the beginning of the file is in the middle of somewhere, not the beginning of the file nor the beginning of any group of data; there are 17*8=136 bytes per time sample so 2500/136=18.3824... or 18 and some fractional set of seconds into the file.
As noted multiple times before, to read from the file sequentially YOU DO NOT NEED TO COMPUTE WHERE THE FILE POINTER IS SUPPOSED TO BE-- DON'T EVEN TRY!!!! fread is smart enough on its own to know where it quit the last time and as long as you don't muck it up, it'll be there waiting for you for the next call.
ONLY if you're adamant about skipping around in the file from one place to another is there any reason whatsoever to even care what the file pointer is. And for that I showed you straightforward code to do it that is much simpler way to approach it.
you are right, I think last night I was just dumb. I went home and tried your script on a dummy file (I generated a file of the same dimensions of my recording) and it seem to work ! I wonder why it didn't on the real one (same N of channels, same bytes, same everything). I will check soon and let you know ! Also you were right about the memmapfile command, is very useful, although for my purposes not necessary as I will just access the file in consecutive segments of the same length (if I got it right :)). Hope to update soon with good news ! I will post the working script here once I will get it running :)
Sometimes when one gets on a given track it's hard to get back...
Good luck, you should be on your way with the base code provided.
So here is the update: the script seems to run correctly. I had to change the format of the fread "window" to
currData=fread(FID,[Fs*60*(n_channels+1),1],'double');
in order to give each segment the correct sizing. One minor thing: at the end of the cycle the scripts tries to read an additional minute which is not present and it stops because of "exceeding array bounds".. I think this might happen because the while cycle does not know a priori when to stop. Is there a way to prevent this ? Should I use a for cycle instead ? Thanks a lot for helping so far !
OK I just added this line before the end of the while loop and it works !
if index > 59, break, end
thanks a lot DB !
OH! One oversight/mistake on my part...since the data are written sequentially by time step, reverse the order of the indices in the [sizeA] argument and transpose to get the 2D array in proper sequence in memory.
currData=fread(FID,[(n_channels+1),Fs*60],'double').';
The
while ~feof(fid)
loop construct I used should work to read the full file. If you want only a counted number of time steps then you'll have to ensure there are sufficient data in the file to read.
that's what indeed I did, as in my comment. thanks :)

Accedi per commentare.

Più risposte (0)

Categorie

Richiesto:

LO
il 13 Set 2018

Modificato:

dpb
il 16 Set 2018

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by