Read text file after a specific text line but avoiding only the next line

Question

Jorge Luis Paredes Estacio il 7 Mag 2023

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/1959334-read-text-file-after-a-specific-text-line-but-avoiding-only-the-next-line

Commentato: dpb il 8 Mag 2023

Hello, I am collecting data after "# HHE HHN HHZ" (I only copy the first 3 rows after "# HHE HHN HHZ" as an example as there could be hundreds) and the position of these columns can vary. I have made a script for a specific text file (see example 1)

Example1:

# 
# 4. COMMENTS
# BASELINE CORRECTED
# 
# 5. ACCELERATION DATA
# HHE HHN HHZ
-0.02104708      -0.02134472       0.00412299
-0.00340606       0.08357343       0.02083563
-0.02940362       0.00093856       0.00505147

The script is the following for the case of one combination of columns defined as textline1, textline2 and so on, which are neccesary so that the data can be unified (rearranged) to a specific position as output:

textline1 = '# HNE HNN HNZ';
%First mixed data%
if index==0
    index = strcmp(tline,textline1);  %%EO NS UD
    if index ==1; index=1; end
elseif index ==1
    tmp=sscanf(tline,'%f %f %f %f');
    tmp1 = [tmp(1); tmp(2); tmp(3)];   % rearrange to EO=X NS=Y UD=Y
    Output = [Output; tmp1'];
end

However, the some records present the following text format where there is a "T" before the data to be collected (after "# HHE HHN HHZ"):

# 
# 4. COMMENTS
# BASELINE CORRECTED
# 
# 5. ACCELERATION DATA
# HHE HHN HHZ
T
     -0.02104708      -0.02134472       0.00412299
     -0.00340606       0.08357343       0.02083563
     -0.02940362       0.00093856       0.00505147

Any help to fix the coding for that case. Thank you very much.

11 Commenti
Mostra 9 commenti meno recentiNascondi 9 commenti meno recenti

Jorge Luis Paredes Estacio il 7 Mag 2023

Modificato: per isakson il 8 Mag 2023

Apri in MATLAB Online

Thak you for your response. It works perfectly when I run it alone. However, when I put the code to work all together there is some issues. The output is still empty. You can see the whole script function in detail below, where "filename" is the name of the file. This functions extract 3 informaton for each record: fs, output and STATION. Output is the acceleration records extracted after "# HHE HHN HHZ

T"

function [fs, Output, STATION] = import_from_CISMID_new4(filename)
textline1 = '# HNE HNN HNZ';
textline2 = '# HNE HNZ HNN';
textline3 = '# HNN HNE HNZ';
textline4 = '# HNN HNZ HNE';
textline5 = '# HNZ HNE HNN';
textline6 = '# HNZ HNN HNE';
fid = fopen(filename,'r');
tline = fgetl(fid);
i = 1;
Output = [];
index = 0;
index_fs = 0;
index_station = 0; %index added for stations
while ischar(tline)
    
    %new condition added for stations
    if index_station == 0
        if strfind(tline,'# STATION: ') > 0
            index_station = 1;
            t = extractAfter(tline,"# STATION: ");
            if length(t)> 10
                index_braket = strfind(tline,"("); %include character until where it should be considered the name of the station
                STATION = tline(length('# STATION: ')+1 : index_braket-2);
            else
                STACODE = extractAfter(tline,"# STATION: ");
                STATION = STACODE(find(~isspace(STACODE)));       
            end
        end
    end
    
    %find sampling frequency
    
    if length(tline)>27 %%COUNT NUMBER OF CHARACTERS AND CHANGE IT AFTER >%%%%%%
        index_fs = strcmp(tline(1:27),'# SAMPLING FREQUENCY (Hz): ');
        if index_fs == 1
            str_output = remove_letters_1(tline);
            fs = str2double(str_output);
            index_fs = 0;
        end
    end
    
    %Getting acceleration
    %First mixed data%
    if index==0
        index = strcmp(tline,textline1);  %%EO NS UD
        if index ==1; index=1; end
    elseif index ==1
        %fid=fopen(filename,'r');        % opent the file for low-level i/o 
        n=0;                                                    % initialize line counter
        tline='';                                                   % preset line content to nothing
        while ~contains(tline,'ACCELERATION DATA')                  % look for the acceleration data section
            tline=fgetl(fid);
            n=n+1;
        end
        for ii=1:3                                               % after found it, look for the data with, without a "T" record
            tline=fgetl(fid);
            if strcmp(tline(1),blanks(1)) | ii>5; break; n=n-1; end    % test for the record beginning of data; bail out if something goes wrong
            n=n+1;
        end
        %fid=fclose(fid);                                        % ok, close the file and do high-level read
        data=readmatrix(filename,'NumHeaderLines',n);
        whos data
        tmp=data(1:end,:); 
        tmp1 = [tmp(:,1) tmp(:,2) tmp(:,3)];   % rearrange to EO=X NS=Y UD=Y
        Output = [Output; tmp1];
        
    end
    
    
    %Second mixed data%
    if index==0
        index = strcmp(tline,textline2); %% EO UD NS
        if index ==1; index=2; end
    elseif index==2
        
        %fid=fopen(filename,'r');        % opent the file for low-level i/o 
        n=0;                                                    % initialize line counter
        tline='';                                                   % preset line content to nothing
        while ~contains(tline,'ACCELERATION DATA')                  % look for the acceleration data section
            tline=fgetl(fid);
            n=n+1;
        end
        for ii=1:3                                               % after found it, look for the data with, without a "T" record
            tline=fgetl(fid);
            if strcmp(tline(1),blanks(1)) | ii>5; break; n=n-1; end    % test for the record beginning of data; bail out if something goes wrong
            n=n+1;
        end
        %fid=fclose(fid);                                        % ok, close the file and do high-level read
        data=readmatrix(filename,'NumHeaderLines',n);
        whos data
        tmp=data(1:end,:);
        tmp2 = [tmp(1); tmp(3); tmp(2)];   % rearrange to EO=X NS=Y UD=Y
        Output = [Output; tmp2];
    end
    
    %Third mixed data%
    if index==0
        index = strcmp(tline,textline3); %% NS EO UD
        if index ==1; index=3; end
    elseif index==3
        
        %fid=fopen(filename,'r');        % opent the file for low-level i/o 
        n=0;                                                    % initialize line counter
        tline='';                                                   % preset line content to nothing
        while ~contains(tline,'ACCELERATION DATA')                  % look for the acceleration data section
            tline=fgetl(fid);
            n=n+1;
        end
        for ii=1:3                                               % after found it, look for the data with, without a "T" record
            tline=fgetl(fid);
            if strcmp(tline(1),blanks(1)) | ii>5; break; n=n-1; end    % test for the record beginning of data; bail out if something goes wrong
            n=n+1;
        end
        %fid=fclose(fid);                                        % ok, close the file and do high-level read
        data=readmatrix(filename,'NumHeaderLines',n);
        whos data
        tmp=data(1:end,:);        
        tmp3 = [tmp(2); tmp(1); tmp(3)];  % rearrange to EO=X NS=Y UD=Y
        Output = [Output; tmp3];
    end
    
    %Fourth mixed data%
    if index==0
        index = strcmp(tline,textline4); % NS UD EO
        if index ==1; index=4; end
    elseif index==4
        
        %fid=fopen(filename,'r');        % opent the file for low-level i/o 
        n=0;                                                    % initialize line counter
        tline='';                                                   % preset line content to nothing
        while ~contains(tline,'ACCELERATION DATA')                  % look for the acceleration data section
            tline=fgetl(fid);
            n=n+1;
        end
        for ii=1:3                                               % after found it, look for the data with, without a "T" record
            tline=fgetl(fid);
            if strcmp(tline(1),blanks(1)) | ii>5; break; n=n-1; end    % test for the record beginning of data; bail out if something goes wrong
            n=n+1;
        end
        %fid=fclose(fid);                                        % ok, close the file and do high-level read
        data=readmatrix(filename,'NumHeaderLines',n);
        whos data
        tmp=data(1:end,:);
        tmp4 = [tmp(3); tmp(1); tmp(2)];  % rearrange to EO=X NS=Y UD=Y
        Output = [Output; tmp4];
    end
    
    %Fith mixed data%
    if index==0
        index = strcmp(tline,textline5); % UD EO NS
        if index ==1; index=5; end
    elseif index==5
        
        %fid=fopen(filename,'r');        % opent the file for low-level i/o 
        n=0;                                                    % initialize line counter
        tline='';                                                   % preset line content to nothing
        while ~contains(tline,'ACCELERATION DATA')                  % look for the acceleration data section
            tline=fgetl(fid);
            n=n+1;
        end
        for ii=1:3                                               % after found it, look for the data with, without a "T" record
            tline=fgetl(fid);
            if strcmp(tline(1),blanks(1)) | ii>5; break; n=n-1; end    % test for the record beginning of data; bail out if something goes wrong
            n=n+1;
        end
        %fid=fclose(fid);                                        % ok, close the file and do high-level read
        data=readmatrix(filename,'NumHeaderLines',n);
        whos data
        tmp=data(1:end,:);
        tmp5 = [tmp(2); tmp(3); tmp(1)];  % rearrange to EO=X NS=Y UD=Y
        Output = [Output; tmp5];
    end
    
    %Sixth mixed data%
    if index==0
        index = strcmp(tline,textline6); % UD NS EO
        if index ==1; index=6; end
    elseif index==6
        %fid=fopen(filename,'r');        % opent the file for low-level i/o 
        n=0;                                                    % initialize line counter
        tline='';                                                   % preset line content to nothing
        while ~contains(tline,'ACCELERATION DATA')                  % look for the acceleration data section
            tline=fgetl(fid);
            n=n+1;
        end
        for ii=1:3                                               % after found it, look for the data with, without a "T" record
            tline=fgetl(fid);
            if strcmp(tline(1),blanks(1)) | ii>5; break; n=n-1; end    % test for the record beginning of data; bail out if something goes wrong
            n=n+1;
        end
        %fid=fclose(fid);                                        % ok, close the file and do high-level read
        data=readmatrix(filename,'NumHeaderLines',n);
        whos data
        tmp=data(1:end,:);
        tmp6 = [tmp(3); tmp(2); tmp(1)];  % rearrange to EO=X NS=Y UD=Y
        Output = [Output; tmp6];
    end
    
    tline = fgetl(fid);
    i = i+1;
end
fclose(fid);
end

dpb il 8 Mag 2023

Modificato: dpb il 8 Mag 2023

Apri in MATLAB Online

Encapsulate the pieces to do the various parts as functions; don't repeat the same code over and over again in line; that is very time-consuming to do initially and makes for impossible-to-maintain/modify/debug later...

I asked for the complete requirments initially and didn't get anything back in response except to read the numeric array after the given header -- as suspected, more than that is needed.

Don't build in the data into the code; read the data and utilize it to make the decisions -- start out by locating the pieces of information needed and build a table record for each file that identifies it, including reading the channel record. You can then reorder the columns in a specific order for each file from that found in the file to build the consistent dataset for analysis. Depending on how the analyses will be carried out, one could either save the Nx3 array as the array or as three channel Nx1 vectors by channel name.

function [chn,idx]=getChannels(fid)       % presume file already open, pass handle
  % find channel record of form
  %     # CHANNEL: HNE HNN HNZ
  % return identified channels and alphabetical order to rearrange data columns by
  MATCHSTR='# CHANNEL: ';
  l=fgetl(fid);
  while ~startsWith(l,MATCHSTR)
    l=fgetl(fid);
  end
  chn=strtrim(extracAfter,l,MATCHSTR);
  chn=split(chn);
  [chn,idx]=sort(chn);
end

When this is done, then move on to finding the sampling frequency in similar fashion. While the given file shows it is the next record and likely will always be, don't presume that to always be the case; it looks as though the file structure is one that can be somewhat flexible so there may be some that have other information as well (unless there is a document that describes the format that says otherwise).

I'd probably choose to save the date/time data as well as the magnitude and locations; likely will turn out to want just for the annotation later, if nothing else.

You might choose to also return the channel string as it exists before splitting/sorting; you could then use that as the key to find the beginning of the acceleration data.

The Q? about the existence or not of the "T" in each file is still open -- is it the case that some do and some don't have it? The key trick there is that you can't search for what isn't there except by the exhaustive search that fails which is very expensive. You can, of course always first presume it isn't and try to convert the first record and catch the error when it fails. The pain with reading data record-by-record is that there isn't a very convenient way to resynch back to the beginning of the record just read when did find it to read the whole set of data in one fscanf operation. When the "T" does exist and the conversion fails, then the next record on are the data and it's easy; when it didn't exist and the conversion succeeded, then read the rest and catenate that result to the first record.

You'll have much better success if you factorize the code into small pieces, each of which does its one task and then hands off to the next.

Jorge Luis Paredes Estacio il 8 Mag 2023

Thank you very much for your detail explanation. I really appreciated. I am going to modify the code as you suggested and try to fix the issue of getting more data.

dpb il 8 Mag 2023

NOTA BENE: In initial code above there was a typo/mismatch between the returned indexing variable and the variable used as the return value in the sort call -- I fixed above, but the original would have an issue...

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

dpb il 7 Mag 2023

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/1959334-read-text-file-after-a-specific-text-line-but-avoiding-only-the-next-line#answer_1230694

Modificato: dpb il 8 Mag 2023

Apri in MATLAB Online

CISMID_SC_SCARQ_NEW_TOCHECH.txt

fn='https://www.mathworks.com/matlabcentral/answers/uploaded_files/1376874/CISMID_SC_SCARQ_NEW_TOCHECH.txt';
data=readmatrix(fn,'CommentStyle',{'#','T'});
whos data
  Name          Size              Bytes  Class     Attributes

  data      49626x4             1588032  double              
[data(1:5,:); nan(1,size(data,2)) ; data(end-4:end,:)]
ans = 11×4
       NaN       NaN       NaN       NaN
       NaN       NaN       NaN       NaN
       NaN       NaN       NaN       NaN
       NaN       NaN       NaN       NaN
       NaN  -20.2870       NaN       NaN
       NaN       NaN       NaN       NaN
    0.0448   -0.0758   -0.0541       NaN
   -0.0259   -0.0098    0.0058       NaN
   -0.0848    0.0277   -0.0031       NaN
   -0.0596    0.0094   -0.0153       NaN

Well, that's a spectacular failure in that the published/documented comment style didn't seem to work well at all...would have to delve into that some more, but may be worthy of a support ticket if don't find an obvious cause that I don't see just looking at the file in the browser.

BTW, since there isn't anythng after the section, you could shorten the file significantly before posting and not lose anything; I was presuming there were probably other sections after the data.

Anyways, let's do something a little different...

opt=detectImportOptions(fn,'Readvariablenames',0,'ExpectedNumVariables',3)

opt =

DelimitedTextImportOptions with properties: Format Properties: Delimiter: {'\t' ' '} Whitespace: '\b' LineEnding: {'\n' '\r' '\r\n'} CommentStyle: {} ConsecutiveDelimitersRule: 'join' LeadingDelimitersRule: 'ignore' TrailingDelimitersRule: 'ignore' EmptyLineRule: 'skip' Encoding: 'UTF-8' Replacement Properties: MissingRule: 'fill' ImportErrorRule: 'fill' ExtraColumnsRule: 'ignore' Variable Import Properties: Set types by name using setvartype VariableNames: {'Var1', 'Var2', 'Var3'} VariableTypes: {'double', 'double', 'double'} SelectedVariableNames: {'Var1', 'Var2', 'Var3'} VariableOptions: Show all 3 VariableOptions Access VariableOptions sub-properties using setvaropts/getvaropts VariableNamingRule: 'modify' Location Properties: DataLines: [15 Inf] VariableNamesLine: 0 RowNamesColumn: 0 VariableUnitsLine: 0 VariableDescriptionsLine: 0 To display a preview of the table, use preview

data=readmatrix(fn,opt);

whos data

Name Size Bytes Class Attributes data 49631x3 1191144 double

data(1:5,:)

ans = 5×3

NaN NaN NaN NaN 2.0000 NaN NaN NaN NaN NaN NaN NaN NaN NaN -20.2870

Well, now we've again illustrated the import detection tool isn't all that great sometimes; particularly for text files...always like to try the higher-level things first, but when they don't work, revert to brute force to find the header..

fid=fopen('CISMID_SC_SCAR...W_TOCHECH.txt','r');        % opent the file for low-level i/o
n=0;                                                    % initialize line counter
l='';                                                   % preset line content to nothing
while ~contains(l,'ACCELERATION DATA')                  % look for the acceleration data section
  l=fgetl(fid);
  n=n+1;
end
for i=1:3                                               % after found it, look for the data with, without a "T" record
  l=fgetl(fid);
  if strcmp(l(1),blanks(1)) | i>5; break; n=n-1; end    % test for the record beginning of data; bail out if something goes wrong
  n=n+1;
end
l = '# HNE HNN HNZ'
l = 'T'
l = '      0.02165885      -0.06615625       0.00254670'
ans = 32
n = 37
fid=fclose(fid);                                        % ok, close the file and do high-level read
data=readmatrix(fn,'NumHeaderLines',n);
whos data
  Name          Size              Bytes  Class     Attributes

  data      49608x3             1190592  double              
data(1:5,:)
ans = 5×3
    0.0217   -0.0662    0.0025
    0.1372   -0.0853   -0.0040
    0.0745   -0.0395    0.0133
   -0.0195    0.0550    0.0390
   -0.0766    0.0929    0.0681

Could also use low-level read to scan the rest of the file from that point on, but it's somewhat of a pain to resynch the filepointer to the betinning of the previous record to resan it, so I just saved the header line count and read with high-level routine.