How to force textscan to include the custom EOL character
    4 visualizzazioni (ultimi 30 giorni)
  
       Mostra commenti meno recenti
    
    dymitr ruta
 il 20 Set 2022
  
    
    
    
    
    Commentato: dymitr ruta
 il 22 Set 2022
            Hi Folks,
I am loading multiline chunks of texts from very big files each starting from '@' character into separate cells. Here is my code to do that:
y=textscan(x,'%s',1e7,'EndOfLine','@'); y=y{1};
The result is good and fast except I wanted to also include the opening '@'. I am running it against massive TB files in blocks of 1e7 chunks. Obviously I know I can do y=strcat('@',y) afterwards, but this postfix takes longer than the original textscan itself. Is there a way to force textscan to also include the specified EOL character, or any other faster solution for that?
Here is a testing line where I create a big multiline string to simulate the file:
x=repmat(['@abc:1:abc:1:2:3:4\ndef:1:abc:1:2:3:4\n'],1,1e7); tic; 
y=textscan(x,'%s','EndOfLine','@'); y=y{1}; 
t(1)=toc; 
y=strcat('@',y); 
t(2)=toc
Note: I want to retain the capability to rapidly put filtered string/file back together by x=[y{:}]; 
Help much appreciated
0 Commenti
Risposta accettata
  Walter Roberson
      
      
 il 20 Set 2022
        No, textscan() will always eat the EndOfLine delimiters.
One approach:
Keep a buffer of unprocessed text, initially []
while ~feof(fid)
    buffer = [buffer, fread(fid, '*uchar', CHUNKSIZE))];
    if isempty(buffer) %end of file
        break
    elseif buffer(1) ~= '@'
        %something is wrong with the input stream, we expected a @ at the
        %beginning
    else
        parts = regexp(buffer, '@[^@]*', 'match');
        buffer = parts{end};
        parts(end) = [];
        %now process the chunks in cell array parts
    end
end
At this point, buffer should be non-empty and should hold the last chunk. Since the last chunk is not followed by @ then at the time we read it we cannot know that it is a complete chunk: we can only know that any particular character is the end of a chunk by peeking ahead to see a @ next character or by detecting that we reached end of file. So expect buffer to have the last chunk in it after the loop.
Più risposte (1)
  dpb
      
      
 il 20 Set 2022
        x=repmat(['@abc:1:abc:1:2:3:4\ndef:1:abc:1:2:3:4\n'],1,1e7); tic; 
y=textscan(x,'%s','EndOfLine','@'); y=y{1}; 
t(1)=toc; 
y=strcat('@',y); 
t(2)=toc; disp([t sum(t)])
tic; 
y=textscan(x,'%s','EndOfLine','@'); y=string(y{1}); 
t(1)=toc; 
y="@"+y; 
t(2)=toc; disp([t sum(t)])
Trades the very expensive strcat function for direction addition with the newer string class -- takes a second longer to convert to string, but save 15 or so in the catenation operation.
The direct catenation of the cellstr array with the cellufn variant was even slower than strcat without more effort than I had time to give at the moment, but the above may lead to some other ideas on direct memory manipulation -- presuming the character strings in the real application aren't of uniform length, the conversion to a straight char() array is  probably not the way to go so I didn't even look at that variant.
2 Commenti
Vedere anche
Categorie
				Scopri di più su Text Data Preparation in Help Center e File Exchange
			
	Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!


