Reading/fetching text from text/PDF file for pre-processing

4 visualizzazioni (ultimi 30 giorni)
I have text/pdf files which contains millions of words(text). If i use str = extractFileText(filename) then firstly matlab became very slow also some time hancked . Also variable is not able to hold such a large data.
I want to read file word by word so i can filter text and make a smaller array of filtered data. Or i want to make filtered data temp file for next processing of data(as t will be small).
i need help in this also if you have any other solution of my probelm do reply.
  2 Commenti
Ive J
Ive J il 21 Mar 2021
Modificato: Ive J il 21 Mar 2021
Did you try using the function with name, value pair?
for i = 1:numel(pages)
str = extractFileText(filename, 'pages', pages(i)); % get only one page per time
% do whatever you want with str
end

Accedi per commentare.

Risposta accettata

moin khan
moin khan il 21 Mar 2021
I firstly tried extractFileText on my file(text file with 19million words) it was really slow ad didnt worked because it all was going in single variable. Now i fetched data line by line and saved in an array and now its ok just take some seconds but its fine with such large file.
code:
fid=fopen(filename);
inputData = cell(0,1);
while ~feof(fid)
tline = fgetl(fid);
if ~isempty(tline)
inputData{end+1,1} = tline;
end
end
fclose(fid);
clear('ans','fid','tline');
documents = tokenizedDocument(inputData);
clear('inputData');

Più risposte (0)

Categorie

Scopri di più su Data Import and Export in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by