Azzera filtri
Azzera filtri

How to fix my attempt to vectorize counts of strings and regexpPatterns in a text file?

1 visualizzazione (ultimi 30 giorni)
REVISED:
Hello Folks,
I am having difficulty vectorizing the counting of occurrences of lines in a data file, File_1_rev1.txt, containing search terms that can either be strings or regular expression patterns. The attached file is small in size for the purpose of this example. The actual file I want to parse is typically 2TB in size so I want to perform counts as efficiently as possible.
Objective:
Minimize the processing time for counting lines in FIle_1_rev1.txt containing occurrences of strings or regexpPatterns and output count results in a table.
Desired output:
Code Issue:
Output I get for the code provide below is incorrect. How do I define variable <C> correctly to count lines containing regular expression patterns so that I get the desired output, shown above?
clear
clc
SearchTerms = {...
'Term_1', 'Blanket';...
'Term_2', 'blah';...
'Term_3', 'of';...
'Term_4', '(dat|not)\d{1}';...
'Term_5', '(dat|not)\d{23}'...
};
Term_IDs = SearchTerms(:,1); % ID of string/regexpPattern to search for
Term_Patterns = SearchTerms(:,2); % string/regexpPattern to count
Num_SearchTerms = height(SearchTerms);
fid = fopen('File_1_rev1.txt');
Text = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
Lines = Text{1,1};
C = categorical(Lines, Term_Patterns, Term_IDs);
[TermCounts,Categories] = histcounts(C);
Result = cell2table(cell(0,Num_SearchTerms), 'VariableNames', Term_IDs');
Result = [Result; num2cell(TermCounts)]
Result = 1×5 table
Term_1 Term_2 Term_3 Term_4 Term_5 ______ ______ ______ ______ ______ 0 0 0 0 0
  1 Commento
Jude
Jude il 28 Dic 2023
Unsuccessfully, I have also tried...
1. Trouble with line below is getting regexpPattern to work,
C = categorical(Lines, Term_Patterns,Term_IDs,"Ordinal",true);
2. Line below looked workable but I am having trouble with implementation
C = discretize(Lines, contains(Lines, regexpPattern(Term_Patterns)), 'categorical', Term_IDs')
3. Currently looking into using the dictionary function to convert <Lines> into a line-by-line representation of
<Term_IDs> where applicable then follow up with the categorical function and histocounts function to get the
counts.

Accedi per commentare.

Risposta accettata

Stephen23
Stephen23 il 28 Dic 2023
Modificato: Stephen23 il 28 Dic 2023
SearchTerms = {...
'Term_1', 'Blanket';...
'Term_2', 'blah';...
'Term_3', 'of';...
'Term_4', '(dat|not)\d{1}';...
'Term_5', '(dat|not)\d{23}'...
};
Term_IDs = SearchTerms(:,1); % ID of string/regexpPattern to search for
Term_Patterns = SearchTerms(:,2); % string/regexpPattern to count
L = readlines('File_1_rev1.txt')
L = 5729×1 string array
"Blanket Blanket Blanket" "" "This" "is a test" "a test Of your" "testing system" "this text does" "not mean anything." "! Do not5 mind spe$cial charac7er5~" "not mean anything." "" "" "this text does" "testing system" "a test of your" "is a test" "This" "55 !! && Test" "dat3 field blah" "blah Blah" "case sensitive or not" "might want to create counts" "for each maybe not. This" "is the end oF an example," "instead of having actual" "data with millions of lines" "of text. " "" "" "This"
P = regexpPattern(Term_Patterns);
F = @(p)nnz(contains(L,p));
V = arrayfun(F,P)
V = 5×1
1 424 848 424 0
T = unstack(table(V,Term_IDs),'V','Term_IDs')
T = 1×5 table
Term_1 Term_2 Term_3 Term_4 Term_5 ______ ______ ______ ______ ______ 1 424 848 424 0

Più risposte (0)

Categorie

Scopri di più su File Operations in Help Center e File Exchange

Prodotti


Release

R2023a

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by