How to fix my attempt to vectorize counts of strings and regexpPatterns in a text file?

Question

Jude il 27 Dic 2023

1
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/2064526-how-to-fix-my-attempt-to-vectorize-counts-of-strings-and-regexppatterns-in-a-text-file

Commentato: Jude il 28 Dic 2023

Risposta accettata: Stephen23

Apri in MATLAB Online

REVISED:

Hello Folks,

I am having difficulty vectorizing the counting of occurrences of lines in a data file, File_1_rev1.txt, containing search terms that can either be strings or regular expression patterns. The attached file is small in size for the purpose of this example. The actual file I want to parse is typically 2TB in size so I want to perform counts as efficiently as possible.

Objective:

Minimize the processing time for counting lines in FIle_1_rev1.txt containing occurrences of strings or regexpPatterns and output count results in a table.

Desired output:

Code Issue:

Output I get for the code provide below is incorrect. How do I define variable <C> correctly to count lines containing regular expression patterns so that I get the desired output, shown above?

clear
clc
SearchTerms = {...
                'Term_1', 'Blanket';...
                'Term_2', 'blah';...
                'Term_3', 'of';...
                'Term_4', '(dat|not)\d{1}';...
                'Term_5', '(dat|not)\d{23}'...
              };
Term_IDs = SearchTerms(:,1);       % ID of string/regexpPattern to search for
Term_Patterns = SearchTerms(:,2);  % string/regexpPattern to count
Num_SearchTerms = height(SearchTerms);
fid = fopen('File_1_rev1.txt');
Text = textscan(fid, '%s', 'Delimiter', '\n');
fclose(fid);
Lines = Text{1,1};
C = categorical(Lines, Term_Patterns, Term_IDs);
[TermCounts,Categories] = histcounts(C);
Result = cell2table(cell(0,Num_SearchTerms), 'VariableNames', Term_IDs');
Result = [Result; num2cell(TermCounts)]
Result = 1×5 table
    Term_1    Term_2    Term_3    Term_4    Term_5
    ______    ______    ______    ______    ______

      0         0         0         0         0   

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Jude il 28 Dic 2023

Unsuccessfully, I have also tried...

1. Trouble with line below is getting regexpPattern to work,

C = categorical(Lines, Term_Patterns,Term_IDs,"Ordinal",true);

2. Line below looked workable but I am having trouble with implementation

C = discretize(Lines, contains(Lines, regexpPattern(Term_Patterns)), 'categorical', Term_IDs')

3. Currently looking into using the dictionary function to convert <Lines> into a line-by-line representation of

<Term_IDs> where applicable then follow up with the categorical function and histocounts function to get the

counts.

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Stephen23 il 28 Dic 2023

2
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/2064526-how-to-fix-my-attempt-to-vectorize-counts-of-strings-and-regexppatterns-in-a-text-file#answer_1379551

Modificato: Stephen23 il 28 Dic 2023

Apri in MATLAB Online

File_1_rev1.txt

SearchTerms = {...
    'Term_1', 'Blanket';...
    'Term_2', 'blah';...
    'Term_3', 'of';...
    'Term_4', '(dat|not)\d{1}';...
    'Term_5', '(dat|not)\d{23}'...
    };
Term_IDs      = SearchTerms(:,1);  % ID of string/regexpPattern to search for
Term_Patterns = SearchTerms(:,2);  % string/regexpPattern to count
L = readlines('File_1_rev1.txt')
L = 5729×1 string array
    "Blanket Blanket Blanket"
    ""
    "This"
    "is a test"
    "a test Of your"
    "testing system"
    "this text does"
    "not mean anything."
    "! Do not5 mind spe$cial charac7er5~"
    "not mean anything."
    ""
    ""
    "this text does"
    "testing system"
    "a test of your"
    "is a test"
    "This"
    "55 !! && Test"
    "dat3 field blah"
    "blah Blah"
    "case sensitive or not"
    "might want to create counts"
    "for each maybe not.  This"
    "is the end oF an example,"
    "instead of having actual"
    "data with millions of lines"
    "of text. "
    ""
    ""
    "This"
P = regexpPattern(Term_Patterns);
F = @(p)nnz(contains(L,p));
V = arrayfun(F,P)
V = 5×1
     1
   424
   848
   424
     0
T = unstack(table(V,Term_IDs),'V','Term_IDs')
T = 1×5 table
    Term_1    Term_2    Term_3    Term_4    Term_5
    ______    ______    ______    ______    ______

      1        424       848       424        0   

2 Commenti
Mostra NessunoNascondi Nessuno

Dyuman Joshi il 28 Dic 2023

+1 for readlines()

Jude il 28 Dic 2023

@Stephen23, thank you for sharing your solution with me. I like this vectorized approach.

Accedi per commentare.

How to fix my attempt to vectorize counts of strings and regexpPatterns in a text file?

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Risposta accettata

2 Commenti
Mostra NessunoNascondi Nessuno

Più risposte (0)

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

How to fix my attempt to vectorize counts of strings and regexpPatterns in a text file?

1 Commento Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Risposta accettata

2 Commenti Mostra NessunoNascondi Nessuno

Più risposte (0)

Vedere anche

Categorie

Tag

Prodotti

Release

Community Treasure Hunt

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

2 Commenti
Mostra NessunoNascondi Nessuno