Find indices of multiple strings within another string

7 visualizzazioni (ultimi 30 giorni)
I am trying to efficiently find which strings (character vectors) match between two cell arrays.
One cell array contains ~1000 equations written as strings that I'm trying to parse by matching to strings in another array (100,000 items). I need to know the indices from the 100,000 items that are found within the ~1000 equations. There may be multiple of the 100,000 items found within each of the 1000 equations.
I'm currently implementing this as such:
Equations.Equation % this is a list of ~1000 equations, a cell array of character vectors
OutputData.DataName % list of ~100,000 possible strings I'm looking for in the equations (my variable names)
for ii = 1:length(Equations)
matches=cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName);
indices = find(matches);
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
This is fairly slow. Is there a way to more efficiently find within Equations(ii).Equation which items within OutputData.DataName are found and the index of those items?
  4 Commenti
Paul
Paul il 9 Apr 2022
Something's not working with this example data and the code in the question. Is there a typo somewherer?
Equations.Equation = { '(123X + 123Y).^2'; ...
'500 + 456X + 123Z'; ...
'200 * abs(789Z * pi) + 123X'}
Equations = struct with fields:
Equation: {3×1 cell}
OutputData.DataName = {'123A'; '123B'; '123C'; '123X'; '123Y'; '123Z'; '456X'; '456Y'; '456Z'; '789X'; '789Y'; '789Z'};
for ii = 1:length(Equations)
matches=cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName);
indices = find(matches);
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
Error using cellfun
Non-scalar in Uniform output, at index 1, output 1.
Set 'UniformOutput' to false.
Voss
Voss il 9 Apr 2022
It seems like Equations is actually a struct array:
Equations = struct('Equation',{ ...
'(123X + 123Y).^2'; ...
'500 + 456X + 123Z'; ...
'200 * abs(789Z * pi) + 123X'})
Equations = 3×1 struct array with fields:
Equation
OutputData.DataName = {'123A'; '123B'; '123C'; '123X'; '123Y'; '123Z'; '456X'; '456Y'; '456Z'; '789X'; '789Y'; '789Z'};
for ii = 1:length(Equations)
matches = cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName).'
indices = find(matches)
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
matches = 1×12 logical array
0 0 0 1 1 0 0 0 0 0 0 0
indices = 1×2
4 5
matches = 1×12 logical array
0 0 0 0 0 1 1 0 0 0 0 0
indices = 1×2
6 7
matches = 1×12 logical array
0 0 0 1 0 0 0 0 0 0 0 1
indices = 1×2
4 12

Accedi per commentare.

Risposta accettata

Paul
Paul il 10 Apr 2022
It looks like using string variables with an inner loop is much faster than a cell array with cellfun, at least here on Answers with the data provided.
Orignal code, modified by @_
Equations = struct('Equation',{ ...
'(123X + 123Y).^2'; ...
'500 + 456X + 123Z'; ...
'200 * abs(789Z * pi) + 123X'});
OutputData.DataName = {'123A'; '123B'; '123C'; '123X'; '123Y'; '123Z'; '456X'; '456Y'; '456Z'; '789X'; '789Y'; '789Z'};
for ii = 1:length(Equations)
matches = cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName).'
indices = find(matches)
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
matches = 1×12 logical array
0 0 0 1 1 0 0 0 0 0 0 0
indices = 1×2
4 5
matches = 1×12 logical array
0 0 0 0 0 1 1 0 0 0 0 0
indices = 1×2
6 7
matches = 1×12 logical array
0 0 0 1 0 0 0 0 0 0 0 1
indices = 1×2
4 12
Convert the cell arrays to strings, and implement an inner loop to compute matches. Verify the results are the same
equations = string({Equations.Equation});
dataname = string(OutputData.DataName);
mathces = nan(1,numel(dataname));
for ii = 1:numel(equations)
for jj = 1:numel(dataname)
matches(jj) = contains(equations(ii),dataname(jj));
end
matches
indices = find(matches)
end
matches = 1×12 logical array
0 0 0 1 1 0 0 0 0 0 0 0
indices = 1×2
4 5
matches = 1×12 logical array
0 0 0 0 0 1 1 0 0 0 0 0
indices = 1×2
6 7
matches = 1×12 logical array
0 0 0 1 0 0 0 0 0 0 0 1
indices = 1×2
4 12
Wrap an outer loop aorund the original code to test timing.
ntrials = 1e5;
tic
for trials = 1:ntrials
for ii = 1:length(Equations)
matches = cellfun(@(x) contains(Equations(ii).Equation,x),OutputData.DataName).';
indices = find(matches);
% do some other stuff with the matches found, then move onto the next iteration of the loop
end
end
toc
Elapsed time is 15.236180 seconds.
tic
for trials = 1:ntrials
for ii = 1:numel(equations)
for jj = 1:numel(dataname)
matches(jj) = contains(equations(ii),dataname(jj));
end
matches;
indices = find(matches);
end
end
toc
Elapsed time is 2.448469 seconds.
I was actually surprised that there isn't a string function that can replace that inner loop, but I couldnt't find one. Maybe it can be done using a particular pattern, but I couldn't figure that out either.

Più risposte (0)

Categorie

Scopri di più su Loops and Conditional Statements in Help Center e File Exchange

Prodotti


Release

R2016b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by