Why can nested quantifiers in REGEXP can cause inefficient failures in MATLAB 6.5 (R13)?

10 visualizzazioni (ultimi 30 giorni)
If:
1. A regular expression contains a parenthesized subexpression which is quantified, and the inner expression is quantified as well (for example, '(.+)+') and
2. The text string contains a long series of matches of the subexpression and
3. The regular expression does not match the string,
it can take REGEXP a long time to determine that a match cannot be made. For example,
str = 'ltbwithupdatedbendsandmatchedoptics3/8/04,';
[s f t] = regexp(str,'^\d*(\w+\d*)*:,');
runs for several minutes without returning a result.

Risposta accettata

MathWorks Support Team
MathWorks Support Team il 17 Apr 2024 alle 0:00
Modificato: MathWorks Support Team il 17 Apr 2024 alle 15:53
This behavior is inherent in the NFA algorithm for regular expressions, used by most systems that support regular expressions, such as Perl, Emacs, and grep. MATLAB uses an optimized NFA algorithm.
The computational complexity of this problem is exponential with the number of characters in the string that must be tested. Expressions that fit this pattern are, by their nature, redundant. When no match is possible, the engine must explore all exponentially-many possibilities before concluding there is no match.
These expressions simplified by removing the outer quantifier --- for example, using '(.+)' instead of '(.+)+'. By reducing the number of duplicate string expansions, you can reduce the work required when no match is possible without affecting which strings the expression matches.
To improve the performance of this regular expression, remove redundancy in the quantifiers. For example, in this command:
[s f t] = regexp(str,'^\d*(\w+\d*)*:,');
the regular expression can be simplified from:
'^\d*(\w+\d*)*:,'
to:
'^\d*(\w*):,'
The non-redundant quantifier produces the same results in parsing the expression as the redundant quantifier, but is much more efficient when no match exists.
Using this expression in the original example, REGEXP reports no matches in well under 1 second:
str = 'ltbwithupdatedbendsandmatchedoptics3/8/04,';
[s f t] = regexp(str,'^\d*(\w*):,');
For more information on the NFA algorithm and building efficient regular expressions, refer to a regular expression reference such as the one listed below.
References

Più risposte (0)

Categorie

Scopri di più su Characters and Strings in Help Center e File Exchange

Prodotti


Release

Non è stata ancora inserita alcuna release.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by