regexp: what am I missing from the documentation?

1 visualizzazione (ultimi 30 giorni)
I have tried to carefully read the regexp documentation, and I am able to sucessfully implement regexp in the simplest cases. For example, given:
test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8'
I can use the following code to retrieve each of the separate names, with the ending numeral and/or whitespace:
exp = '\w*[^1-9\s]';
MyMatch = regexp(test, exp, 'match')
MyMatch = 1×8 cell array
Columns 1 through 6
{'John'} {'Ron'} {'James'} {'Dongo'} {'Chloe'} {'Billgo'}
Columns 7 through 8
{'Marie'} {'Aaron'}
However, despite much effort, I cannot achieve a more complex result (example provided below). I try to limit the number of questions I post to the community, but here is a situation where I ask if the experts can point to where I am erring in my use of regexp to give a (slightly more complex) result. Note that this is not a specific problem I am trying to solve. I merely invented a 'random' problem in an effort to become more adpept in my use of regexp.
For the following example, assume that all name instances in a character vector test have one of two possible problems.
  1. A single digit immediately follows the name (e.g., James7)
  2. The name has 'go' appended to its end.
NB: We know in advance there are no name instances in test that would require us to consider the possibility that 'go' is just the natural ending of a name instance (e.g., Hugogo).
Thus, given the character vector:
test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8'
The desired output is:
MyMatch = 1×8 cell array
Columns 1 through 6
{'John'} {'Ron'} {'James'} {'Don'} {'Chloe'} {'Bill'}
Columns 7 through 8
{'Marie'} {'Aaron'}
Examples of attempted (and failed) solutions:
% Given the documentation's statement, 'If you specify a lookahead assertion before an expression,
% the operation is equivalent to a logical AND."
MyMatch = regexp(test, '(?<=\w*[^*go\s)\w*[^1-9\s]', 'match')
% Attempts to implement 'OR' logic: (exp|exp)
% (1)
[tok, mat] = regexp(test, '(\w+)([^*go\s]|[^1-9\s])', 'tokens', 'match');
vertcat(tok{:}) % then extract col1
% (2)
[tok, mat] = regexp(test, '((\w+)([^*go\s]))|((\w+)([^1-9\s]))', 'tokens', 'match')
vertcat(tok{:}) % then extract col1
% ...
And so on and so forth...
  1. What is your approach/solution (using regexp) to the above? Is it better to take a multipronged approach? e.g., convert to cell array first, use two regexp, etc..
  2. What is your approach/solution (using regexp) given:
test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8 Hugogo' % note Hugogo
% we want the 'MyMatch' or 'MyTokens' cell array to contain 'Hugo'
Thanks for your time, and Happy New Year!
Sincerely,
Ray
  4 Commenti
Stephen23
Stephen23 il 28 Dic 2019
Modificato: Stephen23 il 28 Dic 2019
The regexp documentation's focus is rather on the function rather than the regular expression syntax. For more detailed explanations of the syntax see:
You might also like to download my FEX submission iregexp, which creates an interactive figure for trying different regular expressions and parse strings, and seeing regexp's outputs:
Raymond MacNeil
Raymond MacNeil il 28 Dic 2019
Thanks, Stephen. I have previously examined these additonal pages, but I should probably dig into these more.

Accedi per commentare.

Risposta accettata

Stephen23
Stephen23 il 28 Dic 2019
Modificato: Stephen23 il 28 Dic 2019
A direct interpretation of your description "assume that all name instances in a character vector test have one of two possible problems. 1. A single digit immediately follows the name (e.g., James7) 2. The name has 'go' appended to its end." is to use one lookahead assertion:
>> test = 'John3 Ron4 James7 Dongo Chloe8 Billgo Marie7 Aaron8 Hugogo';
>> regexp(test,'\w+(?=(\d|go)\>)','match')
ans =
'John' 'Ron' 'James' 'Don' 'Chloe' 'Bill' 'Marie' 'Aaron' 'Hugo'
Or similarly using a non-captured token:
>> tkn = regexpi(test,'(\w+)(?:\d|go)\>','tokens');
>> [tkn{:}]
ans =
'John' 'Ron' 'James' 'Don' 'Chloe' 'Bill' 'Marie' 'Aaron' 'Hugo'

Più risposte (0)

Categorie

Scopri di più su Characters and Strings in Help Center e File Exchange

Prodotti


Release

R2018b

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by