Extract numbers from a cell array of strings

1 visualizzazione (ultimi 30 giorni)
I have the following cell array:
s =
'HI_B2_ *TTT4009*_D452_07052016.xlsx'
'HI_H2G_ *TTT4002*_D259_070516.xlsx'
'HI_B2C_ *4008*_D1482_070516.xlsx'
'HI_B2C_ 008_D1482_070516.xlsx'
'HI_A1C_468_070516_ *TTT4004*.xlsx'
'HI__ *TTT4003*_862_07052016_G1C.xlsx'
'HI_B2C_ 008_D1487_070516.xlsx'
'HI_KA6_ *4006*_148_07052016.xlsx'
I would like to extract all the bold numbers into matrix of two columns like so:
ExM=
'4009', 'HI_B2_ TTT4009_D452_07052016.xlsx'
'4002', 'HI_H2G_ TTT4002_D259_070516.xlsx'
'4008', 'HI_B2C_ 4008_D1482_070516.xlsx'
'4004','HI_A1C_468_070516_ TTT4004.xlsx'
'4003','HI__ TTT4003_862_07052016_G1C.xlsx'
'4006','HI_KA6_ 4006_148_07052016.xlsx'
that have the extracted numbers and corresponding file names. Note that all the number extracted begins with "400" and many of them are also after the letters "TTT"...
I tried
regexpi(s, '[\w\s,]*400[\w\s,]*[_;]+','match')
But it did not work correctly and I am also not sure how to make the matrix of two columns without empty strings.
I will appreciate any input or help material to learn from. Thank you very much!!
  1 Commento
Guillaume
Guillaume il 6 Lug 2016
Note that within a character group (delimited by []) you can't use character classes such as \w and \s. The way to write [\w\s,] would be
(?:\w|\s|,)
or just expand the character classes:
[a-zA-Z0-9_ \f\n\r\t\v,]

Accedi per commentare.

Risposta accettata

Azzi Abdelmalek
Azzi Abdelmalek il 6 Lug 2016
s ={'HI_B2_ TTT4009_D452_07052016.xlsx'
'HI_H2G_ TTT4002_D259_070516.xlsx'
'HI_B2C_ 4008_D1482_070516.xlsx'
'HI_B2C_ 008_D1482_070516.xlsx'
'HI_A1C_468_070516_ TTT4004.xlsx'
'HI__ TTT4003_862_07052016_G1C.xlsx'
'HI_B2C_ 008_D1487_070516.xlsx'
'HI_KA6_4006_148_07052016.xlsx'};
a=regexp(s,'.*(400\d*).*','tokens','once')
idx=~cellfun(@isempty,a)
out=[[a{:}]' s(idx)]
  3 Commenti
chlor thanks
chlor thanks il 7 Lug 2016
Modificato: chlor thanks il 7 Lug 2016
The star was actually there because I want to make the numbers look bold but somehow it changed to stars... So I tried without the stars, but it still gives me this.
s ={'HI_B2_ TTT4009_D452_07052016.xlsx'
'HI_H2G_ TTT4002_D259_070516.xlsx'
'HI_B2C_ 4008_D1482_070516.xlsx'
'HI_B2C_ 008_D1482_070516.xlsx'
'HI_A1C_468_070516_ TTT4004.xlsx'
'HI__ TTT4003_862_07052016_G1C.xlsx'
'HI_B2C_ 008_D1487_070516.xlsx'
'HI_KA6_4006_148_07052016.xlsx'};
>> hi = regexp(s, '(?:TTT)?)400\d', 'match', 'once')
hi =
''
''
''
''
''
''
''
''
Still, thank you for the awesome explanation! I have been trying to learn on my own for so many days but it is very hard to find examples which are best to learn from, a lot of times I get stuck over and over again. Your notes help me a lot, thank you so so so much Guillaume!
Guillaume
Guillaume il 7 Lug 2016
Matlab regexp engine does not throw error when the regular expression is not valid (unfortunately). Your expression has unbalanced parentheses and so is not valid.
|'(?:TTT)?400\d' would have worked. This would return the TTT portion in the match if it is present. Probably not what you want.
'(?<=(?:TTT)?)400\d' would also work. The TTT would not be returned in the match. It is just a requirement that the match be preceded by TTT. However, since that requirement is optional (sic!) because of the last ?, it actually serves no purpose and may just as well be omitted. (In my regex, the star was part of the requirement and was not optional).
So, '400\d' is probably what you need then.
If you want to know whether or not the 400\d is actually preceded by TTT:
regexp(s, '(TTT)?(400\d)', 'tokens', 'once')
and if there's a match, see if the first token is empty (no TTT) or not.

Accedi per commentare.

Più risposte (0)

Categorie

Scopri di più su Characters and Strings in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by