Extract numbers from a cell array of strings

Question

chlor thanks il 6 Lug 2016

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/294008-extract-numbers-from-a-cell-array-of-strings

Commentato: Guillaume il 7 Lug 2016

I have the following cell array:

s = 
'HI_B2_ *TTT4009*_D452_07052016.xlsx'
'HI_H2G_ *TTT4002*_D259_070516.xlsx'
'HI_B2C_ *4008*_D1482_070516.xlsx'
'HI_B2C_ 008_D1482_070516.xlsx'
'HI_A1C_468_070516_ *TTT4004*.xlsx'
'HI__ *TTT4003*_862_07052016_G1C.xlsx'
'HI_B2C_ 008_D1487_070516.xlsx'
'HI_KA6_ *4006*_148_07052016.xlsx'

I would like to extract all the bold numbers into matrix of two columns like so:

ExM=
'4009', 'HI_B2_ TTT4009_D452_07052016.xlsx'
'4002', 'HI_H2G_ TTT4002_D259_070516.xlsx'
'4008', 'HI_B2C_ 4008_D1482_070516.xlsx'
'4004','HI_A1C_468_070516_ TTT4004.xlsx'
'4003','HI__ TTT4003_862_07052016_G1C.xlsx'
'4006','HI_KA6_ 4006_148_07052016.xlsx'

that have the extracted numbers and corresponding file names. Note that all the number extracted begins with "400" and many of them are also after the letters "TTT"...

I tried

regexpi(s, '[\w\s,]*400[\w\s,]*[_;]+','match')

But it did not work correctly and I am also not sure how to make the matrix of two columns without empty strings.

I will appreciate any input or help material to learn from. Thank you very much!!

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Guillaume il 6 Lug 2016

Apri in MATLAB Online

Note that within a character group (delimited by []) you can't use character classes such as \w and \s. The way to write [\w\s,] would be

(?:\w|\s|,)

or just expand the character classes:

[a-zA-Z0-9_ \f\n\r\t\v,]

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Azzi Abdelmalek il 6 Lug 2016

1
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/294008-extract-numbers-from-a-cell-array-of-strings#answer_227917

Apri in MATLAB Online

s ={'HI_B2_ TTT4009_D452_07052016.xlsx'
'HI_H2G_ TTT4002_D259_070516.xlsx'
'HI_B2C_ 4008_D1482_070516.xlsx'
'HI_B2C_ 008_D1482_070516.xlsx'
'HI_A1C_468_070516_ TTT4004.xlsx'
'HI__ TTT4003_862_07052016_G1C.xlsx'
'HI_B2C_ 008_D1487_070516.xlsx'
'HI_KA6_4006_148_07052016.xlsx'};
a=regexp(s,'.*(400\d*).*','tokens','once')
idx=~cellfun(@isempty,a)
out=[[a{:}]' s(idx)]

3 Commenti
Mostra 1 commento meno recenteNascondi 1 commento meno recente

chlor thanks il 7 Lug 2016

Modificato: chlor thanks il 7 Lug 2016

Apri in MATLAB Online

The star was actually there because I want to make the numbers look bold but somehow it changed to stars... So I tried without the stars, but it still gives me this.

s ={'HI_B2_ TTT4009_D452_07052016.xlsx'
'HI_H2G_ TTT4002_D259_070516.xlsx'
'HI_B2C_ 4008_D1482_070516.xlsx'
'HI_B2C_ 008_D1482_070516.xlsx'
'HI_A1C_468_070516_ TTT4004.xlsx'
'HI__ TTT4003_862_07052016_G1C.xlsx'
'HI_B2C_ 008_D1487_070516.xlsx'
'HI_KA6_4006_148_07052016.xlsx'};
>> hi = regexp(s, '(?:TTT)?)400\d', 'match', 'once')
hi = 
    ''
    ''
    ''
    ''
    ''
    ''
    ''
    ''

Still, thank you for the awesome explanation! I have been trying to learn on my own for so many days but it is very hard to find examples which are best to learn from, a lot of times I get stuck over and over again. Your notes help me a lot, thank you so so so much Guillaume!

Guillaume il 7 Lug 2016

Apri in MATLAB Online

Matlab regexp engine does not throw error when the regular expression is not valid (unfortunately). Your expression has unbalanced parentheses and so is not valid.

|'(?:TTT)?400\d' would have worked. This would return the TTT portion in the match if it is present. Probably not what you want.

'(?<=(?:TTT)?)400\d' would also work. The TTT would not be returned in the match. It is just a requirement that the match be preceded by TTT. However, since that requirement is optional (sic!) because of the last ?, it actually serves no purpose and may just as well be omitted. (In my regex, the star was part of the requirement and was not optional).

So, '400\d' is probably what you need then.

If you want to know whether or not the 400\d is actually preceded by TTT:

regexp(s, '(TTT)?(400\d)', 'tokens', 'once')

and if there's a match, see if the first token is empty (no TTT) or not.

Accedi per commentare.

Extract numbers from a cell array of strings

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Risposta accettata

3 Commenti
Mostra 1 commento meno recenteNascondi 1 commento meno recente

Più risposte (0)

Vedere anche

Categorie

Tag

Community Treasure Hunt

Extract numbers from a cell array of strings

1 Commento Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

Risposta accettata

3 Commenti Mostra 1 commento meno recenteNascondi 1 commento meno recente

Più risposte (0)

Vedere anche

Categorie

Tag

Community Treasure Hunt

1 Commento
Mostra -1 commenti meno recentiNascondi -1 commenti meno recenti

3 Commenti
Mostra 1 commento meno recenteNascondi 1 commento meno recente