How to extract numerical values from a text in an array or table?
- In some columns, the values are string like '15 units' or '15-25 units'. I want to be able to create a new column with only numbers. In the former, this should obviously convert to 15. In the latter I want this to also be 15 (i.e. only extract the first number before the hyphen). Any ideas on how to do this?
- In other columns, I have long messy strings that sometimes contain valuable numbers, and sometimes don't. When the numbers are valuable, they are usually followed by either " or 'secs'. Is there a way to extract only the numbers in these strings that immediately precede this?
4 Commenti
Risposta accettata
2 Commenti
Più risposte (1)
Hi @Rahul ,
To achieve the extraction of numerical values from your data table in MATLAB, you can utilize regular expressions along with the cellfun function for efficient processing. Below is a concise approach to handle both scenarios you described.
Extracting Numbers from '15 units' or '15-25 units': You can use the following code snippet to create a new column with only the first number:
% Sample data data = {'15 units'; '15-25 units'; '20 units'; 'No data'}; % Extracting the first number extractedNumbers = cellfun(@(x) regexp(x, '(\d+)', 'match', 'once'), data, 'UniformOutput', false); % Convert to numeric numericValues = cellfun(@str2double, extractedNumbers);
Extracting Numbers Preceding " or 'secs": For this, you can modify the regular expression to capture numbers that precede the specified suffixes:
% Sample data data2 = {'30 secs'; '45"'; 'No value'; '20 mins'}; % Extracting numbers before " or 'secs' extractedSecs = cellfun(@(x) regexp(x, '(\d+)(?=\s*["'']|secs)', 'match', 'once'), data2, 'UniformOutput', false); % Convert to numeric numericSecs = cellfun(@str2double, extractedSecs);
Please see attached.
By applying these methods, you can efficiently clean your data and create new columns with the desired numerical values, minimizing manual processing.
2 Commenti
Hi @Rahul,
After reviewing your comments, I comprehend your concern. In my previous code, I noticed that the original regular expression (\d+)(?=\s*["]|secs) is designed to match one or more digits (\d+) that are followed by either a double quote or the word "secs". However, the use of the alternation operator (|) within the lookahead assertion was not correctly capturing both cases. So, I adjusted the regular expression to ensure it captures numbers followed by either a double quote or the word "secs". Here is the complete and updated MATLAB code that addresses the extraction of numeric values from the specified strings:
% Sample data data2 = {'30 secs'; '45"'; 'No value'; '20 mins'; '50 secs'};
% Extracting numbers before " or 'secs' extractedSecs = cellfun(@(x) regexp(x, '(\d+)(?=\s*["]|(?=\s*secs))', 'match', 'once'), data2, 'UniformOutput', false);
% Convert to numeric numericSecs = cellfun(@str2double, extractedSecs);
% Display results disp(numericSecs);
Please see attached.
In the above modified code, the variable data2 contains a cell array of strings, some of which include numeric values followed by "secs" or a double quote.The cellfun function applies the regexp function to each element of data2. The updated regular expression captures the desired numeric values.The extracted string values are converted to numeric format using str2double. This function will return NaN for any non-numeric strings.Finally, the results are displayed using disp, which will show the numeric values extracted from the strings.
Hope this should help resolve your problem.
Vedere anche
Categorie
Community Treasure Hunt
Find the treasures in MATLAB Central and discover how the community can help you!
Start Hunting!