Correction of misspelled words in data source

1 visualizzazione (ultimi 30 giorni)
Sandeep Kapour
Sandeep Kapour il 14 Apr 2021
Commentato: Walter Roberson il 15 Apr 2021
Hello,
i want to extract some data and I am using the "extractAfter" function, which works very well. My data source or measurement data has some problem for example: extractAfter(data, 'Signal1')
Signal1: 5, Signal2: 6
Signal1: 6, Signal2: 5
Sinal1: 8, Signal2: 5
Signal1: 10, Sigal2: 3
The problem is that Sinal1 and Sigal2 is not spelled correctly. Is it possible to change Sinal1 to Signal1 and Sigal2 to Signal2 automatically, because my data is very large. I am using the MATLAB version 2019b.

Risposte (2)

Cris LaPierre
Cris LaPierre il 14 Apr 2021
If you are using extractAfter, your data must be text. If so, have you tried using the replace function?
data = ["Signal1: 5, Signal2: 6";"Signal1: 6, Signal2: 5";"Sinal1: 8, Signal2: 5";"Signal1: 10, Sigal2: 3"]
data = 4×1 string array
"Signal1: 5, Signal2: 6" "Signal1: 6, Signal2: 5" "Sinal1: 8, Signal2: 5" "Signal1: 10, Sigal2: 3"
replace(data,["Sinal1","Sigal2"],["Signal1","Signal2"])
ans = 4×1 string array
"Signal1: 5, Signal2: 6" "Signal1: 6, Signal2: 5" "Signal1: 8, Signal2: 5" "Signal1: 10, Signal2: 3"
  3 Commenti
Cris LaPierre
Cris LaPierre il 14 Apr 2021
Modificato: Cris LaPierre il 14 Apr 2021
I'd suggest replaceBetween with pattern matching, but unfortunately, pattern was introduced in 20b. Since you are on 19b, try using regexprep instead.
data = ["Signal1: 5, Signal2: 6";"Signal1: 6, Signal2: 5";"Sinal1: 8, Signal2: 5";"Signal1: 10, Sigal2: 3"];
newStr = regexprep(data,'\<S\w*(?=1:|2:)',"Signal")
newStr = 4×1 string array
"Signal1: 5, Signal2: 6" "Signal1: 6, Signal2: 5" "Signal1: 8, Signal2: 5" "Signal1: 10, Signal2: 3"
Walter Roberson
Walter Roberson il 15 Apr 2021
Variants:
data = ["Signal1: 5, Signal2: 6";"Signal1: 6, Signal2: 5";"Sinal1: 8, Signal2: 5";"Signal1: 10, Sigal2: 3"];
newStr = regexprep(data,'\<S\w+(\d+):','Signal$1:')
newStr = 4×1 string array
"Signal1: 5, Signal2: 6" "Signal1: 6, Signal2: 5" "Signal1: 8, Signal2: 5" "Signal1: 10, Signal2: 3"
newStr = regexprep(data,'\<S\w+(?=\d+:)','Signal')
newStr = 4×1 string array
"Signal1: 5, Signal2: 6" "Signal1: 6, Signal2: 5" "Signal1: 8, Signal2: 5" "Signal1: 10, Signal2: 3"
newStr = regexprep(data, '\<S\D+', 'Signal')
newStr = 4×1 string array
"Signal1: 5, Signal2: 6" "Signal1: 6, Signal2: 5" "Signal1: 8, Signal2: 5" "Signal1: 10, Signal2: 3"
The first of the variations explicitly keeps a sequence of digits and drops it at the end of 'Signal'. The characters up to that point must be "word building characters", which are the letters and the digits and underscore. For example 'S1gn_l2' would match but not 'S1gn-l1' because '-' is not "word-building"
The second of the variations stops the search when it finds digits followed by colon, and replaces up to there. It differs from Cris's suggestion in that it handles any sequence of digits, not just '1' or '2'. Again the characters matched must be "word-building"
The third of the variations matches any non-digit after the S, stopping at the first digit. For example 'S1gn_l2' would stop matching between the S and the 1, and 'S!gn-l2' would be happily matched. But 'Signal: 5' with the digit missing before the colon woud be replaced with 'Signal5', and if the input were a continuous character string instead of a cell array of character vectors or a string array, then \D+ would be happy to cross line boundaries to find digits it expected. For example: 'Signal?: Nan\nSignal1: 5' would get replaced by 'Signal5' because as far as \D+ is concerned, newline is a valid non-digit character.... but as you can see, the code is shorter and sometimes your variations to be matched are well-defined and you can get away with it.

Accedi per commentare.


Walter Roberson
Walter Roberson il 14 Apr 2021
If you are dealing with a text file, I would suggest rewriting in terms of regexp() with named tokens
S = sprintf('Signal1: 5, Signal2: 6.2\nSignal1: 6e-3, Signal2: 5\nSinal1: 8, Signal2: 5\nSignal1: 10, Sigal2: 3')
S =
'Signal1: 5, Signal2: 6.2 Signal1: 6e-3, Signal2: 5 Sinal1: 8, Signal2: 5 Signal1: 10, Sigal2: 3'
parts = regexp(S, 'Sig?n?al1: (?<s1>[\d.eE+-]+), Sig?n?al2: (?<s2>[\d.eE+-]+)', 'names')
parts = 1×4 struct array with fields:
s1 s2
s1 = str2double({parts.s1})
s1 = 1×4
5.0000 0.0060 8.0000 10.0000
s2 = str2double({parts.s2})
s2 = 1×4
6.2000 5.0000 5.0000 3.0000
Your example only shows integer values. If that is all that is permitted, then change the
[\d.eE+-]+
to
\d+
The version I coded permits positive and negative values and decimals and exponentiation using either 'e' or 'E' ... but does not permit complex numbers.

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by