Read csv strings, keep or create surrounding whitespace

6 visualizzazioni (ultimi 30 giorni)
I have a list of stop words that currently exists as a comma-separated list in a .txt file. The goal is to use that list to remove those words from some target text, but only when a given word (e.g. "and") appears by itself - remove "and", but don't make "sand" into "s". To that end, I tried manually putting spaces around all the words in the list, so "a,able,about" became " a , able , about ". However, the txtscan function stripped the spaces out. Is there a way to prevent it from doing that? Alternatively, if I use the original form of the list, can I tell txtscan to surround each string with spaces?
  1 Commento
Cedric
Cedric il 20 Giu 2014
Modificato: Cedric il 20 Giu 2014
Could you give an example, like a sample file, and indicate precisely what you want to achieve? This seems to be a task for REGEXPREP.

Accedi per commentare.

Risposta accettata

Cedric
Cedric il 20 Giu 2014
Modificato: Cedric il 20 Giu 2014
Here is an example that I can refine if you provide more information. It writes some keywords in upper case..
key = {'lobster', 'and'} ;
str = 'Lobster anatomy includes the cephalothorax which fuses the head and the thorax, both of which are covered by a chitinous carapace, and the abdomen. The lobster''s head bears antennae, antennules, mandibles, the first and second maxillae, and the first, second, and third maxillipeds. Because lobsters live in a murky environment at the bottom of the ocean, they mostly use their antennae as sensors.' ;
for kId = 1 : length( key )
pat = sprintf( '(?<=\\W?)%s(?=(s |\\W))', key{kId} ) ;
str = regexprep( str, pat, upper( key{kId} ), 'ignorecase' ) ;
end
Running this, you get
>> str
str =
LOBSTER anatomy includes the cephalothorax which fuses the head AND the thorax, both of which are covered by a chitinous carapace, AND the abdomen. The LOBSTER's head bears antennae, antennules, mandibles, the first AND second maxillae, AND the first, second, AND third maxillipeds. Because LOBSTERs live in a murky environment at the bottom of the ocean, they mostly use their antennae as sensors.
The REXEXP-based approach makes it possible to code for..
  • only if framed by non alphanumeric characters (e.g. ,),
  • unless following character is an 's',
  • unless at the beginning of the string.
  21 Commenti
Ben
Ben il 23 Giu 2014
Ah, I hadn't realized that regexp functions don't do their work all at once, as stringrep does. That should do it. Thank you so much!
Cedric
Cedric il 23 Giu 2014
Modificato: Cedric il 23 Giu 2014
You're welcome! Note that it could do its job all at once if you were passing a pattern which contains all keywords in an OR operation. Yet, it's often more efficient to apply several times a simple pattern than passing once an extra-long/complex one. That could/should be profiled for your specific case though if you wanted to optimize.

Accedi per commentare.

Più risposte (0)

Categorie

Scopri di più su Language Support in Help Center e File Exchange

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by