Operations on variables with specific naming patterns

I am reading in a large number of text files. The files contain numerical values that I want, and text that I do not. Using the matlab.lang.makeValidName command, I am able to save the numbers into arrays with names like:
  • A123
  • B123
  • C123
  • A234
  • B234
  • C234
  • A345
  • B345
  • C345
  • ...
(It's a lot more complex than this in reality. The variable names are a combination of the filename from which the data was read and the name of the values from the file...but let's try to keep it simple in the example! :)
What I am trying to do now is to run calculations on each of the variables with "A" in the title. Using who('-regexp','A') I get a cell that contains the names of all of the variables in my workspace with "A" in the title, but I can't quite figure out what to do next with that data.
If I wanted to add all of the variables with A in the title, what would the proper command be? Likewise, if I wanted to create a much larger matrix of [A123 ; A234 ; A345 ...] what would that command be? The sizes of all of the "A" variables are the same, so there's nothing to worry about there.
Thanks for the help!

7 Commenti

See TUTORIAL: Why Variables Should Not Be Named Dynamically (eval). After reading this tutorial, are you sure you want to proceed with your current design?
I hadn't seen that tutorial before, per isakson. If it makes you feel any better, this was the first time I'd tried to use eval in code that I'd written to create variable names on the fly. I can certainly try to figure out a different way to do what I need to do.
What I did originally was read each of the files I was analyzing into a cell. Then I searched the cell for the variable names that I needed, and saved those off as char.
The next step was going through and looking for the lines of numbers that corresponded to each group of variable names. Any line that started with a (1) corresponded to the first set of variable names, (2) corresponded to the second set, and so on... The number of sets varied with the input file, as did the number of variable names in each set. Do a str2num on each of those lines to save them into an array, and then save each column of that large array into a separate vector.
Once that's done with the first file, proceed onto the second file. Continue doing this until each output file is analyzed. Each file saves down the data [A123 B123 C123 ...] [A234 B234 C234 ...] [A345 B345 C345 ...]
What I actually need to do is manipulate and write out the data in the groups [A123 A234 A345 ...] [B123 B234 B345 ...].
Working with cells has always been one of my weakest areas in MATLAB, so maybe this is a chance to get a lot better with that!
per isakson
per isakson il 13 Lug 2017
Modificato: per isakson il 13 Lug 2017
Yes, your answer makes me feel better :-) And not only me. Regulars at this forum have devoted a lot of time to discourage the use of eval.
I don't fully understand your problem and would like to pose some questions.
Regarding "large number of text files"
  • Do all files have identical format, i.e. only the actual numbers differs?
  • Would it simplify the analysis if the data from all file are in the memory at the same time? Or is it possible to process one file at the time?
  • Does the data from all files fit in memory (and leave enough memory for the analysis)?
All the arrays, A123,B123,... do they have the same size? And what size are they?
"[A123 B123 C123 ...] [A234 B234 C234 ...] [A345 B345 C345 ...]" Does this denote three 3D arrays?
I appreciate the help. I'll try to answer your questions as best I can.
The files are all text. The formats are similar, but not quite identical depending on who generated them. Sometimes, the data is stored in the order ABC. When someone else runs these cases, their input files may create the output in the order BAC or CBA. Fortunately, the variable names and the order in which they are generated are stored in the text file itself, so I can do a find to figure out what order the variables are in. The variable names are consistent, which is why (initially) I thought it would be a good idea to dynamically name them as varName1, varName2, etc... I have recently been shown the error of my ways. :-)
The arrays for each variable name have a different size to them. A might have seven values, while B has 12 and C would have five. the number of values, as well as the second dimension of the matrix, are consistent throughout the analysis, regardless of the person who generated the data files. When finally generated, the arrays are 2-D. I store each row in a 1-D vector, and then reorganize them into the groups I need. Reading your comments, I suppose I could just make one larger 3-D array for all of the A values instead of several 2-D arrays for the A values in each file, and then split them up that way.
The original data files themselves are each on the order of 30-40 MB, and I analyze them in groups of 60, so figure out about 2 GB for reading all of the data from a group of files in at once. The actual data that I need from the file would probably take up about an extra 50%, so figure about 3 GB of data total. That's a somewhat larger data set than I usually deal with for MATLAB cases, but not so large that I can't load it into memory on my current PC. Despite the sheer size of the data set, would just LOAD ALL THE FILES be the way to go about avoiding the eval issue?
Stephen23
Stephen23 il 17 Lug 2017
Modificato: Stephen23 il 20 Lug 2017
" Fortunately, the variable names and the order in which they are generated are stored in the text file itself"
In general store your data in the simplest arrays possible. In practice this means that you should put them into one numeric array if possible, or you could consider a cell array if sizes/classes are different between data sets, or a structure if the data names are significant.
"just LOAD ALL THE FILES be the way to go about avoiding the eval issue?"
The question is not clear: the problems with eval are not because of lack of memory. And loading data can be done as you wish: badly into dynamically named variables, or neatly into simpler arrays or structures, so what difference would loading all of the files make?
"The variable names are consistent" What does that mean? Do all files contain exactly the same variables?
"The next step was going through and looking for the lines of numbers that corresponded to each group of variable names. Any line that started with a (1) corresponded to the first set of variable names, (2) corresponded to the second set, and so on.."
This indicates that the variable names contain metadata. Note that metadata is data. Data should be stored in variables, not in variable names. Once you put your metadata into variables then accessing and processing that metadata will be much faster and more efficient than any hack code you could come up with that accesses the variable names. Read this carefully:
and in particular this section:
Stephen,
Thanks for your input. My "LOAD ALL THE FILES" comment was the easy way of saying "load all of the data in the files, put them into one really big cell, and manipulate them from there" which matches up with your "you should put them into one numeric array if possible" comment.
The files do contain the same variables, but they may be in a different order. Some files may have them stored in ABC, while others generated by a different person may have them BAC or CBA. This is why I can't just say "Take every third row in each of the files, and make that Array1."
The dynamically naming structure fields looks helpful. https://blogs.mathworks.com/loren/2005/12/13/use-dynamic-field-references/ in particular looks like something that I could use, and still (presumably) practice good coding and avoid the evil eval command.
In my opinion, dynamically named fields are just as bad as eval. You're still encoding metadata in variable (field) names, and this it not the way you should solve your problem.
There are two orthogonal issues at hand:
  • Parsing of the files, so that whatever order the variable come in, you know what they are
  • Storing of these variables and storing of the metadata
The first one can be solved any number of ways, with more or less complexity depending on how robust you want your parser to be
For storage, dynamically named anything is not a good idea. If speed is the focus, then as Stephen said the simplest storage is the best, matrices or cell arrays. Otherwise, you could go more fancy with maps (unfortunately, not very well implemented in matlab) or other containers.
Certainly when I see that you want to have variables A123, B123, A234, etc, what I read is that you need
container(A, 1, 2, 3) %where A could be categorical
container(B, 1, 2, 3)
container(A, 2, 3, 4)
It is then trivial to get all 'A' variables
container(A, :, :, :)

Accedi per commentare.

 Risposta accettata

per isakson
per isakson il 16 Lug 2017
Modificato: per isakson il 29 Lug 2017
It's easier to say that dynamically named variables and eval is not a good idea than making recommendation regarding a design. Nevertheless, I'll make some comments in the order they come to mind.
  • The requirements on your program will depends on whether you yourself will use the program a number of times during the next few weeks or the program will be used by more users during a longer period of time.
  • You describe "groups of 60 files", file size "30-40 MB" and 1D arrays with length of 5,7, and 12 elements. That makes a few zillions of 1D arrays to keep track of. Would mistakes be costly?
  • "The files contain numerical values that I want, and text that I do not." The text contains the metadata.
  • Do you foresee other analyses on these data.
  • Each value of a cell or structure array has an overhead of a little over 100 bytes. That could add up.
num = magic( 7 );
cac = num2cell( num, 2 );
whos num cac
Name Size Bytes Class Attributes
cac 7x1 1176 cell
num 7x7 392 double
cac{3,1}
ans =
46 6 8 17 26 35 37
  • I often chose structure arrays over cell arrays because they help me make the code more readable. Field names are more meaningful than numbers.
master_list = {'A123','B123','C123','A234','B234','C234','A345','B345','C345'};
isA = not( cellfun( @isempty, regexp(master_list,'^A\d{3}$','match'), 'uni',true ) );
A_list = master_list( isA );
for item = A_List
num = S.(item{:});
end
isA = strncmp( master_list, 'A', 1 );

Più risposte (0)

Categorie

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by