Relatively complex matrix indexing question

Hi,
I have a reasonably complex matrix indexing question that I'm struggling to solve. I have an nxn matrix of float values. The matrix is fairly large (n~30000). The matrix is upper (or lower) triangular with the other values being NaN. Each of the rows represented has a vector associated with it, so the matrix rows has a secondary matrix associated with it that is nxk, where each row has k characteristics. Row 1 has the same characteristics as column 1, row 2 has the same characteristics as column2, etc.
I want to plot a histogram of all of the elements in the nxn matrix whose rows/columns match certain characteristics.
To illustrate, let's say I had the following 4x4 value matrix
Values = [...
NaN 4.0 3.1 2.3
NaN NaN 1.2 5.2
NaN NaN NaN 7.1
NaN NaN NaN NaN]
Each row/column has three characteristics and this characteristics matrix has the values
Charac = [...
4 2 3
1 4 5
-3 5 1
10 6 2]
I only want to get histogram elements for rows/columns that have the second characteristic greater than 1 and the third characteristic less than 4. In this case only the first, third and fourth rows would meet the criterion and the values that I would take from the matrix Values are 3.1, 2.3 and 7.1.
Is there a compact and computationally efficient way to logically index this?
Thanks in advance!

8 Commenti

Why "the values that I would take from the matrix Values are 3.1, 2.3 and 7.1."?? How is the result related to "the first, third and fourth rows would meet the criterion"?
Hi Sara,
So since the 1st, 3rd and 4th rows are "valid" from the characteristics matrix criteria, the values that we want to take from the Values matrix would be all valid non-repeating two-element combinations of these numbers, i.e., (1,3), (1,4), and (3,4). The values corresponding to these indices are 3.1, 2.3 and 7.1, respectively.
4.0 at (1,2) is also in the first row. Why is it not to be selected?
Cedric
Cedric il 10 Giu 2014
Modificato: Cedric il 10 Giu 2014
Your logic for defining columns is still unclear to me. You get valid rows, and then you take all possible pairs of these row IDs which define rows/columns in the upper triangle? Your initial matrix takes ~7.2GB, which could indicate that there is an issue with the design of the computation. If not, you may want to profile both the dense and the sparse versions of this computation. How large is k by the way?
Hi Cedric,
I actually will probably scale the first matrix so that it is a uint8 so that the size of the matrix is more manageable. k is about 10-15. Unfortunately, the matrix is not sparse.
Cedric
Cedric il 10 Giu 2014
Modificato: Cedric il 10 Giu 2014
Not sparse, not sparse, I'd say that it is half full at least ;-) I still don't get your columns selection; is it what I describe above? Seeing Image Analyst's solution, it seems that your statement is not clear to the two of us. Back to sparsity, uint8 is a great solution but many built-ins won't support it, so you'll see how far you go with that. If you are stuck with doubles, storing your large matrix as sparse would divide the size in memory by a little less than 2, and building it originally as sparse might be simpler than building is as dense (it happens sometimes, as you can build sparse matrices using vectors of row and column indices, as well as a vector of values).
So the columns and rows are selected by the same criteria, so if the valid indices for the rows are 1,3 and 4, the only valid (row,column) combinations are (1,3), (1,4) and (3,4). In other words, column 2 and row 2 get selected out. I will look into using sparse matrices. My understanding was that since the memory had to store the value and the indices of each element that you only save memory if the matrix has much more than 50% zeros. I would have to have a uint16 values to store the indices of the matrix (so row and column would mean an extra 4 bytes), whereas each element is only 1 byte. So I think the fraction of zeros would have to be something greater than 80% to save any memory.
See my answer below.

Accedi per commentare.

 Risposta accettata

Cedric
Cedric il 10 Giu 2014
Modificato: Cedric il 10 Giu 2014
You are correct about sparse matrices, and I had completely misunderstood your setup! So the lower triangular part is NaN, but the upper part has a density of 1(?) If so, there is no point in using sparse matrices. Reading your last explanation, I think that what you are looking for is along the line of what follows.
EDIT - I provide a second approach (B), which could work if you had only non-NaN elements in a vector. It is just a quick test though and it would have to be tested/profiled.
values = [NaN 4.0 3.1 2.3; ...
NaN NaN 1.2 5.2; ...
NaN NaN NaN 7.1; ...
NaN NaN NaN NaN] ;
charac = [4 2 3; ...
1 4 5; ...
-3 5 1; ...
10 6 2] ;
%%%A. Values is square num. array with NaNs.
% - Find valid row IDs.
isValid = charac(:,2) > 1 & charac(:,3) < 4 ;
validId = find( isValid ) ;
% - Build all possible row/col pairs.
validRC = nchoosek( validId, 2 ) ;
isValid = validRC(:,2) > validRC(:,1) ;
validRC = validRC(isValid,:) ;
% - Extract relevant values.
validId = sub2ind( size(values), validRC(:,1), validRC(:,2)) ;
extract = values(validId) ;
%%%B. Values is lin num array with non-NaNs only.
% - Build what could be values if you were not storing NaNs. Makes little
% sense to build it this way (by creating the large, square array and
% reducing it), but maybe you can build this vector directly instead
% of the array.
valuesLin = values(~isnan(values)) ;
% - Build sub2linInd function.
cSum = [0, cumsum( 1 : (size(values,2)-1) )] ;
sub2linInd = @(row, col) row + cSum(col-1).' ;
% - Extract relevant values.
validId = sub2linInd( validRC(:,1), validRC(:,2) ) ;
extractLin = valuesLin(validId) ;

7 Commenti

I wonder if I need to do the nchoosek operation or will the sequence of statements by Image Analyst do the trick (nchoosek does make sure we don't get the diagonal elements, but since the diagonal elements are NaN anyways, it could be filtered out isnan)?
I'm not sure but I'll do some tests with some larger matrices to see if one of the approaches misses some elements (or if they do both catch all of the elements, which one is more computationally efficient).
If I understood well your statement, Image Analyst's solution will return unwanted elements, typically Values(1,2) in the example.
Image Analyst's original solution would do that, but I believe my edit replacing one of his lines with extractedRows = Values(rowsToExtract,rowsToExtract), eliminates that issue.
I think things work but I realized that uint8 doesn't support NaN, so now I'll have to index such that only when the column index is higher than the row index since I can't use isnan. So I might have to use your kind of solution since I'm not sure how I could get Image Analyst's solution to work easily with that.
Cedric
Cedric il 10 Giu 2014
Modificato: Cedric il 10 Giu 2014
Are there zeros in the upper part? I you scale and/or map the floats onto a discrete set of numbers, e.g. uint8, you can reserve one value that you would use the way Image Analyst uses NaNs. Name it u (e.g. u=0 or u=255), and the call to ISNAN can just be replaced by
extractedValues = extractedRows(extractedRows ~= u) ;
Unfortunately there are zeros, but I was thinking perhaps it's not a big deal. Your solution A I think would work. But do I need these two lines?
isValid = validRC(:,2) > validRC(:,1) ;
validRC = validRC(isValid,:) ;
Won't validId already be in ascending order and then won't nchoosek already generate the pairs such that the first column is always less than the second column?
Cedric
Cedric il 11 Giu 2014
Modificato: Cedric il 11 Giu 2014
Yes, you are absolutely right, I thought that it was "with replacement" and it is not (sorry, I should have checked!).
How are you building the first large matrix (?), would it be difficult to build directly a vector of non-NaN values ( double or even uintx) and avoid building the large array?
If you must build an array, could you use single elements instead of double? The single type/class supports NaN.
If you need e.g. uint8 because you are short on RAM, do you really need the 256 possible values; couldn't you, as mentioned above, use e.g. 255 to code for NaN?
All solutions have advantages and bottlenecks: memory for the double/NaN approach, practicality and speed for the nchoosek/non-NaN approach (nchoosek won't work if you have too many rows matching conditions, even with a k as small as 2), etc. I guess that at this stage, we could proceed by elimination: how much RAM do you have available, and how many rows matching conditions could you have?
I'd say that your short term best bet (reasonable memory usage, precision, computation time, and development time) is likely to be Image Analyst's approach, with your modification, applied to a class single Values array.
I think I'll use the uint8 because of RAM (I might need to go to 30000x30000). I have 16 GB of RAM. With the check of making sure the columns are greater than the rows, I think I don't need the NaN on the lower triangular part (since I will ignore any values there anyway). That way I can also use all 0 to 255 values.
Thanks. I think you and Image Analyst have solved my question!

Accedi per commentare.

Più risposte (1)

Try this and see if it's what you want:
Values = [...
NaN 4.0 3.1 2.3
NaN NaN 1.2 5.2
NaN NaN NaN 7.1
NaN NaN NaN NaN]
Charac = [...
4 2 3
1 4 5
-3 5 1
10 6 2]
% Find rows where column 2 > 1 and column 3 < 4
rowsToExtract = Charac(:, 2) > 1 & Charac(:, 3) < 4
% Extract only those rows and no others.
extractedRows = Values(rowsToExtract,:)
% Get the values from that which are not NaN.
extractedValues = extractedRows(~isnan(extractedRows(:)))

3 Commenti

I think this works if I replace your line extractedRows = Values(rowsToExtract,:)
with
extractedRows = Values(rowsToExtract,rowsToExtract)
I'll run some tests and verify...
I don't see why my way would not work. Your suggestion will end up eliminating some columns also, and you didn't say anything about eliminating columns.
So the rows and columns are "equivalent" with the same characteristics. That is why in the example given, only 3.1, 2.3 and 7.1 would survive because the columns are subject to the same constraints as the rows.
So I think your solution works but I had just not explained the problem clearly enough. I will test tomorrow to make sure everything works.
Thanks!

Accedi per commentare.

Richiesto:

Sam
il 9 Giu 2014

Commentato:

Sam
il 11 Giu 2014

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by