delete rows- tanimoto matrix

Question

0 voti

test2.csv

Hi everyone;

with help of some of you, we could write two scripts to delete rows, in tanimoto matrix which we create based on RDKit fingerprints.

the idea was to delete those rows (= molecules) which are similar to another molecules if :

1.tanimoto index > 0.7 but less than 1.
2. the sum of all tanimoto indexes in a row is larger than the compared molecule.

*in the first solution we used for loop, with a vector called "good_ones"; the rows that got zero we delete and we don't compare them again :

   [M,text,alldata]=xlsread('test2.csv'); 
    [r c]=size(M);
    S=sum(M,2); %sum rows
    good_ones = ones(1,size(M,1));
    % loop over rows 
    for row=1:r;
        for col=row+1:c; 
            if good_ones(row)==1
                if (M(row,col) >= 0.7) & (M(row,col) <1.0 ); %if the value between 0.7 to 1 then we compare the sum column
                    if S(row)>S(col) % if the sum of the i line is larger we delete this line
                        good_ones(row) = 0;
                      else % the sum of the j line is larger so we delete the other line
                          good_ones(col) = 0;
                      end
                  end
                  % mark lines for deletion afterwards            
              end
          end
      end
    new_M = M(find(good_ones),:);

in the second script we used the vector way, but in this solution, rows that we compare and mark to delete, are compared again to another molecules, and in this way we loose more molecules,so the question how i can add a condition here like i did previous and check only the rows that i didn't have check before???:*

an example file of the matrix attached

[M,text,alldata]=xlsread('test2.csv.csv') ;% M is numeric date 
                                       % text is the text :-)
                                       % all data is both M + text
[i j]=size(M);
S=sum(M,2);
triM = tril(M); %lower diagonal
[r, c] = find(triM >= 0.7 & triM < 1.0); %find position of all values in range
deleterow = S(r) > S(c); %compare row r with respective row c
%a true in deleterow means delete respective r, otherwise delete respective c
todelete = unique([r(deleterow) ;c(~deleterow)]);
nM = M;
nM(todelete, :) = [];
text2=text(2:i+1,1);
text2(todelete, :) = [];

7 Commenti
Mostra 5 commenti meno recenti Nascondi 5 commenti meno recenti

dpb il 7 Dic 2014

Modificato: dpb il 7 Dic 2014

Apri in MATLAB Online

I've not seen any previous thread that this looks familiar to so I'm coming in cold and I have absolutely no klew what a "tanimoto" matrix is altho I gather it's a classification statistic of some sort so these are perhaps naive queries, but--

At least with the file test2.csv you linked to, the sum already is in the file as the last column so doesn't the

S=sum(M,2);

double it over what should be? Or, does your actual data file not include the sum column?

Also, should the sum actually be the full row sum since it is a symmetric array or should it not be the sum of the elements in the upper (or lower) triangular array? And, should the sums actually include the diagonal 1 value?

With those questions resolved, does the first script produce the desired output?

I'm sure a vectorized version can be written; I think I'd do it just slightly differently than the author of the above but similar idea; just need the actual definition of right/wrong calculation clarified before spending any time writing code. Again, don't know if I'm right or totally off-base, but seems to me you'd want

triM=tril(M,-1);      % the lower triang excluding diagonal
S=sum(triM,2);        % sums w/o the diagonal and not double-counting

The above assumes also that M doesn't already include a summation column.

dpb il 8 Dic 2014

Apri in MATLAB Online

That's a start altho I'm not sure what you mean precisely by "we want the sum of the whole row -1". If as suggested you pick up only the upper/lower triangular and sum that there is no one included in the summation to begin with as the diagonal elements aren't included. Or is there some other reason that one wants S-1 "like it's a a DOF adjustment or the like?"

Altho, on reflection, it looks like the magnitude of the sums doesn't matter as it's only relative so including the 1 or not makes no difference. The doubling of the symmetry could affect the ordering, however.

Also, explain more precisely the desired logic of which are the desired results -- I don't quite follow the reasoning behind the column vs the row marking in the looping solution. The block

if S(row)>S(col) % delete this line
  good_ones(row) = 0;
else % the sum of the j line is larger so we delete the other line
  good_ones(col) = 0;
end

doesn't make sense to me. The else clause is executed for S(row)<=S(col), not just S(row)<S(col). While it may not matter that much, there is an equality condition at least theoretically possible. What should happen then?

Actually, on yet further consideration, I think this is a case that the sequential solution via the loop may well be the optimal as you are eliminating rows from further consideration whereas the vector solution will operate over the entire array.

I'd have to think about the equivalent vectorized solution and then one would have to time the two for representative datasets to see about performance. How large a set would be typical and what kind of run times do you see?

Shayma il 9 Dic 2014

Apri in MATLAB Online

Hi, each row contains a compare for one molecule with the whole data, so we need the sum of the whole row, minus 1 its the compare to the molecule itself = diagonal, so we can - 1 or not it won't change a lot! but if we take only the upper part- for the last molecule it will appear only one number- one compare, that's why we take the whole row! hope now it's clear.

**in the case of S(row)=S(col), we have to delete one of them, in this case it will be the second molecule, maybe here we can add an additional condition to decide, yet that's how we do it!

we change also a little bit in our script :) and now it double check the molecules , before continue in the loop:

   % loop over rows 
  for row=1:r-1; %last row will be=1 only
      if good_ones(row)==1;
          for col=row+1:c;  % go over upper triangle only !
              if good_ones(col)==1;
                  if (M(row,col) >= 0.7); %if the value >= 0.7 then we compare the sum vector
                      if S(row)>S(col); % if the sum of the row line is larger we delete this line
                          good_ones(row) = 0;
                          break % & we don't continue comparing this line
                      else % the sum of the col line is larger so we delete it
                          good_ones(col) = 0;
                      end
                  end
              end
          end
      end
  end

maybe it won't matter in time if i have only hundreds of molecules, but if you have time i would like to see your solution :)

many thanks for your time

dpb il 9 Dic 2014

OK, I do now grok the summation and yes, the "1" is immaterial as long as it's simply a ranking and not an absolute value.

"...them, in this case it will be the second molecule, maybe here we can add an additional condition to decide..."

In that case, change the > to >= and kill the row that lets you abort the loop earlier. It seems that one still needs to finish out the row otherwise.

I won't have time until perhaps the weekend (and can't promise that, even) to devote to trying the vector solution but given the condition of the decision process I don't think it's likely to be a case where it will save a lot of time. With the JIT compiler, loops aren't always the overhead they used to be. Unless you actually run into severe computational bottlenecks I'd suspect your best utilization of time/resources is to just "get on with it" at this point.

dpb il 9 Dic 2014

Apri in MATLAB Online

Another possible modification that might save some time would be to test whether there are any more on the row greater than the limit on the alternative test instead of continuing the loop on a "by one" step.

Oh, alternatively, instead of the inner loop, use a logical array test on the row at once...

if ~any(M(row,:)>0.7)  % this row passes
  break
else
  % deal with them here...
end

Accedi per commentare.

Accedi per rispondere a questa domanda.

Follow Question

Answer 1

dpb il 10 Dic 2014

Modificato: dpb il 11 Dic 2014

Apri in MATLAB Online

1 voto

"...but if you have time i would like to see your solution :)"

Well, I've not tested this extensively but it does reproduce your script results for the sample dataset --

Presuming

a) S is the summation column vector, and
b) M is the tridiagonal form w/o the diagonal
>> M=M>0.7;                                       % logical array
>> R=squareform(pdist(S,@(x,y) rdivide(x,y)))>1;  % sum ratios to match
>> good = ~any(M&R,2) & ~any(M&~R,1).';
>> all(good==Good)
ans = 
    1

1 Commento
Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

dpb il 10 Dic 2014

Modificato: dpb il 11 Dic 2014

Apri in MATLAB Online

From the command line session I used to debug...

>> [~any(MR&R,2)  ~any(MR&~R,1).' [~any(MR&R,2) & ~any(MR&~R,1).'] Good]
ans =
   1     1     1     1
   0     1     0     0
   1     1     1     1
   0     0     0     0
   1     1     1     1
   1     1     1     1
   1     0     0     0
   1     1     1     1
   1     1     1     1
>> all(ans(:,3)==ans(:,4))
ans =
   1
>>

Good is a local variable from the output from your script that I saved for comparison purposes. MR is MR=M>0.7 logical array; while testing I kept the second variable but there's no need in the actual code it would seem.

As noted above, I have no extensive debugging here; it just seems like the correct logic to match up the rows/columns that are the result of the if...else...end clause but that's as far as I went in depth.

You might want to reverse some of the logic so it's not a negative in the end...that is particularly M<0.7 instead of M>0.7; the sum ratio is a wash. Altho it occurred to me that perhaps if kept the full array M (excluding the diagonal, of course) that perhaps R and 1/R for the two triangulars just might simplify it but again I didn't pursue that thought for lack of time to devote to the problem.

One last thought if you have a problem with this back on the looping solution. One could potentially do something like

pool=find(any(M>0.7,2)).';
for r=pool  % only rows with high magnitude, not all
  ...

How much difference, if any, one might see would be highly dependent upon the dataset I think.

Also, however, one optimization that should make some difference is to convert the M array to the ratio before the loop instead of doing the calculation every pass...

M=M>0.7;
for r=1:nR
  ...
  for c=r+1:nC
    if M(r,c)  % logical flag check instead of computation
      ...

I don't know if it would make any real difference in timing but I'd think the conversion I showed earlier of using a logical array instead of the 0,1 numeric and explicit comparison might save a few cycles...if nothing else it saves a find when you're done because can use the logic array directly as in indexing expression instead of finding the positions.

Accedi per commentare.

delete rows- tanimoto matrix

7 Commenti
Mostra 5 commenti meno recenti Nascondi 5 commenti meno recenti

Risposta accettata

1 Commento
Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

Più risposte (0)

Categorie

Tag

Community Treasure Hunt

delete rows- tanimoto matrix

7 Commenti Mostra 5 commenti meno recenti Nascondi 5 commenti meno recenti

Risposta accettata

1 Commento Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti

Più risposte (0)

Categorie

Tag

Vedere anche

Community Treasure Hunt

7 Commenti
Mostra 5 commenti meno recenti Nascondi 5 commenti meno recenti

1 Commento
Mostra -1 commenti meno recenti Nascondi -1 commenti meno recenti