Subsets of uncorrelated features

Given a N by N correlation matrix of N features, how to find ALL subsets of pariwise uncorrelated features if we assume two features are uncorrelated if their correlation score is less than a threshold Alpha. There is no restriction on the number of features making the subsets. All features making a subset need to be pairwise uncorrelated.

 Risposta accettata

Jeff Miller
Jeff Miller il 12 Lug 2021
Modificato: Jeff Miller il 12 Lug 2021
N = 5;
R = rand(N); % We will ignore the lower triangular part of this array
rCutoff = 0.4;
% Make a cell array that holds all possible combinations of 2, 3, 4, ... features
combos = cell(0,0);
for i=2:N
iCombos = nchoosek(1:N,i);
for j=1:size(iCombos,1)
combos{end+1} = iCombos(j,:);
end
end
ncells = numel(combos);
% Check each cell to make sure that all of the pairwise correlations are
% less than the cutoff
qualifies = true(1,ncells);
for icell=1:ncells
features = combos{icell};
nfeatures = numel(features);
for ifeature=1:nfeatures-1
for jfeature=ifeature+1:nfeatures
iifeature = features(ifeature);
jjfeature = features(jfeature);
if abs(R(iifeature,jjfeature)) > rCutoff
qualifies(icell) = false;
end
end
end
end

5 Commenti

Tried it but it does not seem to work. You seem to only check the correlations between features of combos with 2 features, but not in combos of 3, 4,...,features:
if abs(R(ifeature,jfeature)) > rCutoff
qualifies(icell) = false;
end
Sorry, my bad. I forgot to pick out the feature numbers from the feature array. Try it again after the edit adding the iifeature and jjfeature lines.
That works. I think one could simplify the outer for loop and avoid two inner loops:
for icell = 1:ncells
features = combos{icell};
if sum(nonzeros(triu(abs(R(features,features)),1)) > rCutoff)
qualifies(icell) = false;
end
end
You may well be right, that but "if sum" line is cognitively impenetrable to me. :)
Thanks for accepting my answer.
I am looking for a pairwise uncorrleation between ALL features iofn the "feature" variable". The line:
sum(nonzeros(triu(abs(R(features,features)),1)) > rCutoff)
will result into a matrix of logical values showing which feature pairs are correlated and which are not (triu is only there to reduce the symmetric correlation matrix). If any of the values of the matrix is true (equivalently, sum of values of the matrix is different from zero), The subset unqualifies as an uncorrelated subset.

Accedi per commentare.

Più risposte (2)

Ive J
Ive J il 11 Lug 2021
Modificato: Ive J il 12 Lug 2021
Let R be the pairwise correlation matrix:
N = 10;
R = rand(N);
R(logical(eye(N))) = 1;
for i = 1:size(R, 1) - 1
for j = i+1:size(R, 1)
R(j, i) = R(i, j);
end
end
disp(R)
cutoff = 0.4; % independent features
idx = R < cutoff;
idx = triu(idx); % R(i, j) == R(j, i) in pairwise correlation matrix
features = "feature" + (1:N); % feature names
% there may be a simpler way to do this
indepFeatures = [];
for i = 1:N
indepFeatures = [indepFeatures, arrayfun(@(x)[x, features(i)], features(idx(i, :)), 'uni', false)];
end
indepFeatures = vertcat(indepFeatures{:});
% find all cliques of this set
nodes = zeros(size(indepFeatures, 1), 1);
[~, nodes(:, 1)] = ismember(indepFeatures(:, 1), features);
[~, nodes(:, 2)] = ismember(indepFeatures(:, 2), features);
G = graph(nodes(:, 1), nodes(:, 2));
M = maximalCliques(adjacency(G));
indepSets = cell(size(M, 2), 1);
for i = 1:numel(indepSets)
indepSets{i} = features(M(:, i) ~= 0);
end
indepSets(cellfun(@numel, indepSets) < 2) = []; % this can be further unified with indepFeatures
You can find maximalCliques in FEX.

12 Commenti

Kais
Kais il 11 Lug 2021
Thanks! This will give pairs of uncorrelated features. What I am looking for is the list of all subsets of 2 or more uncorrelated features. For example, a subset of uncorrelated features could be something like {'feature1', 'feature2', 'features4'} where the feature pairwise correlation (i.e. feature 1 vs feature2, feature2 vs feature3 and feature 1 vs feature4) in the subset is always less than the thershold (0.4 in your exmaple).
@Kais, Why do you need this? What is the use case? What will you do with the information after this? Have you considered principal components analysis?
Kais
Kais il 11 Lug 2021
Feature selection for a clinical application. I am running trials of cluster analyses and I need uncorrelated features to run these analyses. Features need to be interpretable so PCA won't do.
Ive J
Ive J il 11 Lug 2021
@Kais So, have you looked at feature selection? There are quite copule of approaches to do so, especially for clinical/medical applications (I assumed you'll use them in a regression analysis afterwards). For instance, you can use simple F-test approaches like fsrftest, or penalized regression (e.g. lasso or ridge).
Kais
Kais il 12 Lug 2021
I did. I am interested in running unsupervised feature selections (F-test and other methods require a response variable). The ultimate goal is to cluster data using the subsets of uncorrelated features.
Ive J
Ive J il 12 Lug 2021
@Kais So, you can try my modified answer; this should do the job (but note the problem may get complicated with large number of features). But, I'm sure you are aware of the drawbacks of this approach: the simplest scenario would be regression analysis when you don't know which [correlated] features better explain the response variable, and you may incorrectly exclude those features. Say A and B are highly correlated but A is a better predictor of response, but you select B (simply because you don't check the amount of response variance A or B explain).
Kais
Kais il 12 Lug 2021
This is the very reason I am running trials of cluster analyses using multilpe feature subsets. Correlated features will be seperated into subsets of uncorrelated features. Since I am dealing with an unsupervised learning problem, regression analyses aren't an option.
@Ive J I tried your solution on a 10 by 10 validation correlation matrix. The subsets are correct but incomplete. There should be 22 subsets (17 made of pairs and 5 made of triples) of uncorrelated features in the matrix below. Your code leads to 11 subsets (6 paird and 5 triples).
cutoff = 0.4;
R=[
1.0000 0.8735 -0.4498 0.7882 0.8260 -0.3310 0.7521 0.8991 -0.3789 0.7062
0.8735 1.0000 -0.4168 0.9182 0.6789 -0.3210 0.7074 0.8244 -0.5138 0.7961
-0.4498 -0.4168 1.0000 -0.3241 -0.4364 0.7018 -0.3825 -0.3257 0.4335 -0.3814
0.7882 0.9182 -0.3241 1.0000 0.4857 -0.2513 0.6033 0.8936 -0.5378 0.5715
0.8260 0.6789 -0.4364 0.4857 1.0000 -0.2878 0.6495 0.5794 -0.2881 0.8435
-0.3310 -0.3210 0.7018 -0.2513 -0.2878 1.0000 -0.3040 -0.2360 0.2695 -0.2606
0.7521 0.7074 -0.3825 0.6033 0.6495 -0.3040 1.0000 0.6469 -0.3280 0.6357
0.8991 0.8244 -0.3257 0.8936 0.5794 -0.2360 0.6469 1.0000 -0.3919 0.5148
-0.3789 -0.5138 0.4335 -0.5378 -0.2881 0.2695 -0.3280 -0.3919 1.0000 -0.3725
0.7062 0.7961 -0.3814 0.5715 0.8435 -0.2606 0.6357 0.5148 -0.3725 1.0000]
Ive J
Ive J il 12 Lug 2021
Modificato: Ive J il 12 Lug 2021
As I commented on the last line, indepSets and indepFeatures (lenght = 22 with your data) should be merged, and in your example there are 7 (and not 5) triples. So, if you only keep sets with a length > 2, you can then merge this with indepFeatures,which has been already generated:
indepSets(cellfun(@numel, indepSets) < 3) = []; % my original example is < 2
So, as I said there are 7 triples:
"feature1" "feature6" "feature9"
"feature2" "feature6" "feature9"
"feature4" "feature6" "feature9"
"feature5" "feature6" "feature9"
"feature6" "feature7" "feature9"
"feature6" "feature8" "feature9"
"feature6" "feature9" "feature10"
Kais
Kais il 12 Lug 2021
Modificato: Kais il 12 Lug 2021
Got it, thanks. I find 5 and not 7 triples because I am taking the absolute value of correlations:
idx = abs(R) < cutoff;
Kais
Kais il 14 Lug 2021
Modificato: Kais il 15 Lug 2021
@Ive J I tried your code with cutoff = .7. I get 32 pairs, 2 triples, 2 quadruples, and 5 quintuples. While the number of pairs and quintuples seem to be correct, the numbers of triples and quadruples are not. For example, the triple [6, 7, 9] is missing.
There should be a total of 112 uncorrelated features. You code finds 44 only. Any clue why is that?
Kais
Kais il 15 Lug 2021
Never mind the last question. I figured the algorithm finds max cliques so it won't count subgraphs within larger subgraphs ([6, 7,9] won't show up because it's part of the larger subset [4, 6,7,9]), which significantly reduces the number of subsets.

Accedi per commentare.

Image Analyst
Image Analyst il 11 Lug 2021

0 voti

Would stepwise regression be of any help?
Otherwise, just make an N by N table of correlation coefficients by corelating every feature with every other feature.

Prodotti

Release

R2021a

Richiesto:

il 10 Lug 2021

Commentato:

il 15 Lug 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by