I have 45x484 matrix but when I calculate coeff pca function, I am getting coeff with 484x44 matrix which causes errors in biplot. Why is the rows and columns switch places?

%from struct to matrix using function
T1 = createDataMatrix(REC);
x=ismissing(T1);
y=any(x,1);
z=T1(:,~y);
a=z;
% scaling data for each column using standardised Z
ZM=zscore(a);
ZM=ZM-mean(ZM);
%PCA using Matlab built-in function
[coeff,score,latent,~,explained,~]=pca(ZM);

 Risposta accettata

Well you should find the answer in your problem not MATLAB pca function. You have 45 observations with 484 variables, so degree of freedom (you already centered your variables) in your case would be 44 and that's the max number of PCs with a non zero variance. You need to look at the total variance explained and pick those PCs explaining much of the variance (let's say 90%); I highly doubt the number of PCs explaining that much of variance even exceeds half of variables in a real case situation (though I admit depends on the nature of the problem).
Bottom line: pca function works just fine.

13 Commenti

Hi, thank you so much for the comment. I am totally new to matlab things.
Actually, I have 44 variables and 484 observations which I have extracted from struct arrays using 'interp1' function. The resulting matrix became like this. I just tried to transpose the T1 to become 484x44 matrix and run it again and coeff matrix became 23x23.
Now, it is giving me this error due to my 'vbls' having 44 variable labels and causing mismatch.
Is there a way I can find which of these original 44 variables represent the new 23 variables plotted from using 'coeff' and 'score' values in new PC dimensions? So that I can label them accordingly. Thanks!
'Error using biplot (line 166)
'VarLabels' value must be a string array, a character array, or a cell array of character vectors
with one label for each row of the coefficients matrix.
Error in testrun2 (line 29)
h=biplot(coeff(:,1:2),'Scores',score(:,1:2), 'VarLabels',vbls);
So, if ZM is 45X484, what's the size of coeff you get from this?
[coeff,score,latent,~,explained,~]=pca(ZM.');
TBH I don't understand why you get a 23X23 matrix for coeff, and that's where the error lies: varlabels must have the same size as size(coeff, 1)
size(coeff, 1) == vbls % it must be true for bioplot to be able to work
MATLAB documents that if you do pca(X) and X is a 2D matrix, then you will get out an array that is size(X,2) by size(X,2) . So if you got out a 23 x 23 matrix, then that implies you passed in a matrix with 23 columns to pca().
Hi Ive, sorry, 45x484 is my T1 after transposing it. Then when I run the zscore and it became 487x23 matrix which is the variable ZM. I would like to know what are these 23 variables and their labels. Cause I have 44 variables initially.
Is there a function that can track the changes in variables so that I can make a new vbls with only 23 relevant names?
Regards,
Hi Walter, thanks for the input. It seems my coeff matrix size is correct now as I got (X,2) by (X,2) matrix.
Then when I run the zscore and it became 487x23 matrix...
How does this happen?
Also, note that scaling after normalization (which in case also scales variables) is pointless: you've already normalized your data, what's the point of scaling?!
% ZM=zscore(a);
% ZM=ZM-mean(ZM); % ?? meaningless
x = randi([1, 100], 4, 1);
zx = zscore(x);
all(zx == (zx - mean(zx)))
ans = logical
1
zscore says
  • If X is a matrix, then Z is a matrix of the same size as X, and each column of Z has mean 0 and standard deviation 1.
So if you are getting a 487 x 23 matrix out of zscore, that implies that you passed a 482 x 23 matrix into zscore.
Hi Ive, correction, the matrix changes from 487x45 to 487x23 due to clearing of columns with Nan. The Z-mean(Z) is just an extraline I put in accidentally.
Yes Walter. That is correct the matrix changes from 487x45 to 487x23 due to clearing of columns with Nan. I explained it wrongly in the last reply.
Ok then, to wrap things up so far:
A = randn(487, 45); % raw data wish 487 observations and 45 variables
A(randi([1, numel(A(:))], 20, 1)) = nan; % add some missing values to raw data
varNames = "x" + (1:size(A, 2));
fprintf('original matrix size: %d observations and %d features\n', size(A, 1), size(A, 2))
original matrix size: 487 observations and 45 features
% remove columns with nan values (I assume in your case, you have only columns
% with all values as nan, otherwise it's better to remove only nan observations).
nanIdx = any(isnan(A), 1);
A(:, nanIdx) = [];
varNames(nanIdx) = []; % remove nan labels
fprintf('pruned matrix size: %d observations and %d features\n', size(A, 1), size(A, 2))
pruned matrix size: 487 observations and 32 features
Az = zscore(A);
[coeff, score, latent, ~, explained, ~] = pca(Az);
% show me what you've done
% 1- variance explained
figure;
pareto(explained)
xlabel('Principal Component')
ylabel('Variance Explained (%)')
% 2- biplot
figure;
biplot(coeff(:, 1:2), 'Scores', score(:, 1:2), 'Varlabels', varNames);
Hi Ive, your summary is extremely helpful! Thank you!
I didnt knew I could use idx to remove Nan variable names!
One last question, how can I remove only the Nan observations? The truth is not all of my column are full of Nan some only have a couple of Nan but I dont know to only remove that observation.
Here is what I have done for the plotting part. Since my PCs only explaine like 10 percent, I put lots of them in biplot for scree test to be over 80%.
h=biplot(coeff(:,1:3),'Scores',score(:,1:3));
hold on
biplot(coeff(:,4:6),'Scores',score(:,4:6));
biplot(coeff(:,7:9),'Scores',score(:,7:9));
biplot(coeff(:,10:12),'Scores',score(:,10:12));
biplot(coeff(:,13:14),'Scores',score(:,13:14));
hold off
axis tight
grid on
xlabel('PC1'), ylabel('PC2')
To remove only samples with missingness, you can apply something this to your dataset:
A = randn(487, 45); % raw data wish 487 observations and 45 variables
A(randi([1, numel(A(:))], 20, 1)) = nan; % add some missing values to raw data
fprintf('original matrix size: %d observations and %d features\n', size(A, 1), size(A, 2))
original matrix size: 487 observations and 45 features
nanObsIdx = any(isnan(A), 2); % samples having at least one missing value in either of features (columns)
A(nanObsIdx, :) = [];
fprintf('pruned matrix size: %d observations and %d features\n', size(A, 1), size(A, 2))
pruned matrix size: 467 observations and 45 features
Obviously this doesn't affect your features but only samples.
Hi Ive, sorry for the late reply. My coding is sort of working now. Thank you so much for your inputs!

Accedi per commentare.

Più risposte (0)

Prodotti

Release

R2021a

Richiesto:

ack
il 31 Ago 2021

Commentato:

ack
il 4 Set 2021

Community Treasure Hunt

Find the treasures in MATLAB Central and discover how the community can help you!

Start Hunting!

Translated by