%from struct to matrix using function T1 = createDataMatrix(REC); x=ismissing(T1); y=any(x,1); z=T1(:,~y); a=z; % scaling data for each column using standardised Z ZM=zscore(a); ZM=ZM-mean(ZM); %PCA using Matlab built-in function [coeff,score,latent,~,explained,~]=pca(ZM);

Well you should find the answer in your problem not MATLAB pca function. You have 45 observations with 484 variables, so degree of freedom (you already centered your variables) in your case would be 44 and that's the max number of PCs with a non zero variance. You need to look at the total variance explained and pick those PCs explaining much of the variance (let's say 90%); I highly doubt the number of PCs explaining that much of variance even exceeds half of variables in a real case situation (though I admit depends on the nature of the problem). Bottom line: pca function works just fine.

I have 45x484 matrix but when I calculate coeff pca function, I am ...

ack il 31 Ago 2021

Hi, thank you so much for the comment. I am totally new to matlab things.

Actually, I have 44 variables and 484 observations which I have extracted from struct arrays using 'interp1' function. The resulting matrix became like this. I just tried to transpose the T1 to become 484x44 matrix and run it again and coeff matrix became 23x23.

Now, it is giving me this error due to my 'vbls' having 44 variable labels and causing mismatch.

Is there a way I can find which of these original 44 variables represent the new 23 variables plotted from using 'coeff' and 'score' values in new PC dimensions? So that I can label them accordingly. Thanks!

'Error using biplot (line 166)

'VarLabels' value must be a string array, a character array, or a cell array of character vectors

with one label for each row of the coefficients matrix.

Error in testrun2 (line 29)

h=biplot(coeff(:,1:2),'Scores',score(:,1:2), 'VarLabels',vbls);

Ive J il 1 Set 2021

Apri in MATLAB Online

So, if ZM is 45X484, what's the size of coeff you get from this?

[coeff,score,latent,~,explained,~]=pca(ZM.');

TBH I don't understand why you get a 23X23 matrix for coeff, and that's where the error lies: varlabels must have the same size as size(coeff, 1)

size(coeff, 1) == vbls % it must be true for bioplot to be able to work

Walter Roberson il 1 Set 2021

MATLAB documents that if you do pca(X) and X is a 2D matrix, then you will get out an array that is size(X,2) by size(X,2) . So if you got out a 23 x 23 matrix, then that implies you passed in a matrix with 23 columns to pca().

ack il 1 Set 2021

Hi Ive, sorry, 45x484 is my T1 after transposing it. Then when I run the zscore and it became 487x23 matrix which is the variable ZM. I would like to know what are these 23 variables and their labels. Cause I have 44 variables initially.

Is there a function that can track the changes in variables so that I can make a new vbls with only 23 relevant names?

Regards,

ack il 1 Set 2021

Hi Walter, thanks for the input. It seems my coeff matrix size is correct now as I got (X,2) by (X,2) matrix.

Ive J il 1 Set 2021

Modificato: Ive J il 1 Set 2021

Apri in MATLAB Online

Then when I run the zscore and it became 487x23 matrix...

How does this happen?

Also, note that scaling after normalization (which in case also scales variables) is pointless: you've already normalized your data, what's the point of scaling?!

% ZM=zscore(a);
% ZM=ZM-mean(ZM); % ?? meaningless
x = randi([1, 100], 4, 1);
zx = zscore(x);
all(zx == (zx - mean(zx)))
ans = logical
   1

Walter Roberson il 1 Set 2021

zscore says

If X is a matrix, then Z is a matrix of the same size as X, and each column of Z has mean 0 and standard deviation 1.

So if you are getting a 487 x 23 matrix out of zscore, that implies that you passed a 482 x 23 matrix into zscore.

ack il 1 Set 2021

Hi Ive, correction, the matrix changes from 487x45 to 487x23 due to clearing of columns with Nan. The Z-mean(Z) is just an extraline I put in accidentally.

ack il 1 Set 2021

Yes Walter. That is correct the matrix changes from 487x45 to 487x23 due to clearing of columns with Nan. I explained it wrongly in the last reply.

Ive J il 1 Set 2021

Apri in MATLAB Online

Ok then, to wrap things up so far:

A = randn(487, 45); % raw data wish 487 observations and 45 variables

A(randi([1, numel(A(:))], 20, 1)) = nan; % add some missing values to raw data

varNames = "x" + (1:size(A, 2));

fprintf('original matrix size: %d observations and %d features\n', size(A, 1), size(A, 2))

original matrix size: 487 observations and 45 features

% remove columns with nan values (I assume in your case, you have only columns

% with all values as nan, otherwise it's better to remove only nan observations).

nanIdx = any(isnan(A), 1);

A(:, nanIdx) = [];

varNames(nanIdx) = []; % remove nan labels

fprintf('pruned matrix size: %d observations and %d features\n', size(A, 1), size(A, 2))

pruned matrix size: 487 observations and 32 features

Az = zscore(A);

[coeff, score, latent, ~, explained, ~] = pca(Az);

% show me what you've done

% 1- variance explained

figure;

pareto(explained)

xlabel('Principal Component')

ylabel('Variance Explained (%)')

% 2- biplot

figure;

biplot(coeff(:, 1:2), 'Scores', score(:, 1:2), 'Varlabels', varNames);

ack il 1 Set 2021

Apri in MATLAB Online

Hi Ive, your summary is extremely helpful! Thank you!

I didnt knew I could use idx to remove Nan variable names!

One last question, how can I remove only the Nan observations? The truth is not all of my column are full of Nan some only have a couple of Nan but I dont know to only remove that observation.

Here is what I have done for the plotting part. Since my PCs only explaine like 10 percent, I put lots of them in biplot for scree test to be over 80%.

h=biplot(coeff(:,1:3),'Scores',score(:,1:3));
hold on
biplot(coeff(:,4:6),'Scores',score(:,4:6));
biplot(coeff(:,7:9),'Scores',score(:,7:9));
biplot(coeff(:,10:12),'Scores',score(:,10:12));
biplot(coeff(:,13:14),'Scores',score(:,13:14));
hold off
axis tight
grid on 
xlabel('PC1'), ylabel('PC2')

Ive J il 2 Set 2021

Modificato: Ive J il 2 Set 2021

Apri in MATLAB Online

To remove only samples with missingness, you can apply something this to your dataset:

A = randn(487, 45); % raw data wish 487 observations and 45 variables
A(randi([1, numel(A(:))], 20, 1)) = nan; % add some missing values to raw data
fprintf('original matrix size: %d observations and %d features\n', size(A, 1), size(A, 2))
original matrix size: 487 observations and 45 features
nanObsIdx = any(isnan(A), 2); % samples having at least one missing value in either of features (columns)
A(nanObsIdx, :) = [];
fprintf('pruned matrix size: %d observations and %d features\n', size(A, 1), size(A, 2))
pruned matrix size: 467 observations and 45 features

Obviously this doesn't affect your features but only samples.

ack il 4 Set 2021

Hi Ive, sorry for the late reply. My coding is sort of working now. Thank you so much for your inputs!

I have 45x484 matrix but when I calculate coeff pca function, I am getting coeff with 484x44 matrix which causes errors in biplot. Why is the rows and columns switch places?

0 Commenti
Mostra -2 commenti meno recenti Nascondi -2 commenti meno recenti

Risposta accettata

13 Commenti
Mostra 11 commenti meno recenti Nascondi 11 commenti meno recenti

Più risposte (0)

Categorie

Prodotti

Release

Tag

Community Treasure Hunt

I have 45x484 matrix but when I calculate coeff pca function, I am getting coeff with 484x44 matrix which causes errors in biplot. Why is the rows and columns switch places?

0 Commenti Mostra -2 commenti meno recenti Nascondi -2 commenti meno recenti

Risposta accettata

13 Commenti Mostra 11 commenti meno recenti Nascondi 11 commenti meno recenti

Più risposte (0)

Categorie

Prodotti

Release

Tag

Vedere anche

Community Treasure Hunt

0 Commenti
Mostra -2 commenti meno recenti Nascondi -2 commenti meno recenti

13 Commenti
Mostra 11 commenti meno recenti Nascondi 11 commenti meno recenti