Evaluation Criteria for Missing Data Imputation Techniques

Question

Tiago Dias il 28 Giu 2018

0
Link

Link diretto a questa domanda

https://it.mathworks.com/matlabcentral/answers/407885-evaluation-criteria-for-missing-data-imputation-techniques

Risposto: Tiago Dias il 5 Lug 2018

Hello,

I have 5 methods for missing data imputation, since my original data set, has missing values due to the fact that is industrial data. And to perform a PCA analysis, and in order to have eigenvalues positives, I need a covariance to be determine positive.

I use the 5 methods to impute missing data, so now i got 5 new matrices of X_imputed.

Question: How can measure the performance of each one? what criteria should I use?

I read about calculation RMSE, but when I see the formula they use SQRT of Xi obs - Xi imputed, and they do the calculation because their initial X is complete, and they introduce a % of MD, but the problem for me is that i already start with Missing Data.

0 Commenti
Mostra -2 commenti meno recentiNascondi -2 commenti meno recenti

Accedi per commentare.

Accedi per rispondere a questa domanda.

Answer 1

Jeff Miller il 4 Lug 2018

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/407885-evaluation-criteria-for-missing-data-imputation-techniques#answer_327372

Apri in MATLAB Online

You can't evaluate the performance of the different imputaton methods with respect to your actual data set, for exactly the reason you mention. You can only compare their performance across simulations where you know the values of each of the missing points (i.e., your simulation pretends that some simulated points are missing). Such a simulation would require very detailed assumptions about the multivariate situation that your data came from, including the reasons why some points are missing.

It might be better to perform the PCA without imputing any missing data (check the pca documentation). Did you try

coeff = pca(X,'Rows','pairwise');

This essentially computes each entry in the covariance matrix using whichever of your original data rows/cases have values for both relevant variables.

2 Commenti
Mostra NessunoNascondi Nessuno

Tiago Dias il 4 Lug 2018

Apri in MATLAB Online

Thanks for your input, but I need to impute the missing data. Sice I got missing values (~30%, industrial data) i can make the calculation of the covariance, but since the covariance got NaN's, I can't calculate scores and loadings.

Since I got my matrix X and my matrix Ximputed (using a PCA model, so all the entry are re-calculate, even the non missing values) I can perform a

sum((X(i,j) - X_imp (i,j)).^2) has a criteria?

Jeff Miller il 5 Lug 2018

Sorry, I do not know whether your suggestion is reasonable or not.

If the data do not even allow the covariances to be estimated, then you probably don't have enough data to decide which is the best imputation method or to do PCA afterwards.

Can you select out a subset of the variables for which you can get a complete set of covariances? You might just do PCA on this subset.

Accedi per commentare.

Answer 2

Tiago Dias il 5 Lug 2018

0
Link

Link diretto a questa risposta

https://it.mathworks.com/matlabcentral/answers/407885-evaluation-criteria-for-missing-data-imputation-techniques#answer_327561

I can't really make a subset, because all variables have missing data. But I found an article when they do the residues from X(with MD) - Ximputed, just for the i,j that are values in X, so I go that way.