Evaluation Criteria for Missing Data Imputation Techniques
10 views (last 30 days)
I have 5 methods for missing data imputation, since my original data set, has missing values due to the fact that is industrial data. And to perform a PCA analysis, and in order to have eigenvalues positives, I need a covariance to be determine positive.
I use the 5 methods to impute missing data, so now i got 5 new matrices of X_imputed.
Question: How can measure the performance of each one? what criteria should I use?
I read about calculation RMSE, but when I see the formula they use SQRT of Xi obs - Xi imputed, and they do the calculation because their initial X is complete, and they introduce a % of MD, but the problem for me is that i already start with Missing Data.
Jeff Miller on 4 Jul 2018
You can't evaluate the performance of the different imputaton methods with respect to your actual data set, for exactly the reason you mention. You can only compare their performance across simulations where you know the values of each of the missing points (i.e., your simulation pretends that some simulated points are missing). Such a simulation would require very detailed assumptions about the multivariate situation that your data came from, including the reasons why some points are missing.
It might be better to perform the PCA without imputing any missing data (check the pca documentation). Did you try
coeff = pca(X,'Rows','pairwise');
This essentially computes each entry in the covariance matrix using whichever of your original data rows/cases have values for both relevant variables.