Replacing missing values

Finally, there is one more useful method, which is not directly a preprocessing method, but can be considers as such one. This method allows to replace missing values in your dataset with the approximated ones.

The method uses PCA based approach described in this paper. The main idea is that we fit the dataset with a PCA model (e.g. PCA NIPALS algorithm can work even if data contains missing values) and then approximate the missing values as if they were lyining in the PC space.

The method has the same parameters as any PCA model. However, instead of specifying number of components you must specify another parameter, expvarlim, which tells how big the portion of variance the model must explain. The default value is 0.95 which corresponds to 95% of the explained variance. You can also specify if data must be centered (default TRUE) and scaled/standardized (default FALSE). See more details by running ?pca.mvreplace.

The example below shows a trivial case. First we generate a simple dataset. Then we replace some of the numbers with missing values (NA) and then apply the method to approximate them.

library(mdatools)

# generate a matrix with correlated variables
s = 1:6
odata = cbind(s, 2*s, 4*s)

# add some noise and labels for columns and rows
set.seed(42)
odata = odata + matrix(rnorm(length(odata), 0, 0.1), dim(odata))
colnames(odata) = paste0("X", 1:3)
rownames(odata) = paste0("O", 1:6)

# make a matrix with missing values
mdata = odata
mdata[5, 2] = mdata[2, 3] = NA

# replace missing values with approximated
rdata = pca.mvreplace(mdata, scale = TRUE)

# show all matrices together
show(round(cbind(odata, mdata, round(rdata, 2)), 3))
##       X1     X2     X3    X1     X2     X3   X1    X2    X3
## O1 1.137  2.151  3.861 1.137  2.151  3.861 1.14  2.15  3.86
## O2 1.944  3.991  7.972 1.944  3.991     NA 1.94  3.99  7.51
## O3 3.036  6.202 11.987 3.036  6.202 11.987 3.04  6.20 11.99
## O4 4.063  7.994 16.064 4.063  7.994 16.064 4.06  7.99 16.06
## O5 5.040 10.130 19.972 5.040     NA 19.972 5.04 10.21 19.97
## O6 5.989 12.229 23.734 5.989 12.229 23.734 5.99 12.23 23.73
# show the difference between original and approximated values
show(round(odata - rdata, 3))
##    X1     X2    X3
## O1  0  0.000 0.000
## O2  0  0.000 0.465
## O3  0  0.000 0.000
## O4  0  0.000 0.000
## O5  0 -0.076 0.000
## O6  0  0.000 0.000

As you can see the method guess that the two missing values must be 7.51 and 10.21, while the original values were 7.97 and 10.13.

The method works if total number of missing values does not exceed 20% (10% if the dataset is small).