Randomized PCA algorithms
Both SVD and NIPALS are not very efficient when number of rows in dataset is very large (e.g. hundreds of thousands values or even more). Such datasets can be easily obtained in case of for example hyperspectral images. Direct use of the traditional algorithms with such datasets often leads to a lack of memory and long computational time.
One of the solution here is to use probabilistic algorithms, which allow to reduce the number of values needed for estimation of principal components. Starting from 0.9.0 one of the probabilistic approach is also implemented in mdatools. The original idea can be found in this paper and some examples on using the approach for PCA analysis of hyperspectral images are described here.
The approach is based on calculation of matrix \(B\), which has much smaller number of rows comparing to the original data, but captures the action of the data. To produce proper values for \(B\), several parameters are used. First of all it is a rank of original data, \(k\), if it is known. Second, it is oversampling parameter, \(p\), which is important if rank is underestimated. The idea is to use \(p\) to overestimate the rank thus making solution more stable. The third parameter, \(q\), is a number of iterations needed to make \(B\) more robust. Usually using \(p = 5\) and \(q = 1\) will work well on most of the datasets and, at the same time, will take less time for finding a solution comparing with conventional methods. By default, \(k\) is set to number of components, used for PCA decomposition.
The example below uses two methods (classic SVD and randomized SVD) for PCA decomposition of 100 000 x 300 dataset and compares time and main outcomes (loadings and explained variance).
# create a dataset as a linear combination of three sin curves with random "concentrations"
n = 100000
X = seq(0, 29.9, by = 0.1)
S = cbind(sin(X), sin(10 * X), sin(5 * X))
C = cbind(runif(n, 0, 1), runif(n, 0, 2), runif(n, 0, 3))
D = C %*% t(S)
D = D + matrix(runif(300 * n, 0, 0.5), ncol = 300)
show(dim(D))
## [1] 100000 300
## user system elapsed
## 20.074 0.415 20.524
# randomized SVD with p = 5 and q = 1
t2 = system.time({m2 = pca(D, ncomp = 2, rand = c(5, 1))})
show(t2)
## user system elapsed
## 2.885 0.300 3.197
##
## Summary for PCA model (class pca)
## Type of limits: ddmoments
## Alpha: 0.05
## Gamma: 0.01
##
## Eigenvals Expvar Cumexpvar Nq Nh
## Comp 1 112.539 62.07 62.07 4 3
## Comp 2 49.978 27.56 89.63 6 5
##
## Summary for PCA model (class pca)
##
## Parameters for randomized algorithm: q = 5, p = 1
## Type of limits: ddmoments
## Alpha: 0.05
## Gamma: 0.01
##
## Eigenvals Expvar Cumexpvar Nq Nh
## Comp 1 112.539 62.07 62.07 4 3
## Comp 2 49.978 27.56 89.63 6 5
## Comp 1 Comp 2
## [1,] 7.973072e-05 1.775574e-05
## [2,] 3.865753e-02 6.927834e-02
## [3,] 6.826246e-02 7.516323e-02
## [4,] 8.135736e-02 1.241170e-02
## [5,] 7.459189e-02 -6.123400e-02
## [6,] 4.932195e-02 -7.792343e-02
## [7,] 1.167070e-02 -2.290599e-02
## [8,] -2.892920e-02 5.313597e-02
## [9,] -6.217601e-02 7.988179e-02
## [10,] -7.993952e-02 3.244552e-02
## Comp 1 Comp 2
## [1,] -7.973072e-05 -1.775574e-05
## [2,] -3.865753e-02 -6.927834e-02
## [3,] -6.826246e-02 -7.516323e-02
## [4,] -8.135736e-02 -1.241170e-02
## [5,] -7.459189e-02 6.123400e-02
## [6,] -4.932195e-02 7.792343e-02
## [7,] -1.167070e-02 2.290599e-02
## [8,] 2.892920e-02 -5.313597e-02
## [9,] 6.217601e-02 -7.988179e-02
## [10,] 7.993952e-02 -3.244552e-02
As you can see the explained variance values, eigenvalues and loadings are identical in the two models and the second method is several times faster.
It is possible to make PCA decomposition even faster if only loadings and scores are needed. In this case you can use method pca.run()
and skip other steps, like calculation of distances, variances, critical limits and so on. But in this case data matrix must be centered (and scaled if necessary) manually prior to the decomposition. Here is an example using the data generated in previous code.
D = scale(D, center = TRUE, scale = FALSE)
# conventional SVD
t1 = system.time({m1 = pca.run(D, method = "svd", ncomp = 2)})
show(t1)
## user system elapsed
## 18.690 0.117 18.808
# randomized SVD with p = 5 and q = 1
t2 = system.time({m2 = pca.run(D, method = "svd", ncomp = 2, rand = c(5, 1))})
show(t2)
## user system elapsed
## 1.657 0.025 1.683
## [,1] [,2]
## [1,] 7.973072e-05 1.775574e-05
## [2,] 3.865753e-02 6.927834e-02
## [3,] 6.826246e-02 7.516323e-02
## [4,] 8.135736e-02 1.241170e-02
## [5,] 7.459189e-02 -6.123400e-02
## [6,] 4.932195e-02 -7.792343e-02
## [7,] 1.167070e-02 -2.290599e-02
## [8,] -2.892920e-02 5.313597e-02
## [9,] -6.217601e-02 7.988179e-02
## [10,] -7.993952e-02 3.244552e-02
## [,1] [,2]
## [1,] -7.973072e-05 -1.775574e-05
## [2,] -3.865753e-02 -6.927834e-02
## [3,] -6.826246e-02 -7.516323e-02
## [4,] -8.135736e-02 -1.241170e-02
## [5,] -7.459189e-02 6.123400e-02
## [6,] -4.932195e-02 7.792343e-02
## [7,] -1.167070e-02 2.290599e-02
## [8,] 2.892920e-02 -5.313597e-02
## [9,] 6.217601e-02 -7.988179e-02
## [10,] 7.993952e-02 -3.244552e-02
As you can see the loadings are still the same but the probabilistic algorithm is about 10 times faster. The exact difference depends on computer of course so if you run these examples on your compute you will get a bit different results, however the randomized algorithm will always be a clear winner.