Partial least squares regression

Partial least squares regression (PLS) is a linear regression method, which uses principles similar to PCA: data is decomposed using latent variables. Because in this case we have two datasets, matrix with predictors (\(\mathbf{X}\)) and matrix with responses (\(\mathbf{Y}\)) we do decomposition for both, computing scores, loadings and residuals: \(\mathbf{X} = \mathbf{TP}^\mathrm{T} + \mathbf{E}_x\), \(\mathbf{Y} = \mathbf{UQ}^\mathrm{T} + \mathbf{E}_y\). In addition to that, orientation of latent variables in PLS is selected to maximize the covariance between the X-scores, \(\mathbf{T}\), and Y-scores \(\mathbf{U}\). This approach makes possible to work with datasets where more traditional Multiple Linear Regression fails — when number of variables exceeds number of observations and when X-variables are mutually correlated. But, at the end, PLS-model is a linear model, where response value is just a linear combination of predictors, so the main outcome is a vector with regression coefficients.

There are two main algorithms for PLS, NIPALS and SIMPLS, in the mdatools only the last one is implemented. PLS model and PLS results objects have a lot of properties and performance statistics, which can be visualized via plots. Besides that, there is also a possibility to compute selectivity ratio (SR) and VIP scores, which can be used for selection of most important variables. Another additional option is a randomization test which helps to select optimal number of components. We will discuss most of the methods in this chapter and you can get the full list using ?pls.