Models and results

In mdatools, any method for data analysis, such as PCA, PLS regression, DD-SIMCA classification and so on, can create two types of objects — a model and a result. Every time you build a model you get a model object. Every time you apply the model to a dataset you get a result object. Thus for PCA, the objects have classes pca and pcares correspondingly.

Each object includes a list with object’s properties (e.g. loadings for model, scores and explained variance for result) and provides a number of methods for visualization and exploration.

Model calibration

Let’s see how this works using a simple example — People data. We already used this data when we were playing with plots, it consists of 32 objects (persons from Scandinavian and Mediterranean countries, 16 male and 16 female) and 12 variables (height, weight, shoesize, annual income, beer and wine consumption and so on). More information about the data can be found using ?people. We will first load the data matrix:

library(mdatools)
data(people)

Now let’s calibrate the model:

m = pca(people, 7, scale = TRUE, info = "People PCA model")
m = selectCompNum(m, 5)

Here pca is a function that builds (calibrates, fits, trains) a PCA model and returns the model object. In terms of programming we can call function pca() a constructor of pca model object. The constructor has one mandatory argument — matrix or data frame with data values. Other arguments are optional as they all have default values. In this case we use three additional arguments: 7 is the maximum number of components, scale = TRUE tells that data values should be standardized (centering is a default option) and info adds a short text with information about the model which can be shown later.

Here people and 7 are positional arguments, meaning their role is defined by their position. Thus pca() function always expects that the first argument is data matrix or data frame and the second argument is the number of components. So if you swap them, you will get an error. The other arguments in the example above are named arguments, meaning their role is defined by name and therefore they can be provided in any order. The full list of arguments for pca() function and their default values can be seen by running ?pca.

Function selectCompNum() sets an “optimal” number of components for the model. In our case we calibrate model with 7 principal components; however, e.g. after investigation of explained variance, we found that 5 components are optimal. In this case we have two choices. Either recalibrate the model using 5 components or use the model that is calibrated already but “tell” the model that 5 components is the optimal number. In this case the model will keep all results calculated for 7 components but will use optimal number of components when necessary. For example, when showing distance plot for the model.

Function print prints information about the model object and its structure:

print(m)
## 
## PCA model (class pca)
## 
## Info:
## People PCA model
## 
## Call:
## pca(x = people, ncomp = 7, scale = TRUE, info = "People PCA model")
## 
## Major model fields:
##  $loadings - matrix with loadings
##  $eigenvals - eigenvalues for components
##  $ncomp - number of calculated components
##  $ncomp.selected - number of selected components
##  $center - values for centering data
##  $scale - values for scaling data
##  $alpha - significance level for critical limits
##  $gamma - significance level for outlier limits
##  $limParams - parameters for distribution of distances
##  $res - list with model results (calibration, test)

As you can see, there are no scores, explained variance values, distances and so on. Because they are not actually part of the PCA model, they are results of applying the model to a calibration set. But loadings, eigenvalues, number of calculated and selected principal components, vectors for centering and scaling the data, significance levels and corresponding critical limits for distances — all are parameters of the model.

Here is an example of how to get values from loading matrix having the model object:

m$loadings[1:4, 1:4]
##              Comp 1     Comp 2      Comp 3      Comp 4
## Height   -0.3752858 -0.1354585 -0.07237117  0.07431853
## Weight   -0.3811357 -0.1114472 -0.06810860 -0.03313762
## Hairleng  0.3377735  0.1501634  0.07901752  0.11431601
## Shoesize -0.3776970 -0.1508061 -0.00127806  0.06569227

One can also notice that the model object has a particular field — res, which is in fact a list of PCA result objects containing results of applying the model to the calibration set (cal) and (if available) validation/test set (test). In the case of regression and classification res can also contain an object with cross-validation results (cv). For example here is the description of object with calibration results:

print(m$res$cal)
## 
## Results for PCA decomposition (class pcares)
## 
## Major fields:
##  $scores - matrix with score values
##  $T2 - matrix with T2 distances
##  $Q - matrix with Q residuals
##  $ncomp.selected - selected number of components
##  $expvar - explained variance for each component
##  $cumexpvar - cumulative explained variance

And if we want to look at scores, here is the way:

m$res$cal$scores[1:4, 1:4]
##           Comp 1    Comp 2     Comp 3     Comp 4
## Lars   -5.332528 0.6765698  1.0673076  1.1008056
## Peter  -3.114319 0.2934016 -0.6707091 -1.3099103
## Rasmus -2.996848 0.3604689 -0.2118207 -1.1169333
## Lene    1.084240 1.8449249 -0.4091936 -0.1228956

Both model and result objects also have related functions (methods), first of all for visualizing various values (e.g. scores plot, loadings plot, etc.). Some of the functions will be discussed later in this chapter, a full list can be found in help for a proper constructor (e.g. ?pca).

The result object is also created every time you apply a model to new data. Like in many built-in R methods, method predict() is used in this case. The first argument of the method is always a model object. Here is a PCA example (assuming we have already built the model):

# create data matrix with three new measurements
Xt = rbind(
   c(180, 80, -1, 44, 35, 35000, 200, 100, -1, 96, -1, 105),
   c(170, 67,  1, 40, 41, 29500, 100, 200,  1, 94,  1, 105),
   c(175, 70,  1, 41, 28, 21500, 110, 120,  1, 97,  1, 115)
)

# project the new data to the previous PCA model and show the result object
res = predict(m, Xt)
print(res)
## 
## Results for PCA decomposition (class pcares)
## 
## Major fields:
##  $scores - matrix with score values
##  $T2 - matrix with T2 distances
##  $Q - matrix with Q residuals
##  $ncomp.selected - selected number of components
##  $expvar - explained variance for each component
##  $cumexpvar - cumulative explained variance