Calibration of PLS-DA model

Calibration of PLS-DA model is very similar to conventional PLS with one difference — you need to provide information about class membership of each object instead of a matrix or a vector with response values. This can be done in two different ways. If you have multiple classes, it is always recommended to provide your class membership data as a factor with predefined labels or a vector with class names as text values. The labels/values in this case will be used as class names. It is also acceptable to use numbers as labels but it will make interpretation of results less readable and can possible cause problems with performance statistics calculations. So use names!

It is very important that you use the same labels/names for e.g. calibration and test set, because this is the way model will identify which class an object came from. And if you have e.g. a typo in a label value, model will assume that the corresponding object is a stranger.

So let’s prepare our data.

data(iris)

cal.ind = c(1:25, 51:75, 101:125)
val.ind = c(26:50, 76:100, 126:150)

Xc = iris[cal.ind, 1:4]
Xv = iris[val.ind, 1:4]

cc.all = iris[cal.ind, 5]
cv.all = iris[val.ind, 5]

In this case, the fifth column of dataset Iris is already factor, otherwise we have to make it as a factor explicitely. Lets check if it is indeed correct.

show(cc.all)

##  [1] setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa    
## [16] setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa     setosa     versicolor versicolor versicolor versicolor versicolor
## [31] versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor versicolor
## [46] versicolor versicolor versicolor versicolor versicolor virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica 
## [61] virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica  virginica 
## Levels: setosa versicolor virginica

However, for the model with just one class, virginica, we need to prepare the class variable in a different way. In this case it is enought to provide a vector with logical values, where TRUE will correspond to a member and FALSE to a non-member of the class. Here is an example how to do it (we will make two — one for calibration and one for validation subsets).

cc.vir = cc.all == "virginica"
cv.vir = cv.all == "virginica"
show(cc.vir)

##  [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
## [29] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE
## [57]  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE  TRUE

Now we can calibrate the models:

m.all = plsda(Xc, cc.all, 3, cv = 1)
m.vir = plsda(Xc, cc.vir, 3, cv = 1, classname = "virginica")

You could notice one important difference. In case when parameter c is a vector with logical values you also need to provide a name of the class. If you do not do this, a default name will be used, but it may cause problems when you e.g. validate your model using test set where class membership is a factor as we have in this example.

Let’s look at the summary for each of the model. As you can see below, summary for multi class PLS-DA simply shows one set of results for each class. The performance statistics include explained X and Y variance (cumulative), values for confusion matrix (true positives, false positives, true negatives, false negatives) as well as specificity, sensitivity and accuracy values.

summary(m.all)

## 
## PLS-DA model (class plsda) summary
## ------------------------------------
## Info: 
## Number of selected components: 1
## Cross-validation: full (leave one out)
## 
## Class #1 (setosa)
##     X cumexpvar Y cumexpvar TP FP TN FN Spec. Sens. Accuracy
## Cal       91.97       46.36 25  0 50  0     1     1        1
## Cv           NA          NA 25  0 50  0     1     1        1
## 
## Class #2 (versicolor)
##     X cumexpvar Y cumexpvar TP FP TN FN Spec. Sens. Accuracy
## Cal       91.97       46.36  0  0 50 25  1.00     0    0.667
## Cv           NA          NA  0  3 47 25  0.94     0    0.627
## 
## Class #3 (virginica)
##     X cumexpvar Y cumexpvar TP FP TN FN Spec. Sens. Accuracy
## Cal       91.97       46.36 24  5 45  1  0.90  0.96    0.920
## Cv           NA          NA 24  6 44  1  0.88  0.96    0.907

Dealing with the multi-class PLS-DA model is similar to dealing with PLS2 models, when you have several y-variables. Every time you want to show a plot or results for a particular class, just provide the class number using parameter nc. For example this is how to show summary only for the third class (virginica).

 summary(m.all, nc = 3)

## 
## PLS-DA model (class plsda) summary
## ------------------------------------
## Info: 
## Number of selected components: 1
## Cross-validation: full (leave one out)
## 
## Class #3 (virginica)
##     X cumexpvar Y cumexpvar TP FP TN FN Spec. Sens. Accuracy
## Cal       91.97       46.36 24  5 45  1  0.90  0.96    0.920
## Cv           NA          NA 24  6 44  1  0.88  0.96    0.907

You can also show statistics only for calibration or only for cross-validation parts, in this case you will see details about contribution of every component to the model.

 summary(m.all$calres, nc = 3)

## 
## PLS-DA results (class plsdares) summary:
## Number of selected components: 1
## 
## Class #3 (virginica):
##        X expvar X cumexpvar Y expvar Y cumexpvar TP FP TN FN Spec. Sens. Accuracy
## Comp 1   91.969      91.969   46.356      46.356 24  5 45  1  0.90  0.96    0.920
## Comp 2    5.500      97.468    6.878      53.233 21  5 45  4  0.90  0.84    0.880
## Comp 3    2.186      99.654    4.816      58.049 24  3 47  1  0.94  0.96    0.947

For one class models, the behaviour is actually similar, but there will be always one set of results — for the corresponding class. Here is the summary.

summary(m.vir)

## 
## PLS-DA model (class plsda) summary
## ------------------------------------
## Info: 
## Number of selected components: 3
## Cross-validation: full (leave one out)
## 
## Class #1 (virginica)
##     X cumexpvar Y cumexpvar TP FP TN FN Spec. Sens. Accuracy
## Cal       98.53       61.31 24  3 47  1  0.94  0.96    0.947
## Cv           NA          NA 24  3 47  1  0.94  0.96    0.947

Like in SIMCA you can also get a confusion matrix for particular result. Here is an example for multiple classes model.

getConfusionMatrix(m.all$calres)

##            setosa versicolor virginica None
## setosa         25          0         0    0
## versicolor      0          0         5   20
## virginica       0          0        24    1

And for the one-class model.

getConfusionMatrix(m.vir$calres)

##           virginica None
## virginica        24    1
## None              3   47