`../vignettes/v03-classification.Rmd`

`v03-classification.Rmd`

Consider the Titanic dataset obtained from the Kaggle competition: https://www.kaggle.com/c/titanic/data. The dataset consists of data about 889 passengers, and the goal of the competition is to predict the survival (yes/no) based on features including the class of service, the sex, the age etc.

`## [1] 891 12`

`head(dat)`

```
## PassengerId Survived Pclass
## 1 1 0 3
## 2 2 1 1
## 3 3 1 3
## 4 4 1 1
## 5 5 0 3
## 6 6 0 3
## Name Sex Age SibSp Parch
## 1 Braund, Mr. Owen Harris male 22 1 0
## 2 Cumings, Mrs. John Bradley (Florence Briggs Thayer) female 38 1 0
## 3 Heikkinen, Miss. Laina female 26 0 0
## 4 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35 1 0
## 5 Allen, Mr. William Henry male 35 0 0
## 6 Moran, Mr. James male NA 0 0
## Ticket Fare Cabin Embarked
## 1 A/5 21171 7.2500 <NA> S
## 2 PC 17599 71.2833 C85 C
## 3 STON/O2. 3101282 7.9250 <NA> S
## 4 113803 53.1000 C123 S
## 5 373450 8.0500 <NA> S
## 6 330877 8.4583 <NA> Q
```

Logistic regression is one of powerful tool to tackle this problem. In statistics, the logistic regression is used to model the probability of a certain class or event existing such as alive/dead, pass/fail, win/lose, or healthy/sick. Logistic regression function is an \(s\)-shaped curve modeling the posterior probability \(p\) via a linear combination of the features. The curve is defined as \(p = \frac{1}{1+\exp(-\eta)}\) where \(\eta = \boldsymbol\beta_0+x\boldsymbol\beta\) and \(x\) are predictors, and \(\boldsymbol\beta_0, \boldsymbol\beta\) are coefficients to be learned from data. The logistic regression model has this form: \[ \log(p/(1-p)) = \boldsymbol\beta_0 + x\boldsymbol\beta. \] The quantity \(\log(p/(1-p))\) is called the logarithm of the odd, also called log-odd or logit. The best subset selection for logistic regression aim to balance model accuracy and model complexity, where the former is achieves by maximizing the log-likelihood function and the latter is characterized by a constraint: \(\| \boldsymbol\beta \|_0 \leq s\) and \(s\) can be determined in a data driven way.

The `abess()`

function in the `abess`

package allows users to perform best subset selection in a highly efficient way. Users can call the `abess()`

function using formula just like what users do with `lm()`

. Or users can specify the design matrix `x`

and the response `y`

. As an example, the Titanic dataset is used to demonstrated the usage of `abess`

package.

A glance at the dataset finds there is any missing data. The `na.omit()`

function allows us to delete the rows that contain any missing data. After that, we get a total of 714 samples left.

`## [1] 712 8`

Then we change the factors into dummy variables with the `model.matrix()`

function. Note that the `abess`

function will automatically include the intercept, and thus, we exclude the first column of `dat`

object.

```
dat <- model.matrix(~., dat)[, -1]
dat <- as.data.frame(dat)
```

We split the dataset into a training set and a test set. The model is going to be built on the training set and later We will test the model performance on the test set.

```
train_index <- 1:round((712*2)/3)
train <- dat[train_index, ]
test <- dat[-train_index, ]
```

`abess`

package
We use `abess`

package to perform best subset selection for the preprocessed Titanic dataset by setting `family = "binomial"`

. The cross validation technique is employed to tune the support size by setting `tune.type = "cv"`

.

```
library(abess)
abess_fit <- abess(x = train[, -1], y = train$Survived,
family = "binomial", tune.type = "cv")
```

After getting the estimator, we can further do more exploring work. The output of `abess()`

function contains the best model for all the candidate support sizes in the `support.size`

. Users can use some generic functions to quickly draw some information of those estimators. Typical examples include:

i. print the fitted model:

`abess_fit`

```
## Call:
## abess.default(x = train[, -1], y = train$Survived, family = "binomial",
## tune.type = "cv")
##
## support.size dev cv
## 1 0 321.9560 64.59866
## 2 1 246.5820 49.99585
## 3 2 229.6078 46.65819
## 4 3 229.5308 46.99849
## 5 4 228.9702 45.95916
## 6 5 220.1834 45.34507
## 7 6 220.0022 46.33645
## 8 7 219.9659 46.31131
## 9 8 219.9210 46.44845
```

- draw the estimated coefficients on all candidate support size by
`coef()`

function:

`coef(abess_fit)`

```
## 9 x 9 sparse Matrix of class "dgCMatrix"
## 0 1 2 3 4
## (intercept) -0.3530971 1.158781 2.9213405 3.051182650 3.261927124
## Pclass . . -0.7952709 -0.833342489 -0.814632212
## Sexmale . -2.545075 -2.5818944 -2.593870577 -2.590161192
## Age . . . . .
## SibSp . . . . .
## Parch . . . . .
## Fare . . . -0.001187223 -0.001680436
## EmbarkedQ . . . . .
## EmbarkedS . . . . -0.305263064
## 5 6 7 8
## (intercept) 4.85381911 5.112857497 5.091004680 5.079012307
## Pclass -1.00849770 -1.076530668 -1.083610537 -1.085091543
## Sexmale -2.59656529 -2.610623446 -2.609378045 -2.601337079
## Age -0.03672237 -0.037469721 -0.037437108 -0.037259374
## SibSp -0.35770273 -0.339011356 -0.340217327 -0.350406209
## Parch . . . 0.053009526
## Fare . -0.001998009 -0.001938335 -0.002224556
## EmbarkedQ . . 0.196096062 0.210160986
## EmbarkedS -0.21507393 -0.245371516 -0.210145242 -0.214986229
```

- get the deviance of the estimated model on all candidate support sizes via
`deviance()`

function:

`deviance(abess_fit)`

```
## [1] 321.9560 246.5820 229.6078 229.5308 228.9702 220.1834 220.0022 219.9659
## [9] 219.9210
```

- visualize the change of models with the change of support size via
`plot()`

function:

`plot(abess_fit, label=T)`

The graph shows that, beginning from the most dense model, the second variable (`Sex`

) is included in the active set until the support size reaches 0. We can also generate a graph about the tuning value.

`plot(abess_fit, type = "tune")`

The tuning value reaches the lowest point at 4, which implies the best model consists of four variables.

Finally, to extract the specified model from the `abess`

object, we can call the `extract()`

function with a given `support.size`

. If `support.size`

is not provided, the model with the best tuning value will be returned. Here we extract the model with support size equals 6.

```
## List of 7
## $ beta :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. ..@ i : int [1:4] 0 1 5 7
## .. ..@ p : int [1:2] 0 4
## .. ..@ Dim : int [1:2] 8 1
## .. ..@ Dimnames:List of 2
## .. .. ..$ : chr [1:8] "Pclass" "Sexmale" "Age" "SibSp" ...
## .. .. ..$ : chr "4"
## .. ..@ x : num [1:4] -0.81463 -2.59016 -0.00168 -0.30526
## .. ..@ factors : list()
## $ intercept : num 3.26
## $ support.size: num 4
## $ support.vars: chr [1:4] "Pclass" "Sexmale" "Fare" "EmbarkedS"
## $ support.beta: num [1:4] -0.81463 -2.59016 -0.00168 -0.30526
## $ dev : num 229
## $ tune.value : num 46
```

The return is a list containing the basic information of the estimated model.

Prediction is allowed for all the estimated models. Just call `predict.abess()`

function with the `support.size`

set to the size of model users are interested in. If `support.size`

is not provided, prediction will be made based on the best tuning value. The `predict.abess()`

can provide both `link`

, standing for the linear predictors, and the `response`

, standing for the fitted probability. Here We will predict the probability of survival on the `test.csv`

data.

`fitted.results <- predict(abess_fit, newx = test, type = 'response')`

If we chose 0.5 as the cut point, i.e, we predict the person survives the sinking of the Titanic if the fitted probability is greater than 0.5, the accuracy will be 0.80.

```
fitted.results <- ifelse(fitted.results > 0.5, 1, 0)
misClasificError <- mean(fitted.results != test$Survived)
print(paste('Accuracy',1-misClasificError))
```

`## [1] "Accuracy 0.810126582278481"`

We can also generate an ROC and calculate the AUC value. On this dataset, the AUC is 0.87, which is quite close to 1.

```
library(ROCR)
fitted.results <- predict(abess_fit, newx = test, type = 'response')
pr <- prediction(fitted.results, test$Survived)
prf <- performance(pr, measure = "tpr", x.measure = "fpr")
plot(prf)
```

```
auc <- performance(pr, measure = "auc")
auc <- auc@y.values[[1]]
auc
```

`## [1] 0.8751124`

When the number of classes is more than 2, we call it multi-class classification task. Logistic regression can be extended to model several classes of events such as determining whether an image contains a cat, dog, lion, etc. Each object detected in the image would be assigned a probability between 0 and 1, with a sum of one. The extended model is multinomial logistic regression.

To arrive at the multinomial logistic model, one can imagine, for \(K\) possible classes, running \(K-1\) independent logistic regression models, in which one class is chosen as a ``pivot’’ and then the other \(K-1\) classes are separately regressed against the pivot outcome. This would proceed as follows, if class K (the last outcome) is chosen as the pivot: \[\ln(\mathbb{P}(y = 1) / \mathbb{P}(y = K)) = x^\top \boldsymbol\beta^{(1)}, \] \[\cdots \cdots\] \[\ln(\mathbb{P}(y = K - 1) / \mathbb{P}(y = K)) = x^\top \boldsymbol\beta^{(K - 1)}.\] Then, the probability to choose the \(j\)-th class can be easily derived to be: \[\mathbb{P}(y = j) = \frac{\exp{(x^\top \boldsymbol\beta^{(j)})}}{1 + \sum_{k=1}^{K-1} \exp{(x^\top \boldsymbol\beta^{(k)})}}, \] and subsequently, we would predict the \(j^{*}\)-th class if the \(j^* = \arg\max_{j} \mathbb{P}(y = j)\). Notice that, for \(K\) possible classes case, there are \(p \times (K - 1)\) unknown parameters: \(\boldsymbol\beta^{(1)}, \ldots, \boldsymbol\beta^{(K-1)}\) to be estimated. Because the number of parameters increase as \(K\), it is even more urge to constrain the model complexity. And the best subset selection for multinomial logistic regression aims to maximize the log-likelihood function and control the model complexity by restricting \(B = (\boldsymbol\beta^{(1)}, \ldots, \boldsymbol\beta^{(K-1)})\) with \(\| B \|_{0, 2} \leq s\) where \(\| B \|_{0, 2} = \sum_{i = 1}^{p} I(B_{i\cdot} = {\bf 0})\), \(B_{i\cdot}\) is the \(i\)-th row of coefficient matrix \(B\) and \({\bf 0} \in R^{K - 1}\) is an all zero vector. In other words, each row of \(B\) would be either all zero or all non-zero.

`abess`

Package
We shall conduct Multinomial logistic regression on an artificial dataset for demonstration. The `generate.data()`

function provides a simple way to generate suitable for this task. The assumption behind is the response vector following a multinomial distribution. The artifical dataset contain 100 observations and 20 predictors but only five predictors have influence on the three possible classes.

```
library(abess)
n <- 100
p <- 20
support.size <- 5
dataset <- generate.data(n, p, support.size, family = "multinomial", class.num = 3)
head(dataset$y)
```

`## [1] 1 2 1 1 0 0`

`dataset$beta`

```
## [,1] [,2] [,3]
## [1,] 0.000000 0.000000 0.000000
## [2,] 0.000000 0.000000 0.000000
## [3,] 0.000000 0.000000 0.000000
## [4,] 0.000000 0.000000 0.000000
## [5,] 0.000000 0.000000 0.000000
## [6,] 16.244248 6.605285 4.958967
## [7,] 0.000000 0.000000 0.000000
## [8,] 0.000000 0.000000 0.000000
## [9,] 0.000000 0.000000 0.000000
## [10,] 0.000000 0.000000 0.000000
## [11,] 3.462001 6.774016 20.454220
## [12,] 0.000000 0.000000 0.000000
## [13,] 18.987979 13.133145 17.211335
## [14,] 8.034160 1.117907 2.904334
## [15,] 0.000000 0.000000 0.000000
## [16,] 0.000000 0.000000 0.000000
## [17,] 0.000000 0.000000 0.000000
## [18,] 23.307996 17.576156 11.922882
## [19,] 0.000000 0.000000 0.000000
## [20,] 0.000000 0.000000 0.000000
```

To carry out best subset selection for multinomial logistic regression, users can call the `abess()`

function with `family`

specified to `multinomial`

. Here is an example.

```
abess_fit <- abess(dataset[["x"]], dataset[["y"]],
family = "multinomial", tune.type = "cv")
extract(abess_fit)[["support.vars"]]
```

`## [1] "x6" "x11" "x13" "x14" "x18"`

Notice that the `abess()`

correctly identifies the support set of the ground truth coefficient matrix.