../vignettes/v07-advancedFeatures.Rmd
v07-advancedFeatures.Rmd
When analyzing the real world datasets, we may have the following
targets:
1. certain variables must be selected when some prior information is
given;
2. selecting the weak signal variables when the prediction performance
is main interest;
3. identifying predictors when group structure are provided;
4. pre-excluding a part of predictors when datasets have ultra
high-dimensional predictors.
In the following content, we will illustrate the statistic methods to
reach these targets in a one-by-one manner, and give quick examples to
show how to perform the statistic methods with the abess
package. Actually, in the abess
package, there targets can
be properly handled by simply change the default arguments in the
abess()
function.
Nuisance regression refers to best subset selection with some prior information that some variables are required to stay in the active set. For example, if we are interested in a certain gene and want to find out what other genes are associated with the response when this particular gene shows effect.
In the abess()
function, the argument
always.include
is designed to realize this goal. user can
pass a vector containing the indexes of the target variables to
always.include
. Here is an example.
##
## Thank you for using abess! To acknowledge our work, please cite the package:
##
## Zhu J, Wang X, Hu L, Huang J, Jiang K, Zhang Y, Lin S, Zhu J (2022). 'abess: A Fast Best Subset Selection Library in Python and R.' Journal of Machine Learning Research, 23(202), 1-7. https://www.jmlr.org/papers/v23/21-1060.html.
n <- 100
p <- 20
support.size <- 3
dataset <- generate.data(n, p, support.size)
dataset$beta
## [1] 0.00000 0.00000 0.00000 0.00000 0.00000 84.67713 0.00000 0.00000
## [9] 0.00000 0.00000 0.00000 0.00000 0.00000 85.82187 0.00000 0.00000
## [17] 0.00000 47.40922 0.00000 0.00000
abess_fit <- abess(dataset[["x"]], dataset[["y"]], always.include = 6)
coef(abess_fit, support.size = abess_fit$support.size[which.min(abess_fit$tune.value)])
## 21 x 1 sparse Matrix of class "dgCMatrix"
## 4
## (intercept) -4.767241
## x1 .
## x2 .
## x3 .
## x4 .
## x5 .
## x6 82.424118
## x7 .
## x8 .
## x9 .
## x10 .
## x11 .
## x12 .
## x13 .
## x14 89.031689
## x15 .
## x16 -12.622583
## x17 .
## x18 43.343247
## x19 .
## x20 .
In some cases, especially under low signal-to-noise ratio (SNR)
setting or predictors are highly correlated, the vanilla type of \(L_0\) constrained model may not be
satisfying and a more sophisticated trade-off between bias and variance
is needed. Under this concern, the abess
package provides
option of best subset selection with \(L_2\) norm regularization called the
regularized best subset selection (BESS). The model has this following
form:
\[\begin{align} \min_{\boldsymbol\beta} \mathcal{L_n}({\boldsymbol\beta}) + \lambda \|\boldsymbol\beta\|_2^2. \end{align}\]
To implement the regularized BESS, user need to specify a value to
the lambda
in the abess()
function. This
lambda
value corresponds to the penalization parameter in
the model above. Here we give an example.
library(abess)
n <- 100
p <- 30
snr <- 0.05
dataset <- generate.data(n, p, snr = snr, seed = 1, beta = rep(c(1, rep(0 ,5)), each = 5), rho = 0.8, cortype = 3)
data.test <- generate.data(n, p, snr = snr, beta = dataset$beta, seed = 100)
abess_fit <- abess(dataset[["x"]], dataset[["y"]], lambda = 0.7)
Let’s test the regularized best subset selection against the no-regularized one over 100 replications in terms of prediction performance.
M <- 100
err.l0 <- rep(0, M)
err.l0l2 <- rep(0, M)
for(i in 1:M){
dataset <- generate.data(n, p, snr = snr, seed = i, beta = rep(c(1, rep(0 ,5)), each = 5), rho = 0.8, cortype = 3)
data.test <- generate.data(n, p, snr = snr, beta = dataset$beta, seed = i+100)
abess_fit <- abess(dataset[["x"]], dataset[["y"]], lambda = 0.7)
coef(abess_fit, support.size = abess_fit$support.size[which.min(abess_fit$tune.value)])
pe.l0l2 <- norm(data.test$y - predict(abess_fit, newx = data.test$x),'2')
err.l0[i] <- pe.l0l2
abess_l0 <- abess(dataset[["x"]], dataset[["y"]])
coef(abess_l0, support.size = abess_l0$support.size[which.min(abess_l0$tune.value)])
pe.l0 <- norm(data.test$y -predict(abess_l0, newx = data.test$x), '2')
err.l0l2[i] <- pe.l0
}
mean(err.l0)
## [1] 128.4383
mean(err.l0l2)
## [1] 141.4478
We see that the regularized best subset select indeed reduces the prediction error.
Group linear model is a linear model which the \(p\) predictors are separated into \(J\) pre-determined non-overlapping groups,
\[ y = \sum_{j=1}^J X_{G_j} \boldsymbol{\beta_{G_j}}+\epsilon, \] where \(\{G_j\}_{j=1}^J\) are the group indices of the \(p\) predictors.
Best group subset selection (BGSS) aims to choose a small part of groups to achieve the best interpretability on the response variable. BGSS is practically useful for the analysis of variables with certain group structures.
The BGSS can be achieved by solving: \[\begin{equation}\label{eq:constraint}
\min\limits_{\beta \in \mathbb{R}^p}
\mathcal{L_n}({\boldsymbol\beta}),\quad s.t.\
\|\boldsymbol{\beta}\|_{0,2}\leqslant s,
\end{equation}\] where the \(\ell_{0,2}\) (-pseudo) norm is defined as
\(\|{{\boldsymbol\beta}}\|_{0,2} =
\sum_{j=1}^J \mathrm{I} (\|\boldsymbol{\beta_{G_j}}\|_2 \neq 0)\)
and model size \(s\) is a positive
integer to be determined from data. Regardless of the NP-hard of this
problem, Zhang et al develop a certifiably polynomial algorithm with a
high probability to solve it. This algorithm is integrated in the
abess
package, and user can handily select best group
subset by assigning a proper value to the group.index
arguments in abess()
function.
We use a synthetic dataset to demonstrate its usage. The dataset consists of 100 observations. Each observation has 20 predictors but only the first three predictors among them have a impact on the response.
set.seed(1234)
n <- 100
p <- 20
support.size <- 3
dataset <- generate.data(n, p, beta = c(10, 5, 5, rep(0, 17)))
Supposing we have some prior information that the second and third
variables belong to the same group, then we can set the
group_index
as:
Then, pass this group_index
to the abess()
function:
abess_fit <- abess(dataset[["x"]], dataset[["y"]],
group.index = group_index)
str(extract(abess_fit))
## List of 7
## $ beta :Formal class 'dgCMatrix' [package "Matrix"] with 6 slots
## .. ..@ i : int [1:3] 0 1 2
## .. ..@ p : int [1:2] 0 3
## .. ..@ Dim : int [1:2] 20 1
## .. ..@ Dimnames:List of 2
## .. .. ..$ : chr [1:20] "x1" "x2" "x3" "x4" ...
## .. .. ..$ : chr "2"
## .. ..@ x : num [1:3] 10.2 4.92 4.88
## .. ..@ factors : list()
## $ intercept : num 0.0859
## $ support.size: int 2
## $ support.vars: chr [1:3] "x1" "x2" "x3"
## $ support.beta: num [1:3] 10.2 4.92 4.88
## $ dev : num 8.3
## $ tune.value : num 225
The fitted result suggest that only two groups are selected (since
support.size
is 2) and the selected variables are matched
with the ground truth setting.
In the case of ultra high-dimensional data, it is more effective to
do feature screening first before carrying out feature selection. The
abess
package allowed users to perform the sure independent
screening to pre-exclude some features according to the marginal
likelihood.
abess
package
Ultra-high dimensional predictors increase computational cost but
reduce estimation accuracy for any statistical procedure. To reduce
dimensionality from high to a relatively acceptable level, a fairly
general asymptotic framework, named feature screening (sure independence
screening) is proposed to tackle even exponentially growing dimension.
The feature screening can theoretically maintain all effective
predictors with a high probability, which is called the sure screening
property. In the abess()
function, to carrying out the
Integrate SIS, user need to passed an integer smaller than the number of
the predictors to the screening.num
. Then the
abess()
function will first calculate the marginal
likelihood of each predictor and reserve those predictors with the
screening.num
largest marginal likelihood. Then, the ABESS
algorithm is conducted only on this screened subset. Here is an
example.
library(abess)
n <- 100
p <- 1000
support.size <- 3
dataset <- generate.data(n, p, support.size)
abess_fit <- abess(dataset[["x"]], dataset[["y"]],
screening.num = 100)