ABESS algorithm: details

Introduction

The ABESS algorithm employing “splicing” technique can exactly solve general best subset problem in a polynomial time. The aim of this page to provide a complete and coherent documentation for ABESS algorithm such that users can easily understand the ABESS algorithm and its variants, thereby facilitating the usage of abess software. We provide the details of splicing approach to linear model as follows, and its variants can be applied in similar ways.

Linear model

Sacrifices

Consider the $\ell_{0}$ constraint minimization problem, $\min _{\boldsymbol{\beta}} \mathcal{L}_{n}(\boldsymbol{\beta}), \quad \text { s.t }\|\boldsymbol{\beta}\|_{0} \leq \mathrm{s},$ where ${n}()=|y-X |{2}^{2}. $ Without loss of generality, we consider $\|\boldsymbol{\beta}\|_{0}=\mathrm{s}$ . Given any initial set $\mathcal{A} \subset \mathcal{S}=\{1,2, \ldots, p\}$ with cardinality $|\mathcal{A}|=s$ , denote $\mathcal{I}=\mathcal{A}^{\mathrm{c}}$ and compute $\hat{\boldsymbol{\beta}}=\arg \min _{\boldsymbol{\beta}_{\mathcal{I}}=0} \mathcal{L}_{n}(\boldsymbol{\beta}).$ We call $\mathcal{A}$ and $\mathcal{I}$ as the active set and the inactive set, respectively.

Given the active set $\mathcal{A}$ and $\hat{\boldsymbol{\beta}}$ , we can define the following two types of sacrifices:

Backward sacrifice: For any $j \in \mathcal{A}$ , the magnitude of discarding variable $j$ is, $\xi_{j}=\mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}}^{\mathcal{A} \backslash\{j\}}\right)-\mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}}^{\mathcal{A}}\right)=\frac{X_{j}^{\top} X_{j}}{2 n}\left(\hat{\boldsymbol\beta}_{j}\right)^{2},$
Forward sacrifice: For any $j \in \mathcal{I}$ , the magnitude of adding variable $j$ is, $\zeta_{j}=\mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}^{\mathcal{A}}}\right)-\mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}}^{\mathcal{A}}+\hat{t}^{\{j\}}\right)=\frac{X_{j}^{\top} X_{j}}{2 n}\left(\frac{\hat{\boldsymbol d}_{j}}{X_{j}^{\top} X_{j} / n}\right)^{2}.$ where $\hat{t}=\arg \min _{t} \mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}}^{\mathcal{A}}+t^{\{j\}}\right), \hat{\boldsymbol d}_{j}=X_{j}^{\top}(y-X \hat{\boldsymbol{\beta}}) / n$ Intuitively, for $j \in \mathcal{A}$ (or $j \in \mathcal{I}$ ), a large $\xi_{j}$ (or $\zeta_{j}$ ) implies the $j$ th variable is potentially important.

Algorithm

Best-Subset Selection with a Given Support Size

Unfortunately, it is noteworthy that these two sacrifices are incomparable because they have different sizes of support set. However, if we exchange some “irrelevant” variables in $\mathcal{A}$ and some “important” variables in $\mathcal{I}$ , it may result in a higher-quality solution. This intuition motivates our splicing method. Specifically, given any splicing size $k \leq s$ , define

$\mathcal{A}_{k}=\left\{j \in \mathcal{A}: \sum_{i \in \mathcal{A}} \mathrm{I}\left(\xi_{j} \geq \xi_{i}\right) \leq k\right\}$ to represent $k$ least relevant variables in $\mathcal{A}$ and $\mathcal{I}_{k}=\left\{j \in \mathcal{I}: \sum_{i \in \mathcal{I}} \mid\left(\zeta_{j} \leq \zeta_{i}\right) \leq k\right\}$ to represent $k$ most relevant variables in $\mathcal{I} .$ Then, we splice $\mathcal{A}$ and $\mathcal{I}$ by exchanging $\mathcal{A}_{k}$ and $\mathcal{I}_{k}$ and obtain a new active set $\tilde{\mathcal{A}}=\left(\mathcal{A} \backslash \mathcal{A}_{k}\right) \cup \mathcal{I}_{k}.$ Let $\tilde{\mathcal{I}}=\tilde{\mathcal{A}}^{c}, \tilde{\boldsymbol{\beta}}=\arg \min _{\boldsymbol{\beta}_{\overline{\mathcal{I}}}=0} \mathcal{L}_{n}(\boldsymbol{\beta})$ , and $\tau_{s}>0$ be a threshold. If $\tau_{s}<$ $\mathcal{L}_{n}(\hat{\boldsymbol\beta})-\mathcal{L}_{n}(\tilde{\boldsymbol\beta})$ , then $\tilde{A}$ is preferable to $\mathcal{A} .$ The active set can be updated iteratively until the loss function cannot be improved by splicing. Once the algorithm recovers the true active set, we may splice some irrelevant variables, and then the loss function may decrease slightly. The threshold $\tau_{s}$ can reduce this unnecessary calculation. Typically, $\tau_{s}$ is relatively small, e.g. $\tau_{s}=0.01 s \log (p) \log (\log n) / n.$

Algorithm 1: BESS.Fix(s): Best-Subset Selection with a given support size s.

Input: $X, y$ , positive integers $k_{\max}, m_{\max}$ , and a threshold $\tau_{s}$ .
Initialize $\mathcal{A}^{0}=\left\{j: \sum_{i=1}^{p} \mathrm{I}\left(\left|\frac{X_{j}^{\top} y}{\sqrt{X_{j}^{\top} X_{j}}}\right| \leq \left| \frac{X_{i}^{\top} y}{\sqrt{X_{i}^{\top} X_{i}}}\right| \leq \mathrm{s}\right\}, \mathcal{I}^{0}=\left(\mathcal{A}^{0}\right)^{c}\right.$ , and $\left(\boldsymbol\beta^{0}, d^{0}\right):$

$\begin{align*} &\boldsymbol{\beta}_{\mathcal{I}^{0}}^{0}=0,\\ &d_{\mathcal{A}^{0}}^{0}=0,\\ &\boldsymbol{\beta}_{\mathcal{A}^{0}}^{0}=\left(\boldsymbol{X}_{\mathcal{A}^{0}}^{\top} \boldsymbol{X}_{\mathcal{A}^{0}}\right)^{-1} \boldsymbol{X}_{\mathcal{A}^{0}}^{\top} \boldsymbol{y},\\ &d_{\mathcal{I}^{0}}^{0}=X_{\mathcal{I}^{0}}^{\top}\left(\boldsymbol{y}-\boldsymbol{X} \boldsymbol{\beta}^{0}\right). \end{align*}$

For $m=0,1, \ldots, m_{\max}$ , do

$\left(\boldsymbol{\beta}^{m+1}, \boldsymbol{d}^{m+1}, \mathcal{A}^{m+1}, \mathcal{I}^{m+1}\right)= \text{Splicing} \left(\boldsymbol{\beta}^{m}, \boldsymbol{d}^{m}, \mathcal{A}^{m}, \mathcal{I}^{m}, k_{\max }, \tau_{s}\right).$

If $\left(\mathcal{A}^{m+1}, \mathcal{I}^{m+1}\right)=\left(\mathcal{A}^{m}, \mathcal{I}^{m}\right)$ , then stop

End For
Output $(\hat{\boldsymbol{\beta}}, \hat{\boldsymbol{d}}, \hat{\mathcal{A}}, \hat{\mathcal{I}})=\left(\boldsymbol{\beta}^{m+1}, \boldsymbol{d}^{m+1} \mathcal{A}^{m+1}, \mathcal{I}^{m+1}\right).$

Algorithm 2: Splicing $\left(\boldsymbol\beta, d, \mathcal{A}, \mathcal{I}, k_{\max }, \tau_{s}\right)$

Input: $\boldsymbol{\beta}, \boldsymbol{d}, \mathcal{A}, \mathcal{I}, k_{\max }$ , and $\tau_{\mathrm{s}} .$
Initialize $L_{0}=L=\frac{1}{2 n}\|y-X \beta\|_{2}^{2}$ , and set $\xi_{j}=\frac{X_{j}^{\top} X_{j}}{2 n}\left(\beta_{j}\right)^{2}, \zeta_{j}=\frac{X_{j}^{\top} X_{j}}{2 n}\left(\frac{d_{j}}{X_{j}^{\top} X_{j} / n}\right)^{2}, j=1, \ldots, p.$
For $k=1,2, \ldots, k_{\max }$ , do

$\mathcal{A}_{k}=\left\{j \in \mathcal{A}: \sum_{i \in \mathcal{A}} \mathrm{I}\left(\xi_{j} \geq \xi_{i}\right) \leq k\right\}$

$\mathcal{I}_{k}=\left\{j \in \mathcal{I}: \sum_{i \in \mathcal{I}} \mathrm{I}\left(\zeta_{j} \leq \zeta_{i}\right) \leq k\right\}$

Let $\tilde{\mathcal{A}}_{k}=\left(\mathcal{A} \backslash \mathcal{A}_{k}\right) \cup \mathcal{I}_{k}, \tilde{\mathcal{I}}_{k}=\left(\mathcal{I} \backslash \mathcal{I}_{k}\right) \cup \mathcal{A}_{k}$ and solve

$\tilde{\boldsymbol{\beta}}_{{\mathcal{A}}_{k}}=\left(\boldsymbol{X}_{\mathcal{A}_{k}}^{\top} \boldsymbol{X}_{{\mathcal{A}}_{k}}\right)^{-1} \boldsymbol{X}_{{\mathcal{A}_{k}}}^{\top} y, \quad \tilde{\boldsymbol{\beta}}_{{\mathcal{I}}_{k}}=0$

$\tilde{\boldsymbol d}_{\mathcal{I}^k}=X_{\mathcal{I}^k}^{\top}(y-X \tilde{\beta}) / n,\quad \tilde{\boldsymbol d}_{\mathcal{A}^k} = 0.$

Compute $\mathcal{L}_{n}(\tilde{\boldsymbol\beta})=\frac{1}{2 n}\|y-X \tilde{\boldsymbol\beta}\|_{2}^{2}.$

If $L>\mathcal{L}_{n}(\tilde{\boldsymbol\beta})$ , then

$(\hat{\boldsymbol{\beta}}, \hat{\boldsymbol{d}}, \hat{\mathcal{A}}, \hat{\mathcal{I}})=\left(\tilde{\boldsymbol{\beta}}, \tilde{\boldsymbol{d}}, \tilde{\mathcal{A}}_{k}, \tilde{\mathcal{I}}_{k}\right)$

$L=\mathcal{L}_{n}(\tilde{\boldsymbol\beta}).$

End for
If $L_{0}-L<\tau_{s}$ , then $(\hat{\boldsymbol\beta}, \hat{d}, \hat{A}, \hat{I})=(\boldsymbol\beta, d, \mathcal{A}, \mathcal{I}).$
Output $(\hat{\boldsymbol{\beta}}, \hat{\boldsymbol{d}}, \hat{\mathcal{A}}, \hat{\mathcal{I}})$ .

Determining the Best Support Size with SIC

In practice, the support size is usually unknown. We use a data-driven procedure to determine s. For any active set $\mathcal{A}$ , define an $\mathrm{SIC}$ as follows: $\operatorname{SIC}(\mathcal{A})=n \log \mathcal{L}_{\mathcal{A}}+|\mathcal{A}| \log (p) \log \log n,$ where $\mathcal{L}_{\mathcal{A}}=\min _{\beta_{\mathcal{I}}=0} \mathcal{L}_{n}(\beta), \mathcal{I}=(\mathcal{A})^{c}$ . To identify the true model, the model complexity penalty is $\log p$ and the slow diverging rate $\log \log n$ is set to prevent underfitting. Theorem 4 states that the following ABESS algorithm selects the true support size via SIC.

Let $s_{\max }$ be the maximum support size. We suggest $s_{\max }=o\left(\frac{n}{\log p}\right)$ as the maximum possible recovery size. Typically, we set $s_{\max }=\left[\frac{n}{\log (p) \log \log n}\right]$ where $[x]$ denotes the integer part of $x$ .

Algorithm 3: ABESS.

Input: $X, y$ , and the maximum support size $s_{\max } .$
For $s=1,2, \ldots, s_{\max }$ , do

$\left(\hat{\boldsymbol{\beta}}_{s}, \hat{\boldsymbol{d}}_{s}, \hat{\mathcal{A}}_{s}, \hat{\mathcal{I}}_{s}\right)= \text{BESS.Fixed}(s).$

End for
Compute the minimum of SIC:

$s_{\min }=\arg \min _{s} \operatorname{SIC}\left(\hat{\mathcal{A}}_{s}\right).$
Output $\left(\hat{\boldsymbol{\beta}}_{s_{\operatorname{min}}}, \hat{\boldsymbol{d}}_{s_{\min }}, \hat{A}_{s_{\min }}, \hat{\mathcal{I}}_{s_{\min }}\right) .$

The corresponding notations in `abess()` function

Notations in algorithm	Parameters in $\mathtt{abess}$	Definitions
$m_{\max}$	max.splicing.iter	The maximum number of splicing algorithm
$k_{\max}$	c.max	The maximum splicing size
$1,\ldots,s_{\max}$	support.size	A sequence representing the support sizes
$\operatorname{SIC}$	tune.type	The type of criterion for choosing the support size
$\{G_1, \ldots, G_J\}$	group.index	A vector indicator the group that each variable belongs to
$\lambda$	lambda	A value for regularized best subset selection
Golden-section searching	gs.range	Upper and lower bounds of the search range

Jin Zhu, Yanhang Zhang

1/20/2022

Introduction

Linear model

Sacrifices

Algorithm

Best-Subset Selection with a Given Support Size

Algorithm 1: BESS.Fix(s): Best-Subset Selection with a given support size s.

Algorithm 2: Splicing $\left(\boldsymbol\beta, d, \mathcal{A}, \mathcal{I}, k_{\max }, \tau_{s}\right)$

Determining the Best Support Size with SIC

Algorithm 3: ABESS.

The corresponding notations in `abess()` function

ABESS algorithm: details

Jin Zhu, Yanhang Zhang

1/20/2022

Introduction

Linear model

Sacrifices

Algorithm

Best-Subset Selection with a Given Support Size

Algorithm 1: BESS.Fix(s): Best-Subset Selection with a given support size s.

Algorithm 2: Splicing (𝛃,d,𝒜,ℐ,kmax,τs)\left(\boldsymbol\beta, d, \mathcal{A}, \mathcal{I}, k_{\max }, \tau_{s}\right)

Determining the Best Support Size with SIC

Algorithm 3: ABESS.

The corresponding notations in abess() function

Algorithm 2: Splicing $\left(\boldsymbol\beta, d, \mathcal{A}, \mathcal{I}, k_{\max }, \tau_{s}\right)$

The corresponding notations in `abess()` function