Introduction

The ABESS algorithm employing โ€œsplicingโ€ technique can exactly solve general best subset problem in a polynomial time. The aim of this page to provide a complete and coherent documentation for ABESS algorithm such that users can easily understand the ABESS algorithm and its variants, thereby facilitating the usage of abess software. We provide the details of splicing approach to linear model as follows, and its variants can be applied in similar ways.

Linear model

Sacrifices

Consider the โ„“0\ell_{0} constraint minimization problem, min๐›ƒโ„’n(๐›ƒ), s.t โˆฅ๐›ƒโˆฅ0โ‰คs, \min _{\boldsymbol{\beta}} \mathcal{L}_{n}(\boldsymbol{\beta}), \quad \text { s.t }\|\boldsymbol{\beta}\|_{0} \leq \mathrm{s}, where ${n}()=|y-X |{2}^{2}. $ Without loss of generality, we consider โˆฅ๐›ƒโˆฅ0=s\|\boldsymbol{\beta}\|_{0}=\mathrm{s}. Given any initial set ๐’œโŠ‚๐’ฎ={1,2,โ€ฆ,p}\mathcal{A} \subset \mathcal{S}=\{1,2, \ldots, p\} with cardinality |๐’œ|=s|\mathcal{A}|=s, denote โ„=๐’œc\mathcal{I}=\mathcal{A}^{\mathrm{c}} and compute ๐›ƒฬ‚=argmin๐›ƒโ„=0โ„’n(๐›ƒ). \hat{\boldsymbol{\beta}}=\arg \min _{\boldsymbol{\beta}_{\mathcal{I}}=0} \mathcal{L}_{n}(\boldsymbol{\beta}). We call ๐’œ\mathcal{A} and โ„\mathcal{I} as the active set and the inactive set, respectively.

Given the active set ๐’œ\mathcal{A} and ๐›ƒฬ‚\hat{\boldsymbol{\beta}}, we can define the following two types of sacrifices:

  1. Backward sacrifice: For any jโˆˆ๐’œj \in \mathcal{A}, the magnitude of discarding variable jj is, ฮพj=โ„’n(๐›ƒฬ‚๐’œโˆ–{j})โˆ’โ„’n(๐›ƒฬ‚๐’œ)=XjโŠคXj2n(๐›ƒฬ‚j)2, \xi_{j}=\mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}}^{\mathcal{A} \backslash\{j\}}\right)-\mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}}^{\mathcal{A}}\right)=\frac{X_{j}^{\top} X_{j}}{2 n}\left(\hat{\boldsymbol\beta}_{j}\right)^{2},
  2. Forward sacrifice: For any jโˆˆโ„j \in \mathcal{I}, the magnitude of adding variable jj is, ฮถj=โ„’n(๐›ƒ๐’œฬ‚)โˆ’โ„’n(๐›ƒฬ‚๐’œ+tฬ‚{j})=XjโŠคXj2n(๐ฬ‚jXjโŠคXj/n)2. \zeta_{j}=\mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}^{\mathcal{A}}}\right)-\mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}}^{\mathcal{A}}+\hat{t}^{\{j\}}\right)=\frac{X_{j}^{\top} X_{j}}{2 n}\left(\frac{\hat{\boldsymbol d}_{j}}{X_{j}^{\top} X_{j} / n}\right)^{2}. where tฬ‚=argmintโ„’n(๐›ƒฬ‚๐’œ+t{j}),๐ฬ‚j=XjโŠค(yโˆ’X๐›ƒฬ‚)/n\hat{t}=\arg \min _{t} \mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}}^{\mathcal{A}}+t^{\{j\}}\right), \hat{\boldsymbol d}_{j}=X_{j}^{\top}(y-X \hat{\boldsymbol{\beta}}) / n Intuitively, for jโˆˆ๐’œj \in \mathcal{A} (or jโˆˆโ„j \in \mathcal{I} ), a large ฮพj\xi_{j} (or ฮถj\zeta_{j}) implies the jj th variable is potentially important.

Algorithm

Best-Subset Selection with a Given Support Size

Unfortunately, it is noteworthy that these two sacrifices are incomparable because they have different sizes of support set. However, if we exchange some โ€œirrelevantโ€ variables in ๐’œ\mathcal{A} and some โ€œimportantโ€ variables in โ„\mathcal{I}, it may result in a higher-quality solution. This intuition motivates our splicing method. Specifically, given any splicing size kโ‰คsk \leq s, define

๐’œk={jโˆˆ๐’œ:โˆ‘iโˆˆ๐’œI(ฮพjโ‰ฅฮพi)โ‰คk} \mathcal{A}_{k}=\left\{j \in \mathcal{A}: \sum_{i \in \mathcal{A}} \mathrm{I}\left(\xi_{j} \geq \xi_{i}\right) \leq k\right\} to represent kk least relevant variables in ๐’œ\mathcal{A} and โ„k={jโˆˆโ„:โˆ‘iโˆˆโ„โˆฃ(ฮถjโ‰คฮถi)โ‰คk} \mathcal{I}_{k}=\left\{j \in \mathcal{I}: \sum_{i \in \mathcal{I}} \mid\left(\zeta_{j} \leq \zeta_{i}\right) \leq k\right\} to represent kk most relevant variables in โ„.\mathcal{I} . Then, we splice ๐’œ\mathcal{A} and โ„\mathcal{I} by exchanging ๐’œk\mathcal{A}_{k} and โ„k\mathcal{I}_{k} and obtain a new active set ๐’œฬƒ=(๐’œโˆ–๐’œk)โˆชโ„k. \tilde{\mathcal{A}}=\left(\mathcal{A} \backslash \mathcal{A}_{k}\right) \cup \mathcal{I}_{k}. Let โ„ฬƒ=๐’œฬƒc,๐›ƒฬƒ=argmin๐›ƒโ„ยฏ=0โ„’n(๐›ƒ)\tilde{\mathcal{I}}=\tilde{\mathcal{A}}^{c}, \tilde{\boldsymbol{\beta}}=\arg \min _{\boldsymbol{\beta}_{\overline{\mathcal{I}}}=0} \mathcal{L}_{n}(\boldsymbol{\beta}), and ฯ„s>0\tau_{s}>0 be a threshold. If ฯ„s<\tau_{s}<โ„’n(๐›ƒฬ‚)โˆ’โ„’n(๐›ƒฬƒ)\mathcal{L}_{n}(\hat{\boldsymbol\beta})-\mathcal{L}_{n}(\tilde{\boldsymbol\beta}), then Aฬƒ\tilde{A} is preferable to ๐’œ.\mathcal{A} . The active set can be updated iteratively until the loss function cannot be improved by splicing. Once the algorithm recovers the true active set, we may splice some irrelevant variables, and then the loss function may decrease slightly. The threshold ฯ„s\tau_{s} can reduce this unnecessary calculation. Typically, ฯ„s\tau_{s} is relatively small, e.g.ย ฯ„s=0.01slog(p)log(logn)/n.\tau_{s}=0.01 s \log (p) \log (\log n) / n.

Algorithm 1: BESS.Fix(s): Best-Subset Selection with a given support size s.
  1. Input: X,yX, y, positive integers kmax,mmaxk_{\max}, m_{\max}, and a threshold ฯ„s\tau_{s}.
  2. Initialize ๐’œ0={j:โˆ‘i=1pI(|XjโŠคyXjโŠคXj|โ‰ค|XiโŠคyXiโŠคXi|โ‰คs},โ„0=(๐’œ0)c\mathcal{A}^{0}=\left\{j: \sum_{i=1}^{p} \mathrm{I}\left(\left|\frac{X_{j}^{\top} y}{\sqrt{X_{j}^{\top} X_{j}}}\right| \leq \left| \frac{X_{i}^{\top} y}{\sqrt{X_{i}^{\top} X_{i}}}\right| \leq \mathrm{s}\right\}, \mathcal{I}^{0}=\left(\mathcal{A}^{0}\right)^{c}\right., and (๐›ƒ0,d0):\left(\boldsymbol\beta^{0}, d^{0}\right):

๐›ƒโ„00=0,d๐’œ00=0,๐›ƒ๐’œ00=(๐—๐’œ0โŠค๐—๐’œ0)โˆ’1๐—๐’œ0โŠค๐ฒ,dโ„00=Xโ„0โŠค(๐ฒโˆ’๐—๐›ƒ0).\begin{align*} &\boldsymbol{\beta}_{\mathcal{I}^{0}}^{0}=0,\\ &d_{\mathcal{A}^{0}}^{0}=0,\\ &\boldsymbol{\beta}_{\mathcal{A}^{0}}^{0}=\left(\boldsymbol{X}_{\mathcal{A}^{0}}^{\top} \boldsymbol{X}_{\mathcal{A}^{0}}\right)^{-1} \boldsymbol{X}_{\mathcal{A}^{0}}^{\top} \boldsymbol{y},\\ &d_{\mathcal{I}^{0}}^{0}=X_{\mathcal{I}^{0}}^{\top}\left(\boldsymbol{y}-\boldsymbol{X} \boldsymbol{\beta}^{0}\right). \end{align*}

  1. For m=0,1,โ€ฆ,mmaxm=0,1, \ldots, m_{\max}, do

    (๐›ƒm+1,๐m+1,๐’œm+1,โ„m+1)=Splicing(๐›ƒm,๐m,๐’œm,โ„m,kmax,ฯ„s).\left(\boldsymbol{\beta}^{m+1}, \boldsymbol{d}^{m+1}, \mathcal{A}^{m+1}, \mathcal{I}^{m+1}\right)= \text{Splicing} \left(\boldsymbol{\beta}^{m}, \boldsymbol{d}^{m}, \mathcal{A}^{m}, \mathcal{I}^{m}, k_{\max }, \tau_{s}\right).

    If (๐’œm+1,โ„m+1)=(๐’œm,โ„m)\left(\mathcal{A}^{m+1}, \mathcal{I}^{m+1}\right)=\left(\mathcal{A}^{m}, \mathcal{I}^{m}\right), then stop

    End For

  2. Output (๐›ƒฬ‚,๐ฬ‚,๐’œฬ‚,โ„ฬ‚)=(๐›ƒm+1,๐m+1๐’œm+1,โ„m+1).(\hat{\boldsymbol{\beta}}, \hat{\boldsymbol{d}}, \hat{\mathcal{A}}, \hat{\mathcal{I}})=\left(\boldsymbol{\beta}^{m+1}, \boldsymbol{d}^{m+1} \mathcal{A}^{m+1}, \mathcal{I}^{m+1}\right).

Algorithm 2: Splicing (๐›ƒ,d,๐’œ,โ„,kmax,ฯ„s)\left(\boldsymbol\beta, d, \mathcal{A}, \mathcal{I}, k_{\max }, \tau_{s}\right)
  1. Input: ๐›ƒ,๐,๐’œ,โ„,kmax\boldsymbol{\beta}, \boldsymbol{d}, \mathcal{A}, \mathcal{I}, k_{\max }, and ฯ„s.\tau_{\mathrm{s}} .

  2. Initialize L0=L=12nโˆฅyโˆ’Xฮฒโˆฅ22L_{0}=L=\frac{1}{2 n}\|y-X \beta\|_{2}^{2}, and set ฮพj=XjโŠคXj2n(ฮฒj)2,ฮถj=XjโŠคXj2n(djXjโŠคXj/n)2,j=1,โ€ฆ,p.\xi_{j}=\frac{X_{j}^{\top} X_{j}}{2 n}\left(\beta_{j}\right)^{2}, \zeta_{j}=\frac{X_{j}^{\top} X_{j}}{2 n}\left(\frac{d_{j}}{X_{j}^{\top} X_{j} / n}\right)^{2}, j=1, \ldots, p.

  3. For k=1,2,โ€ฆ,kmaxk=1,2, \ldots, k_{\max }, do

    ๐’œk={jโˆˆ๐’œ:โˆ‘iโˆˆ๐’œI(ฮพjโ‰ฅฮพi)โ‰คk}\mathcal{A}_{k}=\left\{j \in \mathcal{A}: \sum_{i \in \mathcal{A}} \mathrm{I}\left(\xi_{j} \geq \xi_{i}\right) \leq k\right\}

    โ„k={jโˆˆโ„:โˆ‘iโˆˆโ„I(ฮถjโ‰คฮถi)โ‰คk}\mathcal{I}_{k}=\left\{j \in \mathcal{I}: \sum_{i \in \mathcal{I}} \mathrm{I}\left(\zeta_{j} \leq \zeta_{i}\right) \leq k\right\}

    Let ๐’œฬƒk=(๐’œโˆ–๐’œk)โˆชโ„k,โ„ฬƒk=(โ„โˆ–โ„k)โˆช๐’œk\tilde{\mathcal{A}}_{k}=\left(\mathcal{A} \backslash \mathcal{A}_{k}\right) \cup \mathcal{I}_{k}, \tilde{\mathcal{I}}_{k}=\left(\mathcal{I} \backslash \mathcal{I}_{k}\right) \cup \mathcal{A}_{k} and solve

    ๐›ƒฬƒ๐’œk=(๐—๐’œkโŠค๐—๐’œk)โˆ’1๐—๐’œkโŠคy,๐›ƒฬƒโ„k=0\tilde{\boldsymbol{\beta}}_{{\mathcal{A}}_{k}}=\left(\boldsymbol{X}_{\mathcal{A}_{k}}^{\top} \boldsymbol{X}_{{\mathcal{A}}_{k}}\right)^{-1} \boldsymbol{X}_{{\mathcal{A}_{k}}}^{\top} y, \quad \tilde{\boldsymbol{\beta}}_{{\mathcal{I}}_{k}}=0

    ๐ฬƒโ„k=Xโ„kโŠค(yโˆ’Xฮฒฬƒ)/n,๐ฬƒ๐’œk=0.\tilde{\boldsymbol d}_{\mathcal{I}^k}=X_{\mathcal{I}^k}^{\top}(y-X \tilde{\beta}) / n,\quad \tilde{\boldsymbol d}_{\mathcal{A}^k} = 0.

    Compute โ„’n(๐›ƒฬƒ)=12nโˆฅyโˆ’X๐›ƒฬƒโˆฅ22.\mathcal{L}_{n}(\tilde{\boldsymbol\beta})=\frac{1}{2 n}\|y-X \tilde{\boldsymbol\beta}\|_{2}^{2}.

    If L>โ„’n(๐›ƒฬƒ)L>\mathcal{L}_{n}(\tilde{\boldsymbol\beta}), then

    (๐›ƒฬ‚,๐ฬ‚,๐’œฬ‚,โ„ฬ‚)=(๐›ƒฬƒ,๐ฬƒ,๐’œฬƒk,โ„ฬƒk)(\hat{\boldsymbol{\beta}}, \hat{\boldsymbol{d}}, \hat{\mathcal{A}}, \hat{\mathcal{I}})=\left(\tilde{\boldsymbol{\beta}}, \tilde{\boldsymbol{d}}, \tilde{\mathcal{A}}_{k}, \tilde{\mathcal{I}}_{k}\right)

    L=โ„’n(๐›ƒฬƒ).L=\mathcal{L}_{n}(\tilde{\boldsymbol\beta}).

    End for

  4. If L0โˆ’L<ฯ„sL_{0}-L<\tau_{s}, then (๐›ƒฬ‚,dฬ‚,Aฬ‚,Iฬ‚)=(๐›ƒ,d,๐’œ,โ„).(\hat{\boldsymbol\beta}, \hat{d}, \hat{A}, \hat{I})=(\boldsymbol\beta, d, \mathcal{A}, \mathcal{I}).

  5. Output (๐›ƒฬ‚,๐ฬ‚,๐’œฬ‚,โ„ฬ‚)(\hat{\boldsymbol{\beta}}, \hat{\boldsymbol{d}}, \hat{\mathcal{A}}, \hat{\mathcal{I}}).

Determining the Best Support Size with SIC

In practice, the support size is usually unknown. We use a data-driven procedure to determine s. For any active set ๐’œ\mathcal{A}, define an SIC\mathrm{SIC} as follows: SIC(๐’œ)=nlogโ„’๐’œ+|๐’œ|log(p)loglogn, \operatorname{SIC}(\mathcal{A})=n \log \mathcal{L}_{\mathcal{A}}+|\mathcal{A}| \log (p) \log \log n, where โ„’๐’œ=minฮฒโ„=0โ„’n(ฮฒ),โ„=(๐’œ)c\mathcal{L}_{\mathcal{A}}=\min _{\beta_{\mathcal{I}}=0} \mathcal{L}_{n}(\beta), \mathcal{I}=(\mathcal{A})^{c}. To identify the true model, the model complexity penalty is logp\log p and the slow diverging rate loglogn\log \log n is set to prevent underfitting. Theorem 4 states that the following ABESS algorithm selects the true support size via SIC.

Let smaxs_{\max } be the maximum support size. We suggest smax=o(nlogp)s_{\max }=o\left(\frac{n}{\log p}\right) as the maximum possible recovery size. Typically, we set smax=[nlog(p)loglogn]s_{\max }=\left[\frac{n}{\log (p) \log \log n}\right] where [x][x] denotes the integer part of xx.

Algorithm 3: ABESS.
  1. Input: X,yX, y, and the maximum support size smax.s_{\max } .

  2. For s=1,2,โ€ฆ,smaxs=1,2, \ldots, s_{\max }, do

    (๐›ƒฬ‚s,๐ฬ‚s,๐’œฬ‚s,โ„ฬ‚s)=BESS.Fixed(s).\left(\hat{\boldsymbol{\beta}}_{s}, \hat{\boldsymbol{d}}_{s}, \hat{\mathcal{A}}_{s}, \hat{\mathcal{I}}_{s}\right)= \text{BESS.Fixed}(s).

    End for

  3. Compute the minimum of SIC:

    smin=argminsSIC(๐’œฬ‚s).s_{\min }=\arg \min _{s} \operatorname{SIC}\left(\hat{\mathcal{A}}_{s}\right).

  4. Output (๐›ƒฬ‚smin,๐ฬ‚smin,Aฬ‚smin,โ„ฬ‚smin).\left(\hat{\boldsymbol{\beta}}_{s_{\operatorname{min}}}, \hat{\boldsymbol{d}}_{s_{\min }}, \hat{A}_{s_{\min }}, \hat{\mathcal{I}}_{s_{\min }}\right) .

The corresponding notations in abess() function

Notations in algorithm Parameters in ๐šŠ๐š‹๐šŽ๐šœ๐šœ\mathtt{abess} Definitions
mmaxm_{\max} max.splicing.iter The maximum number of splicing algorithm
kmaxk_{\max} c.max The maximum splicing size
1,โ€ฆ,smax1,\ldots,s_{\max} support.size A sequence representing the support sizes
SIC\operatorname{SIC} tune.type The type of criterion for choosing the support size
{G1,โ€ฆ,GJ}\{G_1, \ldots, G_J\} group.index A vector indicator the group that each variable belongs to
ฮป\lambda lambda A value for regularized best subset selection
Golden-section searching gs.range Upper and lower bounds of the search range