Introduction

The ABESS algorithm employing “splicing” technique can exactly solve general best subset problem in a polynomial time. The aim of this page to provide a complete and coherent documentation for ABESS algorithm such that users can easily understand the ABESS algorithm and its variants, thereby facilitating the usage of abess software. We provide the details of splicing approach to linear model as follows, and its variants can be applied in similar ways.

Linear model

Sacrifices

Consider the 0\ell_{0} constraint minimization problem, min𝛃n(𝛃), s.t 𝛃0s, \min _{\boldsymbol{\beta}} \mathcal{L}_{n}(\boldsymbol{\beta}), \quad \text { s.t }\|\boldsymbol{\beta}\|_{0} \leq \mathrm{s}, where ${n}()=|y-X |{2}^{2}. $ Without loss of generality, we consider 𝛃0=s\|\boldsymbol{\beta}\|_{0}=\mathrm{s}. Given any initial set 𝒜𝒮={1,2,,p}\mathcal{A} \subset \mathcal{S}=\{1,2, \ldots, p\} with cardinality |𝒜|=s|\mathcal{A}|=s, denote =𝒜c\mathcal{I}=\mathcal{A}^{\mathrm{c}} and compute 𝛃̂=argmin𝛃=0n(𝛃). \hat{\boldsymbol{\beta}}=\arg \min _{\boldsymbol{\beta}_{\mathcal{I}}=0} \mathcal{L}_{n}(\boldsymbol{\beta}). We call 𝒜\mathcal{A} and \mathcal{I} as the active set and the inactive set, respectively.

Given the active set 𝒜\mathcal{A} and 𝛃̂\hat{\boldsymbol{\beta}}, we can define the following two types of sacrifices:

  1. Backward sacrifice: For any j𝒜j \in \mathcal{A}, the magnitude of discarding variable jj is, ξj=n(𝛃̂𝒜{j})n(𝛃̂𝒜)=XjXj2n(𝛃̂j)2, \xi_{j}=\mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}}^{\mathcal{A} \backslash\{j\}}\right)-\mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}}^{\mathcal{A}}\right)=\frac{X_{j}^{\top} X_{j}}{2 n}\left(\hat{\boldsymbol\beta}_{j}\right)^{2},
  2. Forward sacrifice: For any jj \in \mathcal{I}, the magnitude of adding variable jj is, ζj=n(𝛃𝒜̂)n(𝛃̂𝒜+t̂{j})=XjXj2n(𝐝̂jXjXj/n)2. \zeta_{j}=\mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}^{\mathcal{A}}}\right)-\mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}}^{\mathcal{A}}+\hat{t}^{\{j\}}\right)=\frac{X_{j}^{\top} X_{j}}{2 n}\left(\frac{\hat{\boldsymbol d}_{j}}{X_{j}^{\top} X_{j} / n}\right)^{2}. where t̂=argmintn(𝛃̂𝒜+t{j}),𝐝̂j=Xj(yX𝛃̂)/n\hat{t}=\arg \min _{t} \mathcal{L}_{n}\left(\hat{\boldsymbol{\beta}}^{\mathcal{A}}+t^{\{j\}}\right), \hat{\boldsymbol d}_{j}=X_{j}^{\top}(y-X \hat{\boldsymbol{\beta}}) / n Intuitively, for j𝒜j \in \mathcal{A} (or jj \in \mathcal{I} ), a large ξj\xi_{j} (or ζj\zeta_{j}) implies the jj th variable is potentially important.

Algorithm

Best-Subset Selection with a Given Support Size

Unfortunately, it is noteworthy that these two sacrifices are incomparable because they have different sizes of support set. However, if we exchange some “irrelevant” variables in 𝒜\mathcal{A} and some “important” variables in \mathcal{I}, it may result in a higher-quality solution. This intuition motivates our splicing method. Specifically, given any splicing size ksk \leq s, define

𝒜k={j𝒜:i𝒜I(ξjξi)k} \mathcal{A}_{k}=\left\{j \in \mathcal{A}: \sum_{i \in \mathcal{A}} \mathrm{I}\left(\xi_{j} \geq \xi_{i}\right) \leq k\right\} to represent kk least relevant variables in 𝒜\mathcal{A} and k={j:i(ζjζi)k} \mathcal{I}_{k}=\left\{j \in \mathcal{I}: \sum_{i \in \mathcal{I}} \mid\left(\zeta_{j} \leq \zeta_{i}\right) \leq k\right\} to represent kk most relevant variables in .\mathcal{I} . Then, we splice 𝒜\mathcal{A} and \mathcal{I} by exchanging 𝒜k\mathcal{A}_{k} and k\mathcal{I}_{k} and obtain a new active set 𝒜̃=(𝒜𝒜k)k. \tilde{\mathcal{A}}=\left(\mathcal{A} \backslash \mathcal{A}_{k}\right) \cup \mathcal{I}_{k}. Let ̃=𝒜̃c,𝛃̃=argmin𝛃¯=0n(𝛃)\tilde{\mathcal{I}}=\tilde{\mathcal{A}}^{c}, \tilde{\boldsymbol{\beta}}=\arg \min _{\boldsymbol{\beta}_{\overline{\mathcal{I}}}=0} \mathcal{L}_{n}(\boldsymbol{\beta}), and τs>0\tau_{s}>0 be a threshold. If τs<\tau_{s}<n(𝛃̂)n(𝛃̃)\mathcal{L}_{n}(\hat{\boldsymbol\beta})-\mathcal{L}_{n}(\tilde{\boldsymbol\beta}), then Ã\tilde{A} is preferable to 𝒜.\mathcal{A} . The active set can be updated iteratively until the loss function cannot be improved by splicing. Once the algorithm recovers the true active set, we may splice some irrelevant variables, and then the loss function may decrease slightly. The threshold τs\tau_{s} can reduce this unnecessary calculation. Typically, τs\tau_{s} is relatively small, e.g. τs=0.01slog(p)log(logn)/n.\tau_{s}=0.01 s \log (p) \log (\log n) / n.

Algorithm 1: BESS.Fix(s): Best-Subset Selection with a given support size s.
  1. Input: X,yX, y, positive integers kmax,mmaxk_{\max}, m_{\max}, and a threshold τs\tau_{s}.
  2. Initialize 𝒜0={j:i=1pI(|XjyXjXj||XiyXiXi|s},0=(𝒜0)c\mathcal{A}^{0}=\left\{j: \sum_{i=1}^{p} \mathrm{I}\left(\left|\frac{X_{j}^{\top} y}{\sqrt{X_{j}^{\top} X_{j}}}\right| \leq \left| \frac{X_{i}^{\top} y}{\sqrt{X_{i}^{\top} X_{i}}}\right| \leq \mathrm{s}\right\}, \mathcal{I}^{0}=\left(\mathcal{A}^{0}\right)^{c}\right., and (𝛃0,d0):\left(\boldsymbol\beta^{0}, d^{0}\right):

𝛃00=0,d𝒜00=0,𝛃𝒜00=(𝐗𝒜0𝐗𝒜0)1𝐗𝒜0𝐲,d00=X0(𝐲𝐗𝛃0).\begin{align*} &\boldsymbol{\beta}_{\mathcal{I}^{0}}^{0}=0,\\ &d_{\mathcal{A}^{0}}^{0}=0,\\ &\boldsymbol{\beta}_{\mathcal{A}^{0}}^{0}=\left(\boldsymbol{X}_{\mathcal{A}^{0}}^{\top} \boldsymbol{X}_{\mathcal{A}^{0}}\right)^{-1} \boldsymbol{X}_{\mathcal{A}^{0}}^{\top} \boldsymbol{y},\\ &d_{\mathcal{I}^{0}}^{0}=X_{\mathcal{I}^{0}}^{\top}\left(\boldsymbol{y}-\boldsymbol{X} \boldsymbol{\beta}^{0}\right). \end{align*}

  1. For m=0,1,,mmaxm=0,1, \ldots, m_{\max}, do

    (𝛃m+1,𝐝m+1,𝒜m+1,m+1)=Splicing(𝛃m,𝐝m,𝒜m,m,kmax,τs).\left(\boldsymbol{\beta}^{m+1}, \boldsymbol{d}^{m+1}, \mathcal{A}^{m+1}, \mathcal{I}^{m+1}\right)= \text{Splicing} \left(\boldsymbol{\beta}^{m}, \boldsymbol{d}^{m}, \mathcal{A}^{m}, \mathcal{I}^{m}, k_{\max }, \tau_{s}\right).

    If (𝒜m+1,m+1)=(𝒜m,m)\left(\mathcal{A}^{m+1}, \mathcal{I}^{m+1}\right)=\left(\mathcal{A}^{m}, \mathcal{I}^{m}\right), then stop

    End For

  2. Output (𝛃̂,𝐝̂,𝒜̂,̂)=(𝛃m+1,𝐝m+1𝒜m+1,m+1).(\hat{\boldsymbol{\beta}}, \hat{\boldsymbol{d}}, \hat{\mathcal{A}}, \hat{\mathcal{I}})=\left(\boldsymbol{\beta}^{m+1}, \boldsymbol{d}^{m+1} \mathcal{A}^{m+1}, \mathcal{I}^{m+1}\right).

Algorithm 2: Splicing (𝛃,d,𝒜,,kmax,τs)\left(\boldsymbol\beta, d, \mathcal{A}, \mathcal{I}, k_{\max }, \tau_{s}\right)
  1. Input: 𝛃,𝐝,𝒜,,kmax\boldsymbol{\beta}, \boldsymbol{d}, \mathcal{A}, \mathcal{I}, k_{\max }, and τs.\tau_{\mathrm{s}} .

  2. Initialize L0=L=12nyXβ22L_{0}=L=\frac{1}{2 n}\|y-X \beta\|_{2}^{2}, and set ξj=XjXj2n(βj)2,ζj=XjXj2n(djXjXj/n)2,j=1,,p.\xi_{j}=\frac{X_{j}^{\top} X_{j}}{2 n}\left(\beta_{j}\right)^{2}, \zeta_{j}=\frac{X_{j}^{\top} X_{j}}{2 n}\left(\frac{d_{j}}{X_{j}^{\top} X_{j} / n}\right)^{2}, j=1, \ldots, p.

  3. For k=1,2,,kmaxk=1,2, \ldots, k_{\max }, do

    𝒜k={j𝒜:i𝒜I(ξjξi)k}\mathcal{A}_{k}=\left\{j \in \mathcal{A}: \sum_{i \in \mathcal{A}} \mathrm{I}\left(\xi_{j} \geq \xi_{i}\right) \leq k\right\}

    k={j:iI(ζjζi)k}\mathcal{I}_{k}=\left\{j \in \mathcal{I}: \sum_{i \in \mathcal{I}} \mathrm{I}\left(\zeta_{j} \leq \zeta_{i}\right) \leq k\right\}

    Let 𝒜̃k=(𝒜𝒜k)k,̃k=(k)𝒜k\tilde{\mathcal{A}}_{k}=\left(\mathcal{A} \backslash \mathcal{A}_{k}\right) \cup \mathcal{I}_{k}, \tilde{\mathcal{I}}_{k}=\left(\mathcal{I} \backslash \mathcal{I}_{k}\right) \cup \mathcal{A}_{k} and solve

    𝛃̃𝒜k=(𝐗𝒜k𝐗𝒜k)1𝐗𝒜ky,𝛃̃k=0\tilde{\boldsymbol{\beta}}_{{\mathcal{A}}_{k}}=\left(\boldsymbol{X}_{\mathcal{A}_{k}}^{\top} \boldsymbol{X}_{{\mathcal{A}}_{k}}\right)^{-1} \boldsymbol{X}_{{\mathcal{A}_{k}}}^{\top} y, \quad \tilde{\boldsymbol{\beta}}_{{\mathcal{I}}_{k}}=0

    𝐝̃k=Xk(yXβ̃)/n,𝐝̃𝒜k=0.\tilde{\boldsymbol d}_{\mathcal{I}^k}=X_{\mathcal{I}^k}^{\top}(y-X \tilde{\beta}) / n,\quad \tilde{\boldsymbol d}_{\mathcal{A}^k} = 0.

    Compute n(𝛃̃)=12nyX𝛃̃22.\mathcal{L}_{n}(\tilde{\boldsymbol\beta})=\frac{1}{2 n}\|y-X \tilde{\boldsymbol\beta}\|_{2}^{2}.

    If L>n(𝛃̃)L>\mathcal{L}_{n}(\tilde{\boldsymbol\beta}), then

    (𝛃̂,𝐝̂,𝒜̂,̂)=(𝛃̃,𝐝̃,𝒜̃k,̃k)(\hat{\boldsymbol{\beta}}, \hat{\boldsymbol{d}}, \hat{\mathcal{A}}, \hat{\mathcal{I}})=\left(\tilde{\boldsymbol{\beta}}, \tilde{\boldsymbol{d}}, \tilde{\mathcal{A}}_{k}, \tilde{\mathcal{I}}_{k}\right)

    L=n(𝛃̃).L=\mathcal{L}_{n}(\tilde{\boldsymbol\beta}).

    End for

  4. If L0L<τsL_{0}-L<\tau_{s}, then (𝛃̂,d̂,Â,Î)=(𝛃,d,𝒜,).(\hat{\boldsymbol\beta}, \hat{d}, \hat{A}, \hat{I})=(\boldsymbol\beta, d, \mathcal{A}, \mathcal{I}).

  5. Output (𝛃̂,𝐝̂,𝒜̂,̂)(\hat{\boldsymbol{\beta}}, \hat{\boldsymbol{d}}, \hat{\mathcal{A}}, \hat{\mathcal{I}}).

Determining the Best Support Size with SIC

In practice, the support size is usually unknown. We use a data-driven procedure to determine s. For any active set 𝒜\mathcal{A}, define an SIC\mathrm{SIC} as follows: SIC(𝒜)=nlog𝒜+|𝒜|log(p)loglogn, \operatorname{SIC}(\mathcal{A})=n \log \mathcal{L}_{\mathcal{A}}+|\mathcal{A}| \log (p) \log \log n, where 𝒜=minβ=0n(β),=(𝒜)c\mathcal{L}_{\mathcal{A}}=\min _{\beta_{\mathcal{I}}=0} \mathcal{L}_{n}(\beta), \mathcal{I}=(\mathcal{A})^{c}. To identify the true model, the model complexity penalty is logp\log p and the slow diverging rate loglogn\log \log n is set to prevent underfitting. Theorem 4 states that the following ABESS algorithm selects the true support size via SIC.

Let smaxs_{\max } be the maximum support size. We suggest smax=o(nlogp)s_{\max }=o\left(\frac{n}{\log p}\right) as the maximum possible recovery size. Typically, we set smax=[nlog(p)loglogn]s_{\max }=\left[\frac{n}{\log (p) \log \log n}\right] where [x][x] denotes the integer part of xx.

Algorithm 3: ABESS.
  1. Input: X,yX, y, and the maximum support size smax.s_{\max } .

  2. For s=1,2,,smaxs=1,2, \ldots, s_{\max }, do

    (𝛃̂s,𝐝̂s,𝒜̂s,̂s)=BESS.Fixed(s).\left(\hat{\boldsymbol{\beta}}_{s}, \hat{\boldsymbol{d}}_{s}, \hat{\mathcal{A}}_{s}, \hat{\mathcal{I}}_{s}\right)= \text{BESS.Fixed}(s).

    End for

  3. Compute the minimum of SIC:

    smin=argminsSIC(𝒜̂s).s_{\min }=\arg \min _{s} \operatorname{SIC}\left(\hat{\mathcal{A}}_{s}\right).

  4. Output (𝛃̂smin,𝐝̂smin,Âsmin,̂smin).\left(\hat{\boldsymbol{\beta}}_{s_{\operatorname{min}}}, \hat{\boldsymbol{d}}_{s_{\min }}, \hat{A}_{s_{\min }}, \hat{\mathcal{I}}_{s_{\min }}\right) .

The corresponding notations in abess() function

Notations in algorithm Parameters in 𝚊𝚋𝚎𝚜𝚜\mathtt{abess} Definitions
mmaxm_{\max} max.splicing.iter The maximum number of splicing algorithm
kmaxk_{\max} c.max The maximum splicing size
1,,smax1,\ldots,s_{\max} support.size A sequence representing the support sizes
SIC\operatorname{SIC} tune.type The type of criterion for choosing the support size
{G1,,GJ}\{G_1, \ldots, G_J\} group.index A vector indicator the group that each variable belongs to
λ\lambda lambda A value for regularized best subset selection
Golden-section searching gs.range Upper and lower bounds of the search range