This vignette introduces what is adaptive best subset selection robust principal component analysis and then we will show how it works using abess package on an artificial example.

PCA

Principal component analysis (PCA) is an important method in the field of data science, which can reduce the dimension of data and simplify our model. It solves an optimization problem like:

maxvvTΣv,s.t.vTv=1 \max_v v^T \Sigma v, \quad s.t.\ v^Tv=1 where Σ=XTX/(n1)\Sigma = X^TX/(n-1) and Xn×pX\in \mathbb{R}^{n\times p} is the centered sample matrix with each row containing one observation of pp variables.

Robust-PCA (RPCA)

However, the original PCA is sensitive to outliers, which may be unavoidable in real data:

  • Object has extreme performance due to fortuity, but he/she shows normal in repeated tests;

  • Wrong observation/recording/computing, e.g. missing or dead pixels, X-ray spikes.

In these situations, PCA may spend too much attention on unnecessary variables. That’s why Robust-PCA (RPCA) is presented, which can be used to recover the (low-rank) sample for subsequent processing.

In mathematics, RPCA manages to divide the sample matrix XX into two parts: X=S+L, X=S+L, where SS is the sparse “outlier” matrix and LL is the “information” matrix with a low rank. Generally, we also suppose SS is not low-rank and LL is not sparse, in order to get the unique solution.

In Lagrange format, minS,LXSLFε,s.t.rank(L)=r,S0s \min _{S, L}\|X-S-L\|_{F} \leq \varepsilon, s . t . \quad \operatorname{rank}(L)=r,\|S\|_{0} \leq s where ss is the sparsity of SS. After RPCA, the information matrix LL can be used in further analysis.

Note that it does NOT deal with “noise”, which may stay in LL and need further procession.

Hard Impute

To solve its sub-problem, RPCA with known outlier positions, we follow a process called “Hard Impute”. The main idea is to estimate the outlier values by precise values with KPCA, where K=rK=r.

Here are the steps:

  1. Input XX, outliers, M,εM, \varepsilon, where outliers record the non-zero positions in SS;
  2. Denote Xnew𝟎X_{n e w} \leftarrow \mathbf{0} with the same shape of XX;
  3. For i=1,2,,Mi=1,2, \ldots, M
  • Xold={Xnew , for outliers X, for others X_{o l d}=\left\{\begin{array}{ll}X_{\text {new }}, & \text { for outliers } \\ X, & \text { for others }\end{array}\right.
  • Form KPCA on XoldX_{o l d} with K=rK=r, and denote vv as the eigenvectors;
  • Xnew =Xold vvTX_{\text {new }}=X_{\text {old }} \cdot v \cdot v^{T};
  • If XnewXold<ε\left\|X_{n e w}-X_{o l d}\right\|<\varepsilon, break: End for;
  1. Return Xnew X_{\text {new }} as LL; where MM is the maximum iteration times and ε\varepsilon is the convergence coefficient.

The final Xnew X_{\text {new }} is supposed to be LL under given outlier positions.

RPCA Application

Recently, RPCA is more widely used, for example,

  • Video Decomposition: in a surveillance video, the background may be unchanged for a long time while only a few pixels (e.g. people) update. In order to improve the efficiency of store and analysis, we need to decomposite the video into background and foreground. Since the background is unchanged, it can be stored well in a low-rank matrix, while the foreground, which is usually quite small, can be indicated by a sparse matrix. That is what RPCA does.

  • Face recognition: due to complex lighting conditions, a small part of the facial features may be unrecognized (e.g. shadow). In face recognition, we need to remove the effects of shadows and focus on the face data. Actually, since the face data is almost unchanged (for one person), and the shadows affect only a small part, it is also a suitable situation to use RPCA. Here are some examples:

Simulated Data Example

Fitting model

Simulated Data Example Fitting model Now we generate an example with 100 rows and 100 columns with 200 outliers. We are looking forward to recover it with a low rank 10.

## 
##  Thank you for using abess! To acknowledge our work, please cite the package:
## 
##  Zhu J, Wang X, Hu L, Huang J, Jiang K, Zhang Y, Lin S, Zhu J (2022). 'abess: A Fast Best Subset Selection Library in Python and R.' Journal of Machine Learning Research, 23(202), 1-7. https://www.jmlr.org/papers/v23/21-1060.html.
n <- 100     # rows
p <- 100     # columns
s <- 200     # outliers
r <- 10      # rank(L)

dat <- generate.matrix(n, p, r, s)

In order to use our program, users should call abessrpca with a given outlier number support_size or an integer interval. For the latter case, a support size will be chosen by information criterion (e.g. GIC) adaptively. abessrpca will return a abessrpca object.

res.srpca <- abessrpca(dat$x, dat$rank, support.size = s)

To extract the estimated coefficients from a fitted abessrpca object, we use the coef function.

coef(res.srpca)

More on the result

To check the performance of the program, we use TPR, FPR as the criterion.

test.model <- function(pred, real)
{
  tpr <- sum(pred !=0 & real != 0)/sum(real != 0)
  fpr <- sum(pred !=0 & real == 0)/sum(real == 0)
  return(c(tpr = tpr, fpr = fpr))
}
test.model(res.srpca$S[[1]], dat$S)
##         tpr         fpr 
## 0.835000000 0.003367347

We can also change different random seeds to test for more situations:

M <- 30
res.test <- lapply(1:M, function(i)
{
  dat <- generate.matrix(n, p, r, s, seed = i)
  res.srpca <- abessrpca(dat$x, dat$rank, support.size = s)
  return(test.model(res.srpca$S[[1]], dat$S))
})
rowMeans(simplify2array(res.test))  
##         tpr         fpr 
## 0.839000000 0.003285714

Under all of these situations, abessRPCA has a good performance.