Generate simulated data under the generalized linear model and Cox proportional hazard model.

generate.data(
  n,
  p,
  support.size = NULL,
  rho = 0,
  family = c("gaussian", "binomial", "poisson", "cox", "mgaussian", "multinomial",
    "gamma", "ordinal"),
  beta = NULL,
  cortype = 1,
  snr = 10,
  sigma = NULL,
  weibull.shape = 1,
  uniform.max = 1,
  y.dim = 3,
  class.num = 3,
  seed = 1
)

Arguments

n

The number of observations.

p

The number of predictors of interest.

support.size

The number of nonzero coefficients in the underlying regression model. Can be omitted if beta is supplied.

rho

A parameter used to characterize the pairwise correlation in predictors. Default is 0.

family

The distribution of the simulated response. "gaussian" for univariate quantitative response, "binomial" for binary classification response, "poisson" for counting response, "cox" for left-censored response, "mgaussian" for multivariate quantitative response, "mgaussian" for multi-classification response, "ordinal" for ordinal response.

beta

The coefficient values in the underlying regression model. If it is supplied, support.size would be omitted.

cortype

The correlation structure. cortype = 1 denotes the independence structure, where the covariance matrix has \((i,j)\) entry equals \(I(i \neq j)\). cortype = 2 denotes the exponential structure, where the covariance matrix has \((i,j)\) entry equals \(rho^{|i-j|}\). cortype = 3 denotes the constant structure, where the non-diagonal entries of covariance matrix are \(rho\) and diagonal entries are 1.

snr

A numerical value controlling the signal-to-noise ratio (SNR). The SNR is defined as as the variance of \(x\beta\) divided by the variance of a gaussian noise: \(\frac{Var(x\beta)}{\sigma^2}\). The gaussian noise \(\epsilon\) is set with mean 0 and variance. The noise is added to the linear predictor \(\eta\) = \(x\beta\). Default is snr = 10. Note that this arguments's effect is overridden if sigma is supplied with a non-null value.

sigma

The variance of the gaussian noise. Default sigma = NULL implies it is determined by snr.

weibull.shape

The shape parameter of the Weibull distribution. It works only when family = "cox". Default: weibull.shape = 1.

uniform.max

A parameter controlling censored rate. A large value implies a small censored rate; otherwise, a large censored rate. It works only when family = "cox". Default is uniform.max = 1.

y.dim

Response's Dimension. It works only when family = "mgaussian". Default: y.dim = 3.

class.num

The number of class. It works only when family = "multinomial". Default: class.num = 3.

seed

random seed. Default: seed = 1.

Value

A list object comprising:

x

Design matrix of predictors.

y

Response variable.

beta

The coefficients used in the underlying regression model.

Details

For family = "gaussian", the data model is $$Y = X \beta + \epsilon.$$ The underlying regression coefficient \(\beta\) has uniform distribution [m, 100m] and \(m=5 \sqrt{2log(p)/n}.\)

For family= "binomial", the data model is $$Prob(Y = 1) = \exp(X \beta + \epsilon)/(1 + \exp(X \beta + \epsilon)).$$ The underlying regression coefficient \(\beta\) has uniform distribution [2m, 10m] and \(m = 5 \sqrt{2log(p)/n}.\)

For family = "poisson", the data is modeled to have an exponential distribution: $$Y = Exp(\exp(X \beta + \epsilon)).$$ The underlying regression coefficient \(\beta\) has uniform distribution [2m, 10m] and \(m = \sqrt{2log(p)/n}/3.\)

For family = "gamma", the data is modeled to have a gamma distribution: $$Y = Gamma(X \beta + \epsilon + 10, shape),$$ where \(shape\) is shape parameter in a gamma distribution. The underlying regression coefficient \(\beta\) has uniform distribution [2m, 100m] and \(m = \sqrt{2log(p)/n}.\)

For family = "ordinal", the data is modeled to have an ordinal distribution.

For family = "cox", the model for failure time \(T\) is $$T = (-\log(U / \exp(X \beta)))^{1/weibull.shape},$$ where \(U\) is a uniform random variable with range [0, 1]. The centering time \(C\) is generated from uniform distribution \([0, uniform.max]\), then we define the censor status as \(\delta = I(T \le C)\) and observed time as \(R = \min\{T, C\}\). The underlying regression coefficient \(\beta\) has uniform distribution [2m, 10m], where \(m = 5 \sqrt{2log(p)/n}\).

For family = "mgaussian", the data model is $$Y = X \beta + E.$$ The non-zero values of regression matrix \(\beta\) are sampled from uniform distribution [m, 100m] and \(m=5 \sqrt{2log(p)/n}.\)

For family= "multinomial", the data model is $$Prob(Y = 1) = \exp(X \beta + E)/(1 + \exp(X \beta + E)).$$ The non-zero values of regression coefficient \(\beta\) has uniform distribution [2m, 10m] and \(m = 5 \sqrt{2log(p)/n}.\)

In the above models, \(\epsilon \sim N(0, \sigma^2 )\) and \(E \sim MVN(0, \sigma^2 \times I_{q \times q})\), where \(\sigma^2\) is determined by the snr and q is y.dim.

Author

Jin Zhu

Examples


# Generate simulated data
n <- 200
p <- 20
support.size <- 5
dataset <- generate.data(n, p, support.size)
str(dataset)
#> List of 3
#>  $ x   : num [1:200, 1:20] -0.626 0.184 -0.836 1.595 0.33 ...
#>   ..- attr(*, "dimnames")=List of 2
#>   .. ..$ : NULL
#>   .. ..$ : chr [1:20] "x1" "x2" "x3" "x4" ...
#>  $ y   : num [1:200, 1] -170.92 68.42 2.05 -7.15 96.92 ...
#>  $ beta: num [1:20] 0 0 0 0 0 ...