Generate simulated data under the generalized linear model and Cox proportional hazard model.
generate.data(
n,
p,
support.size = NULL,
rho = 0,
family = c("gaussian", "binomial", "poisson", "cox", "mgaussian", "multinomial",
"gamma", "ordinal"),
beta = NULL,
cortype = 1,
snr = 10,
sigma = NULL,
weibull.shape = 1,
uniform.max = 1,
y.dim = 3,
class.num = 3,
seed = 1
)
The number of observations.
The number of predictors of interest.
The number of nonzero coefficients in the underlying regression
model. Can be omitted if beta
is supplied.
A parameter used to characterize the pairwise correlation in
predictors. Default is 0
.
The distribution of the simulated response. "gaussian"
for
univariate quantitative response, "binomial"
for binary classification response,
"poisson"
for counting response, "cox"
for left-censored response,
"mgaussian"
for multivariate quantitative response,
"mgaussian"
for multi-classification response,
"ordinal"
for ordinal response.
The coefficient values in the underlying regression model.
If it is supplied, support.size
would be omitted.
The correlation structure.
cortype = 1
denotes the independence structure,
where the covariance matrix has \((i,j)\) entry equals \(I(i \neq j)\).
cortype = 2
denotes the exponential structure,
where the covariance matrix has \((i,j)\) entry equals \(rho^{|i-j|}\).
cortype = 3
denotes the constant structure,
where the non-diagonal entries of covariance
matrix are \(rho\) and diagonal entries are 1.
A numerical value controlling the signal-to-noise ratio (SNR). The SNR is defined as
as the variance of \(x\beta\) divided
by the variance of a gaussian noise: \(\frac{Var(x\beta)}{\sigma^2}\).
The gaussian noise \(\epsilon\) is set with mean 0 and variance.
The noise is added to the linear predictor \(\eta\) = \(x\beta\). Default is snr = 10
.
Note that this arguments's effect is overridden if sigma
is supplied with a non-null value.
The variance of the gaussian noise. Default sigma = NULL
implies it is determined by snr
.
The shape parameter of the Weibull distribution.
It works only when family = "cox"
.
Default: weibull.shape = 1
.
A parameter controlling censored rate.
A large value implies a small censored rate;
otherwise, a large censored rate.
It works only when family = "cox"
.
Default is uniform.max = 1
.
Response's Dimension. It works only when family = "mgaussian"
. Default: y.dim = 3
.
The number of class. It works only when family = "multinomial"
. Default: class.num = 3
.
random seed. Default: seed = 1
.
A list
object comprising:
Design matrix of predictors.
Response variable.
The coefficients used in the underlying regression model.
For family = "gaussian"
, the data model is
$$Y = X \beta + \epsilon.$$
The underlying regression coefficient \(\beta\) has
uniform distribution [m, 100m] and \(m=5 \sqrt{2log(p)/n}.\)
For family= "binomial"
, the data model is $$Prob(Y = 1) = \exp(X
\beta + \epsilon)/(1 + \exp(X \beta + \epsilon)).$$
The underlying regression coefficient \(\beta\) has
uniform distribution [2m, 10m] and \(m = 5 \sqrt{2log(p)/n}.\)
For family = "poisson"
, the data is modeled to have
an exponential distribution:
$$Y = Exp(\exp(X \beta + \epsilon)).$$
The underlying regression coefficient \(\beta\) has
uniform distribution [2m, 10m] and \(m = \sqrt{2log(p)/n}/3.\)
For family = "gamma"
, the data is modeled to have
a gamma distribution:
$$Y = Gamma(X \beta + \epsilon + 10, shape),$$
where \(shape\) is shape parameter in a gamma distribution.
The underlying regression coefficient \(\beta\) has
uniform distribution [2m, 100m] and \(m = \sqrt{2log(p)/n}.\)
For family = "ordinal"
, the data is modeled to have
an ordinal distribution.
For family = "cox"
, the model for failure time \(T\) is
$$T = (-\log(U / \exp(X \beta)))^{1/weibull.shape},$$
where \(U\) is a uniform random variable with range [0, 1].
The centering time \(C\) is generated from
uniform distribution \([0, uniform.max]\),
then we define the censor status as
\(\delta = I(T \le C)\) and observed time as \(R = \min\{T, C\}\).
The underlying regression coefficient \(\beta\) has
uniform distribution [2m, 10m],
where \(m = 5 \sqrt{2log(p)/n}\).
For family = "mgaussian"
, the data model is
$$Y = X \beta + E.$$
The non-zero values of regression matrix \(\beta\) are sampled from
uniform distribution [m, 100m] and \(m=5 \sqrt{2log(p)/n}.\)
For family= "multinomial"
, the data model is $$Prob(Y = 1) = \exp(X \beta + E)/(1 + \exp(X \beta + E)).$$
The non-zero values of regression coefficient \(\beta\) has
uniform distribution [2m, 10m] and \(m = 5 \sqrt{2log(p)/n}.\)
In the above models, \(\epsilon \sim N(0, \sigma^2 )\) and \(E \sim MVN(0, \sigma^2 \times I_{q \times q})\),
where \(\sigma^2\) is determined by the snr
and q is y.dim
.
# Generate simulated data
n <- 200
p <- 20
support.size <- 5
dataset <- generate.data(n, p, support.size)
str(dataset)
#> List of 3
#> $ x : num [1:200, 1:20] -0.626 0.184 -0.836 1.595 0.33 ...
#> ..- attr(*, "dimnames")=List of 2
#> .. ..$ : NULL
#> .. ..$ : chr [1:20] "x1" "x2" "x3" "x4" ...
#> $ y : num [1:200, 1] -170.92 68.42 2.05 -7.15 96.92 ...
#> $ beta: num [1:20] 0 0 0 0 0 ...