Fábio N. Demarqui
Department of Statistics
Universidade Federal de Minas Gerais
Belo Horizonte, Brazil
fndemarqui@est.ufmg.br
Abstract
In this paper we propose a novel R package, calledrsurv, developed for general survival data simulation purposes.The package is built under a new approach to simulate survival data thatdepends deeply on the use of dplyr verbs. The proposed packageallows simulations of survival data from a wide range of regressionmodels, including accelerated failure time (AFT), proportional hazards(PH), proportional odds (PO), accelerated hazard (AH), Yang and Prentice(YP), and extended hazard (EH) models. The package rsurv alsostands out by its ability to generate survival data from an unlimitednumber of baseline distributions provided that an implementation of thequantile function of the chosen baseline distribution is available inR. Another nice feature of the package rsurv lies inthe fact that linear predictors are specified using Rformulas, facilitating the inclusion of categorical variables,interaction terms and offset variables. The functions implemented in thepackage rsurv can also be employed to simulate survival data withmore complex structures, such as survival data with different types ofcensoring mechanisms, survival data with cure fraction, survival datawith random effects (frailties), multivarite survival data, andcompeting risks survival data.
Keywords censoringrandom data generationsurvival regression models
1 Introduction
Survival analysis is a branch of statistics concerned with analyzing thetime taken from an event of interest to occur. The main characteristicdistinguishing survival analysis from other statistical techniques isthe existence of incomplete observations, known as censoredobservations. The presence of censoring requires specialized methods forthe analysis and simulation of survival data.
There are different types of censoring. We say the survival data isright-censored when the time until the event of interest is greater thanthe observed time. Similarly, when the observed time is greater than thetime until the event of interest, we have left-censored survival data.Finally, there are situations where the time to the event of interest isonly known to lie in an interval . In this case, we say thesurvival data is interval-censored. It is also important to know themechanism type that generates censoring when modeling and simulatingsurvival data. If the survival and censoring times are independent, wehave what is called independent censoring. In contrast, when thesurvival and censoring times are not independent, we have a dependentcensoring mechanism. Under dependent censoring, a joint distribution forthe survival and censoring times needs to be specified, posingadditional challenges in the modeling and simulation process.
The large number of existing survival models today, combined with thedifferent types of censoring, makes the simulation of survival data achallenging task. Unfortunately, by the time of writing, there are onlya few R (R Core Team, 2023) packages available from the Comprehensive RArchive Network (CRAN) aimed solely at simulating survival data, namelythe survsim (Moriña andNavarro, 2017), simsurv (Brilleman etal., 2020) and thePermAlgo (Sylvestre etal., 2022) packages.
The package survsim permits the generation of survival times fromaccelerated failure time models, as well as survival data with multipleevents and recurrent events. However, only a restricted set of baselinedistributions that includes the Weibull, log-normal, and log-logisticdistributions is available in this package. The package simsurv,on the other hand, allows the simulation of survival times fromexponential, Weibull, and Gompertz baseline distributions. It furtherpermits simulating from user-defined hazard functions, and fromtwo-component mixture distributions. In the package simsurv,covariates are introduced under a proportional hazards assumption, andtime-dependent covariates can also be included by interacting covariateswith linear time or some user-defined transformation of time. Finally,the package PermAlgo uses a permutational algorithm for thegeneration of survival data conditional on an user-specified list ofpossibly time-dependent covariates. Existing routines, implemented inother packages available on CRAN to simulating survival data under morerestrictive settings include the mlt (TorstenHothorn, 2020), prodlim(Gerds, 2023), and the SimHaz (Xiongetal., 2015) packages.
A common feature shared by the packages discussed above it that they areonly suitable to simulating right-censored survival data under anindependent censoring mechanism. In addition, the data generationfunctions implemented in those packages return a data.frame as output.Those facts, unfortunately, forces the simulated data set to fit in somespecific characteristics.
In this paper we introduce the R package rsurv(Demarqui, 2024a), available from CRAN athttp://CRAN.R-project.org/package=rsruv. The package rsurvis aimed to simulating survival data from some of the main parametricsurvival regression models available in the literature, under a variateof censoring scenarios. The main advantages of the R packagersurv over other existing packages are:
- 1)
The explicit use of R formulas for the specification of the linearpredictor, facilitating the inclusion of interaction terms and offsetvariables.
- 2)
Its versatility to simulating from virtually any baseline survivaldistribution, provided that it possesses a quantile functionimplemented in R.
- 3)
The implementation of a wide range of regression models, includingaccelerated failure time (AFT), proportional hazards (PH),proportional odds (PO), accelerated hazard (AH), Yang and Prentice(YP), and extended hazard (EH) models.
- 4)
The possibility to generate left, right and interval-censored survivaldata, under the assumption of both independent and dependent censoringmechanisms.
The main data generation functions available in the package rsurvare implemented to return a vector of failure times as output. This,combined with the use of the dplyr verbs, brings much moreflexibility in the survival data simulation process. As it shall bedemonstrated in this paper, simulating survival data with more complexstructures such as clustered survival data, multivariate survival data,or survival data with a cure fraction, can be easily performed usingthose core functions. The only drawback of the package rsurvregards its inability to simulate survival data in the presence oftime-dependent covariates. This fact, however, seems to be a faircompromise considering the wide range baseline distributions andregression models that can be combined to simulate survival data usingthe package rsurv.
The paper is organized as follows. In Section 2, we presentthe main algorithm implemented in the package rsurv, along withthe core functions used for survival data generation. In Section3, we illustrate how the package rsurv can be usedto simulate survival data with more complex structures, includingsurvival data with cure fraction, survival data with random effects(frailties), multivarite survival data, and competing risks survivaldata. Finally, in Section 4 we draw some concludingremarks and discuss some possible extensions for the proposed package.
2 Simulation from parametric survival regressionmodels
Let be the survival functionassociated with a given parametric survival regression model, where is a vector of parameters and isa vector of covariates. Since, it follows thatsamples from a regression model with survival function can be generated consideringthe following equation:
(1) |
where .
Equation (1) forms the basis for any simulation procedureaimed for simulating from standard parametric survival regressionmodels. It is important to note that, for data generation purposes, theinverse of does not need tohave a closed-form expression provided that it can be evaluatednumerically.
Any regression model can be thought of as a mechanism by which thedistribution associated with a hom*ogeneous population is perturbed insome fashion by the information contained in a set of covariates. Insurvival analysis, such distribution is commonly referred to as thebaseline distribution. Thus, if corresponds to a baseline survival function depending on a vector ofparameters , then it is reasonable to expect thatwhen .
Under appropriate constraints on its regression coefficients, it ispossible to show that the YP model includes the PH and PO models asparticular cases. Similarly, the AFT, AH, and PH models arise as specialcases of the EH model. Therefore, to simulate from all those models itis sufficient to know how to simulate from the YP and EH models. In thenext sections, we present the basics of the YP and EH models needed forsimulation purposes.
2.1 YP model and its particularcases
The YP model corresponds to a suitable choice to model survival datawith crossing survival curves. This model may also be appropriated insituations where PH and PO assumptions are not satisfied.
Following Demarqui andMayrink (2021) and references therein, the survivalfunction of the Yang and Prentice model is given by:
(2) |
whereis the baseline odds function, and are vectors of regressioncoefficients.
The hazard function of the YP model can be expressed as:
Let and two vectors of covariates.Becase
and
and areoften called short and log term regression coefficients, respectively.
It can be show that the survival curves cross each other when, for some .Moreover, the YP model includes the PH and PO models as particular caseswhen and, respectively. For this reason, theinverse of the survival function given in (2) allows thesimulation from the PH, PO and YP models.
2.2 EH model and its particularcases
The EH model, originally proposed by Etezadi-Amoli andCiampi (1987)…
According to Tseng andShu (2011), the regression coefficients can be regarded as acceleration factors sincethey identify whether the acceleration or deceleration of the observedfailure time related to the baseline failure time. In contrast, theregression coefficients characterize the relativehazard after transforming the time scale of the baseline hazard into anobserved time scale.
In this paper we consider a different parametrization for the EH, inline with the parametrization adopted in the R packagesurvstan (Demarqui, 2024b). Under that parametrization, the hazardand survival functions of the EH model can be expressed as:
and
(3) | |||||
It is easy to see from Equation (3) that the EH modelsincludes the AFT, PH and AH models as particular cases when,, and, respectively. Therefore, the inverseof the survival function given in (3) allows the simulationfrom the AFT, AH, PH and EH models.
2.3 Survival data generation procedure
The discussion presented in the previous sections suggests that thesurvival function should beexpressed as some function of and, satisfyingwhen . The following theorem formalize thisidea, and establishes how random samples from the YP and EH regressionmodels (and its particular cases AFT, AH, PH and PO) can be draw.
Theorem 1 Let ,where and, and consider the survivalfunctions given in Equations (2) and (3). Then:
- i)
There exist functions and , satisfying when , such that:
(4) - ii)
Samples of can be generated by:
(5) where .
The proof of the theorm is given in the appendix.
In practice, the validity of Equation (5) for simulatingsurvival data from the regression models presented in the previoussection is ensured by two mild general assumptions:
- i)
There is available an R function that returns the inverse of , that is, , for .
- ii)
The linear predictors and do not include a intercept term.
It is important to notice that assumption i) does not require theexistence of a closed-form expression for, as long as its inverse can becomputed numerically. Assumption ii), on the other hand, is necessary toavoid identifiability issues since no further assumptions/constraintsare made about the baseline survival function.
As one can see, the result given in Equation (5) is quitegeneral, and can be applied to simulate data from a wide range ofsurvival models. The only restriction for its use regards the fact thatneither covariates nor their effects are allowed to vary with time.
The internal function qsurv, reproduced below, plays a distinctrole in the algorithms implemented in the package rsurv.Specifically, this function allows the use of any baseline survivaldistribution for which there exists the implementation of its quantilefunction.
qsurv<-function(p,baseline,...){
qfunc<-get(paste("q",baseline,sep=""),mode="function")
x<-qfunc(p,lower.tail=FALSE,...)
return(x)
}
The following example shows the usefulness of the qsurv functionwhen working with customized functions or functions imported from otherpackages. There, the function qmydist mimics the functionqexp available in the package stats, whereas the functionqgengamma.orig from the package flexsurv returns thequantile function of the generalized gamma distribution, which includesthe exponential distribution as a particular case when all parametersare equal to 1.
library(flexsurv)
qmydist<-function(p,lambda,...){
x<-qexp(p,rate=lambda,...)
return(x)
}
set.seed(1234567890)
u<-runif(5)
x1<-qexp(u,rate=1,lower.tail=FALSE)
x2<-rsurv:::qsurv(u,baseline="exp",rate=1)
x3<-rsurv:::qsurv(u,baseline="mydist",lambda=1)
x4<-rsurv:::qsurv(u,baseline="gengamma.orig",shape=1,scale=1,k=1)
cbind(x1,x2,x3,x4)
## x1 x2 x3 x4## [1,] 0.09339179 0.09339179 0.09339179 0.09339179## [2,] 0.95202641 0.95202641 0.95202641 0.95202641## [3,] 0.17411789 0.17411789 0.17411789 0.17411789## [4,] 0.29132206 0.29132206 0.29132206 0.29132206## [5,] 0.34595405 0.34595405 0.34595405 0.34595405
The regression models implemented in the package rsurv share thesame parametrizations of those implement in the R packagesurvstan (Demarqui, 2024b). The main functions available in thepackage rsurv are: raftreg(), rahtreg(),rapheg(), raporeg(), rehreg() and rypreg().These functions permit the simulation of survival times from the AFT,AH, PH, PO, EH and YP models, respectively.
All r*reg() functions share the same basic structure.Specifically, the user must provide a numeric vector u ofquantiles used in (5), a formula containing informationregarding the structure of the linear predictors, a vector betaof regression coefficients (and additionally a vector phi for theEH and YP models), which should be compatible with the model matrixinduced by the formula argument, a dist/baseline argumentused to specify the chosen baseline distribution, along with anyadditional arguments passed to the internal function qsurvthrough the … argument.
A distinct feature of the functions r*reg() is that they return avector of generated survival times rather than a full data setcontaining additional information regarding covariates and a failureindicator variable. This brings more flexibility for the user tosimulate survival data with different characteristics and levels ofcomplexity, as we shall demonstrate throughout the examples presented inthe paper.
As a first example, suppose we want to simulate a sample of type Iright-censored survival data, assuming that the failure times aregenerated from an accelerated failure time model with loglogisticbaseline distribution. In addition, assume that we wish to consider twoexploratory variables, say age and sex, and we want to include aninteraction effect between them. Such a task can be easily accomplishedby using the function raftreg along with the functionqllogis available in the package flexsurv, as illustratedbelow:
library(flexsurv)
library(rsurv)
library(survstan)
set.seed(1234567890)
n<-1000
tau<-10#maximumfollowuptime
simdata<-data.frame(
age=rnorm(n),
sex=sample(c("f","m"),size=n,replace=TRUE)
)|>
mutate(
t=raftreg(runif(n),~age*sex,beta=c(1,2,-0.5),
dist="llogis",shape=1.5,scale=1),
)|>
rowwise()|>
mutate(
time=min(t,tau),
status=as.numeric(time==t)
)
glimpse(simdata)
## Rows: 1,000## Columns: 5## Rowwise:## $ age <dbl> 1.34592454, 0.99527131, 0.54622688, -1.91272392, 1.92128431, 1.~## $ sex <chr> "m", "f", "f", "m", "m", "m", "m", "f", "f", "m", "f", "m", "m"~## $ t <dbl> 15.2363453, 1.5259533, 2.1783746, 2.4354995, 58.7932958, 16.714~## $ time <dbl> 10.0000000, 1.5259533, 2.1783746, 2.4354995, 10.0000000, 10.000~## $ status <dbl> 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, ~
fit<-aftreg(
Surv(time,status)~age*sex,
data=simdata,dist="loglogistic"
)
estimates(fit)
## age sexm age:sexm alpha gamma## 0.9494630 2.0094422 -0.4812641 1.4946497 1.0226847
In our next example we demonstrate how to generate survival data withcrossing survival curves using the YP model. In this example, we assumethat the survival times are randomly right-censored, with the censoringtimes following an exponential distribution.
library(GGally)
set.seed(1234567890)
n<-1000
simdata<-data.frame(
trt=sample(c("chemo","chemo+rad"),size=n,replace=TRUE)
)|>
mutate(
t=rypreg(runif(n),~trt,beta=2,phi=-1.5,
dist="weibull",shape=1.5,scale=1),
c=rexp(n,rate=1)
)|>
rowwise()|>
mutate(
time=min(t,c),
status=as.numeric(time==t)
)|>select(-c(t,c))
glimpse(simdata)
## Rows: 1,000## Columns: 3## Rowwise:## $ trt <chr> "chemo", "chemo", "chemo+rad", "chemo+rad", "chemo+rad", "chemo~## $ time <dbl> 0.65743632, 1.14638933, 0.10715893, 0.09876511, 1.44704010, 0.1~## $ status <dbl> 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, ~
fit<-ypreg(
Surv(time,status)~trt,
data=simdata,dist="weibull"
)
estimates(fit)
## short-trtchemo+rad long-trtchemo+rad alpha gamma## 1.920795 -1.446502 1.404515 1.043123
km<-survfit(Surv(time,status)~trt,data=simdata)
ggsurv(km)+
theme(legend.position="bottom")
The two examples presented above clearly demonstrate the flexibilityprovided by the functions r*reg() in the simulation of survivaldata with different types of censoring. In the next section we look atthe simulation of survival data with more sophisticated structures.
3 Simulation of survival data with more complexstructures
This section is dedicated to show how the functions discussed in theprevious section can be used to simulate survival data with more complexstructures. It is worth noting that the examples presented here onlycover a subset of the simulation possibilities.
The package rsurv contains some specialized functions that areuseful for simulating clustered survival data, survival data with curefraction, and interval-censored survival data. Other types of survivaldata, such as multivariate survival data, survival data with competingrisks, and survival data with dependent censoring can be generated, forinstance, combining the functions implemented in the package rsurvwith functions available in the R package copula(Jun Yan, 2007; Hofertetal., 2023).
3.1 Cure rate models
Rodrigues etal. (2009) presented an unified formulation for cure ratemodels, through the use of probability-generating functions, thatincludes several cure rate models available in the literature. In theirframework, cure rate models are formulated based on a two-stage processthat possesses an appealing biological motivation in terms ofincidence-latency of disease.
Suppose we are interested in modeling the time to relapse to cancer andlet denote the number of carcinogenic cells (often calledclonogens) left active after a patient has gone through initialtreatment. Then, for an individual with clonogens left active, let be the non-observable random time (i.e., thepromotion time) for the -th clonogen to produce a detectable cancermass, .
The time to relapse to cancer, which is in fact the observable quantity,is defined as:
(6) |
where denote the orded values of .
Then, under the assumptions that is independent of, and,, it follows that the population survival function can beexpressed as:
(7) | |||||
Theorem 2 (Rodrigues etal. (2009)) The survival functionassociated with a r.v. corresponding to a promotion time(two-stage) cure rate model is given by:
(8) |
where and is theprobability-generating function of .
The next corollary of Theorems 1 and 2 provides a general expression tosimulate survival data with a cure fraction.
Corollary 1 Given a cure fraction probability and a survival function satisfying (4), then time to relapse to cancer can begenerated as follows:
(9) |
where .
The proof of the corollary in show in the Appendix.
It is important to note from Equation (9) that, given, all the time generation procedures describedpreviously can be straightforwardly applied in the generation ofsurvival times in the presence of a cure fraction, provided that theinverse of the probability generating function, , isknown.
distribution | ||
---|---|---|
Bernoulli | ||
Poisson | ||
Negative Binomial | ||
Bell |
Here, and is the Lambert W function.
Common distributions assumed for the incidence sub-model (i.e.,the distribution of the latent r.v. ) are the Bernoulli, Poisson,negative binomial and the Bell distributions. Covariates are usuallyincluded in the incidence sub-model. If is a vector of covariates, and is thecorresponding vector of regression coefficients, than can be related to the linear predictor through appropriate choices of linkfunctions.
The inverse of the probability generating functions presented in Table1 are implemented in the function inv_pgf().Similarly to the r*reg() functions, the functioninv_pgf() requires a formula argument specifying thestructure of the linear predictor, along with a kappa argumentpassing the vector of regression coefficients, that should be compatiblewith the model matrix induced by the formula argument, and anincidence argument determining the incidence sub-model, and itscorresponding link function. Specifically, the functionrsurv::bernoulli() allows the specification of logit, probit,cloglog, and cauchit link functions, whereas the log, identity andsquare root links can be used with the functionsrsurv::poisson(), rsurv::bell() andrsurv::negbin(). When the negative binomial distribution isassumed for the incidence sub-model, the function inv_pgf()further requires an additional argument zeta, representing theextra negative-binomial parameter.
The next example illustrates how survival data with cure fraction can beeasily generated using the package rsurv. In that example weconsider a regression structure only for the incidence sub-model. Morespecifcally, we simulate survival data from a mixture cure rate modelwith a probit link function. We further assume that the survival timesare randomly right-censored.
library(rsurv)
library(GGally)
#fixingtheseedfortherng:
set.seed(1234567890)
kappa<-c(0.5,1.5,-1.1)
n<-1000
#generatingthesetofexplanatoryvariables:
simdata<-data.frame(
trt=sample(c("A","B"),size=n,replace=TRUE),
age=rnorm(n)
)
#generatingthedataset:
v<-inv_pgf(
~trt+age,
incidence=bernoulli("probit"),
kappa=kappa,
data=simdata
)
simdata<-simdata|>
mutate(
t=qexp(v,rate=1,lower.tail=FALSE),
c=rexp(n,rate=1)
)|>
rowwise()|>
mutate(
time=min(t,c),
status=as.numeric(time==t)
)|>
select(-c(t,c))
glimpse(simdata)
## Rows: 1,000## Columns: 4## Rowwise:## $ trt <chr> "A", "A", "B", "B", "B", "A", "B", "A", "B", "A", "A", "A", "A"~## $ age <dbl> 0.2193297, 0.9952571, -0.4572095, 0.4864704, 1.9512081, 0.49597~## $ time <dbl> 0.17789062, 0.17142535, 0.50301782, 0.64774117, 0.40046281, 0.4~## $ status <dbl> 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, ~
km<-survfit(Surv(time,status)~trt,data=simdata)
ggsurv(km)+
ylim(0,1)
3.2 Fraily models
Frailty models are often employed in survival analysis to accommodateunknown and/or unobserved risk factors through the inclusion of randomeffects, known as frailties (a term coined by Vaupeletal. (1979) torepresent individual’s heterogeneity). These type of models have becomeincreasingly popular in multivariate survival analysis since they allowone to take into account the association among individual survival timeswithin subgroups (or clusters) of subjects.
Frailty term | Expected value | Variance |
---|---|---|
0 | ||
1 | ||
- | - |
The parameter is known as the stability parameter of the PS distribution.
The package rsurv permits the inclusion of frailty terms throughthe linear predictors, i.e., and, where is thefrailty term. This convention adopted by the package rsruv allowsthe inclusion of frailties into the linear predictors using thestats::offset() function.
The function rfrailty() implemented in the package rsurvfacilitates the process of simulating survival data with frailties. Thisfunction permits random generation of frailties from the most commonlyfrailty distributions used in practice, namely the gamma distribution,the gaussian distribution, and the positive stable (ps) distribution,which are briefly described in Table 2. Since thegamma and positive stable distributions have support on , bydefault, the function rfrailty() returns samples from positiverandom frailties in the log-scale, that is, .
Our next example shows how the functions r*reg() andrfrailty() can be combined to generate survival data from sharedfrailty models. For this type of survival frailty models, elementsbelonging to the same cluster share the same random effect (frailty). Togenerate frailties using the function rfrailty() we need to passa vector of length containing information regarding the clusterconfiguration to the cluster argument, the assumed distributionto the frailty argument, and finally, the parameter of thefrailty distribution (sigma or alpha, depending on thechoice of passed to frailty argument).
library(rsurv)
library(frailtyEM)
set.seed(1234567890)
n<-1000#samplesize
L<-100#numberofclusters
simdata<-data.frame(
id=rep(1:L,each=n/L),
age=rnorm(n),
sex=sample(c("f","m"),size=n,replace=TRUE)
)|>
mutate(
frailty=rfrailty(cluster=id,frailty="gamma",sigma=0.5),
t=rphreg(u=runif(n),~age*sex+offset(frailty),
beta=c(1,2,-0.5),dist="exp",rate=1),
c=runif(n,0,5)
)|>
rowwise()|>
mutate(
time=min(t,c),
status=as.numeric(t<c)
)|>
select(-c(t,c))
glimpse(simdata)
## Rows: 1,000## Columns: 6## Rowwise:## $ id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,~## $ age <dbl> 1.34592454, 0.99527131, 0.54622688, -1.91272392, 1.92128431, 1~## $ sex <chr> "m", "f", "f", "m", "m", "m", "m", "f", "f", "m", "f", "m", "m~## $ frailty <dbl> -0.15917714, -0.15917714, -0.15917714, -0.15917714, -0.1591771~## $ time <dbl> 0.066662943, 0.370877454, 3.527734703, 0.109155541, 0.11269257~## $ status <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,~
em<-emfrail(Surv(time,status)~age*sex+cluster(id),
distribution=emfrail_dist(dist="gamma"),
data=simdata)
summary(em)
## Call:## emfrail(formula = Surv(time, status) ~ age * sex + cluster(id),## data = simdata, distribution = emfrail_dist(dist = "gamma"))#### Regression coefficients:## coef exp(coef) se(coef) adj. se z p## age 0.9711 2.6409 0.0675 0.0682 14.2469 0## sexm 2.1159 8.2973 0.0949 0.0976 21.6777 0## age:sexm -0.4746 0.6221 0.0811 0.0813 -5.8405 0## Estimated distribution: gamma / left truncation: FALSE#### Fit summary:## Commenges-Andersen test for heterogeneity: p-val 1.25e-19## no-frailty Log-likelihood: -4679.899## Log-likelihood: -4636.705## LRT: 1/2 * pchisq(86.4), p-val 7.39e-21#### Frailty summary:## estimate lower 95% upper 95%## Var[Z] 0.233 0.152 0.350## Kendall’s tau 0.105 0.071 0.149## Median concordance 0.102 0.068 0.145## E[logZ] -0.121 -0.185 -0.078## Var[logZ] 0.263 0.164 0.418## theta 4.284 2.859 6.570## Confidence intervals based on the likelihood function
3.3 Interval-censored survivaldata
Consider a nonnegative random variable representing the time untilthe occorrence of an event of interest. According to Sun (2006), anobservation on is interval-censored when its exact value isunknow, and only an interval of the form can be observed, sothat
(10) |
where .
It is easy to see from (10) that right and left-censoredobservations on occur when and ,respectively. Naturally, when , can be observed exactly.
Two important types of interval-censored survival data that often arisein practice are the so-called type I (also known as current statussurvival data in the literature) and type II interval-censored survivaldata. Type I interval-censored survival data occurs when all observedintervals result in either left-censored () or right-censored() observations. Type II interval-censored survival data,on the other hand, emerges when we have at least one finite intervalsatisfying .
The generation of type I and type II interval-censored survival data isfacilitated by the function rinterval available in the packagersurv. For the random generation of type II interval-censoredsurvival data, the algorithm described in Kiani andArasan (2012), whichemulates a situation in which patients are expected to attendpre-scheduled visits to the physician with attendance probability ,is implemented. The function rinterval requires an argumenttime passing a sample of failure times, an argument tauthat can be either a vector of censoring times (for type Iinterval-censored survival data) or a time-grid of scheduled visits (fortype II interval-censored survival data), and an argument typeindicating the type of interval-censored survival data. Finally, theprob argument, only evaluated when type = II, can beregarded as the attendance probability of a scheduled visit.
The code presented below examplifies how to simulate type Iinterval-censored survival data using the function functionrinterval(). There, it is assumed that the exact failure timesare generated from a PH model with an exponential baseline distribution,whereas the censoring times are generated from a Weibull distribution.
library(rsurv)
library(icenReg)
set.seed(1234567890)
n<-300
covariates<-data.frame(
age=rnorm(n),
sex=sample(c("f","m"),size=n,replace=TRUE)
)
#typeIintervalcensoredsurvivaldata:
simdata1<-covariates|>
mutate(
time=rphreg(u=runif(n),~age+sex,beta=c(1,0.5),dist="exp",rate=1),
tau=rweibull(n,scale=2,shape=1.5),
rinterval(time,tau,type="I")
)
glimpse(simdata1)
## Rows: 300## Columns: 6## $ age <dbl> 1.34592454, 0.99527131, 0.54622688, -1.91272392, 1.92128431, 1.3~## $ sex <chr> "m", "f", "m", "m", "m", "f", "m", "f", "f", "m", "m", "f", "f",~## $ time <dbl> 0.05387373, 0.05326879, 0.16546309, 2.17932784, 0.04415598, 0.01~## $ tau <dbl> 0.8276881, 1.4807894, 3.2631929, 3.3893988, 2.7803610, 0.4764430~## $ left <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000~## $ right <dbl> 0.8276881, 1.4807894, 3.2631929, 3.3893988, 2.7803610, 0.4764430~
fit1<-ic_par(
cbind(left,right)~age+sex,data=simdata1,
model="ph",dist="exponential"
)
summary(fit1)
#### Model: Cox PH## Dependency structure assumed: Independence## Baseline: exponential## Call: ic_par(formula = cbind(left, right) ~ age + sex, data = simdata1,## model = "ph", dist = "exponential")#### Estimate Exp(Est) Std.Error z-value p## log_scale -0.3877 0.6786 0.1038 -3.736 1.871e-04## age 0.9186 2.5060 0.1241 7.402 1.341e-13## sexm 0.2245 1.2520 0.1965 1.142 2.534e-01#### final llk = -94.91423## Iterations = 9
The next example illustrates how type II interval-censored survival datacan be generated using the function rinterval(). In that example,we assume that patients are expected to return to the physician atcertain pre-specified times, with attendancy probability equals to 0.7.
#typeIIintervalcensoredsurvivaldata:
simdata2<-covariates|>
mutate(
time=raftreg(u=runif(n),~age+sex,beta=c(1,0.5),dist="exp",rate=1),
rinterval(time,tau=seq(0,5,by=1),type="II",prob=0.7)
)
glimpse(simdata2)
## Rows: 300## Columns: 5## $ age <dbl> 1.34592454, 0.99527131, 0.54622688, -1.91272392, 1.92128431, 1.3~## $ sex <chr> "m", "f", "m", "m", "m", "f", "m", "f", "f", "m", "m", "f", "f",~## $ time <dbl> 2.583140967, 2.152683999, 0.350258378, 0.045283358, 20.802515190~## $ left <dbl> 2, 1, 0, 0, 5, 2, 0, 1, 0, 1, 0, 0, 1, 2, 2, 0, 0, 1, 3, 5, 2, 3~## $ right <dbl> 3, Inf, 1, 1, Inf, 3, 1, 2, 1, 3, 1, 1, 2, 4, 5, 2, 1, 2, Inf, I~
fit2<-ic_par(
cbind(left,right)~age+sex,data=simdata2,
model="aft",dist="exponential"
)
summary(fit2)
#### Model: Accelerated Failure Time## Dependency structure assumed: Independence## Baseline: exponential## Call: ic_par(formula = cbind(left, right) ~ age + sex, data = simdata2,## model = "aft", dist = "exponential")#### Estimate Exp(Est) Std.Error z-value p## log_scale 0.3638 1.439 0.0716 5.081 3.747e-07## age 1.0450 2.842 0.0908 11.500 0.000e+00## sexm 0.4921 1.636 0.1480 3.326 8.825e-04#### final llk = -266.4422## Iterations = 7
3.4 Multivariate survival data, dependent censoring andcompetingrisks
Copulas provide a way to construct multivariate distributions based uponindependent marginal distributions (Hofertetal., 2018). A multivariatesurvival function can be constructed by the specification of independentmarginal survival models and an appropriated copula function.
Let a d-variated copula. Then, itfollows that the joint survival function of can be expressed as:
(11) |
To generate a random sample of it is sufficient to generate a random sample from andthen take.
library(copula)
library(rsurv)
set.seed(1234567890)
n<-1000#samplesize
tau<-0.5#Kendallstaucorrelation
theta<-iTau(copula=claytonCopula(),tau=tau)
clayton<-claytonCopula(param=theta,dim=2)
u<-rCopula(clayton,n=n)
#simulatingthefailuretimes:
simdata<-data.frame(
age=rnorm(n),
sex=sample(c("f","m"),size=n,replace=TRUE)
)|>
mutate(
t1=rphreg(u=u[,1],~age+sex,beta=c(1,1.2),dist="exp",rate=2),
t2=rporeg(u=u[,2],~age+sex,beta=c(0.8,1.1),dist="exp",rate=1),
)
glimpse(simdata)
## Rows: 1,000## Columns: 4## $ age <dbl> -0.17469076, -0.15529194, -1.03781319, -1.56364665, 1.03961867, -0~## $ sex <chr> "m", "m", "f", "f", "f", "m", "m", "m", "m", "f", "f", "f", "f", "~## $ t1 <dbl> 0.016749142, 0.167458832, 0.245770610, 0.695707186, 0.061162858, 0~## $ t2 <dbl> 0.084588793, 0.583402263, 0.175592119, 0.288876906, 0.298716691, 0~
#checkingoutthecorrelation:
simdata|>
select(t1,t2)|>
cor(method="kendall")
## t1 t2## t1 1.0000000 0.5306627## t2 0.5306627 1.0000000
#adding(right)censoring:
simdata1<-simdata|>
mutate(
c=runif(n,0,5)#randomcensoring
)|>
rowwise()|>
mutate(
t1=min(t1,c),
t2=min(t2,c),
status1=as.numeric(t1<c),
status2=as.numeric(t2<c),
)
glimpse(simdata1)
## Rows: 1,000## Columns: 7## Rowwise:## $ age <dbl> -0.17469076, -0.15529194, -1.03781319, -1.56364665, 1.03961867~## $ sex <chr> "m", "m", "f", "f", "f", "m", "m", "m", "m", "f", "f", "f", "f~## $ t1 <dbl> 0.016749142, 0.167458832, 0.245770610, 0.695707186, 0.06116285~## $ t2 <dbl> 0.084588793, 0.583402263, 0.175592119, 0.288876906, 0.29871669~## $ c <dbl> 2.26609325, 4.57257662, 3.63517003, 2.02494996, 2.66336179, 2.~## $ status1 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,~## $ status2 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0,~
#checkingoutthecorrelation:
simdata1|>
select(t1,t2)|>
cor(method="kendall")
## t1 t2## t1 1.0000000 0.5121882## t2 0.5121882 1.0000000
In the next example we show how to simulate survival data with dependentcensoring. In that case, the observable survival time is given by, where , , and represents thetime to failure, the dependent censoring time, and the censoring timedue to random censoring.
simdata2<-simdata|>
mutate(
a=runif(n,0,5)#randomcensoring
)|>
rowwise()|>
mutate(
y=min(t1,t2,a),
status1=as.numeric(y==t1),
status2=as.numeric(y==t2),
)|>
select(-c(t1,t2))
glimpse(simdata2)
## Rows: 1,000## Columns: 6## Rowwise:## $ age <dbl> -0.17469076, -0.15529194, -1.03781319, -1.56364665, 1.03961867~## $ sex <chr> "m", "m", "f", "f", "f", "m", "m", "m", "m", "f", "f", "f", "f~## $ a <dbl> 3.1886887, 3.6066255, 3.0834830, 0.8921447, 0.8049795, 0.96584~## $ y <dbl> 0.016749142, 0.167458832, 0.175592119, 0.288876906, 0.06116285~## $ status1 <dbl> 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1,~## $ status2 <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,~
Notice that the example presented above can be viewed as a particularcase of competing risks survival data. In the next example, we considera scenario where elements are subjected to three competing causes offailure.
n<-1000
tau<-0.5
theta<-iTau(copula=gumbelCopula(),tau=tau)
gumbel<-gumbelCopula(param=theta,dim=3)
u<-rCopula(gumbel,n=n)
simdata<-data.frame(
age=rnorm(n),
sex=sample(c("f","m"),size=n,replace=TRUE)
)|>
mutate(
t1=rphreg(u=u[,1],~age+sex,beta=c(1,1.2),dist="lnorm",meanlog=0,sdlog=1),
t2=rphreg(u=u[,2],~age+sex,beta=c(0.8,1.1),dist="exp",rate=1),
t3=rphreg(u=u[,3],~age+sex,beta=c(0.7,1.0),dist="exp",rate=1),
a=runif(n,0,5)#randomcensoring
)|>
rowwise()|>
mutate(
y=min(t1,t2,t3,a),
status1=as.numeric(y==t1),
status2=as.numeric(y==t2),
status3=as.numeric(y==t3),
)|>
select(-c(t1,t2,t3))
glimpse(simdata)
## Rows: 1,000## Columns: 7## Rowwise:## $ age <dbl> -0.53451004, -1.48095715, -0.76892438, 0.43319025, 0.76357712,~## $ sex <chr> "m", "m", "m", "m", "m", "m", "m", "f", "f", "f", "f", "m", "f~## $ a <dbl> 2.17617891, 1.49961178, 4.65348449, 1.04816011, 1.12082428, 0.~## $ y <dbl> 0.18034466, 0.51001897, 0.07337358, 0.12327447, 0.13908667, 0.~## $ status1 <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,~## $ status2 <dbl> 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,~## $ status3 <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1,~
4 Conclusions
This paper has presented the new R package rsurv. Theproposed package is designed for general survival data simulation. Itcombines a wide range of survival regression models, which includes AFT,PH, PO, AH, YP and EH models, with an unlimited number of baselinesurvival distributions. The package rsurv is suitable togenerating right, left and interval-censored survival data underindependent and dependent censoring mechanisms. The functions providedin the package rsurv also allows the generation of survival datawith cure fraction, survival data with random effects (frailties), andcompeting risks survival data. Another nice characteristic of thepackage rsurv regards the fact that linear predictors arespecified through R formulas. This, together with the use of dplyrverbs, makes the survival data simulation process much more flexible andintuitive.
Future research involves the development of new functions for simulatingfrom regression models with time-dependent covariates, and time-varyingcovariate effects. Further possible extensions include theimplementation of methods to simulate survival data from recurrentevetns, and simulate data from truncaded survival data.
Appendix
Proof of Theorem 1
We start the proof of Theorem 1 by the YP model given in Equation(2). For this step of the proof, define and .
- i)
As does not depend on , we have that . Using the fact that , it follows that
- ii)
We have that
Therefore, since , we have that
Now we present the proof of Theorem 1 for the EH model given in Equation(3).
- i)
We have that . It follows directly from Equation (3) that
- ii)
Note that
Thus,
Proof of Corollary 1
It follows immediately from Theorem 2 that:
The rest of the proof follows from (9) and the directapplication of Theorem 1.
References
- Brilleman etal. (2020)Brilleman, S.L., R.Wolfe, M.Moreno-Betancur, and M.J. Crowther (2020).Simulating survival data using the simsurv R package.Journal of Statistical Software97(3), 1–27.
- Demarqui (2024a)Demarqui, F. (2024a).rsurv: Random Generation of Survival Data.R package version 0.0.1.
- Demarqui (2024b)Demarqui, F. (2024b).survstan: Fitting Survival Regression Models via ’Stan’.R package version 0.0.7.1, https://fndemarqui.github.io/survstan/.
- Demarqui andMayrink (2021)Demarqui, F.N. and V.D. Mayrink (2021).Yang and Prentice model with piecewise exponential baselinedistribution for modeling lifetime data with crossing survival curves.Brazilian Journal of Probability and Statistics35(1), 172 – 186.
- Etezadi-Amoli andCiampi (1987)Etezadi-Amoli, J. and A.Ciampi (1987).Extended hazard regression for censored survival data withcovariates: A spline approximation for the baseline hazard function.Biometrics43(1), 181–192.
- Gerds (2023)Gerds, T.A. (2023).prodlim: Product-Limit Estimation for Censored Event HistoryAnalysis.R package version 2023.08.28.
- Hofertetal. (2018)Hofert, M., I.Kojadinovic, M.Maechler, and J.Yan (2018).Elements of Copula Modeling with R.Springer Use R! Series.
- Hofertetal. (2023)Hofert, M., I.Kojadinovic, M.Maechler, and J.Yan (2023).copula: Multivariate Dependence with Copulas.R package version 1.1-3.
- Jun Yan (2007)Jun Yan (2007).Enjoy the joy of copulas: With a package copula.Journal of Statistical Software21(4), 1–21.
- Kiani andArasan (2012)Kiani, K. and J.Arasan (2012).Simulation of interval censored data in medical and biologicalstudies.International Journal of Modern Physics: ConferenceSeries09, 112–118.
- Moriña andNavarro (2017)Moriña, D. and A.Navarro (2017).Competing risks simulation with the survsim R package.Communications in Statistics - Simulation andComputation46(7), 5712–5722.
- R Core Team (2023)R Core Team (2023).R: A Language and Environment for Statistical Computing.Vienna, Austria: R Foundation for Statistical Computing.
- Rodrigues etal. (2009)Rodrigues, J., V.G. Cancho, M.deCastro, and F.Louzada-Neto (2009).On the unification of the long-term survival models.Statistics and Probability Letters79, 753–759.
- Sun (2006)Sun, J. (2006).The Statistical Analysis of Interval-Censored Failure TimeData.Springer.
- Sylvestre etal. (2022)Sylvestre, M.-P., T.Evans, T.MacKenzie, and M.Abrahamowicz (2022).PermAlgo: Permutational algorith to generate event timesconditional on a covariate matrix including time-dependent covariates.R package version 1.2.
- TorstenHothorn (2020)Torsten Hothorn (2020).Most likely transformations: The mlt package.Journal of Statistical Software92(1), 1–68.
- Tseng andShu (2011)Tseng, Y.-K. and K.-N. Shu (2011).Efficient estimation for a semiparametric extended hazards model.Communications in Statistics - Simulation andComputation40(2), 258–273.
- Vaupeletal. (1979)Vaupel, J.W., K.G. Manton, and E.Stallard (1979).The impact of heterogeneity in individual frailty on the dynamics ofmortality.Demography16(3), 439–454.
- Xiongetal. (2015)Xiong, D., T.Pokaprakarn, H.Udagawa, and N.Rabbee (2015).SimHaz: Simulated Survival and Hazard Analysis forTime-Dependent Exposure.R package version 0.1.