Simulation of Survival Data With the Package rsurv (2024)

Fábio N. Demarqui
Department of Statistics
Universidade Federal de Minas Gerais
Belo Horizonte, Brazil
fndemarqui@est.ufmg.br

Abstract

In this paper we propose a novel R package, calledrsurv, developed for general survival data simulation purposes.The package is built under a new approach to simulate survival data thatdepends deeply on the use of dplyr verbs. The proposed packageallows simulations of survival data from a wide range of regressionmodels, including accelerated failure time (AFT), proportional hazards(PH), proportional odds (PO), accelerated hazard (AH), Yang and Prentice(YP), and extended hazard (EH) models. The package rsurv alsostands out by its ability to generate survival data from an unlimitednumber of baseline distributions provided that an implementation of thequantile function of the chosen baseline distribution is available inR. Another nice feature of the package rsurv lies inthe fact that linear predictors are specified using Rformulas, facilitating the inclusion of categorical variables,interaction terms and offset variables. The functions implemented in thepackage rsurv can also be employed to simulate survival data withmore complex structures, such as survival data with different types ofcensoring mechanisms, survival data with cure fraction, survival datawith random effects (frailties), multivarite survival data, andcompeting risks survival data.

Keywords censoring\cdotrandom data generation\cdotsurvival regression models

1 Introduction

Survival analysis is a branch of statistics concerned with analyzing thetime taken from an event of interest to occur. The main characteristicdistinguishing survival analysis from other statistical techniques isthe existence of incomplete observations, known as censoredobservations. The presence of censoring requires specialized methods forthe analysis and simulation of survival data.

There are different types of censoring. We say the survival data isright-censored when the time until the event of interest is greater thanthe observed time. Similarly, when the observed time is greater than thetime until the event of interest, we have left-censored survival data.Finally, there are situations where the time to the event of interest isonly known to lie in an interval [L,R]𝐿𝑅[L,R]. In this case, we say thesurvival data is interval-censored. It is also important to know themechanism type that generates censoring when modeling and simulatingsurvival data. If the survival and censoring times are independent, wehave what is called independent censoring. In contrast, when thesurvival and censoring times are not independent, we have a dependentcensoring mechanism. Under dependent censoring, a joint distribution forthe survival and censoring times needs to be specified, posingadditional challenges in the modeling and simulation process.

The large number of existing survival models today, combined with thedifferent types of censoring, makes the simulation of survival data achallenging task. Unfortunately, by the time of writing, there are onlya few R (R Core Team, 2023) packages available from the Comprehensive RArchive Network (CRAN) aimed solely at simulating survival data, namelythe survsim (Moriña andNavarro, 2017), simsurv (Brilleman etal., 2020) and thePermAlgo (Sylvestre etal., 2022) packages.

The package survsim permits the generation of survival times fromaccelerated failure time models, as well as survival data with multipleevents and recurrent events. However, only a restricted set of baselinedistributions that includes the Weibull, log-normal, and log-logisticdistributions is available in this package. The package simsurv,on the other hand, allows the simulation of survival times fromexponential, Weibull, and Gompertz baseline distributions. It furtherpermits simulating from user-defined hazard functions, and fromtwo-component mixture distributions. In the package simsurv,covariates are introduced under a proportional hazards assumption, andtime-dependent covariates can also be included by interacting covariateswith linear time or some user-defined transformation of time. Finally,the package PermAlgo uses a permutational algorithm for thegeneration of survival data conditional on an user-specified list ofpossibly time-dependent covariates. Existing routines, implemented inother packages available on CRAN to simulating survival data under morerestrictive settings include the mlt (TorstenHothorn, 2020), prodlim(Gerds, 2023), and the SimHaz (Xiongetal., 2015) packages.

A common feature shared by the packages discussed above it that they areonly suitable to simulating right-censored survival data under anindependent censoring mechanism. In addition, the data generationfunctions implemented in those packages return a data.frame as output.Those facts, unfortunately, forces the simulated data set to fit in somespecific characteristics.

In this paper we introduce the R package rsurv(Demarqui, 2024a), available from CRAN athttp://CRAN.R-project.org/package=rsruv. The package rsurvis aimed to simulating survival data from some of the main parametricsurvival regression models available in the literature, under a variateof censoring scenarios. The main advantages of the R packagersurv over other existing packages are:

  1. 1)

    The explicit use of R formulas for the specification of the linearpredictor, facilitating the inclusion of interaction terms and offsetvariables.

  2. 2)

    Its versatility to simulating from virtually any baseline survivaldistribution, provided that it possesses a quantile functionimplemented in R.

  3. 3)

    The implementation of a wide range of regression models, includingaccelerated failure time (AFT), proportional hazards (PH),proportional odds (PO), accelerated hazard (AH), Yang and Prentice(YP), and extended hazard (EH) models.

  4. 4)

    The possibility to generate left, right and interval-censored survivaldata, under the assumption of both independent and dependent censoringmechanisms.

The main data generation functions available in the package rsurvare implemented to return a vector of failure times as output. This,combined with the use of the dplyr verbs, brings much moreflexibility in the survival data simulation process. As it shall bedemonstrated in this paper, simulating survival data with more complexstructures such as clustered survival data, multivariate survival data,or survival data with a cure fraction, can be easily performed usingthose core functions. The only drawback of the package rsurvregards its inability to simulate survival data in the presence oftime-dependent covariates. This fact, however, seems to be a faircompromise considering the wide range baseline distributions andregression models that can be combined to simulate survival data usingthe package rsurv.

The paper is organized as follows. In Section 2, we presentthe main algorithm implemented in the package rsurv, along withthe core functions used for survival data generation. In Section3, we illustrate how the package rsurv can be usedto simulate survival data with more complex structures, includingsurvival data with cure fraction, survival data with random effects(frailties), multivarite survival data, and competing risks survivaldata. Finally, in Section 4 we draw some concludingremarks and discuss some possible extensions for the proposed package.

2 Simulation from parametric survival regressionmodels

Let S(t|𝚯,𝐱)𝑆conditional𝑡𝚯𝐱S(t|\boldsymbol{\Theta},\mathbf{x}) be the survival functionassociated with a given parametric survival regression model, where𝚯𝚯\boldsymbol{\Theta} is a vector of parameters and 𝐱𝐱\mathbf{x} isa 1×p1𝑝1\times p vector of covariates. SinceS(t|𝚯,𝐱)U(0,1)similar-to𝑆conditional𝑡𝚯𝐱𝑈01S(t|\boldsymbol{\Theta},\mathbf{x})\sim U(0,1), it follows thatsamples from a regression model with survival functionS(t|𝚯,𝐱)𝑆conditional𝑡𝚯𝐱S(t|\boldsymbol{\Theta},\mathbf{x}) can be generated consideringthe following equation:

T=S1(U|𝚯,𝐱),𝑇superscript𝑆1conditional𝑈𝚯𝐱\displaystyle T=S^{-1}(U|\boldsymbol{\Theta},\mathbf{x}),(1)

where UU(0,1)similar-to𝑈𝑈01U\sim U(0,1).

Equation (1) forms the basis for any simulation procedureaimed for simulating from standard parametric survival regressionmodels. It is important to note that, for data generation purposes, theinverse of S(t|𝚯,𝐱)𝑆conditional𝑡𝚯𝐱S(t|\boldsymbol{\Theta},\mathbf{x}) does not need tohave a closed-form expression provided that it can be evaluatednumerically.

Any regression model can be thought of as a mechanism by which thedistribution associated with a hom*ogeneous population is perturbed insome fashion by the information contained in a set of covariates. Insurvival analysis, such distribution is commonly referred to as thebaseline distribution. Thus, if S0(|𝜽)S_{0}(\cdot|\boldsymbol{\theta})corresponds to a baseline survival function depending on a vector ofparameters 𝜽𝜽\boldsymbol{\theta}, then it is reasonable to expect thatS(t|𝚯,𝐱)=S0(t|𝜽)𝑆conditional𝑡𝚯𝐱subscript𝑆0conditional𝑡𝜽S(t|\boldsymbol{\Theta},\mathbf{x})=S_{0}(t|\boldsymbol{\theta})when 𝐱=𝟎𝐱0\mathbf{x}=\mathbf{0}.

Under appropriate constraints on its regression coefficients, it ispossible to show that the YP model includes the PH and PO models asparticular cases. Similarly, the AFT, AH, and PH models arise as specialcases of the EH model. Therefore, to simulate from all those models itis sufficient to know how to simulate from the YP and EH models. In thenext sections, we present the basics of the YP and EH models needed forsimulation purposes.

2.1 YP model and its particularcases

The YP model corresponds to a suitable choice to model survival datawith crossing survival curves. This model may also be appropriated insituations where PH and PO assumptions are not satisfied.

Following Demarqui andMayrink (2021) and references therein, the survivalfunction of the Yang and Prentice model is given by:

S(t|𝚯,𝐱)=[1+R0(t|𝜽)e𝐱(𝜷ϕ)]e𝐱ϕ,𝑆conditional𝑡𝚯𝐱superscriptdelimited-[]1subscript𝑅0conditional𝑡𝜽superscript𝑒𝐱𝜷bold-italic-ϕsuperscript𝑒𝐱bold-italic-ϕS(t|\boldsymbol{\Theta},\mathbf{x})=\left[1+R_{0}(t|\boldsymbol{\theta})e^{\mathbf{x}(\boldsymbol{\beta}-\boldsymbol{\phi})}\right]^{-e^{\mathbf{x}\boldsymbol{\phi}}},(2)

whereR0(t|𝜽)=1S0(t|𝜽)S0(t|𝜽)subscript𝑅0conditional𝑡𝜽1subscript𝑆0conditional𝑡𝜽subscript𝑆0conditional𝑡𝜽\displaystyle R_{0}(t|\boldsymbol{\theta})=\frac{1-S_{0}(t|\boldsymbol{\theta})}{S_{0}(t|\boldsymbol{\theta})}is the baseline odds function, 𝜷𝜷\boldsymbol{\beta} andϕbold-italic-ϕ\boldsymbol{\phi} are p×1𝑝1p\times 1 vectors of regressioncoefficients.

The hazard function of the YP model can be expressed as:

h(t|𝚯,𝐱)=e𝐱(𝜷+ϕ)e𝐱𝜷F0(t|𝜽)+e𝐱ϕS0(t|𝜽)h0(t|𝜽).conditional𝑡𝚯𝐱superscript𝑒𝐱𝜷bold-italic-ϕsuperscript𝑒𝐱𝜷subscript𝐹0conditional𝑡𝜽superscript𝑒𝐱bold-italic-ϕsubscript𝑆0conditional𝑡𝜽subscript0conditional𝑡𝜽h(t|\boldsymbol{\Theta},\mathbf{x})=\frac{e^{\mathbf{x}(\boldsymbol{\beta}+\boldsymbol{\phi})}}{e^{\mathbf{x}\boldsymbol{\beta}}F_{0}(t|\boldsymbol{\theta})+e^{\mathbf{x}\boldsymbol{\phi}}S_{0}(t|\boldsymbol{\theta})}h_{0}(t|\boldsymbol{\theta}).

Let 𝐱1subscript𝐱1\mathbf{x}_{1} and 𝐱2subscript𝐱2\mathbf{x}_{2} two vectors of covariates.Becase

limt0h(t|𝚯,𝐱2)h(t|𝚯,𝐱1)=exp{(𝐱2𝐱1)𝜷}subscript𝑡0conditional𝑡𝚯subscript𝐱2conditional𝑡𝚯subscript𝐱1subscript𝐱2subscript𝐱1𝜷\displaystyle\lim_{t\rightarrow 0}\frac{h(t|\boldsymbol{\Theta},\mathbf{x}_{2})}{h(t|\boldsymbol{\Theta},\mathbf{x}_{1})}=\exp\{(\mathbf{x}_{2}-\mathbf{x}_{1})\boldsymbol{\beta}\}

and

limth(t|𝚯,𝐱2)h(t|𝚯,𝐱1)=exp{(𝐱2𝐱1)ϕ},subscript𝑡conditional𝑡𝚯subscript𝐱2conditional𝑡𝚯subscript𝐱1subscript𝐱2subscript𝐱1bold-italic-ϕ\displaystyle\lim_{t\rightarrow\infty}\frac{h(t|\boldsymbol{\Theta},\mathbf{x}_{2})}{h(t|\boldsymbol{\Theta},\mathbf{x}_{1})}=\exp\{(\mathbf{x}_{2}-\mathbf{x}_{1})\boldsymbol{\phi}\},

𝜷𝜷\boldsymbol{\beta} and ϕbold-italic-ϕ\boldsymbol{\phi} areoften called short and log term regression coefficients, respectively.

It can be show that the survival curves cross each other whenβj×ϕj<0subscript𝛽𝑗subscriptitalic-ϕ𝑗0\beta_{j}\times\phi_{j}<0, for some j=1,,p𝑗1𝑝j=1,\cdots,p.Moreover, the YP model includes the PH and PO models as particular caseswhen 𝜷=ϕ𝜷bold-italic-ϕ\boldsymbol{\beta}=\boldsymbol{\phi} andϕ=𝟎bold-italic-ϕ0\boldsymbol{\phi}=\mathbf{0}, respectively. For this reason, theinverse of the survival function given in (2) allows thesimulation from the PH, PO and YP models.

2.2 EH model and its particularcases

The EH model, originally proposed by Etezadi-Amoli andCiampi (1987)

h(t|𝚯,𝐱)=h0(te𝐱𝜷|𝜽)e𝐱ϕ.conditional𝑡𝚯𝐱subscript0conditional𝑡superscript𝑒𝐱𝜷𝜽superscript𝑒𝐱bold-italic-ϕh(t|\boldsymbol{\Theta},\mathbf{x})=h_{0}\left(te^{\mathbf{x}\boldsymbol{\beta}}|\boldsymbol{\theta}\right)e^{\mathbf{x}\boldsymbol{\phi}}.

According to Tseng andShu (2011), the regression coefficients𝜷𝜷\boldsymbol{\beta} can be regarded as acceleration factors sincethey identify whether the acceleration or deceleration of the observedfailure time related to the baseline failure time. In contrast, theregression coefficients ϕbold-italic-ϕ\boldsymbol{\phi} characterize the relativehazard after transforming the time scale of the baseline hazard into anobserved time scale.

In this paper we consider a different parametrization for the EH, inline with the parametrization adopted in the R packagesurvstan (Demarqui, 2024b). Under that parametrization, the hazardand survival functions of the EH model can be expressed as:

h(t|𝚯,𝐱)=h0(t/e𝐱𝜷|𝜽)e𝐱ϕconditional𝑡𝚯𝐱subscript0conditional𝑡superscript𝑒𝐱𝜷𝜽superscript𝑒𝐱bold-italic-ϕh(t|\boldsymbol{\Theta},\mathbf{x})=h_{0}\left(t/e^{\mathbf{x}\boldsymbol{\beta}}|\boldsymbol{\theta}\right)e^{\mathbf{x}\boldsymbol{\phi}}

and

S(t|𝚯,𝐱)𝑆conditional𝑡𝚯𝐱\displaystyle S(t|\boldsymbol{\Theta},\mathbf{x})=\displaystyle=exp{H0(t/e𝐱𝜷|𝜽)e𝐱(𝜷+ϕ)}\displaystyle\exp\left\{-H_{0}\left(t/e^{\mathbf{x}\boldsymbol{\beta}}|\boldsymbol{\theta})e^{\mathbf{x}(\boldsymbol{\beta}+\boldsymbol{\phi}}\right)\right\}(3)
=\displaystyle=S0(t/e𝐱𝜷|𝜽)e𝐱(𝜷+ϕ).subscript𝑆0superscriptconditional𝑡superscript𝑒𝐱𝜷𝜽superscript𝑒𝐱𝜷bold-italic-ϕ\displaystyle S_{0}\left(t/e^{\mathbf{x}\boldsymbol{\beta}}|\boldsymbol{\theta}\right)^{e^{\mathbf{x}(\boldsymbol{\beta}+\boldsymbol{\phi})}}.

It is easy to see from Equation (3) that the EH modelsincludes the AFT, PH and AH models as particular cases whenϕ=𝜷bold-italic-ϕ𝜷\boldsymbol{\phi}=-\boldsymbol{\beta},𝜷=𝟎𝜷0\boldsymbol{\beta}=\mathbf{0}, andϕ=𝟎bold-italic-ϕ0\boldsymbol{\phi}=\mathbf{0}, respectively. Therefore, the inverseof the survival function given in (3) allows the simulationfrom the AFT, AH, PH and EH models.

2.3 Survival data generation procedure

The discussion presented in the previous sections suggests that thesurvival function S(t|𝚯,𝐱)𝑆conditional𝑡𝚯𝐱S(t|\boldsymbol{\Theta},\mathbf{x}) should beexpressed as some function of S0(t|𝜽)subscript𝑆0conditional𝑡𝜽S_{0}(t|\boldsymbol{\theta}) and𝐱𝐱\mathbf{x}, satisfyingS(t|𝚯,𝐱)=S0(t|𝜽)𝑆conditional𝑡𝚯𝐱subscript𝑆0conditional𝑡𝜽S(t|\boldsymbol{\Theta},\mathbf{x})=S_{0}(t|\boldsymbol{\theta})when 𝐱=𝟎𝐱0\mathbf{x}=\mathbf{0}. The following theorem formalize thisidea, and establishes how random samples from the YP and EH regressionmodels (and its particular cases AFT, AH, PH and PO) can be draw.

Theorem 1 Let 𝜼=(η1,η2)𝜼subscript𝜂1subscript𝜂2\boldsymbol{\eta}=(\eta_{1},\eta_{2}),where η1=𝐱𝜷subscript𝜂1𝐱𝜷\eta_{1}=\mathbf{x}\boldsymbol{\beta} andη2=𝐱ϕsubscript𝜂2𝐱bold-italic-ϕ\eta_{2}=\mathbf{x}\boldsymbol{\phi}, and consider the survivalfunctions given in Equations (2) and (3). Then:

  • i)

    There exist functions g(s)=g(s;𝜼)𝑔𝑠𝑔𝑠𝜼g(s)=g(s;\boldsymbol{\eta}) and a(t)=a(t;𝜼)𝑎𝑡𝑎𝑡𝜼a(t)=a(t;\boldsymbol{\eta}), satisfying S(t|𝚯,𝐱)=S0(t|𝜽)𝑆conditional𝑡𝚯𝐱subscript𝑆0conditional𝑡𝜽S(t|\boldsymbol{\Theta},\mathbf{x})=S_{0}(t|\boldsymbol{\theta}) when 𝐱=𝟎𝐱0\mathbf{x}=\mathbf{0}, such that:

    S(t|𝚯,𝐱)=g(S0(a(t)|𝜽)).𝑆conditional𝑡𝚯𝐱𝑔subscript𝑆0conditional𝑎𝑡𝜽S(t|\boldsymbol{\Theta},\mathbf{x})=g\left(S_{0}(a(t)|\boldsymbol{\theta}\right)).(4)
  • ii)

    Samples of TS(t|𝚯,𝐱)similar-to𝑇𝑆conditional𝑡𝚯𝐱T\sim S(t|\boldsymbol{\Theta},\mathbf{x}) can be generated by:

    T=a1(S01(g1(U)|𝜽)),𝑇superscript𝑎1superscriptsubscript𝑆01conditionalsuperscript𝑔1𝑈𝜽T=a^{-1}\left(S_{0}^{-1}\left(g^{-1}(U)|\boldsymbol{\theta}\right)\right),(5)

    where UU(0,1)similar-to𝑈𝑈01U\sim U(0,1).

The proof of the theorm is given in the appendix.

In practice, the validity of Equation (5) for simulatingsurvival data from the regression models presented in the previoussection is ensured by two mild general assumptions:

  • i)

    There is available an R function that returns the inverse of S0(|𝜽)S_{0}(\cdot|\boldsymbol{\theta}), that is, t=S01(u|𝜽)𝑡superscriptsubscript𝑆01conditional𝑢𝜽t=S_{0}^{-1}(u|\boldsymbol{\theta}), for u(0,1)𝑢01u\in(0,1).

  • ii)

    The linear predictors η1=𝐱𝜷subscript𝜂1𝐱𝜷\eta_{1}=\mathbf{x}\boldsymbol{\beta} and η2=𝐱ϕsubscript𝜂2𝐱bold-italic-ϕ\eta_{2}=\mathbf{x}\boldsymbol{\phi} do not include a intercept term.

It is important to notice that assumption i) does not require theexistence of a closed-form expression forS0(|𝜽)S_{0}(\cdot|\boldsymbol{\theta}), as long as its inverse can becomputed numerically. Assumption ii), on the other hand, is necessary toavoid identifiability issues since no further assumptions/constraintsare made about the baseline survival functionS0(|𝜽)S_{0}(\cdot|\boldsymbol{\theta}).

As one can see, the result given in Equation (5) is quitegeneral, and can be applied to simulate data from a wide range ofsurvival models. The only restriction for its use regards the fact thatneither covariates nor their effects are allowed to vary with time.

The internal function qsurv, reproduced below, plays a distinctrole in the algorithms implemented in the package rsurv.Specifically, this function allows the use of any baseline survivaldistribution for which there exists the implementation of its quantilefunction.

qsurv<-function(p,baseline,...){

qfunc<-get(paste("q",baseline,sep=""),mode="function")

x<-qfunc(p,lower.tail=FALSE,...)

return(x)

}

The following example shows the usefulness of the qsurv functionwhen working with customized functions or functions imported from otherpackages. There, the function qmydist mimics the functionqexp available in the package stats, whereas the functionqgengamma.orig from the package flexsurv returns thequantile function of the generalized gamma distribution, which includesthe exponential distribution as a particular case when all parametersare equal to 1.

library(flexsurv)

qmydist<-function(p,lambda,...){

x<-qexp(p,rate=lambda,...)

return(x)

}

set.seed(1234567890)

u<-runif(5)

x1<-qexp(u,rate=1,lower.tail=FALSE)

x2<-rsurv:::qsurv(u,baseline="exp",rate=1)

x3<-rsurv:::qsurv(u,baseline="mydist",lambda=1)

x4<-rsurv:::qsurv(u,baseline="gengamma.orig",shape=1,scale=1,k=1)

cbind(x1,x2,x3,x4)

## x1 x2 x3 x4## [1,] 0.09339179 0.09339179 0.09339179 0.09339179## [2,] 0.95202641 0.95202641 0.95202641 0.95202641## [3,] 0.17411789 0.17411789 0.17411789 0.17411789## [4,] 0.29132206 0.29132206 0.29132206 0.29132206## [5,] 0.34595405 0.34595405 0.34595405 0.34595405

The regression models implemented in the package rsurv share thesame parametrizations of those implement in the R packagesurvstan (Demarqui, 2024b). The main functions available in thepackage rsurv are: raftreg(), rahtreg(),rapheg(), raporeg(), rehreg() and rypreg().These functions permit the simulation of survival times from the AFT,AH, PH, PO, EH and YP models, respectively.

All r*reg() functions share the same basic structure.Specifically, the user must provide a numeric vector u ofquantiles used in (5), a formula containing informationregarding the structure of the linear predictors, a vector betaof regression coefficients (and additionally a vector phi for theEH and YP models), which should be compatible with the model matrixinduced by the formula argument, a dist/baseline argumentused to specify the chosen baseline distribution, along with anyadditional arguments passed to the internal function qsurvthrough the … argument.

A distinct feature of the functions r*reg() is that they return avector of generated survival times rather than a full data setcontaining additional information regarding covariates and a failureindicator variable. This brings more flexibility for the user tosimulate survival data with different characteristics and levels ofcomplexity, as we shall demonstrate throughout the examples presented inthe paper.

As a first example, suppose we want to simulate a sample of type Iright-censored survival data, assuming that the failure times aregenerated from an accelerated failure time model with loglogisticbaseline distribution. In addition, assume that we wish to consider twoexploratory variables, say age and sex, and we want to include aninteraction effect between them. Such a task can be easily accomplishedby using the function raftreg along with the functionqllogis available in the package flexsurv, as illustratedbelow:

library(flexsurv)

library(rsurv)

library(survstan)

set.seed(1234567890)

n<-1000

tau<-10#maximumfollowuptime

simdata<-data.frame(

age=rnorm(n),

sex=sample(c("f","m"),size=n,replace=TRUE)

)|>

mutate(

t=raftreg(runif(n),~age*sex,beta=c(1,2,-0.5),

dist="llogis",shape=1.5,scale=1),

)|>

rowwise()|>

mutate(

time=min(t,tau),

status=as.numeric(time==t)

)

glimpse(simdata)

## Rows: 1,000## Columns: 5## Rowwise:## $ age <dbl> 1.34592454, 0.99527131, 0.54622688, -1.91272392, 1.92128431, 1.~## $ sex <chr> "m", "f", "f", "m", "m", "m", "m", "f", "f", "m", "f", "m", "m"~## $ t <dbl> 15.2363453, 1.5259533, 2.1783746, 2.4354995, 58.7932958, 16.714~## $ time <dbl> 10.0000000, 1.5259533, 2.1783746, 2.4354995, 10.0000000, 10.000~## $ status <dbl> 0, 1, 1, 1, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, ~

fit<-aftreg(

Surv(time,status)~age*sex,

data=simdata,dist="loglogistic"

)

estimates(fit)

## age sexm age:sexm alpha gamma## 0.9494630 2.0094422 -0.4812641 1.4946497 1.0226847

In our next example we demonstrate how to generate survival data withcrossing survival curves using the YP model. In this example, we assumethat the survival times are randomly right-censored, with the censoringtimes following an exponential distribution.

library(GGally)

set.seed(1234567890)

n<-1000

simdata<-data.frame(

trt=sample(c("chemo","chemo+rad"),size=n,replace=TRUE)

)|>

mutate(

t=rypreg(runif(n),~trt,beta=2,phi=-1.5,

dist="weibull",shape=1.5,scale=1),

c=rexp(n,rate=1)

)|>

rowwise()|>

mutate(

time=min(t,c),

status=as.numeric(time==t)

)|>select(-c(t,c))

glimpse(simdata)

## Rows: 1,000## Columns: 3## Rowwise:## $ trt <chr> "chemo", "chemo", "chemo+rad", "chemo+rad", "chemo+rad", "chemo~## $ time <dbl> 0.65743632, 1.14638933, 0.10715893, 0.09876511, 1.44704010, 0.1~## $ status <dbl> 1, 0, 1, 1, 1, 0, 1, 0, 1, 1, 0, 1, 0, 0, 1, 1, 0, 1, 0, 1, 0, ~

fit<-ypreg(

Surv(time,status)~trt,

data=simdata,dist="weibull"

)

estimates(fit)

## short-trtchemo+rad long-trtchemo+rad alpha gamma## 1.920795 -1.446502 1.404515 1.043123

km<-survfit(Surv(time,status)~trt,data=simdata)

ggsurv(km)+

theme(legend.position="bottom")

Simulation of Survival Data With the Package rsurv (1)

The two examples presented above clearly demonstrate the flexibilityprovided by the functions r*reg() in the simulation of survivaldata with different types of censoring. In the next section we look atthe simulation of survival data with more sophisticated structures.

3 Simulation of survival data with more complexstructures

This section is dedicated to show how the functions discussed in theprevious section can be used to simulate survival data with more complexstructures. It is worth noting that the examples presented here onlycover a subset of the simulation possibilities.

The package rsurv contains some specialized functions that areuseful for simulating clustered survival data, survival data with curefraction, and interval-censored survival data. Other types of survivaldata, such as multivariate survival data, survival data with competingrisks, and survival data with dependent censoring can be generated, forinstance, combining the functions implemented in the package rsurvwith functions available in the R package copula(Jun Yan, 2007; Hofertetal., 2023).

3.1 Cure rate models

Rodrigues etal. (2009) presented an unified formulation for cure ratemodels, through the use of probability-generating functions, thatincludes several cure rate models available in the literature. In theirframework, cure rate models are formulated based on a two-stage processthat possesses an appealing biological motivation in terms ofincidence-latency of disease.

Suppose we are interested in modeling the time to relapse to cancer andlet N𝑁N denote the number of carcinogenic cells (often calledclonogens) left active after a patient has gone through initialtreatment. Then, for an individual with N𝑁N clonogens left active, letξjsubscript𝜉𝑗\xi_{j} be the non-observable random time (i.e., thepromotion time) for the j𝑗j-th clonogen to produce a detectable cancermass, j=0,1,,N𝑗01𝑁j=0,1,...,N.

The time to relapse to cancer, which is in fact the observable quantity,is defined as:

T𝑇\displaystyle T=\displaystyle=I{N=0}+ξ(1)I{N1},𝐼𝑁0subscript𝜉1𝐼𝑁1\displaystyle\infty I\{N=0\}+\xi_{(1)}I\{N\geq 1\},(6)

where ξ(1)<ξ(2),<<ξ(N)\xi_{(1)}<\xi_{(2)},<\cdots<\xi_{(N)}denote the orded values of ξ1,ξ2,,ξNsubscript𝜉1subscript𝜉2subscript𝜉𝑁\xi_{1},\xi_{2},\cdots,\xi_{N}.

Then, under the assumptions that N𝑁N is independent ofξj,j0subscript𝜉𝑗for-all𝑗0\xi_{j},~{}\forall~{}j\geq 0, andξji.i.d.Sξ(|𝚯,𝐱)\xi_{j}\overset{\text{i.i.d.}}{\sim}S_{\xi}(\cdot|\boldsymbol{\Theta},\mathbf{x}),j0𝑗0j\geq 0, it follows that the population survival function can beexpressed as:

Spop(t)subscript𝑆𝑝𝑜𝑝𝑡\displaystyle S_{pop}(t)=\displaystyle=P(N=0)+P(ξ1>t,,ξN>t;N>0)𝑃𝑁0𝑃formulae-sequencesubscript𝜉1𝑡formulae-sequencesubscript𝜉𝑁𝑡𝑁0\displaystyle P(N=0)+P(\xi_{1}>t,...,\xi_{N}>t;N>0)(7)
=\displaystyle=P(N=0)+k1P(N=k)Sξ(t|𝚯,𝐱)k.𝑃𝑁0subscript𝑘1𝑃𝑁𝑘subscript𝑆𝜉superscriptconditional𝑡𝚯𝐱𝑘\displaystyle P(N=0)+\sum_{k\geq 1}P(N=k)S_{\xi}(t|\boldsymbol{\Theta},\mathbf{x})^{k}.

Theorem 2 (Rodrigues etal. (2009)) The survival functionassociated with a r.v. T𝑇T corresponding to a promotion time(two-stage) cure rate model is given by:

Spop(t)=k=0pk{Sξ(t|𝚯,𝐱)}k=Ap(Sξ(t|𝚯,𝐱)),subscript𝑆𝑝𝑜𝑝𝑡superscriptsubscript𝑘0subscript𝑝𝑘superscriptsubscript𝑆𝜉conditional𝑡𝚯𝐱𝑘subscript𝐴𝑝subscript𝑆𝜉conditional𝑡𝚯𝐱\displaystyle S_{pop}(t)=\sum_{k=0}^{\infty}p_{k}\{S_{\xi}(t|\boldsymbol{\Theta},\mathbf{x})\}^{k}=A_{p}\left(S_{\xi}(t|\boldsymbol{\Theta},\mathbf{x})\right),(8)

where pk=P(N=k)subscript𝑝𝑘𝑃𝑁𝑘p_{k}=P(N=k) and Ap()subscript𝐴𝑝A_{p}(\cdot) is theprobability-generating function of N𝑁N.

The next corollary of Theorems 1 and 2 provides a general expression tosimulate survival data with a cure fraction.

Corollary 1 Given a cure fraction probability π=P(N=0)𝜋𝑃𝑁0\pi=P(N=0)and a survival function Sξ(t|𝚯,𝐱)subscript𝑆𝜉conditional𝑡𝚯𝐱S_{\xi}(t|\boldsymbol{\Theta},\mathbf{x})satisfying (4), then time to relapse to cancer T𝑇T can begenerated as follows:

T={,ifU<πa1(g1(Ap1(U))),ifUπ,𝑇casesif𝑈𝜋superscript𝑎1superscript𝑔1superscriptsubscript𝐴𝑝1𝑈if𝑈𝜋T=\left.\begin{cases}\infty,&\text{if }U<\pi\\a^{-1}\left(g^{-1}\left(A_{p}^{-1}(U)\right)\right),&\text{if }U\geq\pi\end{cases},\right.(9)

where UU(0,1)similar-to𝑈𝑈01U\sim U(0,1).

The proof of the corollary in show in the Appendix.

It is important to note from Equation (9) that, givenV=Ap1(U)𝑉superscriptsubscript𝐴𝑝1𝑈V=A_{p}^{-1}(U), all the time generation procedures describedpreviously can be straightforwardly applied in the generation ofsurvival times in the presence of a cure fraction, provided that theinverse of the probability generating function, Ap1(u)superscriptsubscript𝐴𝑝1𝑢A_{p}^{-1}(u), isknown.

distributionAp(s)subscript𝐴𝑝𝑠A_{p}(s)Ap1(u)superscriptsubscript𝐴𝑝1𝑢A_{p}^{-1}(u)
Bernoulli(1μ)+μs1𝜇𝜇𝑠(1-\mu)+\mu s(u1+μ)/μ𝑢1𝜇𝜇(u-1+\mu)/\mu
Poissonexp{μ(1s)}𝜇1𝑠\exp\{-\mu(1-s)\}[log(μ)+μ]/μdelimited-[]𝜇𝜇𝜇[\log(\mu)+\mu]/\mu
Negative Binomial1(uζ1)/ζμ1superscript𝑢𝜁1𝜁𝜇1-(u^{-\zeta}-1)/\zeta\mu[1+ζμ]1/ζsuperscriptdelimited-[]1𝜁𝜇1𝜁\left[1+\zeta\mu\right]^{-1/\zeta}
Bellexp{esθeθ}superscript𝑒𝑠𝜃superscript𝑒𝜃\exp\left\{e^{s\theta}-e^{\theta}\right\}log(log(u)+eθ)𝑢superscript𝑒𝜃\log(\log(u)+e^{\theta})

Here, E(N)=μ𝐸𝑁𝜇E(N)=\mu and θ=W0(μ)𝜃subscript𝑊0𝜇\theta=W_{0}(\mu) is the Lambert W function.

Common distributions assumed for the incidence sub-model (i.e.,the distribution of the latent r.v. N𝑁N) are the Bernoulli, Poisson,negative binomial and the Bell distributions. Covariates are usuallyincluded in the incidence sub-model. If 𝐳𝐳\mathbf{z} is a1×q1𝑞1\times q vector of covariates, and 𝜿𝜿\boldsymbol{\kappa} is thecorresponding q×1𝑞1q\times 1 vector of regression coefficients, thanπ=P(N=0)𝜋𝑃𝑁0\pi=P(N=0) can be related to the linear predictor𝐳𝜿𝐳𝜿\mathbf{z}\boldsymbol{\kappa} through appropriate choices of linkfunctions.

The inverse of the probability generating functions presented in Table1 are implemented in the function inv_pgf().Similarly to the r*reg() functions, the functioninv_pgf() requires a formula argument specifying thestructure of the linear predictor, along with a kappa argumentpassing the vector of regression coefficients, that should be compatiblewith the model matrix induced by the formula argument, and anincidence argument determining the incidence sub-model, and itscorresponding link function. Specifically, the functionrsurv::bernoulli() allows the specification of logit, probit,cloglog, and cauchit link functions, whereas the log, identity andsquare root links can be used with the functionsrsurv::poisson(), rsurv::bell() andrsurv::negbin(). When the negative binomial distribution isassumed for the incidence sub-model, the function inv_pgf()further requires an additional argument zeta, representing theextra negative-binomial parameter.

The next example illustrates how survival data with cure fraction can beeasily generated using the package rsurv. In that example weconsider a regression structure only for the incidence sub-model. Morespecifcally, we simulate survival data from a mixture cure rate modelwith a probit link function. We further assume that the survival timesare randomly right-censored.

library(rsurv)

library(GGally)

#fixingtheseedfortherng:

set.seed(1234567890)

kappa<-c(0.5,1.5,-1.1)

n<-1000

#generatingthesetofexplanatoryvariables:

simdata<-data.frame(

trt=sample(c("A","B"),size=n,replace=TRUE),

age=rnorm(n)

)

#generatingthedataset:

v<-inv_pgf(

~trt+age,

incidence=bernoulli("probit"),

kappa=kappa,

data=simdata

)

simdata<-simdata|>

mutate(

t=qexp(v,rate=1,lower.tail=FALSE),

c=rexp(n,rate=1)

)|>

rowwise()|>

mutate(

time=min(t,c),

status=as.numeric(time==t)

)|>

select(-c(t,c))

glimpse(simdata)

## Rows: 1,000## Columns: 4## Rowwise:## $ trt <chr> "A", "A", "B", "B", "B", "A", "B", "A", "B", "A", "A", "A", "A"~## $ age <dbl> 0.2193297, 0.9952571, -0.4572095, 0.4864704, 1.9512081, 0.49597~## $ time <dbl> 0.17789062, 0.17142535, 0.50301782, 0.64774117, 0.40046281, 0.4~## $ status <dbl> 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1, 1, ~

km<-survfit(Surv(time,status)~trt,data=simdata)

ggsurv(km)+

ylim(0,1)

Simulation of Survival Data With the Package rsurv (2)

3.2 Fraily models

Frailty models are often employed in survival analysis to accommodateunknown and/or unobserved risk factors through the inclusion of randomeffects, known as frailties (a term coined by Vaupeletal. (1979) torepresent individual’s heterogeneity). These type of models have becomeincreasingly popular in multivariate survival analysis since they allowone to take into account the association among individual survival timeswithin subgroups (or clusters) of subjects.

Frailty termExpected valueVariance
wN(0,σ2)similar-to𝑤N0superscript𝜎2w\sim\mbox{N}(0,\sigma^{2})0σ2superscript𝜎2\sigma^{2}
zGamma(1/σ2,1/σ2)similar-to𝑧Gamma1superscript𝜎21superscript𝜎2z\sim\mbox{Gamma}(1/\sigma^{2},1/\sigma^{2})1σ2superscript𝜎2\sigma^{2}
zPS(α)similar-to𝑧PS𝛼z\sim\mbox{PS}(\alpha)--

The parameter 0<α20𝛼20<\alpha\leq 2 is known as the stability parameter of the PS distribution.

The package rsurv permits the inclusion of frailty terms throughthe linear predictors, i.e.,η1=𝐱𝜷+wsubscript𝜂1𝐱𝜷𝑤\eta_{1}=\mathbf{x}\boldsymbol{\beta}+w andη2=𝐱ϕ+wsubscript𝜂2𝐱bold-italic-ϕ𝑤\eta_{2}=\mathbf{x}\boldsymbol{\phi}+w, where w𝑤w is thefrailty term. This convention adopted by the package rsruv allowsthe inclusion of frailties into the linear predictors using thestats::offset() function.

The function rfrailty() implemented in the package rsurvfacilitates the process of simulating survival data with frailties. Thisfunction permits random generation of frailties from the most commonlyfrailty distributions used in practice, namely the gamma distribution,the gaussian distribution, and the positive stable (ps) distribution,which are briefly described in Table 2. Since thegamma and positive stable distributions have support on +superscript\Re^{+}, bydefault, the function rfrailty() returns samples from positiverandom frailties in the log-scale, that is, w=log(z)𝑤𝑧w=\log(z).

Our next example shows how the functions r*reg() andrfrailty() can be combined to generate survival data from sharedfrailty models. For this type of survival frailty models, elementsbelonging to the same cluster share the same random effect (frailty). Togenerate frailties using the function rfrailty() we need to passa vector of length n𝑛n containing information regarding the clusterconfiguration to the cluster argument, the assumed distributionto the frailty argument, and finally, the parameter of thefrailty distribution (sigma or alpha, depending on thechoice of passed to frailty argument).

library(rsurv)

library(frailtyEM)

set.seed(1234567890)

n<-1000#samplesize

L<-100#numberofclusters

simdata<-data.frame(

id=rep(1:L,each=n/L),

age=rnorm(n),

sex=sample(c("f","m"),size=n,replace=TRUE)

)|>

mutate(

frailty=rfrailty(cluster=id,frailty="gamma",sigma=0.5),

t=rphreg(u=runif(n),~age*sex+offset(frailty),

beta=c(1,2,-0.5),dist="exp",rate=1),

c=runif(n,0,5)

)|>

rowwise()|>

mutate(

time=min(t,c),

status=as.numeric(t<c)

)|>

select(-c(t,c))

glimpse(simdata)

## Rows: 1,000## Columns: 6## Rowwise:## $ id <int> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 2, 2, 3,~## $ age <dbl> 1.34592454, 0.99527131, 0.54622688, -1.91272392, 1.92128431, 1~## $ sex <chr> "m", "f", "f", "m", "m", "m", "m", "f", "f", "m", "f", "m", "m~## $ frailty <dbl> -0.15917714, -0.15917714, -0.15917714, -0.15917714, -0.1591771~## $ time <dbl> 0.066662943, 0.370877454, 3.527734703, 0.109155541, 0.11269257~## $ status <dbl> 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1,~

em<-emfrail(Surv(time,status)~age*sex+cluster(id),

distribution=emfrail_dist(dist="gamma"),

data=simdata)

summary(em)

## Call:## emfrail(formula = Surv(time, status) ~ age * sex + cluster(id),## data = simdata, distribution = emfrail_dist(dist = "gamma"))#### Regression coefficients:## coef exp(coef) se(coef) adj. se z p## age 0.9711 2.6409 0.0675 0.0682 14.2469 0## sexm 2.1159 8.2973 0.0949 0.0976 21.6777 0## age:sexm -0.4746 0.6221 0.0811 0.0813 -5.8405 0## Estimated distribution: gamma / left truncation: FALSE#### Fit summary:## Commenges-Andersen test for heterogeneity: p-val 1.25e-19## no-frailty Log-likelihood: -4679.899## Log-likelihood: -4636.705## LRT: 1/2 * pchisq(86.4), p-val 7.39e-21#### Frailty summary:## estimate lower 95% upper 95%## Var[Z] 0.233 0.152 0.350## Kendall’s tau 0.105 0.071 0.149## Median concordance 0.102 0.068 0.145## E[logZ] -0.121 -0.185 -0.078## Var[logZ] 0.263 0.164 0.418## theta 4.284 2.859 6.570## Confidence intervals based on the likelihood function

3.3 Interval-censored survivaldata

Consider a nonnegative random variable T𝑇T representing the time untilthe occorrence of an event of interest. According to Sun (2006), anobservation on T𝑇T is interval-censored when its exact value isunknow, and only an interval of the form (L,R]𝐿𝑅(L,R] can be observed, sothat

T(L,R],𝑇𝐿𝑅T\in(L,R],(10)

where LR𝐿𝑅L\leq R.

It is easy to see from (10) that right and left-censoredobservations on T𝑇T occur when R=𝑅R=\infty and L=0𝐿0L=0,respectively. Naturally, when L=R𝐿𝑅L=R, T𝑇T can be observed exactly.

Two important types of interval-censored survival data that often arisein practice are the so-called type I (also known as current statussurvival data in the literature) and type II interval-censored survivaldata. Type I interval-censored survival data occurs when all observedintervals result in either left-censored (L=0𝐿0L=0) or right-censored(R=𝑅R=\infty) observations. Type II interval-censored survival data,on the other hand, emerges when we have at least one finite intervalsatisfying L>0𝐿0L>0.

The generation of type I and type II interval-censored survival data isfacilitated by the function rinterval available in the packagersurv. For the random generation of type II interval-censoredsurvival data, the algorithm described in Kiani andArasan (2012), whichemulates a situation in which patients are expected to attendpre-scheduled visits to the physician with attendance probability p𝑝p,is implemented. The function rinterval requires an argumenttime passing a sample of failure times, an argument tauthat can be either a vector of censoring times (for type Iinterval-censored survival data) or a time-grid of scheduled visits (fortype II interval-censored survival data), and an argument typeindicating the type of interval-censored survival data. Finally, theprob argument, only evaluated when type = II, can beregarded as the attendance probability of a scheduled visit.

The code presented below examplifies how to simulate type Iinterval-censored survival data using the function functionrinterval(). There, it is assumed that the exact failure timesare generated from a PH model with an exponential baseline distribution,whereas the censoring times are generated from a Weibull distribution.

library(rsurv)

library(icenReg)

set.seed(1234567890)

n<-300

covariates<-data.frame(

age=rnorm(n),

sex=sample(c("f","m"),size=n,replace=TRUE)

)

#typeIintervalcensoredsurvivaldata:

simdata1<-covariates|>

mutate(

time=rphreg(u=runif(n),~age+sex,beta=c(1,0.5),dist="exp",rate=1),

tau=rweibull(n,scale=2,shape=1.5),

rinterval(time,tau,type="I")

)

glimpse(simdata1)

## Rows: 300## Columns: 6## $ age <dbl> 1.34592454, 0.99527131, 0.54622688, -1.91272392, 1.92128431, 1.3~## $ sex <chr> "m", "f", "m", "m", "m", "f", "m", "f", "f", "m", "m", "f", "f",~## $ time <dbl> 0.05387373, 0.05326879, 0.16546309, 2.17932784, 0.04415598, 0.01~## $ tau <dbl> 0.8276881, 1.4807894, 3.2631929, 3.3893988, 2.7803610, 0.4764430~## $ left <dbl> 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000, 0.0000000~## $ right <dbl> 0.8276881, 1.4807894, 3.2631929, 3.3893988, 2.7803610, 0.4764430~

fit1<-ic_par(

cbind(left,right)~age+sex,data=simdata1,

model="ph",dist="exponential"

)

summary(fit1)

#### Model: Cox PH## Dependency structure assumed: Independence## Baseline: exponential## Call: ic_par(formula = cbind(left, right) ~ age + sex, data = simdata1,## model = "ph", dist = "exponential")#### Estimate Exp(Est) Std.Error z-value p## log_scale -0.3877 0.6786 0.1038 -3.736 1.871e-04## age 0.9186 2.5060 0.1241 7.402 1.341e-13## sexm 0.2245 1.2520 0.1965 1.142 2.534e-01#### final llk = -94.91423## Iterations = 9

The next example illustrates how type II interval-censored survival datacan be generated using the function rinterval(). In that example,we assume that patients are expected to return to the physician atcertain pre-specified times, with attendancy probability equals to 0.7.

#typeIIintervalcensoredsurvivaldata:

simdata2<-covariates|>

mutate(

time=raftreg(u=runif(n),~age+sex,beta=c(1,0.5),dist="exp",rate=1),

rinterval(time,tau=seq(0,5,by=1),type="II",prob=0.7)

)

glimpse(simdata2)

## Rows: 300## Columns: 5## $ age <dbl> 1.34592454, 0.99527131, 0.54622688, -1.91272392, 1.92128431, 1.3~## $ sex <chr> "m", "f", "m", "m", "m", "f", "m", "f", "f", "m", "m", "f", "f",~## $ time <dbl> 2.583140967, 2.152683999, 0.350258378, 0.045283358, 20.802515190~## $ left <dbl> 2, 1, 0, 0, 5, 2, 0, 1, 0, 1, 0, 0, 1, 2, 2, 0, 0, 1, 3, 5, 2, 3~## $ right <dbl> 3, Inf, 1, 1, Inf, 3, 1, 2, 1, 3, 1, 1, 2, 4, 5, 2, 1, 2, Inf, I~

fit2<-ic_par(

cbind(left,right)~age+sex,data=simdata2,

model="aft",dist="exponential"

)

summary(fit2)

#### Model: Accelerated Failure Time## Dependency structure assumed: Independence## Baseline: exponential## Call: ic_par(formula = cbind(left, right) ~ age + sex, data = simdata2,## model = "aft", dist = "exponential")#### Estimate Exp(Est) Std.Error z-value p## log_scale 0.3638 1.439 0.0716 5.081 3.747e-07## age 1.0450 2.842 0.0908 11.500 0.000e+00## sexm 0.4921 1.636 0.1480 3.326 8.825e-04#### final llk = -266.4422## Iterations = 7

3.4 Multivariate survival data, dependent censoring andcompetingrisks

Copulas provide a way to construct multivariate distributions based uponindependent marginal distributions (Hofertetal., 2018). A multivariatesurvival function can be constructed by the specification of independentmarginal survival models and an appropriated copula function.

Let Cθ(u1,,ud)subscript𝐶𝜃subscript𝑢1subscript𝑢𝑑C_{\theta}(u_{1},\cdots,u_{d}) a d-variated copula. Then, itfollows that the joint survival function of (T1,,Td)subscript𝑇1subscript𝑇𝑑(T_{1},\cdots,T_{d})can be expressed as:

S(t1,,td)=Cθ(S1(t1|𝚯1,𝐱1),,Sd(td|𝚯d,𝐱d)).𝑆subscript𝑡1subscript𝑡𝑑subscript𝐶𝜃subscript𝑆1conditionalsubscript𝑡1subscript𝚯1subscript𝐱1subscript𝑆𝑑conditionalsubscript𝑡𝑑subscript𝚯𝑑subscript𝐱𝑑S(t_{1},\cdots,t_{d})=C_{\theta}\left(S_{1}(t_{1}|\boldsymbol{\Theta}_{1},\mathbf{x}_{1}),\cdots,S_{d}(t_{d}|\boldsymbol{\Theta}_{d},\mathbf{x}_{d})\right).(11)

To generate a random sample of 𝐓=(T1,,Td)𝐓subscript𝑇1subscript𝑇𝑑\mathbf{T}=(T_{1},\cdots,T_{d})it is sufficient to generate a random sample from(U1,,Ud)Cθ(u1,,ud)similar-tosubscript𝑈1subscript𝑈𝑑subscript𝐶𝜃subscript𝑢1subscript𝑢𝑑(U_{1},\cdots,U_{d})\sim C_{\theta}(u_{1},\cdots,u_{d}) andthen takeTi=Si1(Ui|𝚯i,𝐱i)subscript𝑇𝑖superscriptsubscript𝑆𝑖1conditionalsubscript𝑈𝑖subscript𝚯𝑖subscript𝐱𝑖T_{i}=S_{i}^{-1}(U_{i}|\boldsymbol{\Theta}_{i},\mathbf{x}_{i}).

library(copula)

library(rsurv)

set.seed(1234567890)

n<-1000#samplesize

tau<-0.5#Kendallstaucorrelation

theta<-iTau(copula=claytonCopula(),tau=tau)

clayton<-claytonCopula(param=theta,dim=2)

u<-rCopula(clayton,n=n)

#simulatingthefailuretimes:

simdata<-data.frame(

age=rnorm(n),

sex=sample(c("f","m"),size=n,replace=TRUE)

)|>

mutate(

t1=rphreg(u=u[,1],~age+sex,beta=c(1,1.2),dist="exp",rate=2),

t2=rporeg(u=u[,2],~age+sex,beta=c(0.8,1.1),dist="exp",rate=1),

)

glimpse(simdata)

## Rows: 1,000## Columns: 4## $ age <dbl> -0.17469076, -0.15529194, -1.03781319, -1.56364665, 1.03961867, -0~## $ sex <chr> "m", "m", "f", "f", "f", "m", "m", "m", "m", "f", "f", "f", "f", "~## $ t1 <dbl> 0.016749142, 0.167458832, 0.245770610, 0.695707186, 0.061162858, 0~## $ t2 <dbl> 0.084588793, 0.583402263, 0.175592119, 0.288876906, 0.298716691, 0~

#checkingoutthecorrelation:

simdata|>

select(t1,t2)|>

cor(method="kendall")

## t1 t2## t1 1.0000000 0.5306627## t2 0.5306627 1.0000000

#adding(right)censoring:

simdata1<-simdata|>

mutate(

c=runif(n,0,5)#randomcensoring

)|>

rowwise()|>

mutate(

t1=min(t1,c),

t2=min(t2,c),

status1=as.numeric(t1<c),

status2=as.numeric(t2<c),

)

glimpse(simdata1)

## Rows: 1,000## Columns: 7## Rowwise:## $ age <dbl> -0.17469076, -0.15529194, -1.03781319, -1.56364665, 1.03961867~## $ sex <chr> "m", "m", "f", "f", "f", "m", "m", "m", "m", "f", "f", "f", "f~## $ t1 <dbl> 0.016749142, 0.167458832, 0.245770610, 0.695707186, 0.06116285~## $ t2 <dbl> 0.084588793, 0.583402263, 0.175592119, 0.288876906, 0.29871669~## $ c <dbl> 2.26609325, 4.57257662, 3.63517003, 2.02494996, 2.66336179, 2.~## $ status1 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0,~## $ status2 <dbl> 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 0, 1, 1, 1, 1, 1, 1, 0,~

#checkingoutthecorrelation:

simdata1|>

select(t1,t2)|>

cor(method="kendall")

## t1 t2## t1 1.0000000 0.5121882## t2 0.5121882 1.0000000

In the next example we show how to simulate survival data with dependentcensoring. In that case, the observable survival time is given byY=min{T,CA}𝑌𝑇𝐶𝐴Y=\min\{T,C\,A\}, where T𝑇T, C𝐶C, and A𝐴A represents thetime to failure, the dependent censoring time, and the censoring timedue to random censoring.

simdata2<-simdata|>

mutate(

a=runif(n,0,5)#randomcensoring

)|>

rowwise()|>

mutate(

y=min(t1,t2,a),

status1=as.numeric(y==t1),

status2=as.numeric(y==t2),

)|>

select(-c(t1,t2))

glimpse(simdata2)

## Rows: 1,000## Columns: 6## Rowwise:## $ age <dbl> -0.17469076, -0.15529194, -1.03781319, -1.56364665, 1.03961867~## $ sex <chr> "m", "m", "f", "f", "f", "m", "m", "m", "m", "f", "f", "f", "f~## $ a <dbl> 3.1886887, 3.6066255, 3.0834830, 0.8921447, 0.8049795, 0.96584~## $ y <dbl> 0.016749142, 0.167458832, 0.175592119, 0.288876906, 0.06116285~## $ status1 <dbl> 1, 1, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 1, 1, 1, 1, 1, 0, 1,~## $ status2 <dbl> 0, 0, 1, 1, 0, 0, 0, 0, 1, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 1, 0,~

Notice that the example presented above can be viewed as a particularcase of competing risks survival data. In the next example, we considera scenario where elements are subjected to three competing causes offailure.

n<-1000

tau<-0.5

theta<-iTau(copula=gumbelCopula(),tau=tau)

gumbel<-gumbelCopula(param=theta,dim=3)

u<-rCopula(gumbel,n=n)

simdata<-data.frame(

age=rnorm(n),

sex=sample(c("f","m"),size=n,replace=TRUE)

)|>

mutate(

t1=rphreg(u=u[,1],~age+sex,beta=c(1,1.2),dist="lnorm",meanlog=0,sdlog=1),

t2=rphreg(u=u[,2],~age+sex,beta=c(0.8,1.1),dist="exp",rate=1),

t3=rphreg(u=u[,3],~age+sex,beta=c(0.7,1.0),dist="exp",rate=1),

a=runif(n,0,5)#randomcensoring

)|>

rowwise()|>

mutate(

y=min(t1,t2,t3,a),

status1=as.numeric(y==t1),

status2=as.numeric(y==t2),

status3=as.numeric(y==t3),

)|>

select(-c(t1,t2,t3))

glimpse(simdata)

## Rows: 1,000## Columns: 7## Rowwise:## $ age <dbl> -0.53451004, -1.48095715, -0.76892438, 0.43319025, 0.76357712,~## $ sex <chr> "m", "m", "m", "m", "m", "m", "m", "f", "f", "f", "f", "m", "f~## $ a <dbl> 2.17617891, 1.49961178, 4.65348449, 1.04816011, 1.12082428, 0.~## $ y <dbl> 0.18034466, 0.51001897, 0.07337358, 0.12327447, 0.13908667, 0.~## $ status1 <dbl> 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0,~## $ status2 <dbl> 0, 0, 1, 1, 0, 0, 1, 0, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 1, 0,~## $ status3 <dbl> 1, 0, 0, 0, 1, 0, 0, 0, 1, 0, 0, 1, 1, 0, 0, 0, 1, 1, 0, 0, 1,~

4 Conclusions

This paper has presented the new R package rsurv. Theproposed package is designed for general survival data simulation. Itcombines a wide range of survival regression models, which includes AFT,PH, PO, AH, YP and EH models, with an unlimited number of baselinesurvival distributions. The package rsurv is suitable togenerating right, left and interval-censored survival data underindependent and dependent censoring mechanisms. The functions providedin the package rsurv also allows the generation of survival datawith cure fraction, survival data with random effects (frailties), andcompeting risks survival data. Another nice characteristic of thepackage rsurv regards the fact that linear predictors arespecified through R formulas. This, together with the use of dplyrverbs, makes the survival data simulation process much more flexible andintuitive.

Future research involves the development of new functions for simulatingfrom regression models with time-dependent covariates, and time-varyingcovariate effects. Further possible extensions include theimplementation of methods to simulate survival data from recurrentevetns, and simulate data from truncaded survival data.

Appendix

Proof of Theorem 1

We start the proof of Theorem 1 by the YP model given in Equation(2). For this step of the proof, defineκ1=eη1subscript𝜅1superscript𝑒subscript𝜂1\kappa_{1}=e^{\eta_{1}} and κ2=eη2subscript𝜅2superscript𝑒subscript𝜂2\kappa_{2}=e^{\eta_{2}}.

  • i)

    As S0(t|𝜽)subscript𝑆0conditional𝑡𝜽S_{0}(t|\boldsymbol{\theta}) does not depend on 𝜼𝜼\boldsymbol{\eta}, we have that a(t,𝜼)=t𝑎𝑡𝜼𝑡a(t,\boldsymbol{\eta})=t. Using the fact that R0(t|𝜽)=11+S0(t|𝜽)=11+ssubscript𝑅0conditional𝑡𝜽11subscript𝑆0conditional𝑡𝜽11𝑠R_{0}(t|\boldsymbol{\theta})=\displaystyle\frac{1}{1+S_{0}(t|\boldsymbol{\theta})}=\frac{1}{1+s}, it follows that

    S(t|𝚯,𝐱)=[1+R0(t|𝜽)κ1κ2]κ2=[1+κ1κ211+s]κ2=g(s,𝜼).𝑆conditional𝑡𝚯𝐱superscriptdelimited-[]1subscript𝑅0conditional𝑡𝜽subscript𝜅1subscript𝜅2subscript𝜅2superscriptdelimited-[]1subscript𝜅1subscript𝜅211𝑠subscript𝜅2𝑔𝑠𝜼S(t|\boldsymbol{\Theta},\mathbf{x})=\left[1+R_{0}(t|\boldsymbol{\theta})\frac{\kappa_{1}}{\kappa_{2}}\right]^{-\kappa_{2}}=\left[1+\frac{\kappa_{1}}{\kappa_{2}}\frac{1}{1+s}\right]^{-\kappa_{2}}=g(s,\boldsymbol{\eta}).
  • ii)

    We have that

    u=S(t|𝚯,𝐱)=[1+κ1κ211+s]κ2𝑢𝑆conditional𝑡𝚯𝐱superscriptdelimited-[]1subscript𝜅1subscript𝜅211𝑠subscript𝜅2\displaystyle u=S(t|\boldsymbol{\Theta},\mathbf{x})=\left[1+\frac{\kappa_{1}}{\kappa_{2}}\frac{1}{1+s}\right]^{-\kappa_{2}}\displaystyle\Leftrightarrowv=[u1/κ21]κ2κ1=11+s𝑣delimited-[]superscript𝑢1subscript𝜅21subscript𝜅2subscript𝜅111𝑠\displaystyle v=\left[u^{-1/\kappa_{2}}-1\right]\frac{\kappa_{2}}{\kappa_{1}}=\frac{1}{1+s}
    \displaystyle\Leftrightarrows=11+v=g1(u,𝜼).𝑠11𝑣superscript𝑔1𝑢𝜼\displaystyle s=\frac{1}{1+v}=g^{-1}(u,\boldsymbol{\eta}).

    Therefore, since a(t,𝜼)=t𝑎𝑡𝜼𝑡a(t,\boldsymbol{\eta})=t, we have that

    t=S01(11+v)=S01(g1(u)).𝑡superscriptsubscript𝑆0111𝑣superscriptsubscript𝑆01superscript𝑔1𝑢t=S_{0}^{-1}\left(\frac{1}{1+v}\right)=S_{0}^{-1}\left(g^{-1}(u)\right).

Now we present the proof of Theorem 1 for the EH model given in Equation(3).

  • i)

    We have that a(t,𝜼)=t/eη1𝑎𝑡𝜼𝑡superscript𝑒subscript𝜂1a(t,\boldsymbol{\eta})=t/e^{\eta_{1}}. It follows directly from Equation (3) that

    S(t|𝚯,𝐱)𝑆conditional𝑡𝚯𝐱\displaystyle S(t|\boldsymbol{\Theta},\mathbf{x})=\displaystyle=exp{H0(t/e𝐱𝜷|𝜽)e𝐱(𝜷+ϕ)}\displaystyle\exp\left\{-H_{0}\left(t/e^{\mathbf{x}\boldsymbol{\beta}}|\boldsymbol{\theta})e^{\mathbf{x}(\boldsymbol{\beta}+\boldsymbol{\phi}}\right)\right\}
    =\displaystyle=S0(t/e𝐱𝜷|𝜽)e𝐱(𝜷+ϕ)subscript𝑆0superscriptconditional𝑡superscript𝑒𝐱𝜷𝜽superscript𝑒𝐱𝜷bold-italic-ϕ\displaystyle S_{0}\left(t/e^{\mathbf{x}\boldsymbol{\beta}}|\boldsymbol{\theta}\right)^{e^{\mathbf{x}(\boldsymbol{\beta}+\boldsymbol{\phi})}}
    =\displaystyle=seη1+η2=g(s,𝜼).superscript𝑠superscript𝑒subscript𝜂1subscript𝜂2𝑔𝑠𝜼\displaystyle s^{e^{\eta_{1}+\eta_{2}}}=g(s,\boldsymbol{\eta}).
  • ii)

    Note that

    u=S(t|𝚯,𝐱)𝑢𝑆conditional𝑡𝚯𝐱\displaystyle u=S(t|\boldsymbol{\Theta},\mathbf{x})=\displaystyle=seη1+η2superscript𝑠superscript𝑒subscript𝜂1subscript𝜂2\displaystyle s^{e^{\eta_{1}+\eta_{2}}}
    \displaystyle\Leftrightarrows=ueη1+η2=g1(u,𝜼).𝑠superscript𝑢superscript𝑒subscript𝜂1subscript𝜂2superscript𝑔1𝑢𝜼\displaystyle s=u^{-e^{\eta_{1}+\eta_{2}}}=g^{-1}(u,\boldsymbol{\eta}).

    Thus,

    a(t)=S01(s)𝑎𝑡superscriptsubscript𝑆01𝑠\displaystyle a(t)=S_{0}^{-1}(s)=\displaystyle=S01(g1(u))t=a1(S01(g1(u))).superscriptsubscript𝑆01superscript𝑔1𝑢𝑡superscript𝑎1superscriptsubscript𝑆01superscript𝑔1𝑢\displaystyle S_{0}^{-1}\left(g^{-1}(u)\right)\Leftrightarrow t=a^{-1}\left(S_{0}^{-1}\left(g^{-1}(u)\right)\right).

Proof of Corollary 1

It follows immediately from Theorem 2 that:

Sξ(t|𝚯,𝐱)=Ap1(Spop(t)).subscript𝑆𝜉conditional𝑡𝚯𝐱superscriptsubscript𝐴𝑝1subscript𝑆𝑝𝑜𝑝𝑡\displaystyle S_{\xi}(t|\boldsymbol{\Theta},\mathbf{x})=A_{p}^{-1}\left(S_{pop}(t)\right).

The rest of the proof follows from (9) and the directapplication of Theorem 1.

References

  • Brilleman etal. (2020)Brilleman, S.L., R.Wolfe, M.Moreno-Betancur, and M.J. Crowther (2020).Simulating survival data using the simsurv R package.Journal of Statistical Software97(3), 1–27.
  • Demarqui (2024a)Demarqui, F. (2024a).rsurv: Random Generation of Survival Data.R package version 0.0.1.
  • Demarqui (2024b)Demarqui, F. (2024b).survstan: Fitting Survival Regression Models via ’Stan’.R package version 0.0.7.1, https://fndemarqui.github.io/survstan/.
  • Demarqui andMayrink (2021)Demarqui, F.N. and V.D. Mayrink (2021).Yang and Prentice model with piecewise exponential baselinedistribution for modeling lifetime data with crossing survival curves.Brazilian Journal of Probability and Statistics35(1), 172 – 186.
  • Etezadi-Amoli andCiampi (1987)Etezadi-Amoli, J. and A.Ciampi (1987).Extended hazard regression for censored survival data withcovariates: A spline approximation for the baseline hazard function.Biometrics43(1), 181–192.
  • Gerds (2023)Gerds, T.A. (2023).prodlim: Product-Limit Estimation for Censored Event HistoryAnalysis.R package version 2023.08.28.
  • Hofertetal. (2018)Hofert, M., I.Kojadinovic, M.Maechler, and J.Yan (2018).Elements of Copula Modeling with R.Springer Use R! Series.
  • Hofertetal. (2023)Hofert, M., I.Kojadinovic, M.Maechler, and J.Yan (2023).copula: Multivariate Dependence with Copulas.R package version 1.1-3.
  • Jun Yan (2007)Jun Yan (2007).Enjoy the joy of copulas: With a package copula.Journal of Statistical Software21(4), 1–21.
  • Kiani andArasan (2012)Kiani, K. and J.Arasan (2012).Simulation of interval censored data in medical and biologicalstudies.International Journal of Modern Physics: ConferenceSeries09, 112–118.
  • Moriña andNavarro (2017)Moriña, D. and A.Navarro (2017).Competing risks simulation with the survsim R package.Communications in Statistics - Simulation andComputation46(7), 5712–5722.
  • R Core Team (2023)R Core Team (2023).R: A Language and Environment for Statistical Computing.Vienna, Austria: R Foundation for Statistical Computing.
  • Rodrigues etal. (2009)Rodrigues, J., V.G. Cancho, M.deCastro, and F.Louzada-Neto (2009).On the unification of the long-term survival models.Statistics and Probability Letters79, 753–759.
  • Sun (2006)Sun, J. (2006).The Statistical Analysis of Interval-Censored Failure TimeData.Springer.
  • Sylvestre etal. (2022)Sylvestre, M.-P., T.Evans, T.MacKenzie, and M.Abrahamowicz (2022).PermAlgo: Permutational algorith to generate event timesconditional on a covariate matrix including time-dependent covariates.R package version 1.2.
  • TorstenHothorn (2020)Torsten Hothorn (2020).Most likely transformations: The mlt package.Journal of Statistical Software92(1), 1–68.
  • Tseng andShu (2011)Tseng, Y.-K. and K.-N. Shu (2011).Efficient estimation for a semiparametric extended hazards model.Communications in Statistics - Simulation andComputation40(2), 258–273.
  • Vaupeletal. (1979)Vaupel, J.W., K.G. Manton, and E.Stallard (1979).The impact of heterogeneity in individual frailty on the dynamics ofmortality.Demography16(3), 439–454.
  • Xiongetal. (2015)Xiong, D., T.Pokaprakarn, H.Udagawa, and N.Rabbee (2015).SimHaz: Simulated Survival and Hazard Analysis forTime-Dependent Exposure.R package version 0.1.
Simulation of Survival Data With the Package rsurv (2024)

FAQs

How to simulate survival time? ›

for an exponential baseline hazard, the survival time t is simulated as: t=−log(U)λexp(βTX) for a Weibull baseline hazard, the survival time t is simulated as: t=(−log(U)λexp(βTX))1/γ

How to do data simulation? ›

For data simulation, statistical modeling techniques such as regression analysis, time series analysis, and Bayesian modeling can be employed. Fitting statistical models to existing data and then utilizing these models to produce simulated data that closely mimics the original dataset are examples of these approaches.

Is it possible to simulate a world? ›

Bostrom's argument rests on the premise that given sufficiently advanced technology, it is possible to represent the populated surface of the Earth without recourse to digital physics; that the qualia experienced by a simulated consciousness are comparable or equivalent to those of a naturally occurring human ...

Can survival time be 0? ›

In survival analysis, time zero (t=0) usually represents the start of a study or the point at which individuals enter the study. Including individuals with t=0 means acknowledging that they were at risk of the event the moment they entered the study, even if the event or censoring happened immediately.

How can I run simulation faster? ›

Below are some tips to consider to further increase the run speed of simulation.
  1. Speed Analysis. ...
  2. Visual Logic Route-in Before Selecting. ...
  3. No Routing Discipline Conflicts. ...
  4. High Volume. ...
  5. Results Synchronization Interval. ...
  6. Simulation Exit Objects. ...
  7. Minimum Wait Time and Shelf Life. ...
  8. Only Use Match on Small Queues.

What is the Weibull survival formula? ›

The median survival time is S(t)=e−λt=0.5 S ( t ) = e − λ t = 0.5 , or tmed=log(2)/λ t m e d = log ⁡ . The Weibull distribution is more appropriate for modeling lifetimes, however. The Weibull hazard function is h(t)=αλ(λt)α−1=αλαtα−1 h ( t ) = α λ ( λ t ) α − 1 = α λ α t α − 1 .

What is used to simulate what happens in the real world? ›

Simulations are usually computer-based, using a software-generated model to provide support for the decisions of managers and engineers as well as for training purposes. Simulation techniques aid understanding and experimentation, as the models are both visual and interactive.

What is time to event survival time? ›

Time-to-event data consist of pairs of observations for each individual: (i) a length of time during which no event was observed, and (ii) an indicator of whether the end of that time period corresponds to an event or just the end of observation.

References

Top Articles
Is Buying a Car a Good Investment? - Experian
What's a Good Profit Margin for Your Small Business?
Spasa Parish
Rentals for rent in Maastricht
159R Bus Schedule Pdf
Sallisaw Bin Store
Black Adam Showtimes Near Maya Cinemas Delano
Www.myschedule.kp.org
Ascension St. Vincent's Lung Institute - Riverside
Understanding British Money: What's a Quid? A Shilling?
Xenia Canary Dragon Age Origins
Momokun Leaked Controversy - Champion Magazine - Online Magazine
‘An affront to the memories of British sailors’: the lies that sank Hollywood’s sub thriller U-571
Tyreek Hill admits some regrets but calls for officer who restrained him to be fired | CNN
Haverhill, MA Obituaries | Driscoll Funeral Home and Cremation Service
Rogers Breece Obituaries
Ems Isd Skyward Family Access
Sauce 423405
Elektrische Arbeit W (Kilowattstunden kWh Strompreis Berechnen Berechnung)
Omni Id Portal Waconia
Kellifans.com
Banned in NYC: Airbnb One Year Later
Four-Legged Friday: Meet Tuscaloosa's Adoptable All-Stars Cub & Pickle
Model Center Jasmin
Ice Dodo Unblocked 76
Is Slatt Offensive
Labcorp Locations Near Me
Storm Prediction Center Convective Outlook
Experience the Convenience of Po Box 790010 St Louis Mo
Fungal Symbiote Terraria
modelo julia - PLAYBOARD
Abby's Caribbean Cafe
Joanna Gaines Reveals Who Bought the 'Fixer Upper' Lake House and Her Favorite Features of the Milestone Project
Tri-State Dog Racing Results
Trade Chart Dave Richard
Lincoln Financial Field Section 110
Free Stuff Craigslist Roanoke Va
Stellaris Resolution
Wi Dept Of Regulation & Licensing
Pick N Pull Near Me [Locator Map + Guide + FAQ]
Horseheads Schooltool
Crystal Westbrooks Nipple
Ice Hockey Dboard
Über 60 Prozent Rabatt auf E-Bikes: Aldi reduziert sämtliche Pedelecs stark im Preis - nur noch für kurze Zeit
Wie blocke ich einen Bot aus Boardman/USA - sellerforum.de
Craigslist Pets Inland Empire
Infinity Pool Showtimes Near Maya Cinemas Bakersfield
Hooda Math—Games, Features, and Benefits — Mashup Math
How To Use Price Chopper Points At Quiktrip
Maria Butina Bikini
Busted Newspaper Zapata Tx
Latest Posts
Article information

Author: Delena Feil

Last Updated:

Views: 6001

Rating: 4.4 / 5 (65 voted)

Reviews: 80% of readers found this page helpful

Author information

Name: Delena Feil

Birthday: 1998-08-29

Address: 747 Lubowitz Run, Sidmouth, HI 90646-5543

Phone: +99513241752844

Job: Design Supervisor

Hobby: Digital arts, Lacemaking, Air sports, Running, Scouting, Shooting, Puzzles

Introduction: My name is Delena Feil, I am a clean, splendid, calm, fancy, jolly, bright, faithful person who loves writing and wants to share my knowledge and understanding with you.