• Ingen resultater fundet

The dynamic discrete time model is where we use the log-likelihood terms, litt), as shown in Equation (1.21) where h is the inverse logit function, h(x) = exp(x)/(1 + exp(x)). This model is suited for situations where the events are observed in intervals and the covariates change at discrete times. However, this is not the case for the hard disk data set. The hard disk data is not reported on monthly precision but on hourly precision. I print the first 10 rows here to illustrate this:

R> hd_dat[1:10, c("serial_number", "model", "tstart", "tstop", "smart_12")]

serial_number model tstart tstop smart_12 505 5XW004AJ ST31500541AS 30.001 40.010 0 506 5XW004AJ ST31500541AS 40.010 43.172 24 507 5XW004AJ ST31500541AS 43.172 56.917 25 508 5XW004Q0 ST31500541AS 40.618 50.962 0 509 5XW004Q0 ST31500541AS 50.962 53.729 54 510 5XW004Q0 ST31500541AS 53.729 54.122 56 511 5XW004Q0 ST31500541AS 54.122 54.424 57 512 5XW004Q0 ST31500541AS 54.424 54.457 58 513 5XW004Q0 ST31500541AS 54.457 54.690 59 514 5XW004Q0 ST31500541AS 54.690 57.193 61

I will explain how the ddhazard implementation discretizes continuous event times in the following paragraphs. I redefinexij as the covariate vector for individualiin the period (ti,j−1, tij].

Next, I redefine the discrete time risk set,Rt, as

Rt={(i, j) : ti,j−1 ≤t−1< ti,j∧Dij = 0} (1.33)

0 10 20 30 40 50 60

−10−9−8−7

Time

Param. HMS5C4040ALE640

0 10 20 30 40 50 60

−11−10−9−8−7

Time

Param. HMS5C4040BLE640

0 10 20 30 40 50 60

−9−8−7−6−5−4−3

Time

Param. HDS5C3030ALA630

0 10 20 30 40 50 60

−8.5−7.5−6.5−5.5

Time

Param. HDS5C4040ALE630

0 10 20 30 40 50 60

−8.0−7.0−6.0−5.0

Time

Param. HDS722020ALA330

0 10 20 30 40 50 60

−8−7−6−5−4

Time

Param. HDS723030ALA640

0 10 20 30 40 50 60

−5−4−3−2−1

Time

Param. ST31500341AS

0 10 20 30 40 50 60

−5.5−5.0−4.5

Time

Param. ST31500541AS

0 10 20 30 40 50 60

−7.0−6.0−5.0

Time

Param. ST4000DM000

Figure 1.8: Predicted parameters for some of the factor levels with the second order random walk model. The blue lines are parameters with the second order random walk model with a fixed effect for the power cycle count and the black lines are parameters with the first order random walk model where all parameters are time-varying. Grey transparent bars indicate the number of individuals at risk for the specific hard disk version. Heights are only comparable within each frame.

Further, I redefine

yijt= 1{Ti∈(max(ti,j−1,t−1),min(tij,t)]t−1<tij≤t} (1.34) lijtt) =yijtlogh(x>ijαt) + (1−yijt) log

1−h(x>ijαt)

(1.35) yijtis a generalization of Equation (1.12) that indicates whether individualiexperiences an event with the jth covariate vector in interval t. The following example will illustrate the impact of discrete time risk sets in Equation (1.33). Suppose we look at intervald−1 and d(the last two intervals) in a model with time-varying covariates. Further, let both the event times and the point at which we observe new covariates happen at continuous points in time.

0 10 20 30 40 50 60 70

0.00.20.4

SMART 12 attribute

Linear predictor term

Figure 1.9: Fixed effects estimates for the SMART 12 attribute on the linear predictor scale.

Figure 1.10 illustrates such a situation. Each horizontal line represents an individual. A cross represents when the covariate values jump for the individual, and a filled circle represents an event that has happened for the individual. Lines that end with an open circle are right censored.

The vertical dashed lines in the figure represent the time interval borders. The first vertical line from the left is the start of interval d−1, the second vertical line is when interval d−1 ends, and intervaldstarts, and the third vertical line is when intervaldends. I will use observation a in Figure 1.10 to illustrate the risk set in Equation (1.33). The covariate vector used in interval d−1 isxa1 asta0 < d−2< ta1. By similar arguments, the covariate vector in interval disxa2. Because we use the risk set in Equation (1.33), we use covariates from 1 for individuals a, c, d, and f for the entire period of intervald−1, even though the covariates change at 2. Furthermore, g is not in either interval, as we only know that it survives parts of interval d−1. Lastly, we never include b as we do not know its covariate vector at the start of intervald.

1.3.1 Continuous Time Model

The continuous time model implemented inddhazard is the model shown in Equation (1.11) in the introduction. The assumptions of the model are that

• the instantaneous hazards are given by exp(xi(t)>α(t)).

• parameters jump at the end of time intervals, i.e., α(t) =αdte where dte gives the ceiling of t. This is illustrated in Figure 1.10 where the parameters jump at the vertical lines.

• the individuals’ covariates jump, i.e., xi(t) = xij where j = {k : ti,k−1 < t ≤ ti,k}. In Figure 1.10, the covariates jump at the crosses.

The instantaneous hazard jumps when either the individual’s covariates jump or the parame-ters jump. Thus, an individual’s event time is piecewise exponentially distributed given the state vectors. The log-likelihood of individualihaving an event at time ti is

log (L(ti0, . . . ,αd)) =xi(ti)>α(ti)− Z ti

0

exp

xi(u)>α(u)

du (1.36)

Interval d−1 Interval d

1 2

1 2 3 4

1 2

1 2 3 4 5

1 2 3

1 2

1 2 3 4

xa1, (ta0,ta1] xa2, (ta1,ta2] xa3, (ta2,ta3]

a b c d

e f g

Figure 1.10: Illustration of a data set with 7 individuals with time-varying covariates. Each horizontal line represents an individual. Each number indicates a start time and stop time in the initial data. A cross indicates that new covariates are observed while a filled circle indicates that the individual has an event. An open circle indicates that the individual is right censored.

Vertical dashed lines are time interval borders. The symbols for the covariate vectors and stop times are shown for observation a.

whereL(·) denotes the likelihood. Because of our assumptions, the complete data log-likelihood in Equation (1.20) simplifies to

L(Q,Q00) =− 1

2(α0−µ0)>Q−100−µ0)

− 1 2

d

X

t=1

t−Fαt−1)>RQ−1R>t−Fαt−1)

− 1

2log|Q0| −d

2log|Q|

+

d

X

t=1

X

(i,j)∈Rt

lijtt) +. . .

(1.37)

where

lijtt) =yijtx>ijαt−exp

x>ijαt

(min{t, tij} −max{t−1, ti,j−1}) (1.38) Thelijt terms are a simplification of Equation (1.36), where I use the assumption that both the covariates, xi(t), and parameters, α(t), are piecewise constant. Further, Rt is the continuous time risk set given by

Rt={(i, j) : ti,j−1 < t∧tij ≥t−1∧Dij = 0} (1.39)

The two conditions inRt are that the observation must start before the interval ends (ti,j−1< t), and end after the interval starts (tij ≥t−1). I will use observation a in Figure 1.10 as an example.

Observation a has two covariate vectors in interval d−1. The first is xa1 as ta0 < d−1 and ta1 > d−2. Similar arguments apply for the covariate vectorxa2.

Example with the continuous model

As mentioned in Section 1.3, the start and stop times in the hard disk failure data set are in fractions of months on a hourly precision. Thus, I can use the continuous model. I fit the model with the EKF with more iterations in the correction step and get the continuous model by setting model = "exponential":

R> system.time(ddfit_cont <- ddhazard(formula = frm, data = hd_dat, by = 1, + max_T = 60, model = "exponential", id = hd_dat$serial_number, Q_0 = diag(

+ 1, 23), Q = diag(.1, 23), control = ddhazard_control(NR_eps = .0001, eps = + .001, LR = .5, method = "EKF")))

user system elapsed 117.6 12.2 26.0

Figure 1.11 shows the first estimated factor levels’ parameters. The results are comparable to what we have seen previously (e.g., see Figure 1.3 where I also used the EKF with more iterations in the correction step but with the discrete time model).