• Ingen resultater fundet

Increasing the Amount of Available Data by Removing Variables

Chapter 7

Estimation of Missing Values in the COST Data Set

So far, we have seen the effect of including model uncertainty. It is obvious that the model as well as the parameter uncertainty should decrease when we see more data. To increase the amount of available data we use three different ap-proaches. First, we remove variables that are clearly not important explanatory variables, and then we use BNs and a semi-parametric approach to estimate the missing values of the remaining variables.

7.1 Increasing the Amount of Available Data by

age sex hyp ihd apo odd alco

Missing cases 0 0 62 76 50 41 163

dm smoke af hemo cla temp sss

Missing cases 50 174 11 188 113 133 7

Table 7.1: Number of missing values for each variable in the COST data set.

one at a time, we adopt the principle of stepwise selection, and remove the variable with lowest PPP, if the PPP is very low, i.e. in the area of 0−5%, and preferable has a lot of missing values. Since we combine BMA with principles of stepwise selection, we refer to this technique asstepwise BMA ;-)

Bothalcoandhemohave PPP=0, but sincehemohas a larger number of missing values and no time dependent term, we decide to remove this risk factor first.

The results of removing hemo, and increasing the size of the data set to 641 subjects are presented in Table7.2and7.3.

Method age sex hyp ihd apo odd

p-value (step) <0.001 <0.001 0.65 (1) 0.29 (2) <0.01 0.02

PPPO 100 96.9 0.5 13.0 64.5 39.8

alco dm smoke af cla temp

p-value (step) 0.13 (5) 0.02 0.21 (4) 0.06 (8) <0.01 0.13 (7)

PPPO 0 39.0 1.1 15.0 70.0 6.8

sss alco*t smoke*t sss*t p-value (step) <0.001 0.21 (6) 0.18 (3) <0.001

PPPO 100 2.4 0 100

Table 7.2: p-values and PPPs from stepwise BMA with hemoremoved.

Method age sex hyp ihd apo odd alco dm

HRS 1.05 1.37 1.35 1.29 1.32

HRB 1.05 1.35 1.36

HRO 1.05 1.35 1.00 1.03 1.22 1.12 1.00 1.12

smoke af cla temp sss alco*t smoke*t sss*t

HRS 1.39 0.95 1.0020

HRB 0.68 0.95 1.0019

HRO 1.00 1.04 1.30 1.01 0.95 1.0002 1.0000 1.0019 Table 7.3: HRs from stepwise BMA with hemo removed. Max. PMP: 0.09.

Total PMP for Top10: 0.47. 62 models included in Occam’s window. HR for xxx∗tis pr. 100 unit increment.

7.1 Increasing the Amount of Available Data by Removing Variables 131

Now, sincealcostill has PPP=0, we decide to removealcoas the next variable.

The results of removingalco (andalco∗t), and increasing the size of the data set to 655 subjects are presented in Table 7.4and7.5.

Method age sex hyp ihd apo

p-value (step) <0.001 <0.001 0.62 (2) 0.46 (3) <0.01

PPPO 100 97.1 0.5 6.3 72.3

odd dm smoke af cla

p-value (step) 0.02 <0.01 0.27 (4) 0.11 (5) <0.01

PPPO 43.3 59.1 0.8 7.9 69.3

temp sss smoke*t sss*t

p-value (step) 0.10 (6) <0.001 0.89 (1) <0.001

PPPO 11.7 100 3.4 100

Table 7.4: p-values and PPPs from stepwise BMA with alcoremoved.

Method age sex hyp ihd apo odd dm

HRS 1.05 1.36 1.36 1.29 1.35

HRB 1.05 1.34 1.36 1.37

HRO 1.05 1.35 1.00 1.01 1.25 1.13 1.21

smoke af cla temp sss smoke*t sss*t

HRS 1.39 0.95 1.0019

HRB 0.69 0.95 1.0019

HRO 1.00 1.02 1.30 1.02 0.95 1.00 1.0019 Table 7.5: HRs from stepwise BMA withalcoremoved. Max. PMP: 0.10. Total PMP for Top10: 0.55. 54 models included in Occam’s window. HR forxxx∗t is pr. 100 unit increment.

Next, hyp is actually the risk factor with lowest PPP, but since the PPP for smoke is just marginally higher, and we have 174 missing values for smoke versus 62 forhyp, we decide to removesmoke. This will also have the positive side-effect of removingsmoke∗t. The results of removingsmoke(andsmoke∗t), and increasing the size of the data set to 725 subjects are presented in Table 7.6and7.7.

This time we remove hyp, since hyp and ihd have a comparable number of missing subjects. Furthermore, we expect ihdto be removed next, since BMA has shown quite consistent results so far. Removing a handful of subjects will probably not alter the results for ihd markedly. The results of removinghyp, and increasing the size of the data set to 730 subjects are presented in Table 7.8and7.9.

Method age sex hyp ihd apo odd p-value (step) <0.001 <0.001 0.46 (2) 0.69 (1) <0.01 0.02

PPPO 100 100 0.6 0.7 78.8 49.0

dm af cla temp sss sss*t

p-value (step) <0.01 0.05 <0.01 0.03 <0.001 <0.001

PPPO 72.2 16.7 64.7 31.7 100 100

Table 7.6: p-values and PPPs from stepwise BMA withsmokeremoved.

Method age sex hyp ihd apo odd

HRS 1.05 1.42 1.34 1.27

HRB 1.05 1.36 1.35

HRO 1.05 1.39 1.00 1.00 1.27 1.14

dm af cla temp sss sss*t

HRS 1.36 1.25 1.37 1.21 0.95 1.0019

HRB 1.39 0.69 0.95 1.0019

HRO 1.26 1.04 1.25 1.06 0.95 1.0019

Table 7.7: HRs from stepwise BMA with smoke removed. Max. PMP: 0.11.

Total PMP for Top10: 0.55. 46 models included in Occam’s window. HR for sss∗t is pr. 100 unit increment.

Method age sex ihd apo odd dm

p-value (step) <0.001 <0.001 0.77 (1) <0.01 0.02 <0.01

PPPO 100 100 0.6 75.1 52.5 73.5

dm af cla temp sss sss*t

p-value (step) 0.05 (2) 0.01 0.03 <0.001 <0.001

PPPO 15.7 59.9 36.4 100 100

Table 7.8: p-values and PPPs from stepwise BMA with hypremoved.

Rightfully so,ihdstill has very low PPP, and the results of removing ihd, and increasing the size of the data set to 742 subjects are presented in Table7.10 and7.11.

Now all PPPs are (significantly) different from zero, and we decide not to remove any more variables. We notice that throughout the selection process, the PPPs have changed significantly, while the p-values are more or less the same. In Table 7.12and 7.13we summarize the changes inp-values and PPPs for each of the remaining variables. In stepwise selection, only the finalp-values fordm andtemp(marked in bold) are noticeably different from their “starting” values,

7.1 Increasing the Amount of Available Data by Removing Variables 133

Method age sex ihd apo odd dm

HRS 1.05 1.40 1.36 1.27 1.36

HRB 1.05 1.37 1.34 1.39

HRO 1.05 1.40 1.00 1.25 1.15 1.27

dm af cla temp sss sss*t

HRS 1.35 1.20 0.95 1.0019

HRB 0.70 0.95 1.0019

HRO 1.03 1.22 1.07 0.95 1.0019

Table 7.9: HRs from stepwise BMA withhypremoved. Max. PMP: 0.09. Total PMP for Top10: 0.52. 49 models included in Occam’s window. HR forsss∗t is pr. 100 unit increment.

Method age sex apo odd dm

p-value (step) <0.001 <0.001 <0.01 0.02 <0.01

PPPO 100 100 73.5 56.9 78.5

af cla temp sss sss*t

p-value (step) 0.03 0.02 0.03 <0.001 <0.001

PPPO 26.5 51.9 33.4 100 100

Table 7.10: p-values and PPPs from stepwise BMA with ihdremoved.

Method age sex apo odd dm

HRS 1.05 1.41 1.33 1.28 1.37 σHR, S ∼0 0.12 0.13 0.13 0.15 HRB 1.05 1.40 1.35 1.34 1.38 HRO 1.05 1.40 1.24 1.17 1.30 σHR, O ∼0 0.12 0.19 0.18 0.22 af cla temp sss sss*t HRS 1.27 1.34 1.20 0.95 1.0019 σHR, S 0.14 0.16 0.10 ∼0 ∼0

HRB 0.95 1.0019

HRO 1.07 1.18 1.06 0.95 1.0019 σHR, O 0.13 0.22 0.11 ∼0 ∼0

Table 7.11: HRs from stepwise BMA withihdremoved. Max. PMP: 0.09. Total PMP for Top10: 0.49. 49 models included in Occam’s window. HR forsss∗t is pr. 100 unit increment.

and all we have learned is that dm and temp are also significant explanatory risk factors.

age sex apo odd dm af cla temp sss sss*t 0.001 0.01 0.01 0.02 0.12 0.01 0.04 0.13 0.001 0.001 0.001 0.001 0.01 0.02 0.02 0.06 0.01 0.13 0.001 0.001 0.001 0.001 0.01 0.02 0.01 0.11 0.01 0.10 0.001 0.001 0.001 0.001 0.01 0.02 0.01 0.05 0.01 0.03 0.001 0.001 0.001 0.001 0.01 0.02 0.01 0.05 0.01 0.03 0.001 0.001 0.001 0.001 0.01 0.02 0.01 0.03 0.02 0.03 0.001 0.001

Table 7.12: Change inp-values using stepwise BMA.

age sex apo odd dm af cla temp sss sss*t

100 96.9 77.9 75.1 28.9 3.6 39.3 9.3 100 100 100 96.9 64.5 39.8 39.0 15.0 70.0 6.8 100 100 100 97.1 72.3 43.3 59.1 7.9 69.3 11.7 100 100 100 100 78.8 49.0 72.2 16.7 64.7 31.7 100 100 100 100 75.1 52.5 73.5 15.7 59.9 36.4 100 100 100 100 73.5 56.9 78.5 26.5 51.9 33.4 100 100

Table 7.13: Change in PPPs using stepwise BMA.

On the other hand, using BMA we constantly update the evidence of an effect for each variable, reflecting the changes in the data set as well as the variable set, and thus reflecting the parameter as well as the model uncertainty. Inspecting the values, we learn that the data show very strong evidence for an effect ofage, sex, andsss. Although the PPPs forsexhave increased a little throughout the selection process, we were, and remain, confident of an effect of these variables.

Although the PPP forapovaried, the changes were within a 15% interval around the borderline between weak and positive confidence for an effect, and the extra data has not added significantly to our knowledge about an effect ofapo.

On the other hand, we started out with positive evidence for an effect ofodd, but the removal ofhemochanged the “relative strength” ofoddandcla. The extra data induced positive evidence against an effect ofodd, while cla moved from positive evidence against an effect, to weak and almost positive evidence for an effect. This evened out in the end, however, and bothoddandclaended up with PPPs indicating (very) weak evidence for an effect. Fordm, there was positive evidence against an effect when we used all variables, but the extra evidence and fewer variables has induced positive evidence for an effect. Especially the removal ofhemo,alco, andsmokegave extra data that increased the evidence for an effect ofdm.

The PPPs foraf andtemp were both very low when we included all variables

7.1 Increasing the Amount of Available Data by Removing Variables 135

and had little data available, but this changed throughout the selection process, and the extra data increased the evidence for an effect of both variables, but they ended up with PPPs around 30 still indicating positive evidence against an effect. Finally, the PPP forsss∗t started and remained at 100.

If we look at the model uncertainty in terms of the number of models included in Occam’s window, the maximum PMP, and the Top10 PMP, the maximum PMP has been fairly stable around 10%, and the Top10 PMP increased from 0.47 to 0.55, when we removed hemo, but ended up at 0.49, i.e. about half of the posterior probability mass was assigned to 10 models at any stage. On the other hand, the number of models included in Occam’s window decreased significantly from 62 to 49, i.e. that fewer models were within reasonable range of the best model in terms of PMP. Remembering that BMA assumes that data is generated by a single model within the model domain, it will assign full PMP to this model with unlimited data available and, eventually, fewer and fewer models will be included in Occam’s window.

The important point is that all these aspects of the parameter and model un-certainty are not discovered in regular stepwise selection using p-values and significance levels. Finally, we calculate the standard deviation of the HRs

σ(HRj) =σ(exp(βj)) = q

V(exp(βj)) (7.1)

using the second-order Taylor expansion to approximate the variance of a func-tion

V[f(x)]≈

∂f(x)

∂x x=E(x)

2

V(x) (7.2)

where f(x) = exp(βj), and we use (3.31) to calculate V(x) = V(βj) in BMA.

The results are presented in Table 7.11, and for all variables except age, sex, sss, and af, the estimated variances using stepwise selection are smaller than using BMA. As explained in Section 3.3.1.4, the regression coefficient variance in BMA includes the model uncertainty. By ignoring the model uncertainty, stepwise selection underestimates the total uncertainty leading to overconfident parameter estimates.

7.2 Using Bayesian Networks to Estimate