Multiple Poisson regression - Statistical methods

4.3 Statistical methods

4.3.3 Multiple Poisson regression

Multiple (multivariable) Poisson regression is an extension of the simple (univariable) Poisson regression introduced above. Multiple refers to the fact that the expected number of events are based on multiple parameters, thus adjusting the incidence rates for differences in many parameters. The estimated parameters are related to the exposure as in the previous section, but also to all the confounders included in the study.

All variables in the present study are categorised which makes it possible to do the analysis based on aggregated data. This means that the data set is grouped so that all persons with the same characteristics and belonging to the same exposure group are turned into one single line in the data set. This line also has information about the amount of person-years related to it

and the amount of deaths. The aggregation of data makes the computation time much faster [48].

Each such line represents a certain stratum within a certain exposure group. A stratum is defined by a specific combination of characteristics. For example one stratum could be men aged 50-55, living alone with a high income in year 2010. This subgroup of the population will exist within each exposure group. In the aggregated data set it is determined how many people from this stratum died within each exposure group. This is done for all possible combinations of confounders and calendar years. Poisson regression then estimates the number of deaths related to each line by estimating a range of parameters and using the risk time associated with each line.

For the statistical analysis one main model is used. This model is based on the study design described earlier and looks as follows:

ln E[dijklmn|X_ijklmn] = lnRTijklmn+α+exposurei+agej +genderk+ cohabitationl+incomem+calender yearn

(4.9)

whered_ijklmnis the number of deaths,RT_ijklmnis the number of person-years andX_ijklmnis the vector of variables within the jth age category, the kth gender, the lth cohabitation status, the mthincome group and the nthcalender year as well as the ith exposure group. αis the intercept related to the incidence rate of the reference group of each variable. All the other elements in equation 4.9 are parameters related to the variables. For each variable the number of parameters estimated is the amount of categories minus 1, since one arbitrarily chosen category will be the reference.

In total, this model has 32 unknown parameters that need to be estimated. The estimation is done using the software SAS9.4 and the procedure proc genmod. An example of how the proc genmod is used can be found in appendix A. This procedure uses maximum likelihood estimation to estimate all the unknown parameters. When estimating parameters 95% confidence intervals are also estimated. Furthermore, it reports the p-value of a likelihood ratio test of a model with and without each parameter [49]. The p-value is the probability of observing the likelihood ratio given the null hypothesis is true, where the null hypothesis is that the simplest model is the best fit.

As mentioned earlier, the incidence rate ratios (IRR) are often of great interest. Below is shown an example of how the IRR between exposure group 1 and 4 is calculated:

IRR_1vs.4 = IR1

where exposure₁ and exposure₄ are the maximum likelihood parameter estimates related to exposure groups 1 and 4 respectively.

Sensitivity analysis

In the sensitivity analysis interactions and effect modifiers will be added to the analysis. This means that an extra part will be added to equation 4.9. One interaction examined is an inter-action between age category and gender, this would lead to the following part being added:

age genderjk (4.14)

which is a parameter that would be estimated for each possible combination of age categor-ies and genders. However, one category of each variable would be chosen as the reference. If 13 age categories exists and two genders, then 12 extra parameters would be estimated. Noth-ing would be changed in equation 4.9, the individual effect of age and gender would still be there.

For the models including effect modifiers, a part added to equation 4.9 could look as follows:

exposure age_ij (4.15)

which is also an estimated parameter associated with both exposure group and age category.

Assuming five exposure groups and 13 age categories, this effect modifier would lead to an estimation of 48 extra parameters.

Chapter 5

Analysis and results

5.1 Data preprocessing

In order to work with the data and in the end make the statistical analysis of the association between magnesium in drinking water and mortality, much preprocessing had to be done to the various data sets (see Chapter 3 for details on data sets). To illustrate the process, Figure 5.1 has been created. The six squares at the top of the figure represent the raw data sets. Squares further down the tree represent processed data sets that are central to the project. The first one is the one containing estimates of magnesium concentrations. This data set demanded many considerations and analysis on its own and is therefore represented in the figure. The last data set at the bottom of the tree is the final data set that was used in the statistical analysis. The interval of years written below the title of each data set indicates the time period in which data is available for that specific data set. All steps marked as circles on the figure are actions done to the data sets and they will shortly be elaborated upon. The n on the figure refers to the number of observations in a data set, and thus it can be seen how the observations are reduced (or increased) during the preprocessing. The red box is a disquisition of observations lost due to matching of the register data to the magnesium estimates through the addresses.

First, the right side of the figure will be explained, thus beginning with the magnesium data sets. The data set containing concentrations of magnesium measured from water samples had 62,941 observations. Each sample was linked to a waterworks and each waterworks was then attempted to be linked to a water supply area. This was possible for almost all samples except for 128. These were excluded. The samples were then grouped by the year in which they were taken. This reduced the data set to 56,131 observations. For the years in which more than one sample were taken, the average of the concentrations was kept. In total, 35% of the reduction was due to samples taken on the exact same day as another sample. This data set was then used in the step called KNN, where estimates of concentrations were made. A description of this step will be given later in the chapter. The estimates were made for all waterworks with at least one sample and for all years from 1980-2017. It resulted in a total of 148,504 estimates.

The data set called waterworks abstraction dated far back in time and therefore not all regis-trations were relevant for this study. The amount of relevant regisregis-trations were further reduced by waterworks without magnesium estimates or with negative registrations. These were all ex-cluded in the step called data cleaning. The data set was left with 103,734 relevant registrations of abstraction which should be compared to the 148,504 magnesium estimates. This means that not all waterworks had a registered abstraction for all years. It could be due to the fact that some of the waterworks have not been active during the whole period. Some waterworks were connected to more than one WSA and for those it was impossible to know how much water they delivered to each area. Therefore the abstraction was simply divided by the amount of WSAs

that the waterworks was supplying, thus assuming an equal amount was delivered to all areas.

This is possibly not always the case but with no further information available, it was the most valid assumption. For more information about the connection between waterworks and WSAs see Table 5.2 and Table 5.1. This correction of abstraction levels was also contained in the data cleaning step.

The magnesium estimates for each waterworks and the abstraction data were then merged and grouped by WSA. The estimates of the waterworks were thus transformed into estimates for each area. For WSAs connected to more than one waterworks this was done by weighting the estimates for a given year by the abstractions registered for that year. For example, if an area had 90% of the water delivered from one waterworks and 10% from another, then the estimated concentrations for the two waterworks would be used to calculate a weighted estimate for the WSA, weighting the estimates 90% and 10% respectively. In case no registrations were made for one of the waterworks, it was assumed that it did not deliver any water that year. However, for years where none of the waterworks connected to a WSA had any registrations of abstrac-tion a simple average of the relevant magnesium estimates was used. This was done with the assumption that the area must have had water delivered and that some registration error had most likely occurred. This was the case for all estimates of 2017 since the 2017 abstraction data was not available. It was also the case for many estimates before 1990. The registration process seems to have been sparse at that time. The merging of the magnesium estimates and the abstraction registrations resulted in a data set with estimates for all water supply areas based on the waterworks connected to it. In total 97,556 estimates were calculated and ready to be used in the further analysis. The amount of estimates is reduced because the original waterworks estimates were grouped by WSA.

Figure 5.1: Overview over the data management process. Squares symbolise data sets and circles symbolise actions.

As can be seen in Figure 5.1, the data set with estimates of magnesium concentrations had to be linked to four other data sets. These data sets include the family income, population data (CRS) and cause of death - all originating from Danish registers. These were accompanied by a data set containing the geographical location of all current addresses. It was then determined in which WSA the coordinates of each address were located. This was done in the step called Connected to WSAand in total less than a 100,000 addresses could not be matched to any WSA.

This was simply due to the fact that their coordinates were not inside any of the water supply areas. Kirstine Wodschow used a geographical information system (QGIS version 2.18.14) to do the geographical matching [2].

Before merging the register data with the magnesium data, all individuals aged less than 28 and individuals with no address were removed from the data set. Only very few observations did not have an address. They were excluded in such a way that all individuals who had at least one year with no address were completely removed from the data. This was done because it would not be possible to calculate their exposure correctly. The reason for keeping persons aged 28 and 29 is to be able to calculate the two-year average exposure.

When merging all these data set into one final data set many observations from the register data did not have a match in the address data set and therefore no connection to any WSA.

This was the case for more than half a million of the observations and in particular this was an issue for observations before 2007. In 2007 many of the municipality codes changed due to the structural reform. As mentioned earlier, some addresses were not linked to any WSA and this resulted in around 350,000 observations not being linked to a WSA. For some observations the case was, that they were linked to a WSA with no concentration. This would be the case for people living in the few areas with no measurement and therefore no estimates. More than half a million were excluded due to the fact that they were supplied by their own well and even though the concentration might be similar to that of the surrounding area it was deemed too uncertain.

For all individuals left in the data set the exposure was calculated. This was done as an av-erage of the past two years. A slightly simpler avav-erage was used for people dying during 2005 where only the concentrations from 2004 and 2005 were used, making it less than a two-year average. This was done so that observations from 2005 could be used and thus keeping the study period ten years long. For people having left Denmark and moved back they were only included two years after they had reentered so that an appropriate exposure could be calculated.

After the calculations, observations of people aged 28 or 29, observations from 2004 and obser-vations where calculations could not be completed (due to reentering the country) were removed from the data set. Furthermore observations where no family income was present were excluded.

This left a data set of around 34.3 million observations corresponding to a study population of 4,143,662 unique individuals. This yields an average time in the study of 8.3 years.

The only extra thing that had to be done to make the data set ready for analysis was categor-ising the data. The exposure was divided into five categories of equal size yielding the following exposure groups:

Group 1: Exposed to 6.65 mg/l or less.

Group 2: Exposed to more than 6.65 and up to 10.3 mg/l.

Group 3: Exposed to more than 10.3 and up to 14.6 mg/l.

Group 4: Exposed to more than 14.6 and up to 21.9 mg/l.

Group 5: Exposed to more than 21.9 mg/l with the maximum being 53.6 mg/l.

The family income was likewise divided into five categories of equal size. This was done by taking the inflation of income and the difference between retired individuals and none retired ones into account.

The age was divided into 13 categories with five-year intervals and the last category being 90+.

The attribute specifying whether the individual is living alone or not was based on the attribute F amilytypefrom the CRS. If this attribute was 5 they were said to be living alone and otherwise they were marked as not living alone.

In document The Technical University of Denmark (Sider 25-32)