• Ingen resultater fundet

Estimation of magnesium levels

In document The Technical University of Denmark (Sider 34-39)

In order to use the information about magnesium levels in different areas it is necessary to have an estimate of the level each year. In some areas many years might pass between measurements and estimating what the concentration was in these years can be done in many ways. No matter how the estimation is done it should be noted that long periods with no measurements does invoke some uncertainty regarding the estimated concentration used for the analysis. To assess the problems and how many areas were affected by them, Figure 5.4 was created.

Figure 5.4: Map of problematic WSAs.

On the figure, areas that are supplied by one or more waterworks are marked in yellow, areas with no data since 2000 are marked in red and areas with very high variation are marked in green. That an area is supplied by more than one waterworks is problematic since it can be difficult to know which waterworks each household gets water from. The areas with no recent data will still be estimated but it should be noted that the estimates will have a high uncertainty.

The areas with high variations defined as a standard deviation of more than 4 mg/l are also problematic since the reason for the variation is unknown. Almost all of them also have more than one waterworks connected and the variation could simply stem from that fact. However, it could also be due to a sudden change in concentration, which is difficult to handle.

All of this made it clear that concentrations had to be estimated for each waterworks and then later aggregated to WSA level. In order to figure out the best way to do the estimations, three methods were taken into account. In the following their performance will be described.

5.3.1 Linear Interpolation

First of all, the method of simple linear interpolation has the issue of using the measurement made in one year as the exact value valid for that whole year. This is of concern since a meas-urement made on a specific date could be unusually high or low compared to measmeas-urements made before and after. Assuming that such a measurement was the true value for the entire year seemed like too strong an assumption. Instead it was decided to use a way of estimating the true concentration based on several measurements. Therefore, the method of linear inter-polation was not used for the final analysis.

5.3.2 Geographical interpolation

A second method using geographical interpolation was attempted. This method uses the neigh-bouring areas to determine the concentration of magnesium. Since magnesium is found in the aquifer and is related to geography, the concentrations of neighbouring areas would be expected to be similar. For testing this, leave-one-out cross validation with the euclidean distance metric, the gaussian kernel weighting and 20 neighbours was used. This method gave a negative mean absolute error of -4.719 mg/l calculated using equation 4.1. Moreover, 15,796 data points could not be estimated because they were related to areas where none of the 20 closest neighbours had any measurements. It seems as if many data points could not be estimated, but it is probably due to the fact that in the early years only very few areas had any measurements. It was also experimented with changing parameters in the geographical interpolation but none of the ex-periments seemed to yield results close to being as precise as the KNN method (results in next section) and therefore only this one constellation is reported.

5.3.3 The KNN method

The third method uses the K-nearest-neighbour algorithm. In order to determine how well this method performs a cross validation was carried out. The validation was made in two steps where the first step included determining the best weighting scheme and the optimal number of neighbours (K). This was done only using the waterworks with more than eight measurements so that up to 6 neighbours could be tested at all waterworks. It was an 8-fold cross validation with a test set of 10% and a training set of 90%. This means that for each waterworks 10% of the data points were removed and placed in a test set. The rest of the data was then used to estimate the data points in the test set. This was for each waterworks done 8 times, every time

placing a different 10% in the test set.

The results can be seen in Figure 5.5 where the darker green colours indicates better test scores.

The test scores used for the heatmap are the mean of the negative mean absolute error.

Figure 5.5: Heatmap illustrating the average negative mean absolute error for different weighting schemes used on K from 1 to 6.

From the figure it is clear that many combinations have similar errors and they might be equally good to use. However, the combination with lowest error is the inverse distance weighting and 4 neighbours. Since this combination performs slightly better than the others, it is the one used. In order to examine what happens when concentrations at all waterworks are estimated (including the ones with 8 or fewer measurements) step 2 of the cross validation was carried out.

In this step a leave-one-out cross validation was made. Here each data point was estimated by the use of the others. The only waterworks excluded in this cross validation were the ones with only one measurement. For waterworks with less than five measurements only the maximum possible number of neighbours was used. Hence, if a waterworks had three measurements only two neighbours were used to estimate the third data point.

Using this approach the negative mean absolute error was -1.22 mg/l, which is slightly worse than the error from the preliminary cross validation. However, this is to be expected since also concentrations at waterworks with very few samples were estimated here.

In Figure 5.6 a histogram of the cross validation errors are shown. This shows that almost all of them are very small but a few are very large. One concentration is estimated to be more than 70 mg/l form the actual measurement.

Figure 5.6: Histogram of negative absolute errors from leave-one-out cross validation using the inverse distance weighting and 4 neighbours where possible.

Note that the y-axis of the figure is on the log-scale making the large negative values visible.

All large errors was double checked manually in the Jupiter database and no apparent reason for the difference could be found.

For the final data set of estimations this method was applied. It was applied on all years and thus also years with samples were estimated using the four nearest neighbours. However, the sample taken in that year was of course also used and with a distance of zero it was the closest data point and therefore also attributed the highest weight.

5.3.4 The data set of estimations

The method used for the final data set is the KNN method. The method was as described used on all waterworks with at least one measurement of magnesium. Also as described earlier all waterworks deliver water to at least one water supply area. Since it is only possible to determine in which WSA an address is located, the estimates of the waterworks needed to be transformed into estimates of concentrations of each area. This was done as described in the data prepro-cessing section.

This aggregation led to the final data set of estimates and in Figure 5.7 one example of how the original measurements are turned into a final weighted estimate is shown.

Figure 5.7: Area 5010 close to Copenhagen is connected to three waterworks. The estimations (in pink) made from all the original measurements shown as green, brown and purple dots.

In Figure 5.7 the pink line is the estimate of concentrations in this specific area close to Copen-hagen. These are the ones used in the further analysis. The estimate is based on all the other dots coloured according to the waterworks at which they were taken.

In order to assess how all these estimates look geographically and time wise, several maps were created. However, their similarity is striking and only one is shown here in Figure 5.8. The rest can be found in appendix B. This map is based on estimates from 2015 and only areas marked in grey has no estimated concentration. As seen on the map it is primarily in Jutland that areas with very low concentrations exists.

Figure 5.8: Map showing the estimated levels of magnesium in all WSAs (with data). Year 2015. Areas with no data (and hence no estimation) are marked in grey. The lighter grey is the ocean. The white is areas where no WSAs exists.

In document The Technical University of Denmark (Sider 34-39)