Clustering in Oracle Data Mining - Travel Time Forecasting

Clustering will be performed using Oracle Data Mining (ODM) clustering algo-rithms. ODM provides two algorithms for this purpose:

Enhanced k-Means algorithm, which is an enhanced version of the traditional k-Means algorithm [14]

Orthogonal partitioning clusters (the O-Cluster algorithm), which is an Oracle proprietary algorithm [15]

Both algorithms support identifying naturally occurring groupings within the input data set. The enhanced K-means algorithm creates hierarchical clusters and groups the input traffic patterns into the user specified number of clus-ters. The O-cluster algorithm selects the most representative clusters without the user prespecifying the number of clusters. Both algorithms provide detailed information about the generated clusters, which includes placement of all traffic patterns in the clustering model hierarchical tree, cluster rules, which capture the main characteristics of the data assigned to each cluster and cluster cen-troid values. The generated clusters can subsequently be used to score new input patterns on their cluster membership. They are also used to generate a Bayesian probability model that is used during scoring for assigning data points to clusters. Clustering can also be used for incident detection by building a clus-tering model, applying it and then finding items that do not fit in any cluster.

Both algorithms were initially tested out on a sample data set. The O-cluster algorithm did not return any usable results due to the fact that all patterns were clustered in the same cluster which, according to the inspection of the 10-minute moving average travel times in Figure9.6, can be dismissed. In addition, the lack of adequate documentation on the O-cluster algorithm made it almost an impossible task to tune the numerous parameter settings which presumably are required for this algorithm to run properly. Neither Google search nor the browsing of the Oracle Technology Network discussion forums [16] shed more light on how to apply this algorithm in practice. For this reason, the enhanced k-Means algorithm was selected for further examination. The start-up phase was difficult due to the lack of adequate documentation in terms of explaining the theoretical applicability of the numerous parameter settings and interpret-ing the output. Once these obstacles were overcome, the algorithm was used to gain insight into the behaviour of traffic patterns.

10.2 Clustering in Oracle Data Mining 47

10.2.1 Enhanced k-Means algorithm

The enhanced k-Means algorithm is a distance-based clustering algorithm that partitions the data into a predetermined number of clusters, provided that there are enough distinct patterns in the data set. The algorithm relies on a distance metric to measure the similarity between data points which is either Euclidean, Cosine, or Fast Cosine distance (refer to [17] for an informal explanation of the applicability of Cosine and Fast Cosine distance metrics). The former and the latter distance metrics have not been tested out. Data points are assigned to the nearest cluster according to the distance metric used. The algorithm builds models in a hierarchical top-down manner and is reminiscent of the algorithms for divisive clustering [18]. The algorithm begins by placing all traffic patterns in a single cluster. This single cluster is denoted as the first level of the hierarchy.

Each level of the hierarchy represents a particular grouping of data into disjoint clusters of traffic patterns. It then chooses the traffic pattern whose average dissimilarity from all the other patterns in the data set is largest. This pattern becomes the first member of the second cluster. At each successive step that pattern in the first cluster whose average distance from those in the second cluster, minus that for the remaining patterns in the first cluster is largest, is transferred to the second cluster. This continues until the corresponding difference in averages becomes negative. When the transfer of patterns from the first cluster to the second cluster stops, there are no longer any patterns in the first cluster, that are, on average closer to those in the second cluster. The result is thus a split of the original cluster into two child clusters, one containing the patterns transferred to the second cluster, and the other those remaining in the first cluster. These two clusters represent the second level of hierarchy. After the children of the parent node have converged, the traditional k-Means algorithm is run on all leaves until convergence, that is, either when the change in error between two consecutive iterations is less than the minimum error tolerance or the number of iterations exceeds the maximum number of iterations (both are parameter settings that need to be specified before running the k-Means algorithm, refer to Section10.2.2). Each successive level is produced by applying this splitting procedure to one of the clusters at the previous levels. There are two different strategies on how to choose which child cluster to split in order to increase the size of the tree, until the desired number of clusters has been reached. This strategy only applies from the second level of the hierarchy and onward. The child clusters can be split based on either largest variance or largest size. Variance is computed as the sum of squared distances from all patterns belonging to the child cluster to the cluster centroid. Size is computed as the number of patterns belonging to each child cluster. Recursive splitting continues until all child clusters either become single patterns or the specified number of clusters has been reached. The centroids of the inner clusters in the hierarchy are updated as the construction of the tree evolves. The whole tree is returned.

Moreover, the algorithm returns, for each cluster, the place in the clustering model hierarchical tree, a cluster prototype (consisting of the mean and variance) and histograms (one for each feature). The clusters discovered by the algorithm are then used to create rules that capture the main characteristics of the data assigned to each cluster. The rules represent the bounding boxes that envelop the data in the clusters discovered by the clustering algorithm. The antecedent of each rule describes the clustering bounding box. The consequent encodes the cluster ID for the cluster described by the rule. Furthermore, the algorithm provides probabilistic scoring and assignment of data to clusters. For a given record, the clustering models return the probability of the record belonging to a given cluster P (cluster|record). This probability is similar to what is obtained from using a classification model. From this perspective, clustering is treated as a classification problem where first the class labels (clusters) are identified and then the records are classified into clusters from a predefined set of clusters [19]. Patterns can be affiliated to several clusters with the same probability as long as the total probability does not exceed 1. This might occur when splitting a natural grouping into a number of constituents. In addition, this will occur whenever an attempt is made to cluster an incident pattern. Missing values are not automatically handled. If missing values are present in the data set and are not treated before the clustering algorithm is run, the algorithm will handle the input data incorrectly with the result that the data will be grouped erroneously.

This algorithm is not susceptible to initialization issues, that is, the results of the algorithm are reproducible for identical parameter settings. This is useful when dealing with a real life implementation in that the results need not be stored but can be obtained any time. However, the algorithm also suffers from a number of deficiencies such as the vagueness of termination criteria in terms of choosing the right number of clusters in the model. It is then up to the user to decide which model (if any) actually represents a natural clustering in the sense that patterns within each of its groups are sufficiently more similar to each other than patterns assigned to different groups at that level.

10.2.2 The applied settings

The algorithm has the following settings (for a full list of available settings refer to [21]):

Number of clusters. This setting specifies the number of clusters in the model.

The value must be between 2 and the number of distinct cases in the data set.

Distance function. This setting specifies how the algorithm calculates the distance. The distance function can be Euclidean, Cosine or Fast Cosine.

10.2 Clustering in Oracle Data Mining 49

Clustering will be performed with the Euclidean distance function.

Split criterion. This setting specifies how the algorithm splits the node. The split criterion can either be variance or size. Clustering will be performed with size as the split criterion. Reasons for this will be explained in Section 10.3.2.

Minimum error tolerance. This setting specifies the minimum percentage change in error between iterations to consider that the clusters have con-verged. The minimum error tolerance must be a non-negative number that is less than 1. Preliminary runs of the algorithm showed that this setting did not impact the outcome of the algorithm at all. For this reason clustering will be performed with the minimum error tolerance set to .001 (default value).

Maximum Iterations. This setting specifies the maximum number of itera-tions for the k-Means algorithm. The value must be between 2 and 30.

Clustering will be performed with the number of iterations set to 30.

A pattern is affiliated to a cluster only if the probability of affiliation is larger than 50 %. Patterns, for which the probability of affiliation is less than 50 %, are not affiliated to any clusters in order to avoid affiliating that pattern to several clusters with the same probability. In most cases, this occurs when the number of clusters in the model exceeds the amount of natural groupings in the data set. If so, there are several ”most-likely” affiliations with probabilities less than 50 %. Aside from this, incident patterns would normally be affiliated with a probability less than 50 % in that they do not belong to any clusters per se. This approach has been chosen in order to keep tabs on the affiliations as the number of clusters in the model is increased, which otherwise would not be feasible. The probabilities are provisionally assigned only for the purpose of evaluation of the validity of the emerged clusters and in order to determine the optimal number of clusters.

10.2.3 Concluding remarks

The main detriment attributable to using ODM is the lack of adequate doc-umentation on how the clustering algorithms work in practice. Oracle only provides limited informative material, which for the most part does not go be-yond the very basic introduction to the algorithms, making the initial starting cumbersome. The lack of elaborate documentation prohibited the use of the O-cluster algorithm due to the fact that the algorithm produced clusters that were deemed incorrect. For this reason, it was chosen to further investigate the enhanced k-Means algorithm.

In document Travel Time Forecasting (Sider 56-60)