• Ingen resultater fundet

and maximizingφ(d) gives the directiond1with the maximum variation between group means relative to within-group variation. Here d1 is the eigenvector corresponding to the largest eigenvalue of A with respect to W. The first canonical discriminant function is d10x and further discriminant functions can be found by maximizingφ(d) under the constraint thatdi0xis uncorrelated with the previous variables. This process of computing linear combinations can be continued until the number of canonical discriminant functions equals

min(k−1, p).

CDA is robust to mild deviations from normality, which is implicitly assumed.

However, nonlinearity or clustering of the observations can degrade the perfor-mance.

In this project it is assumed that the observations are mutually independent and that the distribution within each class is multivariate normal. The CDA compu-tations have been performed with PROC CANDISC from the SAS package [72].

3.6 Classification

Classification is the process of sorting observations e.g. pixels into a finite num-ber of individual categories based on a set of criteria. If a pixel satisfies one of those criteria the pixel is assigned a certain class label. There are basically two types of classification algorithms, namely, the supervised and the unsupervised classifier.

The supervised classifier Multiple Discriminant Analysis (MDA) and the unsu-pervised classifier Cluster Analysis (CA) is described below in the Sections3.6.1 and3.6.2. In this thesis these two types of classifiers are applied in the analyses of the multivariate restored EMISAR data and their performances are compared.

3.6.1 Multiple discriminant analysis

In Discriminant Analysis (DA) the classes are determined beforehand in terms of training samples or areas. The objective is here to determine the combination of variables that best discriminates between two or more classes. In case of more than two classes the classification is often referred to as Multiple Discriminant Analysis (MDA). DA is used for several purposes e.g. to investigate differences among classes or discard variables which are little related to group distinctions.

Choice

In this context MDA is performed on the multivariate restored EMISAR data using training areas. The training areas are here taken to represent different physical properties in the test sites such as variations in biomass, vegetation characteristics and soil moisture contents. For an introduction to DA refer to Anderson (2003) [3]. A Danish introduction to DA and related topics can be found in Conradsen (1984) [17].

3.6.1.1 Introduction

The problem of classification is to classify the observation

x= vector ofpmeasured or derived features characterizing the observation.

The conditional density function fi(x) =p(x|πi), i= 0,1, . . . , k

specifies the probability that the observationxbelongs to the class variableπi. In situations where prior experience or belief of the class variableπiis available thea prioriprobability of classπi is denoted

p(πi) =pi, i= 0,1, . . . , k.

3.6 Classification 43

Through Bayes’s theorem it now is possible to express the a posteriori proba-bility

p(πi|x) = p(x|πi)p(πi) Pk

j=1p(x|πj)p(πj)

= pifi(x) h(x) ,

which denotes the probability that the class is πi given observation x. The denominator h(x) serves as a normalization factor that ensures that thea pos-terioriprobabilities sum to unity. The importance of Bayes’s theorem is that it is possible to make optimal decisions based on the knowledge of the conditional density functionfi(x) and the prior probabilitypi.

In many applications there may be serious consequences if a wrong classification is made e.g. classifying an image of a tumour as normal. We therefore define a loss functionL(j, i), i6=j as shown in Table3.1. Here it appears that no loss is expected when the right choice is made, whereas a penalty term is associated with wrong decisions.

Using the loss function in Table 3.1the expected lossEx{L(j, i)} for choosing classπi is given by

Ex{L(j, i)} = L(1, i)p(πi|x) +L(2, i)p(π2|x) +. . .+L(k, i)p(πk|x)

= − 1

h(x)Si,

where the discriminant score Si is

Si=−L(1, i)p1f1(x)−L(2, i)p2f2(x)−. . .−L(k, i)pkfk(x).

The Bayes solution of the classification problem is to choose the class πi that minimizes the expected loss Ex[L(j, i)] which is equivalent to maximizingSi. In other words we chooseπν if

Sν≥Si, ∀i.

If all lossesL(j, i), i6=j are equal then the discriminant scoreSican be

simpli-fied to

S0i=pifi(x),

which corresponds to choosing the class with the maximuma posteriori proba-bility.

Supported by thecentral limit theoremand due to its analytical simplicity the multivariate normal distribution is a common choice for expressing the class conditional density function of the feature vectorx. When equal losses and known priors are considered we there-fore can apply the logarithm to the discriminant score Si0, which leads to the quadratic discriminant function of the form

Si=−1

2log(detΣi)−1

2(x−µi)TΣ−1i (x−µi) + logpi,

where the common factor 2π−p/2 is neglected. In the case of equal covariance matrices (Σi=Σ) the discriminant function is linear. The decision boundaries for classification are given by the regions where the discriminant functions are equal.

In many applications a rejection threshold is introduced in the classification procedure. Hereby it is possible to exclude observations that are too uncertain i.e. too far from known classes. In image analysis this is useful in situations where one extrapolates from the training areas to the rest of the image. This is conveniently done using theMahalanobisdistance

Di= (x−µi)TΣ−1i (x−µi),

which is the distance from observationxto the class mean µi with respect to Σ−1i . Surfaces of constant probability in (3.11) are hyper-ellipsoids on whichDi

is constant.

In this thesis the individual within-group covariance matricesΣi are estimated from training data. It is here assumed that the observations are mutually inde-pendent and that the distribution within each class is multivariate normal. The

3.6 Classification 45

discriminant function to be used is quadratic with equal prior probabilities and losses. Furthermore the rejection threshold is set to the commonly used p value 0.05. The supervised classification is carried out using PROC DISCRIM from the SAS package [72].

3.6.2 Cluster analysis

It is usually expected that observations that are far apart belong to different classes or clusters, while observations that are close to each other are assigned to the same cluster. In Cluster Analysis (CA) the clusters are not predetermined and the aim is therefore to determine the best way of grouping data.

The clustering method to be focused upon in this thesis is MacQueen’s non-hierarchical clustering algorithmk-means[47]. The k-means algorithm is chosen because we wish to study the natural grouping of the multivariate restored EMISAR data using different number of clusters.

In Baggesen, Nielsen and Larsen (2001) initial exploratory analysis of multivari-ate image data is carried out using a SMAF transformation and a fuzzy k-means algorithm [5]. For a thorough treatment of CA refer to Anderberg (1973) [1].

3.6.2.1 Introduction

The MacQueen’sk-meansalgorithm, which is a nearest centroid sorting method, can be described as follows: A number of clustersk is chosen in advance and a set of points calledcluster seedsis selected. In order to form temporary clusters each observation is assigned to the nearest seed. Next, the new seed points are generated by replacing the old seed points by the mean vectors (centroids) of the temporary clusters. This procedure is repeated until no further grouping of the data occurs.

If a given seed point is chosen as the mean vector xj of a cluster j then the sum of squared deviations between the seed point and theNj data units in the cluster will be a unique minimum. Naturally, the algorithm therefore seeks to partition the data points into k disjoint subsets in such a way that the total within group sum of squares

E=

is minimized. The mean vector xj is

xj = 1 Nj

Nj

X

i=1

xij,

and we note that it is the Euclidean distances between the centroid of clusterj and itsNj data units that are used.

It can be shown that the total number of different ways a set of N data units can be partitioned intok clusters is a finite number given by

S = 1 k!

i=k

X

i=0

(−1)k−i(k i)iN,

which is a Stirling number of the second kind. The algorithm therefore is guar-anteed to reach the global minimum if theS partitions are generated [1]. We see that S becomes very large for large data sets and algorithms such as sim-ulated annealing and gradient descent methods are consequently often used in the process of searching for the global minimum.

The software used in our application for disjoint clustering of restored EMISAR data is PROC FASTCLUS developed by SAS [72]. This procedure is chosen because of its ability of finding good clusters of very large data sets with only a few passes over the data. The program uses the nearest centroid sorting method introduced by Anderberg (1973) and described above. Here the analysis is based on Euclidean distances and each observation is assigned to one and only one cluster. The FASTCLUS procedure combines an effective method for finding initial clusters with a standard iterative algorithm for minimizing the sum of squared distances from the cluster means.

Chapter 4

Markov random fields

Many of the tasks in computer vision can be regarded as optimization problems with respect to a function depending on one or more variables. The function which is often referred to as the cost or objective function of the problem serves as a heuristic guide in the search for the optimal solution.

Unfortunately these optimization problems are not easy to solve in terms of finding the optimal solution. Firstly, an exhaustive search is not feasible due to the often very large search space. Secondly, the objective function cannot be expressed in a closed form. However, the optimal solution can be located by resorting to numerical optimization techniques.

The optimization problem in relation with image restoration can be formulated as the minimization of an energy function. In this case the energy function corresponds to the cost or objective function of the problem. The energy func-tion is typically non-convex having a large number of local minima. Normally these local minima are due to artifacts and noise in the image and are therefore viewed as sub-optimal solutions. The ultimate goal, however, is to retrieve the global energy minimum which corresponds to the optimal solution.

4.1 Introduction

In this chapter two different optimization schemes based upon Markov Random Field (MRF) are described. The optimization schemes presented are in Chap-ter 5 applied on synthetic SAR data and EMISAR data using variousa priori models.

The first technique is the local optimizer Iterated Conditional Modes (ICM) suggested by Besag (1986) [7]. It is a quick deterministic algorithm, which based on a down-hill strategy unfortunately is likely to track a sub-optimal solution. The second optimization scheme is Simulated Annealing (SA). SA is a stochastic method proposed independently by Kirkpatrick (1983) and Cerny (1985) for global optimization [43], [13]. SA also relies on a down-hill strategy but instead of performing gradient descent the successful applications of SA rely on random research methods such as Metropolis algorithm or Gibbs sampler [50], [29]. In contrast to ICM SA is therefore able to escape local minima and thereby finally reach the global minimum.

In Section4.2the theory of Markov random fields and Gibbs random fields are reviewed. The definition of cliques is given and Bayes rule is outlined. Section 4.3contains a description of Iterated conditional modes (ICM) and in Section4.4 the concepts of Simulated Annealing (SA) are outlined.