**3. The Structure from Motion problem**

**3.14 Camera calibration**

Camera calibration refers to the estimation of the intrinsic and/or extrinsic parameters of the camera. The intrinsic parameters correspond to the physical camera parameters (focal length, principal point, skewness), while the extrinsic ones describes the position and orientation of the camera in the world coordinate system. In practice, these parameters can be estimated by knowing the correspondence between a set of 3D points and their projections in the images, and solving a system of linear equation based on the camera model.

Many calibration techniques exist. A common approach for most of them is to use a calibration object with known geometry, and then to compute calibration parameters from a set of images of this calibration object. For example, in [31]

the calibration object is a cube with circular patterns on its sides, while in [32, 33] the calibration object is a planar pattern of squares.

One of the problems with the perspective camera model is that it only approximates the image formation process for a real camera. In general this model is good enough to describe most of the cameras. However, the optics of the real cameras can be very complex, and in many cases the light rays have to go through multiple lenses in order to form the image. In this situation the light rays coming from a certain point of the scene do not project to the prescribed geometric position [2]. If in ideal case the light rays pass through the optical center linearly, in reality the optics introduce a non linear distortion to the optical path [31]. When a high accuracy is required, the camera model has to consider these distortions induced by the camera optics, and to correct them.

The most commonly used corrections are for radial distortions and decentering distortions [34]. The radial distortion is the displacement of the image points radially in the image plane, relative to the center of the image. The decentering distortion appears when the centers of curvature of the lenses are not collinear, and has both a radial and a tangential component [35].

Thus, a point in the image plane undergoes further transformations.

Following notations are made:

### [

*0*

^{u}*0*

^{v}### ]

*is the principal point;*

^{T}### [

*u*

*d*

*v*

*d*

### ]

*are the distorted coordinates;*

^{T}### [ *u*

*c*

*v*

*c*

### ]

are the corrected coordinates.The radial distortion is:

### ( )

The tangential distortion is:

### ( )

*k* are radial distortion coefficients and*p*_{1},*p*_{2},Lare tangential
distortion parameters.

Then ⎥

In practice, only the first two or three radial distortion coefficients and the first two tangential distortion coefficients are estimated.

### The Structure from Motion problem 49

Most of the calibration applications take into account these distortions and evaluate the distortions coefficients together with the other camera parameters.

The calibration procedure using planar calibration patterns became very popular being easy to use and accurate in the same time, and it is supported by many software libraries. Probably the most widely used are Matlab Camera Calibration Toolbox [70] and Intel OpenCV library [71]. Both of them use a checkerboard planar pattern. While the Matlab Toolbox seems to be a little bit less appealing because it needs a lot of user interaction in the corner extraction process, the OpenCV calibration method is completely automatic, once the user provided a set of images of the calibration pattern.

A large amount of work was dedicated to the calibration of endoscopic cameras [36-43]. Characteristic to rod lenses of the endoscopic cameras is their wide field of view. This introduces in general significant radial distortions and they have to be considered in all the applications based on images taken with endoscopic cameras.

### C

HAPTER## 4

**Features detection and ** **tracking **

It was shown in the previous chapters that finding correspondences between different images is the key point in many computer vision applications, and recovering the 3D structure of the scene and the camera motion in only one of them. Other applications are camera calibration, image registration, object recognition, robot navigation, indexing and database retrieval, to remember only a few. It is clear then that results produced by these applications cannot be better than the matching methods themselves. The process of relating images of the same scene it is known as the feature detection and matching. Feature detection refers to finding interest regions in images, while feature matching refers to the process of relating these regions in different images. When the images represent the frames of a video sequence, the matching process is referred as feature tracking.

As the feature detection is the first step in so many computer vision applications, a very large number of feature detection algorithms have been developed in time. Relating all the points in images can be very difficult/impossible and computational expensive task. As a result, in practice only a relative small number of correspondences are detected in different images. Thus the reconstruction is rather a sparse set of 3D points. In some cases this is however not sufficient to reconstruct full surface models. For

specific applications a dense model can be obtained by 3D interpolation. There are also methods to obtain dense reconstructions starting from a small set of correspondences (see for example [ch3-2]).

**4.1 ** **Definition of a feature **

There is no an exact definition of a feature. Rather this notion is general and is related to a certain application. Then a feature can represent any kind of information that can be extracted from an image and it is relevant for solving the given task. In general, features can encapsulate the result of a general neighborhood operation applied to an image, or can represent certain structures present in the image, like points, edges, or connected regions.

A feature has to be:

• Localized

• Meaningful

• Detectable

An important property of the features is their repeatability rate and it refers to the percentage of the correspondences found in different images. In order to be

“good”, a feature has to be distinctive enough to be detected in more than one image. Another property of a feature is its computational complexity. This is important, for example, in real time applications, where there is a time constraint and, as a consequence, the features have to be easy to extract. In some cases, the computational complexity of features can limit the detection process only to certain regions in the images.

Features should be reasonably tolerant to a certain level of noise in image. As the illumination conditions can change in different images, they should be invariant to these changes. Ideally the features have to be also invariant to scaling, rotation and perspective distortions.

Feature detection is a low level image processing operation. When the feature detection is related to a neighborhood operation, then a local decision is made at every image point, weather there is or not a certain type of feature.

A plus of robustness may be added in some applications if two or more different types of features are extracted from the images.

### Features detection and tracking 53

**4.2 ** **Types of features **

Features extracted from images can be

• **Corners / interest points: In the literature, the notions of corner and **
interest point are often used interchangeable. In a traditional way, a
corner is defined by the intersection of two edges, or a point with two
different edge directions in its neighborhood. An interest point, on the
other side, can be any point with a well-defined position in the image
and that can be robustly detected. The early algorithm performed
corner detection by finding edges in a first step and then looking for
the points on the edges with rapid changes in direction. Modern
algorithms do not relay anymore on the edge detection, for example
the corners can be detected by looking for high levels of curvature in
image gradients. Other points than traditional corners can be detected
by this kind of methods. They are interest points, even if by tradition
they are still named corners.

• **Edges. **An edge is defined as a boundary between two regions in the
image. In practice edges are detected as sets of points with high
gradient magnitude.

• **Blobs (also called regions of interest): The notion of blob refers to a **
meaningful region in the image, and sometimes the blobs correspond
to objects in the image. Usually blobs are associated to a certain point
(e.g. the center of mass or the local maximum of an operator response).

• **Ridges: are used to describe elongated objects, and they occur **
normally along the center of such objects. Ridges can be considered a
generalization of the medial axis. In general it is more difficult to
extract these features than corners, lines or blobs.

Once a feature has been detected, an image patch (neighborhood) around the feature can be used to define some attributes or feature descriptors, forming a so called feature vector. The feature detection step itself can provide some of these attributes, e.g. the magnitude of the gradient. The descriptor has to be distinctive, robust to noise, and geometric and photometric deformations. The matching is often based on a distance between the feature vectors, e.g. the Mahalanobis or Euclidean distance. The dimension of the descriptor has a direct impact on the time this task takes.

When talking about feature detectors and their applicability in solving different problems, one should consider the possible transformations of the images that

can occur in a sequence of images. The transformations can be either geometrical or photometrical. The geometrical transformations can be: rotation, similarity (rotation and uniform scale), and affine (rotation and non-uniform scale, e.g. the perspective effect). Photometrical transformations refer to the illumination conditions that can occur from one image to another.

Selecting a feature detector depends not only on the content of the images but also on the way the images are captured. For example, in a sequence of images captured with a video camera, it is not very probable that the image changes too much from one frame to another. In this case the scale and perspective effects may be not considered. On the other hand, if the images are taken separately from very different points of view, then the shape of the features can significantly change, and the affine transformations should be taken into account. Affine invariant feature detectors are in general complex and may require high computational costs. Sometimes, selecting features that are only rotational and scale invariant offer a good compromise.

**4.3 ** **Comparing image regions **

In some cases, the feature matching process requires to compare image regions.

This is typically done using two measures: dissimilarity based on sum-of-square-differences (SSD) and similarity based on normalized cross-correlation (NCC) [2].

For a region W in image I and corresponding region T(W) in image J, the dissimilarity is defined as:

### ( )

### ( ) ( )

### [

^{J}

^{T}

^{x}

^{y}

^{I}

^{x}

^{y}### ]

^{w}

^{x}

^{y}

^{dxdy}*D*^{=}

### ∫∫

*W*,

^{−},

^{2}( , )

^{ (4.1) }

with

*w* ( *x* , *y* )

is a weighting function typically constant and equal to 1, or a
Gaussian function to emphasize the central area of the region.
The similarity is:

### ( )

### Features detection and tracking 55

are the mean intensity in the regions. Subtracting the mean intensity at every location of image regions makes this measure to be invariant to intensity changes.

**4.4 Harris ** **corner ** **detector **

One of the most widely used interest point detector is Harris corner detector [45]. The basic idea is simple: if we consider a small window centered on the point, then shifting the window in any direction should results in large changes in intensity. Anyway, there is no change in the case of a smooth region and, if the point is located on an edge, there is no change along the edge direction.

For a shift [u,v] of the window, the change in intensity can be expressed by the dissimilarity as:

where w is a weighting function normally chosen w=1, or Gaussian function to emphasize the center of the window.

*Figure 4.1 Two principal signal changes in the vicinity of a point. *

### (

max### )

^{1}

^{/}

^{2}

### λ

−### (

min### )

^{1}

^{/}

^{2}

### λ

−Slow change

When the shift is small, the equation (4.4) can be approximated by:

where *M*_{2×}_{2} is called the second-moment matrix (or auto-correlation matrix)
and it is computed from image derivatives:

### ∑ _{⎥} ^{⎥}

The point is considered a feature or not depending on the eigenvalues of the
matrix *M. The eigenvalues correspond to the two principal signal changes in *
the neighborhood of the point. If

*E* ( *u* , *v* ) = 1

is the ellipse centered on the
point, then the eigenvectors give the orientation of the ellipse, and the
eigenvalues give the magnitude of ellipse axes, as shown in Figure 4.1.
If both eigenvalues are very small, then there is no feature at the considered point. If one of them is very small and the other one is a large positive value, then an edge was found. Large, distinct positive values correspond to a corner.

In practice it is often desirable to avoid the complexity associated with the computation of the eigenvalues of the matrix M. Then the corner response function is defined as:

2

The response function depends only on the eigenvalues of the matrix M.

The corner response function is a large positive for a corner, a large negative for an edge, and close to zero for smooth regions.

The corners can be detected by fixing a threshold for the response
function*R*>*threshold*, and finding local maxima.

Harris corner detector is invariant to rotation (the shape of the ellipse depends only on the eigenvalues), and it is also invariant to changes in intensities since it is based on derivatives. Anyway, this detector is not invariant to image scale.

### Features detection and tracking 57

**4.5 ** **The KLT tracker **

The KLT (Kanade-Lucas-Tomasi) tracker [57, 58, 64] is one of the most popular tracking algorithms. For a sequence of images, the algorithm detects in a first stage a set of interest points and then, each of them being then tracked in the next frames.

One of the principle this method is based on is to find features that are optimal for tracking. The detection stage is identical with Harris corner detector (based on the eigenvalues of the second moment matrix), but a different criteria is used to select features.

The authors suggest that features selected with min(

### λ

_{1},

### λ

_{2})>

*threshold*are the ones that can be tracked well. The algorithm is able to determine an affine transformation that maps the neighborhood of a feature point in one image to the corresponding one in the next image by minimizing the dissimilarity between these image regions (see equation 4.1). When the distance between frames is small, the transformation is a simple translation.

When a feature is lost, (couldn’t be tracked anymore), the algorithm is able to find a new one, in order to keep constant the number of tracked features. To keep the algorithm fast there is a limit for the maximum allowed displacement between the frames.

**4.6 ** **Scale invariant feature detectors **

Sometimes, a desirable property for a feature detector is to be invariant to scale changes in images. The basic idea in finding a scale invariant feature point is to define a function on a variable size region around the feature point (e.g. the average intensity) and to find the local maximum of this function. The region size found in this way is invariant to scale changes. Then the feature is described together with its characteristic scale.

The concept of automatic scale selection was proposed by [52]. The scale space of an image is obtained by convolving the image with Gaussian kernels at different scales.

### ) , (

### * ) , , ( ) , ,

### ( *x* *y* *G* *x* *y* *I* *x* *y*

*L* σ = σ

(4.8)
where *I, is the image, G is the Gaussian kernel, and “*” is the convolution *

In [53] a scale invariant feature detector is introduced, namely Harris-Laplace.

The feature points are detected with the Harris detector, and then corresponding scales are determined for each feature point using the Laplacian of Gaussian:

### (

^{(}

^{,}

^{,}

^{)}

^{(}

^{,}

^{,}

^{)}

### )

The characteristic scale is selected by searching for a local maximum of the Laplacian over scales. The authors showed that the characteristic scale is relatively independent of the image scale and that the ratio of the characteristic scales for two images is equal to the scale factor between the images.

In [49] a similar feature detector is presented, but this time the features are located using the determinant of the Hessian matrix.

### ( ) ( )

Due to the second derivatives in the Hessian matrix, this detector gives strong responses for blobs and ridges. The shape of an elliptical region is determined with the second moment matrix of the intensity gradient, and the scale with Laplacian. This detector is known as Hessian-Laplace feature detector.### Features detection and tracking 59

*Figure 4.2 Comparing one dimensional Laplacian and Difference of Gaussians *
In [54] the feature points and their associated scales are found by looking for
local maximum of difference of Gaussian in scale and space:

### )

The difference of Gaussians approximates the Laplacian (see Figure 4.2), and can be computed more efficiently.

### (

^{(}

^{,}

^{,}

^{)}

^{(}

^{,}

^{,}

^{)}

### )

This approximation is compensated by the gained speed of detection process.The detector is called SIFT (Scale Invariant Feature Transform) and it comes together with a very efficient feature descriptor. The combination of SIFT feature detector and descriptor is widely used in many computer vision problems.

In [51] it is shown that the features detected with Harris-Laplace detector are more repeatable than other detectors.

Another scale and rotation invariant interest point detector and descriptor was recently introduced in [46]. The detector is called SURF (Speeded Up Robust Features) and it is based on a measure of the Hessian matrix, and a distribution-based descriptor. As the name may suggest, the detector is focused on speed.

According to the experiments performed by the author, the performance of detector/descriptor is very similar and even outperforms many of the previosly proposed detectors/descriptors with respect to repeatability, distinctiveness, and robustness.

**4.7 ** **Affine invariant region detectors **

The affine-invariant feature detectors deal with view-point changes in different images. Several affine-invariant region detectors have been proposed in literature.

In [51, 49], two affine invariant region detectors are constructed on the top of
*Harris-Laplace and Hessian-Laplace detectors. The shape of the affine region *
is determined with the second moment matrix and then it is normalized to a
circular one.

In [44] two affine invariant region detectors are presented. The first one is
geometry-based. Some anchor points are detected in images with Harris corner
detector [45], and also edges close to the anchor point are extracted with Canny
edge detector [47]. Two points are moved along the edges until they reach a
position where some photometric measures of the parallelogram region spanned
by them together with the anchor point go through an extremum (see Figure
*4.3). *

In the second method the anchor points are local intensity extrema. An intensity function along the rays emanating from this point is evaluated (see Figure 4.4).

Local extrema of this function give the points of the region border.

The function evaluated along the rays is:

*Figure 4.3. Edge based region detector *
p

p1

p2

q

### Features detection and tracking 61

where

*I*

_{0}is the intensity at the extremum, t is a parameter along the ray, I(t) is intensity at the position t, d is just a small number to avoid division by zero.

The regions detected with this method can have arbitrary shape, but they are approximated with ellipses. If f is the characteristic function of a region (1 inside, 0 outside), the geometric moments corresponding to a region R are:

*dxdy*

The geometric moments up to the second order are computed for the regions.

Then the ellipse that approximates the region has the same geometrical moments as the original regions.

Another affine invariant region detector is the salient region detector proposed in [48], which maximizes the entropy within elliptical regions centered on a point.

A different approach was used to develop the Maximally Stable Extremal Region (MSER) detector [50] witch performed very well in comparative studies [49]. The detected regions are connected components having the property that all the pixels inside the region are either brighter or darker than

*Figure 4.4 Intensity based region detector *

t

the pixels outside the boundary. The regions are detected by optimizing the threshold selection process. A maximally stable region corresponds to a

the pixels outside the boundary. The regions are detected by optimizing the threshold selection process. A maximally stable region corresponds to a