Bundle Adjustment - Methods for Structure from Motion

14.7.2 Parametrization Ambiguity

There is also an ambiguity in the parametrization of the deforming structure, but this does not represent a indeterminacy in the structure of the solution. These ambiguities are found by rearranging (14.2)

Here∗denotes an arbitrary value, and the form of the top row is implied by the coefficient ofMµjbeing 1.These parametrization ambiguities account forr(r+ 1)degrees of freedom.

It is noted, that in PCA analysis, these parametrization ambiguities are addressed by maintaining the canonical form of the eigen–vectors derived from the eigen–value decom-position.

14.7.3 Implications

The ambiguities of (14.13), as mentioned, have the effect of reducing the observability of the system. The translation ambiguity implies that it is impossible to detect whether the camera or the 3D structure – or a mixture of both – moved. The scaling ambiguity indicates that a deformation of the object consisting sole of a scaling cannot be detected from a movement of the camera, i.e. a close small object cannot be distinguished from a far large one. As for the parametrization ambiguities, they are just ambiguities of the notation and as such have no physical interpretation. The total degrees of freedom, are seen to ber²+ 5r+ 7, whereof 4r+ 7are related to observability, as summed up in Table 14.1.

14.8 Bundle Adjustment

Similarly to traditional bundle adjustment [185, 211], we propose minimizing the reprojec-tion errors in the images, given the observareprojec-tion model (14.1). This yields a “gold standard

Ambiguity dof Euclidian Similarity Transform 7

Translation Ambiguity 3r

Scaling Ambiguity r

Observability 4r+7

Parametrization Ambiguity r²+r

Total r²+5r+7

Table 14.1: Degrees of freedom (dof) as a function of the number of modes,r.

solution”[89]. Due to the nature of the problem, a non–linear numerical approach is needed to perform the minimization, for which we use the algorithm of Levenberg and Marquardt [123, 129].

It is assumed that the cameras are calibrated, but the same framework and approach would work in the uncalibrated case as well. Each camera is parameterized with a rotation matrix and the coordinates of the camera center. The structure is parameterized as in (14.2).

14.8.1 Regularization

The scaling and translation ambiguities implies that the different time instances of the object cannot be aligned properly, making results hard to interpret. A natural way to restrict these ambiguities is to impose a cost for two consecutive object instances to differ, i.e.

whereδis a small constant, e.g.δ = 10⁻⁴. This prior states, that if there is an ambiguity of how the 3D structure should move relative to the world coordinate system and scale, then stay stationary. This is the regularization used in [6].

As mentioned in Section 14.6, shrinkage of the structure could also be applied in order to address possible over parametrization. This is done by adding the following cost

ρX

β_ik² +X

(||Mk||^F−1)² , (14.16)

where|| · ||^F denotes the Frobenius norm, andρis a small constant dependent on the image noise and the size of theβik. This shrinkage imposes a small penalty on more complex mod-els, whereby the optimization will ’try’ to express as much as possible by camera motion.

14.8.2 Convergence

An unsolved issue is the convergence of the non–linear optimization. When using simulated data, we are not always able to achieve the global minimum. Our hypothesis is that this is either due to local minima in the object function, but more likely it is caused by the

14.9 Experimental Results 141

object function being extremely flat or ill–conditioned near the optimum. This is especially a problem if the model contains more modes than the data, i.e.ris too large. Solving this problem is the primal focus of our further research on this topic, since this is needed to rigorously validate the approach, as e.g. mentioned in Section 14.6.

A likely solution to this convergence problem is a more intelligent fixing of the gauge freedoms as described in [43, 44, 132]. This is likely to have the effect that the objective function will become more well conditioned. Apart from this, using the methods of [43, 44, 132] for gauge fixing should also give a much lower variance on the estimated structure and motion.

It should be noted that the convergence problem has not been greater than good results have been achieved. The numerical optimization is just not good enough for us to make a full rigorous evaluation, since e.g. it is impossible to conclude if a worse fit is due to a change in the data or due to a short coming of the numerical optimization.

14.9 Experimental Results

As a process of validating the proposed framework we ran it on real data, specifically 16 frames of a skeleton doll as illustrated Figures 14.2, 14.3, 14.4 and 14.5. This was experi-ment was made in conjunction with [6], and as such used the accompanying model selection approach as discussed in Section 14.6. The result of the reconstruction is shown in Fig-ure 14.5, where the estimated structFig-ure for all the frames are depicted. It is seen that these resulting structures correspond well with the motion performed by the skeleton doll. To give the reader a feeling for the modes, the three chosen modes as well as the meanM_µare il-lustrated in Figure 14.4. The Root Mean Square (RMS) errors between the measured and reprojected points are3.0and1.9 pixels after the affine factorization and after the bundle adjustment, respectively. Considering that the features were tracked by hand in images of size960×1280pixels, the resulting errors are indeed plausible.

As a more rigorous testing, we simulated images of a shaking house, see Figure 14.7.

This enabled us to use a known ground truth for evaluation. We – among others – evaluated the approach by increasing the noise, and comparing the estimated fit to the ground truth.

The estimated fit was evaluated by calculating the Procrustes distance, cf. [57]³, between the estimated structure and the ground truth. This was done for each time instance and the resulting distances were averaged as a result.

A result is seen in Figure 14.6, where the number of modes was held fixed, to the correct number of two varying modes. It is noted, that the proposed scheme degrades gracefully as the data gets worse, i.e. the noise level increases. However, it should also be noted, that for no noise the fit is not perfect, i.e. distance 0 corrected for machine precision. This has to do with the convergence problems discussed in Section 14.8. This problem became even worse when an automatic model selection scheme was used, in that wrong choice of modes made the convergence properties worse.

3The Procrustes distance between two shapes is the mean squared distance between them after they have aligned with an Euclidean similarity transform and normalized. This is a standard measure with in shape statistics.

Figure 14.2: A frame of the skeleton sequence, with arrows denoting the way it moves.

450 500 550 600 650 700 750 800

200

300

400

500

600

700

Figure 14.3: The features of the skeleton sequence and the lines connecting them for com-prehendable 3D illustrations.

14.10 Conclusion and Discussion 143

14.10 Conclusion and Discussion

A conclusion of this yet unfinished work is; that a seemingly good theoretical treatment of the problem has been developed. This includes identification of the added ambiguities, and some seemingly fruitful ideas on the model selection problem. However, the transition to practical implementation is not as straight forward as we had expected, specifically in regards to the convergence of the non–linear optimization.

As such, the next step in our investigation of the deformable structure from motion prob-lem is to examine better numerical methods. With out better convergence, it is extremely hard – if not impossible – to fully validate the algorithm, and develop a fully integrated system including model selection. The reason for this is that without a better converging non–linear optimization, it is unknown whether a poor estimate is achieved due to a subop-timal approach or just poor convergence in the given instance.

When this issue has been resolved, we are also planning to further investigate approaches for model selection.

However, the preliminary experimental results are seen to give a reasonable solution to the problem yielding a crude experimental validation of the overall approach.

Acknowledgement

We would like to thank Poul Thyregod for invaluable help and discussion in regard to our variance estimation. We would also like to thank Bjarne Ersbøll for discussions on the model selection issue.

Smu

Mode 1

Mode 2

Mode 3

Figure 14.4: The modes of the skeleton sequence. (Upper Left) The mean,M_µ. The dotted lines in the following 3 figures denote the deformation from the mean captured by a given mode. (Upper Right) Mode 1, (Lower Left) Mode 2, (Lower Right) Mode 3

14.10 Conclusion and Discussion 145

Frame 1 Frame 2 Frame 3 Frame 4

Frame 5 Frame 6 Frame 7 Frame 8

Frame 9 Frame 10 Frame 11 Frame 12

Frame 13 Frame 14 Frame 15 Frame 16

Figure 14.5: The reconstructed 3D structure of the skeleton doll captured in the skeleton sequence. See Figure 14.3 for an interpretation of the dots and lines.

0 0.002 0.004 0.006 0.008 0.01 0.012 0.014 0.016 0.018 0.02

Figure 14.6: Reconstruction results for increasing noise using the house sequence, see Fig-ure 14.7. The added noise is Gaussian where the ratio between the image size and standard deviation is denoted as a percentage along the abscissa.

Figure 14.7: Two sample instances of the house structure used for simulation.

C

HAPTER

15 Structure Estimation and Surface Triangulation of Deformable Objects

by: Charlotte Svensson, Henrik Aanæs and Fredrik Kahl

Abstract

A system is developed that from an image sequence of a deformable object automatically extracts features and tracks them through the sequence, estimates the non-rigid 3D structure and finally computes a surface triangulation. Also the camera motion is acquired. The object is supposed to deform according to a linear model, while the motion of the camera can be arbitrary. No domain specific prior of the object is required.

For the structure estimation a two-step approach is used, where we first obtain an initial estimate of the structure and motion, and then obtain an optimal solution via a non-linear optimization scheme. The triangulation is optimized to yield a non-rigid faceted surface that well approximates the true 3D surface.

15.1 Introduction

The estimation of structure and motion from image sequences is one of the most studied problems within computer vision. However, almost all the efforts in this area have dealt with rigid objects. Since the world is not a rigid place, it is important to have a system working for deforming objects as well. A common approach to the non-rigid problem is to

use a prior model of the object, for example when human body or facial motion is studied [186, 182, 120].

We do not use a prior model, but employ the Principal Component Analysis (PCA) framework, whereby the object is supposed to deform according to a linear model. This type of model is fairly general and have proven to be very effective for expressing many types of deforming objects, e.g. [45]. In the works by [25, 209] such a linear model was used and the structure was estimated via a factorization algorithm. We extend this approach by applying a modified bundle adjustment algorithm to minimize the ML-error.

However, the main novelty compared to previos work is the improved surface model-ing. We use the optimized structure to compute a non-rigid surface triangulation, using an approach similar to that of Morris & Kanade [137].

In document Methods for Structure from Motion (Sider 153-162)