False Alarm Compensation - E cient Recursive Speaker Segmentation for Unsupervised Audio Editin

The common approach to speaker change detection is that of a comparison between subsequent segments of a xed length [54][111][105]. This approach however is saddled with an inherent problem, namely the ability to detect short speaker turns, requiring short analysis windows, versus the ability to avoid false positives, requiring long analysis windows, see section 1.1.3. The literature proposes several solutions to this dilemma.

• Algorithms that successively build larger segments rather than comparing

equal sized segments [42].

• Methods for running several analysis window lengths in parallel and then combining the results [7].

• Iterative methods, where the algorithm is run recursively with successively longer analysis windows on the successively smaller amount of possible change-points in order to reject false alarms. This class of methods has been termed False Alarm Compensation, FAC, and is employed by [54].

Here the false alarm compensation step was run only once.

This project employs a variant of the FAC method employed in Jørgensen et al. [54], the dierence being an investigation into the benet of a multistep procedure, here termed Recursive FAC, RFAC, see section2.4.1and5.2.

This approach has a number of free parameters; A buer variable equal toT_i, see section 2.3 used as a buer zone around a hypothesized change-point, a threshold gain parameter, α_{F AC}, and nally, T_max, the maximum data used from available data on each side of a hypothesized change-point.

αF AC is a gain applied to the moving average threshold calculated on the met-ric,MF AC, similar to the variable,αcd, see section2.3. αF ACtimes this moving average threshold is the function below which a hypothesized change-point is rejected. Only two thresholds exist, the change-detection one which is the mov-ing average of the metric with the gainα_cd, and the same moving average with the gainα_{F AC}.

The threshold gain parameter, α_{F AC}, is trained empirically through a search, in the free parameter space, for a maximal F-measure, see section3.3.

In Jørgensen et al. [55] it is argued that in order to maintain correlation between the FAC threshold and the change-point threshold,Tmax, the maximum analysis window and length of the moving average must be identical across algorithms.

T_max has the additional purpose of maintaining speaker turn purity. I.e. if a speaker-change point is missed during the rst iteration, the resulting speaker turn data would contain multiple speakers. Tmaxcounteracts this issue by using the data closest to the hypothesized change-point in question, thus using as much

data from the true speaker as possible and completely purifying the data if the true speaker turn is longer than T_max.

2.4.1 Hybrid method

This section will slightly modify the basic FAC method mentioned in section2.4, by proposing a robust novel approach where metrics are combined to compensate for dissimilar weaknesses. A novel approach is proposed as a literature search failed to yield any solutions to the problems described in this section.

In section4.1.3.1, it is shown that combining dierent metrics could potentially increase performance, the reason being that they might be able to complement each other on several key points. As mentioned in section 2.4, the idea of a false alarm compensation paradigm is to incrementally increase the amount of data available to the metric to progressively rene its estimate at a potential change-point, using the earlier iterations to determine how much data is actually available. During these iterations potential change-points that are found to be false alarms are deleted freeing up data for neighbouring potential change-points.

The metrics, see section3.2, vary in computational complexity and performance, with the computationally heavy metrics generally outperforming the light met-rics, see section 4.1.3. Using a light metric to quickly detect a large amount of potential change-points and then subsequently applying a heavy metric on the small data set to lter out the false alarms seems logical.

This combined metric approach however has a serious issue at its core. In the standard FAC paradigm the FAC threshold is merely a scaled version of the threshold applied in the change-detection step, see section 2.4. As the same metric is usually applied to both change-detection and FAC the only dierence is in the amount of data the metric is working with. I.e. in the standard FAC paradigm it is assumed that the FAC metric assumes roughly the same range of values as the change-detection metric irrespective of the amount of data used and that peaks will merely be scaled versions depending on the data amount.

When combining metrics both of these assumptions fail and must be handled.

The novel approach proposed here involves discarding the notion of using the moving-average of the change-detection step in the FAC step altogether.

Ideally the FAC metric is only evaluated at potential change-points, this however poses a problem as a reference is required in order to determine a threshold to which the FAC metric is compared. As one of the strengths of the combined metric approach is that they treat the data dierently, i.e. accentuating dierent

aspects, using a threshold based on a scaled version of the change-detection metric is invalid.

What is proposed instead is a heuristic for estimating a noise oor of the FAC metric through information gathered from the change-detection step. The basic assumption is that if the change-detection metric has a low threshold, thereby proposing a large amount of potential change-points, then a position far from a potential change-points is very unlikely to actually be a change-point.

The Combined Metric FAC, CMFAC, is therefore given the list of potential change-points along with a list of not-change-points. It is then saddled with the job of identifying which of the potential change-points are dierent enough from the not-change-points. These not-change-points are chosen as close to the center as possible between the proposed change-points.

This novel CMFAC method applies its particular metric, RuLSIF is used in section5.2, to each of the not-change-points using as much data as possible up to T_max, in exactly the same fashion as the standard FAC method, see section 2.4. Thereby acquiring a list of values that probably correspond to the noise oor of the CMFAC metric. This list might however still contain change-points or other rare events, this is handled by taking the median of this list, thereby acquiring a single value very likely to be representative of the noise oor.

This single value is then used as a constant threshold to which the gain,αF AC, is trained, see section 3.3. The hope is that the disadvantage of trading the moving average threshold of standard FAC for the constant median threshold of the CMFAC approach will be out weighted by the benet of a more expensive and more thorough metric. A metric that does not have to correspond to the metric used in the change-detection step.

This median approach is chosen as that will enable the algorithm to apply even in the case of news editing, where only a few speaker changes occur. The CMFAC method is applied in section5.2, with KL and RuLSIF used as change-detection metric and CMFAC metric respectively. There the performance is compared to standard FAC using KL with KL and RuLSIF with RuLSIF.

Methodology

This chapter will delineate the theoretical ground work. To begin with, a short introduction to the concept of metrics and how they relate to the actual sound, is bridged using a spectrogram and a graphical representation of the MFCCs.

Once this is in place, the independent approaches to SCD are described in detail, these include; Vector Quantization, Gaussian based approaches and Relative Density Ratio Estimation.

This chapter will then describe the parameter optimisation technique used to train the change detection and the FAC thresholds. These methods are applied in section4.1.2. Finally this chapter will conclude with some miscellaneous basic required theory.

3.1 Metric introduction

This section will try to bridge the gap from sound pressure up one abstraction level to the SCD F-measure score using the various methods.

The upper graph seen in gure 3.1shows a spectrogram of some speech data, i.e. with frequency on the y-axis, time on the x-axis and magnitude encoded in

the color scale going from blue, low, to red, high. The speech is recognisably producing patterns, but obviously distinguishing between speakers is dicult at best.

Using the feature extration methods described in section 2.2; The MFCCs are shown in the middle graph which to a large degree resembles noise, but most assuredly contains SCD cues. Slightly confusingly the frequency scale in the middle graph is an inverted version of the upper graph, this is a remnant of an alignment process which was more straightforward this way and given that the MFCCs are standardized the relevance is negligible.

The bottom graph displays the metrics of the various methods applied to the MFCCs and here the speaker changes are obviously very detectable. The hori-zontal dotted green lines represent actual changes from one speaker to another, the red ones denote changes between les containing the same speaker, these are obviously invalid and should not get detected. The metrics are designed to peak at speaker changes and are independent of the green and violet circles.

These green and violet circles represent the change-detection and FAC algo-rithms respectively applied to the metrics, with a green simply signifying that the algorithm thinks it encountered a speaker change, not denoting the success.

The violet circles represent potential speaker change-points - formerly green cir-cles - that the FAC algorithm has agged as false alarms. Note that a potential change-point must be within 1 second of the actual change to count towards the F-measure score seen in the legend. The F-measure being the harmonic mean between hits-to-tries ratio and hits-to-possible-hits ratio.

In document E cient Recursive Speaker Segmentation for Unsupervised Audio Editing (Sider 57-62)