Voiced/Unvoiced Analysis - The Database - IMM, Denmarks Technical University

The Database

9.7 Voiced/Unvoiced Analysis

The voiced/unvoiced analysis is introduced as a step in the direction of eventually stream-lining the number of frames that are needed in order to achieve optimal SID performance.

All results have shown that a relatively large number of test data frames are misclassied, meaning that frames that cannot be identied as belonging to the correct speaker are included. The information contained within these frames is thus not speaker-dependent and can therefore be viewed as noise in the SID system.

The analysis is commenced by classifying all the training and test frames as being voiced or unvoiced. This is done by using the autocorrelation with center clipping method of

Section 3.4.1, where a frame is labelled as being voiced if the autocorrelation function has a value above 30% of the maximum peak value found at τ = 0. Any frames not meet-ing this requirement are classied as bemeet-ing unvoiced. The classication of frames from a test sentence as belonging to dierent speakers in comparison with the same sentence divided into voiced/unvoiced frames may reveal whether a correlation exists between the voicing of a frame and its content of speaker specic information. For visualization of this comparison, the 13∆PLPCC feature set is used, as it yielded the highest correct frame classication results in the series of tests conducted in Section 9.5. In Figure 9.1, the classication of 800 frames (8s) of test material using the 13∆PLPCC feature set for all classiers is shown for Speaker 1, a woman (FAML). The top row of classied frames are the results of k-NN classication, the second row the MoG model classication, the third row the NN classication and the bottom row is the sequence of voiced/unvoiced decisions for the test sequence. An analogous analysis for a male speaker, Speaker 4 (MASM), is shown in Figure 9.2. The value 0 is used to denote the unvoiced label, 1 denotes both Speaker 1 and the voiced label and numbers 2-6 each correspond to a speaker in the reference set as listed in Chapter 8.

Frame index

V/UV NN MoG k−NN

13dPLPCC Classification of Speaker 1 test frames

100 200 300 400 500 600 700 800 Unvoiced

Voiced (Sp1) Sp2 Sp3 Sp4 Sp5 Sp6

Figure 9.1: Classication results for Sp1, 13PLPCC + 13∆PLPCC

Although it is dicult to draw conclusions from Figures 9.1 and 9.2, a few observations can be made that are relevant for both speakers. Firstly, there is no clear division in the classied frames from any of the classiers according to the voiced/unvoiced decisions.

However, it can be seen that an incorrectly assigned frame in the speaker identication results is often associated with an unvoiced frame. There are wrong classications made for voiced frames, too, but the dierence lies in the fact that it appears to be a rule that the frames that are unvoiced are incorrectly assigned to a speaker while misclassication occurs more randomly for the voiced frames. In short, a voiced frame is not sure to be classied correctly while an unvoiced frame has a high probability of being classied incorrectly.

These results are for one feature set only and dependent on the parameters of each

Frame index

V/UV NN MoG k−NN

13dPLPCC Classification of Speaker 4 test frames

100 200 300 400 500 600 700 800 Unvoiced

Voiced (Sp1) Sp2 Sp3 Sp4 Sp5 Sp6

Figure 9.2: Classication results for Sp1, 13PLPCC + 13∆PLPCC

classier and so cannot be seen as conclusive. In order to shed more light on the clas-sication compared with the voiced/unvoiced sequence, consensus between the results of the two best performing classiers, the MoG and NN, is analyzed w.r.t. to the voicing of frames. The results are shown as correct classications, so that only two options are permitted: "Correct" and "Incorrect". The voiced and unvoiced labels correspond to the colours for the correct and incorrect labels, respectively, though this is done to permit all the sequences to be shown at once and not because voiced frames are considered in any way as being "correct" and unvoiced ones as "incorrect". These results are shown in Figure 9.3 for Speaker 1 and in 9.4 for Speaker 4. The top row in both gures shows the correctly classied frames for the MoG classier, the second row for the NN classier, the third row for the consensus between these two classiers and the fourth row shows the voiced/unvoiced classications.

Although it remains problematic to observe conclusive trends, there seems to be ev-idence in Figures 9.3 and 9.4 that while classication tends to be dicult for unvoiced frames, the frames immediately after these are more frequently correctly identied. This may be connected to the theory that a considerable amount of speech information is con-tained in the acoustic transients of a speech signal [63]. The transients are areas of rapid change in the spectral envelope of a speech signal and the rich information that they carry may well be speaker-dependent. As it is confusing to try to decipher whether this is true from the sequences of 800 frames that have been shown, a few trials are implemented to test whether the theory holds.

Each of the four feature sets is divided into ve subsets, as listed below.

1. Voiced(V): contains all the frames classied as being voiced 2. Unvoiced(UV): contains all the frames classied as unvoiced

Incorrect Correct

Frame Index

V/UV NN+MoG NN MoG

13dPLPCC Correct classification of Speaker 1 test frames (MoG, NN)

100 200 300 400 500 600 700 800

Figure 9.3: Correct Classication results for Sp1, 13PLPCC + 13∆PLPCC, including consensus between MoG and NN classiers

Incorrect Correct

Frame Index

V/UV NN+MoG NN MoG

13dPLPCC Correct classification of Speaker 4 test frames (MoG, NN)

100 200 300 400 500 600 700 800

Figure 9.4: Correct Classication results for Sp4, 13PLPCC + 13∆PLPCC, including consensus between MoG and NN classiers

3. Unvoiced-Voiced(UVV): contains only the voiced fames that are preceded by an unvoiced frame

4. Voiced-Unvoiced(VUV): contains only the unvoiced frames that are preceded by a voiced frame

5. ALL: contains all frames not sorted according to voicing labels

The temporal changes in the speech signal may not always be represented by the transition between a voiced and unvoiced segment, but this analysis will still provide clues as to how heavily the identication depends voiced/unvoiced state of the frames and the order that these occur in. An initial experiment was conducted with the k-NN classier which proved incapable of identifying all 6 speakers based on anything else but the mixed sequence of frames. All the available material, up to t_train =10s, is used in this analysis and so the limited number of frames in the UVV and VUV sets may cause a decline in identication rate. Despite this, it was observed that for all four feature sets, the percentage of correctly classied frames was highest for the VUV and UVV sets. This can be seen in Figure 9.5.

ALL V UV UVV VUV

20 25 30 35 40 45 50 55

Correct frame rate for feature sets based on voiced/unvoiced analysis

Voiced/Unvoiced feature subset

Correct frame rate/%

12dMFCC 12dLPCC 12dWLPCC 13dPLP

Figure 9.5: k-NN results for the voiced/unvoiced analysis

Figure 9.5 shows the same tendency for all feature sets: that the lowest amount of correctly classied frames is obtained for the unvoiced frames, while the highest rate is ei-ther for the unvoiced-voiced feature set or the voiced-unvoiced feature set. From the PCA analysis of voiced and unvoiced frames using the 12∆MFCC feature set in Section 3.10, it is not surprising that these subsets do not provide good features for speaker identica-tion. As none other than the complete set of frames yielded a successful identication of all 6 reference speakers, these results just provide pointers to the fact that the areas of transition between voiced and unvoiced frames certainly contain information that is vital for speaker identication and that classication based on the voiced or unvoiced frames alone performs more poorly than when there is a combination of the two (in the ALL

data set).

Classication using the MoG models could not be implemented with the reduced fea-ture sets divided along the lines of the voicing decisions. This produced very sparse data for very high dimensionality (D = 24 for MFCC and LPCC, and D = 26 for PLPCC) and so the ability of the MoG classier to model the distributions was greatly reduced.

In the few trials that were implemented the results displayed a high level of instability and always showed overwhelming bias for just one speaker. As there is no additional data available for the reference speakers, the voiced/unvoiced analysis for the MoG classier was not implemented.

The nal series of tests is conducted with the NN classier. Here, the training data sets were all limited to just 9s of speech for each speaker and 2.5s of test speech. These are the upper bounds set by the smaller feature sets, i.e. the UVV and VUV sets. The same amount of data for each feature set provides a platform for fair comparison of per-formance results. The rst ve feature subsets to be implemented are those pertaining to the 12∆MFCC feature vectors, which was the original reference feature set. The results measured for this series of tests are listed in Table 9.11.

Performance measure ALL V UV UVV VUV

ID rate 5 4 5 6 6

Correct frames 43% 41% 35% 50% 49%

Table 9.11: NN results for the voiced/unvoiced analysis using 12∆MFCC

The most signicant dierence shown in Table 9.11 is the correct identication of speakers rate. The correct frame rate increases for the VUV and UVV feature sets, but this alone, as was seen in Section 9.5, is not of vital importance, while the fact that this leads to the correct identication of all six speakers in the reference set is of far greater weight. It suggests that not only are more frames correctly classied, but also that these correctly classied frames are evenly distributed among all 6 speakers.

The next step in searching a way to optimize the classication process is the imple-mentation of the voiced/unvoiced subsets used in conjunction with gender separation.

Following the implementation of gender separation based onF₀ estimates with thek-NN classier, the NN is implemented with the ve subsets of the original 12∆MFCC feature set and the results obtained are listed in Table 9.12. As there are only 3 speakers in each group, the ID rate that represents 100% correctly identied speakers is 3.

From Table 9.12 it is observed that the previously obtained results are conrmed, both for gender separation and for the V/UV analysis. The results from the gender and voicing separation displays improved performance when compared to simply implementing the V/UV subsets for all six speakers, which shows that gender separation once more causes an increased rate of correct classication of the frames. The ID rate does not change much though, and the only two subsets that result in 100% identication for the combined set and the male and female subgroups can be seen in Tables 9.11 and 9.12 as being the UVV and VUV sets. It is interesting to note that while most of the results in Table 9.12 are similar for male and female speakers, a discrepancy exists for the "voiced" and "unvoiced"

Performance measure Gender Group MIX V UV UVV VUV

ID rate Male 2 2 3 3 3

Correct frames Male 54% 44% 55% 64% 66%

ID rate Female 3 2 2 3 3

Correct frames Female 56% 58% 43% 66% 65%

Table 9.12: NN results for the voiced/unvoiced analysis using gender grouped 12∆MFCC frames. Here, the male speakers are recognized at a higher rate for the unvoiced frames, while the opposite holds true for the female speakers. The UVV and VUV subsets yield a more substantial increase in correct frame classication rate than the case for the 6 mixed speakers. For both male and female speakers, using these subsets results in a 10%

increase in correct frame classication rate compared to the unsorted feature set for each gender group.

The division of a feature set into 5 subsets labelled with V/UV details was implemented for the 12∆LPCC, 12∆wLPCC and 13∆PLPCC resulted in classication of only one speaker possible in each case, showing extreme bias towards the one correctly identied speaker. The results obtained for the 12∆MFCC feature set could thus not be reproduced using other feature sets with the NN classier.

Chapter 10

In document IMM, Denmarks Technical University (Sider 115-123)