SID System Performance Using All Frames

The Database

9.5 SID System Performance Using All Frames

In Chapters 5, 6 and 7, the results for each classier using the 12MFCC+12∆MFCC were obtained. Here, additional feature sets are implemented and each classier's per-formance is measured. As the classiers cannot all handle an equal amount of data, the training and test data are set to the values listed in Table 9.3 for all the trials that yielded the results listed in this section.

Classier Training Data Test Data

MoG ALL 8s

k-NN 10s 8s

NN 50s 8s

Table 9.3: Training and test data lengths for each classier

Each amount of data listed in Table 9.3 is per speaker, and "ALL" indicates that all

available training data is used. The length of the training data for each reference speaker is provided in Table 8.1. The training data in the case of the k-NN and NN classiers are restricted because an equal representation for each speaker is required and because an excessive memory usage requirement that could not be provided for was noted for some of the feature sets if more training data is included. All of the test data is limited to 8s as the lower bound for the test material of the reference speakers is above this value and using the same length in all cases provides a fair basis for comparison and the possibility for a frame-by-frame analysis that is started in Figures 5.9, 5.10, 6.3, 6.4, 7.4 and 7.5, where the dierent classiers performances are visualized as the classication of each test frame from each reference speaker.

All classication tasks are based on the principle of consensus so that not only the rate of correct identication of speakers from a whole test sequence are obtained, but the percentage of correctly classied frames in each case is also recorded. This measurement reveals details as to the SID system's ability to recognize a speaker from specic frames.

The results for each test conducted for the dierent feature sets and the three classi-ers are shown in Tables 9.4, 9.5, 9.6 , 9.7 and 9.8. The abbreviation "wLPCC" stands for warped LPCC coecients.

The results recorded as "ID" show the total number of speakers that were identied by using consensus over all test frames. As there are 6 speakers in the set, a 100% correct identication rate is noted as 6. A complete failure to identify any of the speakers is signied by 0, and all values inbetween indicate how many reference speakers out of 6 are correctly identied. "Frames" measures the correctly classied frame rate in percentage.

This is calculated from the number of frames that are assigned to the correct speaker out of the 8s(800 frames) of test speech. For the F₀ estimates, "Frames" are actually entire sentences. Although the correct frame rate in itself is not sucient to determine the performance of the SID system, it is interesting in that it shows which feature sets contain frames that are more easily classied as belonging to the correct speaker and are therefore more rich in speaker-specic information. This knowledge introduces a measure of reliability for each system setup combining a feature set and a classier. The distri-bution of correctly classied frames is also a useful performance measure. This is what was used in the comparisons of the preliminary trials with the three classiers, where the confusion matrices were analyzed. It was revealed that although the MoG classier had a higher correct frame classication rate than the k-NN classier, the distribution of these was so uneven that the rate of identication of speakers was the same for both classiers.

As there is no simple way to represent this distribution however, it is not included as a performance measure in Tables 9.4-9.8. A good distribution of correctly classied frames is, however, represented by the identication of speakers rate. When all 6 speakers are correctly identied, the confusion matrix contains a large majority of the classied test frames in its diagonal.

The source based features, the system performance of which is listed in Table 9.8, prove to be unsuitable for speaker identication. The F₀ estimates for the RC method are the best source features for speaker classication when using the k-NN classier, as for this set 4 reference speakers are identied and a large percentage of the test sentences are classied correctly. Referring to Figure 3.7 no evidence as to why this is the case

Classier Measure ^{8 MFCC} ⁸^∆^MFCC ^{10 MFCC} ¹⁰^∆^MFCC ^{12 MFCC} ¹²^∆^MFCC ¹²^∆∆^MFCC

MoG ID 4 4 4 5 5 5 5

MoG Frames 41% 42% 45% 47% 46% 48% 43%

k-NN ID 6 6 5 6 5 6 5

k-NN Frames 37% 39% 40% 42% 40% 43% 41%

NN ID 6 6 6 6 6 6 6

NN Frames 52% 54% 53% 56% 55% 60% 61%

Table 9.4: The performance of dierent classiers for MFCC feature sets

Classier Measure ^{8 LPCC} ⁸^∆^LPCC ^{10 LPCC} ¹⁰^∆^LPCC ^{12 LPCC} ¹²^∆^LPCC ¹²^∆∆^LPCC

MoG ID 6 6 6 6 6 6 6

MoG Frames 45% 50% 50% 54% 57% 62% 68%

k-NN ID 6 6 5 5 6 6 6

k-NN Frames 32% 33% 34% 35% 38% 38% 38%

NN ID 5 5 5 5 5 6 5

NN Frames 43% 46% 48% 49% 52 54% 59%

Table 9.5: The performance of dierent classiers for LPCC feature sets

Classier Measure ^{8 wLPCC} ⁸^∆^wLPCC ^{10 wLPCC} ¹⁰^∆^wLPCC ^{12 wLPCC} ¹²^∆^wLPCC

MoG ID 6 6 6 6 6 6

MoG Frames 37% 41% 37% 43% 40% 46%

k-NN ID 4 5 4 5 6 6

k-NN Frames 28% 29% 31% 31% 34% 34%

NN ID 6 5 6 6 6 6

NN Frames 40% 46% 43% 46% 48% 43%

Table 9.6: The performance of dierent classiers for warped LPCC feature sets

Classier Measure ^{9 PLPCC} ⁹^∆^PLPCC ^{11 PLPCC} ¹¹^∆^PLPCC ^{13 PLPCC} ¹³^∆^PLPCC ¹³^∆∆^PLPCC

MoG ID 6 6 6 6 6 6 6

MoG Frames 55% 58% 55% 59% 59% 63% 71%

k-NN ID 6 5 6 6 6 6 6

k-NN Frames 41% 40% 41% 43% 45% 45% 45%

NN ID 6 6 5 6 6 6 5

NN Frames 54% 56% 55% 56% 60% 61% 68%

Table 9.7: The performance of dierent classiers for PLPCC feature sets

Classier Measure 8 LPC residual 10 LPC residual 12 LPC residual YINF₀ RCF₀ F₀Trajectory

MoG ID 1 1 1 1 1 1

MoG Frames 18% 17% 17% 0% 0% 6%

k-NN ID 2 2 0 2 4 2

k-NN Frames 18% 18% 17% 42% 67% 34%

Table 9.8: The performance of dierent classiers for source based feature sets for the real cepstrum and not for the YIN estimates can be found, as the relative dif-ferences within the two sets is not large. As there are only 2 test sentences, though, a single correct classiation can make a big dierence in the total results. The extremely small amount of points in the source-based feature sets made it impossible for the MoG classier to estimate a density function with any precision. The LPC residual leads to poor performance in all cases. The F0 trajectories of the real cepstrum method lead to results for both classiers that conrm that these features are not rich in speaker-specic information, as was already observed in Figure 3.10.

From Table 9.4 the NN is seen to be the only classier that can successfully identify all 6 speakers based on all the MFCC feature sets. The low frame classication rate of the MoG classier may be due to the overlap in feature space of MFCC coecents that is observed in the PCA analysis of Figure 3.17. Combined with a restricted amount of data points, the MoG classier has diculty in estimating speaker specic density functions.

Although the highest frame classication rate is obtained for the 12∆∆MFCC feature set implemented with the NN classier, the NN is capable of identifying all 6 speakers for the 8MFCC feature set, as can thek-NN classier. Using MFCC as a feature is thus best done in a SID system setup using the NN.

From Tables 9.5 and 9.6 it is observed that warping the LPCC coecients leads to a de-crease in correctly classied frames for all classiers. The correct identication of speakers rate, however, does not deviate much between the two types of features. These results show that for this SID task, no improvement in performance is gained from the warping of the LPC autoregressive coecients to the bark scale. All the LPCC feature sets result in optimal speaker identication rates of 100% for the MoG classier while the NN clas-sier requires the information contained within the 12∆LPCC feature set to be able to identify all 6 speakers. The much lower dimensional 8LPCC feature set is sucient for good classication of speakers using the MoG classier, while the k-NN classier, as for the MFCC set, can identify all 6 speakers for a few of the LPCC feature sets.

Of all the feature sets, the PLPCC lead to the best performance of the SID system.

For almost all the combinations shown in Table 9.7, the speaker identication rate is 100% and the correct frame classication rate is higher than that for the other feature sets. The preprocessing of the PLPCC coecients that approximates the audiological frequency analysis in the ear and places weight on the perceptually signicant parts of speech thus leads to an improvement in the speaker identication system performance.

For almost all the PLPCC feature sets, 100% correct speaker identication is obtained for all 3 classiers. The preprocessing does require more computational time and so if this is

of vital importance, the MFCC or LPCC feature sets should be used instead.

The reason that the NN classication of the 12∆∆LPCC and 13∆∆PLPCC feature sets is not 100% correct is that the amount of training data used for the tests involving the second derivatives was limited to 30s instead of 50s as the NN otherwise experienced memory storage diculties. To summarize the results obtained in Tables 9.4-9.7, the best performance for each classier is listed in Table 9.9. The performance is based on which feature set yields 100% correct speaker identication with the highest level of reliability, i.e. the largest number of correctly classied frames. If a situation arises where several feature sets resulted in the same performance, the feature set of lowest dimension is cho-sen. The feature set or sets that generally lead to reliable performance for each classifer are also listed.

Classier Optimal Feature Set Good Feature Set(s)

MoG 13∆∆PLPCC LPCC, wLPCC, PLPCC

k-NN 13PLPCC LPCC, PLPCC

NN 13∆PLPCC MFCC

Table 9.9: The optimal feature sets for dierent classiers

Although Table 9.9 shows that the optimal performance for all classiers is achieved with the 13PLPCC feature set and its temporal derivatives, the NN classier is most reliable when used to classify speakers using any of the MFCC feature sets, despite the slightly lower correct frame rate when compared to the PLPCC. For all 4 cepstral coef-cient feature sets, the inclusion of the temporal derivatives of each feature set usually leads to a better speaker identication rate and a higher correct frame classication per-centage. More speaker-specic information is thus available when the temporal variations of the speech signal are analyzed. This can f.ex. be seen in Table 9.4, where using the 12∆MFCC feature set instead of the 12MFCC set with the NN classier leads to a 5%

increase in correct frame classication rate. The temporal derivatives are thus relevant for the speaker identication task.

In order to limit the amount of feature sets used in further trials, four feature sets that result in 100% correct speaker identication rate and high correct frame classication rates are selected for additional testing: the 12∆MFCC, the 12∆LPCC, the 12∆wLPCC and the 13∆PLP feature sets. Apart from the MFCC features, all of these sets resulted in 100% correct identication rate for all classiers. As the NN classier is more stable than the MoG classier and more ecient than the k-NN classier, it is chosen to implement the various types of tests that are presented in Sections 9.6 and 9.7.

In document IMM, Denmarks Technical University (Sider 110-114)