• Ingen resultater fundet

One of the most basic elements in the training of a music genre classification system is the data set. Ideally, the data set should cover the whole (user-defined)

”music universe”, have the true proportions between the number of songs in each genre and the labels should be ”ground-truth”. This is never the case, but should still be the goal. Music data are fairly easy to obtain in comparison with many other kinds of data. However, there are legal issues concerning the copyright laws of music in sharing music data sets. Hence, only a limited amount of data sets are made publicly available which makes it difficult to compare the algorithms of researchers. One solution to this has been to only share meta-data such as

6.2 The data sets 69

features. This was done in e.g. the 2004 ISMIR genre classification contest [44] [6]. In the MIREX 2005 contests [53], the algorithms of several different researchers were compared on a common data set by having the researchers submit the actual implementations to an evaluation committee. Recently, as the area of MIR has matured, repositories such as the MTG-database [14] has become available.

The ground-truth of labels of songs are another concern. As discussed in chapter 2 and e.g. [4], a universal ground-truth does not exist. Even getting reliable labels for the data is often a serious practical problem that researchers has to consider. In [68] and [115], the All Music Guide [45] was used to estimate the ground-truth similarities between artists and songs. The All Music Guide has one of the most extensive collections of evaluations of music in many different genres.

In the following, two different data sets are described. These were the most heavily used data sets in the current dissertation project and human evaluations were made of both. In (Paper C), we used a data set from the ”Free Download section” at Amazon.com [46], but we did not perform any investigations of the human performance on this data set and it is therefore not described in the following.

6.2.1 Data set A

The data set A consists of 100 songs which are evenly distributed among the five genres Classical music, Jazz, Pop, Rock and Techno. In relation to (Paper B), the first co-author and I labelled the songs. They were chosen to be (somehow) characteristic of the specific genre and therefore the songs in a certain genre were quite similar. Additionally, the genres were chosen to be as different as possible. Hence, this data set was created to give only little variability in the human genre classifications. All songs were ripped from personal CDs with a sampling frequency of 22050 Hz.

Human Evaluation

A classification experiment with human subjects was made to evaluate the data set A. The 22 test subjects (mostly younger people between 25 and 35 years old from the signal processing department without any specific knowledge of music) were asked to log in to a website (at different times) which was build for the purpose. They were then asked each to classify 100 different sound samples of length 740 ms and 30 sounds of length 10 s in two different experiment rounds.

The choice of genre was restricted to the five possible (forced-choice) and no prior

70 Experimental results

information was given apart from the genre names. Both the 740 ms and 10 s samples were taken randomly from the test set in (Paper B) which consisted of 25 songs. The experiment with 740 ms samples had to be completed first before proceeding to the 10 s experiment. This was done to avoid too much correlation between the answers in the two experiments due to recognition of the songs.

The subjects could listen to the sound samples repeatedly before deciding, if desired.

It was found that the individual human classification test accuracy in the 10s experiment was 98 % with 95%-confidence interval limits at 97 % and 99 % under the assumption of binomially distributed number of errors. This is in agreement with the desired property of the data set; that it should be a data set with only a small amount of variability on the human classification and with reliable labels. For the 740 ms experiment, the accuracy was 92 % with 95%-confidence interval between 91 and 93 %.

Note that the ”individual” human accuracy was found by considering all of the classifications of the songs as coming from a single classifier and comparing these classifications with our ”ground-truth” labelling (from the co-author and I). Since the human subjects were not involved in the labelling of the data set, it is interesting to compare their consensus labelling on the data set with our

truth” labelling. This is a measure of the validity of our ”ground-truth”. All of the songs are considered to be properly labelled since they were each given 20 ”votes” for a genre label. To find the consensus labelling of a song, we simply use majority voting i.e. choosing the genre which has the most votes.

Comparing this consensus labelling of the data set with our ”ground-truth” gave 100 % classification accuracy. In other words, the consensus decision among the human subjects was completely similar to our ”ground-truth”. This confirms our belief that this is indeed a simple data set and that our ”ground-truth”

labelling is valid.

6.2.2 Data set B

The data set B contains 1210 songs in 11 genres with 110 in each i.e. the songs are evenly distributed. The 11 genres are Alternative, Country, Easy Listening, Electronica, Jazz, Latin, Pop&Dance, Rap&HipHop, R&B Soul, Reggae and Rock. The songs were originally in the MP3-format (MPEG1-Layer 3 encoding) with a bit-rate of at least 128 kBit, but were converted to mono PCM format with a sampling frequency of 22050 Hz. A preliminary experiment indicated that the decompression from MP3-format does not have a significant influence on the classification test error when the bit-rate is as large as here.

6.2 The data sets 71

The labels came from a reliable external source, but only the labels were given and a human evaluation of the genre confusion is therefore desirable.

Human evaluation

Data set B was classified in a similar website-based setup as for data set A and with a comparable group of 25 persons. They were now asked to classify 33 music samples of 30s length into the 11 classes. The samples were taken randomly from a subset with 220 songs from data set B.

The individual human classification test accuracy was estimated to 57 % with a 95 %-confidence interval from 54 to 61 % under the assumption of binomially distributed number of errors. This accuracy was found by considering each

”vote” for a genre label on a song as the outcome of a single ”individual human”-classifier. The corresponding individual human confusion matrix is shown in figure 6.6 where it is compared to the performance of our best performing system (MAR features with the GLM classifier).

As discussed in relation to the human evaluation in subsection 6.2.1, the human consensus labelling can be used to evaluate the ”ground-truth” labelling. A procedure to find this human consensus labelling of songs is also given in sub-section 6.2.1. The same procedure is used here i.e. majority voting among the

”votes” on a song is used to find the human consensus genre for the song. Since the number of evaluations (825) is here quite small compared to the number of songs (220), only a few number of votes were given to each song. It was (heuris-tically) decided that each song should be given at least 3 votes to be included in the comparison and 172 of the 220 songs fulfill this criterion. The human consensus classification test accuracy was found to be 68 % when compared to the ”ground-truth” labelling. Ideally, this accuracy should have been 100%.

The quite large discrepancy (32%) is mainly thought to originate in a lack of knowledge about music genres among the human test subject. An indicator of this is, that a fewparticular subjects with a background in music had very high test accuracy. The corresponding human consensus confusion matrix is illus-trated in figure 6.1. It clearly illustrates that the human subjects do not agree with our ”ground-truth” on especially Alternative and Easy-listening which are probably not as easily defined as e.g. Country. It should also be noted that there are other sources of uncertainty. Certainly, the number of votes for each song was not very large. 94% of the considered songs had between 3 and 6 votes and 33% only had 3 votes. This is arguably a large source of uncertainty on the consensus labelling.

72 Experimental results

0.21 0 0.07 0.14 0 0 0.28 0 0 0 0.28

0.06 0.68 0.06 0 0 0 0.06 0 0.06 0 0.06

0.12 0 0.37 0.12 0.12 0 0.12 0.06 0 0 0.06

0 0 0 0.78 0 0 0.21 0 0 0 0

0.05 0 0.05 0.05 0.76 0.05 0 0 0 0 0

0.06 0 0.12 0 0 0.56 0.18 0 0.06 0 0

0 0 0.06 0.12 0 0 0.81 0 0 0 0

0 0 0 0.06 0 0 0 0.87 0.06 0 0

0 0 0.06 0 0 0 0.13 0 0.8 0 0

0 0 0 0.05 0 0 0 0.05 0 0.88 0

0.06 0 0.13 0 0 0.06 0.06 0 0 0 0.66

Alt Cou Eas Elec Jazz Latin P&D R&H RB&S Reg Rock Alternative

Country

Easy−Listening

Electronica

Jazz

Latin

Pop&Dance

Rap&HipHop

RB&Soul

Reggae

Rock

Figure 6.1: The figure illustrates the human consensus confusion matrix from the human evaluation of data set B. The human consensus on a song is found by majority voting among all the human evaluations of the song. The genres in the rows are the ”ground-truth” labels and the columns are the human consensus genres. It is seen that e.g. the music with ”ground-truth” label Alternative is mostly classified into Pop&Dance or Rock in our human evaluation.