Beat Spectrum - Music Genre Classiﬁcation Systems

TheBeat Spectrum (BS) features are another approach to beat estimation from the whole song which were proposed in [32].

The idea is to start out with a short-time feature representation. Then create the distance matrix with elements which are the distance between the short-time features of each frame. The distance measure is here the cosine measure, ie. the angle between two feature vectors. To create the beat spectrum, the elements on the diagonals of the distance matrix are summed. Hence, the beat spectrum can be mathematically formulated as

B(l) =

i∈I

D(x_i,x_i+l) =

i∈I

|x_i|· xi+l

|x_i+l|

whereD is the distance measure andIis the index set such thati+llies within the size of the distance matrix. The peaks of this beat spectrum corresponds to beats in the music. In our implementation in relation to (Paper C), a Fourier transformation has been used to estimate the periodicity and the content in 6 bands was used as the BS features.

Chapter 5

Classiﬁers and Postprocessing

Given a music representation in the form of feature vectors, it is important to be able to ﬁnd the patterns in feature space that belong to the diﬀerent music genres. In music genre classiﬁcation systems, this is the task of aclassiﬁer. The ﬁrst four sections each describe a traditional classiﬁer which has been used in the current dissertation. In section 5.5, a novelCo-occurrence model (Paper D) is proposed for music genre classiﬁcation. In the last section, Post-processing techniques are considered with special focus on methods to combine a time-sequence of classiﬁer outputs from a song into a single genre decision.

The music features that have been examined in chapter 3 share some common traits. Notably, they all transform the rawaudio signal into a sequence of (multivariate) feature vectors with an implicit ”clustering assumption”. This assumption means that two feature vectors which are close (in some ”simple”

metric) in this feature space should represent much of the same musical content.

This assumption is vital to all areas of music information retrieval. For instance in music recommendation systems [15], distance measures are used to decide which songs sound ”similar” and use that to make recommendations. That is only meaningful with the previous assumption.

In music genre classiﬁcation systems, a classiﬁer uses the ”clustering assump-tion” to transform the feature vectors into estimates of their corresponding music genre. The classiﬁers which have been investigated in this project are all

52 Classiﬁers and Postprocessing

statistical classiﬁers. These kinds of classiﬁer models use a so-called training set of genre-labelled songs to statistically infer the parameters of the model.

After this training procedure, the performance of the classiﬁer model is found by predicting genre-labels on atest set of songs. The performance of the classi-ﬁer and the music genre classiﬁcation system as a whole, is often measured by comparing the known labels of the test set songs against these predicted labels.

In some music genre classiﬁcation systems such as [77] and [106], the classiﬁer use the whole time-series of feature vectors in a song to returns an estimate of the genre for the newsong. Our proposed co-occurrence model (Paper D) which is described in section 5.5 is also capable of using the whole time-series of feature vectors to reach a genre estimate for the song. However, in many systems the classiﬁer predicts a genre, or the probability of a genre, for each feature vectorzn in the song. This is the case for the ﬁrst four classiﬁers which are discussed in the ﬁrst four sections of this chapter. In the postprocessing step, these predictions from every feature vector in the song are combined into a single genre label for the song.

In our research, we have made several assumptions about the nature of the problem as discussed in section 2.3. With these assumptions, the combined classiﬁer and postprocessing parts can be formalized as

Cˆ=g(z1, . . . ,zN) (5.1)

where ˆCis the estimated genre for a song that is represented by the sequence of N feature vectorsz1tozN. The classiﬁer and postprocessing are then contained in the functiong.

It should be mentioned that there exists an abundance of diﬀerent classiﬁcation and general pattern recognition methods in the machine learning and statistics literature. Good overviews can be found in e.g. [27], [8] and [75]. For instance, the above formulation with a training and test set lies within the realm of supervised learning where the genre taxonomy is restricted in advance. The unsupervised learning approach is not limited by such restrictions in taxonomy, but looks for patterns in feature space with special emphasis on the similarity measure [97]. This has the advantage that also new, emerging genres might be discovered, it avoids the problems of deﬁning genre as we discussed in chapter 2 and it avoids the problems with getting reliable labels for the data. However, it relies strongly on the similarity measure. In [102], Hidden Markov Models [92] are used for unsupervised music genre classiﬁcation and [95] used Self-Organizing Maps (SOMs) and a Growing Hierchical SOM variant. In (Paper F), we used Independent Component Analysis (ICA) to cluster music feature

space. Common similarity measures are the simple Euclidian or cosine distance [33] or the Kullbach-Leibler divergence [77]. Several well-known (hierarchical) clustering methods are discussed in e.g. [27] of which the K-Means method is probably the most common.

There also exist numerous supervised classiﬁcation schemes. An important non-parametric method is the K-nearest neighbor (KNN) method which simply as-signs a class to a newfeature vector by voting among the K nearest neighbors.

It has been used in music genre classiﬁcation in e.g. [110] and [67]. Another important class of classiﬁers are the non-probabilisticSupport Vector Machines (SVMs) [20] which have been used with great success in the recent MIREX 2005 contests in music genre classiﬁcation [53]. SVMs were used in more than half of the contributions. Other examples are [117] and [83]. The Artiﬁcial Neural Network classiﬁers could, like the SVMs, be put in the class of discriminative non-probabilistic classiﬁers although they can be modiﬁed to model probability.

In the current dissertation mostly probabilistic classiﬁers have been considered.

These can be described in a uniﬁed Graphical Model framework [57] [96]. In relation to classiﬁcation, they can be split into generative and discriminative models. The generative models model p(z|C) which is the probability of a feature vector z given the class label C. Predicting the class label of a new feature vector ˜z then requires the use of Bayes’ rule to estimate P(C|˜z). A disadvantage of the generative models is that they model each class individually without considering the other classes and hence the class-overlap may be larger than necessary. An advantage is that they directly give an estimate of the reliability of a genre prediction. For instance, if p(˜z|C = c) is very low for each c, this might be considered an outlier. Notable examples of generative classiﬁcation models are the Gaussian Classiﬁer and Gaussian Mixture Model which are described in the following sections as well as the (Hidden) Markov Model.

The discriminative models, in contrast to the generative ones, model the de-sired quantity p(C|z) directly and they are therefore not (directly) capable of detecting outliers. However, they are often more powerful since they generally require a smaller number of parameters than the generative models for similarly ﬂexible decision boundary and hence are less prone to overﬁtting. Examples of discriminative models include regression models such as the Linear Regression and Generalized Linear Model which are examined in the following sections.

Another example is Gaussian Process classiﬁers which are closely related to SVMs.

It should also be mentioned that two diﬀerent regimes exist in statistics. These are the frequentist and Bayesian approaches [75] [11]. In short, it can be said that the frequentist approach to the concept of probability solely builds on

obser-54 Classiﬁers and Postprocessing

vations of events from well-deﬁned random experiments. The relative frequency of occurrence of an event is then a measure of the probability of that event.

The Bayesian approach, in contrast, is willing to assign probabilities according to the belief in a proposition. Bayesian inference then uses Bayes’ theorem to update the belief in the light of newevidence.

The formalization in equation 5.1 is very restricted since it simply assigns a single genre label to a given song. All the previously mentioned supervised probabilistic classiﬁers instead give (using Bayes’ theorem for the generative models) the probabilityP(C|˜z) of a genre given the feature vector. The post-processing methods are capable of combining these probabilities into a single decision for the whole song. However, this is not necessarily the best method in a music genre classiﬁcation system. As discussed in chapter 2, humans do not agree on genre and therefore it seems more natural to use the quantity P(C =c|s) directly as a measure of the degree to which the song with index˜

sbelongs to genre c instead of using a single genre decision. Any probabilistic classiﬁer could estimate this quantity if either the feature vector ˜z represents the whole song or the classiﬁer includes temporal integration such as a Hidden Markov Model or the proposed co-occurrence models in section 5.5. Although this might seem very natural, it should be noted that this would also require the labels of the training set to take the form of a distributionP(C|s).˜

Another restriction in equation 5.1 lies in the output of a genre taxonomy with-out hierarchical structure although such structure is natural in most real-world music genre classiﬁcation systems. For instance [12] uses four diﬀerent levels in the genre taxonomy and makes a supervised classiﬁcation on each level. As mentioned before, there also exist many unsupervised hierarchical methods.

The classiﬁers which have been used in the current dissertation project are de-scribed in the following. First, two generative probabilistic model are dede-scribed;

the Gaussian Classiﬁer (GC) and the Gaussian Mixture Model (GMM). Then the discriminative Linear regression classiﬁer and Generalized Linear Model are discussed. Afterwards, the generative co-occurrence models which we proposed in (Paper D) are investigated and explained.

5.1 Gaussian Classiﬁer

TheGaussian classiﬁer uses the gaussian probability density function as a model for the distribution of feature vectors in each genre. The probabilistic model has been used in e.g. [70] for music genre classiﬁcation due to its simplicity.

The probability density function for a feature vectorzn in the genre with index

5.1 Gaussian Classiﬁer 55

c is

p(zn|C=c) = 1

(2π)^d|Σ_c|exp

−1

2(zn−µ_c)^TΣ⁻_c¹(zn−µ_c)

(5.2)

whereµ_candΣ_c are the mean and covariance matrix, respectively, that belong to genrecanddis the dimensionality of the feature space. The strong assump-tion here is independence between the individual observaassump-tions in the data set (c_s(j),z_j) w herejruns over all feature vectors in the data set andc_s(j)is the as-sociated genre label. The functions(j) gives the song index for a feature vector with indexj. Although there is much time structure in music and many features use overlapping windows, this assumption is often used (quite successfully) in practice. The assumption is used to formulate the log-likelihood as

L = logp(c_s(1), . . . , c_s(M₎,z₁, . . . ,z_M) = log M j=1

p(c_s(j),z_j) (5.3)

= M j=1

logP(C=c_s(j)) + M j=1

logp(zj|C=c_s(j)) (5.4)

where M is the total number of feature vectors in the data set. The tradi-tional maximum likelihood principle states that the model parameters can be estimated by maximizing L. This gives the well-known estimates of the mean and variance of each class. The estimate ofP(C) simply becomes the normal-ized count of occurrences in each class. After the inference of the parameters µ_c and Σc of the model as well asP(C), predictions ofP(C|zn) can be made for newfeature vectors in the test set. According to Bayes’ rule, the predicted probability for each genrec then becomes

P(C=c|zn) = P(C=c)p(zn|C=c) N_c

j=1P(C=j)p(zn|C=j) (5.5) whereNcdenotes the number of genres. This quantity is the output of our gaus-sian classiﬁer model for each feature vectorzn to be used in the postprocessing part.

The gaussian classiﬁer can often be used with good results when the dimension-ality of the feature space is reasonably small. However, bad results are obtained

56 Classiﬁers and Postprocessing

in high-dimensional spaces since the estimation of the covariance matrix be-comes very unreliable. ”Small” and ”high” dimensional spaces are not ﬁxed quantities, but depend on the size of the training set and were of the order of 10-30 and 80-100, respectively, in the author’ experiments. This problem is a classical example of the ”Curse of dimensionality” in machine learning. It should be mentioned that several solutions to this problem have been proposed.

One common solution is to put restrictions on the covariance matrix such that it e.g. only has elements on the diagonal.

In document Music Genre Classiﬁcation Systems - A Computational Approach (Sider 66-72)