Support Vector Machine in the Task of Voice Activity Detection

Evgenia Chernenko (University of Joensuu, Finland),
Tomi Kinnunen (Institute for Infocomm Research, Singapore),
Marko Tuononen (University of Joensuu, Finland),
Pasi Franti (University of Joensuu, Finland),
Haizhou Li (Institute for Infocomm Research, Singapore)

We define voice activity detection (VAD) as a binary classification problem and solve it using the support vector machine (SVM). Challenges in SVM-based approach include selection of representative training segments, selection of features, normalization of the features, and post-processing of the frame-level decisions. We propose to construct SVM-VAD system using MFCC features because they capture the most relevant information of speech, and they are widely used in speech and speaker recognition making the proposed method easy to integrate with existing applications. Practical usability is our driving motivation: the proposed SVM-VAD should be easily adapted into new conditions.

Voice activity detection (VAD) aims at classifying a given sound frame as a speech or non-speech. It is needed as a front-end component in voice-based applications such as speech recognition, speech enhancement, variable frame-rate speech coding, and speaker recognition. Furthermore, VAD is an important tool for a forensic analyst to locate the speech-only parts from large audio collections which can consists of tens of hours of data [1]. A large number of methods have been proposed. Simple methods are based on comparing the frame energy, zero crossing rate, periodicity measure, or spectral entropy with a detection threshold to make the speech/non-speech decision. More advanced models include statistical hypothesis testing [2], long-term spectral divergence measure [3, 4], amplitude probability distribution [5], and low-variance spectrum estimation [6]. The common property in these methods is that they include estimation of the background noise levels and/or noise suppression as a part of the process. The methods usually have a large number of control parameters, which are more or less tuned to a specific application. As an example, in [1] it was reported that the accuracy of the long-term spectral divergence VAD [3] depends much on the selection of the seven control parameters of the method.

The idea of this work is to extract the standard mel-frequency cepstral coefficients (MFCC) with delta and double delta coefficients and train a binary classifier using training files with speech/non-speech annotation. The VAD then labels each test utterance frame by using the trained classifier. We use the support vector machine (SVM) as the classifier since this has shown excellent performance in other classification tasks, e.g. speaker verification [7]. An advantage of this supervised learning is that it can be easily adapted to new operating conditions by providing representative training examples for the new condition. In this way, optimization of the parameters is absorbed to the training algorithm of the SVM whereas optimizing the parameters of conventional VADs, on the other hand, is more difficult.

In our experiments, we use three datasets with a varying degree of difficulty. The first dataset is a subset of the NIST2005 speaker recognition evaluation corpus, consisting of conversational telephone-quality speech having a sampling rate of 8 kHz. We selected 15 files for our purposes, all from different speakers and having duration of 5 minutes per file. The second data set consists of timetable system dialogues recorded in 8 kHz sampling rate. The material consists of human speech commands that are mainly very short, and synthesized speech that provides rather long explanations about bus schedules. Finally, the third data set consists of a one long continuous recording from the lounge of our laboratory in 44.1 kHz. The goal of the material was to simulate wiretapping material collected by the detectives.

We compare the proposed method with existing ones based on energy levels, long-term spectral information, and Gaussian mixture modeling, and provide comparative results on three described datasets. The method works excellently when small false speech acceptance rate is desired, which is the case in text-independent speaker verification, for example. Main advantage of the SVM-based VAD is that it works consistently in the same manner with different corpora. The other methods were more prone to the change of data set and variations of their parameters. Our main conclusion is that, according to our experiments, SVM is easier to adapt to the new data sets than conventional methods as long as we have a short training audio sample from the recording environment.

Bibliography

M. Tuononen, R. González Hautamäki, P. Fränti, "Applicability and Performance Evaluation of Voice Activity Detection", submitted to IEEE Trans. on Information Forensic and Security.
J.-H. Chang, N.S. Kim and S.K. Mitra, "Voice Activity Detection Based on Multiple Statistical Models", IEEE Trans. Signal Processing, 54(6), June 2006, pp. 1965-1976.
J. Ramirez, J.C Segura, C. Benitez, A. de la Torre, A. Rubio (2004), "Efficient voice activity detection algorithms using long-term speech information". Speech Comm. 42, pp. 271-287.
J. Ramirez, P. Yelamos, J.M. Gorriz, J.C. Segura (2006), "SVM-based speech endpoint detection using contextual speech features". Elec. Letters 42(7), 2006.
S.G. Tanyer and H. Özer, "Voice Activity Detection in Nonstationary Noise". IEEE Trans. Speech and Audio Processing, 8(4), July 2000.
A. Davis, S. Nordholm, R. Togneri, "Statistical Voice Activity Detection Using Low-Variance Spectrum Estimation and an Adaptive Threshold", 14(2), March 2006.
W.M. Campbell, J.P. Campbell, D.A. Reynolds, E. Singer, P.A. Torres-Carrasquillo, "Support vector machines for speaker and language recognition", Computer Speech and Language 20(2-3), pp. 210-229, April 2006.

Кафедра Информатики и Математического Обеспечения

Support Vector Machine in the Task of Voice Activity Detection