(c) Larry Ewing, Simon Budig, Garrett LeSage
с 1994 г.

Кафедра Информатики и Математического Обеспечения

ПетрГУ | ИМиИТ | О кафедре | Мобильные платформы | Лаборатория ИТС | Семинары НФИ/AMICT
Сотрудники | Выпускники | Учебный процесс | Табель-календарь | Курсовые и выпускные работы
Вычислительные ресурсы | Публикации | Архив новостей | Контактная информация

Improving Speaker Verification by Periodicity Based Voice Activity Detection

Ville Hautamaki (University of Joensuu, Finland),
Marko Tuononen (University of Joensuu, Finland),
Tuija Niemi-Laitinen (National Bureau of Investigation, Finland),
Pasi Franti (University of Joensuu, Finland)

Voice Activity Detection (VAD) aims at classifying a given sound frame as a speech or non-speech. It is often used as a front-end component in voice-based applications such as automatic speech recognition, speech enhancement and voice biometric. A common property of these applications is that only human sounds (typically only speech) are of interest to the system, and it should therefore be separated from the background. Although VAD is widely needed and reasonably welldefined, existing solutions do not satisfy all the user requirements. The methods either must be trained for the particular application and conditions, or they can be highly unreliable when the conditions change. Demands for working solutions, however, are frequently requested by practitioners.

Traditionally, VAD research has been driven by telecommunications and voice-coding applications, in which VAD has to operate with as small delay as possible and all the speech frames should be detected. Typically, these voice activity detectors work by modeling the noise signal statistics. Initial noise estimates are usually obtained from the beginning of the signal, which is updated as VAD makes non-speech decisions.

Even though realtime VAD for telecommunications application is maybe the most common, other applications also exist. For those applications, design criteria are usually different than telecom VAD.s. In speech segmentation application, realtime operation is not as important as the goal is to process the input speech file (usually hours or even days long) in a background process and then the output will be used for the retrieval or the diarization tasks. Segmentation of the forensic wiretapping is especially difficult task is. In those recordings, no close talking microphone is used and the noise does not possess properties assumed in the telecom VAD.s (long-term stationarity).

In this study, our primary goal is to use VAD as a preprocessor for the realtime speaker verification. It is clear that including non-speech frames in the modeling process would bias the resulting model, especially if the number of non-speech frame is significant. It is also known that not all speech frames have equal discriminative power as it is well known that voiced phonemes are more discriminative than unvoiced phonemes,and therefore, it can be beneficial to drop out unvoiced frames to increase recognition. This is in contrast to telecom or speech recognition application where speech should be accurately detected.

In realtime speaker verification, we need to start processing speech frames with as small delay as possible. New recorded speech frame is pushed to signal processing subsystem, and to scoring subsystem while subject keeps speaking to the microphone. Not only is this useful feature in the standard applications, but it is essential in low power mobile phones, where non-realtime realtime processing is not even usable. Good results in speaker verification can be achived by a simple energy based VAD, but the method needs analyze the whole utterance before it can start to make speech/non-speech decisions.

Our proposed system is based on the detecting the periodicity of the given frame. In contrast to the telecom voice activity detectors we do not model noise, but we base our decisions on the feature that is known to be short term stationary. Performance of the proposed method is compared against two existing methods: a realtime method based on long-term spectral divergence (LTSD) and a simple energy based method, which needs two passes on the data. Summary of the speaker verification results with the corresponding VAD thresholds is shown in Table 1. Best parameters found in tuning with NIST 2001 corpus have been used for the NIST 2006 experiments. We obtained significantly higher error rates for NIST 2006 than NIST 2001, as expected. The periodicity-based method clearly outperforms the other realtime method (LTSD), and performs comparably with the energy-based method when applied for NIST 2001 and 2006 speaker recognition evaluation corpora. The method is also tested for segmenting surveillance recordings, voice dialog and forensic applications.

Table 1: Summary of speaker verification results (% EER).
  NIST 2001 NIST 2006
  model 512 $\;$model 64$\;$ model 512
  EER Thr EERThr EER
No VAD 14 - 16- 44
LTSD 12 40 1445 36
Energy 9 36 1035 17
Periodicity 80.61 100.66 17