Interspeech 2009 Technical Programme

Mon-Ses1-K:
Sadaoki Furui - Selected topics from 40 years of research on speech and speaker recognition

Time:Monday 11:00 Place:Main Hall Type:Keynote
Chair:Isabel Trancoso

11:00Selected topics from 40 years of research on speech and speaker recognition

Sadaoki Furui (Tokyo Institute of Technology)

This talk summarizes my 40 years research on speech and speaker recognition, focusing on selected topics that I have investigated at NTT Laboratories, Bell Laboratories and Tokyo Institute of Technology with my colleagues and students. These topics include: the importance of spectral dynamics in speech perception; speaker recognition methods using statistical features, cepstral features, and HMM/GMM; text-prompted speaker recognition; speech recognition by dynamic features; Japanese LVCSR; spontaneous speech corpus construction and analysis; spontaneous speech recognition; automatic speech summarization; WFST-based decoder development and its applications; and unsupervised model adaptation methods.

Mon-Ses2-O1:
ASR: Features for Noise Robustness

Time:Monday 13:30 Place:Main Hall Type:Oral
Chair:Hynek Hermansky

13:30Feature Extraction for Robust Speech Recognition Using a Power-Law Nonlinearity and Power-Bias Subtraction

Chanwoo Kim (Carnegie Mellon University)
Richard Stern (Carnegie Mellon University)

This paper presents a new feature extraction algorithm called Power-Normalized Cepstral Coefficients (PNCC) that is based on auditory processing. Major new features of PNCC processing include the use of a power-law nonlinearity that replaces the traditional log nonlinearity used for MFCC coefficients, and a novel algorithm that suppresses background excitation by estimating SNR based on the ratio of the arithmetic to geometric mean power, and subtracts the inferred background power. Experimental results demonstrate that the PNCC processing provides substantial improvements in recognition accuracy compared to MFCC and PLP processing for various types of additive noise. The computational cost of PNCC is only slightly greater than that of conventional MFCC processing.

13:50Towards Fusion of Feature Extraction and Acoustic Model Training: A Top Down Process for Robust Speech Recognition

Yu-Hsiang Bosco Chiu (Carnegie Mellon University)
Bhiksha Raj (Carnegie Mellon University)
Richard M. Stern (Carnegie Mellon University)

This paper presents a strategy to learn physiologically motivated components in a feature computation module discriminatively, directly from data, in a manner that is inspired by the presence of efferent processes in the human auditory system. In our model a set of logistic functions which represent the rate-level nonlinearities found in most mammal hearing system are put in as part of the feature extraction process. The parameters of these rate-level functions are estimated to maximize the a posteriori probability of the correct class in the training data. The estimated feature computation is observed to be robust against environmental noise. Experiments conducted with the CMU Sphinx-III on the DARPA Resource Management task show that the discriminatively estimated rate-nonlinearity results in better performance in the presence of background noise than traditional procedures which separate the feature extraction and model training into two distinct parts.

14:10Temporal Modulation Processing of Speech Signals for Noise Robust ASR

Hong You (UCLA Electrical Engineering Dept.)
Abeer Alwan (UCLA Electrical Engineering Dept.)

We analyze the temporal modulation characteristics of speech and noise from a speech/non-speech discrimination point of view, and propose a frequency adaptive modulation processing algorithm and apply it to a noise robust ASR task. Although previous psychoacoustic studies have shown that low temporal modulation components are important for speech intelligibility, there is no reported analysis on modulation components from the point of view of speech/noise discrimination. Our data-driven analysis of modulation components of speech and noise reveals that speech and noise is more accurately classified by low-passed modulation frequencies than band-passed ones. We then propose a frequency adaptive modulation processing algorithm for a noise robust ASR task. Speech recognition experiments are performed to compare the proposed algorithm with other noise robust frontends, including RASTA and ETSI AFE. Results show that the frequency adaptive modulation processing is promising.

14:30PROGRESSIVE MEMORY-BASED PARAMETRIC NON-LINEAR FEATURE EQUALIZATION

Luz García (Department of TSTC, University of Granada, Spain)
Roberto Gemello (LOQUENDO, Torino, ITALY)
Franco Mana (LOQUENDO, Torino, ITALY)
Jose Carlos Segura (Department of TSTC, University of Granada, Spain)

This paper analyzes the benefits and drawbacks of PEQ (Parametric Non-linear Equalization), a features normalization technique based on the parametric equalization of the MFCC parameters to match a reference probability distribution. Two limitations have been outlined: the distortion intrinsic to the normalization process and the lack of accuracy in estimating normalization statistics on short sentences. Two evolutions of PEQ are presented as solutions to the limitations encountered. The effects of the proposed evolutions are evaluated on three speech corpora, namely WSJ0, AURORA-3 and HIWIRE cockpit databases, with different mismatch conditions given by convolutional and/or additive noise and non-native speakers. The obtained results show that the encountered limitations can be overcome by the newly introduced techniques.

14:50Dynamic Features in the Linear Domain for Robust Automatic Speech Recognition in a Reverberant Environment

Osamu Ichikawa (Tokyo Research Laboratory, IBM Research)
Takashi Fukuda (Tokyo Research Laboratory, IBM Research)
Ryuki Tachibana (Tokyo Research Laboratory, IBM Research)
Masafumi Nishimura (Tokyo Research Laboratory, IBM Research)

Since the MFCC are calculated from logarithmic spectra, the delta and delta-delta are considered as difference operations in a logarithmic domain. In a reverberant environment, speech signals have trailing reverberations, whose power is plotted as a long-term exponential decay. This means the logarithmic delta value tends to remain large for a long time. This paper proposes a delta feature calculated in the linear domain, due to the rapid decay in reverberant environments. In an experiment using an evaluation framework (CENSREC-4), significant improvements were found in reverberant situations by simply replacing the MFCC dynamic features with the proposed dynamic features.

15:10Local Projections and Support Vector Based Feature Selection in Speech Recognition

Antonio Miguel (University of Zaragoza)
Alfonso Ortega (University of Zaragoza)
Luis Buera (University of Zaragoza)
Eduardo Lleida (University of Zaragoza)

In this paper we study a method to provide noise robustness in mismatch conditions for speech recognition using local frequency projections and feature selection. Local time-frequency filtering patterns have been used previously to provide noise robust features and a simpler feature set to apply reliability weighting techniques. The proposed method combines two techniques to select the feature set, first a realibility metric based on information theory and, second, a support vector set to reduce the errors. The support vector set provides the most representative examples which have influence in the error rate in mismatch conditions, so that only the features which incorporate implicit robustness to mismatch are selected. Some experimental results are obtained with this method compared to baseline systems using the Aurora 2 database.

Mon-Ses2-O2:
Production: Articulatory modelling

Time:Monday 13:30 Place:East Wing 1 Type:Oral
Chair: Rob Van Son

13:30Feedforward Control of A 3D Physiological Articulatory Model for Vowel Production

Qiang Fang (Phonetics Lab., Institute of Linguistics, Chinese Academy of Social Sciences)
Akikazu Nishikido (IIPL, School of Information Science, Japan Advanced Institute of Science and Technology)
Jianwu Dang (IIPL, School of Information Science, Japan Advanced Institute of Science and Technology)
Aijun Li (Phonetics Lab., Institute of Linguistics, Chinese Academy of Social Sciences)

A 3D Physiological articulatory model has been developed to account for the biomechanical properties of speech organs in speech production. To control the model for investigating the mechanism of speech production, a feedforward control strategy is necessary to generate proper muscle activations according to desired articulatory targets. In this paper, we elaborated a feedforward control module for the 3D physiological articulatory model. In the feedforward control process, an input articulatory target, specified by articulatory parameters, is transformed to intrinsic representation of articulation; then, a muscle activation pattern is estimated by a proposed mapping function. The results showed that the proposed feedforward control strategy is able to control the proposed 3D physiological articulatory model with high accuracy both acoustically and articulatorily.

13:50Articulatory Modeling Based on Semi-polar Coordinates and Guided PCA Technique

Jun Cai (Groupe Parole, LORIA-CNRS & INRIA, BP 239, 54600 Vandoeuvre-lès-Nancy, France)
Yves Laprie (Groupe Parole, LORIA-CNRS & INRIA, BP 239, 54600 Vandoeuvre-lès-Nancy, France)
Julie Busset (Groupe Parole, LORIA-CNRS & INRIA, BP 239, 54600 Vandoeuvre-lès-Nancy, France)
Fabrice Hirsch (Institut de Phonétique de Strasbourg, 2, rue Descartes, 67084 Strasbourg, France)

Research on 2-dimensional static articulatory modeling has been performed by using the semi-polar system and the guided PCA analysis of lateral X-ray images of vocal tract. The density of the grid lines in the semi-polar system has been increased to have a better descriptive precision. New parameters have been introduced to describe the movements of tongue apex. An extra feature, the tongue root, has been extracted as one of the elementary factors in order to improve the precision of tongue model. New methods still remain to be developed for describing the movements of tongue apex.

14:10Sequencing of Articulatory Gestures using Cost Optimization

Juraj Simko (Univeristy College Dublin)
Fred Cummins (University College Dublin)

Within the framework of articulatory phonology (AP), gestures function as primitives, and their ordering in time is provided by a gestural score. Determining how they should be sequenced in time has been something of a challenge. We modify the task dynamic implementation of AP, by defining tasks to be the desired positions of physically embodied end effectors. This allows us to investigate the optimal sequencing of gestures based on a parametric cost function. Costs evaluated include precision of articulation, articulatory effort, and gesture duration. We find that a simple optimization using these costs results in stable gestural sequences that reproduce several known coarticulatory effects.

14:30From experiments to articulatory motion—a three dimensional talking head model

Xiao Bo Lu (Bioengineering Institute, the University of Auckland, Auckland, New Zealand)
C. William Thorpe (Bioengineering Institute, the University of Auckland, Auckland, New Zealand)
Kylie Foster (Department of Food and Health, the University of Massey, Auckland, New Zealand)
Peter Hunter (Bioengineering Institute, the University of Auckland, Auckland, New Zealand)

The goal of this study is to develop a customised computer model that can accurately represent the motions of vocal articulators during vowels and consonants. Models of the articulators were constructed as Finite element (FE) meshes based on digitised high-resolution MRI (Magnetic Resonance Imaging) scans obtained during rest breathing. Articulatory kinematics during speaking were obtained by EMA (Electromagnetic Articulography) and video of the face. The movement information thus acquired was applied to the FE model to provide jaw motion, modeled as a rigid body, and tongue, cheek and lip movement modeled with a free-form deformation technique. The motion of the epiglottis has also been considered in the model.

14:50Towards Robust Glottal Source Modeling

Javier Pérez (TALP Research Center, Universitat Politècnica de Catalunya (UPC), Barcelona, Spain)
Antonio Bonafonte (TALP Research Center, Universitat Politècnica de Catalunya (UPC), Barcelona, Spain)

We present here a new method for the simultaneous estimation of the derivative glottal waveform and the vocal tract filter. The algorithm is pitch-synchronous and uses overlapping frames of several glottal cycles to increase the robustness and quality of the estimation. Two parametric models for the glottal waveform are used: the KLGLOTT88 during the convex optimization iteration, and the LF model for the final parametrization. We use a synthetic corpus using real data published in several studies to evaluate the performance. A second corpus has been specially recorded for this work, consisting of isolated vowels uttered with different voice qualities. The algorithm has been found to perform well with most of the voice qualities present in the synthetic data-set in terms of glottal waveform matching. The performance is also good with the real vowel data-set in terms of resynthesis quality.

15:10Sliding Vocal-tract Model and its Application for Vowel Production

Takayuki Arai (Sophia University)

In a previous study, Arai implemented a sliding vocal-tract model based on Fant’s three-tube model and demonstrated its usefulness for education in acoustics and speech science. The sliding vocal-tract model consists of a long outer cylinder and a short inner cylinder, which simulates tongue constriction in the vocal tract. This model can produce different vowels by sliding the inner cylinder and changing the degree of constriction. In this study, we investigated the model’s coverage of vowels on the vowel space and explored its application for vowel production in the speech and hearing sciences.

Mon-Ses2-O3:
Systems for LVCSR and Rich Transcription

Time:Monday 13:30 Place:East Wing 2 Type:Oral
Chair: Thomas Schaaf

13:30Minimum Hypothesis Phone Error as a Decoding Method for Speech Recognition

Haihua Xu (Shanghai Jiaotong University, China)
Daniel Povey (Microsoft Research, Redmond, WA, USA)
Jie Zhu (Shanghai Jiaotong University, China)
Guanyong Wu (Shanghai Jiaotong University, China)

In this paper we show how methods for approximating phone error as normally used for Minimum Phone Error (MPE) discriminative training, can be used instead as a decoding criterion for lattice rescoring. This is an alternative to Confusion Networks (CN) which are commonly used in speech recognition. The standard (Maximum A Posteriori) decoding approach is a Minimum Bayes Risk estimate with respect to the Sentence Error Rate (SER); however, we are typically more interested in the Word Error Rate (WER). Methods such as CN and our proposed Minimum Hypothesis Phone Error (MHPE) aim to get closer to minimizing the expected WER. Based on preliminary experiments we find that our approach gives more improvement than CN, and is conceptually simpler.

13:50Posterior-based Out-of-Vocabulary Word Detection in Telephone Speech

Stefan Kombrink (Brno University of Technology, Czech Republic)
Lukas Burget (Brno University of Technology, Czech Republic)
Pavel Matejka (Brno University of Technology, Czech Republic)
Martin Karafiat (Brno University of Technology, Czech Republic)
Hynek Hermansky (Johns Hopkins University, Baltimore (USA))

In this paper we present an out-of-vocabulary word detector suitable for English conversational and read speech. We use an approach based on phone posteriors created by a Large Vocabulary Continuous Speech Recognition system and an additional phone recognizer, that allows detection of OOV and misrecognized words. In addition, the recognized word output can be transcribed more detailed using several classes. Reported results are on CallHome English and Wall Street Journal data.

14:10Automatic Transcription System for Meetings of the Japanese National Congress

Yuya Akita (Kyoto University)
Masato Mimura (Kyoto University)
Tatsuya Kawahara (Kyoto University)

This paper presents an automatic speech recognition (ASR) system for assisting meeting record creation of the National Congress of Japan. The system is designed to cope with spontaneous characteristics of meeting speech, as well as a variety of topics and speakers. For acoustic model, minimum phone error (MPE) training is applied with several normalization techniques. For language model, we have proposed statistical style transformation to generate spoken-style N-grams and their statistics. We also introduce statistical modeling of pronunciation variation in spontaneous speech. The ASR system was evaluated on real congressional meetings, and achieved word accuracy of 84%. It is also suggested that the ASR-based transcripts with this accuracy level is usable for editing meeting records.

14:30Cross-language Bootstrapping for Unsupervised Acoustic Model Training: Rapid Development of a Polish Speech Recognition System

Jonas Lööf (RWTH Aachen University)
Christian Gollan (RWTH Aachen University)
Hermann Ney (RWTH Aachen University)

This paper describes the rapid development of a Polish language speech recognition system. The system development was performed without access to any transcribed acoustic training data. This was achieved through the combined use of cross-language bootstrapping and confidence based unsupervised acoustic model training. A Spanish acoustic model was ported to Polish, through the use of a manually constructed phoneme mapping. This initial model was refined through iterative recognition and retraining of the untranscribed audio data. The system was trained and evaluated on recordings from the European Parliament, and included several state-of-the-art speech recognition techniques. Confidence based speaker adaptive training using features space transform adaptation, as well as vocal tract length normalization and maximum likelihood linear regression, was used to refine the acoustic model. Through the combination of the different techniques, good recognition performance was achieved.

14:50Porting an European Portuguese Broadcast News Recognition System to Brazilian Portuguese

Alberto Abad (INESC-ID Lisboa)
Isabel Trancoso (IST / INESC-ID Lisboa, Portugal)
Nelson Neto (Federal University of Pará, Belém, Brazil)
M. Céu Viana (Center of Linguistics of the University of Lisbon, Portugal)

This paper reports on recent work in the context of the activities of the PoSTPort project aimed at porting a Broadcast News recognition system originally developed for European Portuguese to other varieties. Concretely, in this paper we have focused on porting to Brazilian Portuguese. The impact of some of the main sources of variability has been assessed, besides proposing solutions at the lexical, acoustic and syntactic levels. The ported Brazilian Portuguese Broadcast News system allowed a drastic performance improvement from 56.6% WER (obtained with the European Portuguese system) to 25.5%.

15:10Modeling Northern and Southern Varieties of Dutch for STT

Julien Despres (Vecsys Research)
Petr Fousek (CNRS-LIMSI)
Jean-Luc Gauvain (CNRS-LIMSI)
Sandrine Gay (Vecsys Research)
Yvan Josse (Vecsys Research)
Lori Lamel (CNRS-LIMSI)
Abdel Messaoudi (CNRS-LIMSI and Vecsys Research)

This paper describes how the Northern (NL) and Southern (VL) varieties of Dutch are modeled in the joint Limsi-Vecsys~Research speech-to-text transcription systems for broadcast news (BN) and conversational telephone speech (CTS). Using the Spoken Dutch Corpus resources (CGN), systems were developed and evaluated in the 2008 N-Best benchmark. Modeling techniques that are used in our systems for other languages were found to be effective for the Dutch language, however it was also found to be important to have acoustic and language models, and statistical pronunciation generation rules adapted to each variety. This was in particular true for the MLP features which were only effective when trained separately for Dutch and Flemish. The joint submissions obtained the lowest WERs in the benchmark by a significant margin.

Mon-Ses2-O4:
Speech Analysis and Processing I

Time:Monday 13:30 Place:East Wing 3 Type:Oral
Chair:Ben Milner

13:30Nearly Perfect Detection of Continuous F0 Contour and Frame Classification for TTS Synthesis

Thomas Ewender (Speech Processing Group, Computer Engineering and Networks Laboratory ETH Zurich, Switzerland)
Sarah Hoffmann (Speech Processing Group, Computer Engineering and Networks Laboratory ETH Zurich, Switzerland)
Beat Pfister (Speech Processing Group, Computer Engineering and Networks Laboratory ETH Zurich, Switzerland)

We present a new method for the estimation of a continuous fundamental frequency (F0) contour. The algorithm implements a global optimization and yields virtually error-free F0 contours for high quality speech signals. Such F0 contours are subsequently used to extract a continuous fundamental wave. Some local properties of this wave, together with a number of other speech features allow to classify the frames of a speech signal into five classes: voiced, unvoiced, mixed, irregularly glottalized and silence. The presented F0 detection and frame classification can be applied to F0 modeling and prosodic modification of speech segments in high-quality concatenative speech synthesis.

13:50AM-FM ESTIMATION FOR SPEECH BASED ON A TIME-VARYING SINUSOIDAL MODEL

Yannis Pantazis (University of Crete)
Olivier Rosec (Orange Labs)
Yannis Stylianou (University of Crete)

In this paper we present a method based on a time-varying sinusoidal model for a robust and accurate estimation of amplitude and frequency modulations (AM-FM) in speech. The suggested approach has two main steps. First, speech is modeled as a sinusoidal model with time-varying amplitudes. Specifically, the model makes use of a first order time polynomial with complex coefficients for capturing instantaneous amplitude and frequency (phase) components. Next, the model parameters are updated by using the previously estimated instantaneous phase information. Thus, an iterative scheme for AM-FM decomposition of speech is suggested which was validated on synthetic AM-FM signals and tested on reconstruction of voiced speech signals where the signal-to-error reconstruction ratio (SERR) was used as measure. Compared to the standard sinusoidal representation, the suggested approach found to improve the corresponding SERR by 47%, resulting in over 30 dB of SERR.

14:10Voice Source Waveform Analysis and Synthesis using Principal Component Analysis and Gaussian Mixture Modelling

Jon Gudnason (Imperial College London)
Mark Thomas (Imperial College London)
Patrick Naylor (Imperial College London)
Daniel Ellis (Columbia University)

The paper presents a voice source waveform modeling techniques based on principal component analysis (PCA) and Gaussian mixture modeling (GMM). The voice source is obtained by inverse-filteirng speech with the estimated vocal tract filter. This decomposition is useful in speech analysis, synthesis, recognition and coding. Here, a data-driven approach is presented for signal decomposition and classification based on the principal components of the voice source. The principal components are analyzed and the `prototype' voice source signals corresponding to the Gaussian mixture means are examined. We show how an unknown signal can be decomposed into its components and/or prototypes and resynthesized. We show how the techniques are suited for both low bitrate or high quality analysis/synthesis schemes.

14:30Model-Based Estimation Of Instantaneous Pitch In Noisy Speech

Jung Ook Hong (Statistics and Information Sciences Laboratory, Harvard University)
Patrick J. Wolfe (Statistics and Information Sciences Laboratory, Harvard University)

In this paper we propose a model-based approach to instantaneous pitch estimation in noisy speech, by way of incorporating pitch smoothness assumptions into the well-known harmonic model. In this approach, the latent pitch contour is modeled using a basis of smooth polynomials, and is fit to waveform data by way of a harmonic model whose partials have time-varying amplitudes. The resultant nonlinear least squares estimation task is accomplished through the Gauss-Newton method with a novel initialization step that serves to greatly increase algorithm efficiency. We demonstrate the accuracy and robustness of our method through comparisons to state-of-the art pitch estimation algorithms using both simulated and real waveform data.

14:50Complex Cepstrum-based Decomposition of Speech for Glottal Source Estimation

Thomas Drugman (Faculté Polytechnique de Mons)
Baris Bozkurt (Izmir Institute of Technology)
Thierry Dutoit (Faculté Polytechnique de Mons)

Homomorphic analysis is a well-known method for the separation of non-linearly combined signals. More particularly, the use of complex cepstrum for source-tract deconvolution has been discussed in various articles. However there exists no study which proposes a glottal flow estimation methodology based on cepstrum and reports effective results. In this paper, we show that complex cepstrum can be effectively used for glottal flow estimation by separating the causal and anticausal components of a windowed speech signal as done by the Zeros of the Z-Transform (ZZT) decomposition. Based on exactly the same principles presented for ZZT decomposition, windowing should be applied such that the windowed speech signals exhibit mixed-phase characteristics which conform the speech production model that the anticausal component is mainly due to the glottal flow open phase. The advantage of the complex cepstrum-based approach compared to the ZZT decomposition is its much higher speed.

15:10Approximate Intrinsic Fourier Analysis of Speech

Frank Tompkins (Statistics and Information Sciences Laboratory, Harvard University)
Patrick J. Wolfe (Statistics and Information Sciences Laboratory, Harvard University)

Popular parametric models of speech sounds such as the source-filter model provide a fixed means of describing the variability inherent in speech waveform data. However, nonlinear dimensionality reduction techniques such as the intrinsic Fourier analysis method of Jansen and Niyogi provide a more flexible means of adaptively estimating such structure directly from data. Here we employ this approach to learn a low-dimensional manifold whose geometry is meant to reflect the structure implied by the human speech production system. We derive a novel algorithm to efficiently learn this manifold for the case of many training examples--the setting of both greatest practical interest and computational difficulty. We then demonstrate the utility of our method by way of a proof-of-concept phoneme identification system that operates effectively in the intrinsic Fourier domain.

Mon-Ses2-S1:
Special Session: INTERSPEECH 2009 Emotion Challenge

Time:Monday 13:30 Place:East Wing 4 Type:Special
Chair:Bjoern Schuller & Anton Batliner

#0Emotion Classification in Children’s Speech Using Fusion of Acoustic and Linguistic Features

Tim Polzehl (TU-Berlin, Deutsche Telekom Laboratories)
Shiva Sundaram (TU-Berlin, Deutsche Telekom Laboratories)
Hamed Ketabdar (TU-Berlin, Deutsche Telekom Laboratories)
Michael Wagner (National Centre for Biometric Studies)
Florian Metze (interACT)

This paper describes a system to detect angry vs. non-angry utterances of children who are engaged in dialog with an Aibo robot dog. The system was submitted to the Interspeech2009 Emotion Challenge evaluation. The speech data consist of short utterances of the children’s speech, and the proposed system is designed to detect anger in each given chunk. Frame-based cepstral features, prosodic and acoustic features as well as glottal excitation features are extracted automatically, reduced in dimensionality and classified by means of an artificial neural network and a support vector machine. An automatic speech recognizer transcribes the words in an utterance and yields a separate classification based on the degree of emotional salience of the words. Late fusion is applied to make a final decision on anger vs. non-anger of the utterance. Preliminary results show 75.9% unweighted average recall on the training data and 67.6% on the test set.

#0Acoustic Emotion Recognition using Dynamic Bayesian Networks and Multi-Space Distributions

Roberto Barra-Chicote (Speech Technology Group. Universidad Politecnica de Madrid. Spain)
Fernando Fernandez (Speech Technology Group. Universidad Politecnica de Madrid. Spain)
Syaheerah Lutfi (Speech Technology Group. Universidad Politecnica de Madrid. Spain)
Juan Manuel Lucas-Cuesta (Speech Technology Group. Universidad Politecnica de Madrid. Spain)
Javier Macias-Guarasa (Department of Electronics. University of Alcala. Spain)
Juan Manuel Montero (Speech Technology Group. Universidad Politecnica de Madrid. Spain)
Ruben San-Segundo (Speech Technology Group. Universidad Politecnica de Madrid. Spain)
Jose Manuel Pardo (Speech Technology Group. Universidad Politecnica de Madrid. Spain)

In this paper we describe the acoustic emotion recognition system built at the Speech Technology Group of the Universidad Politecnica de Madrid (Spain) to participate in the INTERSPEECH 2009 Emotion Challenge. Our proposal is based on the use of a Dynamic Bayesian Network (DBN) to deal with the temporal modelling of the emotional speech information. The selected features (MFCC, F0, Energy and their variants) are modelled as different streams, and the F0 related ones are integrated under a Multi Space Distribution (MSD) framework, to properly model its dual nature (voiced/unvoiced). Experimental evaluation on the challenge test set, show a 67.06% and 38.24% of unweighted recall for the 2 and 5-classes tasks respectively. In the 2-class case, we achieve similar results compared with the baseline, with 8.5 times less features. In the 5-class case, we achieve a statistically significant 6.5% relative improvement.

#0Brno University of Technology System for Interspeech 2009 Emotion Challenge

Marcel Kockmann (Brno University of Technology, Czech Republic)
Lukas Burget (Brno University of Technology, Czech Republic)
Jan Cernocky (Brno University of Technology, Czech Republic)

This paper describes Brno University of Technology (BUT) system for the Interspeech 2009 Emotion Challenge. Our submitted system for the Open Performance Sub-Challenge uses acoustic frame based features as a front-end and Gaussian Mixture Models as a back-end. Different feature types and modeling approaches successfully applied in speaker- and language recognition are investigated and we can achieve an 16% and 9% relative improvement over the best dynamic and static baseline system on the 5-class task, respectively.

#0Cepstral and Long-Term Features for Emotion Recognition

Pierre Dumouchel (Ecole de technologie superieure)
Najim Dehak (Ecole de technologie superieure)
Yazid Attabi (Ecole de technologie superieure)
Reda Dehak (Laboratoire de recherche et de developpement de l\'EPITA)
Narjes Boufaden (Centre de recherche informatique de Montreal)

In this paper, we describe systems that were developed for the Open Performance Sub-Challenge of the INTERSPEECH 2009 Emotion Challenge. We participate to both two-class and five-class emotion detection. For the two-class problem, the best performance is obtained by logistic regression fusion of three systems. Theses systems use short- and long-term speech features. This fusion achieved an absolute improvement of 2,6% on the unweighted recall value compared with [6]. For the five-class problem, we submitted two individual systems: cepstral GMM vs. long-term GMM-UBM. The best result comes from a cepstral GMM and produced an absolute improvement of 3,5% compared to [6].

#0Exploring the benefits of discretization of acoustic features for speech emotion recognition

Thurid Vogt (Multimedia Concepts and Applications, University of Augsburg, Germany)
Elisabeth André (Multimedia Concepts and Applications, University of Augsburg, Germany)

We present a contribution to the Open Performance subchallenge of the INTERSPEECH 2009 Emotion Challenge. We evaluate the feature extraction and classifier of EmoVoice, our framework for real-time emotion recognition from voice on the challenge database and achieve competitive results. Furthermore, we explore the benefits of discretizing numeric acoustic features and find it beneficial in a multi-class task.

#0Combining spectral and prosodic information for emotion recognition in the Interspeech 2009 Emotion Challenge

Iker Luengo (Department of Electronics and Telecommunication, University of the Basque Country, Spain)
Eva Navas (Department of Electronics and Telecommunication, University of the Basque Country, Spain)
Inmaculada Hernáez (Department of Electronics and Telecommunication, University of the Basque Country, Spain)

This paper describes the system presented at the Interspeech 2009 Emotion Challenge. It relies on both spectral and prosodic features in order to automatically detect the emotional state of the speaker. As both kinds of features have very different characteristics, they are treated separately, creating two sub-classifiers, one using the prosodic features and the other one using the prosodic ones. The results of these two classifiers are then combined with a fusion system based on Support Vector Machines.

#0GTM-URL Contribution to the INTERSPEECH 2009 Emotion Challenge

Santiago Planet (GTM – Grup de Recerca en Tecnologies Mèdia, La Salle – Universitat Ramon Llull, Spain)
Ignasi Iriondo (GTM – Grup de Recerca en Tecnologies Mèdia, La Salle – Universitat Ramon Llull, Spain)
Joan-Claudi Socoró (GTM – Grup de Recerca en Tecnologies Mèdia, La Salle – Universitat Ramon Llull, Spain)
Carlos Monzo (GTM – Grup de Recerca en Tecnologies Mèdia, La Salle – Universitat Ramon Llull, Spain)
Jordi Adell (GTM – Grup de Recerca en Tecnologies Mèdia, La Salle – Universitat Ramon Llull, Spain)

This paper describes our participation in the INTERSPEECH 2009 Emotion Challenge [1]. Starting from our previous experience in the use of automatic classification for the validation of an expressive corpus, we have tackled the difficult task of emotion recognition from speech with real-life data. Our main contribution to this work is related to the Classifier Sub-Challenge, for which we tested several classification strategies. On the whole, the results were slightly worse than or similar to the baseline, but we found some configurations that could be considered in future implementations.

#0Improving Automatic Emotion Recognition from Speech Signals

Elif Bozkurt (Koc University, Istanbul, Turkey)
Engin Erzin (Koc University, Istanbul, Turkey)
Cigdem Eroglu Erdem (Bahcesehir University, Istanbul, Turkey)
Tanju Erdem (Ozyegin University, Istanbul, Turkey)

We present a speech signal driven emotion recognition system. Our system is trained and tested with the INTERSPEECH 2009 Emotion Challenge corpus, which includes spontaneous and emotionally rich recordings. We investigate prosody related, spectral and HMM-based features for the evaluation of emotion recognition with Gaussian mixture model (GMM) based classifiers. Spectral features consist of mel-scale cepstral coefficients (MFCC), line spectral frequency (LSF) features and their derivatives, whereas prosody-related features consist of mean normalized values of pitch, first derivative of pitch and intensity. Unsupervised training of HMM structures are employed to define prosody related temporal features for the emotion recognition problem. We also investigate data fusion of different features and decision fusion of different classifiers, which are not well studied for emotion recognition framework.

#0Emotion Recognition Using a Hierarchical Binary Decision Tree Approach

Chi-Chun Lee (Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA)
Emily Mower (Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA)
Carlos Busso (Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA)
Sungbok Lee (Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA)
Shrikanth Narayanan (Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA)

Emotion state tracking is an important aspect of human-computer and human-robot interaction. It is important to design task specific emotion recognition systems for real-world applications. In this work, we propose a hierarchical structure loosely motivated by Appraisal Theory for emotion recognition. The levels in the hierarchical structure are carefully designed to place the easier classification task at the top level and delay the decision between highly ambiguous classes to the end. The proposed structure maps an input utterance into one of the five-emotion classes through subsequent layers of binary classifications. We obtain a balanced recall on each of the individual emotion classes using this hierarchical structure. The performance measure of the average unweighted recall percentage on the evaluation data set improves by 3.3% absolute (8.8% relative) over the baseline model.

13:30The INTERSPEECH 2009 Emotion Challenge

Bjoern Schuller (Technische Universitaet Muenchen)
Stefan Steidl (Friedrich-Alexander University Erlangen-Nuremberg)
Anton Batliner (Friedrich-Alexander University Erlangen-Nuremberg)

The last decade has seen a substantial body of literature on the recognition of emotion from speech. However, in comparison to related speech processing tasks such as Automatic Speech and Speaker Recognition, practically no standardised corpora and test-conditions exist to compare performances under exactly the same conditions. Instead a multiplicity of evaluation strategies employed – such as cross-validation or percentage splits without proper instance definition – prevents exact reproducibility. This INTERSPEECH 2009 Emotion Challenge aims at bridging such gaps between excellent research on human emotion recognition from speech and low compatibility of results. The FAU Aibo Emotion Corpus serves as basis with clearly defined test and training partitions incorporating speaker independence as needed in most reallife settings. This paper introduces the challenge, the corpus, the features, and benchmark results of two popular approaches towards emotion recognition from speech.

Mon-Ses2-P1:
Speech perception I

Time:Monday 13:30 Place:Hewison Hall Type:Poster
Chair:Paul Boersma

#1Relative importance of formant and whole-spectral cues for vowel perception

Masashi Ito (Graduate School of Engineering, Tohoku University, Japan)
Keiji Ohara (Research Institute of Electrical Communication, Tohoku University, Japan)
Akinori Ito (Graduate School of Engineering, Tohoku University, Japan)
Masafumi Yano (Research Institute of Electrical Communication, Tohoku University, Japan)

Three psycho-acoustical experiments were carried out to investigate relative importance of formant frequency and whole spectral shape as cues for vowel perception. Four types of vowel-like signals were presented to eight listeners. The mean responses for stimuli including both formant and amplitude-ratio feature were quite similar to those for the stimuli including only formant peak feature. Nonetheless reasonable vowel changes were observed in responses for stimuli including only amplitude-ratio feature. The perceived vowel changes were also observed even for stimuli including neither of these features. The results suggested that perceptual cues were involved in various parts of vowel spectrum.

#2Influences of vowel duration on speaker-size estimation and discrimination

Chihiro Takeshima (Kyoto City University of Arts)
Minoru Tsuzaki (Kyoto City University of Arts)
Toshio Irino (Faculty of Systems Engineering, Wakayama University)

Several studies have shown that the auditory system has a mechanism to extract the speaker-size information, using sufficiently long sounds. This paper investigated influence of vowel duration on the processing for size extraction using short vowels. In a size estimation experiment, listeners subjectively estimated the speaker size for isolated vowels. The results showed that listeners' size perception was highly correlated with the vocal-tract length in all the tested durations (from 16 ms to 256 ms). In a size discrimination experiment, listeners were presented with two vowels and were asked which vowel was perceived to be spoken by a smaller speaker. The results showed that the just-noticeable differences in speaker rose considerably for 16-ms duration. These observations suggest that the auditory system can extract size information even for 16-ms vowels although the precision of size extraction would deteriorate when the duration becomes less than 32 ms.

#3High Front Vowels in Czech: a Contrast in Quantity or Quality?

Václav Jonáš Podlipský (Department of English and American Studies, Palacký University in Olomouc, Czech Republic)
Radek Skarnitzl (Institute of Phonetics, Faculty of Arts, Charles University in Prague, Czech Republic)
Jan Volín (Institute of Phonetics, Faculty of Arts, Charles University in Prague, Czech Republic)

We investigate the perception and production of Czech /I/ and /i:/, a contrast traditionally described as quantitative. First, we show that the spectral difference between the vowels is for many Czechs as strong a cue as (or even stronger than) duration. Second, we test the hypothesis that this shift towards vowel quality as a perceptual cue for this contrast resulted in weakening of the durational differentiation in production. Our measurements confirm this: members of the /I/-/i:/ pair differed in duration much less than those of other short-long pairs. We interpret these findings in terms of Lindblom’s H&H theory.

#4Effect of contralateral noise on energetic and informational masking on speech-in-speech intelligibility

Marjorie Dole (Laboratoire Dynamique du Langage UMR5596)
Michel Hoen (Stem Cell and Brain Research Institute U846)
Fanny Meunier (Laboratoire Dynamique du Langage UMR5596)

This experiment tested the advantage of binaural presentation of an interfering noise in a task involving identification of monaurally-presented words. These words were embedded in three types of noise: a stationary noise, a speech-modulated noise and a speech-babble noise, in order to assess energetic and informational masking contributions to binaural unmasking. Our results showed important informational masking in the monaural condition, principally due to lexical and phonetic competition. We also found a binaural unmasking effect, which was more important when speech was used as interferer, suggesting that this suppressive effect was more efficient in the case of high-level informational (lexical and phonetic) competition.

#5Using location cues to track speaker changes from mobile, binaural microphones.\\thanks{This work was funded by the EU Cognitive Systems STReP project POP (Perception On Purpose

Heidi Christensen (University of Sheffield)
Jon Barker (University of Sheffield)

This paper presents initial developments towards computational hearing models that move beyond stationary microphone assumptions. We present a particle filtering based system for using localisation cues to track speaker changes in meeting recordings. Recording are made using in-ear binaural microphones worn by a listener whose head is constantly moving. Tracking speaker changes requires simultaneously inferring the perceiver's head orientation, as any change in relative spatial angle to a source can be caused by either the source moving or the microphones moving. In real applications, such as robotics, there may be access to external estimates of the perceiver's position. We investigate the effect of simulating varying degrees of measurement noise in an external perceiver position estimate. We show that only limited self-position knowledge is needed to greatly improve the reliability with which we can decode the acoustic localisation cues in the meeting scenario.

#6A perceptual investigation of speech transcription errors involving frequent near-homophones in French and American English

Ioana Vasilescu (LIMSI-CNRS, France)
Martine Adda-Decker (LIMSI-CNRS, France)
Lori Lamel (LIMSI-CNRS, France)
Pierre Hallé (LPP-CNRS)

This article compares the errors made by automatic speech recognizers to those made by humans for near-homophones in American English and French. This exploratory study focuses on the impact of limited word context and the potential resulting ambiguities for automatic speech recognition (ASR) systems and human listeners. Perceptual experiments using 7-gram chunks centered on incorrect or correct words output by an ASR system, show that humans make significantly more transcription errors on the first type of stimuli, thus highlighting the local ambiguity. The long-term aim of this study is to improve the modeling of such ambiguous items in order to reduce ASR errors.

#7The role of glottal pulse rate and vocal tract length in the perception of speaker identity

Etienne Gaudrain (Centre for the Neural Basis of Hearing, Department of Physiology, Development and Neuroscience, University of Cambridge, United-Kingdom)
Su Li (Centre for the Neural Basis of Hearing, Department of Physiology, Development and Neuroscience, University of Cambridge, United-Kingdom)
Vin Shen Ban (Centre for the Neural Basis of Hearing, Department of Physiology, Development and Neuroscience, University of Cambridge, United-Kingdom)
Roy D Patterson (Centre for the Neural Basis of Hearing, Department of Physiology, Development and Neuroscience, University of Cambridge, United-Kingdom)

In natural speech, for a given speaker, vocal tract length (VTL) is effectively fixed whereas glottal pulse rate (GPR) is varied to indicate prosodic distinctions. This suggests that VTL will be a more reliable cue for identifying a speaker than GPR. It also suggests that listeners will accept larger changes in GPR before perceiving speaker change. We measured the effect of GPR and VTL on the perception of a speaker difference, and found that listeners hear different speakers given a VTL difference of 25%, but they require a GPR difference of 45%.

#8Development of voicing categorization in deaf children with cochlear implant

Victoria Medina (Laboratoire Psychologie de la Perception, Université Paris Descartes, CNRS)
Willy Serniclaes (Laboratoire Psychologie de la Perception, Université Paris Descartes, CNRS)

Cochlear implant (CI) improves hearing but communication abilities still depend on several factors. The present study assesses the development of voicing categorization in deaf children with cochlear implant, examining both categorical perception (CP) and boundary precision (BP) performances. We compared 22 implanted children to 55 normal-hearing children using different age factors. The results showed that the development of voicing perception in CI children is fairly similar to that in normal-hearing controls with the same auditory experience and irrespective of differences in the age of implantation (two vs. three years of age).

#9Processing Liaison-Initial Words in Native and Non-Native French: Evidence from Eye Movements

Annie Tremblay (University of Illinois at Urbana-Champaign)

French listeners have no difficulty recognizing liaison-initial words. This is in part because acoustic/phonetic information distinguishes liaison consonants from (non-resyllabified) word onsets in the speech signal. Using eye tracking, this study investigates whether native speakers of English, a language that does not have a phonological resyllabification process like liaison, can develop target-like segmentation procedures for recognizing liaison-initial words in French, and if so, how such procedures develop with increasing proficiency.

#10Estimating the Potential of Signal and Interlocutor-Track Information for Language Modeling

Nigel Ward (University of Texas at El Paso)
Benjamin Walker (University of Texas at El Paso)

Although today most language models treat language purely as word sequences, there is recurring interest in tapping new sources of information, such as disfluencies, prosody, the interlocutor's dialog act, and the interlocutor's recent words. In order to estimate the potential value of such sources of information, we extend Shannon's guessing-game method for estimating entropy to work for spoken dialog. Four teams of two subjects each predicted the next word in a dialog using various amounts of context: one word, two words, all the words spoken so far or the full dialog audio so far. The entropy benefit in the full-audio condition over the full text condition was substantial, .64 bits per word, greater than the .54 bit benefit of full text context over trigrams. This suggests that language models may be improved by use of the prosody of the speaker and context from the interlocutor.

Mon-Ses2-P2:
Accent and Language Recognition

Time:Monday 13:30 Place:Hewison Hall Type:Poster
Chair: William Campbell

#1Factor Analysis and SVM for Language Recognition

Florian Verdet (Université d\'Avignon et des Pays du Vaucluse, Laboratoire Informatique d\'Avignon, Avignon, France and Département d\'Informatique, Université de Fribourg, Fribourg, Switzerland)
Driss Matrouf (Université d\'Avignon et des Pays du Vaucluse, Laboratoire Informatique d\'Avignon, Avignon, France)
Jean-François Bonastre (Université d\'Avignon et des Pays du Vaucluse, Laboratoire Informatique d\'Avignon, Avignon, France)
Jean Hennebert (Département d\'Informatique, Université de Fribourg, Fribourg, Switzerland)

Statistic classifiers operate on features that generally include both, useful and useless information. These two types of information are difficult to separate in feature domain. Recently, a new paradigm based on Factor Analysis (FA) proposed a model decomposition into useful and useless components. This method has successfully been applied to speaker recognition tasks. In this paper, we study the use of FA for language recognition. We propose a classification method based on SDC features and Gaussian Mixture Models (GMM). We present well performing systems using Factor Analysis and FA-based Support Vector Machine (SVM) classifiers. Experiments are conducted using NIST LRE 2005’s primary condition. The relative equal error rate reduction obtained by the best factor analysis configuration with respect to baseline GMM-UBM system is over 60 %, corresponding to an EER of 6.59 %.

#2Exploring Universal Attribute Characterization of Spoken Languages for Spoken Language Recognition

Sabato Marco Siniscalchi (NTNU)
Jeremy Reed (Georgia Institute of Technology)
Torbjørn Svendsen (NTNU)
Chin-Hui Lee (Georgia Institute of Technology)

We propose a novel universal acoustic characterization approach to spoken language identification (LID), in which any spoken language is described with a common set of fundamental units defined "universally." Specifically, manner and place of articulation form this unit inventory and are used to build a set of universal attribute models with data-driven techniques. Using the vector space modeling approaches to LID a spoken utterance is first decoded into a sequence of attributes. Then, a feature vector consisting of co-occurrence statistics of attribute units is created, and the final LID decision is implemented with a set of vector space language classifiers. Although the present study is just in its preliminary stage, promising results comparable to acoustically rich phone-based LID systems have already been obtained on the NIST 2003 LID task. The results provide clear insight for further performance improvements and encourage a continuing exploration of the proposed framework.

#3On the use of Phonological Features for Automatic Accent Analysis

Abhijeet Sangwan (Center for Robust Speech Systems)
John Hansen (Center for Robust Speech Systems)

In this paper, we present an automatic accent analysis system that is based on phonological features (PFs). The proposed system exploits the knowledge of articulation embedded in phonology by rapidly build Markov models (MMs) of PFs extracted from accented speech. The Markov models capture information in the PF space along two dimensions of articulation: PF state-transitions and state-durations. Furthermore, by utilizing MMs of native and non-native accents a new statistical measure of “accentedness” is developed which rates the articulation of a word on a scale of native-like (−1) to non-native like (+1. The proposed methodology is then used to perform an automatic cross-sectional study of accented English spoken by native speakers of Mandarin Chinese (N-MC). The work developed in this paper is easily assimilated into language learning systems, and has impact in the areas of speaker recognition and ASR (automatic speech recognition).

#4Language Recognition Using Language Factors

Fabio Castaldo (Politecnico di Torino)
Sandro Cumani (Politecnico di Torino)
Pietro Laface (Politecnico di Torino)
Daniele Colibro (Loquendo)

Language recognition systems based on acoustic models reach state of the art performance using discriminative training techniques. In speaker recognition, eigenvoice modeling of the speaker, and the use of speaker factors as input features to SVMs has recently been demonstrated to give good results compared to the standard GMM-SVM approach, which combines GMMs supervectors and SVMs. In this paper we propose, in analogy to the eigenvoice modeling approach, to estimate an eigen-language space, and to use the language factors as input features to SVM classifiers. Since language factors are low-dimension vectors, training and evaluating SVMs with different kernels and with large training examples becomes an easy task. This approach is demonstrated on the 14 languages of the NIST 2007 language recognition task, and shows performance improvements with respect to the standard GMM-SVM technique.

#5Automatic Accent Detection: Effect of Base Units and Boundary Information

Je Hun Jeon (The University of Texas at Dallas)
Yang Liu (The University of Texas at Dallas)

Automatic prominence or pitch accent detection is important as it can perform automatic prosodic annotation of speech corpora, as well as provide additional features in other tasks such as keyword detection. In this paper, we evaluate how accent detection performance changes according to different base units and what kind of boundary information is available. We compare word, syllable, and vowel-based units when their boundaries are provided. We also automatically estimate syllable boundaries using energy contours when phone-level alignment is available. In addition, we utilize a sliding window with fixed length under the condition of unknown boundaries. Our experiments show that when boundary information is available, using longer base unit achieves better performance. In the case of no boundary information, using a moving window with a fixed size achieves similar performance to using syllable information on word-level evaluation, suggesting that accent detection can be performed without relying on a speech recognizer to generate boundaries.

#6Age Verification Using a Hybrid Speech Processing Approach

Ron M Hecht (PuddingMedia)
Omer Hezroni (PuddingMedia)
Amit Manna (PuddingMedia)
Ruth Aloni-Lavi (PuddingMedia)
Gil Dobry (PuddingMedia)
Amir Alfandary (Nice systems)
Yaniv Zigel (Bio-medical Engineering Dept., Ben-Gurion University)

The human speech production system is a multi-level system. On the upper level, it starts with information that one wants to transmit. It ends on the lower level with the materialization of the information into a speech signal. Most of the recent work conducted in age estimation is focused on the lower-acoustic level. In this research the upper lexical level information is utilized for age-group verification and it is shown that one's vocabulary reflects one's age. Several age-group verification systems that are based on automatic transcripts are proposed. In addition, a hybrid approach is introduced, an approach that combines the word-based system and an acoustic-based system. Experiments were conducted on a four age-groups verification task using the Fisher corpora, where an average equal error rate (EER) of 28.7% was achieved using the lexical-based approach and 28.0% using an acoustic approach. By merging these two approaches the verification error was reduced to 24.1%.

#7Information Bottleneck Based Age Verification

Ron M Hecht (PuddingMedia, Kfar-Saba, Israel)
Omer Hezroni (PuddingMedia, Kfar-Saba, Israel)
Amit Manna (PuddingMedia, Kfar-Saba, Israel)
Gil Dobry (Bio-medical Engineering Department, Ben-Gurion University, Beer-Sheva, Israel)
Yaniv Zigel (Bio-medical Engineering Department, Ben-Gurion University, Beer-Sheva, Israel)
Naftali Tishby (School of Engineering and Computer Science, Hebrew University, Jerusalem, Israel)

Word N-gram models can be used for word-based age-group verification. In this paper the agglomerative information bottleneck (AIB) approach is used to tackle one of the most fundamental drawbacks of word N-gram models: its abundant amount of irrelevant information. It is demonstrated that irrelevant information can be omitted by joining words to form word-clusters; this provides a mechanism to transform any sequence of words to a sequence of word-cluster labels. Consequently, word N-gram models are converted to wordcluster N-gram models which are more compact. Age verification experiments were conducted on the Fisher corpora. Their goal was to verify the age-group of the speaker of an unknown speech segment. In these experiments an Ngram model was compressed to a fifth of its original size without reducing the verification performance. In addition, a verification accuracy improvement is demonstrated by disposing irrelevant information.

#8Discriminative N-gram Selection for Dialect Recognition

Fred Richardson (MIT Lincoln Laboratory)
William Campbell (MIT Lincoln Laboratory)
Pedro Torres-Carrasquillo (MIT Lincoln Laboratory)

Dialect recognition is a challenging and multifaceted problem. Distinguishing between dialects can rely upon many tiers of interpretation of speech data-e.g., prosodic, phonetic, spectral, and word. High-accuracy automatic methods for dialect recognition typically use either phonetic or spectral characteristics of the input. A challenge with spectral system, such as those based on shifted-delta cepstral coefficients, is that they achieve good performance but do not provide insight into distinctive dialect features. In this work, a novel method based upon discriminative training and phone N-grams is proposed. This approach achieves excellent classification performance, fuses well with other systems, and has interpretable dialect characteristics in the phonetic tier. The method is demonstrated on data from the LDC and prior NIST language recognition evaluations. The method is also combined with spectral methods to demonstrate state-of-the-art performance in dialect recognition.

#9Data-driven Phonetic Comparison and Conversion between South African, British and American English Pronunciations

Linsen Loots (Department of Electrical and Electronic Engineering, Stellenbosch University, South Africa)
Thomas Niesler (Department of Electrical and Electronic Engineering, Stellenbosch University, South Africa)

We analyse pronunciations in American, British and South African English pronunciation dictionaries. Three analyses are perfomed. First the accuracy is determined with which decision tree based grapheme-to-phoneme (G2P) conversion can be applied to each accent. It is found that there is little difference between the accents in this regard. Secondly, pronunciations are compared by performing pairwise alignments between the accents. Here we find that South African English pronunciation most closely matches British English. Finally, we apply decision trees to the conversion of pronunciations from one accent to another. We find that pronunciations of unknown words can be more accurately determined from a known pronunciation in a different accent than by means of G2P methods. This has important implications for the development of pronunciation dictionaries in less-resourced varieties of English, and hence also for the development of ASR systems.

#10Target-Aware Language Models for Spoken Language Recognition

Rong Tong (Institute for Infocomm Research, Singapore)
Bin Ma (Institute for Infocomm Research, Singapore)
Haizhou Li (Institute for Infocomm Research, Singapore)
Eng Siong Chng (Nanyang Technological University, Singapore)

This paper studies a way of constructing multiple phone tokenizers for language recognition. In this approach, each phone tokenizer for a target language will share a common set of acoustic models, while each will have a unique phone-based language model (LM) trained for a specific target language. The target-aware language models (TALM) are constructed to capture the discriminative ability of individual phones for the desired target languages. The parallel phone tokenizers thus formed are shown to achieve better performance than the original phone recognizer. The proposed TALM is very different from the LM in the traditional PPRLM technique as the TALM applies the LM information in the front-end while PPRLM approach uses a LM in the system back-end; Furthermore, the TALM exploits the discriminative phones occurrence statistics, which are different from the traditional n-gram statistics in PPRLM approach. A novel way of training TALM is also studied in this paper.

#11Language Identification for Speech-to-Speech Translation

Daniel Chung Yong Lim (Language Technologies Institute, Carnegie Mellon University)
Ian Lane (Language Technologies Institute, Carnegie Mellon University)

This paper investigates the use of language identification (LID) in real-time speech-to-speech translation systems. We propose a framework that incorporates LID capability into a speech-to-speech translation system while minimizing the impact on the system’s real-time performance. We compared two phone-based LID approaches, namely PRLM and PPRLM, to a proposed extended approach based on Conditional Random Field classifiers. The performances of these three approaches were evaluated to identify the input language in the CMU English-Iraqi TransTAC system, and the proposed approach obtained significantly higher classification accuracies on two of the three test sets evaluated.

#12Using Prosody and Phonotactics in Arabic Dialect Identification

Fadi Biadsy (Columbia University)
Julia Hirschberg (Columbia University)

While Modern Standard Arabic is the formal spoken and written language of the Arab world, dialects are the major communication mode for everyday life; identifying a speaker’s dialect is thus critical to speech processing tasks such as automatic speech recognition, as well as speaker identification. We examine the role of prosodic features (intonation and rhythm) across four Arabic dialects: Gulf, Iraqi, Levantine, and Egyptian, for the purpose of automatic dialect identification. We show that prosodic features can significantly improve identification, over a purely phonotactic-based approach, with an identification accuracy of 86.33% for 2m utterances.

Mon-Ses2-P3:
ASR: Acoustic Model Training and Combination

Time:Monday 13:30 Place:Hewison Hall Type:Poster
Chair: Jeff Bilmes

#1Refactoring Acoustic Models using Variational Expectation-Maximization

Pierre Dognin (IBM T.J. Research Center (USA))
John Hershey (IBM T.J. Research Center (USA))
Vaibhava Goel (IBM T.J. Research Center (USA))
Peder Olsen (IBM T.J. Research Center (USA))

In probabilistic modeling, it is often useful to change the structure, or refactor, a model, so that it has a different number of components, different parameter sharing, or other constraints. For example, we may wish to find a Gaussian mixture model (GMM) with fewer components that best approximates a reference model. Maximizing the likelihood of the refactored model under the reference model is equivalent to minimizing their KL divergence. For GMMs, this optimization is not analytically tractable. However, a lower bound to the likelihood can be maximized using a variational expectation-maximization algorithm. Automatic speech recognition provides a good framework to test the validity of such methods, because we can train reference models of any given size for comparison with refactored models. We show that we can efficiently reduce model size by 50%, with the same recognition performance as the corresponding model trained from data.

#2Investigations on Convex Optimization Using Log-Linear HMMs for Digit String Recognition

Georg Heigold (RWTH Aachen University)
David Rybach (RWTH Aachen University)
Ralf Schlüter (RWTH Aachen University)
Hermann Ney (RWTH Aachen University)

Discriminative methods are an important technique to refine the acoustic model in speech recognition. Conventional discriminative training is initialized with some baseline model and the parameters are re-estimated in a separate step. This approach has proven to be successful, but it includes many heuristics, approximations, and parameters to be tuned. This tuning involves much engineering and makes it difficult to reproduce and compare experiments. In contrast to the conventional training, convex optimization techniques provide a sound approach to estimate all model parameters from scratch. Such a straight approach hopefully dispense with additional heuristics, e.g. scaling of posteriors. This paper addresses the question how well this concept using log-linear models carries over to practice. Experimental results are reported for a digit string recognition task, which allows for the investigation of this issue without approximations.

#3Investigations on discriminative training in large scale acoustic model estimation

Janne Pylkkönen (Adaptive Informatics Research Centre, Helsinki University of Technology)

In this paper two common discriminative training criteria, maximum mutual information (MMI) and minimum phone error (MPE), are investigated. Two main issues are addressed: sensitivity to different lattice segmentations and the contribution of the parameter estimation method. It is noted that MMI and MPE may benefit from different lattice segmentation strategies. The use of discriminative criterion values as the measure of model goodness is shown to be problematic as the recognition results do not correlate well with these measures. Moreover, the parameter estimation method clearly affects the recognition performance of the model irrespective of the value of the discriminative criterion. Also the dependence on the recognition task is demonstrated by example with two Finnish large vocabulary dictation tasks used in the experiments.

#4Margin-Space Integration of MPE Loss via Differencing of MMI Functionals for Generalized Error-Weighted Discriminative Training

Erik McDermott (NTT Corporation)
Shinji Watanabe (NTT Corporation)
Atsushi Nakamura (NTT Corporation)

Using the central observation that margin-based weighted classification error (modeled using Minimum Phone Error (MPE)) corresponds to the derivative with respect to the margin term of margin-based hinge loss (modeled using Maximum Mutual Information (MMI)), this article subsumes and extends margin-based MPE and MMI within a broader framework in which the objective function is an integral of MPE loss over a range of margin values. Applying the Fundamental Theorem of Calculus, this integral is easily evaluated using finite differences of MMI functionals; lattice-based training using the new criterion can then be carried out using differences of MMI gradients. Preliminary experimental results comparing the new framework with margin-based MMI, MCE and MPE on the Corpus of Spontaneous Japanese and the MIT OpenCourseWare/MIT-World corpus are presented.

#5Compacting Discriminative Feature Space Transforms for Embedded Devices

Etienne Marcheret (IBM)
Jia-Yu Chen (UIUC)
Petr Fousek (IBM)
Peder Olsen (IBM)
Vaibhava Goel (IBM)

Discriminative training of the feature space using the minimum phone error objective function has been shown to yield remarkable accuracy improvements. These gains, however, come at a high cost of memory. In this paper we present techniques that maintain fMPE performance while reducing the required memory by approximately 94%. This is achieved by designing a quantization methodology which minimizes the error between the true fMPE computation and that produced with the quantized parameters. Also illustrated is a Viterbi search over the allocation of quantization levels, providing a framework for optimal non-uniform allocation of quantization levels over the dimensions of the fMPE feature vector. This provides an additional 8% relative reduction in required memory with no loss in recognition accuracy.

#6A Discriminative Back-off Acoustic Model for Automatic Speech Recognition

Hung-An Chang (MIT Computer Science and Artificial Intelligence Laboratory)
James R. Glass (MIT Computer Science and Artificial Intelligence Laboratory)

In this paper we propose a back-off discriminative acoustic model for Automatic Speech Recognition (ASR). We use a set of broad phonetic classes to divide the classification problem originating from context-dependent modeling into a set of sub-problems. By appropriately combining the scores from classifiers designed for the sub-problems, we can guarantee that the back-off acoustic score for different context-dependent units will be different. The back-off model can be combined with discriminative training algorithms to further improve the performance. Experimental results on a large vocabulary lecture transcription task show that the proposed back-off discriminative acoustic model has more than a 2.0% absolute word error rate reduction compared to clustering-based acoustic model.

#7Efficient Generation and Use of MLP Features for Arabic Speech Recognition

Junho Park (University of Cambridge)
Frank Diehl (University of cambridge)
Mark Gales (University of Cambridge)
Marcus Tomalin (University of Cambridge)
Phil Woodland (University of Cambridge)

Feature derived from Multi-Layer Perceptrons (MLPs) are fronted to challenge how to build such a complex MLP with huge amount of trainig data efficiently. This paper discusses various methods to reduce training effort for the incorporation of MLP features into an ASR system; parallel network design and training; combining methods of outputs of those parallel networks; and a rapid retraining procedure for discriminatively trained MLP-feature based acoustic models. The use of parallel network combination gave significant improvements over standard MLP configuration in word error rate on single unadapted decoding stage. However, the gains were getting shrinked on sophisticated adaptation steps although the combination method was efficient in terms of training cost.

#8A Study of Bootstrapping with Multiple Acoustic Features for Improved Automatic Speech Recognition

Xiaodong Cui (IBM T. J. Watson Research Center)
Jian Xue (IBM T. J. Watson Research Center)
Bing Xiang (IBM T. J. Watson Research Center)
Bowen Zhou (IBM T. J. Watson Research Center)

This paper investigates a scheme of bootstrapping with multiple acoustic features (MFCC, PLP and LPCC) to improve the overall performance of automatic speech recognition. In this scheme, a Gaussian mixture distribution is estimated for each type of feature resampled in each HMM state by single-pass re-training on a shared decision tree. Thus obtained acoustic models based on the multiple features are combined by likelihood averaging during decoding. Experiments on large vocabulary spontaneous speech recognition show its superior overall performance than the best of acoustic models from individual features. It also achieves comparable performance to Recognizer Output Voting Error Reduction (ROVER) with computational advantages.

#9ANALYSIS OF LOW-RESOURCE ACOUSTIC MODEL SELF-TRAINING

Scott Novotney (BBN Technologies)
Richard Schwartz (BBN Technologies)

Previous work on self-training of acoustic models using unlabeled data reported significant reductions in WER assuming a large phonetic dictionary was available. We now assume only those words from ten hours of speech are initially available. Subsequently, we are then given a large vocabulary and then quantify the value of repeating self-training with this larger dictionary. This experiment is used to analyze the effects of self-training on categories of words. We report the following findings: (i) Although the small 5k vocabulary raises WER by 2% absolute, self-training is equally effective as using a large 75k vocabulary. (ii) Adding all 75k words to the decoding vocabulary after self-training reduces the WER degradation to only 0.8% absolute. (iii) Self-training most benefits those words in the unlabeled audio but not transcribed by a wide margin.

#10Log-linear Model Combination with Word-dependent Scaling Factors

Björn Hoffmeister (Chair of Computer Science 6, Computer Science Department, RWTH Aachen University)
Liang Ruoying (Chair of Computer Science 6, Computer Science Department, RWTH Aachen University)
Ralf Schlüter (Chair of Computer Science 6, Computer Science Department, RWTH Aachen University)
Hermann Ney (Chair of Computer Science 6, Computer Science Department, RWTH Aachen University)

Log-linear model combination is the standard approach in LVCSR to combine several knowledge sources, usually an acoustic and a language model. Instead of using a single scaling factor per knowledge source, we make the scaling factor word- and pronunciation-dependent. In this work, we combine three acoustic models, a pronunciation model, and a language model for a Mandarin BN/BC task. The achieved error rate reduction of 2% relative is small but consistent for two test sets. An analysis of the results shows that the major contribution comes from the improved interdependency of language and acoustic model.

Mon-Ses2-P4:
Spoken dialogue systems

Time:Monday 13:30 Place:Hewison Hall Type:Poster
Chair:Dilek Hakkani-Tur

#1Enabling A User To Specify An Item At Any Time During System Enumeration

Kyoko Matsuyama (Kyoto University)
Kazunori Komatani (Kyoto University)
Tetsuys Ogata (Kyoto University)
Hiroshi G. Okuno (Kyoto University)

In conversational dialogue systems, users prefer to speak at any time and to use natural expressions. We have developed an Independent Component Analysis (ICA) based semi-blind source separation method, which allows users to barge-in over system utterances at any time. We created a novel method from timing information derived from barge-in utterances to identify one item that a user indicates during system enumeration. First, we determine the timing distribution of user utterances containing referential expressions and then approximate it using gamma distribution. Second, we represent both the utterance timing and automatic speech recognition (ASR) results as probabilities of the desired selection from the system's enumeration. We then integrate these two probabilities to identify the item having the maximum likelihood of selection. Experimental results using 400 utterances indicated that our method outperformed two methods used as a baseline (one of ASR results only and one of utterance timing only) in identification accuracy.

#2System Request Detection in Human Conversation Based on Multi-Resolution Gabor Wavelet Features

Tomoyuki Yamagata (Kobe University)
Tetsuya Takiguchi (Kobe University)
Yasuo Ariki (Kobe University)

For a hands-free speech interface, it is important to detect commands in spontaneous utterances. Usual voice activity detection systems can only distinguish speech frames from non-speech frames, but they cannot discriminate whether the detected speech section is a command for a system or not. In this paper, in order to analyze the difference between system requests and spontaneous utterances, we focus on fluctuations in a long period, such as prosodic articulation, and fluctuations in a short period, such as phoneme articulation. The use of multi-resolution analysis using Gabor wavelet on a Log-scale Mel-frequency Filter-bank clarifies the different characteristics of system commands and spontaneous utterances. Experiments using our robot dialog corpus show that the accuracy of the proposed method is 92.6% in F-measure, while the conventional power and prosody-based method is just 66.7%.

#3Using Graphical Models for Mixed-Initiative Dialog Management Systems with Realtime Policies

Stefan Schwärzler (Technische Universität München, Germany)
Stefan Maier (Technische Universität München, Germany)
Joachim Schenk (Technische Universität München, Germany)
Frank Wallhoff (Technische Universität München, Germany)
Gerhard Rigoll (Technische Universität München, Germany)

In this paper, we present a novel approach for dialog modeling, which extends the idea underlying the partially observable Markov Decision Processes (POMDPs), i. e. it allows for calculating the dialog policy in real-time and thereby increases the system flexibility. The use of statistical dialog models is particularly advantageous to react adequately to common errors of speech recognition systems. Comparing our results to the refernce system (POMDP), we achieve a relative reduction of 31.6 % of the average dialog length. Furthermore, the proposed system shows a relative enhancement of 64.4 % of the sensitivity rate in the error recognition capabilities using the same specifity rate in both systems. The achieved results are based on the Air Travelling Information System with 21650 user utterances in 1585 natural spoken dialogs.

#4Conversation Robot Participating in and Activating a Group Communication

Shinya Fujie (Waseda University)
Yoichi Matsuyama (Waseda University)
Hikaru Taniyama (Waseda University)
Tetsunori Kobayashi (Waseda University)

As a new type of application of the conversation system, a robot activating other parties' communications has been developed. The robot participates in a quiz game with other participants and tries to activate the game. The functions installed in the robot are as follows: (1) The robot can participate in a group communication using its basic group conversation function. (2) The robot can perform the game according to the rules of the game. (3) The robot can activate communication using its proper actions depending on the game situations and the participants' situations. We conducted a real field experiment: the prototype system performed a quiz game with elderly people in an adult day-care center. The robot successfully entertained the people with its one hour demonstration.

#5Recent Advances in WFST-based Dialog System

Chiori Hori (National Institute of Information and Communications Technology (NICT))
Kiyonori Ohtake (National Institute of Information and Communications Technology (NICT))
Teruhisa Misu (National Institute of Information and Communications Technology (NICT))
Hideki Kashioka (National Institute of Information and Communications Technology (NICT))
Satoshi Nakamura (National Institute of Information and Communications Technology (NICT))

We proposed a dialog system using a weighted finite-state transducer (WFST) in which users concept and system action tags are input and output of the transducer, respectively. To test the potential of the WFST-based dialog management (DM) platform using statistical DM models, we constructed a dialog system using a human-to-human spoken dialog corpus for hotel reservation, which is annotated with Interchange Format (IF). A scenario WFST and a spoken language understanding (SLU) WFST were obtained from the corpus and then composed together and optimized. We evaluated the detection accuracy of the system next actions. In this paper, we focus on how WFST optimization operations contribute to the performance of the system. In addition, we have constructed a full WFST-based dialog system by composing SLU, scenario and sentence generation (SG) WFSTs. We show an example of a hotel reservation dialog with the fully composed system and discuss future work.

#6A Statistical Dialog Manager for the LUNA Project

David Griol (Universidad Carlos III de Madrid)
Giuseppe Riccardi (University of Trento)
Emilio Sanchis (Universitat Politecnica de Valencia)

In this paper, we present an approach for the development of a statistical dialog manager, in which the system response is selected by means of a classification process which considers all the previous history of the dialog to select the next system response. In particular, we use decision trees for its implementation. The statistical model is automatically learned from training data which are labeled in terms of different SLU features. This methodology has been applied to develop a dialog manager within the framework of the European LUNA project, whose main goal is the creation of a robust natural spoken language understanding system. We present an evaluation of this approach for both human machine and human-human conversations acquired in this project. We demonstrate that a statistical dialog manager developed with the proposed technique and learned from a corpus of human-machine dialogs can successfully infer the task-related topics present in spontaneous human-human dialogs.

#7A Policy-Switching Learning Approach for Adaptive Spoken Dialogue Agents

Heriberto Cuayáhuitl (Autonomous University of Tlaxcala)
Juventino Montiel-Hernández (Autonomous University of Tlaxcala)

The reinforcement learning paradigm has been adopted for inferring optimized and adaptive spoken dialogue agents. Such agents are typically learnt and tested without combining competing agents that may yield better performance at some points in the conversation. This paper presents an approach that learns dialogue behaviour from competing agents---switching from one policy to another competing one---on a previously proposed hierarchical learning framework. This policy-switching approach was investigated using a simulated flight booking dialogue system based on different types of information request. Experimental results reported that the induced agent using the proposed policy-switching approach yielded 8.2% fewer system actions than three baselines with a fixed type of information request. This result suggests that the proposed approach is useful for learning adaptive and scalable spoken dialogue agents.

#8Strategies for Accelerating the Design of Dialogue Applications using Heuristic Information from the Backend Database

Luis Fernando D\'Haro (Speech Technology Group. Universidad Politecnica de Madrid. Spain.)
Ricardo Cordoba (Speech Technology Group. Universidad Politecnica de Madrid. Spain.)
Ruben San-Segundo (Speech Technology Group. Universidad Politecnica de Madrid. Spain.)
Javier Macias-Guarasa (Speech Technology Group. Universidad Politecnica de Madrid. Spain.)
Jose Manuel Pardo (Speech Technology Group. Universidad Politecnica de Madrid. Spain.)

Nowadays, current commercial and academic platforms for developing spoken dialogue applications lack of acceleration strategies based on using heuristic information from the contents or structure of the backend database in order to speed up the definition of the dialogue flow. In this paper we describe our attempts to take advantage of these information sources using the following strategies: the quick creation of classes and attributes to define the data model structure, the semi-automatic generation and debugging of database access functions, the automatic proposal of the slots that should be preferably requested using mixed-initiative forms or the slots that are better to request one by one using directed forms, and the generation of automatic state proposals to specify the transition network that defines the dialogue flow. Subjective and objective evaluations confirm the advantages of using the proposed strategies to simplify the design, and the high acceptance of the platform and its acceleration strategies.

#9Feature-based Summary Space for Stochastic Dialogue Modeling with Hierarchical Semantic Frames

Florian Pinault (LIA - UAPV)
Fabrice Lefèvre (LIA - UAPV)
Renato De Mori (LIA - UAPV)

In a spoken dialogue system, the dialogue manager needs to make decisions in a highly noisy environment. This work addresses this issue by proposing a framework to interface efficient probabilistic modeling both for the spoken language understanding module and for the dialogue management module. Hierarchical semantic frames are inferred and composed to build a thorough representation of the user's utterance semantic. Then this representation is mapped into a feature-based summary space in which is defined the set of dialogue states used by the dialogue manager, based on the POMDP paradigm. This allows a planning of the dialogue course taking into account the uncertainty on the dialogue state and tractability is ensured by use of an intermediate summary space. A preliminary implementation of such a system is presented on the MEDIA domain. The task is touristic information and hotel reservation, and the availability of WoZ data allows to consider a model-based approach to the POMDP dialogue manager.

#10Language Modeling and Dialog Management for Address Recognition

Rajesh Balchandran (IBM - T J Watson Research Center)
Rachevsky Leonid (IBM - T J Watson Research Center)
Larry Sansone (IBM - T J Watson Research Center)

This paper describes a language modeling and dialog management system for efficient and robust recognition of several arbitrarily ordered and inter-related components from very large datasets - such as with a complete addresses specified in a single sentence with address components in their natural sequence. A new two-pass speech recognition technique based on using multiple language models with embedded grammars is presented. Tests with this technique on complete address recognition task yielded good results and memory and CPU requirements are sufficiently low to make this technique viable for embedded environments. Additionally, a goal oriented algorithm for dialog based error recovery and disambiguation, that does not require manual identification of all possible dialog situations, is also presented. The combined system yields very high task completion accuracy, for only a few additional turns of interaction.

#11A framework for rapid development of conversational natural language call routing systems for call centers

Ea-Ee Jan (IBM)
Hong-Kwang Kuo (IBM)
Osamuyimen Stewart (IBM)
David Lubensky (IBM)

A framework for rapid development of conversational natural language call routing systems is proposed. The framework cuts costs by using only scantily prepared business requirements to automatically create an initial prototype. Aside from clear targets (terminal routing classes). vague targets which are variations of users’ incomplete (semantically overlapping) sentences are enumerated. The vague targets can be derived from the confusion set of the semantic tokens of the clear targets. Also automatically generated for a vague target is a disambiguation dialogue module, which consists of a prompt and grammar to guide the user from a vague target to one of its associated clear targets. In the final analysis, our framework is able to reduce the human labor associated with developing an initial natural language call routing system from a few weeks to just a few hours. The experimental results from a deployed pilot system support the feasibility of our proposed approach.

#12The MonAMI Reminder: a spoken dialogue system for face-to-face interaction

Jonas Beskow (KTH Speech Music & Hearing)
Jens Edlund (KTH Speech Music & Hearing)
Björn Granström (KTH Speech Music & Hearing)
Joakim Gustafson (KTH Speech Music & Hearing)
Gabriel Skantze (KTH Speech Music & Hearing)
Helena Tobiasson (KTH Human-Computer Interaction Group)

We describe the MonAMI Reminder, a multimodal spoken dialogue system which can assist elderly and disabled people in organising and initiating their daily activities. Based on deep interviews with potential users, we have designed a calendar and reminder application which uses an innovative mix of an embodied conversational agent, digital pen and paper, and the web to meet the needs of those users as well as the current constraints of speech technology. We also explore the use of head pose tracking for interaction and attention control in human-computer face-to-face interaction.

#13Influence of Training on Direct and Indirect Measures for the Evaluation of Multimodal Systems

Julia Seebode (Research training group prometei, Berlin Institute of Technology, Germany)
Stefan Schaffer (Research training group prometei, Berlin Institute of Technology, Germany)
Ina Wechsung (Deutsche Telekom Laboratories, Berlin Institute of Technology, Germany)
Florian Metze (School of Computer Science, Carnegie Mellon University, Pittsburgh, USA)

Finding suitable evaluation methods is an indispensable task during the development of new user interfaces, as no standardized approach has so far been established, especially for multimodal interfaces. In the current study, we used several data sources (direct and indirect measurements) to evaluate a multimodal version of an information system, tested on trained and untrained users. We investigated the extent to which the different types of data showed concordance concerning the perceived quality of the system, in order to derive clues as to the suitability of the respective evaluation methods. The aim was to examine, if widely used methods not originally developed for multimodal interfaces are appropriate under these conditions, and to derive new evaluation paradigms.

#14Talking Heads for Interacting with Spoken Dialog Smart-Home Systems

Christine Kühnel (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin)
Benjamin Weiss (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin)
Sebastian Möller (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin)

In this paper the relation between the quality of a talking head as an output component of a spoken dialog system and the quality of the system itself are investigated. Results show that the quality of the talking head has indeed an important impact on system quality. The quality of the talking head itself is found to be influenced by visual and speech quality and the synchronization of voice and lip movement.

#15Speech Generation from Hand Gestures Based on Space Mapping

Aki Kunikoshi (The University of Tokyo)
Yu Qiao (The University of Tokyo)
Nobuaki Minematsu (The University of Tokyo)
Keikichi Hirose (The University of Tokyo)

Individuals with speaking disabilities often use a TTS synthesizer for speech communication. Since users always have to type sound symbols and the synthesizer reads them out in a monotonous style, the use of the current synthesizers usually renders real-time operation and lively communication difficult. In this paper, we develop a special glove, by wearing which, speech sounds are generated from hand gesture transitions. For development, GMM-based voice conversion techniques are applied to estimate a mapping function between a space of hand gestures and another space of speech sounds. In this paper, as an initial trial, a mapping between hand gestures and Japanese vowel sounds is estimated so that topological features of the selected gestures in a feature space and those of the five Japanese vowels in a cepstrum space are equalized. Experiments show that the special glove can generate good Japanese vowel transitions with voluntary control of duration and articulation.

Mon-Ses3-O1:
Automatic Speech Recognition: Language Models I

Time:Monday 16:00 Place:Main Hall Type:Oral
Chair:Steve Renals

16:00Back-Off Language Model Compression

Boulos Harb (Google, Inc.)
Ciprian Chelba (Google, Inc.)
Jeffrey Dean (Google, Inc.)
Sanjay Ghemawat (Google, Inc.)

With the availability of large amounts of training data relevant to speech recognition scenarios, scalability becomes a very productive way to improve language model performance. We present a technique that represents a back-off n-gram language model using arrays of integer values and thus renders it amenable to effective block compression. We propose a few such compression algorithms and evaluate the resulting language model along two dimensions: memory footprint, and speed reduction relative to the uncompressed one. We experimented with a model that uses a 32-bit word vocabulary (at most 4B words) and log-probabilities/back-off-weights quantized to 1 byte, respectively. The best compression algorithm achieves 2.6 bytes/n-gram at 18X slower than uncompressed.

16:20Improving Broadcast News Transcription with a Precision Grammar and Discriminative Reranking

Tobias Kaufmann (ETH Zurich)
Thomas Ewender (ETH Zurich)
Beat Pfister (ETH Zurich)

We propose a new approach of integrating a precision grammar into speech recognition. The approach is based on a novel robust parsing technique and discriminative reranking. By reranking 100-best output of the LIMSI German broadcast news transcription system we achieved a significant reduction of the word error rate by 9.6% relative. To our knowledge, this is the first significant improvement for a real-world broad-domain speech recognition task due to a precision grammar.

16:40Use of Contexts in Language Model Interpolation and Adaptation

Xunying Liu (Cambridge University Engineering Department)
Mark Gales (Cambridge University Engineering Department)
Phil Woodland (Cambridge University Engineering Department)

Language models (LMs) are often constructed by building component models on multiple text sources to be interpolated using global, context free weights. By re-adjusting these weights, LMs may be adapted to a target domain of a particular genre, epoch or other higher level attributes. Other factors that determine the ``usefulness'' of sources on a context dependent basis, such as modeling resolution, generalization, topics and styles, are poorly modeled. To overcome this problem, this paper investigates a context dependent form of LM interpolation and adaptation. In previous research, it was used primarily for LM adaptation. In this paper, a range of schemes to combine context dependent weights obtained from training and test data to improve LM adaptation are proposed. Consistent perplexity and error rate gains of 6\% relative were obtained on a state-of-the-art broadcast recognition task.

17:00Exploiting Chinese Character Models to Improve Speech Recognition Performance

J. L. Hieronymus (NASA Ames Research Center)
X. Liu (Cambridge University Engineering Department)
M. J. F. Gales (Cambridge University Engineering Department)
P.C. Woodland (Cambridge University Engineering Department)

The Chinese language is based on characters which are syllabic in nature. Since languages have syllabotactic rules which govern the construction of syllables and their allowed sequences, Chinese character sequence models can be used as a first level approximation. Ngram character sequence models were trained on 4.3 billion characters. Characters are used as a first level recognition unit with multiple pronunciations per character. The CU-HTK Mandarin word based system was used to recognize words which were then converted to character sequences. The character alone error rates of one best recognition were slightly worse than word based character recognition. However combining the two systems using log-linear combination gives better results than either system separately. An equally weighted combination gave consistant CER gains of 0.1 - 0.2 \% absolute over the word based standard system.

17:20Constraint selection for topic-based MDI adaptation of language models

Gwénolé Lecorvé (IRISA/INSA, France)
Guillaume Gravier (IRISA/CNRS, France)
Pascale Sébillot (IRISA/INSA, France)

This paper presents an unsupervised topic-based language model adaptation method which specializes the standard minimum information discrimination approach by identifying and combining topic-specific features. By acquiring a topic terminology from a thematically coherent corpus, language model adaptation is restrained to the sole probability re-estimation of n-grams ending with some topic-specific words, keeping other probabilities untouched. Experiments are carried out on a large set of spoken documents about various topics. Results show significant perplexity and recognition improvements which outperform results of classical adaptation techniques.

17:40Nonstationary Latent Dirichlet Allocation for Speech Recognition

Chuang-Hua Chueh (National Cheng Kung University)
Jen-Tzung Chien (National Cheng Kung University)

Latent Dirichlet allocation (LDA) has been successful for document modeling. LDA extracts the latent topics across documents. Words in a document are generated by the same topic distribution. However, in real-world documents, the usage of words in different paragraphs is varied and accompanied with different writing styles. This study extends the LDA and copes with the variations of topic information within a document. We build the nonstationary LDA (NLDA) by incorporating a Markov chain which is used to detect the stylistic segments in a document. Each segment corresponds to a particular style in composition of a document. This NLDA can exploit the topic information between documents as well as the word variations within a document. We accordingly establish a Viterbi-based variational Bayesian procedure. A language model adaptation scheme using NLDA is developed for speech recognition. Experimental results show improvement of NLDA over LDA in terms of perplexity and word error rate.

Mon-Ses3-O2:
Phoneme-level Perception

Time:Monday 16:00 Place:East Wing 1 Type:Oral
Chair:Rolf Carlson

16:00Categorical perception of speech without stimulus repetition

Jack Rogers (MRC Cognition and Brain Sciences Unit, Cambridge, UK)
Matthew Davis (MRC Cognition and Brain Sciences Unit, Cambridge, UK)

We explored the perception of phonetic continua generated with an automated auditory morphing technique in three perceptual experiments. The use of large sets of stimuli allowed an assessment of the impact of single vs. paired presentation without the massed stimulus repetition typical of categorical perception experiments. A third experiment shows that such massed repetition alters the degree of categorical and sub-categorical discrimination possible in speech perception. Implications for accounts of speech perception are discussed.

16:20Non-automaticity of use of orthographic knowledge in phoneme evaluation

Anne Cutler (Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands)
Chris Davis (MARCS Auditory Laboratories, University of Western Sydney, Australia)
Jeesun Kim (MARCS Auditory Laboratories, University of Western Sydney, Australia)

Two phoneme goodness rating experiments addressed the role of orthographic knowledge in the evaluation of speech sounds. Ratings for the best tokens of /s/ were higher in words spelled with S (e.g., bless) than in words where /s/ was spelled with C (e.g., voice). This difference did not appear for analogous nonwords for which every lexical neighbour had either S or C spelling (pless, floice). Models of phonemic processing incorporating obligatory influence of lexical information in phonemic processing cannot explain this dissociation; the data are consistent with models in which phonemic decisions are not subject to necessary top-down lexical influence.

16:40Learning and generalization of novel contrastive cues

Meghan Sumner (Stanford University, Department of Linguistics)

This paper examines the learning of a novel phonetic contrast. Specifically, we examine how a contrast is learned – do speakers learn a specific property about a particular word, or do they internalize a pattern that can be applied to words of a particular type in subsequent processing? In two experiments, participants listened to foreign-accented English and were taught to make stop release contrastive. Following training, participants take either a minimal pair decision task or a cross-modal form priming task, both of which include trained words, words that were untrained but include a trained rime, and novel, untrained words. The results of both experiments suggest that listeners use both strategies in learning – they generalize to words with similar rimes, but are unable to extend this knowledge to novel words.

17:00Vowel Category Perception Affected by Microdurational Variations

Einar Meister (Institute of Cybernetics, Tallinn University of Technology, Estonia)
Stefan Werner (Department of General Linguistics and Language Technology, University of Joensuu, Finland)

Vowel quality perception in quantity languages is considered to be unrelated to vowel duration since duration is used to realize quantity oppositions. To test the role of microdurational variations in vowel category perception in Estonian listening experiments with synthetic stimuli were carried out, involving five vowel pairs along the close-open axis. The results show that in the case of high-mid vowel pairs vowel openness correlates positively with stimulus duration; in mid-low vowel pairs no such correlation was found. The discrepancy in the results is explained by the hypothesis that in case of shorter perceptual distances (high-mid area of vowel space) intrinsic duration plays the role of a secondary feature to enhance perceptual contrast between vowels, whereas in case of mid-low oppositions perceptual distance is large enough to guarantee the necessary perceptual contrast by spectral features alone and vowel intrinsic duration as an additional cue is not needed.

17:20Perceptual grouping of alternating word pairs: Effect of pitch difference and presentation rate

Nandini Iyer (Air Force Research Laboratory)
Douglas Brungart (Air Force Research Laboratory)
Brian Simpson (Air Force Research Laboratory)

When listeners hear sequences of tones that slowly alternate between a low frequency and a slightly higher frequency, they report hearing a single stream of alternating tones. However, when the alternation rate and/or the frequency difference increases, they report hearing two distinct streams: a slowly pulsing high and low frequency stream. This experiment used repeating sequences of spondees to investigate whether a similar streaming phenomenon might occur for speech stimuli. The F0 difference between every other word was varied from 0 - 18 semitones. Each word was either 100 or 125 ms in duration. The inter-onset intervals (IOIs) of the individual words were varied from 100 - 300 ms. As expected, F0 differences was a strong cue for sequential segregation. Moreover, the number of 'two' stream judgments were greater at smaller IOIs, suggesting that factors that influence the obligatory streaming of tonal signals are also important in the segregation of speech signals.

17:40Comparing methods to find a best exemplar in a multidimensional space

Titia Benders (Institute of Phonetic Sciences, University of Amsterdam)
Paul Boersma (Institute of Phonetic Sciences, University of Amsterdam)

We present a simple algorithm for running a listening experi- ment aimed at finding the best exemplar in a multidimensional space. For simulated humanlike listeners, who have perception thresholds and some decision noise on their responses, the algo- rithm on average ends up twelve times closer than Iverson and Evans’ goodness interpolation algorithm.

Mon-Ses3-O3:
Statistical Parametric Synthesis I

Time:Monday 16:00 Place:East Wing 2 Type:Oral
Chair:Keiichi Tokuda

16:00Autoregressive HMMs for speech synthesis

Matt Shannon (Cambridge University Engineering Department, U.K.)
William Byrne (Cambridge University Engineering Department, U.K.)

We propose the autoregressive HMM for speech synthesis. We show that the autoregressive HMM supports efficient EM parameter estimation and that we can use established effective synthesis techniques such as synthesis considering global variance with minimal modification. The autoregressive HMM uses the same model for parameter estimation and synthesis in a consistent way, in contrast to the standard HMM synthesis framework, and supports easy and efficient parameter estimation, in contrast to the trajectory HMM. We find that the autoregressive HMM gives performance comparable to the standard HMM synthesis framework on a Blizzard Challenge-style naturalness evaluation.

16:20ASYNCHRONOUS F0 AND SPECTRUM MODELING FOR HMM-BASED SPEECH SYNTHESIS

Cheng-Cheng Wang (USTC iFlytek Speech Lab, University of Science and Technology of China, Hefei,China)
Zhen-Hua Ling (USTC iFlytek Speech Lab, University of Science and Technology of China, Hefei,China)
Li-Rong Dai (USTC iFlytek Speech Lab, University of Science and Technology of China, Hefei,China)

This paper proposes an asynchronous model structure for fundamental frequency(F0) and spectrum modeling in HMM-based parametric speech synthesis to improve the performance of F0 prediction. F0 and spectrum features are considered to be synchronous in the conventional system. Considering that the production of these two features is decided by the movement of different speech organs, an explicitly asynchronous model structure is introduced. At training stage, F0 models are training asynchronously with spectrum models. At synthesis stage, the two features are generated respectively. The objective and subjective evaluation results show the proposed method can effectively improve the accuracy of F0 prediction.

16:40A Minimum V/U Error Approach to F0 Generation in HMM-based TTS

yao Qian (Microsoft Research Asia, Beijing, China)
Frank Soong (Microsoft Research Asia, Beijing, China)
miaomiao Wang (Microsoft Research Asia, Beijing, China)
zhizheng Wu (Microsoft Research Asia, Beijing, China)

The HMM-based TTS can produce a highly intelligible and decent quality voice. However, HMM model degrades when feature vectors used in training are noisy. Among all noisy features, pitch tracking errors and corresponding flawed voiced/unvoiced (v/u) decisions are identified as two key factors in voice quality problems. In this paper, we propose a minimum v/u error approach to F0 generation. A prior knowledge of v/u is imposed in each Mandarin phone and accumulated v/u posterior probabilities are used to search for the optimal v/u switching point in each VU or UV segment in generation. Objectively the new approach is shown to improve v/u prediction performance, specifically on voiced to unvoiced swapping errors. They are reduced from 3.7% (baseline) down to 2.0% (new approach). The improvement is also subjectively confirmed by an AB preference test score, 72% (new approach) versus 22% (baseline).

17:00Voiced/Unvoiced Decision Algorithm for HMM-based Speech Synthesis

Shiyin Kang (Department of Computer Science and Technology, Tsinghua University, Beijing, China)
Zhiwei Shuang (IBM China Research Lab, Beijing, China)
Quansheng Duan (Department of Computer Science and Technology, Tsinghua University, Beijing, China)
Yong Qin (IBM China Research Lab, Beijing, China)
Lianhong Cai (Department of Computer Science and Technology, Tsinghua University, Beijing, China)

This paper introduces a novel method to improve the U/V decision method in HMM-based speech synthesis. In the conventional method, the U/V decision of each state is independently made, and a state in the middle of a vowel may be decided as unvoiced. In this paper, we propose to utilize the constraints of natural speech to improve the U/V decision inside a unit, such as syllable or phone. We use a GMM-based U/V change time model to select the best U/V change time in one unit, and refine the U/V decision of all states in that unit based on the selected change time. The result of a perceptual evaluation demonstrates that the proposed method can significantly improve the naturalness of the synthetic speech.

17:20Local minimum generation error criterion for hybrid HMM speech synthesis

Xavi Gonzalvo (Phonetic Arts Ltd.)
Alexander Gutkin (Yahoo! Europe)
Joan Claudi Socoro (Universitat Ramon Llull)
Ignasi Iriondo (Universitat Ramon Llull)
Paul Taylor (Phonetic Arts Ltd.)

This paper presents an HMM-driven hybrid speech synthesis approach in which unit selection concatenative synthesis is used to improve the quality of the statistical system using a Local Minimum Generation Error (LMGE) during the synthesis stage. The idea behind this approach is to combine the robustness due to HMMs with the naturalness of concatenated units. Unlike the conventional hybrid approaches to speech synthesis that use concatenative synthesis as a backbone, the proposed system employs stable regions of natural units to improve the statistically generated parameters. We show that this approach improves the generation of vocal tract parameters, smoothes the bad joints and increases the overall quality.

17:40Thousands of Voices for HMM-based Speech Synthesis

Junichi Yamagishi (University of Edinburgh)
Bela Usabaev (Universit¨at T¨ubingen)
Simon King (University of Edinburgh)
Oliver Watts (University of Edinburgh)
John Dines (Idiap Research Institute)
Jilei Tian (Nokia)
Rile Hu (Nokia)
Keiichiro Oura (Nagoya Institute of Technology)
Keiichi Tokuda (Nagoya Institute of Technology)
Reima Karhila (Helsinki University of Technology)
Mikko Kurimo (Helsinki University of Technology)

Our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an ‘average voice model’ plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack of phonetic balance. This enables us consider building high-quality voices on ’non-TTS’ corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper we show thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal databases (WSJ0/WSJ1/WSJCAM0), Resource Management, Globalphone and Speecon. We report some perceptual evaluation results and outline the outstanding issues.

Mon-Ses3-O4:
Systems for Spoken Language Translation

Time:Monday 16:00 Place:East Wing 3 Type:Oral
Chair: Hermann Ney

16:00Efficient Combination of Confidence Measures for Machine Translation

Sylvain Raybaud (LORIA)
David Langlois (LORIA)
Kamel Smaili (LORIA)

We present in this paper a twofold contribution to Machine Translation. First, we present a method to automatically build training and testing corpora for confidence measures containing realistic errors. Errors introduced into reference translation simulate classical machine translation errors (word deletion and word substitution), and are supervised by Wordnet. Second, we use SVM to combine original and classical confidence measures both at word- and sentence-level. We show that the obtained combination outperform by 14% (absolute) our best single word-level confidence measure, and that sentence-level combination of confidence measures produces meaningful scores.

16:20Incremental Dialog Clustering for Speech-to-Speech Translation

David Stallard (BBN Technologies)
Stavros Tsakalidis (BBN Technologies)
Shirin Saleem (BBN Technologies)

Application domains for language processing systems, especially speech-to-speech translation and dialog systems, often contain sub-domains and/or task-types for which different outputs may be appropriate given the same input. We present a document-clustering approach to sub-domain classification, which uses a recently-developed algorithm based on von Mises Fisher distributions. We give preliminary perplexity reduction and MT performance results for a speech-to-speech translation system using this model.

16:40Iterative Sentence-Pair Extraction from Quasi-Parallel Corpora for Machine Translation

Ruhi Sarikaya (IBM T.J. Watson Research Center)
Sameer Maskey (IBM T.J. Watson Research Center)
Rong Zhang (IBM T.J. Watson Research Center)
Ea-Ee Jan (IBM T.J. Watson Research Center)
Dagen Wang (IBM T.J. Watson Research Center)
Bhuvana Ramabhadran (IBM T.J. Watson Research Center)
Salim Roukos (IBM T.J. Watson Research Center)

This paper addresses parallel data extraction from the quasi-parallel corpora generated in a crowd-sourcing project where ordinary people watch tv shows and movies and transcribe/translate what they hear, creating document pools in different languages. Since they do not have guidelines for naming and performing translations, it is often not clear which documents are the translations of the same show/movie and which sentences are the translations of the each other in a given document pair. We introduce a method for automatically pairing documents in two languages and extracting parallel sentences from the paired documents. The method consists of three steps: i) document pairing, ii) sentence pair alignment of the paired documents, and iii) context extrapolation to boost the sentence pair coverage. Human evaluation of the extracted data shows that 95% of the extracted sentences carry useful information for translation. Experimental results also show that using the extracted data .....

17:00Tree Kernel-Based Phrase Reordering with Structured Syntactic Knowledge

min Zhang (Institute for Infocomm Research)
Haizhou Li (Institute for Infocomm Research)

Structured syntactic knowledge is important for phrase reordering. In this paper, we propose using convolution tree kernel over parse tree to model the structured syntactic knowledge for phrase reordering in the context of BTG-based statistical machine translation. Our study reveals that the structured syntactic features are very effective for phrase reordering and those features can be well captured by the tree kernel. We further combine the structured features and other commonly-used linear features into a composite kernel. Experimental results on the NIST MT-2005 Chinese-English translation tasks show that our proposed method statistically significantly outperforms the baseline methods.

17:00RTTS: Towards Enterprise-level Real-Time Speech Transcription and Translation Services

Juan M. Huerta (IBM T J Watson Research Center)
Cheng Wu (IBM T J Watson Research Center)
Andrej Sakrajda (IBM T J Watson Research Center)
Sasha Caskey (IBM T J Watson Research Center)
Ea-Ee Jan (IBM T J Watson Research Center)
Alexander Faisman (IBM T J Watson Research Center)
Shai Ben-David (IBM)
Wen Liu (IBM)
Uyi Stewart (IBM)
Michael Frissora (IBM)
David Lubensky (IBM)
Antonio Lee (IBM)

In this paper we describe the RTTS system for enterprise-level real time speech recognition and translation. RTTS follows a Web Service-based approach which allows the encapsulation of ASR and MT Technology components thus hiding the configuration and tuning complexities and details from the client applications while exposing a uniform interface. In this way, RTTS is capable of easily supporting a wide variety of client applications. The clients we have implemented include a VoIP-based real time speech-to-speech translation system, a chat and Instant Messaging translation System, a Transcription Server, among others.

17:20Using Syntax in Large-Scale Audio Document Translation

Jing Zheng (SRI International)
Necip Fazil Ayan (SRI International)
Wen Wang (SRI International)
David Burkett (UC Berkeley)

Recently, the use of syntax has very effectively improved machine translation (MT) quality in many text MT tasks. However, using syntax in speech MT poses additional challenges because of disfluencies and other spoken language phenomena, and of errors introduced by automatic speech recognition (ASR). In this paper, we investigate the effect of using syntax in a large-scale audio document translation task targeting broadcast news and broadcast conversations. We do so by comparing the performance of three synchronous context-free grammar based translation approaches: 1) hierarchical phrase-based translation, 2) syntax-augmented MT, and 3) string-to-dependency MT. The results show a positive effect of explicitly using syntax when translating broadcast news, but no benefit when translating broadcast conversations. The results indicate that improving the robustness of syntactic systems against conversational language style is important to their success and requires future effort.

17:40Context-driven bilingual movie subtitle alignment

Andreas Tsiartas (Speech Analysis and Interpretation Laboratory, Department of Electrical Engineering, University of Southern California, Los Angeles, CA 90089)
Prasanta Ghosh (Speech Analysis and Interpretation Laboratory, Department of Electrical Engineering, University of Southern California, Los Angeles, CA 90089)
Panayiotis Georgiou (Speech Analysis and Interpretation Laboratory, Department of Electrical Engineering, University of Southern California, Los Angeles, CA 90089)
Shrikanth Narayanan (Speech Analysis and Interpretation Laboratory, Department of Electrical Engineering, University of Southern California, Los Angeles, CA 90089)

Movie subtitle alignment is a potentially useful approach for deriving automatically parallel bilingual/multilingual spoken language data for automatic speech translation. In this paper, we consider the movie subtitle alignment task. We propose a distance metric between utterances of different languages based on lexical features derived from bilingual dictionaries. We use the dynamic time warping algorithm to obtain the best alignment. The best F-score of ~0.713 is obtained using the proposed approach.

Mon-Ses3-S1:
Special Session: Silent Speech Interfaces

Time:Monday 16:00 Place:East Wing 4 Type:Special
Chair:Bruce Denby & Tanja Schultz

#0Visuo-Phonetic Decoding using Multi-Stream and Context-Dependent Models for an Ultrasound-based Silent Speech Interface

Thomas Hueber (ESPCI/Telecom ParisTech)
Elie-Laurent Benaroya (ESPCI ParisTech)
Gérard Chollet (LTCI/CNRS Telecom ParisTech)
Bruce Denby (UPMC Paris VI - ESPCI ParisTech)
Gérard Dreyfus (Laboratoire d\'Electronique - ESPCI ParisTech)
Maureen Stone (University of Maryland Dental School)

Recent improvements are presented for phonetic decoding of continuous-speech from ultrasound and optical observations of the tongue and lips in a silent speech interface application. In a new approach to this critical step, the visual streams are modeled by context-dependent multi-stream Hidden Markov Models (CD-MSHMM). Results are compared to a baseline system using context-independent modeling and a visual feature fusion strategy, with both systems evaluated on a one-hour, phonetically balanced English speech database. Tongue and lip images are coded using PCA-based feature extraction techniques. The uttered speech signal, also recorded, is used to initialize the training of the visual HMMs. Visual phonetic decoding performance is evaluated successively with and without the help of linguistic constraints introduced via a 2.5k-word decoding dictionary.

#0Disordered Speech Recognition Using Acoustic and sEMG Signals

Yunbin Deng (BAE Systems, Inc, Advanced Information Technologies)
Rupal Patel (Communication Analysis & Design Lab, Northeastern University)
James T. Heaton (Center for Laryngeal Surgery & Voice Rehabilitation, Mass. General Hospital)
Glen Colby (BAE Systems, Inc, Advanced Information Technologies)
L. Donald Gilmore (Delsys, Inc.)
Joao Cabrera (BAE Systems, Inc, Advanced Information Technologies)
Serge H. Roy (Delsys, Inc.)
Carlo J. De Luca (Delsys, Inc.)
Geoffrey S. Meltzner (BAE Systems, Inc, Advanced Information Technologies)

Parallel isolated word corpora were collected from healthy speakers and individuals with speech impairment due to stroke or cerebral palsy. Surface electromyographic (sEMG) signals were collected for both vocalized and mouthed speech production modes. Pioneering work on disordered speech recognition using the acoustic signal, the sEMG signals, and their fusion are reported. Results indicate that speaker-dependent isolated-word recognition from the sEMG signals of articulator muscle groups during vocalized disordered-speech production was highly effective. However, word recognition accuracy for mouthed speech was much lower, likely related to the fact that some disordered speakers had considerable difficulty producing consistent mouthed speech. Further development of the sEMG-based speech recognition systems is needed to increase usability and robustness.

#0Multimodal HMM-based NAM-to-speech conversion

Viet-Anh TRAN (GIPSA-Lab, Département Parole & Cognition, UMR n°5216 CNRS/INPG/UJF/U. Stendhal, France)
Gérard BAILLY (GIPSA-Lab, Département Parole & Cognition, UMR n°5216 CNRS/INPG/UJF/U. Stendhal, France)
Hélène LOEVENBRUCK (GIPSA-Lab, Département Parole & Cognition, UMR n°5216 CNRS/INPG/UJF/U. Stendhal, France)
Tomoki TODA (NAIST (NAra Institute of Science and Technology), Japan)

Although the segmental intelligibility of converted speech from silent speech using direct signal-to-signal mapping proposed by Toda et al. is quite acceptable, listeners have sometimes difficulty in chunking the speech continuum into meaningful words due to incomplete phonetic cues provided by output signals. This paper studies another approach consisting in combining HMM-based statistical speech recognition and synthesis techniques, as well as training on aligned corpora, to convert silent speech to audible voice.

#0Technologies for Processing Body-Conducted Speech Detected with a Non-Audible Murmur Microphone

Tomoki Toda (Nara Institute of Science and Technology)
Keigo Nakamura (Nara Institute of Science and Technology)
Takayuki Nagai (Nara Institute of Science and Technology)
Tomomi Kaino (Nara Institute of Science and Technology)
Yoshitaka Nakajima (Nara Institute of Science and Technology)
Kiyohiro Shikano (Nara Institute of Science and Technology)

In this paper, we review our recent research on technologies for processing body-conducted speech detected with Non-Audible Murmur (NAM) microphone. NAM microphone enables us to detect various types of body-conducted speech such as extremely soft whisper, normal speech, and so on. Moreover, it is robust against external noise due to its noise-proof structure. To make speech communication more universal by effectively using these properties of NAM microphone, we have so far developed two main technologies: one is body-conducted speech conversion for human-to-human speech communication; and the other is body-conducted speech recognition for man-machine speech communication. This paper gives an overview of these technologies and presents our new attempts to investigate the effectiveness of body-conducted speech recognition.

#0Impact of Different Speaking Modes on EMG-based Speech Recognition

Michael Wand (Cognitive Systems Lab, University of Karlsruhe, Germany)
Szu-Chen Stan Jou (ATC, ICL, Industrial Technology Research Institute, Taiwan)
Arthur R. Toth (Cognitive Systems Lab, University of Karlsruhe, Germany)
Tanja Schultz (Cognitive Systems Lab, University of Karlsruhe, Germany)

We present our recent results on speech recognition by surface electromyography (EMG), which captures the electric potentials that are generated by the human articulatory muscles. This technique can be used to enable Silent Speech Interfaces, since EMG signals are generated even when people only articulate speech without producing any sound. Preliminary experiments have shown that the EMG signals created by audible and silent speech are quite distinct. In this paper we first compare various methods of initializing a silent speech EMG recognizer, showing that the performance of the recognizer substantially varies across different speakers. Based on this, we analyze EMG signals from audible and silent speech, present first results on how discrepancies between these speaking modes affect EMG recognizers, and suggest areas for future work.

#0Artificial speech synthesizer control by brain-computer interface

Jonathan S. Brumberg (Boston University; Neural Signals, Inc.)
Philip R. Kennedy (Neural Signals, Inc.)
Frank H. Guenther (Boston University; Harvard University; MIT)

We developed and tested a brain-computer interface for control of an artificial speech synthesizer by an individual with near complete paralysis. This neural prosthesis for speech restoration is currently capable of predicting vowel formant frequencies based on neural activity recorded from an intracortical microelectrode implanted in the left hemisphere speech motor cortex. Using instantaneous auditory feedback (< 50 ms) of predicted formant frequencies, the study participant has been able to correctly perform a vowel production task at a maximum rate of 80-90% correct.

#0Synthesizing Speech from Electromyography using Voice Transformation Techniques

Arthur R. Toth (University of Karlsruhe)
Michael Wand (University of Karlsruhe)
Tanja Schultz (University of Karlsruhe)

Surface electromyography (EMG) can be used to record the activation potentials of articulatory muscles while a person speaks. It could enable silent speech interfaces, as EMG signals are generated even when people pantomime speech noiselessly. Having effective silent speech interfaces would enable a number of compelling applications, allowing people to communicate in areas where they would not want to be overheard or could not be heard. In order to use EMG signals in speech interfaces, however, there must be a relatively accurate method to map the signals to speech. Most previous attempts to use EMG signals for speech interfaces appear to focus on Automatic Speech Recognition (ASR) based on features derived from EMG signals. We explore the alternative idea of using Voice Transformation (VT) techniques to synthesize speech from EMG signals. We report the results of our preliminary studies, noting the difficulties we encountered and suggesting future work.

16:00Characterizing Silent and Pseudo-Silent Speech using Radar-like Sensors

John Holzrichter (Hertz Foundation)

Radar-like sensors enable the measuring of speech articulator conditions, especially their shape changes and contact events both during silent and normal speech. Such information can be used to associate articulator conditions with digital “codes” for use in communications, machine control, speech masking or canceling, and other applications.

Mon-Ses3-P3:
Automatic Speech Recognition: Adaptation I

Time:Monday 16:00 Place:Hewison Hall Type:Poster
Chair:Stephen Cox

#0On the Development of Matched and Mismatched Italian Children’s Speech Recognition Systems

Piero Cosi (ISTC-CNR (Istituto di Scienze e Tecnologie della Cognizione - Consiglio Nazionale delle Ricerche))

While at least read speech corpora are available for Italian children’s speech research, there exist many languages in which this is not the case. Learning statistical mappings between the adult and child acoustic space using existing adult/children corpora may provide a future direction for generating children’s models for such data deficient languages. In this work the recent advances in the development of the SONIC Italian children’s speech recognition system will be described. Specifically, the complete training and test set of the FBK (ex ITC-irst) Italian Children’s Speech Corpus (ChildIt) was considered. Using the University of Colorado SONIC LVSR system, we demonstrate a phonetic recognition error rate of 12,0% for a system which incorporates Vocal Tract Length Normalization (VTLN), Speaker-Adaptive Trained phonetic models, as well as unsupervised Structural MAP Linear Regression (SMAPLR).

#0Speaker Adaptation Based on Two-Step Active Learning

Koichi Shinoda (Tokyo Institute of Technology)
Hiroko Murakami (Tokyo Institute of Technology)
Sadaoki Furui (Tokyo Institute of Technology)

We propose a two-step active learning method for supervised speaker adaptation. In the first step, the initial adaptation data is collected to obtain a phone error distribution. In the second step, those sentences whose phone distributions are close to the error distribution are selected, and their utterances are collected as the additional adaptation data. We evaluated the method using a Japanese speech database and maximum likelihood linear regression (MLLR) as the speaker adaptation algorithm. We confirmed that our method had a significant improvement over a method using randomly chosen sentences for adaptation.

#0Using VTLN matrices for Rapid and Computationally-Efficient Speaker Adaptation with Robustness to First-Pass Transcription Errors

Shakti Prasad Rath (Indian Institute of Technology Kanpur)
Srinivasan Umesh (Inidian Institute of Technology Kanpur)
Achintya Kumar Sarkar (Inidian Institute of Technology Kanpur)

In this paper we combine rapid adaptation capability of conventional VTLN with computational efficiency of transform-based adaptation such as CMLLR. Conventional VTLN requires very little adaptation data unlike transform-based adaptation methods. However, conventional VTLN is computationally expensive since it requires generation of warped features. We have recently shown that VTLN can be efficiently implemented as a linear-transformation with computational complexity similar to CMLLR. In this frame-work VTLN provides significant improvement in performance when there is small adaptation data than transform-based adaptation. We also show that the use of MLLT along with VTLN gives performance that is better than MLLR and comparable to SAT with MLLT even for large adaptation data. Further we show that in mismatched conditions, VTLN provides significant improvement over transform-based adaptation. We compare the performance of different methods on WSJ, RM and TIDIGITS tasks.

#0Acoustic Class Specific VTLN-Warping using Regression Class Trees

Shakti Prasad Rath (Indian Institute of Technology Kanpur)
Srinivasan Umesh (Indian Institute of Technology Kanpur)

In this paper we study the use of different frequency warp-factors for different acoustic classes. This is motivated by the fact that all acoustic classes do not exhibit similar spectral variation as a result of physiological differences in vocal tract and therefore the use of a single frequency-warp for the entire utterance may not be appropriate. We have recently proposed an VTLN method that implements VTLN-warping through a linear-transformation of the conventional MFCC features and efficiently estimates the warp-factor using the same sufficient statistics that are used in CMLLR adaptation. In this paper, we have shown that in this efficient framework of VTLN and using the idea of regression class tree it is possible to obtain separate frequency-warping for different acoustic classes. On the WSJ database we have shown the recognition performance of the proposed method for data driven based and phonetic knowledge regression class trees.

#0Bilinear Transformation Space-based Maximum Likelihood Linear Regression

Hwa Jeon Song (School of Electrical Engineering, Pusan National University)
Yongwon Jeong (School of Electrical Engineering, Pusan National University)
Hyung Soon Kim (School of Electrical Engineering, Pusan National University)

This paper proposes two types of bilinear transformation space-based speaker adaptation frameworks. In training session, transformation matrices for speakers are decomposed into the style factor for speakers’ characteristics and orthonormal basis of eigenvectors to control dimensionality of the canonical model by the singular value decomposition-based algorithm. In adaptation session, the style factor of a new speaker is estimated, depending on what kind of proposed framework is used. At the same time, the dimensionality of the canonical model can be reduced by the orthonormal basis from training. Moreover, both maximum likelihood linear regression (MLLR) and eigenspace-based MLLR are identified as special cases of our proposed methods. Experimental results show that the proposed methods are much more effective and versatile than other methods.

#0Speaking Style Adaptation for Spontaneous Speech Recognition Using Multiple-Regression HMM

Yusuke Ijima (Tokyo Institute of Technology)
Takeshi Matsubara (Tokyo Institute of Technology)
Takashi Nose (Tokyo Institute of Technology)
Takao Kobayashi (Tokyo Institute of Technology)

This paper describes a rapid model adaptation technique for spontaneous speech recognition. The proposed technique utilizes a multiple-regression hidden Markov model (MRHMM) and is based on a style estimation technique of speech. In the MRHMM, the mean vector of probability density function (pdf) is given by a function of a low-dimensional vector, called style vector, which corresponds to the intensity of expressivity of speaking style variation. The value of the style vector is estimated for every utterance of the input speech and the model adaptation is conducted by calculating new mean vectors of the pdf using the estimated style vector. The performance evaluation results using “Corpus of spontaneous Japanese (CSJ)” are shown under a condition in which the amount of model training and adaptation data is very small.

#0Improving the robustness by multiple sets of HMMs

Hans-Guenter Hirsch (Niederrhein University of Applied Sciences)
Andreas Kitzig (Niederrhein University of Applied Sciences)

The highest recognition performance is still achieved when training a recognition system with speech data that have been recorded in the acoustic scenario where the system will be applied. We investigated the approach of using several sets of HMMs. These sets have been trained on data that were recorded in different typical noise situations. One HMM set is individually selected at each speech input by comparing the pause segment at the beginning of the utterance with the pause models of all sets. We observed a considerable reduction of the error rates when applying this approach in comparison to two well known techniques for improving the robustness. Furthermore, we developed a technique to additionally adapt certain parameters of the selected HMMs to the specific noise condition. This leads to a further improvement of the recognition rates.

#0On the Use of Pitch Normalization for Improving Children\'s Speech Recognition

Rohit Sinha (Department of Electronics and Communication Engineering, Indian Institute of Technology Guwahati, Guwahati-781039, India.)
Shweta Ghai (Department of Electronics and Communication Engineering, Indian Institute of Technology Guwahati, Guwahati-781039, India.)

In this work, we have studied the effect of pitch variations across the speech signals in context of automatic speech recognition. Our initial study done on vowel data indicates that on account of insufficient smoothing of pitch harmonics by the filterbank, particularly for high pitch signals, the variances of mel frequency cepstral coefficients (MFCC) feature significantly increase with increase in the pitch of the speech signals. Further to reduce the variance of MFCC feature due to varying pitch among speakers, a maximum likelihood based explicit pitch normalization method has been explored. On connected digit recognition task, with pitch normalization a relative improvement of 15% is obtained over baseline for children's speech (higher pitch) on adults' speech (lower pitch) trained models.

#0Speaker normalization for template based speech recognition

Sébastien Demange (Katholieke Universiteit Leuven ESAT/PSI)
Dirk Van Compernolle (Katholieke Universiteit Leuven ESAT/PSI)

Vocal Tract Length Normalization (VTLN) has been shown to be an efficient speaker normalization tool for HMM based systems. In this paper we show that it is equally efficient for a template based recognition system. Template based systems, while promising, have as potential drawback that templates maintain all non phonetic details apart from the essential phonemic properties; i.e. they retain information on speaker and acoustic recording circumstances. This may lead to a very inefficient usage of the database. We show that after VTLN significantly more speakers - also from opposite gender - contribute templates to the matching sequence compared to the non-normalized case. In experiments on the Wall Street Journal database this leads to a relative word error rate reduction of 10%.

#0Combination of Acoustic and Lexical Speaker Adaptation for Disordered Speech Recognition

Oscar Saz (University of Zaragoza)
Eduardo Lleida (University of Zaragoza)
Antonio Miguel (University of Zaragoza)

This paper presents an approach to provide of lexical adaptation in Automatic Speech Recognition (ASR) of the disordered speech from a group of young impaired speakers. The outcome of an Acoustic Phonetic Decoder (APD) is used to learn new lexical variants of the 57-word vocabulary and add them to a lexicon personalized to each user. The possibilities of combination of this lexical adaptation with acoustic adaptation achieved through traditional Maximum A Posteriori (MAP) approaches are furtherer explored, and the results show the importance of matching the lexicon in the ASR decoding phase to the lexicon used for the acoustic adaptation.

#3Tree-based Estimation of Speaker Characteristics for Speech Recognition

Mats Blomberg (Dept. of Speech, Music and Hearing, KTH/CSC, Stockholm, Sweden)
Daniel Elenius (Dept. of Speech, Music and Hearing, KTH/CSC, Stockholm, Sweden)

A hierarchical tree is designed to reduce the computationally heavy demands of joint multi-dimensional estimation of speaker characteristic properties in speech recognition. The leaf model sets are created by transforming a conventionally trained set. Non-leaf sets are formed by merging the models of their child nodes. One- (VTLN) and four-dimensional speaker profile vectors (VTLN, two spectral slope parameters and model variance scaling) reduce the computational load to a fraction compared to that of an exhaustive search. In recognition experiments on children's connected digits using adult and male models, the one-dimensional tree search performed as well as the exhaustive search. Further reduction was achieved with four dimensions. The best recognition results are 0.93% and 10.2% WER in TIDIGITS and PF-Star-Sw, respectively, using adult models.

#5A Study on the Influence of Covariance Adaptation on Jacobian Compensation in Vocal Tract Length Normalization

Rama Sanand Doddipatla (Indian Institute of Technology Kanpur)
Shakti Prasad Rath (Indian Institute of Technology Kanpur)
Srinivasan Umesh (Indian Institute of Technology Kanpur)

In this paper, we first show that accounting for Jacobian in VTLN degrades the performance in the mismatched train and test speaker conditions. VTLN is implemented using our recently proposed approach of linear transformation of conventional MFCC, ie, a feature-transformation. In this case, Jacobian is simply the determinant of the LT. Feature transformation is equivalent to the means and covariances of the model being transformed by the inverse transformation while leaving the data unchanged. Using a set of adaptation experiments, we analyze the reasons for the degradation during Jacobian compensation and conclude that applying the same VTLN transformation on both means and variances does not fully match the data when there is a mismatch in the speaker conditions. We propose to use covariance adaptation on top of VTLN to account for the covariance mismatch between the train and the test speakers and show that accounting for Jacobian after covariance adaptation improves the performance.

Mon-Ses3-P2:
Prosody, Text Analysis, and Multilingual Models

Time:Monday 16:00 Place:Hewison Hall Type:Poster
Chair:Andrew Breen

#1Polyglot Speech Prosody Control

Harald Romsdorfer (Speech Processing Group, ETH Zurich, Switzerland)

Within a polyglot text-to-speech synthesis system, the generation of an adequate prosody for mixed-lingual texts, sentences, or even words, requires a polyglot prosody model that is able to seamlessly switch between languages and that applies the same voice for all languages. This paper presents the first polyglot prosody model that fulfills these requirements and that is constructed from independent monolingual prosody models. A perceptual evaluation showed that the synthetic polyglot prosody of about 82% of German and French mixed-lingual test sentences cannot be distinguished from natural polyglot prosody.

#2Weighted Neural Network Ensemble Models for Speech Prosody Control

Harald Romsdorfer (Speech Processing Group, ETH Zurich, Switzerland)

In text-to-speech synthesis systems, the quality of the predicted prosody contours influences quality and naturalness of synthetic speech. This paper presents a new statistical model for prosody control that combines an ensemble learning technique using neural networks as base learners with feature relevance determination. This weighted neural network ensemble model was applied for both, phone duration modeling and fundamental frequency modeling. A comparison with state-of-the-art prosody models based on classification and regression trees (CART), multivariate adaptive regression splines (MARS), or artificial neural networks (ANN), shows a 12% improvement compared to the best duration model and a 24% improvement compared to the best F0 model. The neural network ensemble model also outperforms another, recently presented ensemble model based on gradient tree boosting.

#3Cross-language F0 Modeling for Under-resourced Tonal Languages: A Case Study on Thai-Mandarin

Vataya Boonpiam (National Electronics and Computer Technology Center)
Anocha Rugchatjaroen (National Electronics and Computer Technology Center)
Chai Wutiwiwatchai (National Electronics and Computer Technology Center)

This paper proposed a novel method for F0 modeling in under-resourced tonal languages. Conventional statistical models require large training data which are deficient in many languages. In tonal languages, different syllabic tones are represented by different F0 shapes, some of them are similar across languages. With cross-language F0 contour mapping, we can augment the F0 model of one under-resourced language with corpora from another rich-resourced language. A case study on Thai HMM-based F0 modeling with a Mandarin corpus is explored. Comparing to baseline systems without cross-language resources, over 7% relative reduction of RMSE and significant improvement of MOS are obtained.

#4Prosodic issues in synthesising Thadou, a Tibeto-Burman tone language

Dafydd Gibbon (Universität Bielefeld, Bielefeld, Germany)
Pramod K. S. Pandey (Jawaharlal Nehru University, New Delhi, India)
D. Mary Kim Haokip (Assam University, Silchar, India)
Jolanta Bachan (Adam Mickiewicz University, Poznań, Poland)

The objective of the present analysis is to present linguistic constraints on the phonetic realisation of lexical tone which are relevant for the choice of speech synthesis development strategy for a specific type of tone language, in this case Thadou (Tibeto-Burman), which has lexical and morphosyntactic tone as well as phonetic tone displacement. The last two constraint types differ from those in more well-known tone languages such as Mandarin, and present problems for mainstream corpus-based speech synthesis techniques. Linguistic and phonetic models and a ‘microvoice’ for rule-based tone generation are developed.

#5Advanced Unsupervised Joint Prosody Labeling and Modeling for Mandarin Speech and Its Application to Prosody Generation for TTS

Chen-Yu Chiang (Dept. Communication Engineering, National Chiao Tung University, Taiwan)
Sin-Horng Chen (Dept. Communication Engineering, National Chiao Tung University, Taiwan)
Yih-Ru Wang (Dept. Communication Engineering, National Chiao Tung University, Taiwan)

Motivated by the success of the unsupervised joint prosody labeling and modeling (UJPLM) method for Mandarin speech on modeling of syllable pitch contour in our previous study, in this paper, the advanced UJPLM (A-UJPLM) method is proposed based on UJPLM to jointly label prosodic tags and model syllable pitch contour, duration and energy level. Experimental results on the Sinica Treebank corpus showed that most prosodic tags labeled were linguistically meaningful and the model parameters estimated were interpretable and generally agreed with other previous study. In virtue of the functions given by the model parameters, an application of A-UJPLM to the prosody generation for Mandarin TTS is proposed. Experimental results showed that the proposed method performed well. Most predicted prosodic features matched well to their original counterparts. This also reconfirmed the effectiveness of the A-UJPLM method.

#6Optimization of T-Tilt F0 Modeling

Ausdang Thangthai (National Electronics and Computer Technology Center (NECTEC))
Anocha Rugchatjaroen (National Electronics and Computer Technology Center (NECTEC))
Nattanun Thatphithakkul (National Electronics and Computer Technology Center (NECTEC))
Ananlada Chotimongkol (National Electronics and Computer Technology Center (NECTEC))
Chai Wutiwiwatchai (National Electronics and Computer Technology Center (NECTEC))

This paper investigates on the improvement of T-Tilt modeling, a modified Tilt model specifically designed for F0 modeling in tonal languages. The model has proved to work well for F0 analysis but suffers from text-to-F0 prediction. To optimize, the T-Tilt event is restricted to span over the whole syllable unit which helps reduce the number of parameters significantly. F0 interpolation and smoothing processes often performed in preprocessing are avoided to prevent modeling errors. F0 shape pre-classification and parameter clustering are introduced for better modeling. Evaluation results using the optimized model show the significant improvement for both F0 analysis and prediction.

#7A Multi-Level Context-Dependent Prosodic Model Applied to Duration Modeling

Nicolas OBIN (IRCAM)
Xavier RODET (IRCAM)
Anne LACHERET-DUJOUR (Modyco labs)

We present in this article a multi-level prosodic model based on the estimation of prosodic parameters on a set of well defined linguistic units. Different linguistic units are used to represent different scales of prosodic variations (local and global forms) and thus to estimate the linguistic factors that can explain the variations of prosodic parameters independently on each level. This model is applied to the modeling of syllable-based durational parameters on two read speech corpora - laboratory and acted speech. Compared to a syllable-based baseline model, the proposed approach improves performance in terms of the temporal organization of the predicted durations (correlation score) and reduces model's complexity, when showing comparable performance in terms of relative prediction error.

#8Sentiment classification in English from sentence-level annotations of emotions regarding models of affect

Alexandre Trilla (GTM - Grup de Recerca en Tecnologies Mèdia LA SALLE - UNIVERSITAT RAMON LLULL)
Francesc Alías (GTM - Grup de Recerca en Tecnologies Mèdia LA SALLE - UNIVERSITAT RAMON LLULL)

This paper presents a text classifier for automatically tagging the sentiment of input text according to the emotion that is being conveyed. This system has a pipelined framework composed of Natural Language Processing modules for feature extraction and a hard binary classifier for decision making between positive and negative categories. To do so, the Semeval 2007 dataset composed of sentences emotionally annotated is used for training purposes after being mapped into a model of affect. The resulting scheme stands a first step towards a complete emotion classifier for a future automatic expressive text-to-speech synthesizer.

#9Identification of Contrast and Its Emphatic Realization in HMM Based Speech Synthesis

Leonardo Badino (University of Edinburgh, Edinburgh, U.K.)
Sebastian Andersson (University of Edinburgh, Edinburgh, U.K.)
Junichi Yamagishi (University of Edinburgh, Edinburgh, U.K.)
Robert Clark (University of Edinburgh, Edinburgh, U.K.)

The work presented in this paper proposes to identify contrast in the form of contrastive word pairs and prosodically signal it with emphatic accents in a Text-to-Speech (TTS) application using a Hiddden-Markov-Model (HMM) based speech synthesis system.We first describe a novel method to automatically detect contrastive word pairs using textual features only and report its performance on a corpus of spontaneous conversations in English. Subsequently we describe the set of features selected to train a HMM based speech synthesis system and attempting to properly control prosodic prominence (including emphasis). Results from a large scale perceptual test show that in the majority of cases listeners judge emphatic contrastive word pairs as acceptable as their non-emphatic counterpart, while emphasis on non-contrastive pairs is almost never acceptable.

#10How to Improve TTS Systems for Emotional Expressivity

Antonio Rui Ferreira Rebordao (The University of Tokyo)
Mostafa Al Masum Shaikh (The University of Tokyo)
Keikichi Hirose (The University of Tokyo)
Nobuaki Minematsu (The University of Tokyo)

Several experiments have been carried out that revealed weaknesses of the current Text-To-Speech (TTS) systems in their emotional expressivity. Although some TTS systems allow XML-based representations of prosodic and/or phonetic variables, few publications considered, as a pre-processing stage, the use of intelligent text processing to detect affective information that can be used to tailor the parameters needed for emotional expressivity. This paper describes a technique for an automatic prosodic parameterization based on affective clues. This technique recognizes the affective information conveyed in a text and, accordingly to its emotional connotation, assigns appropriate pitch accents and other prosodic parameters by XML-tagging. This pre-processing assists the TTS system to generate synthesized speech that contains emotional clues. The experimental results are encouraging and suggest the possibility of suitable emotional expressivity in speech synthesis.

#11State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis

Yi-Jian Wu (Microsoft)
Yoshihiko Nankaku (Nagoya Institute of Technology)
Keiichi Tokuda (Nagoya Institute of Technology)

A phone mapping-based method had been introduced for cross-lingual speaker adaptation in HMM-based speech synthesis. In this paper, we continue to propose a state mapping based method for cross-lingual speaker adaptation. In this method, we firstly establish the state mapping between two voice models in source and target languages using Kullback-Leibler divergence (KLD). Based on the established mapping information, we introduce two approaches to conduct cross-lingual speaker adaptation, including data mapping and transform mapping approaches. From the experimental results, the state mapping based method outperformed the phone mapping based method. In addition, the data mapping approach achieved better speaker similarity, and the transform mapping approach achieved better speech quality after adaptation.

#12Real Voice and TTS Accent Effects on Intelligibility and Comprehension for Indian Speakers of English as a Second Language

Frederick V. Weber (Earth Institute, Columbia University)
Kalika Bali (Microsoft Research, India)

We investigate the effect of accent on comprehension of English for speakers of English as a second language in southern India. Subjects were exposed to real and TTS voices with US and several Indian accents, and were tested for intelligibility and comprehension. Performance trends indicate a measurable advantage for familiar accents, and are broken down by various demographic factors.

#13Improving Consistence of Phonetic Transcription for Text-to-Speech

Pablo Daniel Agüero (FI-UNMDP)
Antonio Bonafonte (Universitat Politècnica de Catalunya, Barcelona, Spain)
Juan Carlos Tulli (FI-UNMDP)

Grapheme-to-phoneme conversion is an important step in speech segmentation and synthesis. Many approaches are proposed in the literature to perform appropriate transcriptions: CART, FST, HMM, etc. In this paper we propose the use of an automatic algorithm that uses the transformation-based error-driven learning to match the phonetic transcription with the speaker's dialect and style. Different transcriptions based on word, part-of-speech tags, weak forms and phonotactic rules are validated. The experimental results show an improvement in the transcription using an objective measure. The articulation MOS score is also improved, as most of the changes in phonetic transcription affect coarticulation effects.

Mon-Ses3-P1:
Human Speech Production I

Time:Monday 16:00 Place:Hewison Hall Type:Poster
Chair: Shrikanth Narayanan

#1Probabilistic effects on French [t] duration

Francisco Torreira (Radboud Universiteit Nijmegen & Max Planck Institute for Psycholinguistics)
Mirjam Ernestus (Radboud Universiteit Nijmegen & Max Planck Institute for Psycholinguistics)

The present study shows that [t] consonants are affected by probabilistic factors in a syllable-timed language as French, and in spontaneous as well as in journalistic speech. Study 1 showed a word bigram frequency effect in spontaneous French, but its exact nature depended on the corpus on which the probabilistic measures were based. Study 2 investigated journalistic speech and showed an effect of the joint frequency of the test word and its following word. We discuss the possibility that these probabilistic effects are due to the speaker's planning of upcoming words, and to the speaker's adaptation to the listener's needs.

#2On the production of sandhi phenomena in French: psycholinguistic and acoustic data

Odile Bagou (Groupe NeuroPsychoLinguistique, FLSH, University of Neuchâtel, Switzerland)
Violaine Michel (Groupe NeuroPsychoLinguistique, FLSH, University of Neuchâtel, Switzerland)
Marina Laganaro (Groupe NeuroPsychoLinguistique, FLSH, University of Neuchâtel, Switzerland)

This study addresses two complementary questions about the production of sandhi phenomena in French. First, we investigated whether the encoding of sandhi phenomena involves a processing cost compared to non-resyllabified sequences. The elicited sequences were then used to address our second question, namely how critical V1CV2 sequences are phonetically realized across different boundary conditions. Results on production latencies suggested that the encoding of liaison enchaînée involves an additional processing cost compared to enchaînement and non resyllabified sequence. More, acoustic analyses indicated durational differences across the three boundary conditions. Implications for both, psycholinguistic and phonological models are discussed.

#3Extreme reductions: Contraction of disyllables into monosyllables in Taiwan Mandarin

Chierh Cheng (Department of Speech, Hearing and Phonetic Sciences, University College London, UK)
Yi Xu (Department of Speech, Hearing and Phonetic Sciences, University College London, UK)

This study investigates a severe form of segmental reduction known as contraction. In Taiwan Mandarin, a disyllabic word or phrase is often contracted into a monosyllabic unit in conversational speech, just as “do not” is often contracted into “don’t” in English. A systematic experiment was conducted to explore the underlying mechanism of such contraction. Preliminary results show evidence that contraction is not a categorical shift but a gradient undershoot of the articulatory target as a result of time pressure. Moreover, contraction seems to occur only beyond a certain duration threshold. These findings may further our understanding of the relation between duration and segmental reduction.

#4Annotation and Features of Non-native Mandarin Tone Quality

Mitchell Peabody (MIT)
Stephanie Seneff (MIT)

Native speakers of non-tonal languages, such as American English, frequently have difficulty accurately producing the tones of Mandarin Chinese. This paper describes a corpus of Mandarin Chinese spoken by non-native speakers and annotated for tone quality using a simple Good-Bad system. We examine inter-rater correlation of the annotations and highlight the differences in feature distribution between native, good non-native, and bad non-native tone productions. We find that the features of tones judged by a simple majority to be bad are significantly different from features from tones judged to be good, and tones produced by native speakers.

#5On-line Formant Shifting as a Function of F0

Kateřina Chládková (Amsterdam Center for Language and Communication, University of Amsterdam, The Netherlands)
Paul Boersma (Amsterdam Center for Language and Communication, University of Amsterdam, The Netherlands)
Václav Jonáš Podlipský (Department of English and American Studies, Palacký University Olomouc, Czech Republic)

We investigate whether there is a within-speaker effect of a higher F0 on the values of the first and the second formant. When asked to speak at a high F0, speakers turn out to raise their formants as well. In the F1 dimension this effect is greater for women than for men. We conclude that while a general formant raising effect might be due to the physiology of a high F0 (i.e. raised larynx and shorter vocal tract), a plausible explanation for the gender-dependent size of the effect on F1 values can only be found in the undersampling hypothesis.

#6Production Boundary between Fricative and Affricate in Japanese and Korean Speakers

Kimiko Yamakawa (National Institute of Informatics)
Shigeaki Amano (NTT Communications Science Laboratories)
Shuichi Itahashi (National Institute of Informatics)

A fricative [s] and an affricate [ts] pronounced by both native Japanese and Korean speakers were analyzed to clarify the effect of the mother language on speech production. It was revealed that Japanese speakers have a clear individual production boundary between [s] and [ts], and that this boundary corresponds to the production boundary of all Japanese speakers. In contrast, although Korean speakers tend to have a clear individual production boundary, the boundary dose not corresponds to that of Japanese speakers. These facts suggest that Korean speakers tend to have a stable [s]-[ts] production boundary but that differ from Japanese speakers.

#7Aerodynamics of Fricative Production in European Portuguese

Cátia M. R. Pinho (IEETA, Universidade de Aveiro, Portugal)
Luis M. T. Jesus (IEETA and ESSUA, Universidade de Aveiro, Portugal)
Anna Barney (ISVR, University of Southampton, UK)

The characteristics of steady state fricative production, and those of the phone preceding and following the fricative, were investigated. Aerodynamic and electroglotographic (EGG) recordings of four normal adult speakers (two females and two males), producing a speech corpus of 9 isolated words with the European Portuguese (EP) voiced fricatives /v, z, Z/ in initial, medial and final word position, and the same 9 words embedded in 42 different real EP carrier sentences, were analysed. Multimodal data allowed the characterisation of fricatives in terms of their voicing mechanisms, based on the amplitude of oral flow, F1 excitation and fundamental frequency (F0).

#8Contextual effects on protrusion and lip opening for /i,y/

Anne Bonneau (LORIA/CNRS)
Julie Busset (LORIA/ UMR 7503)
Brigitte Wrobel-Dautcourt (LORIA/UMR7503)

This study investigates the effect of “adverse” contexts, especially that of the consonant /S/, on labial parameters for French /i,y/. Five parameters were analysed: the height, width and area of lip opening, the distance between the corners of the mouth, as well as lip protrusion. Ten speakers uttered a corpus made up of isolated vowels, syllables and logatoms. A special procedure has been designed to evaluate lip opening contours. Results showed that the carry-over effect of the consonant /S/ can impede the opposition between /i/ and /y/ in the protrusion dimension, depending upon speakers.

#9Speech Rate Effects on European Portuguese Nasal Vowels

Catarina Oliveira (University of Aveiro)
Paula Martins (Health School, University of Aveiro)
António Teixeira (DETI/IEETA, University of Aveiro)

This paper presents new temporal information regarding the production of European Portuguese (EP) nasal vowels, based on new EMMA data. The influence of speech rate on duration of velum gestures and their coordination with consonantic and glottal gestures were analyzed. As information on relative speed of articulators is scarce, the parameter stiffness for the nasal gestures was also calculated and analyzed. Results show clear effects of speech rate on temporal characteristics of EP nasal vowels. Speech rate reduces the duration of velum gestures, increases the stiffness and inter-gestural overlap.

#10Relation of formants and subglottal resonances in Hungarian vowels

Tamás Gábor Csapó (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary)
Zsuzsanna Bárkányi (Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest, Hungary)
Tekla Etelka Gráczi (Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest, Hungary)
Tamás Bőhm (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary; Institute for Psychology, Hungarian Academy of Sciences, Budapest, Hungary)
Steven M. Lulich (Speech Communication Group, MIT, Cambridge, MA 02139)

The relation between vowel formants and subglottal resonances (SGRs) has previously been explored in English, German, and Korean. Results from these studies indicate that vowel classes are categorically separated by SGRs. We extended this work to Hungarian vowels, which have not been related to SGRs before. The Hungarian vowel system contains paired long and short vowels as well as a series of front rounded vowels, similar to German but more complex than English and Korean. Results indicate that SGRs separate vowel classes in Hungarian as in English, German, and Korean, and uncover additional patterns of vowel formants relative to the third subglottal resonance (Sg3). These results have implications for understanding phonological distinctive features, and applications in automatic speech technologies.

Mon-Ses3-P4:
Applications in learning and other areas

Time:Monday 16:00 Place:Hewison Hall Type:Poster
Chair: Nestor Becerra Yoma

#1Designing spoken tutorial dialogue with children to elicit predictable but educationally valuable responses

Gregory Aist (Carnegie Mellon University)
Jack Mostow (Carnegie Mellon University)

How to construct spoken dialogue interactions with children that are educationally effective and technically feasible? To address this challenge, we propose a design principle that constructs short dialogues in which (a) the user’s utterance are the external evidence of task performance or learning in the domain, and (b) the target utterances can be expressed as a well-defined set, in some cases even as a finite language (up to a small set of variables which may change from exercise to exercise.) The key approach is to teach the human learner a parameterized process that maps input to response. We describe how the discovery of this design principle came out of analyzing the processes of automated tutoring for reading and pronunciation and designing dialogues to address vocabulary and comprehension, show how it also accurately describes the design of several other language tutoring interactions, and discuss how it could extend to non-language tutoring tasks.

#2Optimizing non-native speech recognition for CALL applications

Joost van Doremalen (Centre for Language and Speech Technology, Radboud University Nijmegen)
Helmer Strik (Centre for Language and Speech Technology, Radboud University Nijmegen)
Catia Cucchiarini (Centre for Language and Speech Technology, Radboud University Nijmegen)

We are developing a Computer Assisted Language Learning (CALL) system that gives feedback to grammar and pronunciation that makes use of Automatic Speech Recognition (ASR). However, good quality unconstrained non-native ASR is not yet feasible. Therefore, we use an approach in which we try to elicit constrained responses. The task in the current experiments is to select utterances from a list of responses. The results of our experiments show that significant improvements can be obtained by optimizing the language model and acoustic models. In this way we could reduce the utterance error rate from 29-26% to 10-8%.

#3Evaluation of English Intonation based on Combination of Multiple Evaluation Scores

Akinori Ito (Graduate School of Engineering, Tohoku University)
Tomoaki Konno (Graduate School of Engineering, Tohoku University)
Masashi Ito (Graduate School of Engineering, Tohoku University)
Shozo Makino (Graduate School of Engineering, Tohoku University)

In this paper, we proposed a novel method for evaluating intonation of an English utterance spoken by a learner for intonation learning by a CALL system. The proposed method is based on an intonation evaluation method proposed by Suzuki et al., which uses “word importance factors,” which are calculated based on word clusters given by a decision tree. We extended Suzuki’s method so that multiple decision trees are used and the resulting intonation scores are combined using multiple regression. As a result of an experiment, we obtained correlation coefficient comparable to the correlation between human raters.

#4A LANGUAGE-INDEPENDENT FEATURE SET FOR THE AUTOMATIC EVALUATION OF PROSODY

Andreas Maier (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung)
Florian Hönig (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung)
Viktor Zeissler (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung)
Anton Batliner (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung)
Erik Körner (Universität Erlangen-Nürnberg, Japanologie)
Nobuyuki Yamanaka (Universität Erlangen-Nürnberg, Japanologie)
Peter Ackermann (Universität Erlangen-Nürnberg, Japanologie)
Elmar Nöth (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung)

In second language learning, the correct use of prosody plays a vital role. Therefore, an automatic method to evaluate the naturalness of the prosody of a speaker is desirable. We present a novel method to model prosody independently of the text and thus independently of the language as well. For this purpose, the voiced and unvoiced speech segments are extracted and a 187-dimensional feature vector is computed for each voiced segment. This approach is compared to word based prosodic features on a German text passage. Both are confronted with the perceptive evaluation of two native speakers of German. The word-based feature set yielded correlations of up to 0.92 while the text-independent feature set yielded 0.88. This is in the same range as the inter-rater correlation with 0.88.

#5Adapting the Acoustic Model of a Speech Recognizer for Varied Proficiency Non-Native Spontaneous Speech Using Read Speech with Language-Specific Pronunciation Difficulty

Klaus Zechner (Educational Testing Service)
Derrick Higgins (Educational Testing Service)
Rene Lawless (Educational Testing Service)
Yoko Futagi (Educational Testing Service)
Sarah Ohls (Educational Testing Service)
George Ivanov (Educational Testing Service)

This paper presents a novel approach to acoustic model adaptation of a recognizer for non-native spontaneous speech for candidates’ responses in a test of spoken English. Instead of transcribing spontaneous speech data, a read speech corpus is created where non-native speakers of English read English sentences of different degrees of pronunciation difficulty with respect to their native language. As a selection criterion we develop a novel score, the “phonetic challenge score”, consisting of a measure for native language-specific difficulties described in the second-language acquisition literature and also of a statistical measure based on the cross-entropy between phoneme sequences of the native language and English. The results of using the read speech for AM adaptation of a recognizer for spontaneous non-native speech show a significant reduction of word error rate for two of four language groups of the spontaneous speech test set as well as for the entire test set.

#6Analysis and Utilization of MLLR Speaker Adaptation Technique for Learners\' Pronunciation Evaluation

Dean Luo (The University of Tokyo)
Yu Qiao (The University of Tokyo)
Nobuaki Minematsu (The University of Tokyo)
Yutaka Yamauchi (Tokyo International University)
Hirose Keikichi (The University of Tokyo)

In this paper, we investigate the effects and problems of MLLR speaker adaptation when applied to pronunciation evaluation. Automatic scoring and error detection experiments are conducted on two publicly available databases of Japanese learners’ English pronunciation. As we expected, over adapta-tion causes misjudge of pronunciation accuracy. Based on the analyses, two novel methods, Forced-aligned GOP score and Regularized-MLLR adaptation, are proposed to solve the ad-verse effects of MLLR adaption. Experimental results show that the proposed methods can better utilize MLLR adaptation and avoid over adaptation.

#7Control of human generating force by use of acoustic information – Study on Onomatopoeic utterances for controlling small lifting-force

Miki Iimura (School of Engineering, Tokyo Denki University)
Taichi Sato (School of Engineering, Tokyo Denki University)
Kihachiro Tanaka (Faculty of Engineering, Saitama University)

We have conducted basic experiments for applying acoustic information to engineering problems. We asked the subjects to execute lifting actions while listening to sounds and measured the resultant lifting-force. We used human onomatopoeic utterances as the sounds that are presented to the subjects aiming to make their lifting-force small. Especially, we focused on the “emotion” or “nuance” contained in humans’ utterances, which is a unique characteristic evoked by the utterance’ acoustical features. We found that the emotion or nuance can control the lifting-force effectively. We also clarified the acoustical features that are responsible for effective control of lifting-force exerted by human.

#8Mi-DJ: a multi-source intelligent DJ service

Ching-Hsien Lee (researcher)
Hsu-Chih Wu (researcher)

In this paper, A Multi-source intelligent DJ (Mi-DJ) service is introduced. It is an audio program platform that integrates different media types, including audio and text format content. It acts like a DJ who plays personalized audio program to user whenever and wherever users need. The audio program is automatically generated, comprising several audio clips; all of them are from either existing audio files or text information, such as e-mail, calendar, news or user-preferred article. Our unique program generation technology makes user feel like listening to a well-organized program, instead of several separated audio files. The program can be organized dynamically, which realizes context-aware service based on location, user's schedule, or other user preference. With appropriate data management, text processing and speech synthesis technologies, Mi-DJ can be applied to many application scenarios. For example, it can be applied in language learning and tour guide.

#9Human Voice or Prompt Generation? Can they Co-exist in an Application?

Géza Németh (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics)
Csaba Zainkó (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics)
Mátyás Bartalis (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics)
Gábor Olaszy (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics)
Géza Kiss (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics)

This paper describes an R&D project regarding procedures for the automatic maintenance of the interactive voice response (IVR) system of a mobile telecom operator. The original plan was to create a generic voice prompt generation system for the customer service department. The challenge was to create a solution that is hard to distinguish from the human speaker (i.e. passing a sort of Turing-test) so its output can be freely mixed with original human recordings. The domain of the solution at the first step had to be narrowed down to the price list of available mobile phones and services. This is updated weekly, so the final operational system generates about 3 hours of speech at each weekend. It operates under human supervision but without intervention in the speech generation process. It was tested both by academic procedures and company customers and was accepted as fulfilling the original requirements.

#10Automatic vs. human question answering over multimedia meeting recordings

Quoc Anh Le (University of Namur)
Andrei Popescu-Belis (Idiap Research Institute)

Information access in meeting recordings can be assisted by meeting browsers, or can be fully automated following a question-answering (QA) approach. An information access task is defined, aiming at discriminating true vs. false parallel statements about facts in meetings. An automatic QA algorithm is applied to this task, using passage retrieval over a meeting transcript. The algorithm scores 59% accuracy for passage retrieval, while random guessing is below 1%, but only scores 60% on combined retrieval and question discrimination, for which humans reach 70%-80% and the baseline is 50%. The algorithm clearly outperforms humans for speed, at less than 1 second per question, vs. 1.5-2 minutes per question for humans. The degradation on ASR compared to manual transcripts still yields lower but acceptable scores, especially for passage identification. Automatic QA thus appears to be a promising enhancement to meeting browsers used by humans, as an assistant for relevant passage identification.

Tue-Ses0-K:
Tom Griffiths - Connecting human and machine learning via probabilistic models of cognition

Time:Tuesday 08:30 Place:Main Hall Type:Keynote
Chair:Steve Renals

08:30Connecting human and machine learning via probabilistic models of cognition

Tom Griffiths (UC Berkley)

Human performance defines the standard that machine learning systems aspire to in many areas, including learning language. This suggests that studying human cognition may be a good way to develop better learning algorithms, as well as providing basic insights into how the human mind works. However, in order for ideas to flow easily from cognitive science to computer science and vice versa, we need a common framework for describing human and machine learning. I will summarize recent work exploring the hypothesis that probabilistic models of cognition, which view learning as a form of statistical inference, provide such a framework, including results that illustrate how novel ideas from statistics can inform cognitive science. Specifically, I will talk about how probabilistic models can be used to identify the assumptions of learners, learn at different levels of abstraction, and link the inductive biases of individuals to cultural universals.

Tue-Ses1-O1:
ASR: Discriminative Training

Time:Tuesday 10:00 Place:Main Hall Type:Oral
Chair: Erik McDermott

10:00On the Semi-Supervised Learning of Multi-Layered Perceptrons

Jonathan Malkin (University of Washington)
Amarnag Subramanya (University of Washington)
Jeff Bilmes (University of Washington)

We present a novel approach for training a multi-layered perceptron (MLP) in a semi-supervised fashion. Our objective function, when optimized, balances training set accuracy with fidelity to a graph-based manifold over all points. Additionally, the objective favors smoothness via an entropy regularizer over classifier outputs as well as straightforward L2 regularization. Our approach also scales well enough to enable large-scale training. The results demonstrate significant improvement on several phone classification tasks over baseline MLPs.

10:20Generalized Discriminative Feature Transformation for Speech Recognition

Roger Hsiao (InterACT, Language Technologies Institute, Carnegie Mellon University)
Tanja Schultz (InterACT, Language Technologies Institute, Carnegie Mellon University)

We propose a new algorithm called Generalized Discriminative Feature Transformation (GDFT) for acoustic models in speech recognition. GDFT is based on Lagrange relaxation on a transformed optimization problem. We show that the existing discriminative feature transformation methods like feature space MMI/MPE (fMMI/MPE), region dependent linear transformation (RDLT), and a non-discriminative feature transformation, constrained maximum likelihood linear regression (CMLLR) are special cases of GDFT. We evaluate the performance of GDFT for Iraqi large vocabulary continuous speech recognition (LVCSR).

10:40A Fast Online Algorithm for Large Margin Training of Continuous Density Hidden Markov Models

Chih-Chieh Cheng (University of California, San Diego)
Fei Sha (University of Southern California)
Lawrence Saul (University of California, San Diego)

We propose an online learning algorithm for large margin training of continuous density hidden Markov models. The online algorithm updates the model parameters incrementally after the decoding of each training utterance. For large margin training, the algorithm attempts to separate the log-likelihoods of correct and incorrect transcriptions by an amount proportional to their Hamming distance. We evaluate this approach to hidden Markov modeling on the TIMIT speech database. We find that the algorithm yields significantly lower phone error rates than other approaches--both online and batch--that do not attempt to enforce a large margin. We also find that the algorithm converges much more quickly than analogous batch optimizations for large margin training.

11:00Maximum Mutual Information Estimation via Second Order Cone Programming for Large Vocabulary Continuous Speech Recognition

Dalei Wu (Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, CANADA)
Baojie Li (Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, CANADA)
Hui Jiang (Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, CANADA)

In this paper, we have successfully extended our previous work of convex optimization methods to MMIE-based discriminative training for large vocabulary continuous speech recognition. Specifically, we have re-formulated the MMIE training into a second order cone programming (SOCP) program using some convex relaxation techniques that we have previously proposed. Moreover, the entire SOCP formulation has been developed for word graphs instead of N-best lists to handle large vocabulary tasks. The proposed method has been evaluated in the standard WSJ-5k task and experimental results show that the proposed SOCP method significantly outperforms the conventional EBW method in terms of recognition accuracy as well as convergence behavior. Our experiments also show that the proposed SOCP method is efficient enough to handle some relatively large HMM sets normally used in large vocabulary tasks.

11:20Hidden Conditional Random Field with Distribution Constraints for Phone Classification

Dong Yu (Microsoft Research)
Li Deng (Microsoft Research)
Alex Acero (Microsoft Research)

We advance the recently proposed hidden conditional random field (HCRF) model by replacing the moment constraints (MCs) with the distribution constraints (DCs). We point out that the DCs are the same as the traditional MCs for the binary features but are able to better regularize the probability distribution of the continuous-valued features than the MCs. We show that under the DCs the HCRF model is no longer log-linear but embeds the model parameters in non-linear functions. We provide an effective solution to the resulting optimization problem by converting it to the traditional log-linear form at a higher-dimensional space of features exploiting cubic spline. We demonstrate that a 20.8% classification error rate can be achieved on the TIMIT phone classification task using the HCRF-DC model. This result is superior to any published single-system result on this task including the HCRF-MC model, the discriminatively trained HMMs, and the large-margin HMMs using the same features.

11:40Deterministic Annealing Based Training Algorithm for Bayesian Speech Recognition

Sayaka Shiota (Nagoya Institute of Technology)
Kei Hashimoto (Nagoya Institute of Technology)
Yoshihiko Nanakaku (Nagoya Institute of Technology)
Keiichi Tokuda (Nagoya Institute of Technology)

This paper proposes a deterministic annealing based training algorithm for Bayesian speech recognition. The Bayesian method is a statistical technique for estimating reliable predictive distributions by marginalizing model parameters. However, the local maxima problem in the Bayesian method is more serious than in the ML-based approach, because the Bayesian method treats not only state sequences but also model parameters as latent variables. The deterministic annealing EM (DAEM) algorithm has been proposed to improve the local maxima problem in the EM algorithm, and its effectiveness has been reported in HMM-based speech recognition using ML criterion. In this paper, the DAEM algorithm is applied to Bayesian speech recognition to relax the local maxima problem. Speech recognition experiments show that the proposed method achieved a higher performance than the conventional methods.

Tue-Ses1-O2:
Language acquisition

Time:Tuesday 10:00 Place:East Wing 1 Type:Oral
Chair:Maria Uther

10:00Connecting Rhythm and Prominence in Automatic ESL Pronunciation Scoring

Emily Nava (University of Southern California)
Joseph Tepperman (University of Southern California)
Louis Goldstein (University of Southern California)
Maria Luisa Zubizarreta (University of Southern California)
Shrikanth Narayanan (University of Southern California)

Past studies have shown that a native Spanish speaker's use of phrasal prominence is a good indicator of her level of English prosody acquisition. Because of the cross-linguistic differences in the organization of phrasal prominence and durational contrasts, we hypothesize that those speakers with English-like prominence in their L2 speech are also expected to have acquired English-like rhythm. Statistics from a corpus of native and nonnative English confirm that speakers with an English-like phrasal prominence are also the ones who use English-like rhythm. Additionally, two methods of automatic score generation based on vowel duration times demonstrate a correlation of at least 0.6 between these automatic scores and subjective scores for phrasal prominence. These findings suggest that simple vowel duration measures obtained from standard automatic speech recognition methods can be salient cues for estimating subjective scores of prosodic acquisition, and of pronunciation in general.

10:20Evaluating parameters for mapping adult vowels to imitative babbling

Ilana Heintz (The Ohio State University)
Mary Beckman (The Ohio State University)
Eric Fosler-Lussier (The Ohio State University)
Lucie Ménard (Université de Québec à Montréal)

We design a neural network model of first language acquisition to explore the relationship between child and adult speech sounds. The model learns simple vowel categories using a produce-and-perceive babbling algorithm in addition to listening to ambient speech. The model is similar to that of Westermann & Miranda (2004), but adds a dynamic aspect in that it adapts in both the articulatory and acoustic domains to changes in the child's speech patterns. The training data is designed to replicate infant speech sounds and articulatory configurations. By exploring a range of articulatory and acoustic dimensions, we see how the child might learn to draw correspondences between his or her own speech and that of a caretaker, whose productions are quite different from the child's. We also design an imitation evaluation paradigm that gives insight into the strengths and weaknesses of the model.

10:40Intonation of Japanese sentences spoken by English speakers

Chiharu Tsurutani (Griffith University, Australia)

This study investigated intonation of Japanese sentences spoken by Australian English speakers and the influence of their first language (L1) prosody on their intonation of Japanese sentences. The second language (L2) intonation is a complicated product of the L1 transfer at two levels of prosodic hierarchy: at word level and at phrase levels. L2 speech is hypothesized to retain the characteristics of L1, and to gain marked features of the target language only during the late stage of acquisition. Investigation of this hypothesis involved acoustic measurement of L2 speakers’ intonation contours, and comparison of these contours with those of native speakers.

11:00KLAIR: a Virtual Infant for Spoken Language Acquisition Research

Mark Huckvale (University College London)
Ian Howard (University of Cambridge)
Sascha Fagel (Berlin Institute of Technology)

Recent research into the acquisition of spoken language has stressed the importance of learning through embodied linguistic interaction with caregivers rather than through passive observation. However the necessity of interaction makes experimental work into the simulation of infant speech acquisition difficult because of the technical complexity of building real-time embodied systems. In this paper we present KLAIR: a software toolkit for building simulations of spoken language acquisition through interactions with a virtual infant. The main part of KLAIR is a sensori-motor server that supplies a client machine learning application with a virtual infant on screen that can see, hear and speak. By encapsulating the real-time complexities of audio and video processing within a server that will run on a modern PC, we hope that KLAIR will encourage and facilitate more experimental research into spoken language acquisition through interaction.

11:20An Articulatory Analysis of Phonological Transfer Using Real-Time MRI

Joseph Tepperman (University of Southern California)
Erik Bresch (University of Southern California)
Yoon-Chul Kim (University of Southern California)
Sungbok Lee (University of Southern California)
Louis Goldstein (University of Southern California)
Shrikanth Narayanan (University of Southern California)

Phonological transfer is the influence of a first language on phonological variations made when speaking a second language. With automatic pronunciation assessment applications in mind, this study intends to uncover evidence of phonological transfer in terms of articulation. Real-time MRI videos from three German speakers of English and three native English speakers are compared to uncover the influence of German consonants on close English consonants not found in German. Results show that nonnative speakers demonstrate the effects of L1 transfer through the absence of articulatory contrasts seen in native speakers, while still maintaining minimal articulatory contrasts that are necessary for automatic detection of pronunciation errors, encouraging the further use of articulatory models for speech error characterization and detection.

11:40Do Multiple Caregivers Speed up Language Acquisition?

Louis ten Bosch (Radboud University Nijmegen)
Okko Rasanen (Helsinki University of Technology)
Joris Driesen (Catholic University of Leuven)
Guillaume Aimetti (University of Sheffield)
Toomas Altosaar (Helsinki University of Technology)
Lou Boves (Radboud University Nijmegen)
Athena Corns (Radboud University Nijmegen)

In this paper we compare three different implementations of language learning to investigate the issue of speaker-dependent initial representations and subsequent generalization. These implementations are used in a comprehensive model of language acquisition under development in the FP6 FET project ACORNS. All algorithms are embedded in a cognitively and ecologically plausible framework, and perform the task of detecting word-like units without any lexical, phonetic, or phonological information. The results show that the computational approaches differ with respect to the extent they deal with unseen speakers, and how generalization depends on the variation observed during training.

Tue-Ses1-O3:
ASR: Lexical and Prosodic Models

Time:Tuesday 10:00 Place:East Wing 2 Type:Oral
Chair:Eric Fosler-Lussier

10:00Grapheme to phoneme conversion using an SMT system

Antoine Laurent (Laboratoire Informatique Université du Maine (LIUM))
Paul Deléglise (Laboratoire Informatique Université du Maine (LIUM))
Sylvain Meignier (Laboratoire Informatique Université du Maine (LIUM))

This paper presents an automatic grapheme to phoneme conversion system that uses statistical machine translation techniques provided by the Moses Toolkit. The generated word pronunciations are employed in the dictionary of an automatic speech recognition system and evaluated using the ESTER 2 French broadcast news corpus. Grapheme to phoneme conversion based on Moses is compared to two other methods: G2P, and a dictionary look-up method supplemented by a rule-based tool for phonetic transcriptions of words unavailable in the dictionary. Moses gives better results than G2P, and have performance comparable to the dictionary look-up strategy.

10:20Lexical and Phonetic Modeling for Arabic Automatic Speech Recognition

Long Nguyen (BBN Technologies)
Tim Ng (BBN Technologies)
Kham Nguyen (Northeastern University)
Rabih Zbib (Massachusetts Institute of Technology)
John Makhoul (BBN Technologies)

In this paper, we describe the use of either words or morphemes as lexical modeling units and the use of either graphemes or phonemes as phonetic modeling units for Arabic automatic speech recognition (ASR). We designed four Arabic ASR systems: two word-based systems and two morpheme-based systems. Experimental results using these four systems show that they have comparable state-of-the-art performance individually, but the more sophisticated morpheme-based system tends to be the best. However, they seem to complement each other quite well within the ROVER system combination framework to produce substantially-improved combined results.

10:40Assessing Context and Learning for isiZulu Tone Recognition

Gina-Anne Levow (University of Chicago)

Prosody plays an integral role in spoken language understanding. In isiZulu, a Nguni family language with lexical tone, prosodic information determines word meaning. We assess the impact of models of tone and coarticulation for tone recognition. We demonstrate the importance of modeling prosodic context to improve tone recognition. We employ this less commonly studied language to assess models of tone developed for English and Mandarin, finding common threads in coarticulatory modeling. We also demonstrate the effectiveness of semi-supervised and unsupervised tone recognition techniques for this less-resourced language, with weakly supervised approaches rivaling supervised techniques.

11:00A Sequential Minimization Algorithm for Finite-State Pronunciation Lexicon Models

Dobrisek Simon (Faculty of Electrical Engineering, Ljubljana University, Slovenia)
Vesnicer Bostjan (Faculty of Electrical Engineering, Ljubljana University, Slovenia)
Mihelic France (Faculty of Electrical Engineering, Ljubljana University, Slovenia)

The paper first presents a large-vocabulary automatic speech-recognition system that is being developed for the Slovenian language. The concept of a single-pass token-passing algorithm for fast speech decoding that can be used with the designed multi-level system structure is discussed. From the algorithmic point of view, the main component of the system is a finite-state pronunciation lexicon model. This component has crucial impact on the overall performance of the system and we developed a sequential minimization algorithm that very efficiently reduces the size and algorithmic complexity of the lexicon model. The presented experiments show that the sequential minimization algorithm considerably outperforms (up to 60 %) the conventional algorithms that were developed for the static global optimization of the finite-state transducers.

11:20A General-Purpose 32 ms Prosodic Vector for Hidden Markov Modeling

Kornel Laskowski (Carnegie Mellon University)
Mattias Heldner (KTH)
Jens Edlund (KTH)

Prosody plays a central role in conversation, making it important for speech technologies to model. Unfortunately, the application of standard modeling techniques to the acoustics of prosody has been hindered by difficulties in modeling intonation. In this work, we explore the suitability of the recently introduced fundamental frequency variation (FFV) spectrum as a candidate general representation of tone. Experiments on 4 tasks demonstrate that FFV features are complimentary to other acoustic measures of prosody and that hidden Markov models offer a suitable modeling paradigm. Proposed improvements yield a 35% relative decrease in error on unseen data and simultaneously reduce time complexity by a factor of five. The resulting representation is sufficiently mature for general deployment in a broad range of automatic speech processing applications.

11:40Vocabulary Expansion through Automatic Abbreviation Generation for Chinese Voice Search

Dong Yang (Department of Computer Science, Tokyo Institute of Technology)
Yi-cheng Pan (Department of Computer Science, Tokyo Institute of Technology)
Sadaoki Furui (Department of Computer Science, Tokyo Institute of Technology)

Long named entities are often abbreviated in oral Chinese language, and this usually leads to out-of-vocabulary(OOV) problems in speech recognition applications. In this paper, we propose a new method for automatically generating abbreviations for Chinese named entities and we perform vocabulary expansion using output of the abbreviation model for voice search. In our abbreviation modeling, we convert the abbreviation generation problem into a tagging problem and use the conditional random field (CRF) as the tagging tool. In the vocabulary expansion, considering the multiple abbreviation problem and limited coverage of top-1 abbreviation candidate, we add top-10 candidates into the vocabulary. In our experiments, for the abbreviation modeling, we achieved the top-10 coverage of 88.3% by the proposed method; for the voice search, we improved the voice search accuracy from 16.9% to 79.2% by incorporating the top-10 abbreviation candidates to vocabulary.

Tue-Ses1-O4:
Unit-Selection Synthesis

Time:Tuesday 10:00 Place:East Wing 3 Type:Oral
Chair:Alan Black

10:00Perceptual Cost Function for Cross-fading Based Concatenation

Qi Miao (Center for Spoken Language Understanding (CSLU), Division of Biomedical Computer Science (BMCS), Oregon Health & Science University (OHSU), Oregon, USA 97006)
Alexander Kain (Center for Spoken Language Understanding (CSLU), Division of Biomedical Computer Science (BMCS), Oregon Health & Science University (OHSU), Oregon, USA 97006)
Jan P. H. van Santen (Center for Spoken Language Understanding (CSLU), Division of Biomedical Computer Science (BMCS), Oregon Health & Science University (OHSU), Oregon, USA 97006)

In earlier research, we applied a linear weighted cross-fading function to ensure smooth concatenation. However, this can cause unnaturally shaped spectral trajectories. We propose context-sensitive cross-fading. To train this system, a perceptually validated cost function is needed, which is the focus of this paper. A corpus was designed to generate a variety of formant trajectory shapes. A perceptual experiment was performed and a multiple linear regression model was applied to predict perceptual quality ratings from various distances between cross-faded and natural trajectories. Results show that perceptual quality could be predicted well from the proposed distance measures.

10:20Exploring Automatic Similarity Measures for Unit Selection Tuning

Daniel Tihelka (University of West Bohemia)
Jan Romportl (SpeechTech s.r.o)

The paper focuses on the current handling of target features in the unit selection approach basically requiring huge corpora. In the paper there are outlined possible solutions based on measuring (dis)similarity among prosodic patterns. As the start of research, several intuitively chosen measures of acoustic signal (dis)similarity are presented and correlated to perceived similarity obtained from a large-scale listening test.

10:40Towards Intonation Control in Unit Selection Speech Synthesis

Cedric Boidin (Orange Labs)
Olivier Boeffard (IRISA / University of Rennes 1)
Thierry Moudenc (Orange Labs)
Geraldine Damnati (Orange Labs)

We propose to control intonation in unit selection speech synthesis with a mixed CART-HMM intonation model. The Finite State Machine (FSM) formulation is suited to incorporate the intonation model in the unit selection framework because it allows for combination of models with different unit types and handling competing intonative variants. Subjective experiments have been carried out to compare segmental and joint-prosodic-and-segmental unit selection.

11:00A Novel Approach to Cost Weighting in Unit Selection TTS

Jerome Bellegarda (Apple Inc.)

Unit selection text-to-speech synthesis relies on multiple cost criteria, each encapsulating a different aspect of acoustic and prosodic context at any given concatenation point. For a particular set of criteria, the relative weighting of the resulting costs crucially affects final candidate ranking. Their influence is typically determined in an empirical manner (e.g., based on a limited amount of synthesized data), yielding global weights that are thus applied to all concatenations indiscriminately. This paper proposes an alternative approach, based on a data-driven framework separately optimized for each concatenation. The cost distribution in every information stream is dynamically leveraged to locally shift weight towards those characteristics that prove most discriminative at this point. An illustrative case study underscores the potential benefits of this solution.

11:20Maximum Likelihood Unit Selection for Corpus-based Speech Synthesis

Abubeker Gamboa Rosales (University of Guanajuato)
Hamurabi Gamboa Rosales (Dresden University of Technology)
Ruediger Hoffmann (Dresden University of Technology)

Unit selection attempts to find the best combination of speech unit sequences in an inventory so that the perceptual differences between expected (natural) and synthesized signals are as low as possible. However, mismatches and distortions are still possible in concatenative speech synthesis and they are normally perceptible in the synthesized waveform. Therefore, unit selection strategies and parameter tuning are still important issues in the improvement of speech synthesis. We present a novel concept to increase the efficiency of the exhaustive speech unit search within the inventory via a unit selection model. This model bases its operation on a mapping analysis of the concatenation sub-costs, a Bayes optimal classification (BOC), and a Maximum likelihood selection ( MLS). The principle advantage of the proposed unit selection method is that it does not require an exhaustive training to set up weighted coefficients for target and concatenation subcosts.

11:40A Close Look into the Probablistic Concatenation Model for Corpus-based Speech Synthesis

Shinsuke Sakai (NICT)
Ranniery Maia (NICT)
Hisashi Kawai (NICT)
Satoshi Nakamura (NICT)

We have proposed a novel probabilistic approach to concatenation modeling for corpus-based speech synthesis, where the goodness of concatenation for a unit is modeled using a conditional Gaussian probability densities whose mean is defined as a linear transform of the feature vector from the previous unit, and have shown its effectiveness through a subjective listening test. In this paper, we further investigate the characteristics of the proposed method by a objective evaluation and by observing the sequence of concatenation scores across an utterance. We also present the mathematical relationships of the proposed method with other approaches and show that it has a flexible modeling power, having other approaches to concatenation scoring methods as special cases.

Tue-Ses1-S1:
Special Session: Advanced Voice Function Assessment

Time:Tuesday 10:00 Place:East Wing 4 Type:Special
Chair:Anna Barney & Mette Pedersen

10:00Acoustic and High-Speed Digital Imaging Based Analysis of Pathological Voice Contributes to Better Understanding and Differential Diagnosis of Neurological Dysphonias and of Mimicking Phonatory Disorders

Krzysztof Izdebski (Pacific Voice and Speech Foundation & Department of Otolaryngology: Head & Neck Surgery, Stanford Voice & Swallowing Center, Stanford University School of Medicine)
Yuling Yan (Department of Bioengineering, Santa Clara University & Department of Otolaryngology, Stanford University School of Medicine)
Melda Kunduk (Department of Communication Sciences and Disorders, Louisiana State University)

Using Nyquist-plots definitions and HSDI-based analyses of the acoustic and visual data base of similarly sounding disordered neurologically driven pathological phonations, we categorized these signals and provided an in-depth explanation of how these sounds differ, and how these sounds are generated at the glottic level. Combined evaluations based on modern technology strengthened our knowledge and improved objective guidelines on how to approach clinical diagnosis “by ear”, significantly aiding the process of differential diagnosis of complex pathological voice qualities in non-laboratory settings. Index Terms: HSDI, Nyquist-plots, voice quality, tremor overpressure, vocal arrests, neurologic dsyphonias, functional dysphonias, mimicking disorders

10:20Normalized Modulation Spectral Features for Cross-Database Voice Pathology Detection

Maria Markaki (Computer Science Department, University of Crete)
Yannis Stylianou (Computer Science Department, University of Crete)

In this paper, we employ normalized modulation spectral analysis for voice pathology detection. Such normalization is important when there is a mismatch between training and testing conditions, or in other words, employing the detection system in real (testing) conditions. Modulation spectra usually produce a high-dimensionality space. For classification purposes, the size of the original space is reduced using Higher Order Singular Value Decomposition (SVD). Further, we select most relevant features based on the mutual information between subjective voice quality and computed features, which leads to an adaptive to the classification task modulation spectra representation. For voice pathology detection, the adaptive modulation spectra is combined with an SVM classifier. To simulate the real testing conditions, we used two independently recorded databases; one for training and the other for testing. We address the difference of signal characteristics between training and testing data through subband normalization of modulation spectral features. Simulations show that feature normalization enables the cross-database detection of pathological voices even when training and test data are different.

10:40Speech sample salience analysis for speech cycle detection

Christophe Mertens (Laboratory of Images, Signals and Telecommunication Devices, CP 165/51, Faculté des Sciences Appliquées. Université Libre de Bruxelles)
Francis Grenez (Laboratory of Images, Signals and Telecommunication Devices, CP 165/51, Faculté des Sciences Appliquées. Université Libre de Bruxelles)
Jean Schoentgen (National Fund for Scientific Research, Belgium)

The presentation proposes a method for the measurement of cycle lengths in voiced speech. The background is the study of acoustic cues of slow (vocal tremor) and fast (vocal jitter) perturbations of the vocal frequency. Here, these acoustic cues are obtained by means of a temporal method that detects speech cycles via the so-called salience of the speech signal samples. The method does not request that the signal is locally periodic and the average period length is known a priori. Several implementations are considered and discussed. Salience analysis is compared with the auto-correlation method for cycle detection implemented in Praat.

11:00The Use of Telephone Speech Recordings for Assessment and Monitoring of Cognitive Function in Elderly People

Viliam Rapcan (Trinity Centre for Bioengineering, Trinity College Dublin, Ireland)
Shona D\'Arcy (Trinity Centre for Bioengineering, Trinity College Dublin, Ireland)
Nils Penard (Trinity College Institute of Neuroscience, Trinity College Dublin, Ireland)
Ian H. Robertson (Trinity College Institute of Neuroscience, Trinity College Dublin, Ireland)
Richard B. Reilly (Trinity Centre for Bioengineering & Trinity College Institute of Neuroscience, Trinity College Dublin, Ireland)

Cognitive assessment in clinic represents time consuming and expensive task. Speech may be employed as a means of monitoring cognitive function in elderly people. Extraction of speech characteristics from speech recorded remotely over a telephone was investigated and compared to speech characteristics extracted from recordings made in controlled environment. Results demonstrate that speech characteristics can be, with little changes in feature extraction algorithm, reliably (with overall accuracy of 93.2%) extracted from telephone quality speech. With further development of a fully automated IVR system, an early screening system for cognitive decline may be easily realized.

11:20Optimized Feature set to Assess Acoustic Perturbations in Dysarthric Speech

Sunil Nagaraja (Department of Electrical and Computer Engineering, University of New Brunswick, Canada)
Eduardo Castillo Guerra (Department of Electrical and Computer Engineering, University of New Brunswick, Canada)

This paper is focused on the optimization of features derived to characterize the acoustic perturbations encountered in a group of neurological disorders known as Dysarthria. The work derives a set of orthogonal features that enable acoustic analyses of dysarthric speech from eight different Dysarthria types. The feature set is composed by combinations of objective measurements obtained with digital signal processing algorithms and perceptual judgments of the most reliably perceived acoustic perturbations. The effectiveness of the features to provide relevant information of the disorders is evaluated with different classifiers enabling a classification rate up to 93.7%.

11:40A MICROPHONE-INDEPENDENT VISUALIZATION TECHNIQUE FOR SPEECH DISORDERS

Andreas Maier (Universität Erlangen-Nürnberg, Abteilung für Phoniatrie und Pädaudiologie)
Stefan Wenhardt (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung)
Tino Haderlein (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung)
Maria Schuster (Universität Erlangen-Nürnberg, Abteilung für Phoniatrie und Pädaudiologie)
Elmar Nöth (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung)

In this paper we introduce a novel method for the visualization of speech disorders. We demonstrate the method with disordered speech and a control group. However, both groups were recorded using two different microphones. The projection of the patient data using a single microphone yields significant correlations between the coordinates on the map and certain criteria of the disorder which were perceptually rated. However, projection of data from multiple microphones reduces this correlation. Usually, the acoustical mismatch between the microphones is greater than the mismatch between the speakers, i.e., not the disorders but the microphones form clusters in the visualization. Based on an extension of the Sammon mapping, we are able to create a map which projects the same speakers onto the same position even if multiple microphones are used. Furthermore, our method also restores the correlation between the map coordinates and the perceptual assessment.

12:00Evaluation of the Effect of the GSM Full Rate codec on the Automatic Detection of Laryngeal Pathologies Based on Cepstral Analysis

Ruben Fraile (Universidad Politecnica de Madrid)
Carmelo Sanchez (Universidad Politecnica de Madrid)
Juan I. Godino-Llorente (Universidad Politecnica de Madrid)
Nicolas Saenz-Lechon (Universidad Politecnica de Madrid)
Victor Osma-Ruiz (Universidad Politecnica de Madrid)
Juana M. Gutierrez (Universidad Politecnica de Madrid)

Advances in speech signal analysis during the last decade have allowed the development of automatic algorithms for a non-invasive detection of laryngeal pathologies. Bearing in mind the extension of these automatic methods to remote diagnosis scenarios, this paper analyzes the performance of a pathology detector based on Mel Frequency Cepstral Coefficients when the speech signal has undergone the distortion of a speech codec such as the GSM FR codec, which is use in one of the nowadays most widespread communications networks. It is shown that the overall performance of the automatic detection of pathologies is degraded less than 5%, and that such degradation is not due to the codec itself, but to the bandwidth limitation needed at its input. These results indicate that the GSM system can be more adequate to implement remote voice assessment than the analogue telephone channel.

12:20Cepstral analysis of vocal dysperiodicities in disordered connected speech

Ali Alpan (Laboratory of Images, Signals & Telecommunication Devices, Université Libre de Bruxelles, Brussels, Belgium)
Jean Schoentgen (National Fund for Scientific Research, Belgium)
Youri Maryn (Department of Otorhinolaryngology and Head & Neck Surgery, Department of Speech-Language Pathology and Audiology, Sint-Jan General Hospital, Bruges, Belgium)
Francis Grenez (Laboratory of Images, Signals & Telecommunication Devices, Université Libre de Bruxelles, Brussels, Belgium)
Peter Murphy (Department of Electronic and Computer Engineering, University of Limerick, Limerick, Ireland)

Several studies have shown that the amplitude of the first rahmonic peak (R1) in the cepstrum is an indicator of hoarse voice quality. The cepstrum is obtained by taking the inverse Fourier Transform of the log-magnitude spectrum. In the present study, a number of spectral analysis processing steps are implemented, including period-synchronous and period-asynchronous analysis, as well as harmonic-synchronous and harmonic-asynchronous spectral band-limitation prior to computing the cepstrum. The analysis is applied to connected speech signals. The correlation between amplitude R1 and perceptual ratings is examined for a corpus comprising 28 normophonic and 223 dysphonic speakers. One observes that the correlation between R1 and perceptual ratings increases when the spectrum is band-limited prior to computing the cepstrum. In addition, comparisons are made with a popular cepstral cue which is the cepstral peak prominence (CPP).

12:40Standard information from patients: the usefulness of self-evaluation measured with the French version of the VHI

Lise Crevier-Buchman (Department of Otolaryngology, Head & Neck Surgery, Hôpital Européen Georges Pompidou, Université Paris Descartes, Paris, France / Lab. Phonétique et Phonologie, UMR 7018 CNRS-Paris3/Sorbonne Nouvelle, Paris, France)
Stephanie Borel (Department of Otolaryngology, Head & Neck Surgery, Hôpital Européen Georges Pompidou, Université Paris Descartes, Paris, France / Lab. Phonétique et Phonologie, UMR 7018 CNRS-Paris3/Sorbonne Nouvelle, Paris, France)
Stephane Hans (Department of Otolaryngology, Head & Neck Surgery, Hôpital Européen Georges Pompidou, Université Paris Descartes, Paris, France)
Madeleine Menard (Department of Otolaryngology, Head & Neck Surgery, Hôpital Européen Georges Pompidou, Université Paris Descartes, Paris, France)
jacqueline Vaissiere (Lab. Phonétique et Phonologie, UMR 7018 CNRS-Paris3/Sorbonne Nouvelle, Paris, France)

Voice Handicap Index is a scale designed to measure the voice disability in daily life. Two groups of patients were evaluated. One group was represented by glottic carcinoma treated by cordectomy Type I & II (13 patients), type III (5 patients), type V (5 patients). Evaluation was done pre and postoperatively for 12 months. The other group was represented by patients with unilateral vocal fold paralysis treated by thyroplasty (17 patients). Evaluation was done before and 3 months postoperatively. Total VHI, emotional and physical subscales improved significantly for type I&II cordectomy and for thyroplasty. VHI can provide an insight into patient’s handicap

13:00Intelligibility Assessment in Children with Cleft Lip and Palate in Italian and German

Marcello Scipioni (Politecnico di Milano, Polo Regionale di Como, Italy)
Matteo Gerosa (FBK - Fondazione Bruno Kessler, Trento, Italy)
Diego Giuliani (FBK - Fondazione Bruno Kessler, Trento, Italy)
Elmar Nöth (Chair of Pattern Recognition, Friedrich-Alexander-University Erlangen-Nuremberg, Germany)
Andreas Maier (Chair of Pattern Recognition, Friedrich-Alexander-University Erlangen-Nuremberg, Germany)

Current research has shown that the speech intelligibility in children with cleft lip and palate (CLP) can be estimated automatically using speech recognition methods. On German CLP data high and significant correlations between human ratings and the recognition accuracy of a speech recognition system were already reported. In this paper we investigate whether the approach is also suitable for other languages. Therefore, we compare the correlations obtained on German data with the correlations on Italian data. A high and significant correlation (r=0.76; p < 0.01) was identified on the Italian data. This results do not differ significantly from the results on German data (p > 0.05).

13:20Universidade de Aveiro’s Voice Evaluation Protocol

Luis M. T. Jesus (IEETA and ESSUA, Universidade de Aveiro, Portugal)
Anna Barney (ISVR, University of Southampton, UK)
Ricardo Santos (Hospital Privado da Trofa, Portugal)
Janine Caetano (Agrupamento de Escolas Serra da Gardunha, Fundão, Portugal)
Juliana Jorge (RAIZ, Esmoriz, Portugal)
Pedro Sá Couto (Departamento de Matemática da Universidade de Aveiro, Portugal)

This paper presents Universidade de Aveiro’s Voice Evaluation Protocol for European Portuguese (EP), and a preliminary inter-rater reliability study. Ten patients with vocal pathology were assessed, by two Speech and Language Therapists (SLTs). Protocol parameters such as overall severity, roughness, breathiness, change of loudness (CAPE-V), grade, breathiness and strain (GRBAS), glottal attack, respiratory support, respiratory-phonotary-articulatory coordination, digital laryngeal manipulation, voice quality after manipulation, muscular tension and diagnosis, presented high reliability and were highly correlated (good inter-rater agreement and high value of correlation). Values for the overall severity and grade were similar to those reported in the literature.

Tue-Ses1-P1:
Human Speech Production II

Time:Tuesday 10:00 Place:Hewison Hall Type:Poster
Chair:Martin Cooke

#1Simple Physical Models of the Vocal Tract for Education in Speech Science

Takayuki Arai (Sophia University)

In the speech-related field, physical models of the vocal tract are effective tools for education in acoustics. Arai’s cylinder-type models are based on Chiba and Kajiyama’s measurement of vocal-tract shapes. The models quickly and effectively demonstrate vowel production. In this study, we developed physical models with simplified shapes as educational tools to illustrate how vocal-tract shape accounts for differences among vowels. As a result, the five Japanese vowels were produced by tube-connected models, where several uniform tubes with different cross-sectional areas and lengths are connected as Fant’s and Arai’s three-tube models.

#2Auto-meshing Algorithm for Acoustic Analysis of Vocal Tract

Kyohei Hayashi (Future University Hakodate)
Nobuhiro Miki (Future University Hakodate)

We propose a new method for an auto-meshing algorithm for an acoustic analysis of the vocal tract using the Finite Element Method (FEM). In our algorithm, the domain of the 3 dimensional figure of the vocal tract is decomposed into two domains; one is a surface domain and the other is an inner domain in order to employ the overlapping domain decomposition method. The meshing of surface blocks can be realized with smooth surfaces using a NURBS interpolation. We show the example of the meshes for the vocal tract figure of Japanese vowel /a/, and the trial result of the FEM simulation.

#3Voice production model employing an interactive boundary-layer analysis of glottal flow

Tokihiko Kaburagi (Department of Acoustic Design, Faculty of Design, Kyushu University)
Katsunori Daimo (Graduate School of Design, Kyushu University)
Shogo Nakamura (School of Design, Kyushu University)

A voice production model has been studied by considering essential aerodynamic and acoustic phenomena in phonation. Acoustic voice sources are produced by the volume flow through the glottis. A precise flow analysis is therefore performed based on the boundary-layer approximation and the viscous-inviscid interaction between the boundary layer and core flow. This flow analysis can supply information on the separation point of the glottal flow and the thickness of the boundary layer, and yield an effective prediction of the flow behavior. When the flow analysis is combined with a mechanical model of the vocal fold, the resulting acoustic wave travels through the vocal tract and a pressure change develops in the vicinity of the glottis. This change can affect the glottal flow and the motion of the folds, causing source-filter interaction. Preliminary simulations were conducted by changing the relationship between the fundamental and formant frequencies and their results were reported.

#4Characteristics of Two-Dimensional Finite Difference Techniques for Vocal Tract Analysis and Voice Synthesis

Matt Speed (Audio Lab, Department of Electronics, University of York)
Damian Murphy (Audio Lab, Department of Electronics, University of York)
David Howard (Audio Lab, Department of Electronics, University of York)

Both digital waveguide and finite difference techniques are numerical methods that have been demonstrated as appropriate for acoustic modelling applications. Whilst the application of the digital waveguide mesh to vocal tract modelling has been the subject of previous work, the application of comparable finite difference techniques is as yet untested. This study explores the characteristics of such a finite-difference approach to two-dimensional vocal tract modelling. Initial results suggest that finite difference techniques alone are not ideal, due to the limitation of non-dynamic behaviour and poor representation of admittance discontinuities in the approximation of three dimensional geometries. They do however introduce robust boundary formulations, and have a valid and useful application in modelling non-vital static volumes, particularly the nasal tract.

#5Adaptation of a predictive model of tongue shapes

Chao Qin (EECS, School of Engineering, University of California, Merced)
Miguel Carreira-Perpiñán (EECS, School of Engineering, University of California, Merced)

It is possible to recover the full midsagittal contour of the tongue with submillimetric accuracy from the location of just 3--4 landmarks on it. This involves fitting a predictive mapping from the landmarks to the contour using a training set consisting of contours extracted from ultrasound recordings. However, extracting sufficient contours is a slow and costly process. Here, we consider adapting a predictive mapping obtained for one condition (such as a given recording session, recording modality, speaker or speaking style) to a new condition, given only a few new contours and no correspondences. We propose an extremely fast method based on estimating a 2D-wise linear alignment mapping, and show it recovers very accurate predictive models from about 10 new contours.

#6Using sensor orientation information for computational head stabilisation in 3D Electromagnetic Articulography (EMA)

Christian Kroos (MARCS Auditory Laboratories, University of Western Sydney, Australia)

We propose a new simple algorithm to make use of the sensor orientation information in 3D Electromagnetic Articulography (EMA) for computational head stabilisation. The algorithm also provides a well-defined procedure in the case where only two sensors are available for head motion tracking and allows for the combining of position coordinates and orientation angles for head stabilisation with an equal weighting of each kind of information. An evaluation showed that the method using the orientation angles produced the most reliable results.

#7Collision Threshold Pressure Before and After Vocal Loading

Laura Enflo (Dept. of Speech, Music and Hearing, School of Computer Science & Communication, KTH, Sweden)
Johan Sundberg (Dept. of Speech, Music and Hearing, School of Computer Science & Communication, KTH, Sweden)
Friedemann Pabst (Hospital Dresden Friedrichstadt, Dresden, Germany)

The phonation threshold pressure (PTP) has been found to increase during vocal fatigue. In the present study we compare PTP and collision threshold pressure (CTP) before and after vocal loading in singer and non-singer voices. Seven subjects repeated the vowel sequence /a,e,i,o,u/ at an SPL of at least 80 dB @ 0.3 m for 20 min. Before and after this loading the subjects’ voices were recorded while they produced a diminuendo repeating the syllable /pa/. Oral pressure during the /p/ occlusion was used as a measure of subglottal pressure. Both CTP and PTP increased significantly after the vocal loading.

#8Gender differences in the realization of vowel-initial glottalization

Elke Philburn (University of Manchester, Department of Linguistics and English Language)

The aim of the study was to investigate gender-dependent differences in the realization of German glottalized vowel onsets. Laryngographic data of semi-spontaneous speech were collected from four male and four female speakers of Standard German. Measurements of relative vocal fold contact duration were carried out including glottalized vowel onsets as well as non-glottalized controls. The results show that female subjects realized the glottalized vowel onsets with greater maximum vocal fold contact duration than male subjects and that the glottalized vowel onsets produced by females were more clearly distinguished from the non-glottalized controls.

#9Stability and composition of functional synergies for speech movements in children and adults

Hayo Terband (Medical Psychology/Pediatric Neurology Centre/ENT, Radboud University Nijmegen Medical Centre, Nijmegen, the Netherlands)
Frits van Brenk (Department of Speech and Language Therapy, University of Strathclyde, Glasgow, UK)
Pascal van Lieshout (Department of Speech-Language Pathology, Oral Dynamics Lab; Department of Psychology; Institute of Biomaterials and Biomedical Engineering, University of Toronto, and Toronto Rehabilitation Institute, Toronto, Canada)
Lian Nijland (Medical Psychology/Pediatric Neurology Centre/ENT, Radboud University Nijmegen Medical Centre, Nijmegen, the Netherlands)
Ben Maassen (Medical Psychology/Pediatric Neurology Centre/ENT, Radboud University Nijmegen Medical Centre, Nijmegen, the Netherlands ; Department of Neurolinguistics, University of Groningen, Groningen, the Netherlands)

The consistency and composition of functional synergies for speech movements were investigated in 7 year-old children and adults in a reiterated speech task using electromagnetic articulography (EMA). Results showed higher variability in children for tongue tip and jaw, but not for lower lip movement trajectories. Furthermore, the relative contribution to the oral closure of lower lip was smaller in children compared to adults, whereas in this respect no difference was found for tongue tip. These results support and extend findings of non-linearity in speech motor development and illustrate the importance of a multi-measures approach in studying speech motor development.

#10An analysis of speech rate strategies in aging

Frits van Brenk (Department of Speech and Language Therapy, University of Strathclyde, Glasgow, UK; Medical Psychology/Pediatric Neurology Centre/ENT, Radboud University Nijmegen Medical Centre, Nijmegen, the Netherlands)
Hayo Terband (Medical Psychology/Pediatric Neurology Centre/ENT, Radboud University Nijmegen Medical Centre, Nijmegen, the Netherlands)
Pascal van Lieshout (Department of Speech-Language Pathology, Oral Dynamics Lab; Department of Psychology; Institute of Biomaterials and Biomedical Engineering, University of Toronto, and Toronto Rehabilitation Institute, Toronto, Canada)
Anja Lowit (Department of Speech and Language Therapy, University of Strathclyde, Glasgow, UK)
Ben Maassen (Medical Psychology/Pediatric Neurology Centre/ENT, Radboud University Nijmegen Medical Centre, Nijmegen, the Netherlands; Department of Neurolinguistics, University of Groningen, Groningen, the Netherlands)

Effects of age and speech rate on movement cycle duration were assessed using electromagnetic articulography. In a repetitive task syllables were articulated at eight rates, obtained by metronome and self-pacing. Results indicate that increased speech rate is associated with increasing movement cycle duration stability, while decreased rate leads to a decrease in uniformity of cycle duration, supporting the view that alterations in speech rate are associated with different motor control strategies involving durational manipulations. The relative contribution of closing movement durations increases with decreasing speech rate, and is a more dominant strategy for elderly speakers.

#11Variability and stability in collaborative dialogues: turn-taking and filled pauses

Štefan Beňuš (Constantine the Philosopher University, Nitra, Slovakia and Slovak Academy of Sciences, Bratislava, Slovakia)

Filled pauses have important and varied functions in turn-taking behavior, and better understanding of their relationship opens new ways for improving the quality and naturalness of dialogue systems. We use a corpus of collaborative task oriented dialogues to provide new insights into the relationship between filled pauses and turn-taking based on temporal and acoustic features. We then explore which of these patterns are stable and robust across speakers, which are prone to entrainment based on conversational partner, and which are variable and noisy. Our findings suggest that intensity is the least stable feature followed by pitch-related features, and temporal features relating filled pauses to chunking and turn-taking are the most stable.

#12Speaking in the presence of a competing talker

Youyi Lu (University of Sheffield)
Martin Cooke (Ikerbasque and University of the Basque Country)

How do speakers cope with a competing talker? This study investigated the possibility that speakers are able to retime their contributions to take advantages of temporal fluctuations in the background, reducing any adverse effects for an interlocutor. Speech was produced in quiet, competing talker, modulated noise and stationary backgrounds, with and without a communicative task. An analysis of the timing of contributions relative to the background indicated a significantly reduced chance of overlapping for the modulated noise backgrounds relative to quiet, with competing speech resulting in the least overlap. Strong evidence for an active overlap avoidance strategy is presented.

Tue-Ses1-P3:
Speech and Audio Segmentation and Classification

Time:Tuesday 10:00 Place:Hewison Hall Type:Poster
Chair:S. Umesh

#1Wavelet-based Speaker Change Detection in Single Channel Speech Data

Michael Wiesenegger (Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria)
Franz Pernkopf (Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria)

Speaker segmentation is the task of finding speaker turns in an audio stream. We propose a metric-based algorithm based on Discrete Wavelet Transform (DWT) features.Principal component analysis (PCA) or linear discriminant analysis (LDA) are further used to reduce the dimensionality of the feature space and remove redundant information. In the experiments our methods -- DWT-PCA and DWT-LDA -- are compared to the DISTBIC algorithm using clean and noisy data of the TIMIT database. Especially, under conditions with strong noise, i.e. -10dB SNR, our DWT-PCA approach is very robust, the false alarm rate (FAR) drops by ~2% and the missed detection rate (MDR) stays about the same compared to clean speech, whereas the DISTBIC method fails -- the FAR and MDR is almost ~0% and ~100%, respectively. For clean speech DWT-PCA shows an improvement of ~30% (relative) for both the FAR and MDR in comparison to the DISTBIC algorithm. DWT-LDA is performing slightly worse than DWT-PCA.

#2An Adaptive Threshold Computation for Unsupervised Speaker Segmentation

Laura Docio-Fernandez (University of Vigo)
Paula Lopez-Otero (University of Vigo)
Carmen Garcia-Mateo (University of Vigo)

Reliable speaker segmentation is critical in many applications in the speech processing domain. In this paper, we compare the performance of two speaker segmentation systems: the first one is inspired on a typical state-of-art speaker segmentation system, and the other is an improved version of the former system. We show that the proposed system has a better performance as it does not over-segment the data. This system includes an algorithm that randomly discards some of the point changes with a probability depending on its performance at any moment. Thus, the system merges adjacent segments when they are spoken by the same speaker with a high probability; anytime a change is discarded the discard probability will rise, as the system made a mistake; the opposite will occur when the two adjacent segments belong to different speakers, as there will not be a mistake in this case. We show the improvements of the new system through comparative experiments on TC-STAR spanish database.

#3A data-driven approach for estimating the time-frequency binary mask

Gibak Kim (Department of Electrical Engineering, University of Texas at Dallas)
Philipos Loizou (Department of Electrical Engineering, University of Texas at Dallas)

The ideal binary mask, often used in robust speech recognition applications, requires an estimate of the local SNR in each time-frequency (T-F) unit. A data-driven approach is proposed for estimating the instantaneous SNR of each T-F unit. By assuming that the a priori SNR and a posteriori SNR are uniformly distributed within a small region, the instantaneous SNR is estimated by minimizing the localized Bayes risk. The binary mask estimator derived by the proposed approach is evaluated in terms of hit and false alarm rates. Compared to the binary mask estimator that uses the decision-directed approach to compute the SNR, the proposed data-driven approach yielded substantial improvements (up to 40%) in classification performance, when assessed in terms of a sensitivity metric which is based on the difference between the hit and false alarm rates.

#4A Semi-supervised Version of Heteroscedastic Linear Discriminant Analysis

Zhou Haolang (CLSP, ECE, Johns Hopkins University)
Karakos Damianos (CLSP, COE, ECE, Johns Hopkins University)
Andreou Andreas (CLSP, ECE, Johns Hopkins University)

Heteroscedastic Linear Discriminant Analysis (HLDA) was introduced as an extension of Linear Discriminant Analysis to the case where the class-conditional distributions have unequal covariances. The HLDA transform is computed such that the likelihood of the training (labeled) data is maximized, under the constraint that the projected distributions are orthogonal to a nuisance space that does not offer any discrimination. In this paper we consider the case of semi-supervised learning, where a large amount of unlabeled data is also available. We derive update equations for the parameters of the projected distributions, which are estimated jointly with the HLDA transform, and we empirically compare it with the case where no unlabeled data are available. Experimental results with synthetic data and real data from a vowel recognition task show that, in most cases, semi-supervised HLDA results in improved performance over HLDA.

#5Self-learning Vector Quantization for Pattern Discovery from Speech

Okko Johannes Räsänen (Department of Signal Processing and Acoustics, Helsinki University of Technology, Finland)
Unto Kalervo Laine (Department of Signal Processing and Acoustics, Helsinki University of Technology, Finland)
Toomas Altosaar (Department of Signal Processing and Acoustics, Helsinki University of Technology, Finland)

A novel and computationally straightforward clustering algorithm was developed for vector quantization (VQ) of speech signals for a task of unsupervised pattern discovery (PD) from speech. The algorithm works in purely incremental mode, is computationally extremely feasible, and achieves comparable classification quality with the well-known k-means algorithm in the PD task. In addition to presenting the algorithm, general findings regarding the relationship between the amounts of training material, convergence of the clustering algorithm, and the ultimate quality of VQ codebooks are discussed.

#6Monaural Segregation of Voiced Speech using Discriminative Random Fields

Rohit Prabhavalkar (The Ohio State University)
Zhaozhang Jin (The Ohio State University)
Eric Fosler-Lussier (The Ohio State University)

Techniques for separating speech from background noise and other sources of interference have important applications for robust speech recognition and speech enhancement. Many traditional computational auditory scene analysis (CASA) based approaches decompose the input mixture into a time-frequency (T-F) representation, and attempt to identify the T-F units where the target energy dominates that of the interference. This is accomplished using a two stage process of segmentation and grouping. In this pilot study, we explore the use of Discriminative Random Fields (DRFs) for the task of monaural speech segregation. We find that the use of DRFs allows us to effectively combine multiple auditory features into the system, while simultaneously integrating the the two CASA stages into one. Our preliminary results suggest that CASA based approaches may benefit from the DRF framework.

#7Advancements in Whisper-Island Detection within Normally Phonated Audio Streams

Chi Zhang (Research Assistant, PhD Student)
John Hansen (Professer, Chair of E.E. Department)

In this study, several improvements are proposed for improved whisper-island detection within normally phonated audio streams. Based on our previous study, an improved feature, which is more sensitive to vocal effort change points between whisper and neutral speech, is developed and utilized in vocal effort change point(VECP) detection and vocal effort classification. Evaluation is based on the proposed multi-error score, where the improved feature showed better performance in VECPs detection with the lowest MES of 19.08. Furthermore, a more accurate whisper-island detection was obtained using the improved algorithm. Finally, the experimental detection rate results of 95.33% reflects better whisper-island detection performance for the improved algorithm versus that of the original baseline algorithm.

#8Joint Segmentation and Classification of Dialog Acts using Conditional Random Fields

Matthias Zimmermann (xbrain.ch)

This paper investigates the use of conditional random fields for joint segmentation and classification of dialog acts exploiting both word and prosodic features that are directly available from a speech recognizer. To validate the approach experiments are conducted with two different sets of dialog act types under both reference and speech to text conditions. Although the proposed framework is conceptually simpler than previous attempts at segmentation and classification of DAs it outperforms all previous systems for a task based on the ICSI (MRDA) meeting corpus.

#9Exploring Complex Vowels as Phrase Break Correlates in a Corpus of English Speech with ProPOSEL, a Prosody and POS English Lexicon

Claire Brierley (University of Bolton)
Eric Atwell (University of Leeds)

Real-world knowledge of syntax is seen as integral to the machine learning task of phrase break prediction but there is a deficiency of a priori knowledge of prosody in both rule-based and data-driven classifiers. Speech recognition has established that pauses affect vowel duration in preceding words. Based on the observation that complex vowels occur at rhythmic junctures in poetry, we run significance tests on a sample of contemporary British English and find a statistically significant correlation between complex vowels in canonical dictionary pronunciations of words in a text, and phrase breaks. The experiment depends on automatic text annotation via ProPOSEL, a prosody and part-of-speech English lexicon. Index Terms: prosody; real-world knowledge for machine learning; phrase break prediction; text-to-speech synthesis.

#10Automatic Topic Detection of Recorded Voice Messages

Caroline Clemens (Deutsche Telekom Laboratories, Berlin, Germany)
Stefan Feldes (T-Systems, Darmstadt, Germany)
Karlheinz Schuhmacher (Deutsche Telekom Laboratories, Berlin, Germany)
Joachim Stegmann (Deutsche Telekom Laboratories, Berlin, Germany)

We present an approach to automatic classification of spontaneously spoken voice messages. During overload periods at call-centers customers are offered a call-back at a later time. A speech dialog asks them to describe their concern on a voice box. The identified topics correspond to the supported service categories, which in turn determine the agent group the customer message is routed to. Our multistage classification process includes speech-to-text, stemming, keyword spotting, and categorization. Classifier training and evaluation have been performed with real-life data. Results show promising performance. The pilot will be launched in a field test.

#11Identification and Automatic Detection of Parasitic Speech Sounds

Jindrich Matousek (Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia, Czech Republic)
Radek Skarnitzl (Institute of Phonetics, Faculty of Arts & Philosophy, Charles University in Prague, Czech Republic)
Pavel Machac (Institute of Phonetics, Faculty of Arts & Philosophy, Charles University in Prague, Czech Republic)
Jan Trmal (Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia, Czech Republic)

This paper presents initial experiments with the identification and automatic detection of parasitic sounds in speech signals. The main goal of this study is to identify such sounds in the source recordings for unit-selection-based speech synthesis systems and thus to avoid their unintended usage in synthesised speech. The first part of the paper describes the phonetic analysis and identification of parasitic phenomena in recordings of two Czech speakers. In the second part, experiments with the automatic detection of parasitic sounds using HMM-based and BVM classifiers are presented. The results are encouraging, especially those for glottalization phenomena.

#12Phonetic alignment for speech synthesis in under-resourced languages

Daniel Van Niekerk (Human Language Technologies Research Group, Meraka Institute, CSIR, Pretoria, South Africa AND School of Electrical, Electronic and Computer Engineering, North-West University, Potchefstroom, South Africa)
Etienne Barnard (Human Language Technologies Research Group, Meraka Institute, CSIR, Pretoria, South Africa)

The rapid development of concatenative speech synthesis systems in resource scarce languages requires an efficient and accurate solution with regard to automated phonetic alignment. However, in this context corpora are often minimally designed due to a lack of resources and expertise necessary for large scale development. Under these circumstances many techniques toward accurate segmentation are not feasible and it is unclear which approaches should be followed. In this paper we investigate this problem by evaluating alignment approaches and demonstrating how these approaches can be applied to limit manual interaction while achieving acceptable alignment accuracy with minimal ideal resources.

#13Improving Initial Boundary Estimation for HMM-based Automatic Phonetic Segmentation

Udochukwu Kalu Ogbureke (School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland)
Julie Carson-Berndsen (School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland)

This paper presents an approach to boundary estimation for automatic segmentation of speech given a phone (sound) sequence. The technique presented represents an extension to existing approaches to Hidden Markov Model based automatic segmentation which modifies the topology of the model to control for duration. An HMM system trained with this modified topology places 77.10%, 86.72% and 91.15% of the boundaries, on the TIMIT speech test corpus annotations, within 10, 15 and 20 ms respectively as compared with manual annotations. This represents an improvement over the baseline result of 70.99%, 83.50% and 89.18% for initial boundary estimation

Tue-Ses1-P4:
Speaker Recognition and Diarisation

Time:Tuesday 10:00 Place:Hewison Hall Type:Poster
Chair:Sadaoki Furui

#1Importance of Nasality Measures for Speaker Recognition Data Selection and Performance Prediction

Howard Lei (International Computer Science Institute)
Eduardo Lopez-Gonzalo (Dep. of Signals, Systems and Radiocomm., Universidad Politecnica Madrid, Spain)

We improve upon measures relating feature vector distributions to speaker recognition (SR) performances for SR performance prediction and arbitrary data selection. In particular, we examine the means and variances of 11 features pertaining to nasality (resulting in 22 measures), computing them on feature vectors of phones to determine which measures give good SR performance prediction of phones. We've found that the combination of nasality measures give a 0.917 correlation with the Equal Error Rates (EERs) of phones on SRE08, exceeding the correlation of our previous best measure (mutual information) by 12.7%. When implemented in our data-selection scheme (which does not require a SR system to be run), the nasality measures allow us to select data with combined EER better than data selected via running a SR system in certain cases, at a fortieth of the computational costs. The nasality measures require a tenth of the computational costs compared to our previous best measure.

#2Exploration of Vocal Excitation Modulation Features for Speaker Recognition

Ning Wang (Department of Electronic Engineering, The Chinese University of Hong Kong)
P. C. Ching (Department of Electronic Engineering, The Chinese University of Hong Kong)
Tan Lee (Department of Electronic Engineering, The Chinese University of Hong Kong)

To derive spectro-temporal vocal source features complementary to the conventional spectral-based vocal tract features in improving the performance and reliability of a speaker recognition system, the excitation related modulation properties are studied. Through multi-band demodulation method, source-related amplitude and phase quantities are parameterized into feature vectors. Evaluation of the proposed features is carried out first through a set of designed experiments on artificially generated inputs, and then by simulations on speech corpus. It is observed via the designed experiments that the proposed features are capable of capturing the vocal differences in terms of F0 variation, pitch epoch shape, and relevant excitation details between epochs. In the simulations, by combination with the standard spectral features, both the amplitude and the phase-related features are shown to evidently reduce the identification error rate and equal error rate in the speaker recognition system.

#3Speaker Identification for Whispered Speech Using Modified Temporal Patterns and MFCCs

Xing Fan (Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, Richardson, Texas 75083, USA)
John H.L. Hansen (Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, Richardson, Texas 75083, USA)

Whisper is used by talkers intentionally in certain circumstances to protect personal privacy. Due to the absence of periodic excitation in the production of whisper, there are considerable differences between neutral and whispered speech in the spectral structure. Therefore, performance of speaker ID systems trained with high energy voiced phonemes, degrades significantly when tested with whisper. This study considers a combination of modified temporal patterns (m-TRAPs) and MFCCs to improve the performance of a neutral trained system for whispered speech. The m-TRAPs are introduced based on an explanation for the whisper/neutral mismatch degradation of MFCCs based system. A phoneme-by-phoneme score weighting method is used to fuse the score from each subband. Text independent closed set speaker ID was conducted and experiment shows that m-TRAPs is especially efficient for whisper with low SNR. When combining the scores from both MFCCs and TRAPs GMMs, an absolute 26.3% improvement in accuracy is obtained compared with a traditional MFCCs baseline system. This result confirms a viable approach to improving speaker ID performance between neutral/whisper mismatch conditions.

#4Speaker Diarization for Meeting Room Audio

Hanwu Sun (Institute for Infocomm Research)
Tin Lay Nwe (Institute for Infocomm Research)
Bin Ma (Institute for Infocomm Research)
Haizhou Li (Institute for Infocomm Research)

This paper describes a speaker diarization system in 2007 NIST Rich Transcription (RT07) Meeting Recognition Evaluation for the task of Multiple Distant Microphone (MDM) in meeting room scenarios. The system includes three major modules: data preparation, initial speaker clustering and cluster purification/merging. The data preparation consists of the raw data Wiener filtering and beamforming, Time Difference of Arrival estimate and speech activity detection. Based on the initial processed data, two-stage histogram quantization has been used to perform the initial speaker clustering. A modified purification strategy via high-order GMM clustering method is proposed. BIC criterion is applied for cluster merging. The system achieves a competitive overall DER of 8.31% for RT07 MDM speaker diarization task.

#5Improving Speaker Segmentation via Speaker Identification and Text Segmentation

Runxin Li (InterACT, Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA)
Tanja Schultz (InterACT, Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA; Fakultat fur Informatik, Universitat Karlsruhe (TH), Germany)
Qin Jin (InterACT, Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA)

Speaker segmentation is an essential part of a speaker diarization system. Common segmentation systems usually miss speaker change points when speakers switch fast. These errors seriously confuse the following speaker clustering step and result in high overall speaker diarization error rates. In this paper two methods are proposed to deal with this problem: The first approach uses speaker identification techniques to boost speaker segmentation. And the second approach applies text segmentation methods to improve the performance of speaker segmentation. Experiments on Quaero speaker diarization evaluation data shows that our methods achieve up to 45% relative reduction in the speaker diarization error and 64% relative increase in the speaker change detection recall rate over the baseline system. Moreover, both these two approaches can be considered as post-processing steps over the baseline segmentation, therefore, they can be applied in any speaker diarization systems.

#6Overall performance metrics for multi-condition Speaker Recognition Evaluations

David van Leeuwen (TNO Human Factors)

In this paper we propose a framework for measuring the overall performance of an automatic speaker recognition system using a set of trials of a heterogeneous evaluation such as NIST SRE-2008, which combines several acoustic conditions in one evaluation. We do this by weighting trials of different conditions according to their relative proportion, and we derive expressions for the basic speaker recognition performance measures Cdet, Cllr, as well as the DET curve, from which EER and minCdet can be computed. Examples of pooling of conditions are shown on SRE-2008 data, including speaker sex and microphone type and speaking style.

#7Speaker Identification usingWarped MVDR Cepstral Features

Matthias Wölfel (ZKM|Center for Art and Media, Germany)
Qian Yang (Universität Karlsruhe (TH), Germany)
Jin Qin (Carnegie Mellon University, USA)
Tanja Schultz (Universität Karlsruhe (TH), Germany)

It is common practice to use similar or even the same feature extraction methods for automatic speech recognition and speaker recognition. While the front-end for the former requires to preserve phoneme discrimination and to compensate for speaker differences to some extend the front-end for the latter has to preserve the unique characteristics of individual speakers. It seems, therefore, contradictory to use the same feature extraction methods for both tasks. Starting out from the common practice we propose to use warped minimum variance distortionless response (MVDR) cepstral coefficients, which have already been demonstrated to preform superior for automatic speech recognition in particular under adverse conditions. Replacing the widely used mel-frequency cepstral coefficients by warped MVDR cepstral coefficients improves the speaker identification accuracy by up to 24% relative. We found that the optimal choice of the model order within the warped MVDR framework differs between speech recognition and speaker recognition, confirming our intuition that the two different tasks indeed require different feature extraction strategies.

#8Entropy Based Overlapped Speech Detection as a Pre-Processing Stage for Speaker Diarization

Oshry Ben-Harush (Ben-Gurion University of the Negev)
Itshak Lapidot (Sami Shamoon College of Engineering)
Hugo Guterman (Ben-Gurion University of the Negev)

One inherent deficiency of most diarization systems is their inability to handle co-channel or overlapped speech. Most of the suggested algorithms perform under singular conditions, require high computational complexity in both time and frequency domains. In this study, frame based entropy analysis of the audio data in the time domain serves as a single feature for an overlapped speech detection algorithm. Identification of overlapped speech segments is performed using Gaussian Mixture Modeling (GMM) along with well known classification algorithms applied on two speaker conversations. By employing this methodology, the proposed method eliminates the need for setting a hard threshold for each conversation or database. LDC CALLHOME American English corpus is used for evaluation of the suggested algorithm. The proposed method successfully detects 63.2% of the frames labeled as overlapped speech by the manual segmentation, while keeping a 5.4% false-alarm rate.

#9Speech Style and Speaker Recognition: a Case Study

Marco Grimaldi (School of Computer Science and Informatics, UCD, Dubin; FBK, via Sommarive 18, I-38100 Povo (Trento))
Fred Cummins (School of Computer Science and Informatics, UCD, Dubin)

This work presents an experimental evaluation of the effect of different speech styles on the task of speaker recognition. We make use of willfully altered voice extracted from the CHAINS corpus and methodically assess the effect of its use in a reference speaker identification and verification system. We contrast normal readings of text with two varieties of imitative styles and with the familiar, non-imitative, variant of fast speech. Furthermore, we test the applicability of a novel speech parameterization that has been suggested as a promising technique in the task of speaker identification: the pyknogram frequency estimate coefficients - pykfec.

#10The Majority Wins: a Method for Combining Speaker Diarization Systems

Marijn Huijbregts (University of Twente)
David Leeuwen, van (TNO Human Factors)
Franciska Jong, de (University of Twente)

In this paper we present a method for combining multiple diarization systems into one single system by applying a majority voting schema. The voting schema selects the best segmentation purely on basis of the output of each system. On our development set of NIST Rich Transcription evaluation meetings the voting method improves our system on all evaluation conditions. For the single distant microphone condition, the DER performance is improved by 7.8% (relative) compared to the best input system. For the multiple distant microphone condition the improvement is 3.6%.

#11Two-Wire Nuisance Attribute Projection

Yosef Solewicz (Department of Computer Science, Bar-Ilan University, Ramat-Gan, Israel)
Hagai Aronowitz (IBM Haifa Research Labs, Haifa 31905, Israel)

This paper addresses the task of nuisance reduction in two-wire speaker recognition applications. Besides channel mismatch, two-wire conversations are contaminated by extraneous speakers which represent an additional source of noise in the supervector domain. It is shown that two-wire nuisance manifests itself as undesirable directions in the interspeaker subspace. For this purpose, we derive two alternative Nuisance Attribute Projection (NAP) formulations tailored for two-wire sessions. The first formulation generalizes the NAP framework based on a model of two-wire conversations. The second formulation explicitly models the four- vs. two-wire supervector variability. Preliminary experiments show that two-wire NAP significantly outperforms regular NAP in varied two-wire tasks

Tue-Ses1-P2:
Speech perception II

Time:Tuesday 10:00 Place:Hewison Hall Type:Poster
Chair:Odette Scharenborg

#1THE EFFECT OF R-RESONANCE INFORMATION ON INTELLIGIBILITY

Antje Heinrich (Department of Linguistics, University of Cambridge, UK)
Sarah Hawkins (Department of Linguistics, University of Cambridge, UK)

We investigated the importance of phonetic information in preceding syllables for the intelligibility of minimally paired words containing /r/ or /l/. Target words were cross-spliced either into a different token of the same sentence (match) or into a sentence that was identical but originally uttered with the paired word (mismatch). Young and older adults heard the sentences in various background babbles. Matched phonetic information in syllables earlier in the sentence and in the syllable immediately preceding the target segment facilitated intelligibility for r- but not l-words. Despite hearing loss, older adults used this phonetic information as much as young listeners.

#2Perception of Temporal Cues at the Discourse Boundary

Hsin-Yi Lin (Ph.D Student of Graduate Institute of Linguistics, National Taiwan University)
Janice Fon (Assistant Professor of Linguistics, National Taiwan University)

This study investigates the role of temporal cues in the perception at discourse boundaries. Target cues were penult lengthening, final lengthening, and pause duration. Results showed that different cues are weighted differently for different purposes, where final lengthening is more important for subjects to detect boundaries, while pause duration is more responsible in cuing the sizes of boundaries.

#3Human Audio-Visual Consonant Recognition Analyzed with Three Bimodal Integration Models

Zhanyu Ma (Sound and Image Processing Lab, KTH - Royal Institute of Technology,Sweden)
Arne Leijon (Sound and Image Processing Lab, KTH - Royal Institute of Technology,Sweden)

With A-V recordings, ten normal hearing people took recognition tests at different signal-to-noise ratios (SNR). The A-V recognition results are predicted by the fuzzy logical model of perception (FLMP) and the post-labelling integration model (POSTL). We also applied hidden Markov models (HMMs) and multi-stream HMMs (MSHMMs) for the recognition. As expected, all the models agree qualitatively with the results that the benefit gained from the visual signal is larger at lower acoustic SNRs. However, the FLMP severely overestimates the A-V integration result, while the POSTL model underestimates it. Our automatic speech recognizers integrated the audio and visual stream efficiently. The visual automatic speech recognizer could be adjusted to correspond to human visual performance. The MSHMMs combine the audio and visual streams efficiently, but the audio automatic speech recognizer must be further improved to allow precise quantitative comparisons with human audio-visual performance.

#4Effects of tempo in radio commercials on young and elderly listeners

Hanny Ouden, den (Utrecht University)
Hugo Quene (Utrecht University)

The aim of the present study is to investigate the effects of tempo manipulations in radio commercials, on listeners’ evaluation, cognition and persuasion. Questionnaire scores from 131 young and 130 elderly listeners show effects of tempo manipulation on listeners’ subjective evaluation, but not on their cognitive scores. Tempo effects on persuasion scores are modulated by the listeners’ general disposition towards radio and radio commercials. In sum, it seems that not age but listeners’ general disposition is of importance in evaluating tempo manipulation of radio commercials.

#5Self-voice recognition in 4 to 5-year-old children

Sofia Strömbergsson (Department of Speech, Music and Hearing, KTH, Stockholm, Sweden)

Children’s ability to recognize their own recorded voice as their own was explored in a group of 4 to 5-year-old children. The task for the children was to identify which one of four voice samples represented their own voice. The results reveal that children perform well above chance level, and that a time span of 1-2 weeks between the recording and the identification does not affect the children’s performance. F0 similarity between the participant’s recordings and the reference recordings correlated with a higher error-rate. Implications for the use of recordings in speech and language therapy are discussed.

#6Are real tongue movements easier to speech read than synthesized?

Olov Engwall (Centre for Speech Technology, CSC, KTH, Stockholm, Sweden)
Preben Wik (Centre for Speech Technology, CSC, KTH, Stockholm, Sweden)

Speech perception studies with augmented reality displays in talking heads have shown that tongue reading abilities are weak initially, but that subjects become able to extract some information from intra-oral visualizations after a short training session. In this study, we investigate how the nature of the tongue movements influences the results, by comparing synthetic rule-based and actual, measured movements. The subjects were significantly better at perceiving sentences accompanied by real movements, indicating that the current coarticulation model developed for facial movements is not optimal for the tongue.

#7Eliciting a hierarchical structure of human consonant perception task errors using Formal Concept Analysis

Carmen Peláez-Moreno (University Carlos III Madrid, Spain)
Ana Isabel García-Moral (University Carlos III Madrid, Spain)
Francisco José Valverde-Albacete (University Carlos III Madrid, Spain)

In this paper we have used Formal Concept Analysis to elicit a hierarchical structure of human consonant perception task errors. We have used the Native Listeners experiments provided for the Consonant Challenge session of Interspeech 2008 to analyse perception errors comitted in relation with the place of articulation of the consonants being evaluated for one quiet and six noisy acoustic conditions.

#8Acoustic and Perceptual Effects of Vocal training in Amateur Male Singing

Takeshi Saitou (National Institute of Advanced Industrial Science and Technology (AIST))
Masataka Goto (National Institute of Advanced Industrial Science and Technology (AIST))

This paper reports our investigation of the acoustical effects of vocal training for amateur singers and of the contribution of those effects to perceived vocal quality. Recording singing voices before and after vocal training and then analyzing changes in acoustic parameters with a focus on features unique to singing voices, we found that two different F0 fluctuations (vibrato and overshoot) and singing formant were improved by the training. The results of psychoacoustic experiments showed that perceived voice quality was influenced more by the changes of F0 characteristics than by the changes of spectral characteristics and that acoustic features unique to singing voices contribute to perceived voice quality in the following order: vibrato, singing formant, overshoot, and preparation.

Tue-Ses2-O1:
Automotive and Mobile applications

Time:Tuesday 13:30 Place:Main Hall Type:Oral
Chair:Kate Knill

13:30Fast Speech Recognition for Voice Destination Entry in a Car Navigation System

Hoon Chung (ETRI)
Jeon Gue Park (ETRI)
Hyeon Bae Jeon (ETRI)
Yun Keun Lee (ETRI)

In this paper, we introduce a multi-stage decoding algorithm optimized to recognize very large number of entry names on a resource-limited embedded device. The multi-stage decoding algorithm is composed of a two-stage HMM-based coarse search and a detailed search. The two-stage HMM-based coarse search generates a small set of candidates that are assumed to contain a correct hypothesis with high probability, and the detailed search re-ranks the candidates by rescoring them with sophisticate acoustic models. In this paper, we take experiments with 1-millions of point-of-interest (POI) names on an in-car navigation device with a fixed-point processor running at 620MHz. The experimental result shows that the multi-stage decoding algorithm runs about 2.23 times real-time on the device without serious degradation of recognition performance.

13:50Improving Perceived Accuracy for In-Car Media Search

Yun-Cheng Ju (Microsoft Research)
Michael Seltzer (Microsoft Research)
Ivan Tashev (Microsoft Research)

Speech recognition technology is prone to mistakes, but this is not the only source of errors that cause speech recognition systems to fail; sometimes the user simply does not utter the command correctly. Usually, user mistakes are not considered when a system is designed and evaluated. This creates a gap between the claimed accuracy of the system and the actual accuracy perceived by the users. We address this issue quantitatively in our in-car infotainment media search task and propose expanding the capability of voice command to accommodate user mistakes while retaining a high percentage of the performance for queries with correct syntax. As a result, failures caused by user mistakes were reduced by an absolute 70% at the cost of a drop in accuracy of only 0.28%.

14:10Laying the Foundation for In-car Alcohol Detection by Speech

Florian Schiel (Bavarian Archive for Speech Signals, Ludwig-Maximilians-Universität München)
Christian Heinrich (Bavarian Archive for Speech Signals, Ludwig-Maximilians-Universität München)

The fact that an increasing number of functions in the automobile are and will be controlled by speech of the driver rises the question whether this speech input may be used to detect a possible alcoholic intoxication of the driver. For that matter a large part of the new Alcohol Language Corpus (ALC) edited by the Bavarian Archive of Speech Signals (BAS) will be used for a broad statistical investigation of possible feature candidates for classification. In this contribution we present the motivation and the design of the ALC corpus as well as first results from fundamental frequency and rhythm analysis. Our analysis by comparing sober and alcoholized speech of the same individuals suggests that there are in fact promising features that can automatically be derived from the speech signal during the speech recognition process and will indicate intoxication for most speakers.

14:30A Voice Search Approach to Replying to SMS Messages in Automobiles

Yun-Cheng Ju (Microsoft Research)
Tim Paek (Microsoft Research)

Automotive infotainment systems now provide drivers the ability to hear incoming Short Message Service (SMS) text messages using text-to-speech. However, the question of how best to allow users to respond to these messages using speech recognition remains unsettled. In this paper, we propose a robust voice search approach to replying to SMS messages based on template matching. The templates are empirically derived from a large SMS corpus and matches are accurately retrieved using a vector space model. In evaluating SMS replies within the acoustically challenging environment of automobiles, the voice search approach consistently outperformed using just the recognition results of a statistical language model or a probabilistic context-free grammar. For SMS replies covered by our templates, the approach achieved as high as 89.7% task completion when evaluating the top five reply candidates.

14:50Language Modeling for What-with-Where on GOOG-411

Charl van Heerden (Meraka Institute)
Johan Schalkwyk (Google Inc.)
Brian Strope (Google Inc.)

This paper describes the language modeling architectures and recognition experiments that enabled support of 'what-with-where' queries on GOOG-411. First we compare accuracy trade-offs between a single national business LM for business queries and using many small models adapted for particular cities. Experimental evaluations show that both approaches lead to comparable overall accuracy. Differences in the distributions of errors also lead to improvements from a simple combination. We then optimize variants of the national business LM in the context of combined business and location queries from the web, and finally evaluate these models on a recognition test from the recently fielded 'what-with-where' system.

15:10Very Large Vocabulary Voice Dictation for Mobile Devices

Jan Nouza (SpeechLab, Institute of Information Technology and Electronics Technical University of Liberec, 461 17 Liberec, Czech Republic)
Petr Cerva (SpeechLab, Institute of Information Technology and Electronics Technical University of Liberec, 461 17 Liberec, Czech Republic)
Jindrich Zdansky (SpeechLab, Institute of Information Technology and Electronics Technical University of Liberec, 461 17 Liberec, Czech Republic)

This paper deals with optimization techniques that can make very large vocabulary voice dictation applications deployable on recent mobile devices. We focus namely on optimization of signal parameterization (frame rate, FFT calculation, fixed-point representation) and on efficient pruning techniques employed on the state and Gaussian mixture level. We demonstrate the applicability of the proposed techniques on the practical design of an embedded 255K-word discrete dictation program developed for Czech. Its real performance is comparable to a client-server version of the fluent dictation program implemented on the same mobile device.

Tue-Ses2-O2:
Prosody: production I

Time:Tuesday 13:30 Place:East Wing 1 Type:Oral
Chair: Fred Cummins

13:30Did you say a BLUE banana? The prosody of contrast and abnormality in Bulgarian and Dutch

Diana V. Dimitrova (University of Groningen)
Gisela Redeker (University of Groningen)
John C.J. Hoeks (University of Groningen)

In a production experiment on Bulgarian that was based on a previous study on Dutch [1], we investigated the role of prosody when linguistic and extra-linguistic information coincide or contradict. Speakers described abnormally colored fruits in conditions where contrastive focus and discourse relations were varied. We found that the coincidence of contrast and abnormality enhances accentuation in Bulgarian as it did in Dutch. Surprisingly, when both factors are in conflict, the prosodic prominence of abnormality often overruled focus accentuation in both Bulgarian and Dutch, though the languages also show marked differences.

13:50A Quantitative Study of F0 Peak Alignment and Sentence Modality

Hansjörg Mixdorff (BHT University of Applied Sciences, Berlin, Germany)
Hartmut Pfitzinger (University of Kiel, Germany)

The current study examines the relationship between prosodic accent labels assigned in the Kiel Corpus of Spontaneous Speech IV, Isačenko’s intoneme classes of the underlying accents and the associated parameters of the Fujisaki model. Among other findings, there is a close connection between early peaks and information intonemes, as well as late peaks and non-terminal intonemes. The majority of tokens within both intoneme classes, however, are associated with medial peaks. Precise analysis of alignment shows that accent command offset times for information intonemes are significantly earlier than for non-terminal intonemes. This suggests that the anchoring of the relevant tonal transition could be more important for separating different intonational categories than that of the F0 peak.

14:10Closely related languages, different ways of realizing focus

Szu-wei Chen (Graduate Institute of Linguistic, National Chung Cheng University, Taiwan)
Bei Wang (Institute of Chinese Minority Languages, Minzu University of China, China)
Yi Xu (Department of Speech, Hearing and Phonetic Sciences, University College London, UK)

We investigated how focus was prosodically realized in Taiwanese, Taiwan Mandarin and Beijing Mandarin by monolingual and bilingual speakers. Acoustic analyses showed that all speakers raised pitch and intensity of focused words, but only Beijing Mandarin speakers lowered pitch and intensity of post-focus words. Cross-group differences in duration were mixed. When listening to stimuli from their own language groups, subjects from Beijing had over 80% focus recognition rate, while those from Taiwan had less than 70% recognition rate. This difference is mainly due to presence/absence of post-focus compression. These findings have implications for prosodic typology, language contact and bilingualism.

14:30Cross-variety Rhythm Typology in Portuguese

Plínio Barbosa (Speech Prosody Studies Group/Dep. of Linguistics/Inst.Est. Ling., Univ. of Campinas, Brazil)
Maria do Céu Viana (Center of Linguistics of the University of Lisbon, Portugal)
Isabel Trancoso (INESC-ID, Lisbon, Portugal)

This paper aims at proposing a measure of speech rhythm based on the inference of the coupling strength between the syllable oscillator and the stress group oscillator of an underlying coupled oscillators model. This coupling is inferred from the linear regression between the stress group duration and the number of syllables within the group, as well as from the multiple linear regression between the same parameters and an estimate of phrase stress prominence. This technique is applied to compare the rhythmic differences between European and Brazilian Portuguese in two speaking styles and three speakers per variety. Compared with a syllable-sized normalised PVI, the findings suggest that the coupling strength captures better the perceptual effects of the speakers' renditions. Furthermore, it shows that stress group duration is much better predicted by adding phrase stress prominence to the regression.

14:50Pitch adaptation in different age groups: boundary tones versus global pitch

Marie Nilsenova (Tilburg University)
Marc Swerts (Tilburg University)
Veronique Houtepen (Tilburg University)
Heleen Dittrich (Tilburg University)

Linguistic adaptation is a process by which interlocutors adjust their production to their environment. In the context of human-computer interaction, past research showed that adult speakers adapt to computer speech in various manners but less is known about younger age groups. We report the results of three priming experiments in which children in different age groups interacted with a prerecorded computer voice. The goal of the experiments was to determine to what extent children copy the pitch properties of the interlocutor. Based on the dialogue model of Pickering & Garrod, we predicted that children would be more likely to adapt to pitch primes that were meaningful in the context (high or low boundary tone) compared to primes with no apparent functionality (global pitch manipulation). This prediction was confirmed by our data. Moreover, we observed a decreasing trend in adaptation in the older age groups compared to the younger ones.

15:10Backchannel-Inviting Cues in Task-Oriented Dialogue

Agustín Gravano (Department of Computer Science, Columbia University, New York, NY, USA)
Julia Hirschberg (Department of Computer Science, Columbia University, New York, NY, USA)

We examine backchannel-inviting cues --- distinct prosodic, acoustic and lexical events in the speaker's speech that tend to precede a short response produced by the interlocutor to convey continued attention --- in the Columbia Games Corpus, a large corpus of task-oriented dialogues. We show that the likelihood of occurrence of a backchannel increases quadratically with the number of cues conjointly displayed by the speaker. Our results are important for improving the coordination of conversational turns in interactive voice-response systems, so that systems can produce backchannels in appropriate places, and so that they can elicit backchannels from users in expected places.

Tue-Ses2-O3:
ASR: Spoken Language Understanding

Time:Tuesday 13:30 Place:East Wing 2 Type:Oral
Chair:Lin-shan Lee

13:30What\'s in an Ontology for Spoken Language Understanding

Silvia Quarteroni (University of Trento)
Giuseppe Riccardi (University of Trento)
Marco Dinarelli (University of Trento)

Current Spoken Language Understanding systems rely either on hand-written semantic grammars or on flat attribute-value sequence labeling. In both approaches, concepts and their relations (when modeled at all) are domain-specific, thus making it difficult to expand or port the domain model. To address this issue, we introduce: 1) a domain model based on an ontology where concepts are classified into either predicative or argumentative; 2) the modeling of relations between such concept classes in terms of classical relations as defined in lexical semantics. We study and analyze our approach on a corpus of customer care data, where we evaluate the coverage and relevance of the ontology for the interpretation of speech utterances (clean and noisy).

13:50A Fundamental Study of Shouted Speech for Acoustic-Based Security System

Hiroaki NANJO (Faculty of Science and Technology, Ryukoku University, Japan)
Hiroki MIKAMI (Faculty of Science and Technology, Ryukoku University, Japan)
Hiroshi KAWANO (Graduate School of Science and Engineering, Ritsumeikan University, Japan)
Takanobu NISHIURA (Graduate School of Science and Engineering, Ritsumeikan University, Japan)

A speech processing system for ensuring safety and security, namely, acoustic-based security system is addressed. Focusing on indoor security such as school security, we study for an advanced acoustic-based system which can discriminate emergency shout from the other speech events based on the understanding of speech events. In this paper, we describe fundamental results of shouted speech.

14:10Evaluating the Potential Utility of ASR N-Best Lists for Incremental Spoken Dialogue Systems

Timo Baumann (University of Potsdam)
Okko Buß (University of Potsdam)
Michaela Atterer (University of Potsdam)
David Schlangen (University of Potsdam)

The potential of using ASR n-best lists for dialogue systems has often been recognised (if less often realised): it is often the case that even when the top-ranked hypothesis is erroneous, a better one can be found at a lower rank. In this paper, we describe metrics for evaluating whether the same potential carries over to incremental dialogue systems, where ASR output is consumed and reacted upon while speech is still ongoing. We show that even small N can provide an advantage for semantic processing, at a cost of a computational overhead.

14:30Improving the Recognition of Names by Document-Level Clustering

Bin Zhang (Department of Electrical Engineering, University of Washington, Seattle, WA 98195, USA)
Wei Wu (Department of Electrical Engineering, University of Washington, Seattle, WA 98195, USA)
Jeremy G. Kahn (Department of Linguistics, University of Washington, Seattle, WA 98195, USA)
Mari Ostendorf (Department of Electrical Engineering, University of Washington, Seattle, WA 98195, USA)

Named entities are of great importance in spoken document processing, but speech recognizers often get them wrong because they are infrequent. A name correction method based on document-level name clustering is proposed in this paper, consisting of three components: named entity detection, name clustering, and name hypothesis selection. We compare the performance of this method to oracle conditions and show that the oracle gain is a 23% reduction in name character error for Mandarin and the automatic approach achieves about 20% of that.

14:50Robust dependency parsing for Spoken Language Understanding of spontaneous speech

FREDERIC BECHET (Universite d\'Avignon)
ALEXIS NASR (LIF - CNRS / Universite Aix-Marseille)

We describe in this paper a syntactic parser for spontaneous speech geared towards the identification of verbal subcategorization frames. The parser proceeds in two stages. The first stage is based on generic syntactic resources for French. The second stage is a reranker which is specially trained for a given application. The parser is evaluated on the French MEDIA spoken dialogue corpus.

15:10Semantic Role Labeling with Discriminative Feature Selection for Spoken Language Understanding

Chao-Hong Liu (National Cheng Kung University, Tainan, TAIWAN)
Chung-Hsien Wu (National Cheng Kung University, Tainan, TAIWAN)

In the task of Spoken Language Understanding (SLU), Intent Classification techniques have been applied to different domains of Spoken Dialog Systems (SDS). Recently it was shown that intent classification performance can be improved with Semantic Role (SR) information. However, using SR information for SDS encounters two difficulties: 1) the state-of-the-art Automatic Speech Recognition (ASR) systems provide less than 80% recognition rate, 2) speech always exhibits ungrammatical expressions. This study presents an approach to Semantic Role Labeling (SRL) with discriminative feature selection to improve the performance of SDS. Bernoulli event features on word and part-of-speech sequences are introduced for better representation of the ASR recognized text. SRL and SLU experiments conducted using CoNLL-2005 SRL corpus and ATIS spoken corpus show that the proposed feature selection method with Bernoulli event features can improve intent classification by 3.4% and the performance of SRL.

Tue-Ses2-O4:
Speaker Diarisation

Time:Tuesday 13:30 Place:East Wing 3 Type:Oral
Chair:Douglas Reynolds

13:30A STUDY OF NEW APPROACHES TO SPEAKER DIARIZATION

Douglas Reynolds (MIT Lincoln Laboratory)
Patrick Kenny (CRIM)
Fabio Castaldo (Politecnico di Torino)

This paper reports on work carried out at the 2008 JHU Summer Workshop examining new approaches to speaker diarization. Four different systems were developed and experiments were conducted using summed-channel telephone data from the 2008 NIST SRE. The systems are a baseline agglomerative clustering system, a new Variational Bayes system using eigenvoice speaker models, a streaming system using a mix of low dimensional speaker factors and classic segmentation and clustering, and a new hybrid system combining the baseline system with a new cosine-distance speaker factor clustering. Results are presented using the Diarization Error Rate as well as by the EER when using diarization outputs for a speaker detection task. The best configurations of the diarization system produced DERs of 3.5-4.6\% and we demonstrate a weak correlation of EER and DER,

13:50REDEFINING THE BAYESIAN INFORMATION CRITERION FOR SPEAKER DIARISATION

Themos Stafylakis (Institute for Language and Speech Processing, National Technical University of Athens)
Vassilis Katsouros (Institute for Language and Speech Processing)
George Carayannis (Institute for Language and Speech Processing, National Technical University of Athens)

A novel approach to Bayesian Information Criterion (BIC) is introduced. The new criterion redefines the penalty terms of the BIC, such that each parameter is penalized with the effective sample size is trained with. Contrary to Local-BIC, the proposed criterion scores overall clustering hypotheses and therefore is not restricted to hierarchical clustering algorithms. Contrary to Global-BIC, it provides a local dissimilarity measure that depends only the statistics of the examined clusters and not on the overall sample size. We tested our criterion with two benchmark tests and found significant improvement in performance in the speaker diarisation task

14:10Speaker Diarization Using Divide-and-Conquer

Shih-Sian Cheng (Institute of Information Science, Academia Sinica, Taipei, Taiwan)
Chun-Han Tseng (Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan)
Chia-Ping Chen (Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan)
Hsin-Min Wang (Institute of Information Science, Academia Sinica, Taipei, Taiwan)

Speaker diarization systems consist of two core components: speaker segmentation and speaker clustering. The current state-of-the-art speaker diarization systems usually apply hierarchical agglomerative clustering (HAC) for speaker clustering after segmentation. However, HAC's quadratic computational complexity with respect to the number of data samples inevitably limits its application in large-scale data sets. In this paper, we propose a divide-and-conquer (DAC) framework for speaker diarization. It recursively partitions the input speech stream into two sub-streams, performs diarization on them separately, and then combines the diarization results obtained from them using HAC. The experiment results show that the proposed framework is faster than the conventional segmentation and clustering-based approach while achieving comparable diarization accuracy. Moreover, the proposed framework obtains a higher speedup over the conventional approach on a larger test data set.

14:30KL Realignment for Speaker Diarization with Multiple Feature Streams

Deepu Vijayasenan (Idiap Research Institute, 1920 Martigny, CH)
Fabio Valente (Idiap Research Institute, 1920 Martigny, CH)
Herve Bourlard (Idiap Research Institute, 1920 Martigny, CH)

This paper aims at investigating the use of Kullback-Leibler (KL) divergence based realignment with application to speaker diarization. The use of KL divergence based realignment operates directly on the speaker posterior distribution estimates and is compared with traditional realignment performed using HMM/GMM system. We hypothesize that using posterior estimates to re-align speaker boundaries is more robust than gaussian mixture models in case of multiple feature streams with different statistical properties. Experiments are run on the NIST RT06 data. They reveal that in case of conventional MFCC features the two approaches have the same performance while the KL based system outperforms the HMM/GMM re-alignment in case of combination of multiple feature streams (MFCC and TDOA). Furthermore we discuss the possible extension to other feature sets.

14:50Speech Overlap Detection in a Two-Pass Speaker Diarization System

Marijn Huijbregts (University of Twente)
David Leeuwen, van (TNO Human Factors)
Franciska Jong, de (University of Twente)

In this paper we present the two-pass speaker diarization system that we developed for the NIST RT09s evaluation. In the first pass of our system a model for speech overlap detection is generated automatically. This model is used in two ways to reduce the diarization errors due to overlapping speech. First, it is used in a second diarization pass to remove overlapping speech from the data while training the speaker models. Second, it is used to find speech overlap for the final segmentation so that overlapping speech segments can be generated. The experiments show that our overlap detection method improves the performance of all three of our system configurations.

15:10Improved Speaker Diarization of Meeting Speech with Recurrent Selection of Representative Speech Segments and Participant Interaction Pattern Modeling

Kyu Han (University of Southern California)
Shrikanth Narayanan (University of Southern California)

In this work we describe two distinct novel improvements to our speaker diarization system, previously proposed for analysis of meeting speech. The first approach focuses on recurrent selection of representative speech segments for speaker clustering while the other is based on participant interaction pattern modeling. The former selects speech segments with high relevance to speaker clustering, especially from a robust cluster modeling perspective, and keeps updating them throughout clustering procedures. The latter statistically models conversation patterns between meeting participants and applies it as a priori information when refining diarization results. Experimental results reveal that the two proposed approaches provide performance enhancement by 29.82% (relative) in terms of diarization error rate in tests on 13 meeting excerpts from various meeting speech corpora.

Tue-Ses2-P4:
Robust Automatic Speech Recognition I

Time:Tuesday 13:30 Place:Hewison Hall Type:Poster

#1Optimization of Dereverberation Parameters based on Likelihood of Speech Recognizer

Randy Gomez (Kyoto University)
Tatsuya Kawahara (Kyoto University)

Speech recognition under reverberant condition is a difficult task. Most dereverberation techniques used to address this problem enhance the reverberant waveform independent from that of the speech recognizer. In this paper, we improve the conventional Spectral Subtraction-based (SS) dereverberation technique. In our proposed approach, the dereverberation parameters are optimized to improve the likelihood of the acoustic model. The system is capable of adaptively fine-tuning these parameters jointly with acoustic model training. Additional optimization is also implemented during decoding of the test utterances. We have evaluated using real reverberant data and experimental results show that the proposed method significantly improves the recognition performance over the conventional approach.

#2Application of noise robust MDT speech recognition on the SPEECON and SpeechDat-Car databases

Jort Florent Gemmeke (Dept. of Linguistics, Radboud University, Nijmegen, The Netherlands)
Yujun Wang (ESAT Department, Katholieke Universiteit Leuven, Belgium)
Maarten Van Segbroeck (ESAT Department, Katholieke Universiteit Leuven, Belgium)
Bert Cranen (Dept. of Linguistics, Radboud University, Nijmegen, The Netherlands)
Hugo Van hamme (ESAT Department, Katholieke Universiteit Leuven, Belgium)

We show that the recognition accuracy of an MDT recognizer which performs well on artificially noisified data, deteriorates rapidly under realistic noisy conditions (using multiple microphone recordings from the SPEECON/SpeechDat-Car databases) and is outperformed by a commercially available recognizer which was trained using a multi-condition paradigm. Analysis of the recognition results indicates that the recording channels with the lowest SNRs where the MDR recognizer fails most, are also the channels which suffer most from room reverberation. Despite the channel compensation measures we took, it appears difficult to maintain the restorative power of MDT in such non-additive noise conditions.

#3Model based feature enhancement for automatic speech recognition in reverberant environments

Alexander Krueger (University of Paderborn)
Reinhold Haeb-Umbach (University of Paderborn)

In this paper we present a new feature space dereverberation technique for automatic speech recognition. We derive an expression for the dependence of the reverberant speech features in the log-mel spectral domain on the non-reverberant speech features and the room impulse response. The obtained observation model is used for a model based speech enhancement based on Kalman filtering. The performance of the proposed enhancement technique is studied on the AURORA5 database. In our currently best configuration, which includes uncertainty decoding, the number of recognition errors is approximately halved compared to the recognition of unprocessed speech.

#4A study of mutual front-end processing method based on statistical model for noise robust speech recognition

Masakiyo Fujimoto (NTT Communication Science Laboratories, NTT Corporation)
Kentaro Ishizuka (NTT Communication Science Laboratories, NTT Corporation)
Tomohiro Nakatani (NTT Communication Science Laboratories, NTT Corporation)

This paper addresses robust front-end processing for automatic speech recognition (ASR) in noise. Accurate recognition of corrupted speech requires noise robust front-end processing, e.g., voice activity detection (VAD) and noise suppression (NS). Typically, VAD and NS are combined as one-way processing, and are developed independently. However, VAD and NS should not be assumed to be independent techniques, because sharing each others' information is important for the improvement of front-end processing. Thus, we investigate the mutual front-end processing by integrating VAD and NS, which can beneficially share each others' information. In an evaluation of a concatenated speech corpus, CENSREC-1-C database, the proposed method improves the performance of both VAD and ASR compared with the conventional method.

#5Integrating Codebook and Utterance Information in Cepstral Statistics Normalization Techniques for Robust Speech Recognition

Guan-min He (National Chi Nan University)
Jeih-weih Hung (National Chi Nan University)

Cepstral statistics normalization techniques have been shown to be very successful at improving the noise robustness of speech features. This paper proposes a hybrid-based scheme to achieve a more accurate estimate of the statistical information of features in these techniques. By properly integrating codebook and utterance knowledge, the resulting hybrid-based approach significantly outperforms conventional utterance-based,segment-based and codebook-based approaches in noise environments. Furthermore, the high-performance CS-HEQ can be implemented with a short delay and can thus be applied in real-time online systems.

#6Reduced Complexity Equalization of Lombard Effect for Speech Recognition in Noisy Adverse Environments

Hynek Boril (Center for Robust Speech Systems, Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, U.S.A)
John H.L. Hansen (Center for Robust Speech Systems, Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, U.S.A)

Speech signal corruption by background noise, microphone channel variations, and speech production adjustments introduced by speakers in an effort to communicate efficiently over noise (Lombard effect) impact severely the automatic speech recognition (ASR) performance. Recently, a set of unsupervised techniques reducing ASR sensitivity to these sources of distortion have been presented. In this study, a scheme utilizing a set of speech-in-noise Gaussian mixture models and a neutral/LE classifier is shown to substantially decrease the computational load of the compensations (from 14 to 2–4 ASR decoding passes) while preserving the performance. In addition, an extended codebook capturing multiple environmental noises is introduced and shown to improve ASR in changing environments. The evaluation is conducted on the samples from the Czech Lombard Speech Database (CLSD‘05) presented in different levels of background car noise and Aurora 2 noises.

#7UNSUPERVISED TRAINING SCHEME WITH NON-STEREO DATA FOR EMPIRICAL FEATURE VECTOR COMPENSATION

Luis Buera (I3A, University of Zaragoza)
Antonio Miguel (I3A, University of Zaragoza)
Alfonso Ortega (I3A, University of Zaragoza)
Eduardo Lleida (I3A, University of Zaragoza)
Richard Stern (Carnegie Mellon University)

In this paper, a novel training scheme based on unsupervised and non-stereo data is presented for Multi-Environment Model-based LInear Normalization (MEMLIN) and MEMLIN with cross-probability model based on GMMs (MEMLIN-CPM). Both are data-driven feature vector normalization techniques which have been proved very effective in dynamic noisy acoustic environments. However, this kind of techniques usually requires stereo data in a previous training phase, which could be an important limitation in real situations. To compensate this drawback, we present an approach based on ML criterion and Vector Taylor Series (VTS). Experiments have been carried out with Spanish SpeechDat Car, reaching consistent improvements:48.7\% and 61.9\% when the novel training process is applied over MEMLIN and MEMLIN-CPM, respectively.

#8Incremental Adaptation with VTS and Joint Adaptively Trained Systems

Federico Flego (Cambridge University)
Mark Gales (Cambridge University)

Recently adaptive training schemes using model based compensation approaches such as VTS and JUD have been proposed. Adaptive training allows the use of multi-environment training data whilst training a neutral, ``clean'', acoustic model to be trained. This paper describes and assesses the advantages of using incremental, rather than batch, mode adaptation with these adaptively trained systems. Incremental adaptation reduces the latency during recognition, and has the possibility of reducing the error rate for slowly varying noise. The work is evaluated on a large scale multi-environment training configuration targeted at in-car speech recognition. Results on in-car collected test data indicate that incremental adaptation is an attractive option when using these adaptively trained systems.

#9Target Speech GMM-based Spectral Compensation for Noise Robust Speech Recognition

Takahiro Shinozaki (Tokyo Institute of Technology)
Sadaoki Furui (Tokyo Institute of Technology)

To improve speech recognition performance in adverse conditions, a noise compensation method is proposed that applies a transformation in the spectral domain whose parameters are optimized based on likelihood of speech GMM modeled on the feature domain. The idea is that additive and convolutional noises have mathematically simple expression in the spectral domain while speech characteristics are better modeled in the feature domain such as MFCC. The proposed method works as a feature extraction front-end that is independent from decoding engine, and has ability to compensate for non-stationary additive and convolutional noises with a short time delay. It includes spectral subtraction as a special case when no parameter optimization is performed. Experiments were performed using the AURORA-2J database. It has been shown that significantly higher recognition performance is obtained by the proposed method than spectral subtraction.

#10Noise-Robust Feature Extraction Based on Forward Masking

Sheng-Chiuan Chiou (Department of Computer Science and Engineering, National Sun Yat-sen University)
Chia-Ping Chen (Department of Computer Science and Engineering, National Sun Yat-sen University)

Forward masking is a phenomenon of human auditory perception, that a weaker sound is masked by a preceding stronger masker. In this paper, we postulate the mechanism of forward masking to be synaptic adaptation and temporal integration, and incorporate them in the feature extraction process of an automatic speech recognition system to improve noise-robustness. The synaptic adaptation is implemented by a highpass filter, and the temporal integration is implemented by a bandpass filter. We apply both filters in the domain of log mel-spectrum. On the Aurora 3 tasks, we evaluate three modified mel-frequency cepstral coefficients: synaptic adaptation only, temporal integration only, and both synaptic adaptation and temporal integration. Experiments show that the overall improvement is 16.1\%, 21.8\%, and 26.2\% respectively in the three cases over the baseline.

Tue-Ses2-P1:
Speech Analysis and Processing II

Time:Tuesday 13:30 Place:Hewison Hall Type:Poster
Chair:Aladdin Ariyaeeinia

#2Spectral and Temporal Modulation Features for Phonetic Recognition

Stephen Zahorian (Binghamton University)
Hongbing Hu (Binghamton University)
Zhengqing Chen (Binghamton University)
Jiang Wu (Binghamton University)

Recently, the modulation spectrum has been proposed and found to be a useful source of speech information. The modulation spectrum represents longer term variations in the spectrum and thus implicitly requires features extracted from much longer speech segments compared to MFCCs and their delta terms. In this paper, a Discrete Cosine Transform (DCT) analysis of the log magnitude spectrum combined with a Discrete Cosine Series (DCS) expansion of DCT coefficients over time is proposed as a method for capturing both the spectral and modulation information. Several variations of the DCT/DCS features were evaluated with phonetic recognition experiments using TIMIT and its telephone version (NTIMIT). Best results obtained with a combined feature set are 73.8% for TIMIT and 62.5% for NTIMIT. The modulation features are shown to be far more important than the spectral features for automatic speech recognition and far more noise robust.

#3Use of Harmonic Phase Information for Polarity Detection in Speech Signals

Ibon Saratxaga (University of the Basque Country)
Daniel Erro (University of the Basque Country)
Inmaculada Hernáez (University of the Basque Country)
Iñaki Sainz (University of the Basque Country)
Eva Navas (University of the Basque Country)

Phase information resultant from the harmonic analysis of the speech can be very successfully used to determine the polarity of a voiced speech segment. In this paper we present two algorithms which calculate the signal polarity from this information. One is based on the effect of the glottal signal on the phase of the first harmonics and the other on the relative phase shifts between the harmonics. The detection rates of these two algorithms are compared against others established algorithms.

#4Finite Mixture Spectrogram Modeling for Multipitch Tracking Using A Factorial Hidden Markov Model

Michael Wohlmayr (Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria)
Franz Pernkopf (Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria)

In this paper, we present a simple and efficient feature modeling approach for tracking the pitch of two speakers speaking simultaneously. We model the spectrogram features using Gaussian Mixture Models (GMMs) in combination with the Minimum Description Length (MDL) model selection criterion. This enables to automatically determine the number of Gaussian components depending on the available data for a specific pitch pair. A factorial hidden Markov model (FHMM) is applied for tracking. We compare our approach to two methods based on correlogram features. Those methods either use a HMM or a FHMM for tracking. Experimental results on the Mocha-TIMIT database show that our proposed approach significantly outperforms the correlogram-based methods for speech utterances mixed at 0dB. The superior performance even holds when adding white Gaussian noise to the mixed speech utterances during pitch tracking.

#5Group-Delay-Deviation Based Spectral Analysis of Speech

Anthony Stark (Griffith University)
Kuldip Paliwal (Griffith University)

In this paper, we investigate a new method for extracting useful information from the group delay spectrum of speech. The group delay spectrum is often poorly behaved and noisy. In the literature, various methods have been proposed to address this problem. However, to make the group delay a more tractable function, these methods have typically relied upon some modification of the underlying speech signal. The method proposed in this paper does not require such modifications. To accomplish this, we investigate a new function derived from the group delay spectrum, namely the group delay deviation. We use it for both narrowband analysis and wideband analysis of speech and show that this function exhibits meaningful formant and pitch information.

#6Speaker Dependent Mapping for Low Bit Rate Coding of Throat Microphone Speech

Anand Joseph Xavier Medabalimi (International Institute of Information Technology, Hyderabad, India)
Yegnanarayana Bayya (International Institute of Information Technology, Hyderabad, India)
Sanjeev Gupta (Center for Artificial Intelligence and Robotics, Bangalore, India)
Kesheorey R M (Center for Artificial Intelligence and Robotics, Bangalore, India)

Throat microphones (TM) which are robust to background noise can be used in environments with high levels of background noise. Speech collected using TM is perceptually less natural. The objective of this paper is to map the spectral features (represented in the form of cepstral features) of TM and close speaking microphone (CSM) speech to improve the former's perceptual quality, and to represent it in an efficient manner for coding. The spectral mapping of TM and CSM speech is done using a multilayer feed-forward neural network, which is trained from features derived from TM and CSM speech. The sequence of estimated CSM spectral features is quantized and coded as a sequence of codebook indices using vector quantization. The sequence of codebook indices, the pitch contour and the energy contour derived from the TM signal are used to store/transmit the TM speech information efficiently. At the receiver, the all-pole system corresponding to the estimated CSM spectral vectors is excited by a synthetic residual to generate the speech signal.

#7Analysis of Lombard Speech using Excitation Source Information

Bapineedu Gummadi (International Institute of Information Technology, Hyderabad, India)
Avinash Boppay (International Institute of Information Technology, Hyderabad, India)
Suryakanth. V. Gangashetty (International Institute of Information Technology, Hyderabad, India)
Yegnanarayana Bayya (International Institute of Information Technology, Hyderabad, India)

This paper examines the Lombard effect on the excitation features in speech production. These features correspond mostly to the acoustic features at subsegmental (< pitch period) level. The instantaneous fundamental frequency F0 (i.e., pitch), the strength of excitation at the instants of significant excitation and a loudness measure reflecting the sharpness of the impulse-like excitation around epochs are used to represent the excitation features at the subsegmental level. The Lombard effect influences the pitch and the loudness. The extent of Lombard effect on speech depends on the nature and level (or intensity) of the external feedback that causes the Lombard effect.

#8A Comparison of Linear and Nonlinear Dimensionality Reduction Methods Applied to Synthetic Speech

Andrew Errity (School of Computing, Dublin City University)
John McKenna (School of Computing, Dublin City University)

In this study a number of linear and nonlinear dimensionality reduction methods are applied to high dimensional representations of synthetic speech to produce corresponding low dimensional embeddings. Several important characteristics of the synthetic speech, such as formant frequencies and f0, are known and controllable prior to dimensionality reduction. The degree to which these characteristics are retained after dimensionality reduction is examined in visualisation and classification experiments. Results of these experiments indicate that each method is capable of discovering meaningful low dimensional representations of synthetic speech and that the nonlinear methods may outperform linear methods in some cases.

#9ZZT-domain Immiscibility of the Opening and Closing Phases of the LF GFM under Frame Length Variations

Christian Fischer Pedersen (Dept. of Electronic Systems, Aalborg University, Denmark)
Ove Andersen (Dept. of Electronic Systems, Aalborg University, Denmark)
Paul Dalsgaard (Dept. of Electronic Systems, Aalborg University, Denmark)

Current research has proposed a non-parametric speech waveform representation (rep) based on zeros of the z-transform (ZZT)[1][2]. Empirically, the ZZT rep has successfully been applied in discriminating the glottal and vocal tract components in pitch-synchronously windowed speech by using the unit circle (UC) as discriminant[1][2]. Further, similarity between ZZT reps of windowed speech, glottal flow waveforms, and waveforms of glottal flow opening and closing phases has been demonstrated[1][3]. Therefore, the underlying cause of the separation on either side of the UC can be analyzed via the ZZT reps of the opening and closing phase waveforms; the waveforms are generated by the LF glottal flow model (GFM)[1]. The present paper demonstrates this cause and effect analytically and thereby supplement the previous empirical works. Moreover, this paper demonstrates that immiscibility is variant under changes in frame lengths; lengths that maximize or minimize immiscibility are presented.

#10Dimension Reducing of LSF parameters Based on Radial Basis Function Neural Network

Hongjun Sun (+86-010-62632269)
Jianhua Tao (+86-010-62632269)
Huibin Jia (+86-010-62632269)

In this paper, we investigate a novel method for transforming line spectral frequency (LSF) parameters to lower dimensional coefficients. Radial basis function neutral network (RBF NN) based transforming model is used to fit LSF vectors. In the training process, two criterions, including mean squared error and weighted mean squared error, are involved to measure distance between original vector and approximate vector. Besides, features of LSF parameters are taken into account to supervise the training process. As a result, LSF vectors are represented by the coefficient vectors of transforming model. The experimental results reveal that 24-order LSF vector can be transformed to 15-dimension coefficient vector with an average spectral distortion of approximately 1dB. Subjective evaluation manifests that the transform method in this paper will not lead to significant voice quality decreasing.

#11Characterizing Speaker Variability Using Spectral Envelopes of Vowel Sounds

Harish Arsikere (Indian Institute of Technology - Kanpur)
Rama Sanand Doddipatla (Indian Institute of Technology - Kanpur)
Srinivasan Umesh (Indian Institute of Technology - Kanpur)

In this paper, we present a study to understand the relation between spectra of speakers enunciating the same sound and to investigate the issue of uniform versus non-uniform scaling. There is a lot of interest in understanding this relation as speaker variability is a major source of concern in many applications including Automatic Speech Recognition (ASR). Using dynamic programming, we find mapping relations between smoothed spectral envelopes of speakers enunciating the same sound and show that these relations are not linear but have a consistent non-uniform behavior. This non-uniform behavior is also shown to vary across vowels. Through a series of experiments, we show that using the observed non-uniform relation provides better vowel normalization than just a simple linear scaling relation. All results in this paper are based on vowel data from TIMIT, Hillenbrand et al. and North Texas databases.

#12Analysis of band structures for speaker-specific information in FM feature extraction

Tharmarajah Thiruvaran (School of Electrical Engineering and Telecommunications, The University of New South Wales and National Information Communication Technology (NICTA))
Eliathamby Ambikairajah (School of Electrical Engineering and Telecommunications, The University of New South Wales and National Information Communication Technology (NICTA))
Julien Epps (School of Electrical Engineering and Telecommunications, The University of New South Wales and National Information Communication Technology (NICTA))

Frequency modulation (FM) features are typically extracted using a filterbank, usually based on an auditory frequency scale, however there are psychophysical evidence to suggest that this scale may not be optimal for extracting speaker-specific information. In this paper, speaker-specific information in FM features is analyzed as a function of the filterbank structure at feature, model and classification stages. Scatter matrix based separation measures at the feature level and Kullback-Leibler distance based measures at the model level are used to analyze the discriminative contributions of the different bands. Then a series of speaker recognition experiments are performed to study how each band of the FM feature contributes to speaker recognition. Then a new filter banks structure is proposed that attempts to maximize the speaker-specific information in the FM feature for telephone data. Finally, the distribution of speaker specific information is analyzed for wideband speech.

#13Artificial Nasalization of Speech Sounds Based on Pole-Zero Models of Spectral Relations between Mouth and Nose Signals

Karl Schnell (Institute of Applied Physics, Goethe-University Frankfurt, Max-von-Laue-Str. 1, D-60438 Frankfurt am Main, Germany)
Arild Lacroix (Institute of Applied Physics, Goethe-University Frankfurt, Max-von-Laue-Str. 1, D-60438 Frankfurt am Main, Germany)

In this contribution, a method for nasalization of speech sounds is proposed based on model-based spectral relations between mouth and nose signals. For that purpose, the mouth and nose signals of speech utterances are recorded simultaneously. The spectral relations of the mouth and nose signals are modeled by pole-zero models. A filtering of non-nasalized speech signals by these pole-zero models yields approximately nasal signals, which can be utilized to nasalize the speech signals. The artificial nasalization can be exploited to modify speech units of a non-nasalized or weakly nasalized representation which should be nasalized due to coarticulation or for the production of foreign words.

#14Error metrics for impaired auditory nerve responses of different phoneme groups

Andrew Hines (Trinity College Dublin)
Naomi Harte (Trinity College Dublin)

An auditory nerve model allows faster investigation of new signal processing algorithms for hearing aids. This paper presents a study of the degradation of auditory nerve (AN) responses at a phonetic level for a range of sensorineural hearing losses and flat audiograms. The AN model of Zilany & Bruce was used to compute responses to a diverse set of phoneme rich sentences from the TIMIT database. The characteristics of both the average discharge rate and spike timing of the responses are discussed. The experiments demonstrate that a mean absolute error metric provides a useful measure of average discharge rates but a more complex measure is required to capture spike timing response errors.

#15Feature Extraction for Detecting Stop Consonants in Continuous Speech

Chi-Yueh Lin (National Tsing Hua University, Hsinchu, Taiwan)
Hsiao-Chuan Wang (National Tsing Hua University, Hsinchu, Taiwan)

Stop consonant is a highly non-stationary signal, distinct from other phonetic classes by possessing some particular acoustic characteristics. How to model its prominent acoustic landmark and detect it effectively in continuous speech have been challenge tasks for years. In this paper an approach using the two-dimensional discrete cosine transform (2D-DCT) to encode its burst portion in spectro-temporal domain is suggested. An emerging machine learning approach, random forest, uses the derived features to locate stop bursts in continuous speech. A series of experimental results demonstrate that our suggested approach has promising performance.

Tue-Ses2-P3:
ASR: Decoding and Confidence Measures

Time:Tuesday 13:30 Place:Hewison Hall Type:Poster
Chair:Kai Yu

#1Incremental composition of static decoding graphs

Miroslav Novak (IBM T.J. Watson Research Center)

A fast, scalable and memory-efficient method for static decoding graph construction is presented. As an alternative to the traditional transducer-based approach, it is based on incremental composition. Memory efficiency is achieved by combining composition, determinization and minimization into a single step, thus eliminating large intermediate graphs. We have previously reported the use of incremental composition limited to grammars and left cross-word context. Here, this approach is extended to n-gram models with explicit epsilon arcs and right cross-word context.

#2Evaluation of Phone Lattice Based Speech Decoding

Jacques Duchateau (Katholieke Universiteit Leuven)
Kris Demuynck (Katholieke Universiteit Leuven)
Hugo Van hamme (Katholieke Universiteit Leuven)

Previously, we proposed a flexible two-layered speech recogniser architecture, called FLaVoR. In the first layer an unconstrained, task independent phone recogniser generates a phone lattice. Only in the second layer the task specific lexicon and language model are applied to decode the phone lattice and produce a word level recognition result. In this paper, we present a further evaluation of the FLaVoR architecture. The performance of a classical single-layered architecture and the FLaVoR architecture are compared on two recognition tasks, using the same acoustic, lexical and language models. On the large vocabulary Wall Street Journal 5k and 20k benchmark tasks, the two-layered architecture resulted in slightly but not significantly better word error rates. On a reading error detection task for a reading tutor for children, the FLaVoR architecture clearly outperformed the single-layered architecture.

#3A Fully Data Parallel WFST-based Large Vocabulary Continuous Speech Recognition on a Graphics Processing Unit

Jike Chong (University of California, Berkeley)
Ekaterina Gonina (University of California, Berkeley)
Youngmin Yi (University of California, Berkeley)
Kurt Keutzer (University of California, Berkeley)

Tremendous compute throughput is becoming available in personal desktop and laptop systems through the use of graphics processing units (GPUs). However, exploiting this resource requires re-architecting an application to fit a data-parallel programming model. The complex graph traversal routines in the inference process for large vocabulary continuous speech recognition (LVCSR) have been considered by many as unsuitable for extensive parallelization. We explore and demonstrate a fully data parallel implementation of a speech inference engine on NVIDIA's GTX280 GPU. Our implementation has a compute-intensive phase for observation probability computation that allows dynamic elimination of redundant computation while maintaining close-to-peak execution efficiency. We demonstrate the importance of exploring application-level trade-offs in the communication-intensive graph traversal phase to adapt the algorithm to data parallel execution on GPUs.

#4Combined low level and high level features for Out-Of-Vocabulary Word detection

Benjamin LECOUTEUX (Laboratoire Informatique d\'Avignon (LIA) University of Avignon, France)
Georges LINARES (Laboratoire Informatique d\'Avignon (LIA) University of Avignon, France)
Benoit FAVRE (ICSI, 1947 Center St, Suite 600, Berkeley, CA 94704, USA)

This paper addresses the issue of Out-Of-Vocabulary (OOV) words detection in Large Vocabulary Continuous Speech Recognition (LVCSR) systems. We propose a method inspired by confidence measures, that consists in analyzing the recognition system outputs in order to automatically detect errors due to OOV words. This method combines various features based on acoustic, linguistic, decoding graph and semantics. We evaluate separately each feature and we estimate their complementarity. Experiments are conducted on a large French broadcast news corpus from the ESTER evaluation campaign. Results show good performance in real conditions: the method obtains a OOV word detection rate of 43%-90% with 2.5%-17.5% of false detection.

#5Bayes Risk Approximations Using Time Overlap with an Application to System Combination

Björn Hoffmeister (Chair of Computer Science 6, Computer Science Department, RWTH Aachen University)
Ralf Schlüter (Chair of Computer Science 6, Computer Science Department, RWTH Aachen University)
Hermann Ney (Chair of Computer Science 6, Computer Science Department, RWTH Aachen University)

The computation of the Minimum Bayes Risk (MBR) decoding rule for word lattices needs approximations. We investigate a class of approximations where the Levenshtein alignment is approximated under the condition that competing lattice arcs overlap in time. The approximations have their origins in MBR decoding and in discriminative training. We develop modified versions and propose a new, conceptually extremely simple confusion network algorithm. The MBR decoding rule is extended to scope with several lattices, which enables us to apply all the investigated approximations to system combination. All approximations are tested on a Mandarin and on an English LVCSR task for a single system and for system combination. The new methods are competitive in error rate and show some advantages over the standard approaches to MBR decoding.

#6Unsupervised Estimation of the Language Model Scaling Factor

Christopher M. White (Human Language Technology Center of Excellence, and Center for Language and Speech Processing, Johns Hopkins University)
Ariya Rastrow (Human Language Technology Center of Excellence, and Center for Language and Speech Processing, Johns Hopkins University)
Sanjeev Khudanpur (Human Language Technology Center of Excellence, and Center for Language and Speech Processing, Johns Hopkins University)
Frederick Jelinek (Human Language Technology Center of Excellence, and Center for Language and Speech Processing, Johns Hopkins University)

This paper addresses the adjustment of the language model (LM) scaling factor of an automatic speech recognition (ASR) system for a new domain using only un-transcribed speech. The main idea is to replace the (unavailable) reference transcript with an automatic transcript generated by an independent ASR system, and adjust parameters using this sloppy reference. It is shown that despite its fairly high error rate (ca. 35%), choosing the scaling factor to minimize disagreement with the erroneous transcripts is still an effective recipe for model selection. This effectiveness is demonstrated by adjusting an ASR system trained on Broadcast News to transcribe the MIT Lectures corpus. An ASR system for telephone speech produces the sloppy reference, and optimizing towards it yields a nearly optimal LM scaling factor for the MIT Lectures corpus.

#7Simultaneous Estimation of Confidence and Error Cause in Speech Recognition Using Discriminative Model

Atsunori Ogawa (NTT Corporation)
Atsushi Nakamura (NTT Corporation)

Since recognition errors are unavoidable in speech recognition, confidence scoring, which accurately estimates the reliability of recognition results, is a critical function for speech recognition engines. In addition to achieving accurate confidence estimation, if we are to develop speech recognition systems that will be widely used by the public, speech recognition engines must be able to report the causes of errors properly, namely they must offer a reason for any failure to recognize input utterances. This paper proposes a method that simultaneously estimates both confidences and causes of errors in speech recognition results by using discriminative models. We evaluated the proposed method in an initial speech recognition experiment, and confirmed its promising performance with respect to confidence and error cause estimation.

#8A Generalized Composition Algorithm for Weighted Finite-State Transducers

Cyril Allauzen (Google)
Michael Riley (Google)
Johan Schalkwyk (Google)

This paper describes a weighted finite-state transducer composition algorithm that generalizes the notion of the composition filter and present filters that remove useless epsilon paths and push forward labels and weights along epsilon paths. This filtering allows us to compose together large speech recognition context-dependent lexicons and language models much more efficiently in time and space than previously possible. We present experiments on Broadcast News and Google Search by Voice that demonstrate a 5% to 10% overhead for dynamic, runtime composition compared to a static, offline composition of the recognition transducer. To our knowledge, this is the first such system with such small overhead.

#9Word Confidence using Duration Models

Stefano Scanzio (Politecnico di Torino)
Pietro Laface (Politecnico di Torino)
Daniele Colibro (Loquendo S.p.A.)
Roberto Gemello (Loquendo S.p.A.)

In this paper, we propose a word confidence measure based on phone durations depending on large contexts. The measure is based on the expected duration of each recognized phone in a word. In the approach here proposed the duration of each phone is in principle context-dependent, and the measure is a function of the distance between the observed and expected phone duration distributions within a word. Our experiments show that, since the “duration confidence” does not make use of any acoustic information, its Equal Error Rate (EER) in terms of False Accept and False Rejection rates is not as good as the one obtained by using the more informed acoustic confidence measure. However, combining the two measures by a simple linear interpolation, the system EER improves by 6% to 10% relative on an isolated word recognition task in several languages.

#10A Comparison of Audio-free Speech Recognition Error Prediction Methods

Preethi Jyothi (Ohio State University)
Eric Fosler-Lussier (Ohio State University)

Predicting possible speech recognition errors can be invaluable for a number of Automatic Speech Recognition (ASR) applications. In this study, we extend a Weighted Finite State Transducer (WFST) framework for error prediction to facilitate a comparison between two approaches of predicting confusable words: examining recognition errors on the training set to learn phone confusions and utilizing distances between the phonetic acoustic models for the prediction task. We also expand the framework to deal with continuous word recognition and we can accurately predict 60% of the misrecognized sentences (with an average words-per-sentence count of 15) and a little over 70% of the total number of errors from the unseen test data where no acoustic information related to the test data is utilized.

#11Automatic Out-of-Language Detection based on Confidence Measures derived from LVCSR Word and Phone Lattices

Petr Motlicek (Idiap Research Institute, Martigny, Switzerland)

Confidence Measures (CMs) estimated from Large Vocabulary Continuous Speech Recognition (LVCSR) outputs are commonly used metrics to detect incorrectly recognized words. In this paper, we propose to exploit CMs derived from frame-based word and phone posteriors to detect speech segments containing pronunciations from non-target (alien) languages. The LVCSR system used is built for English, which is the target language, with medium-size recognition vocabulary (5k words). The efficiency of detection is tested on a set comprising speech from three different languages (English, German, Czech). Results achieved indicate that employment of specific temporal context (integrated in the word or phone level) significantly increases the detection accuracies. Furthermore, we show that combination of several CMs can also improve the efficiency of detection.

#12Automatic Estimation of Decoding Parameters Using Large-Margin Iterative Linear Programming

Brian Mak (The Hong Kong University of Science and Technology)
Tom Ko (The Hong Kong University of Science and Technology)

The decoding parameters in automatic speech recognition --- grammar factor and word insertion penalty --- are usually determined by performing a grid search on a development set. Recently, we cast their estimation as a convex optimization problem, and proposed a solution using an iterative linear programming algorithm. However, the solution depends on how well the development data set matches with the test set. In this paper, we further investigates an improvement on the generalization property of the solution by using large margin training within the iterative linear programming framework. Empirical evaluation on the WSJ0 5K speech recognition tasks shows that the recognition performance of the decoding parameters found by the improved algorithm using only a subset of the acoustic model training data is even better than that of the decoding parameters found by grid search on the development data, and is close to the performance of those found by grid search on the test set.

Tue-Ses2-P2:
Speech processing with audio or audiovisual input

Time:Tuesday 13:30 Place:Hewison Hall Type:Poster
Chair:Bob Damper

#1Application of Differential Microphone Array for IS-127 EVRC Rate Determination Algorithm

Henry Widjaja (Institut Teknologi Telkom)
Suryoadhi Wibowo (Institut Teknologi Telkom)

Differential microphone array is known to have low sensitivity to distant sound sources. Such characteristics maybe advantageous in voice activity detection where it can be assumed that the target speaker is close and background noise sources are distant. In this paper we develop a simple modification to the EVRC rate determination algorithm (EVRC RDA) to exploit the noise-canceling property of differential microphone array to improve its performance in highly dynamic noise environment. Comprehensive computer simulations show that the modified algorithm outperforms the original EVRC RDA in all tested noise conditions.

#2Estimating the position and orientation of an acoustic source with a microphone array network

Alberto Yoshihiro Nakano (Toyohashi University of Technology)
Seiichi Nakagawa (Toyohashi University of Technology)
Kazumasa Yamamoto (Toyohashi University of Technology)

We propose a method that finds the position and orientation of an acoustic source in an enclosed environment. For each of eight T-shaped arrays forming a microphone array network, the time delay of arrival (TDOA) of signals from microphone pairs, a source position candidate, and energy related features are estimated. These form the input for artificial neural networks (ANNs), the purpose of which is to provide indirectly a more precise position of the source and, additionally, to estimate the source's orientation using various combinations of the estimated parameters. The best combination of parameters (TDOAs and microphone positions) yields a 21.8% reduction in the mean average position error compared to baselines, and a correct orientation ratio higher than 99.0%. The position estimation baselines include two estimation methods: a TDOA-based method that finds the source position geometrically, and the SRP-PHAT that finds the most likely source position by spatial exploration.

#3Singing voice detection in polyphonic music using predominant pitch

Vishweshwara Rao (Electrical Engineering Department, Indian Institute of Technology Bombay)
Ramakrishnan Srinivasakannan (Electrical Engineering Department, Indian Institute of Technology Bombay)
Preeti Rao (Electrical Engineering Department, Indian Institute of Technology Bombay)

This paper demonstrates the superiority of energy-based features derived from the knowledge of predominant-pitch, for singing voice detection in polyphonic music over commonly used spectral features. However, such energy-based features tend to misclassify loud, pitched instruments. To provide robustness to such accompaniment we exploit the relative instability of the pitch contour of the singing voice by attenuating harmonic spectral content belonging to stable-pitch instruments, using sinusoidal modeling. The obtained feature shows high classification accuracy when applied to north Indian classical music data and is also found suitable for automatic detection of vocal-instrumental boundaries required for smoothing the frame-level classifier decisions.

#4Word stress assessment for computer aided language learning

Juan Pablo Arias (Universidad de Chile)
Nestor Becerra Yoma (Universidad de Chile)
Hiram Vivanco (Universidad de Chile)

In this paper an automatic word stress assessment system is proposed based on a top-to-bottom scheme. The method presented is text and language independent. The utterance pronounced by the student is directly compared with a reference one. The trend similarity of F0 and energy contours are compared frame-by-frame by using DTW alignment. The stress assessment evaluation system gives an EER equal to 21.5%, which in turn is similar to the error observed in phonetic quality evaluation schemes. These results suggest that the proposed system can be employed in real applications and applicable to any language.

#5A non-intrusive signal-based model for speech quality evaluation using automatic classification of background noises

Adrien Leman (France Telecom R&D)
Julien Faure (France Telecom R&D)
Etienne Parizet (INSA de Lyon)

This paper describes an original method for speech quality evaluation in the presence of different types of background noises for a range of communications (mobile, VoIP, RTC). The model is obtained from subjective experiments described in [1]. These experiments show that background noise can be more or less tolerated by listeners, depending on the sources of noise that can be identified. Using a classification method, the background noises can be classified into four groups. For each one of the four groups, a relation between loudness of the noise and speech quality is proposed.

#6Acoustic Event Detection for Spotting \"Hot Spots\" in Podcasts

Kouhei Sumi (Graduate School of Informatics, Kyoto University)
Tatsuya Kawahara (Graduate School of Informatics, Kyoto University)
Jun Ogata (National Institute of Advanced Industrial Science and Technology)
Masataka Goto (National Institute of Advanced Industrial Science and Technology)

This paper presents a method to detect acoustic events that can be used to find “hot spots” in podcast programs. We focus on meaningful non-verbal audible reactions which suggest hot spots such as laughter and reactive tokens. In order to detect this kind of short events and segment the counterpart utterances, we need accurate audio segmentation and classification, dealing with various recording environments and background music. Thus, we propose a method for automatically estimating and switching penalty weights for the BIC-based segmentation depending on background environments. Experimental results show significant improvement in detection accuracy by our method compared to when using a constant penalty weight.

#7Improving Detection of Acoustic Events Using Audiovisual Data and Feature Level Fusion

Taras Butko (Technical University of Catalonia)
Cristian Canton-Ferrer (Technical University of Catalonia)
Carlos Segura (Technical University of Catalonia)
Xavi Giro (Technical University of Catalonia)
Climent Nadeu (Technical University of Catalonia)
Javier Hernando (Technical University of Catalonia)
Josep-Ramon Casas (Technical University of Catalonia)

The detection of the acoustic events (AEs) that are naturally produced in a meeting room may help to describe the human and social activity that takes place in it. When applied to spontaneous recordings, the detection of AEs from only audio information shows a large amount of errors, which are mostly due to temporal overlapping of sounds. In this paper, a system to detect and recognize AEs using both audio and video information is presented. A feature-level fusion strategy is used, and the structure of the HMM-GMM based system considers each class separately and uses a one-against-all strategy for training. Experimental AED results with a new and rather spontaneous dataset are presented which show the advantage of the proposed approach.

#8Detecting Audio Events for Semantic Video Search

Miguel Bugalho (INESC-ID Lisboa / IST)
José Portêlo (INESC-ID Lisboa)
Isabel Trancoso (INESC-ID Lisboa / IST)
Thomas Pellegrini (INESC-ID Lisboa)
Alberto Abad (INESC-ID Lisboa)

This paper describes our work on audio event detection, one of our tasks in the European project VIDIVIDEO. Preliminary experiments with a small corpus of sound effects have shown the potential of this type of corpus for training purposes. This paper describes our experiments with SVM classifiers, and different features, using a 290-hour corpus of sound effects, which allowed us to build detectors for almost 50 semantic concepts. Although the performance of these detectors on the development set is quite good (achieving an average F-measure of 0.87), preliminary experiments on documentaries and films showed that the task is much harder in real-life videos, which so often include overlapping audio events.

#9Factor Analysis for Audio-based Video Genre Classification

Mickael Rouvier (LIA)
Matrouf Driss (LIA)
Georges Linarès (LIA)

Statistic classifiers operate on features that generally include both usefull and useless information. These two types of information are difficult to separate in the feature domain. Recently, a new paradigm based on a Latent Factor Analysis proposed a model decomposition into usefull and useless components. This method was successfully applied to speaker and language recognition tasks. In this paper, we study the use of Latent Factor Analysis for video genre classification by using only the audio channel. We propose a classification method based on short-term cepstral features and GMM or SVM classifiers, that are combined to Factor Analysis. Experiments are conducted on a corpus composed of 5 types of video (musics, commercials, cartoons, movies and news). The relative classification error reduction obtained by using the best factor analysis configuration with respect to baseline system (GMM-UBM) is about 56%, corresponding to a correct identification rate of about 90%.

#10Robust Audio-based Classification of Video Genre

Mickael Rouvier (LIA)
Georges Linarès (LIA)
Driss Matrouf (LIA)

Video genre classification is a challenging task in a global context of fast growing video collections availible on the Internet. In this paper, we present a new method for video genre identification by audio contents analysis. Our approach relies on the combination of low and high level audio features. We investigate the discriminative capacity of features related to acoustic instability, speaker interactivity, speech quality and acoustic space characterization. The genre identification is performed on these features by using a SVM classifier. Experiments are conducted on a corpus composed from cartoons, movies, news, commercials and music on which we obtain an identification rate of 91%.

#11Fusing Audio and Video Information for Online Speaker Diarization

Joerg Schmalenstroeer (Department of Communications Engineering, University of Paderborn, Germany)
Martin Kelling (Department of Communications Engineering, University of Paderborn, Germany)
Volker Leutnant (Department of Communications Engineering, University of Paderborn, Germany)
Reinhold Haeb-Umbach (Department of Communications Engineering, University of Paderborn, Germany)

In this paper we present a system for identifying and localizing speakers using distant microphone arrays and a steerable pan-tilt-zoom camera. Audio and video streams are processed in real-time to obtain the diarization information ``who speaks when and where'' with low latency to be used in advanced video conferencing systems or user-adaptive interfaces. A key feature of the proposed system is to first glean information about the speaker's location and identity from the audio and visual data streams separately and then to fuse these data in a probabilistic framework employing the Viterbi algorithm. Here, visual evidence of a person is utilized through a priori state probabilities, while location and speaker change information are employed via time-variant transition probablities. Experiments show that video information yields a substantial improvement compared to pure audio-based diarization.

#12Multimodal Speaker Verification Using Ancillary Known Speaker Characteristics Such as Gender or Age

Girija Chetty (University of Canberra)
Michael Wagner (University of Canberra)

Multimodal speaker verification based on easy-to-obtain bio-metric traits such as face and voice is rapidly gaining acceptance as the preferred technology for many applications. In many such practical applications, other characteristics of the speaker such as gender or age are known and may be exploited for enhanced verification accuracy. In this paper we present a parallel approach determining gender as an ancillary speaker characteristic, which is incorporated in the decision of a face-voice speaker verification system. Preliminary experiments with the DaFEx multimodal audio-video database show that fusing the results of gender recognition and identity verification improves the performance of multimodal speaker verification. Index Terms: multimodal, face-voice, speaker verification, speaker characterisation

#13Discovering Keywords from Cross-Modal Input: Ecological vs. Engineering Methods for Enhancing Acoustic Repetitions

Guillaume Aimetti (University of Sheffield)
Roger Moore (University of Sheffield)
Louis ten Bosch (Radboud University)
Okko Rasanen (Helsinki University of Technology)
Unto Laine (Helsinki University of Technology)

This paper introduces a computational model that automatically segments acoustic speech data and builds internal representations of keyword classes from cross-modal (acoustic and pseudo-visual) input. Acoustic segmentation is achieved using a novel dynamic time warping technique and the focus of this paper is on recent investigations conducted to enhance the identification of repeating portions of speech. This ongoing research is inspired by current cognitive views of early language acquisition and therefore strives for ecological plausibility in an attempt to build more robust speech recognition systems. Results show that an ad-hoc computationally engineered solution can aid the discovery of repeating acoustic patterns. However, we show that this improvement can be simulated in a more ecologically valid way.

Tue-Ses3-S1:
Panel: Speech & Intelligence

Time:Tuesday 16:00 Place:Main Hall Type:Special
Chair:Roger Moore

16:00Speech and Intelligence Panel Session

In line with the theme of this year’s INTERSPEECH conference, this special semi-plenary Panel Session will be run as a guided discussion, drawing on issues raised by the panel members and solicited in advance from the attendees. An international panel of distinguished experts will engage with the topic of ‘speech and intelligence’ and address open questions such the importance of a link between spoken language and other aspects of human cognition. It is expected that this special event will be both informative and entertaining, and will involve opportunities for audience participation. Panel chair: Roger Moore (UK); Panel members to include: Janet Baker (USA), Anton Batliner (Germany), Lou Boves (Netherlands), Nick Campbell (Eire), Hiroya Fujisaki (Japan), Bjorn Granstrom (Sweden), Tom Griffiths (USA), Sarah Hawkins (UK), Dirk Heylan (Netherlands), Mark Huckvale (UK) & Nobuaki Minematsu (Japan). If you have a particular question or topic that you would like the panel to discuss, then please send your suggestion(s) to r.k.moore@dcs.shef.ac.uk.

Tue-Ses3-O3:
Speaker verification & identification I

Time:Tuesday 16:00 Place:East Wing 2 Type:Oral
Chair: Patrick Kenny

16:00Investigation into variants of Joint Factor Analysis for speaker recognition

Lukas Burget (Brno University of Technology)
Pavel Matejka (Brno University of Technology)
Valiantsina Hubeika (Brno University of Technology)
Jan Cernocky (Brno University of Technology)

In this paper, we have investigated into JFA used for speaker recognition. First, we performed systematic comparison of full JFA with its simplified variants and confirmed superior performance of the full JFA with both eigenchannels and eigenvoices. We investigated into sensitivity of JFA on the number of eigenvoices both for the full one and simplified variants. We studied the importance of normalization and found that gender-dependent zt-norm was crucial. The results are reported on NIST 2006 and 2008 SRE evaluation data.

16:20Improved GMM-based Speaker Verification Using SVM-Driven Impostor Dataset Selection

Mitchell McLaren (SAIVT Research Laboratory, QUT, Brisbane, Australia)
Robbie Vogt (SAIVT Research Laboratory, QUT, Brisbane, Australia)
Brendan Baker (SAIVT Research Laboratory, QUT, Brisbane, Australia)
Sridha Sridharan (SAIVT Research Laboratory, QUT, Brisbane, Australia)

The problem of impostor dataset selection for GMM-based speaker verification is addressed through the recently proposed data-driven background dataset refinement technique. The SVM-based refinement technique selects from a candidate impostor dataset those examples that are most frequently selected as support vectors when training a set of SVMs on a development corpus. This study demonstrates the versatility of dataset refinement in the task of selecting suitable impostor datasets for use in GMM-based speaker verification. The use of refined Z- and T-norm datasets provided performance gains of 15% in EER in the NIST 2006 SRE over the use of heuristically selected datasets. The refined datasets were shown to generalise well to the unseen data of the NIST 2008 SRE.

16:40Adaptive Individual Background Model for Speaker Verification

Yossi Bar-Yosef (Tel-Aviv University, Tel-Aviv 69978, Israel)
Yuval Bistritz (Tel-Aviv University, Tel-Aviv 69978, Israel)

Most techniques for speaker verification today use Gaussian Mixture Models (GMMs) and make the decision by comparing the likelihood of the speaker model to the likelihood of a universal background model (UBM). The paper proposes to replace the UBM by an individual background model (IBM) that is generated for each speaker. The IBM is created using the K-nearest cohort models and the UBM by a simple new adaptation algorithm. The new GMM-IBM speaker verification system can also be combined with various score normalization techniques that have been proposed to increase the robustness of the GMM-UBM system. Comparative experiments were held on the NIST-2004-SRE database with a plain system setting (without score normalization) and also with the combination of adaptive test normalization (ATnorm). Results indicated that the proposed GMM-IBM system outperform a comparable GMM-UBM system.

17:00Optimization of Discriminative Kernels In SVM Speaker Verification

Shi-xiong Zhang (The Hong Kong Polytechnic University)
Man-wai Mak (The Hong Kong Polytechnic University)

In SVM speaker verification, the kernel needs to map variable-length observation sequences to fixed-size supervectors that capture the dynamic characteristics of speech utterances and allow speakers to be easily distinguished. Most kernels in SVM speaker verification are obtained by assuming a specific form for the similarity function of supervectors. This paper relaxes this assumption to derive a new general kernel. The kernel function is general in that it is a linear combination of any kernels belonging to the reproducing kernel Hilbert space. The combination weights are obtained by optimizing the ability of a discriminant function to separate a target speaker from impostors using either regression analysis or SVM training. The idea was applied to both low- and high-level speaker verification. In both cases, results show that the proposed kernels outperform the state-of-the-art sequence kernels.

17:20UBM-Based Sequence Kernel for Speaker Recognition

Zhenchun Lei (School of Computer and Information Engineering, Jiangxi Normal University, China)

This paper proposes a probabilistic sequence kernel based on the universal background model, which is widely used in speaker recognition. The Gaussian components are used to construct the speaker reference space, and the utterances with different length are mapped into the fixed size vectors after normalization with correlation matrix. Finally the linear support vector machine is used for speaker recognition. A transition probabilistic sequence kernel is also proposed by adaption the transition information between neighbor frames. The experiments on NIST 2001 show that the performance is compared with the traditional UBM-MAP model. If we fusion the models, the performance will be improved 16.8% and 19.1% respectively compared with the UBM-MAP model.

17:40GMM Kernel by Taylor Series for Speaker Verification

Xu Minqiang (Department of Electronic Science and Technology, USTC, Hefei, Anhui, China;Department of Electrical and Computer Engineering, UIUC, USA)
Zhou Xi (Department of Electrical and Computer Engineering, UIUC, USA)
Dai Beiqian (Department of Electronic Science and Technology, USTC, Hefei, Anhui, China)
Huang Thomas S. (Department of Electrical and Computer Engineering, UIUC, USA)

Currently, approach of Gaussian Mixture Model combined with Support Vector Machine to text-independent speaker verification task has produced the stat-of-the-art performance. Many kernels have been reported for combining GMM and SVM. In this paper, we propose a novel kernel to represent the GMM distribution by Taylor expansion theorem and it’s regarded as the input of SVM. The utterance-specific GMM is represented as a combination of orders of Taylor series expansing at the the means of the Gaussian components. Here we extract the distribution information around the means of the Gaussian components in the GMM as we can naturally assume that each mean position indicates a feature cluster in the feature space. And then the kernel computes the emsemble distance between orders of Taylor series. Results of our new kernel on NIST speaker recognition evaluation (SRE) 2006 core task have been shown relative improvements of up to 7.1% and 11.7% in EER for male and female compared to K-L divergence based SVM system.

Tue-Ses3-O4:
Text Processing for Spoken Language Generation

Time:Tuesday 16:00 Place:East Wing 3 Type:Oral
Chair:Bernd Möbius

16:00Automatic Syllabification for Danish Text-to-Speech Systems

Jeppe Beck (Microsoft Language Development Center)
Daniela Braga (Microsoft Language Development Center)
João Nogueira (Faculty of Sciences of University of Lisbon)
Miguel Dias (Microsoft Language Development Center)
Luis Coelho (Instituto Politécnico do Porto)

In this paper, a rule-based automatic syllabifier for Danish is described using the Maximal Onset Principle. Prior success rates of rule-based methods applied to Portuguese and Catalan syllabification modules were on the basis of this work. The system was implemented and tested using a very small set of rules. The results gave rise to 96.9% and 98.7% of word accuracy rate, contrary to our initial expectations, being Danish a language with a complex syllabic structure and thus difficult to be rule-driven. Comparison with data-driven syllabification system using artificial neural networks showed a higher accuracy rate of the former system.

16:20Hybrid Approach to Grapheme to Phoneme Conversion for Korean

Jinsik Lee (Pohang University of Science and Technology)
Byeongchang Kim (Catholic University of Daegu)
Gary Geunbae Lee (Pohang University of Science and Technology)

In the grapheme to phoneme conversion problem for Korean, two main approaches have been discussed: knowledge-based and data-driven methods. However, both camps have limitations: the knowledge-based hand-written rules cannot handle some of the pronunciation changes due to the lack of capability of linguistic analyzers and many exceptions; data-driven methods always suffer from data sparseness. To overcome the shortages of both camps, this paper presents a novel combining method which effectively integrates two components: (1) a rule-based converting system based on linguistically motivated hand-written rules and (2) a statistical converting system using a Maximum Entropy model. The experimental results clearly show the effectiveness of our proposed method.

16:40Robust LTS rules with the Combilex speech technology lexicon

Korin Richmond (CSTR, Informatics, Edinburgh University)
Robert Clark (CSTR, Informatics, Edinburgh University)
Sue Fitt (CSTR, Informatics, Edinburgh University)

Combilex is a high quality pronunciation lexicon aimed at speech technology applications that has recently been released by CSTR. Combilex benefits from several advanced features. This paper evaluates one of these: the explicit alignment of phones to graphemes in a word. This alignment can help to rapidly develop robust and accurate letter-to-sound (LTS) rules, without needing to rely on automatic alignment methods. To evaluate this, we used Festival's LTS module, comparing its standard automatic alignment with Combilex's explicit alignment. Our results show using Combilex's alignment improves LTS accuracy: 86.50% words correct as opposed to 84.49%, with our most general form of lexicon. In addition, building LTS models is greatly accelerated, as the need to list allowed alignments is removed. Finally, loose comparison with other studies indicates Combilex is a superior quality lexicon in terms of consistency and size.

17:00Letter-to-phoneme conversion by inference of rewriting rules

Vincent Claveau (IRISA - CNRS)

Phonetization is a crucial step for oral document processing. In this paper, a new letter-to-phoneme conversion approach is proposed; it is automatic, simple, portable and efficient. It relies on a machine learning technique initially developed for transliteration and translation; the system infers rewriting rules from examples of words with their phonetic representations. This approach is evaluated in the framework of the Pronalsyl Pascal challenge, which includes several datasets on different languages. The obtained results equal or outperform those of the best known systems. Moreover, thanks to the simplicity of our technique, the inference time of our approach is much lower than those of the best performing state-of-the-art systems.

17:20Online Discriminative Training for Grapheme-to-Phoneme Conversion

Sittichai Jiampojamarn (Department of Computing Science, University of Alberta)
Grzegorz Kondrak (Department of Computing Science, University of Alberta)

We present an online discriminative training approach to grapheme-to-phoneme (g2p) conversion. We employ a many-to-many alignment between graphemes and phonemes, which overcomes the limitations of widely used one-to-one alignments. The discriminative structure-prediction model incorporates input segmentation, phoneme prediction, and sequence modeling in a unified dynamic programming framework. The learning model is able to capture both local context features in inputs, as well as non-local dependency features in sequence outputs. Experimental results show that our system surpasses the state-of-the-art on several data sets.

17:40Using Same-Language Machine Translation to Create Alternative Target Sequences for Text-To-Speech Synthesis

Peter Cahill (University College Dublin)
Jinhua Du (Dublin City University)
Andy Way (Dublin City University)
Julie Carson-Berndsen (University College Dublin)

Modern speech synthesis systems attempt to produce speech utterances from an open domain of words. In some situations, the synthesiser will not have the appropriate units to pronounce some words or phrases accurately but it still must attempt to pronounce them. This paper presents a hybrid machine translation and unit selection speech synthesis system. The machine translation system was trained with English as the source and target language. Rather than the synthesiser only saying the input text as would happen in conventional synthesis systems, the synthesiser may say an alternative utterance with the same meaning. This method allows the synthesiser to overcome the problem of insufficient units in runtime.

Tue-Ses3-S2:
Special Session: Measuring the Rhythm of Speech

Time:Tuesday 16:00 Place:East Wing 4 Type:Special
Chair: Daniel Hirst & Greg Kochanski

#0Investigating Changes in the Rhythm of Maori over Time

Margaret Maclagan (University of Canterbury, New Zealand)
Catherine Watson (University of Auckland, New Zealand)
Jeanette King (University of Canterbury, New Zealand)
Ray Harlow (University of Waikato, New Zealand)
Laura Thompson (University of Auckland, New Zealand)
Peter Keegan (University of Auckland, New Zealand)

Present-day Maori elders comment that the mita (which includes rhythm) of the Maori language, has changed over time. This paper presents the first results in a study of the change of Maori rhythm. PVI analyses did not capture this change. Perceptual experiments, using extracts of speech low-pass filtered to 400 Hz, demonstrated that Maori and English speech could be distinguished. Listeners who spoke Maori were more accurate than those who spoke only English. The English and Maori speech of groups of different speakers born at different times was perceived differently, indicating that the rhythm of Maori has indeed changed over time.

#0The Dynamic Dimension of the Global Speech-Rhythm Attributes

Jan Volín (Institute of Phonetics, Charles University in Prague)
Petr Pollák (Faculty of Electrical Engineering, Czech Technical University in Prague)

Recent years have revealed that certain global attributes of speech rhythm can be quite successfully captured with respect to consonantal and vocalic intervals in spoken texts. One of the many problems of this approach lies in complex syllabic structures. Unless we make an a-priori phonological decision, sonorous consonants may contribute to either vocalic or consonantal part of the speech signal in post-initial and pre-final positions of syllabic onsets and codas. A procedure is offered to avoid phonological dilemmas together with tedious manual work. The method is tested on continuous Czech and English texts read out by several professionals.

#0Vowel duration in pre-geminate contexts in Polish

Zofia Malisz (Adam Mickiewicz University, Poznan)

The study presents Polish experimental data on the variability of vowel duration in the context of following singleton and geminate consonants. The aim of the study is to explain the low vocalic variability values obtained from "rhythm metrics" based analyses of speech rhythm. It also aims at contributing to the discussion about current dynamical models of speech rhythm that contain assumptions of the relative temporal stability of the vowel-to-vowel sequence. The results suggest that vowels in Polish co-vary with following consonant length in a roughly proportionate manner. An interpretation of the effect is offered where a fortition process overrides the possibility of temporal compensation. Index Terms: gemination, vowel duration, speech rhythm, Polish

#0Effects of Mora-timing in English Rhythm Control by Japanese Learners

Shizuka Nakamura (Graduate School of Global Information and Telecommunication Studies, Waseda University, Japan)
Hiroaki Kato (National Institute of Information and Communications Technology / Advanced Telecommunications Research Institute International, Japan)
Yoshinori Sagisaka (Graduate School of Global Information and Telecommunication Studies, Waseda University, Japan)

In our previous studies on an objective evaluation of English rhythm control by Japanese learners, we noticed that the accustomed mora-timing of Japanese learners might unfavorably affect English speech of stress-timing. In this paper, we analyzed durational differences between Japanese learners and native speakers in the corresponding speech units such as stressed/unstressed syllable, strong/weak vowel, syllable in content/function word, and closed/open syllable from a perspective of the contrast of stressed/unstressed syllables. It was confirmed that these durational differences caused by mora-timing strongly affected subjective evaluation by native teachers, through correlation analyses of these differences and subjective evaluation scores.

16:00The rhythm of text and the rhythm of utterances: from metrics to models.

Daniel Hirst (CNRS, Aix-Marseille Université, Aix-en-Provence, France)

The typological classification of languages as stress-timed, syllable-timed and mora-timed did not stand up to empirical investigation which found little or no evidence for the different types of isochrony which had been assumed to be the basis for the classification. In recent years, there has been a renewal of interest with the development of empirical metrics for measuring rhythm. In this paper it is shown that some of these metrics are more sensitive to the rhythm of the text than to the rhythm of the utterance itself. While a number of recent proposals have been made for improving these metrics it is proposed that what is needed is more detailed studies of large corpora in order to develop more sophisticated models of the way in which prosodic structure is realised in different languages. New data on British English is presented using the Aix-Marsec corpus.

16:20No Time to Lose? Time Shrinking Effects Enhance the Impression of Rhythmic ”Isochrony” and Fast Speech Rate

Petra Wagner (Universität Bielefeld)
Andreas Windmann (Universität Bielefeld)

Time Shrinking denotes the psycho-acoustic shrinking effect of a short interval on one or several subsequent longer intervals. Its effectiveness in the domain of speech perception has so far not been examined. Two perception experiments clearly suggest the influence of relative duration patterns triggering time shrinking on the perception of tempo and rhythmical isochrony or rather "evenness". A comparison between the experimental data and duration patterns across various languages suggests a strong influence of time shrinking on the impression of isochrony in speech and perceptual speech rate. Our results thus emphasize the necessity of taking into account relative timing within rhythmical domains such as feet, phrases or narrow rhythm units as a complementary perspective to popular global rhythm variability metrics.

16:40Measuring speech rhythm variation in a model-based framework

Plínio Barbosa (Speech Prosody Studies Group/Dep. of Linguistics/Inst.Est. Ling., Univ. of Campinas, Brazil)

A coupled-oscillators-model-based method for measuring speech rhythm is presented. This model explains cross-linguistic differences in rhythm as deriving from varying degrees of coupling strength between a syllable oscillator and a phrase stress oscillator. The method was applied to three texts read aloud in French, in Brazilian and European Portuguese by seven speakers. The results reproduce the early findings on rhythm typology for these languages/varieties with the following advantages: it successfully accounts for speech rate variation, related to the syllabic oscillator frequency in the model; it takes only syllable-sized units into account, not spliting syllables into vowels and consonants; the consequences of phrase stress magnitude on stress group duration are directly considered; both universal and language-specific aspects of speech rhythm are captured by the model.

17:00Rhythm measures with language-independent segmentation

Anastassia Loukina (Phonetics laboratory, University of Oxford, United Kingdom)
Greg Kochanski (Phonetics laboratory, University of Oxford, United Kingdom)
Chilin Shih (EALC/Linguistics, University of Illinois, Urbana-Champaign USA)
Elinor Keane (Phonetics laboratory, University of Oxford, United Kingdom)
Ian Watson (Phonetics laboratory, University of Oxford, United Kingdom)

We compare 15 measures of speech rhythm based on an automatic segmentation of speech into vowel-like and consonant-like regions. This allows us to apply identical segmentation criteria to all languages and compute rhythm measures over a large corpus. It may also approximate more closely the segmentation available to pre-lexical infants, who have been claimed to discriminate between languages. We find that within-language variation is large and comparable to the language-to-language differences we observed. We evaluate the success of different measures in separating languages and show that the efficiency of measures depends on the languages included in the corpus. Rhythm appears to be described by two dimensions and different published rhythm measures capture different aspects of it.

Tue-Ses3-P4:
Topics in Spoken Language Processing

Time:Tuesday 16:00 Place:Hewison Hall Type:Poster
Chair: Chiori Hori

#1Confidence-Based Techniques for Rapid and Robust Topic Identification of Conversational Telephone Speech

Jonathan Wintrode (US Department of Defense)
Scott Kulp (Rutgers University)

We investigate the impact of automatic speech recognition errors on the accuracy of topic identification in conversational telephone speech. We present a modified TF-IDF feature-weighting calculation that provides significant robustness under various recognition error conditions. For our experiments we take conversations from the Fisher corpus to produce 1-best and lattice outputs using one recognizer tuned to run at various speeds. We use SVM classifiers to perform topic identification on the output. We observe classifiers incorporating confidence information to be significantly more robust to errors than those treating output as unweighted text.

#2Localization of Speech Recognition in Spoken Dialog Systems: How Machine Translation Can Make Our Lives Easier

David Suendermann (SpeechCycle, Inc.)
Jackson Liscombe (SpeechCycle, Inc.)
Krishna Dayanidhi (SpeechCycle, Inc.)
Roberto Pieraccini (SpeechCycle, Inc.)

The localization of speech recognition for large-scale spoken dialog systems can be a tremendous manual exercise. Usually though, a vast number of transcribed and annotated utterances exists for the source language. In this paper, we propose to use such data and translate it into the target language using machine translation. The translated utterances and their associated (original) annotations are then used to train statistical grammars for all contexts of the target system. As an example, we localize an English spoken dialog system for Internet troubleshooting to Spanish by translating more than 4 million source utterances without any human intervention. In an application of the localized system to more than 10,000 utterances collected on a similar Spanish Internet troubleshooting system, we show that the overall accuracy was only 5.7% worse than that of the English source system.

#3Algorithms for Speech Indexing in Microsoft Recite

Kunal Mukerjee (Microsoft)
Shankar Regunathan (Microsoft)
Jeffrey Cole (Microsoft)

Microsoft Recite is a mobile application to store and retrieve spoken notes. Recite stores and matches n-grams of pattern class identifiers that are designed to be language neutral and handle a large number of out of vocabulary phrases. The query algorithm expects noise and fragmented matches and compensates for them with a heuristic ranking scheme. This contribution describes a class of indexing algorithms for Recite that allows for high retrieval accuracy while meeting the constraints of low computational complexity and memory footprint of embedded platforms. The results demonstrate that a particular indexing scheme within this class can be selected to optimize the trade-off between retrieval accuracy and insertion/query complexity.

#4Parallelized Viterbi Processor for 5,000-Word Large-Vocabulary Real-Time Continuous Speech Recognition FPGA System

Tsuyoshi Fujinaga (Kobe University)
Kazuo Miura (Kobe University)
Hiroki Noguchi (Kobe University)
Hiroshi Kawaguchi (Kobe University)
Masahiko Yoshimoto (Kobe University)

We propose a novel Viterbi processor for the large vocabulary real-time continuous speech recognition. This processor is built with multi Viterbi cores. Since each core can independently compute, these cores reduce the cycle times very efficiently. To verify the effect of utilizing multi cores, we implement a dual-core Viterbi processor in an FPGA and achieve 49% cycle-time reduction, compared to a single-core processor. Our proposed dual-core Viterbi processor achieves the 5,000-word real-time continuous speech recognition at 65.175 MHz. In addition, it is easy to implement scalable increases in the number of cores, which leads to achievement of the larger vocabulary.

#5SpLaSH (Spoken Language Search Hawk): integrating time-aligned with text-aligned annotations

Sara Romano (Natural Language Processing group Department of Physical Sciences, ‘Federico II’ University, Naples, Italy)
Elvio Cecere (Natural Language Processing group Department of Physical Sciences, ‘Federico II’ University, Naples, Italy)
Francesco Cutugno (Natural Language Processing group Department of Physical Sciences, ‘Federico II’ University, Naples, Italy)

In this work we present SpLaSH (Spoken Language Search Hawk), a toolkit used to perform complex queries on spoken language corpora. In SpLaSH, tools for the integration of time aligned annotations (TMA), by means of annotation graphs, with text aligned ones (TXA), by means of generic XML files, are provided. SpLaSH imposes a very limited number of constraints to the data model design, allowing the integration of annotations developed separately within the same dataset and without any relative dependency. It also provides a GUI allowing three types of queries: simple query on TXA or TMA structures, sequence query on TMA structure and cross query on both TXA and TMA integrated structures.

#6PodCastle: Collaborative Training of Acoustic Models on the Basis of Wisdom of Crowds for Podcast Transcription

Jun Ogata (National Institute of Advanced Industrial Science and Technology (AIST))
Masataka Goto (National Institute of Advanced Industrial Science and Technology (AIST))

This paper presents acoustic-model-training techniques for improving automatic transcription of podcasts. A typical approach for acoustic modeling is to create a task-specific corpus including hundreds of hours of speech data and their accurate transcriptions. This approach, however, is impractical in podcast-transcription task because manual generation of the transcriptions of the large amounts of speech covering all the various types of podcast contents will be too costly and time consuming. To solve this problem, we introduce collaborative training of acoustic models on the basis of wisdom of crowds, i.e., the transcriptions of podcast-speech data are generated by anonymous users on our web service PodCastle. We then describe a podcast-dependent acoustic modeling system by using RSS metadata to deal with the differences of acoustic conditions in podcasts. From our experimental results, the effectiveness of the proposed acoustic model training was confirmed.

#7A WFST-based Log-linear Framework for Speaking-style Transformation

Graham Neubig (Graduate School of Informatics, Kyoto University)
Shinsuke Mori (Graduate School of Informatics, Kyoto University)
Tatsuya Kawahara (Graduate School of Informatics, Kyoto University)

When attempting to make transcripts from automatic speech recognition results, disfluency deletion, transformation of colloquial expressions, and insertion of dropped words must be performed to ensure that the final product is clean transcript-style text. This paper introduces a system for the automatic transformation of the spoken word to transcript-style language that enables not only deletion of disfluencies, but also substitutions of colloquial expressions and insertion of dropped words. A number of potentially useful features are combined in a log-linear probabilistic framework, and the utility of each is examined. The system is implemented using weighted finite state transducers (WFSTs) to allow for easy combination of features and integration with other WFST-based systems. On evaluation, the best system achieved a 5.37% word error rate, a 5.49% absolute gain over a rule-based baseline and a 1.54% absolute gain over a simple noisy-channel model.

#8ClusterRank: A Graph Based Method for Meeting Summarization

Nikhil Garg (Ecole Polytechnique Fédérale de Lausanne, Switzerland)
Benoit Favre (International Computer Science Institute, Berkeley, USA)
Korbinian Reidhammer (International Computer Science Institute, Berkeley, USA)
Dilek Hakkani-Tür (International Computer Science Institute, Berkeley, USA)

This paper presents an unsupervised, graph based approach for extractive summarization of meetings. Graph based methods such as TextRank have been used for sentence extraction from news articles. These methods model text as a graph with sentences as nodes and edges based on word overlap. A sentence node is then ranked according to its similarity with other nodes. The spontaneous speech in meetings leads to incomplete, ill-formed sentences with high redundancy and calls for additional measures to extract relevant sentences. We propose an extension of the TextRank algorithm that clusters the meeting utterances and uses these clusters to construct the graph. We evaluate this method on the AMI meeting corpus and show a significant improvement over TextRank and other baseline methods.

#9Leveraging Sentence Weights in a Concept-based Optimization Framework for Extractive Meeting Summarization

Shasha Xie (International Computer Science Institute, Berkeley, CA)
Benoit Favre (International Computer Science Institute, Berkeley, CA)
Dilek Hakkani-Tur (International Computer Science Institute, Berkeley, CA)
Yang Liu (The University of Texas at Dallas, Richardson, TX)

We adopt an unsupervised concept-based global optimization framework for extractive meeting summarization, where a subset of sentences is selected to cover as many important concepts as possible. We propose to leverage sentence importance weights in this model. Three ways are introduced to combine the sentence weights within the concept-based optimization framework: selecting sentences for concept extraction, pruning unlikely candidate summary sentences, and using joint optimization of sentence and concept weights. Our experimental results on the ICSI meeting corpus show that our proposed methods can significantly improve the performance for both human transcripts and ASR output compared to the baseline of the concept-based approach, and this unsupervised approach achieves results comparable with those from supervised learning approaches presented in previous work.

#10Hybrids of Supervised and Unsupervised Models for Extractive Speech Summarization

Shih-Hsiang Lin (National Taiwan Normal University)
Yueng-Tien Lo (National Taiwan Normal University)
Yao-Ming Yeh (National Taiwan Normal University)
Berlin Chen (National Taiwan Normal University)

Speech summarization, distilling important information and removing redundant and incorrect information from spoken documents, has become an active area of intensive research in the recent past. In this paper, we consider hybrids of supervised and unsupervised models for extractive speech summarization. Moreover, we investigate the use of the unsupervised summarizer to improve the performance of the supervised summarizer when manual labels are not available for training the latter. A novel training data selection and relabeling approach designed to leverage the inter-document or/and the inter-sentence similarity information is explored as well. Encouraging results were initially demonstrated.

#11Automatic Detection of Audio Advertisements

Dan Melamed (AT&T Labs-Research)
Yeon-Jun Kim (AT&T Labs-Research)

Quality control analysts in customer service call centers often search for keywords in call transcripts. Their searches can return an overwhelming number of false positives when the search terms also appear in advertisements that customers hear while they are on hold. This paper presents new methods for detecting advertisements in audio data, so that they can be filtered out. In order to be usable in real-world applications, our methods are designed to minimize human intervention after deployment. Even so, they are much more accurate than a baseline HMM method.

#12Named Entity Network based on Wikipedia

Sameer Maskey (IBM Research)
Wisam Dakka (Google)

Named Entities (NEs) play an important role in many natural language and speech processing tasks. A resource that identifies relations between NEs could potentially be very useful. We present such automatically generated knowledge resource from Wikipedia, Named Entity Network (NE-NET), that provides a list of related Named Entities (NEs) and the degree of relation for any given NE. Unlike some manually built knowledge resource, NE-NET has a wide coverage consisting of 1.5 million NEs represented as nodes of a graph with 6.5 million arcs relating them. NE-NET also provides the ranks of the related NEs using a simple ranking function that we propose. In this paper, we present NE-NET and our experiments showing how NE-NET can be used to improve the retrieval of spoken (Broadcast News) and text documents.

Tue-Ses3-P2:
ASR: Acoustic Modelling

Time:Tuesday 16:00 Place:Hewison Hall Type:Poster
Chair:Simon King

#1Combined Discriminative Training for Multi-Stream HMM-based Audio-Visual Speech Recognition

Jing Huang (IBM Research)
Karthik Visweswariah (IBM Research)

In this paper we investigate discriminative training of models and feature space for a multi-stream HMM-based audio-visual speech recognizer (AVSR). Since the two streams are used together in decoding, we propose to train the parameters of the two streams jointly. This is in contrast to prior work which has considered discriminative training of parameters in each stream independent of the other. In experiments on a 20-speaker one-hour speaker independent test set, we obtain 22% relative gain on AVSR performance over A/V models whose parameters are trained separately, and 50% relative gain on AVSR over the baseline maximum-likelihood models. On a noisy (mismatched to training) test set, we obtain 21% relative gain over A/V models whose parameters are trained separately. This represents 30% relative improvement over the maximum-likelihood baseline.

#2Cued Speech Recognition for Augmentative Communication in Normal-hearing and Hearing-impaired Subjects

Panikos Heracleous (GIPSA-lab, Speech and Cognition Department)
Denis Beautemps (GIPSA-lab, Speech and Cognition Department)
Noureddine Aboutabit (GIPSA-lab, SPeech and Cognition Department)

Speech is the most natural communication mean for humans. However, in situations where audio speech is not available or cannot be perceived because of disabilities or adverse environmental conditions, people may resort to alternative methods such as augmented speech, i.e. audio speech supplemented or replaced by other modalities, such as audiovisual speech, or Cued Speech. Cued Speech is a visual communication mode, which uses lipreading and handshapes placed in different position to make spoken language wholly understandable to deaf individuals. The current study reports the authors' activities and progress in Cued Speech recognition for French. Previously, the authors have reported experimental results for vowel- and consonant recognition in Cued Speech for French in the case of a normal-hearing subject. The study has been extended by also employing a deaf cuer, and both cuer-dependent and multi-cuer experiments based on hidden Markov models (HMM) have been conducted.

#3On Acquiring Speech Production Knowledge from Articulatory Measurements

Daniel Neiberg (Department of Speech Music and Hearing (TMH), CSC, KTH, Stockholm, Sweden)
Gopal Ananthakrishnan (Department of Speech Music and Hearing (TMH), CSC, KTH, Stockholm, Sweden)
Mats Blomberg (Department of Speech Music and Hearing (TMH), CSC, KTH, Stockholm, Sweden)

The paper proposes a general version of a coupled Hidden Markov/Bayesian Network model for performing phoneme recognition on acoustic-articulatory data. The model uses knowledge learned from the articulatory measurements, available for training, for phoneme recognition on the acoustic input. After training on the articulatory data, the model is able to predict 71.5% of the articulatory state sequences using the acoustic input. Using optimized parameters, the proposed method shows a slight improvement for two speakers over the baseline phoneme recognition system which does not use articulatory knowledge. However, the improvement is only statistically significant for one of the speakers. While there is an improvement in recognition accuracy for the vowels, diphthongs and to some extent the semi-vowels, there is a decrease in accuracy for the remaining phonemes.

#4Measuring the gap between HMM-based ASR and TTS

John Dines (Idiap Research Institute)
Junichi Yamagishi (University of Edinburgh)
Simon King (University of Edinburgh)

The EMIME project is conducting research in the development of technologies for mobile, personalised speech-to-speech translation. The hidden Markov model is being used as the underlying technology in both automatic speech recognition and text-to-speech synthesis, thus, the investigation of unified statistical models has become an implicit goal of our research. As one of the first steps towards this goal, we have been investigating commonalities and differences between HMM-based ASR and TTS. In this paper we present results and analysis of a series of experiments that have been conducted with English ASR and TTS, measuring performance with respect to phone set and lexicon, feature extraction and HMM topology. Our results show that, although the fundamental statistical model may be essentially the same, optimal ASR and TTS performance may demand diametrically opposed system designs. This represents a major challenge to be addressed in the investigation of unified models.

#5Speech recognition with speech synthesis models by marginalising over decision tree leaves

John Dines (Idiap Research Institute)
Lakshmi Saheer (Idiap Research Institute)
Hui Liang (Idiap Research Institute)

There has been increasing interest in the use of unsupervised adaptation for the personalisation of text-to-speech (TTS), particularly in the context of speech-to-speech translation. This requires that we are able to generate adaptation transforms from the output of an automatic speech recognition (ASR) system. An approach that utilises unified ASR and TTS models would seem to offer an ideal mechanism for the application of unsupervised adaptation to TTS since transforms could be shared between ASR and TTS. Such unified models should use a common set of parameters. A major barrier to such parameter sharing is the use of differing contexts in ASR and TTS. In this paper we propose a simple approach that generates ASR models from a trained set of TTS models by marginalising over the TTS contexts that are not used by ASR. We present preliminary results of our proposed method on a large vocabulary speech recognition task and provide insights into future directions of this work.

#6Detailed description of triphone model using SSS-free algorithm

Motoyuki Suzuki (Institute of Technology and Science, The University of Tokushima)
Daisuke Honma (Graduate School of Engineering, Tohoku University)
Akinori Ito (Graduate School of Engineering, Tohoku University)
Shozo Makino (Graduate School of Engineering, Tohoku University)

The triphone model is frequently used as an acoustic model. It is effective for modeling phonetic variations caused by coarticulation. However, it is known that acoustic features of phonemes are also affected by other factors such as speaking style and speaking speed. In this paper, a new acoustic model is proposed. All training data which have the same phoneme context are automatically clustered into several clusters based on acoustic similarity, and a “sub-triphones” is trained using training data corresponding to a cluster. In experiments, the sub-triphone model achieved about 5% higher phoneme accuracy than the triphone model.

#7Decision Tree Acoustic Models for ASR

Jitendra Ajmera (Toshiba)
Masami Akamine (Toshiba)

This paper presents a summary of our research progress using decision-tree acoustic models (DTAM) for large vocabulary speech recognition. Various configurations of training DTAMs are proposed and evaluated on wall-street journal (WSJ) task. A number of different acoustic and categorical features have been used for this purpose. Various ways of realizing a forest instead of a single tree have been presented and shown to improve recognition accuracy. Although the performance is not shown to be better than Gaussian mixture models (GMMs), several advantages of DTAMs have been highlighted and exploited. These include compactness, computational simplicity and ability to handle unordered information.

#8Compression Techniques Applied to Multiple Speech Recognition Systems

Catherine Breslin (Toshiba Research Europe Ltd)
Matt Stuttle (Toshiba Research Europe Ltd)
Kate Knill (Toshiba Research Europe Ltd)

Speech recognition systems typically contain many Gaussian distributions, and hence a large number of parameters. This makes them both slow to decode speech, and large to store. Techniques have been proposed to decrease the number of parameters. One approach is to share parameters between multiple Gaussians, thus reducing the total number of parameters and allowing for shared likelihood calculation. Gaussian tying and subspace clustering are two related techniques which take this approach to system compression. These techniques can decrease the number of parameters with no noticeable drop in performance for single systems. However, multiple acoustic models are often used in real speech recognition systems. This paper considers the application of Gaussian tying and subspace compression to multiple systems. Results show that two speech recognition systems can be modelled using the same number of Gaussians as just one system, with little effect on individual system performance.

#9Graphical Models for Discrete Hidden Markov Models in Speech Recognition

Antonio Miguel (University of Zaragoza)
Alfonso Ortega (University of Zaragoza)
Luis Buera (University of Zaragoza)
Eduardo Lleida (University of Zaragoza)

Emission probability distributions in speech recognition have been traditionally associated to continuous random variables. The most successful models have been the mixtures of Gaussians in the states of the hidden Markov models to generate/capture observations. In this work we show how graphical models can be used to extract the joint information of more than two features. This is possible if we previously quantize the speech features to a small number of levels and model them as discrete random variables. In this paper it is shown a method to estimate a graphical model with a constrained number of dependencies, which is a subset of the directed acyclic graph based model framework, Bayesian networks. Some experimental results are obtained with this method compared to baseline systems of full and diagonal covariance matrices.

#10Factor Analyzed HMM Topology for Speech Recognition

Chuan-Wei Ting (National Cheng Kung University)
Jen-Tzung Chien (National Cheng Kung University)

This paper presents a new factor analyzed (FA) similarity measure between two Gaussian mixture models (GMMs). An adaptive hidden Markov model (HMM) topology is built to compensate the pronunciation variations in speech recognition. Our idea aims to evaluate whether the variation of a HMM state from new speech data is significant or not and judge if a new state should be generated in the models. Due to the effectiveness of FA data analysis, we measure the GMM similarity by estimating the common factors and specific factors embedded in the HMM means and variances. Similar Gaussian densities are represented by the common factors. Specific factors express the residual of similarity measure. We perform a composite hypothesis test due to common factors as well as specific factors. An adaptive HMM topology is accordingly established from continuous collection of training utterances. Experiments show that the proposed FA measure outperforms other measures with comparable size of parameters.

#11Tied-State Multi-path HMnet Model using Three-Domain Successive State Splitting

Soo-Young Suk (Speech Processing Group, Information Technology Research Institute, AIST)
Hiroaki Kojima (Speech Processing Group, Information Technology Research Institute, AIST)

In this paper, we address the improvement of an acoustic model using the multi-path Hidden Markov network (HMnet) model for automatically creating non-uniform tied-state, context-dependent hidden markov model topologies. Recent research has achieved multi-path model topologies in order to improve the recognition performance in gender-independent, spontaneous-speaking applications. However, the multi-path acoustic model size may increase and require more training samples depending on the increased number of paths. To solve this problem, we used a tied-state multi-path topology by which we can create a three-domain successive state splitting method to which environmental splitting is added. This method can obtain a suitable model topology with small mixture components. Experiments demonstrated that the proposed multi-path HMnet model performs better than single-path models for the same number of states.

#12Acoustic Modeling Using Exponential Families

Vaibhava Goel (IBM)
Peder Olsen (IBM)

We present a framework to utilize general exponential families for acoustic modeling. Maximum Likelihood (ML) parameter estimation is carried out using sampling based estimates of the partition function and expected feature vector. Markov Chain Monte Carlo procedures are used to draw samples from general exponential densities. We apply our ML estimation framework to two new exponential families to demonstrate the modeling flexibility afforded by this framework.

Tue-Ses3-P1:
Single- and Multichannel Speech Enhancement

Time:Tuesday 16:00 Place:Hewison Hall Type:Poster

#1Watermark Recovery From Speech Using Inverse Filtering And Sign Correlation

Robert Morris (SPAWAR Systems Center Pacific)
Ralph Johnson (SPAWAR Systems Center Pacific)
Vladimir Goncharoff (University of Illinois at Chicago)
Joseph DiVita (SPAWAR Systems Center Pacific)

This paper presents an improved method for asynchronous embedding and recovery of sub-audible watermarks in speech signals. The watermark, a sequence of DTMF tones, was added to speech without knowledge of its time-varying characteristics. Watermark recovery began by implementing a synchronized zero-phase inverse filtering operation to decorrelate the speech during its voiced segments. The final step was to apply the sign correlation technique, which resulted in performance advantages over linear correlation detection. Our simulations include the effects of finite word length in the correlator.

#2Weighted Linear Prediction for Speech Analysis in Noisy Conditions

Jouni Pohjalainen (Dept. Signal Processing and Acoustics, Helsinki University of Technology, FI-02015 TKK, Finland)
Heikki Kallasjoki (Adaptive Informatics Research Centre, Helsinki University of Technology, FI-02015 TKK, Finland)
Kalle Palomäki (Adaptive Informatics Research Centre, Helsinki University of Technology, FI-02015 TKK, Finland)
Mikko Kurimo (Adaptive Informatics Research Centre, Helsinki University of Technology, FI-02015 TKK, Finland)
Paavo Alku (Dept. Signal Processing and Acoustics, Helsinki University of Technology, FI-02015 TKK, Finland)

Following earlier work, we modify linear predictive (LP) speech analysis by including temporal weighting of the squared prediction error in the model optimization. In order to focus this so called weighted LP model on the least noisy signal regions in the presence of stationary additive noise, we use short-time signal energy as the weighting function. We compare the noisy spectrum analysis performance of weighted LP and its recently proposed variant, the latter guaranteed to produce stable synthesis models. As a practical test case, we use automatic speech recognition to verify that the weighted LP methods improve upon the conventional FFT and LP methods by making spectrum estimates less prone to corruption by additive noise.

#3Log-Spectral Magnitude MMSE Estimators under Super-Gaussian Densities

Richard Christian Hendriks (Delft University of Technology)
Richard Heusdens (Delft University of Technology)
Jesper Jensen (Oticon A/S)

Despite the fact that histograms of speech DFT coefficients are super-Gaussian, not much attention has been paid to develop estimators under these super-Gaussian distributions in combination with perceptual meaningful distortion measures. In this paper we present log-spectral magnitude MMSE estimators under super-Gaussian densities, resulting in an estimator that is perceptually more meaningful and in line with measured histograms of speech DFT coefficients. Compared to state-of-the-art reference methods, the presented estimator leads to an improvement of the segmental SNR in the order of 0.5 dB up to 1 dB. Moreover, listening tests show that the proposed estimator leads to significant improvement for the presented estimator over state-of-the-art methods.

#4Speech enhancement in a 2-dimensional area based on power spectrum estimation of multiple areas with investigation of existence of active sources

Yusuke Hioka (NTT Cyber Space Laboratories, NTT Corporation)
Kenichi Furuya (NTT Cyber Space Laboratories, NTT Corporation)
Yoichi Haneda (NTT Cyber Space Laboratories, NTT Corporation)
Akitoshi Kataoka (Fuculty of Science and Technology, Ryukoku University)

A microphone array that emphasizes sound sources located in a particular 2-dimensional area is described. We previously developed a method that estimates the power spectra of target and noise sounds using multiple fixed beamformings. However, that method requires the areas where the noise sources are located to be restricted. We describe the principle of this limitation then propose a procedure that investigates the possibility of the existence of a sound source in a target area and other areas beforehand to reduce the number of unknown power spectra to be estimated.

#5Modulation Domain Spectral Subtraction for Speech Enhancement

Kuldip Paliwal (Signal Processing Laboratory, Griffith University, Queensland, Australia)
Belinda Schwerin (Signal Processing Laboratory, Griffith University, Queensland, Australia)
Kamil Wojcicki (Signal Processing Laboratory, Griffith University, Queensland, Australia)

In this paper we investigate the modulation domain as an alternative to the acoustic domain for speech enhancement. More specifically, we wish to determine how competitive the modulation domain is for spectral subtraction as compared to the acoustic domain. For this purpose, we extend the traditional analysis-modification-synthesis framework to include modulation domain processing. We then compensate the noisy modulation spectrum for additive noise distortion by applying the spectral subtraction algorithm in the modulation domain. Using subjective listening tests and objective speech quality evaluation we show that the proposed method results in improved speech quality. Furthermore, applying spectral subtraction in the modulation domain does not introduce the musical noise artifacts that are typically present after acoustic domain spectral subtraction. The proposed methods also achieves better background noise reduction than the MMSE method.

#6Variational Loopy Belief Propagation for Multi-talker Speech Recognition

Steven Rennie (IBM)
John Hershey (IBM)
Peder Olsen (IBM)

We address single-channel speech separation and recognition by combining loopy belief propagation and variational inference methods. Inference is done in a graphical model consisting of an HMM for each speaker combined with the max interaction model of source combination. We present a new variational inference algorithm that exploits the structure of the max model to compute an arbitrarily tight bound on the probability of the mixed data. The variational parameters are chosen so that the algorithm scales linearly in the size of the language and acoustic models, and quadratically in the number of sources. The algorithm scores 30.7\% on the SSC task \cite{Cooke:09}, which is the best published result by a method that scales linearly with speaker model complexity to date. The algorithm achieves average recognition error rates of 27\%, 35\%, and 51\% on small datasets of SSC-derived speech mixtures containing two, three, and four sources, respectively, using a single audio channel.

#7Enhancement of Binaural Speech Using Codebook Constrained Iterative Binaural Wiener Filter

Nadir Cazi (Indian Institute of Science, Bangalore)
Thippur Sreenivas (Indian Institute of Science, Bangalore)

A clean speech VQ codebook has been shown to be effective in providing intraframe constraints and hence better convergence of the iterative wiener filtering scheme for single channel speech enhancement. Here we present an extension of the single channel CCIWF scheme to binaural speech input by incorporating a speech distortion weighted multi-channel wiener filter. The new algorithm shows considerable improvement over single channel CCIWF in each channel, in a diffuse noise field environment, in terms of aposteriori SNR and speech intelligibility measure. Next, considering a moving speech source, a good tracking performance is seen, upto a certain resolution.

#8A Semi-blind Source Separation Method with A Less Amount of Computation Suitable for Tiny DSP Modules

Kazunobu Kondo (Yamaha Corporation)
Makoto Yamada (Yamaha Corporation)
Hideki Kenmochi (Yamaha Corporation)

In this paper, we propose a method of implementing FDICA on tiny DSP modules. Firstly, we show a semi-blind separation matrix initialization step that consists of an estimation method using covariance fitting for a known source and an unknown source. It contributes to the faster convergence and less amount of computation. Secondly, a learning band selection step is shown that consists of the determinant of the covariance matrix as a criteria for selection; This achieves a significant reduction of an amount of computation with practical separation performance. Finally, the effectiveness of the proposed method is evaluated via the source separation simulations in anechoic and reverberant rooms, and also a procedure and a resource presumption for the integrated method which we call tinyICA are shown.

#9Model-based Speech Separation: Identifying Transcription using Orthogonality

Siu Wa Lee (The Chinese University of Hong Kong)
Frank K. Soong (Microsoft Research Asia)
Tan Lee (The Chinese University of Hong Kong)

Spectral envelopes and harmonics are the building elements of a speech signal. By estimating these elements, individual speech sources in a mixture observation can be reconstructed and hence separated. Transcription gives the spoken content. More important, it describes the expected sequence of spectral envelopes, if modeling of different speech sounds is acquired. Our recently proposed single-microphone speech separation algorithm exploits this to derive the spectral envelope trajectories of individual sources and remove interference accordingly. This paper investigates the relationship between the correctness of transcription hypotheses and the orthogonality of associated source estimates. An orthogonality measure is introduced to quantify the correlation between spectrograms. Experiments verify that underlying true transcriptions lead to a salient orthogonality distribution, which is distinguishable from the counterfeit transcription one.

#10Enhanced Minimum Statistics Technique Incorporating Soft Decision For Noise Suppression

Yun-Sik Park (Inha University)
Ji-Hyun Song (Inha University)
Jae-Hun Choi (Inha University)
Joon-Hyuk Chang (Inha University)

In this paper, we propose a novel approach to noise power estimation for robust noise suppression in noisy environments. From investigation of the state-of-the-art techniques for noise power estimation, it is discovered that the previously known methods are accurate mostly either during speech absence or speech presence but none of it works well in both situations. Our approach combines minimum statistics (MS) and soft decision (SD) techniques based on probability of speech absence. The performance of the proposed approach is evaluated by a quantitative comparison method and subjective test under various noise environments and found to yield better results compared with conventional MS and SD-based schemes.

#11Effect of Noise Reduction on Reaction Time to Speech in Noise

Mark Huckvale (UCL)
Jayne Leak (UCL)

In moderate levels of noise, listeners report that noise reduction (NR) processing can improve the perceived quality of a speech signal as measured on a typical MOS rating scale. Most quantitative experiments of intelligibility, however, show that NR reduces the intelligibility of noisy speech signals, and so should be expected to increase the cognitive effort required to process utterances. To study cognitive effort we look at how NR affects reaction times to speech in noise, using material that is still highly intelligible. We show that adding noise increases reaction times and that NR does not restore reaction times back to the quiet condition. The implication is that NR does not make speech "easier" to process, at least as far as this task is concerned.

#12Joint Noise Reduction and Dereverberation of Speech Using Hybrid TF-GSC and Adaptive MMSE Estimator

Behdad Dashtbozorg (Yazd University)
Hamid Reza Abutalebi (Yazd University)

This paper proposes a new multichannel hybrid method for dereverberation of speech signals in noisy environments. This method extends the use of a hybrid noise reduction method for dereverberation which is based on the combination of Generalized Sidelobe Canceller (GSC) and a single-channel noise reduction stage. In this research, we employ Transfer Function GSC (TF-GSC) that is more suitable for dereverberation. The single-channel stage is an Adaptive Minimum Mean-Square Error (AMMSE) spectral amplitude estimator. We also modify the AMMSE estimator for dereverberation application. Experimental results demonstrate superiority of the proposed method in dereverberation of speech signal in noisy environments.

#13A Study on Multiple Sound Source Localization with a Distributed Microphone System

Kook Cho (Ritsumeikan University)
Takanobu Nishiura (Ritsumeikan University)
Yoichi Yamashita (Ritsumeikan University)

This paper describes a novel method for multiple sound source localization and its performance evaluation in actual room environments. The proposed method localizes a sound source by finding the position that maximizes the accumulated correlation coefficient between multiple channel pairs. After the estimation of the first sound source, a typical pattern of the accumulated correlation for a single sound source is subtracted from the observed distribution of the accumulated correlation. Subsequently, the second sound source is searched again. To evaluate the effectiveness of the proposed method, experiments of multiple sound source localization were carried out in an actual office room. The result shows that multiple sound source localization accuracy is about 99.7%.

#14Robust Minimal Variance Distortionless Speech Power Spectra Enhancement

Tao Yu (CRSS: Center for Robust Speech System, University of Texas at Dallas, Texas,USA)
John H. L. Hansen (CRSS: Center for Robust Speech System, University of Texas at Dallas, Texas,USA)

In this study, we propose a novel minimal variance distortionless speech power spectral enhancement algorithm, which is robust to real-world implementation issues. Our proposed method is implemented in the power spectral domain where stochastic noise can be modeled as the exponential distribution, whose non-Gaussianity is explored by order statistics filter. Both theoretical and experimental results shows the effectiveness of our proposed method over traditional ones.

#15Speech Enhancement Minimizing Generalized Euclidean Distortion Using Supergaussian Priors

Amit Das (University of Colorado, Boulder and University of Texas, Dallas)
John H. L. Hansen (University of Texas, Dallas)

We introduce short time spectral estimators which minimize the weighted Euclidean distortion (WED) between the clean and estimated speech spectral components when clean speech is degraded by additive noise. The traditional minimum mean square error (MMSE) estimator does not take into account sufficient perceptual measure during enhancement of noisy speech. However, the new estimators discussed in this paper provide greater flexibility to improve speech quality. We explore the cases when clean speech spectral magnitude and discrete Fourier transform (DFT) coefficients are modeled by super-Gaussian priors like Chi and bilateral Gamma distributions respectively. We also present the joint maximum aposteriori (MAP) estimators of the Chi distributed spectral magnitude and uniform phase. Performance evaluations over two noise types and three SNR levels demonstrate improved results of the proposed estimators.

#16STFT-Based Speech Enhancement by Reconstructing the Harmonics

Iman Haji Abolhassani (INRS-Energie-Matériaux-Télécommunications, Montréal, Canada)
Sid-Ahmed Selouani (Université de Moncton, Campus de Shippagan, Canada)
Douglas O\'Shaughnessy (INRS-Energie-Matériaux-Télécommunications, Montréal, Canada)

A novel Short Time Fourier Transform (STFT) based speech enhancement method is introduced. This method enhances the magnitude spectrum of a noisy speech segment. The new idea that is used in this method is to basically reconstruct the harmonics at the multiples of the fundamental frequency (F0) rather than trying to improve them. The harmonics are produced, in the magnitude spectrum, using the knowledge of the window function we are using for the STFT. These harmonics are then scaled and laid on multiples of F0. Experimental results prove the effectiveness of this enhancement method in various noisy conditions and various SNR ratios.

#17Joint Speech Enhancement and Speaker Identification Using Monte Carlo Methods

Ciira wa Maina (Drexel University)
John MacLaren Walsh (Drexel University)

We present an approach to speaker identification using noisy speech observations where the speech enhancement and speaker identification tasks are performed jointly. This is motivated by the belief that human beings perform these tasks jointly and that optimality may be sacrificed if sequential processing is used. We employ a Bayesian approach where the speech features are modeled using a mixture of Gaussians prior. A Gibbs sampler is used to estimate the speech source and the identity of the speaker. Preliminary experimental results are presented comparing our approach to a maximum likelihood approach and demonstrating the ability of our method to both enhance speech and identify speakers.

Tue-Ses3-P3:
Assistive Speech Technology

Time:Tuesday 16:00 Place:Hewison Hall Type:Poster
Chair:Elmar Noeth

#1Personalizing synthetic voices for people with progressive speech disorders: judging voice similarity

Sarah Creer (University of Sheffield)
Stuart Cunningham (University of Sheffield)
Phil Green (University of Sheffield)
Kaniz Fatema (University of Kent)

In building personalized synthetic voices for people with speech disorders, the output should capture the individual's vocal identity. This paper reports a listener judgment experiment on the similarity of Hidden Markov Model based synthetic voices using varying amounts of adaptation data to two non-impaired speakers. We conclude that around 100 sentences of data is needed to build a voice that retains the characteristics of the target speaker but using more data improves the voice. Experiments using Multi-Layer Perceptrons (MLPs) are conducted to find which acoustic features contribute to the similarity judgments. Results show that mel-cepstral distortion and fraction of voicing agreement contribute most to replicating the similarity judgment but the combination of all features is required for accurate prediction. Ongoing work applies the findings to voice building for people with impaired speech.

#2Electrolaryngeal Speech Enhancement Based on Statistical Voice Conversion

Keigo Nakamura (Graduate School of Information Science, Nara Institute of Science and Technology)
Tomoki Toda (Graduate School of Information Science, Nara Institute of Science and Technology)
Hiroshi Saruwatari (Graduate School of Information Science, Nara Institute of Science and Technology)
Kiyohiro Shikano (Graduate School of Information Science, Nara Institute of Science and Technology)

This paper proposes a speaking-aid system for laryngectomees using GMM-based voice conversion that converts electrolaryngeal speech (EL speech) to normal speech. Because valid \(F_0\) information cannot be obtained from the EL speech, we have so far converted the EL speech to whisper. This paper conducts the EL speech conversion to normal speech using \(F_0\) counters estimated from the spectral information of the EL speech. The converted normal speech is experimentally evaluated to demonstrate its preference. Moreover, this paper experimentally investigates the output speech of our aid systems, that is whisper or normal speech.

#3Age Recognition for Spoken Dialogue Systems: Do We Need It?

Maria Wolters (CSTR, University of Edinburgh)
Ravichander Vipperla (CSTR, University of Edinburgh)
Steve Renals (CSTR, University of Edinburgh)

When deciding whether to adapt relevant aspects of the system to the particular needs of older users, spoken dialogue systems often rely on automatic detection of chronological age. In this paper, we show that vocal ageing as measured by acoustic features is an unreliable indicator of the need for adaptation. Simple lexical features greatly improve the prediction of both relevant aspects of cognition and interactions style. Lexical features also boost age group prediction. We suggest that adaptation should be based on observed behaviour, not on chronological age, unless it is not feasible to build classifiers for relevant adaptation decisions.

#4Speech-based and Multimodal Media Center for Different User Groups

Markku Turunen (University of Tampere)
Jaakko Hakulinen (University of Tampere)
Aleksi Melto (University of Tampere)
Juho Hella (University of Tampere)
Juha-Pekka Rajaniemi (University of Tampere)
Erno Mäkinen (University of Tampere)
Jussi Rantala (University of Tampere)
Tomi Heimonen (University of Tampere)
Tuuli Laivo (University of Tampere)
Hannu Soronen (Tampere University of Technogy)
Mervi Hansen (Tampere University of Technogy)
Pellervo Valkama (University of Tampere)
Toni Miettinen (University of Tampere)
Roope Raisamo (University of Tampere)

We present a multimodal media center interface based on speech input, gestures, and haptic feedback. For special user groups, including visually and physically impaired users, the application features a zoomable context + focus GUI in tight combination with speech output and full speech-based control. These features have been developed in cooperation with representatives of the user groups. Evaluations of the system with regular users have been conducted and results from a study where subjective evaluations were collected show that the performance and user experience of speech input were very good, similar to results from a ten month public pilot use.

#5Virtual Speech Reading Support for Hard of Hearing in a Domestic Multi-Media Setting

Samer Al Moubayed (KTH Centre for Speech Technology, Stockholm, Sweden.)
Jonas Beskow (KTH Centre for Speech Technology, Stockholm, Sweden.)
Anne-Marie Öster (KTH Centre for Speech Technology, Stockholm, Sweden.)
Giampiero Salvi (KTH Centre for Speech Technology, Stockholm, Sweden.)
Björn Grantröm (KTH Centre for Speech Technology, Stockholm, Sweden.)
Nic Van Son (Viataal, Nijmegen, The Netherlands)
Ellen Ormel (Viataal, Nijmegen, The Netherlands)

In this paper we present recent results on the development of the SynFace lip synchronized talking head towards multilinguality, varying signal conditions and noise robustness in the Hearing at Home project. We then describe the large scale hearing impaired user studies carried out for three languages. The user tests focus on measuring the gain in Speech Reception Threshold in Noise when using SynFace, and on measuring the effort scaling when using SynFace by hearing impaired people. Preliminary analysis of the results does not show significant gain in SRT or in effort scaling. But looking at inter-subject variability, it is clear that many subjects benefit from SynFace especially with speech with stereo babble noise.

#6Real-Time Correction of Closed-Captions

Patrick Cardinal (Centre de Recherche Informatique de Montréal)
Gilles Boulianne (Centre de Recherche Informatique de Montréal)

Live closed-captions for deaf and hard of hearing audiences are currently produced by stenographers, or by voice writers using speech recognition. Both techniques can produce captions with errors. We are currently developing a correction module that allows a user to intercept the real-time caption stream and correct it before it is broadcast. We report results of preliminary experiments on correction rate and actual user performance using a prototype correction module connected to the output of a speech recognition captioning system.

#7Universal Access: Speech Recognition for Talkers with Spastic Dysarthria

Harsh Vardhan Sharma (Beckman Institute for Advanced Science and Technology, Urbana, USA)
Mark Hasegawa-Johnson (Beckman Institute for Advanced Science and Technology, Urbana, USA)

This paper describes the results of our experiments in small and medium vocabulary dysarthric speech recognition, using the database being recorded by our group under the Universal Access initiative. We develop and test speaker-dependent, word- and phone-level speech recognizers utilizing the hidden Markov Model architecture; the models are trained exclusively on dysarthric speech produced by individuals diagnosed with cerebral palsy. The experiments indicate that (a) different system configurations (being word vs. phone based, number of states per HMM, number of Gaussian components per state specific observation probability density etc.) give useful performance (in terms of recognition accuracy) for different speakers and different task-vocabularies, and (b) for very low intelligibility subjects, speech recognition outperforms human listeners on recognizing dysarthric speech.

#8Exploring Speech Therapy Games with Children on the Autism Spectrum

Mohammed E Hoque (Massachusetts Institute of Technology)
Joseph K Lane (Massachusetts Institute of Technology)
Rana el Kaliouby (Massachusetts Institute of Technology)
Matthew Goodwin (Massachusetts Institute of Technology)
Rosalind Picard (Massachusetts Institute of Technology)

Individuals on the autism spectrum often have difficulties producing intelligible speech with either high or low speech rate, and atypical pitch and/or amplitude affect. In this study, we present a novel intervention towards customizing speech enabled games to help them produce intelligible speech. In this approach, we clinically and computationally identify the areas of speech production difficulties of our participants. We provide an interactive and customized interface for the participants to meaningfully manipulate the prosodic aspects of their speech. Over the course of 12 months, we have conducted several pilots to set up the experimental design, developed a suite of games and audio processing algorithms for prosodic analysis of speech. Preliminary results demonstrate our intervention being engaging and effective for our participants.

#9Analyzing GMMs to characterize resonance anomalies in speakers suffering from apnoea

José Luis Blanco (Signal, Systems & RadioCommunications Department. Universidad Politécnica de Madrid, Spain)
Rubén Fernández (Signal, Systems & RadioCommunications Department. Universidad Politécnica de Madrid, Spain)
David Pardo (Signal, Systems & RadioCommunications Department. Universidad Politécnica de Madrid, Spain)
Álvaro Sigüenza (Signal, Systems & RadioCommunications Department. Universidad Politécnica de Madrid, Spain)
Luis A. Hernández (Signal, Systems & RadioCommunications Department. Universidad Politécnica de Madrid, Spain)
José Alcázar (Respiratory Department. Hospital Torrecardenas, Almeria, Spain)

Past research on the speech of apnoea patients has revealed that resonance anomalies are among the most distinguishing traits for these speakers. This paper presents an approach to characterize these peculiarities using GMMs and distance measures between distributions. We report the findings obtained with two analytical procedures, working with a purpose-designed speech database of both healthy and apnoea-suffering patients. First, we validate the database to guarantee that the models trained are able to describe the acoustic space in a way that may reveal differences between groups. Then we study abnormal nasalization in apnoea patients by considering vowels in nasal and non-nasal phonetic contexts. Our results confirm that there are differences between the groups, and that statistical modelling techniques can be used to describe this factor. Results further suggest that it would be possible to design an automatic classifier using such discriminative information.

#10On the Mutual Information between Source and Filter Contributions for Voice Pathology Detection

Thomas Drugman (Faculté Polytechnique de Mons)
Thomas Dubuisson (Faculté Polytechnique de Mons)
Thierry Dutoit (Faculté Polytechnique de Mons)

This paper addresses the problem of automatic detection of voice pathologies directly from the speech signal. For this, we investigate the use of the glottal source estimation as a means to detect voice disorders. Three sets of features are proposed, depending on whether they are related to the speech or the glottal signal, or to prosody. The relevancy of these features is assessed through mutual information-based measures. This allows an intuitive interpretation in terms of discrimation power and redundancy between the features, independently of any subsequent classifier. It is discussed which characteristics are interestingly informative or complementary for detecting voice pathologies.

#11A System for Detecting Miscues in Dyslexic Read Speech

Morten Højfeldt Rasmussen (Multimedia Information and Signal Processing, Department of Electronic Systems, Aalborg University, Denmark)
Zheng-Hua Tan (Multimedia Information and Signal Processing, Department of Electronic Systems, Aalborg University, Denmark)
Børge Lindberg (Multimedia Information and Signal Processing, Department of Electronic Systems, Aalborg University, Denmark)
Søren Holdt Jensen (Multimedia Information and Signal Processing, Department of Electronic Systems, Aalborg University, Denmark)

While miscue detection in general is a well explored research field little attention has so far been paid to miscue detection in dyslexic read speech. This domain differs substantially from the domains that are commonly researched, as for example dyslexic read speech includes frequent regressions and long pauses between words. A system detecting miscues in dyslexic read speech is presented. It includes an ASR component employing a forced-alignment like grammar adjusted for dyslexic input and uses the GOP score and phone duration to accept or reject the read words. Experimental results show that the system detects miscues at a false alarm rate of 5.3% and a miscue detection rate of 40.1%. These results are worse than current state of the art reading tutors perhaps indicating that dyslexic read speech is a challenge to handle.

Wed-Ses0-K:
Deb Roy - New Horizons in the Study of Language Development

Time:Wednesday 08:30 Place:Main Hall Type:Keynote
Chair:Roger Moore

08:30New Horizons in the Study of Language Development

Deb Roy (MIT Media Lab)

Emerging forms of ecologically-valid longitudinal recordings of human behavior and social interaction promise fresh perspectives on age-old questions of child development. In a pilot effort, 240,000 hours of audio and video recordings of one child’s life at home are being analyzed with a focus on language development. To study a corpus of this scale and richness, current methods of developmental sciences are insufficient. New data analysis algorithms and methods for interpretation and computational modeling are under development. Preliminary speech analysis reveals surprising levels of linguistic “finetuning” by caregivers that may provide crucial support for word learning. Ongoing analysis of various other aspects of the corpus aim to model detailed aspects of the child’s language development as a function of learning mechanisms combined with everyday experience. Plans to collect similar corpora from more children based on a streamlined recording system are underway.

Wed-Ses1-O1:
Speaker verification & identification II

Time:Wednesday 10:00 Place:Main Hall Type:Oral
Chair:Jean-Francois Bonastre

10:00Does Session Variability Compensation in Speaker Recognition Model Intrinsic Variation Under Mismatched Conditions?

Elizabeth Shriberg (SRI International)
Sachin Kajarekar (SRI International)
Nicolas Scheffer (SRI International)

Intersession variability (ISV) compensation in speaker recognition is well studied with respect to extrinsic variation, but little is known about its ability to model intrinsic variation. We find that ISV compensation is remarkably successful on a corpus of intrinsic variation that is highly controlled for channel (a dominant component of ISV). The results are particularly surprising because the ISV training data come from a different corpus than do speaker train and test data. We further find that relative improvements are (1) inversely related to uncompensated performance, (2) reduced more by vocal effort train/test mismatch than by speaking style mismatch, and (3) reduced additionally for mismatches in both style and level. Results demonstrate that intersession variability compensation does model intrinsic variation, and suggest that mismatched data may be more useful than previously expected for modeling certain types of within-speaker variability in speech.

10:20Variability Compensated Support Vector Machines Applied to Speaker Verification

Zahi Karam (DSPG, Research Laboratory of Electronics at MIT & MIT Lincoln Laboratory)
William Campbell (MIT Lincoln Laboratory)

Speaker verification using SVMs has proven successful, specifically using the GSV Kernel with NAP. Also, the recent popularity and success of JFA has led to promising attempts to use speaker factors directly as SVM features. NAP projection and the use of speaker factors are methods of handling variability: NAP by removing nuisance variability, and using speaker factors by forcing the discrimination to be performed based on inter-speaker variability. These successes have led us to propose a new method we call VCSVM to handle both inter and intra-speaker variability directly in the SVM optimization. VCSVM adds a regularized penalty to the optimization that biases the normal to the hyperplane to be orthogonal to the nuisance subspace or alternatively the complement of the inter-speaker variability subspace. The bias attempts to emphasize inter-speaker variability while ignoring intra-speaker variability. This paper presents the VCSVM theory and promising results on nuisance compensation.

10:40Support Vector Machines versus Fast Scoring in the Low-Dimensional Total Variability Space for Speaker Verification

Najim Dehak (CRIM-ETS)
Réda Dehak (LRDE-EPITA)
Patrick Kenny (CRIM)
Niko Brummer (Agnitio)
Pierre Ouellet (CRIM)
Pierre Dumouchel (CRIM-ETS)

This paper presents a new speaker verification system architecture based on Joint Factor Analysis (JFA) as feature extractor. In this modeling, the JFA is used to define a new low-dimensional space named the total variability factor space, instead of both channel and speaker variability spaces for the classical JFA. The main contribution in this approach, is the use of the cosine kernel in the new total factor space to design two different systems: the first system is Support Vector Machines based, and the second one uses directly this kernel as a decision score. This last scoring method makes the process faster and less computation complex compared to others classical methods. We tested several intersession compensation methods in total factors, and we found that the combination of Linear Discriminate Analysis and Within Class Covariance Normalization achieved the best performance.

11:00Within-Session Variability Modelling for Factor Analysis Speaker Verification

Robbie Vogt (Speech Research Lab, QUT)
Jason Pelecanos (IBM T.J. Watson Research Center)
Nicolas Scheffer (SRI International)
Sachin Kajarekar (SRI International)
Sridha Sridharan (Speech Research Lab, QUT)

This work presents an extended Joint Factor Analysis model including explicit modelling of unwanted within-session variability. The goals of the proposed extended JFA model are to improve verification performance with short utterances by compensating for the effects of limited or imbalanced phonetic coverage, and to produce a flexible JFA model that is effective over a wide range of utterance lengths without adjusting model parameters such as retraining session subspaces. Experimental results on the 2006 NIST SRE corpus demonstrate the flexibility of the proposed model by providing competitive results over a wide range of utterance lengths without retraining and also yielding modest improvements in a number of conditions over current state-of-the-art.

11:20Speaker Recognition by Gaussian Information Bottleneck

Ron M Hecht (Department of Computer Science, Tel-Aviv University, Tel-Aviv, Israel)
Elad Noor (The Weizmann Institute of Science, Rehovot, Israel)
Naftali Tishby (School of Engineering and Computer Science, Hebrew University, Jerusalem, Israel)

This paper explores a novel approach for the extraction of relevant information in speaker recognition tasks. This approach uses a principled information theoretic framework - the Information Bottleneck method (IB). In our application, the method compresses the acoustic data while preserving mostly the relevant information for speaker identification. This paper focuses on a continuous version of the IB method known as the Gaussian Information Bottleneck (GIB). This version assumes that both the source and target variables are high dimensional multivariate Gaussian variables. The GIB was applied in our work to the Super Vector (SV) dimension reduction conundrum. Experiments were conducted on the male part of the NIST SRE 2005 corpora. The GIB representation was compared to other dimension reduction techniques and to a baseline system. In our experiments, the GIB outperformed the baseline system; achieving a 6.1% Equal Error Rate (EER) compared to the 15.1% EER of a baseline system.

11:40Variational Dynamic Kernels for Speaker Verification

Chris Longworth (Cambridge University Engineering Department)
Rogier van Dalen (Cambridge University Engineering Department)
Mark Gales (Cambridge University Engineering Department)

An important aspect of SVM-based speaker verification is the choice of dynamic kernel. Recently there has been interest in the use of kernels based on the Kullback-Leibler divergence between GMMs. Since this has no closed-form solution, typically a matched-pair upper bound is used instead. This places significant restrictions on the forms of model structure that may be used. All GMMs must contain the same number of components and must be adapted from a single background model. For many tasks this will not be optimal. In this paper, dynamic kernels are proposed based on alternative, variational approximations to the KL divergence. Unlike the matched-pair bound, these do not restrict the forms of GMM that may be used. Additionally, using a more accurate approximation of the divergence may lead to performance gains. Preliminary results using these kernels are presented on the NIST 2002 SRE dataset.

Wed-Ses1-O2:
Emotion and Expression I

Time:Wednesday 10:00 Place:East Wing 1 Type:Oral
Chair:Ailbhe Ni Chasaide

10:00Emotion dimensions and formant position

Martijn Bastiaan Goudbeek (University of Tilburg, the Netherlands / Swiss Center for Affective Sciences, Geneva, Switzerland)
Jean Philippe Goldman (Language Technology Laboratory, University of Geneva, Switzerland)
Klaus Scherer (Swiss Center for Affective Sciences, Switzerland)

The influence of emotion on articulatory precision was investigated in a newly established corpus of acted emotional speech. The frequencies of the first and second formant of the vowels /i/, /u/, and /a/ was measured and shown to be significantly affected by emotion dimension. High arousal resulted in a higher mean F1 in all vowels, whereas positive valence resulted in higher mean values for F2. The dimension potency/control showed a pattern of effects that was consistent with a larger vocalic triangle for emotions high in potency/control. The results are interpreted in the context of Scherer's component process model.

10:20Identifying Uncertain Words within an Utterance via Prosodic Features

Heather Pon-Barry (Harvard University)
Stuart Shieber (Harvard University)

We describe an experiment that investigates whether sub-utterance prosodic features can be used to detect uncertainty at the word-level. That is, given an utterance that is classified as uncertain, we want to determine which word or phrase the speaker is uncertain about. We have a corpus of utterances spoken under varying degrees of certainty. Using combinations of sub-utterance prosodic features we train models to predict the level of certainty of an utterance. On a set of utterances that were perceived to be uncertain, we compare the predictions of our models for two candidate `target word' segmentations: (a) one with the actual word causing uncertainty as the proposed target word, and (b) one with a control word as the proposed target word. Our best model correctly identifies the word causing the uncertainty rather than the control word 91% of the time.

10:40Evaluating Evaluators: A Case Study in Understanding the Benefits and Pitfalls of Multi-Evaluator Modeling

Emily Mower (University of Southern California)
Maja J Mataric (University of Southern California)
Shrikanth Narayanan (University of Southern California)

Emotion perception is a complex process, often measured using stimuli presentation experiments that query evaluators for their perceptual ratings of emotional cues. These evaluations contain variability both related and unrelated to the evaluated utterances. One approach to handling this variability is to model emotion perception at the individual level. However, the reported perception of users may not adequately capture the emotional acoustic properties of an utterance. This problem can be mitigated by creating averaged evaluator models. We demonstrate that this averaging improves classification performance compared to models created using individual-specific evaluations. We also demonstrate that the performance increases are related to the consistency with which evaluators label data. These results suggest that the acoustic properties of emotional speech are better captured using models formed from averaged evaluations rather than from individual-specific evaluations.

11:00Responding to User Emotional State by Adding Emotional Coloring to Utterances

Jaime Acosta (University of Texas at El Paso)
Nigel Ward (University of Texas at El Paso)

When people speak to each other, they share a rich set of nonverbal behaviors such as varying prosody in voice. These behaviors, sometimes interpreted as demonstrations of emotions, call for appropriate responses, but today’s spoken dialog systems lack the ability to do so. We collected a corpus of persuasive dialogs, specifically conversations about graduate school between a staff member and students, and had judges label all utterances with triples indicating the perceived emotions, using the three dimensions: activation, evaluation, and power. We found immediate response patterns, in which the staff member colored her utterances in response to the emotion shown by the student in the immediately previous utterance, and built a predictive model suitable for use in a dialog system to persuasively discuss graduate school with students.

11:20Analysis of Laugh Signals for Detecting in Continuous Speech

Sudheer Kumar K (International Institute of Information Technology, Hyderabad, India)
Sri Harish Reddy M (International Institute of Information Technology, Hyderabad, India)
Sri Rama Murty K (Indian Institute of Technology Madras, Chennai, India)
Yegnanarayana B (International Institute of Information Technology, Hyderabad, India)

Laughter is a nonverbal vocalization that occurs often in speech communication. Since laughter is produced by the speech production mechanism, spectral analysis methods are used mostly for the study of laughter acoustics. In this paper the significance of excitation features for discriminating laughter and speech is discussed. New features describing the excitation characteristics are used to analyze the laugh signals. The features are based on instantaneous pitch and strength of excitation at epochs. An algorithm is developed based on these features to detect laughter regions in continuous speech. The results are illustrated by detecting laughter regions in a TV broadcast program.

11:40Data-driven Clustering in Emotional Space for Affect Recognition Using Discriminatively Trained LSTM Networks

Martin Woellmer (Technische Universitaet Muenchen)
Florian Eyben (Technische Universitaet Muenchen)
Bjoern Schuller (Technische Universitaet Muenchen)
Ellen Douglas-Cowie (Queen\'s University Belfast)
Roddy Cowie (Queen\'s University Belfast)

In today's affective databases speech turns are often labelled on a continuous scale for emotional dimensions such as valence or arousal to better express the diversity of human affect. However, applications like virtual agents usually map the detected emotional user state to rough classes in order to reduce the multiplicity of emotion dependent system responses. Since these classes often do not optimally reflect emotions that typically occur in a given application, this paper investigates data-driven clustering of emotional space to find class divisions that better match the training data and the area of application. Thereby we consider the Belfast Sensitive Artificial Listener database and TV talkshow data from the VAM corpus. We show that a discriminatively trained Long Short-Term Memory (LSTM) recurrent neural net that explicitly learns clusters in emotional space and additionally models context information outperforms both, Support Vector Machines and a Regression-LSTM net.

Wed-Ses1-O3:
Automatic Speech Recognition: Adaptation II

Time:Wednesday 10:00 Place:East Wing 2 Type:Oral
Chair:Satoshi Nakamura

10:00On the Estimation and the Use of Confusion-Matrices for Improving ASR Accuracy

Santiago Omar Caballero Morales (University of East Anglia, School of Computing Sciences)
Stephen Cox (University of East Anglia, School of Computing Sciences)

In previous work, we described how learning the pattern of recognition errors made by an individual using a certain ASR system leads to increased recognition accuracy compared with a standard MLLR adaptation approach. This was the case for low-intelligibility speakers with dysarthric speech, but no improvement was observed for normal speakers. In this paper, we describe an alternative method for obtaining the training data for confusion-matrix estimation for normal speakers which is more effective than our previous technique. We also address the issue of data sparsity in estimation of confusion-matrices by using non-negative matrix factorization (NMF) to discover structure within them. The confusion-matrix estimates made using these techniques are integrated into the ASR process using a technique termed as ``metamodels'', and the results presented here show statistically significant gains in word recognition accuracy when applied to normal speech.

10:20A Study on Soft Margin Estimation of Linear Regression Parameters for Speaker Adaptation

Shigeki Matsuda (Spoken Language Communication Group, National Institute of Information and Communication Technology)
Yu Tsao (Spoken Language Communication Group, National Institute of Information and Communication Technology)
Jinyu Li (Speech Component Group, Microsoft Corporation)
Satoshi Nakamura (Spoken Language Communication Group, National Institute of Information and Communication Technology)
Chin-Hui Lee (School of Electrical and Computer Engineering, Georgia Institute of Technology)

We formulate a framework for soft margin estimation- based linear regression (SMELR) and apply it to supervised speaker adaptation. Enhanced separation capability and increased discriminative ability are two key properties in margin-based discriminative training. For the adaptation process to be able to flexibly utilize any amount of data, we also propose a novel interpolation scheme to linearly combine the speaker independent (SI) and speaker adaptive SMELR (SMELR/SA) models. The two proposed SMELR algorithms were evaluated on a Japanese large vocabulary continuous speech recognition task. Both the SMELR and interpolated SI+SMELR/SA techniques showed improved speech adaptation performance in comparison with the well-known maximum likelihood linear regression (MLLR) method. We also found that the interpolation framework works even more effectively than SMELR when the amount of adaptation data is relatively small.

10:40Exploring the Role of Spectral Smoothing in context of Children\'s Speech Recognition

Shweta Ghai (Department of Electronics and Communication Engineering, Indian Institute of Technology Guwahati, Guwahati-781039, India.)
Rohit Sinha (Department of Electronics and Communication Engineering, Indian Institute of Technology Guwahati, Guwahati-781039, India.)

This work is motivated by our earlier study which shows that on explicit pitch normalization the children's speech recognition performance on the adults' speech trained models improves as a result of reduction in the pitch-dependent distortions in the spectral envelope. In this paper, we study the role of spectral smoothing in context of children's speech recognition. The spectral smoothing has been effected in the feature domain by two approaches viz., modification of bandwidth of the filters in the filterbank and cepstral truncation. In conjunction, both approaches give significant improvement in the children's speech recognition performance with 57% relative improvement over the baseline. Also, when combined with the widely used vocal tract length normalization (VTLN), these spectral smoothing approaches result in an additional 25% relative improvement over the VTLN performance for children's speech recognition on the adults' speech trained models.

11:00Unsupervised Lattice-based Acoustic Model Adaptation for Speaker-Dependent Conversational Telephone Speech Transcription

Kit Thambiratnam (Microsoft Research)
Frank Seide (Microsoft Research)

This paper examines the application of lattice adaptation techniques to speaker-dependent models for the purpose of conversational telephone speech transcription. Given sufficient training data per speaker, it is feasible to build adapted speaker-dependent models using lattice MLLR and lattice MAP. Experiments on iterative and cascaded adaptation are presented. Additionally various strategies for thresholding frame posteriors are investigated, and it is shown that accumulating statistics from the local best-confidence path is sufficient to achieve optimal adaptation. Overall, an iterative cascaded lattice system was able to reduce WER by 7.0% abs., which was a 0.8% abs. gain over transcript-based adaptation. Lattice adaptation reduced the unsupervised/supervised adaptation gap from 2.5\% to 1.7\%.

11:20Rapid Unsupervised Adaptation Using Frame Independent Output Probabilities of Gender and Context Independent Phoneme Models

Satoshi KOBASHIKAWA (NTT Cyber Space Laboratories)
Atsunori OGAWA (NTT Communication Science Laboratories)
Yoshikazu YAMAGUCHI (NTT Cyber Space Laboratories)
Satoshi TAKAHASHI (NTT Cyber Space Laboratories)

Business is demanding higher recognition accuracy with no increase in computation time compared to previously adopted baseline speech recognition systems. Accuracy can be improved by adding a gender dependent acoustic model and unsupervised adaptation based on CMLLR. CMLLR-based batch-type unsupervised adaptation estimates a single global transformation matrix by utilizing prior unsupervised labeling, which unfortunately increases the computation time. Our proposed technique reduces prior gender selection and labeling time by using frame independent output probabilities of only gender dependent speech GMM and monophone HMM in a dual-gender acoustic model. The proposed technique further raises accuracy by employing a power term after adaptation. Simulations using spontaneous speech show that the proposed technique reduces computation time by 17.9 % and the relative error in correct rate by 13.7 % compared to the baseline without prior gender selection and unsupervised adaptation.

11:40Bark-shift based nonlinear speaker normalization using the second subglottal resonance

Shizhen Wang (University of California, Los Angeles)
Yi-Hui Lee (University of California, Los Angeles)
Abeer Alwan (University of California, Los Angeles)

In this paper, we propose a Bark-scale shift based piecewise nonlinear warping function for speaker normalization, and a joint frequency discontinuity and energy attenuation detection algorithm to estimate the second subglottal resonance (Sg2). We then apply Sg2 for rapid speaker normalization. Experimental results on children's speech recognition show that the proposed nonlinear warping function is more effective for speaker normalization than linear frequency warping. Compared to maximum likelihood based grid search methods, Sg2 normalization is more efficient and achieves comparable or better performance, especially for limited normalization data.

Wed-Ses1-O4:
Voice Transformation I

Time:Wednesday 10:00 Place:East Wing 3 Type:Oral
Chair:Yannis Stylianou

10:00Many-to-many eigenvoice conversion with reference voice

Yamato Ohtani (Graduate School of Information Science, Nara Institute of Science and Technology)
Tomoki Toda (Graduate School of Information Science, Nara Institute of Science and Technology)
Hiroshi Saruwatari (Graduate School of Information Science, Nara Institute of Science and Technology)
Kiyohiro Shikano (Graduate School of Information Science, Nara Institute of Science and Technology)

We propose many-to-many voice conversion (VC) techniques to convert an arbitrary source voice into an arbitrary target voice. We have been hitherto proposed one-to-many eigenvoice conversion (EVC) and many-to-one EVC. In EVC, an eigenvoice GMM (EV-GMM) is trained in advance using multiple parallel data sets of a reference speaker and many pre-stored speakers. The EV-GMM is flexibly adapted to an arbitrary speaker using a small amount of data. In this paper, we realize many-to-many VC by sequentially performing many-to-one EVC and one-to-many EVC through the reference speaker using the same EV-GMM. Experimental results demonstrate the effectiveness of the proposed method.

10:20Alleviating the One-to-Many Mapping Problem in Voice Conversion with Context-Dependent Modeling

Elizabeth Godoy (Orange Labs)
Olivier Rosec (Orange Labs)
Thierry Chonavel (Telecom Bretagne)

This paper addresses the "one-to-many" mapping problem in Voice Conversion (VC) by exploring source-to-target mappings in GMM-based spectral transformation. Specifically, we examine differences using source-only versus joint source/target information in the classification stage of transformation, effectively illustrating a "one-to-many effect" in the traditional acoustically-based GMM. We propose combating this effect by using phonetic information in the GMM learning and classification. We then show the success of our proposed context-dependent modeling with transformation results using an objective error criterion. Finally, we discuss implications of our work in adapting current approaches to VC.

10:40Efficient Modeling of Temporal Structure of Speech For Applications in Voice Transformation

Binh Phu Nguyen (School of Information Science, Japan Advanced Institute of Science and Technology)
Akagi Masato (School of Information Science, Japan Advanced Institute of Science and Technology)

Aims of voice transformation are to change styles of given utterances. Most voice transformation methods process speech signals in a time-frequency domain. In the time domain, when processing spectral information, conventional methods do not consider relations between neighboring frames. If unexpected modifications happen, there are discontinuities between frames, which leads to the degradation of the speech quality. This paper proposes a new modeling of temporal structure of speech to ensure the smoothness of the transformed speech for improving the speech quality in voice transformation. We propose an improvement of the temporal decomposition (TD) technique to model the temporal structure of speech. The TD is used to ensure the smoothness of the transformed speech. We investigate the TD in two applications, concatenative speech synthesis and spectral voice conversion. Experimental results confirm the effectiveness of TD in terms of improving the quality of the transformed speech.

11:00Cross-Language Voice Conversion Based on Eigenvoices

Malorie Charlier (Faculté Polytechnique de Mons)
Yamato Ohtani (Graduate School of Information Science, Nara Institute of Science and Technology)
Tomoki Toda (Graduate School of Information Science, Nara Institute of Science and Technology)
Alexis Moinet (Faculté Polytechnique de Mons)
Thierry Dutoit (Faculté Polytechnique de Mons)

This paper presents a novel cross-language voice conversion (VC) method based on eigenvoice conversion (EVC). Cross language VC is a technique for converting voice quality between two speakers uttering different languages each other. In general, parallel data consisting of utterance pairs of those two speakers are not available. To deal with this problem, we apply EVC to cross-language VC because EVC framework can develop the conversion model without using parallel data. The results of subjective evaluations demonstrate that the proposed method yields significant performance improvements compared with a conventional cross-language VC method based on frame selection.

11:20Voice Conversion using K-Histograms and Frame Selection

Alejandro José Uriz (FI-UNMDP)
Pablo Daniel Agüero (FI-UNMDP)
Antonio Bonafonte (Universitat Politècnica de Catalunya, Barcelona, Spain)
Juan Carlos Tulli (FI-UNMDP)

The goal of voice conversion systems is to modify the voice of a source speaker to be perceived as if it had been uttered by another specific speaker. Many approaches found in the literature work based on statistical models and introduce an oversmoothing in the target features. Our proposal is a new model that combines several techniques used in unit selection for text-to-speech and a non-gaussian transformation mathematical model. Subjective results support the proposed approach.

11:40Online Model Adaptation for Voice Conversion using Model-based Speech Synthesis Technique

Dalei Wu (Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, CANADA)
Baojie Li (Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, CANADA)
Hui Jiang (Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, CANADA)
Qianjie Fu (House Ear Institute, 2100 West Third Street, Los Angeles, CA 90057, USA)

In this paper, we present a novel voice conversion method using model-based speech synthesis that can be used for some applications where prior knowledge or training data is not available from the source speaker. In the proposed method, training data from a target speaker is used to build a GMM-based speech model and voice conversion is then performed for each utterance from the source speaker according to the pre-trained target speaker model. To reduce the mismatch between source and target speakers, online model adaptation is proposed to improve model selection accuracy, based on maximum likelihood linear regression (MLLR). Objective and subjective evaluations suggest that the proposed methods are quite effective in generating acceptable voice quality for voice conversion even without training data from source speakers.

Wed-Ses1-S1:
Special Session: Lessons and Challenges Deploying Voice Search

Time:Wednesday 10:00 Place:East Wing 4 Type:Special
Chair:Mike Cohen & Mike Phillips

10:00Role of Natural Language Understanding in Voice Local Search

Junlan Feng (AT&T Labs Research)
Srinivas Banglore (AT&T Labs Research)
Mazin Gilbert (AT&T Labs Research)

Speak4it is a voice-enabled local search system currently available for iPhone devices. The natural language understanding (NLU) component is one of the key technology modules in this system. The role of NLU in voice-enabled local search is twofold: (a) parse the automatic speech recognition (ASR) output (1-best and word lattices) into meaningful segments that contribute to high-precision local search, and (b) understand user’s intent. This paper is concerned with the first task of NLU. In previous work, we had presented a scalable approach to parsing, which is built upon text indexing and search framework, and can also parse ASR lattices. In this paper, we propose an algorithm to improve the baseline by extracting the “subjects” of the query. Experimental results indicate that lattice-based query parsing outperforms ASR 1-best based parsing by 2.1% absolute and extracting subjects in the query improves the robustness of search.

10:20Recognition and Correction of Voice Web Search Queries

Keith Vertanen (University of Cambridge)
Per Ola Kristensson (University of Cambridge)

In this work we investigate how to recognize and correct voice web search queries. We describe our corpus of web search queries and show how it was used to improve the accuracy of recognition. We show that using a search-specific vocabulary with automatically generated pronunciations is superior to using a vocabulary limited to a fixed pronunciation dictionary. We conducted a formative user study to investigate recognition and correction aspects of voice search in a mobile context. In the user study, we found that despite a word error rate of 48%, users were able to speak and correct search queries in about 18 seconds. Users did this while walking around using a mobile touch-screen device.

10:40Voice Search and Everything Else – What Users Are Saying to the Vlingo Top Level Voice UI

Chao Wang (Vlingo)

No abstract available.

11:00Searching Google by Voice

Johan Schalkwyk (Google)

No abstract available.

11:20Multiple-hypotheses searches from deeply parsed requests to multiple-evidences scoring: the DeepQA challenge

Roberto Sicconi (IBM)

No abstract available.

11:40Research Areas in Voice Search: Lessons from Microsoft Deployments

Geoffrey Zweng (Microsoft)

No abstract available.

Wed-Ses1-P1:
Phonetics, Phonology, cross-language comparisons, pathology

Time:Wednesday 10:00 Place:Hewison Hall Type:Poster
Chair: Valerie Hazan

#1Fast Transcription of Unstructured Audio Recordings

Brandon Roy (MIT Media Laboratory)
Deb Roy (MIT Media Laboratory)

We introduce a new method for human-machine collaborative speech transcription that is significantly faster than existing transcription methods. In this approach, automatic audio processing algorithms are used to robustly detect speech in audio recordings and split speech into short, easy to transcribe segments. Sequences of speech segments are loaded into a transcription interface that enables a human transcriber to simply listen and type, obviating the need for manually finding and segmenting speech or explicitly controlling audio playback. As a result, playback stays synchronized to the transcriber's speed of transcription. In evaluations using naturalistic audio recordings made in everyday home situations, the new method is up to 6 times faster than other popular transcription tools while preserving transcription quality.

#2Finding Allophones: an Evaluation on Consonants in the TIMIT Corpus

Timothy Kempton (University of Sheffield)
Roger Moore (University of Sheffield)

Phonemic analysis, the process of identifying the contrastive sounds in a language, involves finding allophones; phonetic variants of those contrastive sounds. An algorithm for finding allophones (developed by Peperkamp et al.) is evaluated on consonants in the TIMIT acoustic phonetic transcripts. A novel phonetic filter based on the active articulator is introduced and has a higher recall than previous filters. The combined retrieval performance, measured by area under the ROC curve, is 83%. The system implemented can process any language transcribed in IPA and is currently being used to assist the phonemic analysis of unwritten languages.

#3Automatic formant extraction for sociolinguistic analysis of large corpora

Keelan Evanini (University of Pennsylvania)
Stephen Isard (University of Pennsylvania)
Mark Liberman (University of Pennsylvania)

In this paper, we propose a method of formant prediction from pole and bandwidth data, and apply this method to automatically extract F1 and F2 values from a corpus of regional dialect variation in North America that contains 134,000 manual formant measurements. These predicted formants are shown to increase performance over the default formant values from a popular speech analysis package. Finally, we demonstrate that sociolinguistic analysis based on vowel formant data can be conducted reliably using the automatically predicted values, and we argue that sociolinguists should begin to use this methodology in order to be able to analyze larger amounts of data efficiently.

#4Investigating phonetic information reduction and lexical confusability

William Hartmann (The Ohio State University)
Eric Fosler-Lussier (The Ohio State University)

In the presence of pronunciation variation and the masking effects of additive noise, we investigate the role of phonetic information reduction and lexical confusability on ASR performance. Contrary to previous work \cite{Briscoe89}, we show that place of articulation as a representation for unstressed segments performs at least as well as manner of articulation in the presence of additive noise. Methods of phonetic reduction introduce lexical confusibility which negatively impact performance. By limiting this confusability, recognizers that employ high levels of phonetic reduction (40.1%) can perform as well a baseline system in the presence of nonstationary noise.

#5Improving phone recognition performance via phonetically-motivated units

Hyejin Hong (Department of Linguistics, Seoul National University, Seoul, Korea)
Minhwa Chung (Department of Linguistics, Seoul National University, Seoul, Korea)

This paper examines how phonetically-motivated units affect the performance of phone recognition systems. Focusing on the realization of /h/, which is one of the most frequently error-making phones in Korean phone recognition, three different phone sets are designed by considering optional phonetic constraints which show complementary distributions. Experimental results show that one of the proposed sets, the h-deletion set improves phone recognition performance compared to the baseline phone recognizer. It is noteworthy that this set needs no additional phonetic unit, which means that no more HMM is necessary to be modeled, accordingly it has the advantage in terms of model size. Besides, it obtains competent performance compared to the baseline system in terms of word recognition as well. Thus, this phonetically-motivated approach dealing with improvement of phone recognition performance is expected to be used in embedded solutions which require fast and light recognition process.

#6An Evaluation of Formant Tracking method on an Arabic labeled Database

Imen Jemaa (Unite de Recherche Traitement du Signal, Traitement de l Image et Reconnaissance de Formes)
Oussama Rekhis (Unite de Recherche Traitement du Signal, Traitement de l Image et Reconnaissance de Formes)
Kais Ouni (Unite de Recherche Traitement du Signal, Traitement de l Image et Reconnaissance de Formes)
Yves Laprie (Equipe Parole, LORIA nancy, France)

In this paper we present a labeled Arabic database of the first three formant tracks. This database is used to evaluate a new automatic formant tracking algorithm based on Fourier ridges detection. In this method we have introduced a continuity constraint based on the computation of center of gravity for a set of frequency formant candidates. This leads to connect a frame of speech to its neighbours and thus to improve the robustness of track. The formant trajectories obtained from the proposed algorithm are compared to those manually labeled from the database and those given by LPC based Praat tool.

#7Comparison of Manual and Automated Estimates of Subglottal Resonances

Wolfgang Wokurek (IMS Uni Stuttgart)
Andreas Madsack (IMS Uni Stuttgart)

This study compares manual measurements of the first two subglottal resonances to the results of an automated measurement procedure for the same quantities. We also briefly sketch the sensor prototype that is used for the measurements. The subglottal resonances are presented in the space spanned by the vowels' first two formants. A three axis acceleration sensor is gently pressed at the neck of the speaker. In front of the ligamentum conicum, located near the lower end of the larynx, pressure signals may be recorded that follow the subglottal pressure changes at least up to 2 kHz bandwidth. The recordings of the subglottal pressure signals are made simultaneously with recordings of the electroglottogram and the acoustic speech sound with 12 male and 12 female speakers.

#8Using durational cues in a computational model of spoken-word recognition

Odette Scharenborg (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)

Evidence that listeners use durational cues to help resolve temporarily ambiguous speech input has accumulated over the past few years. In this paper, we investigate whether durational cues are also beneficial for word recognition in a computational model of spoken-word recognition. Two sets of simulations were carried out using the acoustic signal as input. The simulations showed that the computational model, like humans, takes benefit from durational cues during word recognition, and uses these to disambiguate the speech signal. These results thus provide support for the theory that durational cues play a role in spoken-word recognition. Index Terms: duration, spoken-word recognition, computational modelling

#9Second language discrimination vowel contrasts by adults speakers with a five vowel system

Bianca Sisinni (CRIL (Centro di Ricerca Interdisciplinare sul Linguaggio) - Salento University (Lecce - Italy))
Mirko Grimaldi (CRIL (Centro di Ricerca Interdisciplinare sul Linguaggio) - Salento University (Lecce - Italy))

This study tests the ability of a group of Salento Italian undergraduate students that have been exposed to L2 in a scholastic context to perceive British English second language (L2) vowel phonemes. The aim is to verify if the Perceptual Assimilation Model could be applied to them. In order to test their ability to perceive L2 phonemes, subjects have executed an identification and an oddity discrimination test. The results indicated that the L2 discrimination processes are in line with those predicted by the PAM, supporting the idea that students with a formal L2 background are still naïve listeners to the L2.

#10Three-way Laryngeal Categorization of Japanese, French, English and Chinese Plosives by Korean Speakers

Tomohiko Ooigawa (Phonetics Laboratory, Sophia University, Tokyo, Japan)
Shigeko Shinohara (Phonetics Laboratory, Sophia University, Tokyo, Japan)

Korean has a three-way laryngeal contrast in oral stops. This paper reports perception patterns of plosives of Japanese, French, English and Chinese by Korean speakers. In Korean loanwords, laryngeal contrasts of Japanese, French, and English plosives show distinct patterns. To test whether perception explains the loanword patterns, we selected languages with different acoustic properties and carried out perception tests. Our results reveal discrepancies between the phonological adaptation and the acoustic perception patterns.

#11The effect of F0 peak-delay on the L1 / L2 perception of English lexical stress

Shinichi Tokuma (Chuo University)
Yi Xu (University College London)

This study investigated the perceptual effect of F0 peak-delay on L1 / L2 perception of English lexical stress. A bisyllabic English non-word /nInI/ whose F0 was set to reach its peak in the second syllable was embedded in a frame sentence and used as the stimulus of the perceptual experiment. Native English and Japanese speakers were asked to determine lexical stress locations in the experiment. The results showed that in the perception of English lexical stress, delayed F0 peaks which were aligned with the second syllable of the stimulus words perceptually affected Japanese and English groups in the same manner: both groups perceived the delayed F0 peaks as a cue to lexical stress in the first syllable when the peaks were aligned with, or before, the end of /n/ in the second syllable. A supplementary experiment conducted on Japanese speakers confirmed the location of the categorical boundary. These findings are supported by the data provided by previous studies.

#12Lexical tone production by Cantonese speakers with Parkinson’s disease

Joan K-Y Ma (Dresden University of Technology)

This study was to investigate lexical tone production in Cantonese speakers associated with Parkinson’s disease (PD) and the effect of intonation on the production of lexical tone. Speech data was collected from five Cantonese PD speakers. Speech materials consisted of targets contrasting in tones, embedded in different sentence contexts (initial, medial and final) and intonations (statements and questions). Analysis of the normalized F0 patterns showed that PD speakers contrasted the six lexical tones in similar manner as compared with control speakers across positions and intonations, except at the final position of questions. Significantly lower F0 values were found at the 75% and 100% time points of the final syllable of questions for the PD speakers than for the control speakers, indicating that intonation has a smaller influence on the F0 patterns of lexical tones for PD speakers than control speakers.

#13Acoustic cues of palatalisation in plosive + lateral onset clusters

Daniela Müller (CLLE-ERSS, Université de Toulouse 2 - Le Mirail, Toulouse, France & Romanisches Seminar, Ruprecht-Karls-Universität Heidelberg, Heidelberg, Germany)
Sidney Martin Mota (Escola Oficial d\'Idiomes de Tarragona, Tarragona, Spain)

Palatalisation of /l/ in obstruent + lateral onset clusters in the absence of a following palatal sound has received a considerable amount of attention from historical linguistics. The phonetics of its development, however, remains less well-investigated. This paper aims at studying the acoustic cues that could have led plosive + lateral onset clusters to develop palatalisation. It is found that onset clusters with velar plosives favour palatalisation more than labial + lateral clusters, and that a high degree of darkness diminishes the likelihood of palatalisation to take place.

Wed-Ses1-P3:
Statistical Parametric Synthesis II

Time:Wednesday 10:00 Place:Hewison Hall Type:Poster
Chair:Simon King

#1A BayesianApproach to Hidden Semi-Markov Model Based S