Interspeech 2009 Technical Programme

Mon-Ses1-K:
Sadaoki Furui - Selected topics from 40 years of research on speech and speaker recognition

Time:Monday 11:00 Place:Main Hall Type:Keynote
Chair:Isabel Trancoso

11:00Selected topics from 40 years of research on speech and speaker recognition

Sadaoki Furui (Tokyo Institute of Technology)

This talk summarizes my 40 years research on speech and speaker recognition, focusing on selected topics that I have investigated at NTT Laboratories, Bell Laboratories and Tokyo Institute of Technology with my colleagues and students. These topics include: the importance of spectral dynamics in speech perception; speaker recognition methods using statistical features, cepstral features, and HMM/GMM; text-prompted speaker recognition; speech recognition by dynamic features; Japanese LVCSR; spontaneous speech corpus construction and analysis; spontaneous speech recognition; automatic speech summarization; WFST-based decoder development and its applications; and unsupervised model adaptation methods.

Mon-Ses2-O1:
ASR: Features for Noise Robustness

Time:Monday 13:30 Place:Main Hall Type:Oral
Chair:Hynek Hermansky

13:30Feature Extraction for Robust Speech Recognition Using a Power-Law Nonlinearity and Power-Bias Subtraction

Chanwoo Kim (Carnegie Mellon University)
Richard Stern (Carnegie Mellon University)

This paper presents a new feature extraction algorithm called Power-Normalized Cepstral Coefficients (PNCC) that is based on auditory processing. Major new features of PNCC processing include the use of a power-law nonlinearity that replaces the traditional log nonlinearity used for MFCC coefficients, and a novel algorithm that suppresses background excitation by estimating SNR based on the ratio of the arithmetic to geometric mean power, and subtracts the inferred background power. Experimental results demonstrate that the PNCC processing provides substantial improvements in recognition accuracy compared to MFCC and PLP processing for various types of additive noise. The computational cost of PNCC is only slightly greater than that of conventional MFCC processing.

13:50Towards Fusion of Feature Extraction and Acoustic Model Training: A Top Down Process for Robust Speech Recognition

Yu-Hsiang Bosco Chiu (Carnegie Mellon University)
Bhiksha Raj (Carnegie Mellon University)
Richard M. Stern (Carnegie Mellon University)

This paper presents a strategy to learn physiologically motivated components in a feature computation module discriminatively, directly from data, in a manner that is inspired by the presence of efferent processes in the human auditory system. In our model a set of logistic functions which represent the rate-level nonlinearities found in most mammal hearing system are put in as part of the feature extraction process. The parameters of these rate-level functions are estimated to maximize the a posteriori probability of the correct class in the training data. The estimated feature computation is observed to be robust against environmental noise. Experiments conducted with the CMU Sphinx-III on the DARPA Resource Management task show that the discriminatively estimated rate-nonlinearity results in better performance in the presence of background noise than traditional procedures which separate the feature extraction and model training into two distinct parts.

14:10Temporal Modulation Processing of Speech Signals for Noise Robust ASR

Hong You (UCLA Electrical Engineering Dept.)
Abeer Alwan (UCLA Electrical Engineering Dept.)

We analyze the temporal modulation characteristics of speech and noise from a speech/non-speech discrimination point of view, and propose a frequency adaptive modulation processing algorithm and apply it to a noise robust ASR task. Although previous psychoacoustic studies have shown that low temporal modulation components are important for speech intelligibility, there is no reported analysis on modulation components from the point of view of speech/noise discrimination. Our data-driven analysis of modulation components of speech and noise reveals that speech and noise is more accurately classified by low-passed modulation frequencies than band-passed ones. We then propose a frequency adaptive modulation processing algorithm for a noise robust ASR task. Speech recognition experiments are performed to compare the proposed algorithm with other noise robust frontends, including RASTA and ETSI AFE. Results show that the frequency adaptive modulation processing is promising.

14:30PROGRESSIVE MEMORY-BASED PARAMETRIC NON-LINEAR FEATURE EQUALIZATION

Luz García (Department of TSTC, University of Granada, Spain)
Roberto Gemello (LOQUENDO, Torino, ITALY)
Franco Mana (LOQUENDO, Torino, ITALY)
Jose Carlos Segura (Department of TSTC, University of Granada, Spain)

This paper analyzes the benefits and drawbacks of PEQ (Parametric Non-linear Equalization), a features normalization technique based on the parametric equalization of the MFCC parameters to match a reference probability distribution. Two limitations have been outlined: the distortion intrinsic to the normalization process and the lack of accuracy in estimating normalization statistics on short sentences. Two evolutions of PEQ are presented as solutions to the limitations encountered. The effects of the proposed evolutions are evaluated on three speech corpora, namely WSJ0, AURORA-3 and HIWIRE cockpit databases, with different mismatch conditions given by convolutional and/or additive noise and non-native speakers. The obtained results show that the encountered limitations can be overcome by the newly introduced techniques.

14:50Dynamic Features in the Linear Domain for Robust Automatic Speech Recognition in a Reverberant Environment

Osamu Ichikawa (Tokyo Research Laboratory, IBM Research)
Takashi Fukuda (Tokyo Research Laboratory, IBM Research)
Ryuki Tachibana (Tokyo Research Laboratory, IBM Research)
Masafumi Nishimura (Tokyo Research Laboratory, IBM Research)

Since the MFCC are calculated from logarithmic spectra, the delta and delta-delta are considered as difference operations in a logarithmic domain. In a reverberant environment, speech signals have trailing reverberations, whose power is plotted as a long-term exponential decay. This means the logarithmic delta value tends to remain large for a long time. This paper proposes a delta feature calculated in the linear domain, due to the rapid decay in reverberant environments. In an experiment using an evaluation framework (CENSREC-4), significant improvements were found in reverberant situations by simply replacing the MFCC dynamic features with the proposed dynamic features.

15:10Local Projections and Support Vector Based Feature Selection in Speech Recognition

Antonio Miguel (University of Zaragoza)
Alfonso Ortega (University of Zaragoza)
Luis Buera (University of Zaragoza)
Eduardo Lleida (University of Zaragoza)

In this paper we study a method to provide noise robustness in mismatch conditions for speech recognition using local frequency projections and feature selection. Local time-frequency filtering patterns have been used previously to provide noise robust features and a simpler feature set to apply reliability weighting techniques. The proposed method combines two techniques to select the feature set, first a realibility metric based on information theory and, second, a support vector set to reduce the errors. The support vector set provides the most representative examples which have influence in the error rate in mismatch conditions, so that only the features which incorporate implicit robustness to mismatch are selected. Some experimental results are obtained with this method compared to baseline systems using the Aurora 2 database.

Mon-Ses2-O2:
Production: Articulatory modelling

Time:Monday 13:30 Place:East Wing 1 Type:Oral
Chair: Rob Van Son

13:30Feedforward Control of A 3D Physiological Articulatory Model for Vowel Production

Qiang Fang (Phonetics Lab., Institute of Linguistics, Chinese Academy of Social Sciences)
Akikazu Nishikido (IIPL, School of Information Science, Japan Advanced Institute of Science and Technology)
Jianwu Dang (IIPL, School of Information Science, Japan Advanced Institute of Science and Technology)
Aijun Li (Phonetics Lab., Institute of Linguistics, Chinese Academy of Social Sciences)

A 3D Physiological articulatory model has been developed to account for the biomechanical properties of speech organs in speech production. To control the model for investigating the mechanism of speech production, a feedforward control strategy is necessary to generate proper muscle activations according to desired articulatory targets. In this paper, we elaborated a feedforward control module for the 3D physiological articulatory model. In the feedforward control process, an input articulatory target, specified by articulatory parameters, is transformed to intrinsic representation of articulation; then, a muscle activation pattern is estimated by a proposed mapping function. The results showed that the proposed feedforward control strategy is able to control the proposed 3D physiological articulatory model with high accuracy both acoustically and articulatorily.

13:50Articulatory Modeling Based on Semi-polar Coordinates and Guided PCA Technique

Jun Cai (Groupe Parole, LORIA-CNRS & INRIA, BP 239, 54600 Vandoeuvre-lès-Nancy, France)
Yves Laprie (Groupe Parole, LORIA-CNRS & INRIA, BP 239, 54600 Vandoeuvre-lès-Nancy, France)
Julie Busset (Groupe Parole, LORIA-CNRS & INRIA, BP 239, 54600 Vandoeuvre-lès-Nancy, France)
Fabrice Hirsch (Institut de Phonétique de Strasbourg, 2, rue Descartes, 67084 Strasbourg, France)

Research on 2-dimensional static articulatory modeling has been performed by using the semi-polar system and the guided PCA analysis of lateral X-ray images of vocal tract. The density of the grid lines in the semi-polar system has been increased to have a better descriptive precision. New parameters have been introduced to describe the movements of tongue apex. An extra feature, the tongue root, has been extracted as one of the elementary factors in order to improve the precision of tongue model. New methods still remain to be developed for describing the movements of tongue apex.

14:10Sequencing of Articulatory Gestures using Cost Optimization

Juraj Simko (Univeristy College Dublin)
Fred Cummins (University College Dublin)

Within the framework of articulatory phonology (AP), gestures function as primitives, and their ordering in time is provided by a gestural score. Determining how they should be sequenced in time has been something of a challenge. We modify the task dynamic implementation of AP, by defining tasks to be the desired positions of physically embodied end effectors. This allows us to investigate the optimal sequencing of gestures based on a parametric cost function. Costs evaluated include precision of articulation, articulatory effort, and gesture duration. We find that a simple optimization using these costs results in stable gestural sequences that reproduce several known coarticulatory effects.

14:30From experiments to articulatory motion—a three dimensional talking head model

Xiao Bo Lu (Bioengineering Institute, the University of Auckland, Auckland, New Zealand)
C. William Thorpe (Bioengineering Institute, the University of Auckland, Auckland, New Zealand)
Kylie Foster (Department of Food and Health, the University of Massey, Auckland, New Zealand)
Peter Hunter (Bioengineering Institute, the University of Auckland, Auckland, New Zealand)

The goal of this study is to develop a customised computer model that can accurately represent the motions of vocal articulators during vowels and consonants. Models of the articulators were constructed as Finite element (FE) meshes based on digitised high-resolution MRI (Magnetic Resonance Imaging) scans obtained during rest breathing. Articulatory kinematics during speaking were obtained by EMA (Electromagnetic Articulography) and video of the face. The movement information thus acquired was applied to the FE model to provide jaw motion, modeled as a rigid body, and tongue, cheek and lip movement modeled with a free-form deformation technique. The motion of the epiglottis has also been considered in the model.

14:50Towards Robust Glottal Source Modeling

Javier Pérez (TALP Research Center, Universitat Politècnica de Catalunya (UPC), Barcelona, Spain)
Antonio Bonafonte (TALP Research Center, Universitat Politècnica de Catalunya (UPC), Barcelona, Spain)

We present here a new method for the simultaneous estimation of the derivative glottal waveform and the vocal tract filter. The algorithm is pitch-synchronous and uses overlapping frames of several glottal cycles to increase the robustness and quality of the estimation. Two parametric models for the glottal waveform are used: the KLGLOTT88 during the convex optimization iteration, and the LF model for the final parametrization. We use a synthetic corpus using real data published in several studies to evaluate the performance. A second corpus has been specially recorded for this work, consisting of isolated vowels uttered with different voice qualities. The algorithm has been found to perform well with most of the voice qualities present in the synthetic data-set in terms of glottal waveform matching. The performance is also good with the real vowel data-set in terms of resynthesis quality.

15:10Sliding Vocal-tract Model and its Application for Vowel Production

Takayuki Arai (Sophia University)

In a previous study, Arai implemented a sliding vocal-tract model based on Fant’s three-tube model and demonstrated its usefulness for education in acoustics and speech science. The sliding vocal-tract model consists of a long outer cylinder and a short inner cylinder, which simulates tongue constriction in the vocal tract. This model can produce different vowels by sliding the inner cylinder and changing the degree of constriction. In this study, we investigated the model’s coverage of vowels on the vowel space and explored its application for vowel production in the speech and hearing sciences.

Mon-Ses2-O3:
Systems for LVCSR and Rich Transcription

Time:Monday 13:30 Place:East Wing 2 Type:Oral
Chair: Thomas Schaaf

13:30Minimum Hypothesis Phone Error as a Decoding Method for Speech Recognition

Haihua Xu (Shanghai Jiaotong University, China)
Daniel Povey (Microsoft Research, Redmond, WA, USA)
Jie Zhu (Shanghai Jiaotong University, China)
Guanyong Wu (Shanghai Jiaotong University, China)

In this paper we show how methods for approximating phone error as normally used for Minimum Phone Error (MPE) discriminative training, can be used instead as a decoding criterion for lattice rescoring. This is an alternative to Confusion Networks (CN) which are commonly used in speech recognition. The standard (Maximum A Posteriori) decoding approach is a Minimum Bayes Risk estimate with respect to the Sentence Error Rate (SER); however, we are typically more interested in the Word Error Rate (WER). Methods such as CN and our proposed Minimum Hypothesis Phone Error (MHPE) aim to get closer to minimizing the expected WER. Based on preliminary experiments we find that our approach gives more improvement than CN, and is conceptually simpler.

13:50Posterior-based Out-of-Vocabulary Word Detection in Telephone Speech

Stefan Kombrink (Brno University of Technology, Czech Republic)
Lukas Burget (Brno University of Technology, Czech Republic)
Pavel Matejka (Brno University of Technology, Czech Republic)
Martin Karafiat (Brno University of Technology, Czech Republic)
Hynek Hermansky (Johns Hopkins University, Baltimore (USA))

In this paper we present an out-of-vocabulary word detector suitable for English conversational and read speech. We use an approach based on phone posteriors created by a Large Vocabulary Continuous Speech Recognition system and an additional phone recognizer, that allows detection of OOV and misrecognized words. In addition, the recognized word output can be transcribed more detailed using several classes. Reported results are on CallHome English and Wall Street Journal data.

14:10Automatic Transcription System for Meetings of the Japanese National Congress

Yuya Akita (Kyoto University)
Masato Mimura (Kyoto University)
Tatsuya Kawahara (Kyoto University)

This paper presents an automatic speech recognition (ASR) system for assisting meeting record creation of the National Congress of Japan. The system is designed to cope with spontaneous characteristics of meeting speech, as well as a variety of topics and speakers. For acoustic model, minimum phone error (MPE) training is applied with several normalization techniques. For language model, we have proposed statistical style transformation to generate spoken-style N-grams and their statistics. We also introduce statistical modeling of pronunciation variation in spontaneous speech. The ASR system was evaluated on real congressional meetings, and achieved word accuracy of 84%. It is also suggested that the ASR-based transcripts with this accuracy level is usable for editing meeting records.

14:30Cross-language Bootstrapping for Unsupervised Acoustic Model Training: Rapid Development of a Polish Speech Recognition System

Jonas Lööf (RWTH Aachen University)
Christian Gollan (RWTH Aachen University)
Hermann Ney (RWTH Aachen University)

This paper describes the rapid development of a Polish language speech recognition system. The system development was performed without access to any transcribed acoustic training data. This was achieved through the combined use of cross-language bootstrapping and confidence based unsupervised acoustic model training. A Spanish acoustic model was ported to Polish, through the use of a manually constructed phoneme mapping. This initial model was refined through iterative recognition and retraining of the untranscribed audio data. The system was trained and evaluated on recordings from the European Parliament, and included several state-of-the-art speech recognition techniques. Confidence based speaker adaptive training using features space transform adaptation, as well as vocal tract length normalization and maximum likelihood linear regression, was used to refine the acoustic model. Through the combination of the different techniques, good recognition performance was achieved.

14:50Porting an European Portuguese Broadcast News Recognition System to Brazilian Portuguese

Alberto Abad (INESC-ID Lisboa)
Isabel Trancoso (IST / INESC-ID Lisboa, Portugal)
Nelson Neto (Federal University of Pará, Belém, Brazil)
M. Céu Viana (Center of Linguistics of the University of Lisbon, Portugal)

This paper reports on recent work in the context of the activities of the PoSTPort project aimed at porting a Broadcast News recognition system originally developed for European Portuguese to other varieties. Concretely, in this paper we have focused on porting to Brazilian Portuguese. The impact of some of the main sources of variability has been assessed, besides proposing solutions at the lexical, acoustic and syntactic levels. The ported Brazilian Portuguese Broadcast News system allowed a drastic performance improvement from 56.6% WER (obtained with the European Portuguese system) to 25.5%.

15:10Modeling Northern and Southern Varieties of Dutch for STT

Julien Despres (Vecsys Research)
Petr Fousek (CNRS-LIMSI)
Jean-Luc Gauvain (CNRS-LIMSI)
Sandrine Gay (Vecsys Research)
Yvan Josse (Vecsys Research)
Lori Lamel (CNRS-LIMSI)
Abdel Messaoudi (CNRS-LIMSI and Vecsys Research)

This paper describes how the Northern (NL) and Southern (VL) varieties of Dutch are modeled in the joint Limsi-Vecsys~Research speech-to-text transcription systems for broadcast news (BN) and conversational telephone speech (CTS). Using the Spoken Dutch Corpus resources (CGN), systems were developed and evaluated in the 2008 N-Best benchmark. Modeling techniques that are used in our systems for other languages were found to be effective for the Dutch language, however it was also found to be important to have acoustic and language models, and statistical pronunciation generation rules adapted to each variety. This was in particular true for the MLP features which were only effective when trained separately for Dutch and Flemish. The joint submissions obtained the lowest WERs in the benchmark by a significant margin.

Mon-Ses2-O4:
Speech Analysis and Processing I

Time:Monday 13:30 Place:East Wing 3 Type:Oral
Chair:Ben Milner

13:30Nearly Perfect Detection of Continuous F0 Contour and Frame Classification for TTS Synthesis

Thomas Ewender (Speech Processing Group, Computer Engineering and Networks Laboratory ETH Zurich, Switzerland)
Sarah Hoffmann (Speech Processing Group, Computer Engineering and Networks Laboratory ETH Zurich, Switzerland)
Beat Pfister (Speech Processing Group, Computer Engineering and Networks Laboratory ETH Zurich, Switzerland)

We present a new method for the estimation of a continuous fundamental frequency (F0) contour. The algorithm implements a global optimization and yields virtually error-free F0 contours for high quality speech signals. Such F0 contours are subsequently used to extract a continuous fundamental wave. Some local properties of this wave, together with a number of other speech features allow to classify the frames of a speech signal into five classes: voiced, unvoiced, mixed, irregularly glottalized and silence. The presented F0 detection and frame classification can be applied to F0 modeling and prosodic modification of speech segments in high-quality concatenative speech synthesis.

13:50AM-FM ESTIMATION FOR SPEECH BASED ON A TIME-VARYING SINUSOIDAL MODEL

Yannis Pantazis (University of Crete)
Olivier Rosec (Orange Labs)
Yannis Stylianou (University of Crete)

In this paper we present a method based on a time-varying sinusoidal model for a robust and accurate estimation of amplitude and frequency modulations (AM-FM) in speech. The suggested approach has two main steps. First, speech is modeled as a sinusoidal model with time-varying amplitudes. Specifically, the model makes use of a first order time polynomial with complex coefficients for capturing instantaneous amplitude and frequency (phase) components. Next, the model parameters are updated by using the previously estimated instantaneous phase information. Thus, an iterative scheme for AM-FM decomposition of speech is suggested which was validated on synthetic AM-FM signals and tested on reconstruction of voiced speech signals where the signal-to-error reconstruction ratio (SERR) was used as measure. Compared to the standard sinusoidal representation, the suggested approach found to improve the corresponding SERR by 47%, resulting in over 30 dB of SERR.

14:10Voice Source Waveform Analysis and Synthesis using Principal Component Analysis and Gaussian Mixture Modelling

Jon Gudnason (Imperial College London)
Mark Thomas (Imperial College London)
Patrick Naylor (Imperial College London)
Daniel Ellis (Columbia University)

The paper presents a voice source waveform modeling techniques based on principal component analysis (PCA) and Gaussian mixture modeling (GMM). The voice source is obtained by inverse-filteirng speech with the estimated vocal tract filter. This decomposition is useful in speech analysis, synthesis, recognition and coding. Here, a data-driven approach is presented for signal decomposition and classification based on the principal components of the voice source. The principal components are analyzed and the `prototype' voice source signals corresponding to the Gaussian mixture means are examined. We show how an unknown signal can be decomposed into its components and/or prototypes and resynthesized. We show how the techniques are suited for both low bitrate or high quality analysis/synthesis schemes.

14:30Model-Based Estimation Of Instantaneous Pitch In Noisy Speech

Jung Ook Hong (Statistics and Information Sciences Laboratory, Harvard University)
Patrick J. Wolfe (Statistics and Information Sciences Laboratory, Harvard University)

In this paper we propose a model-based approach to instantaneous pitch estimation in noisy speech, by way of incorporating pitch smoothness assumptions into the well-known harmonic model. In this approach, the latent pitch contour is modeled using a basis of smooth polynomials, and is fit to waveform data by way of a harmonic model whose partials have time-varying amplitudes. The resultant nonlinear least squares estimation task is accomplished through the Gauss-Newton method with a novel initialization step that serves to greatly increase algorithm efficiency. We demonstrate the accuracy and robustness of our method through comparisons to state-of-the art pitch estimation algorithms using both simulated and real waveform data.

14:50Complex Cepstrum-based Decomposition of Speech for Glottal Source Estimation

Thomas Drugman (Faculté Polytechnique de Mons)
Baris Bozkurt (Izmir Institute of Technology)
Thierry Dutoit (Faculté Polytechnique de Mons)

Homomorphic analysis is a well-known method for the separation of non-linearly combined signals. More particularly, the use of complex cepstrum for source-tract deconvolution has been discussed in various articles. However there exists no study which proposes a glottal flow estimation methodology based on cepstrum and reports effective results. In this paper, we show that complex cepstrum can be effectively used for glottal flow estimation by separating the causal and anticausal components of a windowed speech signal as done by the Zeros of the Z-Transform (ZZT) decomposition. Based on exactly the same principles presented for ZZT decomposition, windowing should be applied such that the windowed speech signals exhibit mixed-phase characteristics which conform the speech production model that the anticausal component is mainly due to the glottal flow open phase. The advantage of the complex cepstrum-based approach compared to the ZZT decomposition is its much higher speed.

15:10Approximate Intrinsic Fourier Analysis of Speech

Frank Tompkins (Statistics and Information Sciences Laboratory, Harvard University)
Patrick J. Wolfe (Statistics and Information Sciences Laboratory, Harvard University)

Popular parametric models of speech sounds such as the source-filter model provide a fixed means of describing the variability inherent in speech waveform data. However, nonlinear dimensionality reduction techniques such as the intrinsic Fourier analysis method of Jansen and Niyogi provide a more flexible means of adaptively estimating such structure directly from data. Here we employ this approach to learn a low-dimensional manifold whose geometry is meant to reflect the structure implied by the human speech production system. We derive a novel algorithm to efficiently learn this manifold for the case of many training examples--the setting of both greatest practical interest and computational difficulty. We then demonstrate the utility of our method by way of a proof-of-concept phoneme identification system that operates effectively in the intrinsic Fourier domain.

Mon-Ses2-S1:
Special Session: INTERSPEECH 2009 Emotion Challenge

Time:Monday 13:30 Place:East Wing 4 Type:Special
Chair:Bjoern Schuller & Anton Batliner

#0Emotion Classification in Children’s Speech Using Fusion of Acoustic and Linguistic Features

Tim Polzehl (TU-Berlin, Deutsche Telekom Laboratories)
Shiva Sundaram (TU-Berlin, Deutsche Telekom Laboratories)
Hamed Ketabdar (TU-Berlin, Deutsche Telekom Laboratories)
Michael Wagner (National Centre for Biometric Studies)
Florian Metze (interACT)

This paper describes a system to detect angry vs. non-angry utterances of children who are engaged in dialog with an Aibo robot dog. The system was submitted to the Interspeech2009 Emotion Challenge evaluation. The speech data consist of short utterances of the children’s speech, and the proposed system is designed to detect anger in each given chunk. Frame-based cepstral features, prosodic and acoustic features as well as glottal excitation features are extracted automatically, reduced in dimensionality and classified by means of an artificial neural network and a support vector machine. An automatic speech recognizer transcribes the words in an utterance and yields a separate classification based on the degree of emotional salience of the words. Late fusion is applied to make a final decision on anger vs. non-anger of the utterance. Preliminary results show 75.9% unweighted average recall on the training data and 67.6% on the test set.

#0Acoustic Emotion Recognition using Dynamic Bayesian Networks and Multi-Space Distributions

Roberto Barra-Chicote (Speech Technology Group. Universidad Politecnica de Madrid. Spain)
Fernando Fernandez (Speech Technology Group. Universidad Politecnica de Madrid. Spain)
Syaheerah Lutfi (Speech Technology Group. Universidad Politecnica de Madrid. Spain)
Juan Manuel Lucas-Cuesta (Speech Technology Group. Universidad Politecnica de Madrid. Spain)
Javier Macias-Guarasa (Department of Electronics. University of Alcala. Spain)
Juan Manuel Montero (Speech Technology Group. Universidad Politecnica de Madrid. Spain)
Ruben San-Segundo (Speech Technology Group. Universidad Politecnica de Madrid. Spain)
Jose Manuel Pardo (Speech Technology Group. Universidad Politecnica de Madrid. Spain)

In this paper we describe the acoustic emotion recognition system built at the Speech Technology Group of the Universidad Politecnica de Madrid (Spain) to participate in the INTERSPEECH 2009 Emotion Challenge. Our proposal is based on the use of a Dynamic Bayesian Network (DBN) to deal with the temporal modelling of the emotional speech information. The selected features (MFCC, F0, Energy and their variants) are modelled as different streams, and the F0 related ones are integrated under a Multi Space Distribution (MSD) framework, to properly model its dual nature (voiced/unvoiced). Experimental evaluation on the challenge test set, show a 67.06% and 38.24% of unweighted recall for the 2 and 5-classes tasks respectively. In the 2-class case, we achieve similar results compared with the baseline, with 8.5 times less features. In the 5-class case, we achieve a statistically significant 6.5% relative improvement.

#0Brno University of Technology System for Interspeech 2009 Emotion Challenge

Marcel Kockmann (Brno University of Technology, Czech Republic)
Lukas Burget (Brno University of Technology, Czech Republic)
Jan Cernocky (Brno University of Technology, Czech Republic)

This paper describes Brno University of Technology (BUT) system for the Interspeech 2009 Emotion Challenge. Our submitted system for the Open Performance Sub-Challenge uses acoustic frame based features as a front-end and Gaussian Mixture Models as a back-end. Different feature types and modeling approaches successfully applied in speaker- and language recognition are investigated and we can achieve an 16% and 9% relative improvement over the best dynamic and static baseline system on the 5-class task, respectively.

#0Cepstral and Long-Term Features for Emotion Recognition

Pierre Dumouchel (Ecole de technologie superieure)
Najim Dehak (Ecole de technologie superieure)
Yazid Attabi (Ecole de technologie superieure)
Reda Dehak (Laboratoire de recherche et de developpement de l\'EPITA)
Narjes Boufaden (Centre de recherche informatique de Montreal)

In this paper, we describe systems that were developed for the Open Performance Sub-Challenge of the INTERSPEECH 2009 Emotion Challenge. We participate to both two-class and five-class emotion detection. For the two-class problem, the best performance is obtained by logistic regression fusion of three systems. Theses systems use short- and long-term speech features. This fusion achieved an absolute improvement of 2,6% on the unweighted recall value compared with [6]. For the five-class problem, we submitted two individual systems: cepstral GMM vs. long-term GMM-UBM. The best result comes from a cepstral GMM and produced an absolute improvement of 3,5% compared to [6].

#0Exploring the benefits of discretization of acoustic features for speech emotion recognition

Thurid Vogt (Multimedia Concepts and Applications, University of Augsburg, Germany)
Elisabeth André (Multimedia Concepts and Applications, University of Augsburg, Germany)

We present a contribution to the Open Performance subchallenge of the INTERSPEECH 2009 Emotion Challenge. We evaluate the feature extraction and classifier of EmoVoice, our framework for real-time emotion recognition from voice on the challenge database and achieve competitive results. Furthermore, we explore the benefits of discretizing numeric acoustic features and find it beneficial in a multi-class task.

#0Combining spectral and prosodic information for emotion recognition in the Interspeech 2009 Emotion Challenge

Iker Luengo (Department of Electronics and Telecommunication, University of the Basque Country, Spain)
Eva Navas (Department of Electronics and Telecommunication, University of the Basque Country, Spain)
Inmaculada Hernáez (Department of Electronics and Telecommunication, University of the Basque Country, Spain)

This paper describes the system presented at the Interspeech 2009 Emotion Challenge. It relies on both spectral and prosodic features in order to automatically detect the emotional state of the speaker. As both kinds of features have very different characteristics, they are treated separately, creating two sub-classifiers, one using the prosodic features and the other one using the prosodic ones. The results of these two classifiers are then combined with a fusion system based on Support Vector Machines.

#0GTM-URL Contribution to the INTERSPEECH 2009 Emotion Challenge

Santiago Planet (GTM – Grup de Recerca en Tecnologies Mèdia, La Salle – Universitat Ramon Llull, Spain)
Ignasi Iriondo (GTM – Grup de Recerca en Tecnologies Mèdia, La Salle – Universitat Ramon Llull, Spain)
Joan-Claudi Socoró (GTM – Grup de Recerca en Tecnologies Mèdia, La Salle – Universitat Ramon Llull, Spain)
Carlos Monzo (GTM – Grup de Recerca en Tecnologies Mèdia, La Salle – Universitat Ramon Llull, Spain)
Jordi Adell (GTM – Grup de Recerca en Tecnologies Mèdia, La Salle – Universitat Ramon Llull, Spain)

This paper describes our participation in the INTERSPEECH 2009 Emotion Challenge [1]. Starting from our previous experience in the use of automatic classification for the validation of an expressive corpus, we have tackled the difficult task of emotion recognition from speech with real-life data. Our main contribution to this work is related to the Classifier Sub-Challenge, for which we tested several classification strategies. On the whole, the results were slightly worse than or similar to the baseline, but we found some configurations that could be considered in future implementations.

#0Improving Automatic Emotion Recognition from Speech Signals

Elif Bozkurt (Koc University, Istanbul, Turkey)
Engin Erzin (Koc University, Istanbul, Turkey)
Cigdem Eroglu Erdem (Bahcesehir University, Istanbul, Turkey)
Tanju Erdem (Ozyegin University, Istanbul, Turkey)

We present a speech signal driven emotion recognition system. Our system is trained and tested with the INTERSPEECH 2009 Emotion Challenge corpus, which includes spontaneous and emotionally rich recordings. We investigate prosody related, spectral and HMM-based features for the evaluation of emotion recognition with Gaussian mixture model (GMM) based classifiers. Spectral features consist of mel-scale cepstral coefficients (MFCC), line spectral frequency (LSF) features and their derivatives, whereas prosody-related features consist of mean normalized values of pitch, first derivative of pitch and intensity. Unsupervised training of HMM structures are employed to define prosody related temporal features for the emotion recognition problem. We also investigate data fusion of different features and decision fusion of different classifiers, which are not well studied for emotion recognition framework.

#0Emotion Recognition Using a Hierarchical Binary Decision Tree Approach

Chi-Chun Lee (Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA)
Emily Mower (Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA)
Carlos Busso (Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA)
Sungbok Lee (Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA)
Shrikanth Narayanan (Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA)

Emotion state tracking is an important aspect of human-computer and human-robot interaction. It is important to design task specific emotion recognition systems for real-world applications. In this work, we propose a hierarchical structure loosely motivated by Appraisal Theory for emotion recognition. The levels in the hierarchical structure are carefully designed to place the easier classification task at the top level and delay the decision between highly ambiguous classes to the end. The proposed structure maps an input utterance into one of the five-emotion classes through subsequent layers of binary classifications. We obtain a balanced recall on each of the individual emotion classes using this hierarchical structure. The performance measure of the average unweighted recall percentage on the evaluation data set improves by 3.3% absolute (8.8% relative) over the baseline model.

13:30The INTERSPEECH 2009 Emotion Challenge

Bjoern Schuller (Technische Universitaet Muenchen)
Stefan Steidl (Friedrich-Alexander University Erlangen-Nuremberg)
Anton Batliner (Friedrich-Alexander University Erlangen-Nuremberg)

The last decade has seen a substantial body of literature on the recognition of emotion from speech. However, in comparison to related speech processing tasks such as Automatic Speech and Speaker Recognition, practically no standardised corpora and test-conditions exist to compare performances under exactly the same conditions. Instead a multiplicity of evaluation strategies employed – such as cross-validation or percentage splits without proper instance definition – prevents exact reproducibility. This INTERSPEECH 2009 Emotion Challenge aims at bridging such gaps between excellent research on human emotion recognition from speech and low compatibility of results. The FAU Aibo Emotion Corpus serves as basis with clearly defined test and training partitions incorporating speaker independence as needed in most reallife settings. This paper introduces the challenge, the corpus, the features, and benchmark results of two popular approaches towards emotion recognition from speech.

Mon-Ses2-P1:
Speech perception I

Time:Monday 13:30 Place:Hewison Hall Type:Poster
Chair:Paul Boersma

#1Relative importance of formant and whole-spectral cues for vowel perception

Masashi Ito (Graduate School of Engineering, Tohoku University, Japan)
Keiji Ohara (Research Institute of Electrical Communication, Tohoku University, Japan)
Akinori Ito (Graduate School of Engineering, Tohoku University, Japan)
Masafumi Yano (Research Institute of Electrical Communication, Tohoku University, Japan)

Three psycho-acoustical experiments were carried out to investigate relative importance of formant frequency and whole spectral shape as cues for vowel perception. Four types of vowel-like signals were presented to eight listeners. The mean responses for stimuli including both formant and amplitude-ratio feature were quite similar to those for the stimuli including only formant peak feature. Nonetheless reasonable vowel changes were observed in responses for stimuli including only amplitude-ratio feature. The perceived vowel changes were also observed even for stimuli including neither of these features. The results suggested that perceptual cues were involved in various parts of vowel spectrum.

#2Influences of vowel duration on speaker-size estimation and discrimination

Chihiro Takeshima (Kyoto City University of Arts)
Minoru Tsuzaki (Kyoto City University of Arts)
Toshio Irino (Faculty of Systems Engineering, Wakayama University)

Several studies have shown that the auditory system has a mechanism to extract the speaker-size information, using sufficiently long sounds. This paper investigated influence of vowel duration on the processing for size extraction using short vowels. In a size estimation experiment, listeners subjectively estimated the speaker size for isolated vowels. The results showed that listeners' size perception was highly correlated with the vocal-tract length in all the tested durations (from 16 ms to 256 ms). In a size discrimination experiment, listeners were presented with two vowels and were asked which vowel was perceived to be spoken by a smaller speaker. The results showed that the just-noticeable differences in speaker rose considerably for 16-ms duration. These observations suggest that the auditory system can extract size information even for 16-ms vowels although the precision of size extraction would deteriorate when the duration becomes less than 32 ms.

#3High Front Vowels in Czech: a Contrast in Quantity or Quality?

Václav Jonáš Podlipský (Department of English and American Studies, Palacký University in Olomouc, Czech Republic)
Radek Skarnitzl (Institute of Phonetics, Faculty of Arts, Charles University in Prague, Czech Republic)
Jan Volín (Institute of Phonetics, Faculty of Arts, Charles University in Prague, Czech Republic)

We investigate the perception and production of Czech /I/ and /i:/, a contrast traditionally described as quantitative. First, we show that the spectral difference between the vowels is for many Czechs as strong a cue as (or even stronger than) duration. Second, we test the hypothesis that this shift towards vowel quality as a perceptual cue for this contrast resulted in weakening of the durational differentiation in production. Our measurements confirm this: members of the /I/-/i:/ pair differed in duration much less than those of other short-long pairs. We interpret these findings in terms of Lindblom’s H&H theory.

#4Effect of contralateral noise on energetic and informational masking on speech-in-speech intelligibility

Marjorie Dole (Laboratoire Dynamique du Langage UMR5596)
Michel Hoen (Stem Cell and Brain Research Institute U846)
Fanny Meunier (Laboratoire Dynamique du Langage UMR5596)

This experiment tested the advantage of binaural presentation of an interfering noise in a task involving identification of monaurally-presented words. These words were embedded in three types of noise: a stationary noise, a speech-modulated noise and a speech-babble noise, in order to assess energetic and informational masking contributions to binaural unmasking. Our results showed important informational masking in the monaural condition, principally due to lexical and phonetic competition. We also found a binaural unmasking effect, which was more important when speech was used as interferer, suggesting that this suppressive effect was more efficient in the case of high-level informational (lexical and phonetic) competition.

#5Using location cues to track speaker changes from mobile, binaural microphones.\\thanks{This work was funded by the EU Cognitive Systems STReP project POP (Perception On Purpose

Heidi Christensen (University of Sheffield)
Jon Barker (University of Sheffield)

This paper presents initial developments towards computational hearing models that move beyond stationary microphone assumptions. We present a particle filtering based system for using localisation cues to track speaker changes in meeting recordings. Recording are made using in-ear binaural microphones worn by a listener whose head is constantly moving. Tracking speaker changes requires simultaneously inferring the perceiver's head orientation, as any change in relative spatial angle to a source can be caused by either the source moving or the microphones moving. In real applications, such as robotics, there may be access to external estimates of the perceiver's position. We investigate the effect of simulating varying degrees of measurement noise in an external perceiver position estimate. We show that only limited self-position knowledge is needed to greatly improve the reliability with which we can decode the acoustic localisation cues in the meeting scenario.

#6A perceptual investigation of speech transcription errors involving frequent near-homophones in French and American English

Ioana Vasilescu (LIMSI-CNRS, France)
Martine Adda-Decker (LIMSI-CNRS, France)
Lori Lamel (LIMSI-CNRS, France)
Pierre Hallé (LPP-CNRS)

This article compares the errors made by automatic speech recognizers to those made by humans for near-homophones in American English and French. This exploratory study focuses on the impact of limited word context and the potential resulting ambiguities for automatic speech recognition (ASR) systems and human listeners. Perceptual experiments using 7-gram chunks centered on incorrect or correct words output by an ASR system, show that humans make significantly more transcription errors on the first type of stimuli, thus highlighting the local ambiguity. The long-term aim of this study is to improve the modeling of such ambiguous items in order to reduce ASR errors.

#7The role of glottal pulse rate and vocal tract length in the perception of speaker identity

Etienne Gaudrain (Centre for the Neural Basis of Hearing, Department of Physiology, Development and Neuroscience, University of Cambridge, United-Kingdom)
Su Li (Centre for the Neural Basis of Hearing, Department of Physiology, Development and Neuroscience, University of Cambridge, United-Kingdom)
Vin Shen Ban (Centre for the Neural Basis of Hearing, Department of Physiology, Development and Neuroscience, University of Cambridge, United-Kingdom)
Roy D Patterson (Centre for the Neural Basis of Hearing, Department of Physiology, Development and Neuroscience, University of Cambridge, United-Kingdom)

In natural speech, for a given speaker, vocal tract length (VTL) is effectively fixed whereas glottal pulse rate (GPR) is varied to indicate prosodic distinctions. This suggests that VTL will be a more reliable cue for identifying a speaker than GPR. It also suggests that listeners will accept larger changes in GPR before perceiving speaker change. We measured the effect of GPR and VTL on the perception of a speaker difference, and found that listeners hear different speakers given a VTL difference of 25%, but they require a GPR difference of 45%.

#8Development of voicing categorization in deaf children with cochlear implant

Victoria Medina (Laboratoire Psychologie de la Perception, Université Paris Descartes, CNRS)
Willy Serniclaes (Laboratoire Psychologie de la Perception, Université Paris Descartes, CNRS)

Cochlear implant (CI) improves hearing but communication abilities still depend on several factors. The present study assesses the development of voicing categorization in deaf children with cochlear implant, examining both categorical perception (CP) and boundary precision (BP) performances. We compared 22 implanted children to 55 normal-hearing children using different age factors. The results showed that the development of voicing perception in CI children is fairly similar to that in normal-hearing controls with the same auditory experience and irrespective of differences in the age of implantation (two vs. three years of age).

#9Processing Liaison-Initial Words in Native and Non-Native French: Evidence from Eye Movements

Annie Tremblay (University of Illinois at Urbana-Champaign)

French listeners have no difficulty recognizing liaison-initial words. This is in part because acoustic/phonetic information distinguishes liaison consonants from (non-resyllabified) word onsets in the speech signal. Using eye tracking, this study investigates whether native speakers of English, a language that does not have a phonological resyllabification process like liaison, can develop target-like segmentation procedures for recognizing liaison-initial words in French, and if so, how such procedures develop with increasing proficiency.

#10Estimating the Potential of Signal and Interlocutor-Track Information for Language Modeling

Nigel Ward (University of Texas at El Paso)
Benjamin Walker (University of Texas at El Paso)

Although today most language models treat language purely as word sequences, there is recurring interest in tapping new sources of information, such as disfluencies, prosody, the interlocutor's dialog act, and the interlocutor's recent words. In order to estimate the potential value of such sources of information, we extend Shannon's guessing-game method for estimating entropy to work for spoken dialog. Four teams of two subjects each predicted the next word in a dialog using various amounts of context: one word, two words, all the words spoken so far or the full dialog audio so far. The entropy benefit in the full-audio condition over the full text condition was substantial, .64 bits per word, greater than the .54 bit benefit of full text context over trigrams. This suggests that language models may be improved by use of the prosody of the speaker and context from the interlocutor.

Mon-Ses2-P2:
Accent and Language Recognition

Time:Monday 13:30 Place:Hewison Hall Type:Poster
Chair: William Campbell

#1Factor Analysis and SVM for Language Recognition

Florian Verdet (Université d\'Avignon et des Pays du Vaucluse, Laboratoire Informatique d\'Avignon, Avignon, France and Département d\'Informatique, Université de Fribourg, Fribourg, Switzerland)
Driss Matrouf (Université d\'Avignon et des Pays du Vaucluse, Laboratoire Informatique d\'Avignon, Avignon, France)
Jean-François Bonastre (Université d\'Avignon et des Pays du Vaucluse, Laboratoire Informatique d\'Avignon, Avignon, France)
Jean Hennebert (Département d\'Informatique, Université de Fribourg, Fribourg, Switzerland)

Statistic classifiers operate on features that generally include both, useful and useless information. These two types of information are difficult to separate in feature domain. Recently, a new paradigm based on Factor Analysis (FA) proposed a model decomposition into useful and useless components. This method has successfully been applied to speaker recognition tasks. In this paper, we study the use of FA for language recognition. We propose a classification method based on SDC features and Gaussian Mixture Models (GMM). We present well performing systems using Factor Analysis and FA-based Support Vector Machine (SVM) classifiers. Experiments are conducted using NIST LRE 2005’s primary condition. The relative equal error rate reduction obtained by the best factor analysis configuration with respect to baseline GMM-UBM system is over 60 %, corresponding to an EER of 6.59 %.

#2Exploring Universal Attribute Characterization of Spoken Languages for Spoken Language Recognition

Sabato Marco Siniscalchi (NTNU)
Jeremy Reed (Georgia Institute of Technology)
Torbjørn Svendsen (NTNU)
Chin-Hui Lee (Georgia Institute of Technology)

We propose a novel universal acoustic characterization approach to spoken language identification (LID), in which any spoken language is described with a common set of fundamental units defined "universally." Specifically, manner and place of articulation form this unit inventory and are used to build a set of universal attribute models with data-driven techniques. Using the vector space modeling approaches to LID a spoken utterance is first decoded into a sequence of attributes. Then, a feature vector consisting of co-occurrence statistics of attribute units is created, and the final LID decision is implemented with a set of vector space language classifiers. Although the present study is just in its preliminary stage, promising results comparable to acoustically rich phone-based LID systems have already been obtained on the NIST 2003 LID task. The results provide clear insight for further performance improvements and encourage a continuing exploration of the proposed framework.

#3On the use of Phonological Features for Automatic Accent Analysis

Abhijeet Sangwan (Center for Robust Speech Systems)
John Hansen (Center for Robust Speech Systems)

In this paper, we present an automatic accent analysis system that is based on phonological features (PFs). The proposed system exploits the knowledge of articulation embedded in phonology by rapidly build Markov models (MMs) of PFs extracted from accented speech. The Markov models capture information in the PF space along two dimensions of articulation: PF state-transitions and state-durations. Furthermore, by utilizing MMs of native and non-native accents a new statistical measure of “accentedness” is developed which rates the articulation of a word on a scale of native-like (−1) to non-native like (+1. The proposed methodology is then used to perform an automatic cross-sectional study of accented English spoken by native speakers of Mandarin Chinese (N-MC). The work developed in this paper is easily assimilated into language learning systems, and has impact in the areas of speaker recognition and ASR (automatic speech recognition).

#4Language Recognition Using Language Factors

Fabio Castaldo (Politecnico di Torino)
Sandro Cumani (Politecnico di Torino)
Pietro Laface (Politecnico di Torino)
Daniele Colibro (Loquendo)

Language recognition systems based on acoustic models reach state of the art performance using discriminative training techniques. In speaker recognition, eigenvoice modeling of the speaker, and the use of speaker factors as input features to SVMs has recently been demonstrated to give good results compared to the standard GMM-SVM approach, which combines GMMs supervectors and SVMs. In this paper we propose, in analogy to the eigenvoice modeling approach, to estimate an eigen-language space, and to use the language factors as input features to SVM classifiers. Since language factors are low-dimension vectors, training and evaluating SVMs with different kernels and with large training examples becomes an easy task. This approach is demonstrated on the 14 languages of the NIST 2007 language recognition task, and shows performance improvements with respect to the standard GMM-SVM technique.

#5Automatic Accent Detection: Effect of Base Units and Boundary Information

Je Hun Jeon (The University of Texas at Dallas)
Yang Liu (The University of Texas at Dallas)

Automatic prominence or pitch accent detection is important as it can perform automatic prosodic annotation of speech corpora, as well as provide additional features in other tasks such as keyword detection. In this paper, we evaluate how accent detection performance changes according to different base units and what kind of boundary information is available. We compare word, syllable, and vowel-based units when their boundaries are provided. We also automatically estimate syllable boundaries using energy contours when phone-level alignment is available. In addition, we utilize a sliding window with fixed length under the condition of unknown boundaries. Our experiments show that when boundary information is available, using longer base unit achieves better performance. In the case of no boundary information, using a moving window with a fixed size achieves similar performance to using syllable information on word-level evaluation, suggesting that accent detection can be performed without relying on a speech recognizer to generate boundaries.

#6Age Verification Using a Hybrid Speech Processing Approach

Ron M Hecht (PuddingMedia)
Omer Hezroni (PuddingMedia)
Amit Manna (PuddingMedia)
Ruth Aloni-Lavi (PuddingMedia)
Gil Dobry (PuddingMedia)
Amir Alfandary (Nice systems)
Yaniv Zigel (Bio-medical Engineering Dept., Ben-Gurion University)

The human speech production system is a multi-level system. On the upper level, it starts with information that one wants to transmit. It ends on the lower level with the materialization of the information into a speech signal. Most of the recent work conducted in age estimation is focused on the lower-acoustic level. In this research the upper lexical level information is utilized for age-group verification and it is shown that one's vocabulary reflects one's age. Several age-group verification systems that are based on automatic transcripts are proposed. In addition, a hybrid approach is introduced, an approach that combines the word-based system and an acoustic-based system. Experiments were conducted on a four age-groups verification task using the Fisher corpora, where an average equal error rate (EER) of 28.7% was achieved using the lexical-based approach and 28.0% using an acoustic approach. By merging these two approaches the verification error was reduced to 24.1%.

#7Information Bottleneck Based Age Verification

Ron M Hecht (PuddingMedia, Kfar-Saba, Israel)
Omer Hezroni (PuddingMedia, Kfar-Saba, Israel)
Amit Manna (PuddingMedia, Kfar-Saba, Israel)
Gil Dobry (Bio-medical Engineering Department, Ben-Gurion University, Beer-Sheva, Israel)
Yaniv Zigel (Bio-medical Engineering Department, Ben-Gurion University, Beer-Sheva, Israel)
Naftali Tishby (School of Engineering and Computer Science, Hebrew University, Jerusalem, Israel)

Word N-gram models can be used for word-based age-group verification. In this paper the agglomerative information bottleneck (AIB) approach is used to tackle one of the most fundamental drawbacks of word N-gram models: its abundant amount of irrelevant information. It is demonstrated that irrelevant information can be omitted by joining words to form word-clusters; this provides a mechanism to transform any sequence of words to a sequence of word-cluster labels. Consequently, word N-gram models are converted to wordcluster N-gram models which are more compact. Age verification experiments were conducted on the Fisher corpora. Their goal was to verify the age-group of the speaker of an unknown speech segment. In these experiments an Ngram model was compressed to a fifth of its original size without reducing the verification performance. In addition, a verification accuracy improvement is demonstrated by disposing irrelevant information.

#8Discriminative N-gram Selection for Dialect Recognition

Fred Richardson (MIT Lincoln Laboratory)
William Campbell (MIT Lincoln Laboratory)
Pedro Torres-Carrasquillo (MIT Lincoln Laboratory)

Dialect recognition is a challenging and multifaceted problem. Distinguishing between dialects can rely upon many tiers of interpretation of speech data-e.g., prosodic, phonetic, spectral, and word. High-accuracy automatic methods for dialect recognition typically use either phonetic or spectral characteristics of the input. A challenge with spectral system, such as those based on shifted-delta cepstral coefficients, is that they achieve good performance but do not provide insight into distinctive dialect features. In this work, a novel method based upon discriminative training and phone N-grams is proposed. This approach achieves excellent classification performance, fuses well with other systems, and has interpretable dialect characteristics in the phonetic tier. The method is demonstrated on data from the LDC and prior NIST language recognition evaluations. The method is also combined with spectral methods to demonstrate state-of-the-art performance in dialect recognition.

#9Data-driven Phonetic Comparison and Conversion between South African, British and American English Pronunciations

Linsen Loots (Department of Electrical and Electronic Engineering, Stellenbosch University, South Africa)
Thomas Niesler (Department of Electrical and Electronic Engineering, Stellenbosch University, South Africa)

We analyse pronunciations in American, British and South African English pronunciation dictionaries. Three analyses are perfomed. First the accuracy is determined with which decision tree based grapheme-to-phoneme (G2P) conversion can be applied to each accent. It is found that there is little difference between the accents in this regard. Secondly, pronunciations are compared by performing pairwise alignments between the accents. Here we find that South African English pronunciation most closely matches British English. Finally, we apply decision trees to the conversion of pronunciations from one accent to another. We find that pronunciations of unknown words can be more accurately determined from a known pronunciation in a different accent than by means of G2P methods. This has important implications for the development of pronunciation dictionaries in less-resourced varieties of English, and hence also for the development of ASR systems.

#10Target-Aware Language Models for Spoken Language Recognition

Rong Tong (Institute for Infocomm Research, Singapore)
Bin Ma (Institute for Infocomm Research, Singapore)
Haizhou Li (Institute for Infocomm Research, Singapore)
Eng Siong Chng (Nanyang Technological University, Singapore)

This paper studies a way of constructing multiple phone tokenizers for language recognition. In this approach, each phone tokenizer for a target language will share a common set of acoustic models, while each will have a unique phone-based language model (LM) trained for a specific target language. The target-aware language models (TALM) are constructed to capture the discriminative ability of individual phones for the desired target languages. The parallel phone tokenizers thus formed are shown to achieve better performance than the original phone recognizer. The proposed TALM is very different from the LM in the traditional PPRLM technique as the TALM applies the LM information in the front-end while PPRLM approach uses a LM in the system back-end; Furthermore, the TALM exploits the discriminative phones occurrence statistics, which are different from the traditional n-gram statistics in PPRLM approach. A novel way of training TALM is also studied in this paper.

#11Language Identification for Speech-to-Speech Translation

Daniel Chung Yong Lim (Language Technologies Institute, Carnegie Mellon University)
Ian Lane (Language Technologies Institute, Carnegie Mellon University)

This paper investigates the use of language identification (LID) in real-time speech-to-speech translation systems. We propose a framework that incorporates LID capability into a speech-to-speech translation system while minimizing the impact on the system’s real-time performance. We compared two phone-based LID approaches, namely PRLM and PPRLM, to a proposed extended approach based on Conditional Random Field classifiers. The performances of these three approaches were evaluated to identify the input language in the CMU English-Iraqi TransTAC system, and the proposed approach obtained significantly higher classification accuracies on two of the three test sets evaluated.

#12Using Prosody and Phonotactics in Arabic Dialect Identification

Fadi Biadsy (Columbia University)
Julia Hirschberg (Columbia University)

While Modern Standard Arabic is the formal spoken and written language of the Arab world, dialects are the major communication mode for everyday life; identifying a speaker’s dialect is thus critical to speech processing tasks such as automatic speech recognition, as well as speaker identification. We examine the role of prosodic features (intonation and rhythm) across four Arabic dialects: Gulf, Iraqi, Levantine, and Egyptian, for the purpose of automatic dialect identification. We show that prosodic features can significantly improve identification, over a purely phonotactic-based approach, with an identification accuracy of 86.33% for 2m utterances.

Mon-Ses2-P3:
ASR: Acoustic Model Training and Combination

Time:Monday 13:30 Place:Hewison Hall Type:Poster
Chair: Jeff Bilmes

#1Refactoring Acoustic Models using Variational Expectation-Maximization

Pierre Dognin (IBM T.J. Research Center (USA))
John Hershey (IBM T.J. Research Center (USA))
Vaibhava Goel (IBM T.J. Research Center (USA))
Peder Olsen (IBM T.J. Research Center (USA))

In probabilistic modeling, it is often useful to change the structure, or refactor, a model, so that it has a different number of components, different parameter sharing, or other constraints. For example, we may wish to find a Gaussian mixture model (GMM) with fewer components that best approximates a reference model. Maximizing the likelihood of the refactored model under the reference model is equivalent to minimizing their KL divergence. For GMMs, this optimization is not analytically tractable. However, a lower bound to the likelihood can be maximized using a variational expectation-maximization algorithm. Automatic speech recognition provides a good framework to test the validity of such methods, because we can train reference models of any given size for comparison with refactored models. We show that we can efficiently reduce model size by 50%, with the same recognition performance as the corresponding model trained from data.

#2Investigations on Convex Optimization Using Log-Linear HMMs for Digit String Recognition

Georg Heigold (RWTH Aachen University)
David Rybach (RWTH Aachen University)
Ralf Schlüter (RWTH Aachen University)
Hermann Ney (RWTH Aachen University)

Discriminative methods are an important technique to refine the acoustic model in speech recognition. Conventional discriminative training is initialized with some baseline model and the parameters are re-estimated in a separate step. This approach has proven to be successful, but it includes many heuristics, approximations, and parameters to be tuned. This tuning involves much engineering and makes it difficult to reproduce and compare experiments. In contrast to the conventional training, convex optimization techniques provide a sound approach to estimate all model parameters from scratch. Such a straight approach hopefully dispense with additional heuristics, e.g. scaling of posteriors. This paper addresses the question how well this concept using log-linear models carries over to practice. Experimental results are reported for a digit string recognition task, which allows for the investigation of this issue without approximations.

#3Investigations on discriminative training in large scale acoustic model estimation

Janne Pylkkönen (Adaptive Informatics Research Centre, Helsinki University of Technology)

In this paper two common discriminative training criteria, maximum mutual information (MMI) and minimum phone error (MPE), are investigated. Two main issues are addressed: sensitivity to different lattice segmentations and the contribution of the parameter estimation method. It is noted that MMI and MPE may benefit from different lattice segmentation strategies. The use of discriminative criterion values as the measure of model goodness is shown to be problematic as the recognition results do not correlate well with these measures. Moreover, the parameter estimation method clearly affects the recognition performance of the model irrespective of the value of the discriminative criterion. Also the dependence on the recognition task is demonstrated by example with two Finnish large vocabulary dictation tasks used in the experiments.

#4Margin-Space Integration of MPE Loss via Differencing of MMI Functionals for Generalized Error-Weighted Discriminative Training

Erik McDermott (NTT Corporation)
Shinji Watanabe (NTT Corporation)
Atsushi Nakamura (NTT Corporation)

Using the central observation that margin-based weighted classification error (modeled using Minimum Phone Error (MPE)) corresponds to the derivative with respect to the margin term of margin-based hinge loss (modeled using Maximum Mutual Information (MMI)), this article subsumes and extends margin-based MPE and MMI within a broader framework in which the objective function is an integral of MPE loss over a range of margin values. Applying the Fundamental Theorem of Calculus, this integral is easily evaluated using finite differences of MMI functionals; lattice-based training using the new criterion can then be carried out using differences of MMI gradients. Preliminary experimental results comparing the new framework with margin-based MMI, MCE and MPE on the Corpus of Spontaneous Japanese and the MIT OpenCourseWare/MIT-World corpus are presented.

#5Compacting Discriminative Feature Space Transforms for Embedded Devices

Etienne Marcheret (IBM)
Jia-Yu Chen (UIUC)
Petr Fousek (IBM)
Peder Olsen (IBM)
Vaibhava Goel (IBM)

Discriminative training of the feature space using the minimum phone error objective function has been shown to yield remarkable accuracy improvements. These gains, however, come at a high cost of memory. In this paper we present techniques that maintain fMPE performance while reducing the required memory by approximately 94%. This is achieved by designing a quantization methodology which minimizes the error between the true fMPE computation and that produced with the quantized parameters. Also illustrated is a Viterbi search over the allocation of quantization levels, providing a framework for optimal non-uniform allocation of quantization levels over the dimensions of the fMPE feature vector. This provides an additional 8% relative reduction in required memory with no loss in recognition accuracy.

#6A Discriminative Back-off Acoustic Model for Automatic Speech Recognition

Hung-An Chang (MIT Computer Science and Artificial Intelligence Laboratory)
James R. Glass (MIT Computer Science and Artificial Intelligence Laboratory)

In this paper we propose a back-off discriminative acoustic model for Automatic Speech Recognition (ASR). We use a set of broad phonetic classes to divide the classification problem originating from context-dependent modeling into a set of sub-problems. By appropriately combining the scores from classifiers designed for the sub-problems, we can guarantee that the back-off acoustic score for different context-dependent units will be different. The back-off model can be combined with discriminative training algorithms to further improve the performance. Experimental results on a large vocabulary lecture transcription task show that the proposed back-off discriminative acoustic model has more than a 2.0% absolute word error rate reduction compared to clustering-based acoustic model.

#7Efficient Generation and Use of MLP Features for Arabic Speech Recognition

Junho Park (University of Cambridge)
Frank Diehl (University of cambridge)
Mark Gales (University of Cambridge)
Marcus Tomalin (University of Cambridge)
Phil Woodland (University of Cambridge)

Feature derived from Multi-Layer Perceptrons (MLPs) are fronted to challenge how to build such a complex MLP with huge amount of trainig data efficiently. This paper discusses various methods to reduce training effort for the incorporation of MLP features into an ASR system; parallel network design and training; combining methods of outputs of those parallel networks; and a rapid retraining procedure for discriminatively trained MLP-feature based acoustic models. The use of parallel network combination gave significant improvements over standard MLP configuration in word error rate on single unadapted decoding stage. However, the gains were getting shrinked on sophisticated adaptation steps although the combination method was efficient in terms of training cost.

#8A Study of Bootstrapping with Multiple Acoustic Features for Improved Automatic Speech Recognition

Xiaodong Cui (IBM T. J. Watson Research Center)
Jian Xue (IBM T. J. Watson Research Center)
Bing Xiang (IBM T. J. Watson Research Center)
Bowen Zhou (IBM T. J. Watson Research Center)

This paper investigates a scheme of bootstrapping with multiple acoustic features (MFCC, PLP and LPCC) to improve the overall performance of automatic speech recognition. In this scheme, a Gaussian mixture distribution is estimated for each type of feature resampled in each HMM state by single-pass re-training on a shared decision tree. Thus obtained acoustic models based on the multiple features are combined by likelihood averaging during decoding. Experiments on large vocabulary spontaneous speech recognition show its superior overall performance than the best of acoustic models from individual features. It also achieves comparable performance to Recognizer Output Voting Error Reduction (ROVER) with computational advantages.

#9ANALYSIS OF LOW-RESOURCE ACOUSTIC MODEL SELF-TRAINING

Scott Novotney (BBN Technologies)
Richard Schwartz (BBN Technologies)

Previous work on self-training of acoustic models using unlabeled data reported significant reductions in WER assuming a large phonetic dictionary was available. We now assume only those words from ten hours of speech are initially available. Subsequently, we are then given a large vocabulary and then quantify the value of repeating self-training with this larger dictionary. This experiment is used to analyze the effects of self-training on categories of words. We report the following findings: (i) Although the small 5k vocabulary raises WER by 2% absolute, self-training is equally effective as using a large 75k vocabulary. (ii) Adding all 75k words to the decoding vocabulary after self-training reduces the WER degradation to only 0.8% absolute. (iii) Self-training most benefits those words in the unlabeled audio but not transcribed by a wide margin.

#10Log-linear Model Combination with Word-dependent Scaling Factors

Björn Hoffmeister (Chair of Computer Science 6, Computer Science Department, RWTH Aachen University)
Liang Ruoying (Chair of Computer Science 6, Computer Science Department, RWTH Aachen University)
Ralf Schlüter (Chair of Computer Science 6, Computer Science Department, RWTH Aachen University)
Hermann Ney (Chair of Computer Science 6, Computer Science Department, RWTH Aachen University)

Log-linear model combination is the standard approach in LVCSR to combine several knowledge sources, usually an acoustic and a language model. Instead of using a single scaling factor per knowledge source, we make the scaling factor word- and pronunciation-dependent. In this work, we combine three acoustic models, a pronunciation model, and a language model for a Mandarin BN/BC task. The achieved error rate reduction of 2% relative is small but consistent for two test sets. An analysis of the results shows that the major contribution comes from the improved interdependency of language and acoustic model.

Mon-Ses2-P4:
Spoken dialogue systems

Time:Monday 13:30 Place:Hewison Hall Type:Poster
Chair:Dilek Hakkani-Tur

#1Enabling A User To Specify An Item At Any Time During System Enumeration

Kyoko Matsuyama (Kyoto University)
Kazunori Komatani (Kyoto University)
Tetsuys Ogata (Kyoto University)
Hiroshi G. Okuno (Kyoto University)

In conversational dialogue systems, users prefer to speak at any time and to use natural expressions. We have developed an Independent Component Analysis (ICA) based semi-blind source separation method, which allows users to barge-in over system utterances at any time. We created a novel method from timing information derived from barge-in utterances to identify one item that a user indicates during system enumeration. First, we determine the timing distribution of user utterances containing referential expressions and then approximate it using gamma distribution. Second, we represent both the utterance timing and automatic speech recognition (ASR) results as probabilities of the desired selection from the system's enumeration. We then integrate these two probabilities to identify the item having the maximum likelihood of selection. Experimental results using 400 utterances indicated that our method outperformed two methods used as a baseline (one of ASR results only and one of utterance timing only) in identification accuracy.

#2System Request Detection in Human Conversation Based on Multi-Resolution Gabor Wavelet Features

Tomoyuki Yamagata (Kobe University)
Tetsuya Takiguchi (Kobe University)
Yasuo Ariki (Kobe University)

For a hands-free speech interface, it is important to detect commands in spontaneous utterances. Usual voice activity detection systems can only distinguish speech frames from non-speech frames, but they cannot discriminate whether the detected speech section is a command for a system or not. In this paper, in order to analyze the difference between system requests and spontaneous utterances, we focus on fluctuations in a long period, such as prosodic articulation, and fluctuations in a short period, such as phoneme articulation. The use of multi-resolution analysis using Gabor wavelet on a Log-scale Mel-frequency Filter-bank clarifies the different characteristics of system commands and spontaneous utterances. Experiments using our robot dialog corpus show that the accuracy of the proposed method is 92.6% in F-measure, while the conventional power and prosody-based method is just 66.7%.

#3Using Graphical Models for Mixed-Initiative Dialog Management Systems with Realtime Policies

Stefan Schwärzler (Technische Universität München, Germany)
Stefan Maier (Technische Universität München, Germany)
Joachim Schenk (Technische Universität München, Germany)
Frank Wallhoff (Technische Universität München, Germany)
Gerhard Rigoll (Technische Universität München, Germany)

In this paper, we present a novel approach for dialog modeling, which extends the idea underlying the partially observable Markov Decision Processes (POMDPs), i. e. it allows for calculating the dialog policy in real-time and thereby increases the system flexibility. The use of statistical dialog models is particularly advantageous to react adequately to common errors of speech recognition systems. Comparing our results to the refernce system (POMDP), we achieve a relative reduction of 31.6 % of the average dialog length. Furthermore, the proposed system shows a relative enhancement of 64.4 % of the sensitivity rate in the error recognition capabilities using the same specifity rate in both systems. The achieved results are based on the Air Travelling Information System with 21650 user utterances in 1585 natural spoken dialogs.

#4Conversation Robot Participating in and Activating a Group Communication

Shinya Fujie (Waseda University)
Yoichi Matsuyama (Waseda University)
Hikaru Taniyama (Waseda University)
Tetsunori Kobayashi (Waseda University)

As a new type of application of the conversation system, a robot activating other parties' communications has been developed. The robot participates in a quiz game with other participants and tries to activate the game. The functions installed in the robot are as follows: (1) The robot can participate in a group communication using its basic group conversation function. (2) The robot can perform the game according to the rules of the game. (3) The robot can activate communication using its proper actions depending on the game situations and the participants' situations. We conducted a real field experiment: the prototype system performed a quiz game with elderly people in an adult day-care center. The robot successfully entertained the people with its one hour demonstration.

#5Recent Advances in WFST-based Dialog System

Chiori Hori (National Institute of Information and Communications Technology (NICT))
Kiyonori Ohtake (National Institute of Information and Communications Technology (NICT))
Teruhisa Misu (National Institute of Information and Communications Technology (NICT))
Hideki Kashioka (National Institute of Information and Communications Technology (NICT))
Satoshi Nakamura (National Institute of Information and Communications Technology (NICT))

We proposed a dialog system using a weighted finite-state transducer (WFST) in which users concept and system action tags are input and output of the transducer, respectively. To test the potential of the WFST-based dialog management (DM) platform using statistical DM models, we constructed a dialog system using a human-to-human spoken dialog corpus for hotel reservation, which is annotated with Interchange Format (IF). A scenario WFST and a spoken language understanding (SLU) WFST were obtained from the corpus and then composed together and optimized. We evaluated the detection accuracy of the system next actions. In this paper, we focus on how WFST optimization operations contribute to the performance of the system. In addition, we have constructed a full WFST-based dialog system by composing SLU, scenario and sentence generation (SG) WFSTs. We show an example of a hotel reservation dialog with the fully composed system and discuss future work.

#6A Statistical Dialog Manager for the LUNA Project

David Griol (Universidad Carlos III de Madrid)
Giuseppe Riccardi (University of Trento)
Emilio Sanchis (Universitat Politecnica de Valencia)

In this paper, we present an approach for the development of a statistical dialog manager, in which the system response is selected by means of a classification process which considers all the previous history of the dialog to select the next system response. In particular, we use decision trees for its implementation. The statistical model is automatically learned from training data which are labeled in terms of different SLU features. This methodology has been applied to develop a dialog manager within the framework of the European LUNA project, whose main goal is the creation of a robust natural spoken language understanding system. We present an evaluation of this approach for both human machine and human-human conversations acquired in this project. We demonstrate that a statistical dialog manager developed with the proposed technique and learned from a corpus of human-machine dialogs can successfully infer the task-related topics present in spontaneous human-human dialogs.

#7A Policy-Switching Learning Approach for Adaptive Spoken Dialogue Agents

Heriberto Cuayáhuitl (Autonomous University of Tlaxcala)
Juventino Montiel-Hernández (Autonomous University of Tlaxcala)

The reinforcement learning paradigm has been adopted for inferring optimized and adaptive spoken dialogue agents. Such agents are typically learnt and tested without combining competing agents that may yield better performance at some points in the conversation. This paper presents an approach that learns dialogue behaviour from competing agents---switching from one policy to another competing one---on a previously proposed hierarchical learning framework. This policy-switching approach was investigated using a simulated flight booking dialogue system based on different types of information request. Experimental results reported that the induced agent using the proposed policy-switching approach yielded 8.2% fewer system actions than three baselines with a fixed type of information request. This result suggests that the proposed approach is useful for learning adaptive and scalable spoken dialogue agents.

#8Strategies for Accelerating the Design of Dialogue Applications using Heuristic Information from the Backend Database

Luis Fernando D\'Haro (Speech Technology Group. Universidad Politecnica de Madrid. Spain.)
Ricardo Cordoba (Speech Technology Group. Universidad Politecnica de Madrid. Spain.)
Ruben San-Segundo (Speech Technology Group. Universidad Politecnica de Madrid. Spain.)
Javier Macias-Guarasa (Speech Technology Group. Universidad Politecnica de Madrid. Spain.)
Jose Manuel Pardo (Speech Technology Group. Universidad Politecnica de Madrid. Spain.)

Nowadays, current commercial and academic platforms for developing spoken dialogue applications lack of acceleration strategies based on using heuristic information from the contents or structure of the backend database in order to speed up the definition of the dialogue flow. In this paper we describe our attempts to take advantage of these information sources using the following strategies: the quick creation of classes and attributes to define the data model structure, the semi-automatic generation and debugging of database access functions, the automatic proposal of the slots that should be preferably requested using mixed-initiative forms or the slots that are better to request one by one using directed forms, and the generation of automatic state proposals to specify the transition network that defines the dialogue flow. Subjective and objective evaluations confirm the advantages of using the proposed strategies to simplify the design, and the high acceptance of the platform and its acceleration strategies.

#9Feature-based Summary Space for Stochastic Dialogue Modeling with Hierarchical Semantic Frames

Florian Pinault (LIA - UAPV)
Fabrice Lefèvre (LIA - UAPV)
Renato De Mori (LIA - UAPV)

In a spoken dialogue system, the dialogue manager needs to make decisions in a highly noisy environment. This work addresses this issue by proposing a framework to interface efficient probabilistic modeling both for the spoken language understanding module and for the dialogue management module. Hierarchical semantic frames are inferred and composed to build a thorough representation of the user's utterance semantic. Then this representation is mapped into a feature-based summary space in which is defined the set of dialogue states used by the dialogue manager, based on the POMDP paradigm. This allows a planning of the dialogue course taking into account the uncertainty on the dialogue state and tractability is ensured by use of an intermediate summary space. A preliminary implementation of such a system is presented on the MEDIA domain. The task is touristic information and hotel reservation, and the availability of WoZ data allows to consider a model-based approach to the POMDP dialogue manager.

#10Language Modeling and Dialog Management for Address Recognition

Rajesh Balchandran (IBM - T J Watson Research Center)
Rachevsky Leonid (IBM - T J Watson Research Center)
Larry Sansone (IBM - T J Watson Research Center)

This paper describes a language modeling and dialog management system for efficient and robust recognition of several arbitrarily ordered and inter-related components from very large datasets - such as with a complete addresses specified in a single sentence with address components in their natural sequence. A new two-pass speech recognition technique based on using multiple language models with embedded grammars is presented. Tests with this technique on complete address recognition task yielded good results and memory and CPU requirements are sufficiently low to make this technique viable for embedded environments. Additionally, a goal oriented algorithm for dialog based error recovery and disambiguation, that does not require manual identification of all possible dialog situations, is also presented. The combined system yields very high task completion accuracy, for only a few additional turns of interaction.

#11A framework for rapid development of conversational natural language call routing systems for call centers

Ea-Ee Jan (IBM)
Hong-Kwang Kuo (IBM)
Osamuyimen Stewart (IBM)
David Lubensky (IBM)

A framework for rapid development of conversational natural language call routing systems is proposed. The framework cuts costs by using only scantily prepared business requirements to automatically create an initial prototype. Aside from clear targets (terminal routing classes). vague targets which are variations of users’ incomplete (semantically overlapping) sentences are enumerated. The vague targets can be derived from the confusion set of the semantic tokens of the clear targets. Also automatically generated for a vague target is a disambiguation dialogue module, which consists of a prompt and grammar to guide the user from a vague target to one of its associated clear targets. In the final analysis, our framework is able to reduce the human labor associated with developing an initial natural language call routing system from a few weeks to just a few hours. The experimental results from a deployed pilot system support the feasibility of our proposed approach.

#12The MonAMI Reminder: a spoken dialogue system for face-to-face interaction

Jonas Beskow (KTH Speech Music & Hearing)
Jens Edlund (KTH Speech Music & Hearing)
Björn Granström (KTH Speech Music & Hearing)
Joakim Gustafson (KTH Speech Music & Hearing)
Gabriel Skantze (KTH Speech Music & Hearing)
Helena Tobiasson (KTH Human-Computer Interaction Group)

We describe the MonAMI Reminder, a multimodal spoken dialogue system which can assist elderly and disabled people in organising and initiating their daily activities. Based on deep interviews with potential users, we have designed a calendar and reminder application which uses an innovative mix of an embodied conversational agent, digital pen and paper, and the web to meet the needs of those users as well as the current constraints of speech technology. We also explore the use of head pose tracking for interaction and attention control in human-computer face-to-face interaction.

#13Influence of Training on Direct and Indirect Measures for the Evaluation of Multimodal Systems

Julia Seebode (Research training group prometei, Berlin Institute of Technology, Germany)
Stefan Schaffer (Research training group prometei, Berlin Institute of Technology, Germany)
Ina Wechsung (Deutsche Telekom Laboratories, Berlin Institute of Technology, Germany)
Florian Metze (School of Computer Science, Carnegie Mellon University, Pittsburgh, USA)

Finding suitable evaluation methods is an indispensable task during the development of new user interfaces, as no standardized approach has so far been established, especially for multimodal interfaces. In the current study, we used several data sources (direct and indirect measurements) to evaluate a multimodal version of an information system, tested on trained and untrained users. We investigated the extent to which the different types of data showed concordance concerning the perceived quality of the system, in order to derive clues as to the suitability of the respective evaluation methods. The aim was to examine, if widely used methods not originally developed for multimodal interfaces are appropriate under these conditions, and to derive new evaluation paradigms.

#14Talking Heads for Interacting with Spoken Dialog Smart-Home Systems

Christine Kühnel (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin)
Benjamin Weiss (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin)
Sebastian Möller (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin)

In this paper the relation between the quality of a talking head as an output component of a spoken dialog system and the quality of the system itself are investigated. Results show that the quality of the talking head has indeed an important impact on system quality. The quality of the talking head itself is found to be influenced by visual and speech quality and the synchronization of voice and lip movement.

#15Speech Generation from Hand Gestures Based on Space Mapping

Aki Kunikoshi (The University of Tokyo)
Yu Qiao (The University of Tokyo)
Nobuaki Minematsu (The University of Tokyo)
Keikichi Hirose (The University of Tokyo)

Individuals with speaking disabilities often use a TTS synthesizer for speech communication. Since users always have to type sound symbols and the synthesizer reads them out in a monotonous style, the use of the current synthesizers usually renders real-time operation and lively communication difficult. In this paper, we develop a special glove, by wearing which, speech sounds are generated from hand gesture transitions. For development, GMM-based voice conversion techniques are applied to estimate a mapping function between a space of hand gestures and another space of speech sounds. In this paper, as an initial trial, a mapping between hand gestures and Japanese vowel sounds is estimated so that topological features of the selected gestures in a feature space and those of the five Japanese vowels in a cepstrum space are equalized. Experiments show that the special glove can generate good Japanese vowel transitions with voluntary control of duration and articulation.

Mon-Ses3-O1:
Automatic Speech Recognition: Language Models I

Time:Monday 16:00 Place:Main Hall Type:Oral
Chair:Steve Renals

16:00Back-Off Language Model Compression

Boulos Harb (Google, Inc.)
Ciprian Chelba (Google, Inc.)
Jeffrey Dean (Google, Inc.)
Sanjay Ghemawat (Google, Inc.)

With the availability of large amounts of training data relevant to speech recognition scenarios, scalability becomes a very productive way to improve language model performance. We present a technique that represents a back-off n-gram language model using arrays of integer values and thus renders it amenable to effective block compression. We propose a few such compression algorithms and evaluate the resulting language model along two dimensions: memory footprint, and speed reduction relative to the uncompressed one. We experimented with a model that uses a 32-bit word vocabulary (at most 4B words) and log-probabilities/back-off-weights quantized to 1 byte, respectively. The best compression algorithm achieves 2.6 bytes/n-gram at 18X slower than uncompressed.

16:20Improving Broadcast News Transcription with a Precision Grammar and Discriminative Reranking

Tobias Kaufmann (ETH Zurich)
Thomas Ewender (ETH Zurich)
Beat Pfister (ETH Zurich)

We propose a new approach of integrating a precision grammar into speech recognition. The approach is based on a novel robust parsing technique and discriminative reranking. By reranking 100-best output of the LIMSI German broadcast news transcription system we achieved a significant reduction of the word error rate by 9.6% relative. To our knowledge, this is the first significant improvement for a real-world broad-domain speech recognition task due to a precision grammar.

16:40Use of Contexts in Language Model Interpolation and Adaptation

Xunying Liu (Cambridge University Engineering Department)
Mark Gales (Cambridge University Engineering Department)
Phil Woodland (Cambridge University Engineering Department)

Language models (LMs) are often constructed by building component models on multiple text sources to be interpolated using global, context free weights. By re-adjusting these weights, LMs may be adapted to a target domain of a particular genre, epoch or other higher level attributes. Other factors that determine the ``usefulness'' of sources on a context dependent basis, such as modeling resolution, generalization, topics and styles, are poorly modeled. To overcome this problem, this paper investigates a context dependent form of LM interpolation and adaptation. In previous research, it was used primarily for LM adaptation. In this paper, a range of schemes to combine context dependent weights obtained from training and test data to improve LM adaptation are proposed. Consistent perplexity and error rate gains of 6\% relative were obtained on a state-of-the-art broadcast recognition task.

17:00Exploiting Chinese Character Models to Improve Speech Recognition Performance

J. L. Hieronymus (NASA Ames Research Center)
X. Liu (Cambridge University Engineering Department)
M. J. F. Gales (Cambridge University Engineering Department)
P.C. Woodland (Cambridge University Engineering Department)

The Chinese language is based on characters which are syllabic in nature. Since languages have syllabotactic rules which govern the construction of syllables and their allowed sequences, Chinese character sequence models can be used as a first level approximation. Ngram character sequence models were trained on 4.3 billion characters. Characters are used as a first level recognition unit with multiple pronunciations per character. The CU-HTK Mandarin word based system was used to recognize words which were then converted to character sequences. The character alone error rates of one best recognition were slightly worse than word based character recognition. However combining the two systems using log-linear combination gives better results than either system separately. An equally weighted combination gave consistant CER gains of 0.1 - 0.2 \% absolute over the word based standard system.

17:20Constraint selection for topic-based MDI adaptation of language models

Gwénolé Lecorvé (IRISA/INSA, France)
Guillaume Gravier (IRISA/CNRS, France)
Pascale Sébillot (IRISA/INSA, France)

This paper presents an unsupervised topic-based language model adaptation method which specializes the standard minimum information discrimination approach by identifying and combining topic-specific features. By acquiring a topic terminology from a thematically coherent corpus, language model adaptation is restrained to the sole probability re-estimation of n-grams ending with some topic-specific words, keeping other probabilities untouched. Experiments are carried out on a large set of spoken documents about various topics. Results show significant perplexity and recognition improvements which outperform results of classical adaptation techniques.

17:40Nonstationary Latent Dirichlet Allocation for Speech Recognition

Chuang-Hua Chueh (National Cheng Kung University)
Jen-Tzung Chien (National Cheng Kung University)

Latent Dirichlet allocation (LDA) has been successful for document modeling. LDA extracts the latent topics across documents. Words in a document are generated by the same topic distribution. However, in real-world documents, the usage of words in different paragraphs is varied and accompanied with different writing styles. This study extends the LDA and copes with the variations of topic information within a document. We build the nonstationary LDA (NLDA) by incorporating a Markov chain which is used to detect the stylistic segments in a document. Each segment corresponds to a particular style in composition of a document. This NLDA can exploit the topic information between documents as well as the word variations within a document. We accordingly establish a Viterbi-based variational Bayesian procedure. A language model adaptation scheme using NLDA is developed for speech recognition. Experimental results show improvement of NLDA over LDA in terms of perplexity and word error rate.

Mon-Ses3-O2:
Phoneme-level Perception

Time:Monday 16:00 Place:East Wing 1 Type:Oral
Chair:Rolf Carlson

16:00Categorical perception of speech without stimulus repetition

Jack Rogers (MRC Cognition and Brain Sciences Unit, Cambridge, UK)
Matthew Davis (MRC Cognition and Brain Sciences Unit, Cambridge, UK)

We explored the perception of phonetic continua generated with an automated auditory morphing technique in three perceptual experiments. The use of large sets of stimuli allowed an assessment of the impact of single vs. paired presentation without the massed stimulus repetition typical of categorical perception experiments. A third experiment shows that such massed repetition alters the degree of categorical and sub-categorical discrimination possible in speech perception. Implications for accounts of speech perception are discussed.

16:20Non-automaticity of use of orthographic knowledge in phoneme evaluation

Anne Cutler (Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands)
Chris Davis (MARCS Auditory Laboratories, University of Western Sydney, Australia)
Jeesun Kim (MARCS Auditory Laboratories, University of Western Sydney, Australia)

Two phoneme goodness rating experiments addressed the role of orthographic knowledge in the evaluation of speech sounds. Ratings for the best tokens of /s/ were higher in words spelled with S (e.g., bless) than in words where /s/ was spelled with C (e.g., voice). This difference did not appear for analogous nonwords for which every lexical neighbour had either S or C spelling (pless, floice). Models of phonemic processing incorporating obligatory influence of lexical information in phonemic processing cannot explain this dissociation; the data are consistent with models in which phonemic decisions are not subject to necessary top-down lexical influence.

16:40Learning and generalization of novel contrastive cues

Meghan Sumner (Stanford University, Department of Linguistics)

This paper examines the learning of a novel phonetic contrast. Specifically, we examine how a contrast is learned – do speakers learn a specific property about a particular word, or do they internalize a pattern that can be applied to words of a particular type in subsequent processing? In two experiments, participants listened to foreign-accented English and were taught to make stop release contrastive. Following training, participants take either a minimal pair decision task or a cross-modal form priming task, both of which include trained words, words that were untrained but include a trained rime, and novel, untrained words. The results of both experiments suggest that listeners use both strategies in learning – they generalize to words with similar rimes, but are unable to extend this knowledge to novel words.

17:00Vowel Category Perception Affected by Microdurational Variations

Einar Meister (Institute of Cybernetics, Tallinn University of Technology, Estonia)
Stefan Werner (Department of General Linguistics and Language Technology, University of Joensuu, Finland)

Vowel quality perception in quantity languages is considered to be unrelated to vowel duration since duration is used to realize quantity oppositions. To test the role of microdurational variations in vowel category perception in Estonian listening experiments with synthetic stimuli were carried out, involving five vowel pairs along the close-open axis. The results show that in the case of high-mid vowel pairs vowel openness correlates positively with stimulus duration; in mid-low vowel pairs no such correlation was found. The discrepancy in the results is explained by the hypothesis that in case of shorter perceptual distances (high-mid area of vowel space) intrinsic duration plays the role of a secondary feature to enhance perceptual contrast between vowels, whereas in case of mid-low oppositions perceptual distance is large enough to guarantee the necessary perceptual contrast by spectral features alone and vowel intrinsic duration as an additional cue is not needed.

17:20Perceptual grouping of alternating word pairs: Effect of pitch difference and presentation rate

Nandini Iyer (Air Force Research Laboratory)
Douglas Brungart (Air Force Research Laboratory)
Brian Simpson (Air Force Research Laboratory)

When listeners hear sequences of tones that slowly alternate between a low frequency and a slightly higher frequency, they report hearing a single stream of alternating tones. However, when the alternation rate and/or the frequency difference increases, they report hearing two distinct streams: a slowly pulsing high and low frequency stream. This experiment used repeating sequences of spondees to investigate whether a similar streaming phenomenon might occur for speech stimuli. The F0 difference between every other word was varied from 0 - 18 semitones. Each word was either 100 or 125 ms in duration. The inter-onset intervals (IOIs) of the individual words were varied from 100 - 300 ms. As expected, F0 differences was a strong cue for sequential segregation. Moreover, the number of 'two' stream judgments were greater at smaller IOIs, suggesting that factors that influence the obligatory streaming of tonal signals are also important in the segregation of speech signals.

17:40Comparing methods to find a best exemplar in a multidimensional space

Titia Benders (Institute of Phonetic Sciences, University of Amsterdam)
Paul Boersma (Institute of Phonetic Sciences, University of Amsterdam)

We present a simple algorithm for running a listening experi- ment aimed at finding the best exemplar in a multidimensional space. For simulated humanlike listeners, who have perception thresholds and some decision noise on their responses, the algo- rithm on average ends up twelve times closer than Iverson and Evans’ goodness interpolation algorithm.

Mon-Ses3-O3:
Statistical Parametric Synthesis I

Time:Monday 16:00 Place:East Wing 2 Type:Oral
Chair:Keiichi Tokuda

16:00Autoregressive HMMs for speech synthesis

Matt Shannon (Cambridge University Engineering Department, U.K.)
William Byrne (Cambridge University Engineering Department, U.K.)

We propose the autoregressive HMM for speech synthesis. We show that the autoregressive HMM supports efficient EM parameter estimation and that we can use established effective synthesis techniques such as synthesis considering global variance with minimal modification. The autoregressive HMM uses the same model for parameter estimation and synthesis in a consistent way, in contrast to the standard HMM synthesis framework, and supports easy and efficient parameter estimation, in contrast to the trajectory HMM. We find that the autoregressive HMM gives performance comparable to the standard HMM synthesis framework on a Blizzard Challenge-style naturalness evaluation.

16:20ASYNCHRONOUS F0 AND SPECTRUM MODELING FOR HMM-BASED SPEECH SYNTHESIS

Cheng-Cheng Wang (USTC iFlytek Speech Lab, University of Science and Technology of China, Hefei,China)
Zhen-Hua Ling (USTC iFlytek Speech Lab, University of Science and Technology of China, Hefei,China)
Li-Rong Dai (USTC iFlytek Speech Lab, University of Science and Technology of China, Hefei,China)

This paper proposes an asynchronous model structure for fundamental frequency(F0) and spectrum modeling in HMM-based parametric speech synthesis to improve the performance of F0 prediction. F0 and spectrum features are considered to be synchronous in the conventional system. Considering that the production of these two features is decided by the movement of different speech organs, an explicitly asynchronous model structure is introduced. At training stage, F0 models are training asynchronously with spectrum models. At synthesis stage, the two features are generated respectively. The objective and subjective evaluation results show the proposed method can effectively improve the accuracy of F0 prediction.

16:40A Minimum V/U Error Approach to F0 Generation in HMM-based TTS

yao Qian (Microsoft Research Asia, Beijing, China)
Frank Soong (Microsoft Research Asia, Beijing, China)
miaomiao Wang (Microsoft Research Asia, Beijing, China)
zhizheng Wu (Microsoft Research Asia, Beijing, China)

The HMM-based TTS can produce a highly intelligible and decent quality voice. However, HMM model degrades when feature vectors used in training are noisy. Among all noisy features, pitch tracking errors and corresponding flawed voiced/unvoiced (v/u) decisions are identified as two key factors in voice quality problems. In this paper, we propose a minimum v/u error approach to F0 generation. A prior knowledge of v/u is imposed in each Mandarin phone and accumulated v/u posterior probabilities are used to search for the optimal v/u switching point in each VU or UV segment in generation. Objectively the new approach is shown to improve v/u prediction performance, specifically on voiced to unvoiced swapping errors. They are reduced from 3.7% (baseline) down to 2.0% (new approach). The improvement is also subjectively confirmed by an AB preference test score, 72% (new approach) versus 22% (baseline).

17:00Voiced/Unvoiced Decision Algorithm for HMM-based Speech Synthesis

Shiyin Kang (Department of Computer Science and Technology, Tsinghua University, Beijing, China)
Zhiwei Shuang (IBM China Research Lab, Beijing, China)
Quansheng Duan (Department of Computer Science and Technology, Tsinghua University, Beijing, China)
Yong Qin (IBM China Research Lab, Beijing, China)
Lianhong Cai (Department of Computer Science and Technology, Tsinghua University, Beijing, China)

This paper introduces a novel method to improve the U/V decision method in HMM-based speech synthesis. In the conventional method, the U/V decision of each state is independently made, and a state in the middle of a vowel may be decided as unvoiced. In this paper, we propose to utilize the constraints of natural speech to improve the U/V decision inside a unit, such as syllable or phone. We use a GMM-based U/V change time model to select the best U/V change time in one unit, and refine the U/V decision of all states in that unit based on the selected change time. The result of a perceptual evaluation demonstrates that the proposed method can significantly improve the naturalness of the synthetic speech.

17:20Local minimum generation error criterion for hybrid HMM speech synthesis

Xavi Gonzalvo (Phonetic Arts Ltd.)
Alexander Gutkin (Yahoo! Europe)
Joan Claudi Socoro (Universitat Ramon Llull)
Ignasi Iriondo (Universitat Ramon Llull)
Paul Taylor (Phonetic Arts Ltd.)

This paper presents an HMM-driven hybrid speech synthesis approach in which unit selection concatenative synthesis is used to improve the quality of the statistical system using a Local Minimum Generation Error (LMGE) during the synthesis stage. The idea behind this approach is to combine the robustness due to HMMs with the naturalness of concatenated units. Unlike the conventional hybrid approaches to speech synthesis that use concatenative synthesis as a backbone, the proposed system employs stable regions of natural units to improve the statistically generated parameters. We show that this approach improves the generation of vocal tract parameters, smoothes the bad joints and increases the overall quality.

17:40Thousands of Voices for HMM-based Speech Synthesis

Junichi Yamagishi (University of Edinburgh)
Bela Usabaev (Universit¨at T¨ubingen)
Simon King (University of Edinburgh)
Oliver Watts (University of Edinburgh)
John Dines (Idiap Research Institute)
Jilei Tian (Nokia)
Rile Hu (Nokia)
Keiichiro Oura (Nagoya Institute of Technology)
Keiichi Tokuda (Nagoya Institute of Technology)
Reima Karhila (Helsinki University of Technology)
Mikko Kurimo (Helsinki University of Technology)

Our recent experiments with HMM-based speech synthesis systems have demonstrated that speaker-adaptive HMM-based speech synthesis (which uses an ‘average voice model’ plus model adaptation) is robust to non-ideal speech data that are recorded under various conditions and with varying microphones, that are not perfectly clean, and/or that lack of phonetic balance. This enables us consider building high-quality voices on ’non-TTS’ corpora such as ASR corpora. Since ASR corpora generally include a large number of speakers, this leads to the possibility of producing an enormous number of voices automatically. In this paper we show thousands of voices for HMM-based speech synthesis that we have made from several popular ASR corpora such as the Wall Street Journal databases (WSJ0/WSJ1/WSJCAM0), Resource Management, Globalphone and Speecon. We report some perceptual evaluation results and outline the outstanding issues.

Mon-Ses3-O4:
Systems for Spoken Language Translation

Time:Monday 16:00 Place:East Wing 3 Type:Oral
Chair: Hermann Ney

16:00Efficient Combination of Confidence Measures for Machine Translation

Sylvain Raybaud (LORIA)
David Langlois (LORIA)
Kamel Smaili (LORIA)

We present in this paper a twofold contribution to Machine Translation. First, we present a method to automatically build training and testing corpora for confidence measures containing realistic errors. Errors introduced into reference translation simulate classical machine translation errors (word deletion and word substitution), and are supervised by Wordnet. Second, we use SVM to combine original and classical confidence measures both at word- and sentence-level. We show that the obtained combination outperform by 14% (absolute) our best single word-level confidence measure, and that sentence-level combination of confidence measures produces meaningful scores.

16:20Incremental Dialog Clustering for Speech-to-Speech Translation

David Stallard (BBN Technologies)
Stavros Tsakalidis (BBN Technologies)
Shirin Saleem (BBN Technologies)

Application domains for language processing systems, especially speech-to-speech translation and dialog systems, often contain sub-domains and/or task-types for which different outputs may be appropriate given the same input. We present a document-clustering approach to sub-domain classification, which uses a recently-developed algorithm based on von Mises Fisher distributions. We give preliminary perplexity reduction and MT performance results for a speech-to-speech translation system using this model.

16:40Iterative Sentence-Pair Extraction from Quasi-Parallel Corpora for Machine Translation

Ruhi Sarikaya (IBM T.J. Watson Research Center)
Sameer Maskey (IBM T.J. Watson Research Center)
Rong Zhang (IBM T.J. Watson Research Center)
Ea-Ee Jan (IBM T.J. Watson Research Center)
Dagen Wang (IBM T.J. Watson Research Center)
Bhuvana Ramabhadran (IBM T.J. Watson Research Center)
Salim Roukos (IBM T.J. Watson Research Center)

This paper addresses parallel data extraction from the quasi-parallel corpora generated in a crowd-sourcing project where ordinary people watch tv shows and movies and transcribe/translate what they hear, creating document pools in different languages. Since they do not have guidelines for naming and performing translations, it is often not clear which documents are the translations of the same show/movie and which sentences are the translations of the each other in a given document pair. We introduce a method for automatically pairing documents in two languages and extracting parallel sentences from the paired documents. The method consists of three steps: i) document pairing, ii) sentence pair alignment of the paired documents, and iii) context extrapolation to boost the sentence pair coverage. Human evaluation of the extracted data shows that 95% of the extracted sentences carry useful information for translation. Experimental results also show that using the extracted data .....

17:00Tree Kernel-Based Phrase Reordering with Structured Syntactic Knowledge

min Zhang (Institute for Infocomm Research)
Haizhou Li (Institute for Infocomm Research)

Structured syntactic knowledge is important for phrase reordering. In this paper, we propose using convolution tree kernel over parse tree to model the structured syntactic knowledge for phrase reordering in the context of BTG-based statistical machine translation. Our study reveals that the structured syntactic features are very effective for phrase reordering and those features can be well captured by the tree kernel. We further combine the structured features and other commonly-used linear features into a composite kernel. Experimental results on the NIST MT-2005 Chinese-English translation tasks show that our proposed method statistically significantly outperforms the baseline methods.

17:00RTTS: Towards Enterprise-level Real-Time Speech Transcription and Translation Services

Juan M. Huerta (IBM T J Watson Research Center)
Cheng Wu (IBM T J Watson Research Center)
Andrej Sakrajda (IBM T J Watson Research Center)
Sasha Caskey (IBM T J Watson Research Center)
Ea-Ee Jan (IBM T J Watson Research Center)
Alexander Faisman (IBM T J Watson Research Center)
Shai Ben-David (IBM)
Wen Liu (IBM)
Uyi Stewart (IBM)
Michael Frissora (IBM)
David Lubensky (IBM)
Antonio Lee (IBM)

In this paper we describe the RTTS system for enterprise-level real time speech recognition and translation. RTTS follows a Web Service-based approach which allows the encapsulation of ASR and MT Technology components thus hiding the configuration and tuning complexities and details from the client applications while exposing a uniform interface. In this way, RTTS is capable of easily supporting a wide variety of client applications. The clients we have implemented include a VoIP-based real time speech-to-speech translation system, a chat and Instant Messaging translation System, a Transcription Server, among others.

17:20Using Syntax in Large-Scale Audio Document Translation

Jing Zheng (SRI International)
Necip Fazil Ayan (SRI International)
Wen Wang (SRI International)
David Burkett (UC Berkeley)

Recently, the use of syntax has very effectively improved machine translation (MT) quality in many text MT tasks. However, using syntax in speech MT poses additional challenges because of disfluencies and other spoken language phenomena, and of errors introduced by automatic speech recognition (ASR). In this paper, we investigate the effect of using syntax in a large-scale audio document translation task targeting broadcast news and broadcast conversations. We do so by comparing the performance of three synchronous context-free grammar based translation approaches: 1) hierarchical phrase-based translation, 2) syntax-augmented MT, and 3) string-to-dependency MT. The results show a positive effect of explicitly using syntax when translating broadcast news, but no benefit when translating broadcast conversations. The results indicate that improving the robustness of syntactic systems against conversational language style is important to their success and requires future effort.

17:40Context-driven bilingual movie subtitle alignment

Andreas Tsiartas (Speech Analysis and Interpretation Laboratory, Department of Electrical Engineering, University of Southern California, Los Angeles, CA 90089)
Prasanta Ghosh (Speech Analysis and Interpretation Laboratory, Department of Electrical Engineering, University of Southern California, Los Angeles, CA 90089)
Panayiotis Georgiou (Speech Analysis and Interpretation Laboratory, Department of Electrical Engineering, University of Southern California, Los Angeles, CA 90089)
Shrikanth Narayanan (Speech Analysis and Interpretation Laboratory, Department of Electrical Engineering, University of Southern California, Los Angeles, CA 90089)

Movie subtitle alignment is a potentially useful approach for deriving automatically parallel bilingual/multilingual spoken language data for automatic speech translation. In this paper, we consider the movie subtitle alignment task. We propose a distance metric between utterances of different languages based on lexical features derived from bilingual dictionaries. We use the dynamic time warping algorithm to obtain the best alignment. The best F-score of ~0.713 is obtained using the proposed approach.

Mon-Ses3-S1:
Special Session: Silent Speech Interfaces

Time:Monday 16:00 Place:East Wing 4 Type:Special
Chair:Bruce Denby & Tanja Schultz

#0Visuo-Phonetic Decoding using Multi-Stream and Context-Dependent Models for an Ultrasound-based Silent Speech Interface

Thomas Hueber (ESPCI/Telecom ParisTech)
Elie-Laurent Benaroya (ESPCI ParisTech)
Gérard Chollet (LTCI/CNRS Telecom ParisTech)
Bruce Denby (UPMC Paris VI - ESPCI ParisTech)
Gérard Dreyfus (Laboratoire d\'Electronique - ESPCI ParisTech)
Maureen Stone (University of Maryland Dental School)

Recent improvements are presented for phonetic decoding of continuous-speech from ultrasound and optical observations of the tongue and lips in a silent speech interface application. In a new approach to this critical step, the visual streams are modeled by context-dependent multi-stream Hidden Markov Models (CD-MSHMM). Results are compared to a baseline system using context-independent modeling and a visual feature fusion strategy, with both systems evaluated on a one-hour, phonetically balanced English speech database. Tongue and lip images are coded using PCA-based feature extraction techniques. The uttered speech signal, also recorded, is used to initialize the training of the visual HMMs. Visual phonetic decoding performance is evaluated successively with and without the help of linguistic constraints introduced via a 2.5k-word decoding dictionary.

#0Disordered Speech Recognition Using Acoustic and sEMG Signals

Yunbin Deng (BAE Systems, Inc, Advanced Information Technologies)
Rupal Patel (Communication Analysis & Design Lab, Northeastern University)
James T. Heaton (Center for Laryngeal Surgery & Voice Rehabilitation, Mass. General Hospital)
Glen Colby (BAE Systems, Inc, Advanced Information Technologies)
L. Donald Gilmore (Delsys, Inc.)
Joao Cabrera (BAE Systems, Inc, Advanced Information Technologies)
Serge H. Roy (Delsys, Inc.)
Carlo J. De Luca (Delsys, Inc.)
Geoffrey S. Meltzner (BAE Systems, Inc, Advanced Information Technologies)

Parallel isolated word corpora were collected from healthy speakers and individuals with speech impairment due to stroke or cerebral palsy. Surface electromyographic (sEMG) signals were collected for both vocalized and mouthed speech production modes. Pioneering work on disordered speech recognition using the acoustic signal, the sEMG signals, and their fusion are reported. Results indicate that speaker-dependent isolated-word recognition from the sEMG signals of articulator muscle groups during vocalized disordered-speech production was highly effective. However, word recognition accuracy for mouthed speech was much lower, likely related to the fact that some disordered speakers had considerable difficulty producing consistent mouthed speech. Further development of the sEMG-based speech recognition systems is needed to increase usability and robustness.

#0Multimodal HMM-based NAM-to-speech conversion

Viet-Anh TRAN (GIPSA-Lab, Département Parole & Cognition, UMR n°5216 CNRS/INPG/UJF/U. Stendhal, France)
Gérard BAILLY (GIPSA-Lab, Département Parole & Cognition, UMR n°5216 CNRS/INPG/UJF/U. Stendhal, France)
Hélène LOEVENBRUCK (GIPSA-Lab, Département Parole & Cognition, UMR n°5216 CNRS/INPG/UJF/U. Stendhal, France)
Tomoki TODA (NAIST (NAra Institute of Science and Technology), Japan)

Although the segmental intelligibility of converted speech from silent speech using direct signal-to-signal mapping proposed by Toda et al. is quite acceptable, listeners have sometimes difficulty in chunking the speech continuum into meaningful words due to incomplete phonetic cues provided by output signals. This paper studies another approach consisting in combining HMM-based statistical speech recognition and synthesis techniques, as well as training on aligned corpora, to convert silent speech to audible voice.

#0Technologies for Processing Body-Conducted Speech Detected with a Non-Audible Murmur Microphone

Tomoki Toda (Nara Institute of Science and Technology)
Keigo Nakamura (Nara Institute of Science and Technology)
Takayuki Nagai (Nara Institute of Science and Technology)
Tomomi Kaino (Nara Institute of Science and Technology)
Yoshitaka Nakajima (Nara Institute of Science and Technology)
Kiyohiro Shikano (Nara Institute of Science and Technology)

In this paper, we review our recent research on technologies for processing body-conducted speech detected with Non-Audible Murmur (NAM) microphone. NAM microphone enables us to detect various types of body-conducted speech such as extremely soft whisper, normal speech, and so on. Moreover, it is robust against external noise due to its noise-proof structure. To make speech communication more universal by effectively using these properties of NAM microphone, we have so far developed two main technologies: one is body-conducted speech conversion for human-to-human speech communication; and the other is body-conducted speech recognition for man-machine speech communication. This paper gives an overview of these technologies and presents our new attempts to investigate the effectiveness of body-conducted speech recognition.

#0Impact of Different Speaking Modes on EMG-based Speech Recognition

Michael Wand (Cognitive Systems Lab, University of Karlsruhe, Germany)
Szu-Chen Stan Jou (ATC, ICL, Industrial Technology Research Institute, Taiwan)
Arthur R. Toth (Cognitive Systems Lab, University of Karlsruhe, Germany)
Tanja Schultz (Cognitive Systems Lab, University of Karlsruhe, Germany)

We present our recent results on speech recognition by surface electromyography (EMG), which captures the electric potentials that are generated by the human articulatory muscles. This technique can be used to enable Silent Speech Interfaces, since EMG signals are generated even when people only articulate speech without producing any sound. Preliminary experiments have shown that the EMG signals created by audible and silent speech are quite distinct. In this paper we first compare various methods of initializing a silent speech EMG recognizer, showing that the performance of the recognizer substantially varies across different speakers. Based on this, we analyze EMG signals from audible and silent speech, present first results on how discrepancies between these speaking modes affect EMG recognizers, and suggest areas for future work.

#0Artificial speech synthesizer control by brain-computer interface

Jonathan S. Brumberg (Boston University; Neural Signals, Inc.)
Philip R. Kennedy (Neural Signals, Inc.)
Frank H. Guenther (Boston University; Harvard University; MIT)

We developed and tested a brain-computer interface for control of an artificial speech synthesizer by an individual with near complete paralysis. This neural prosthesis for speech restoration is currently capable of predicting vowel formant frequencies based on neural activity recorded from an intracortical microelectrode implanted in the left hemisphere speech motor cortex. Using instantaneous auditory feedback (< 50 ms) of predicted formant frequencies, the study participant has been able to correctly perform a vowel production task at a maximum rate of 80-90% correct.

#0Synthesizing Speech from Electromyography using Voice Transformation Techniques

Arthur R. Toth (University of Karlsruhe)
Michael Wand (University of Karlsruhe)
Tanja Schultz (University of Karlsruhe)

Surface electromyography (EMG) can be used to record the activation potentials of articulatory muscles while a person speaks. It could enable silent speech interfaces, as EMG signals are generated even when people pantomime speech noiselessly. Having effective silent speech interfaces would enable a number of compelling applications, allowing people to communicate in areas where they would not want to be overheard or could not be heard. In order to use EMG signals in speech interfaces, however, there must be a relatively accurate method to map the signals to speech. Most previous attempts to use EMG signals for speech interfaces appear to focus on Automatic Speech Recognition (ASR) based on features derived from EMG signals. We explore the alternative idea of using Voice Transformation (VT) techniques to synthesize speech from EMG signals. We report the results of our preliminary studies, noting the difficulties we encountered and suggesting future work.

16:00Characterizing Silent and Pseudo-Silent Speech using Radar-like Sensors

John Holzrichter (Hertz Foundation)

Radar-like sensors enable the measuring of speech articulator conditions, especially their shape changes and contact events both during silent and normal speech. Such information can be used to associate articulator conditions with digital “codes” for use in communications, machine control, speech masking or canceling, and other applications.

Mon-Ses3-P3:
Automatic Speech Recognition: Adaptation I

Time:Monday 16:00 Place:Hewison Hall Type:Poster
Chair:Stephen Cox

#0On the Development of Matched and Mismatched Italian Children’s Speech Recognition Systems

Piero Cosi (ISTC-CNR (Istituto di Scienze e Tecnologie della Cognizione - Consiglio Nazionale delle Ricerche))

While at least read speech corpora are available for Italian children’s speech research, there exist many languages in which this is not the case. Learning statistical mappings between the adult and child acoustic space using existing adult/children corpora may provide a future direction for generating children’s models for such data deficient languages. In this work the recent advances in the development of the SONIC Italian children’s speech recognition system will be described. Specifically, the complete training and test set of the FBK (ex ITC-irst) Italian Children’s Speech Corpus (ChildIt) was considered. Using the University of Colorado SONIC LVSR system, we demonstrate a phonetic recognition error rate of 12,0% for a system which incorporates Vocal Tract Length Normalization (VTLN), Speaker-Adaptive Trained phonetic models, as well as unsupervised Structural MAP Linear Regression (SMAPLR).

#0Speaker Adaptation Based on Two-Step Active Learning

Koichi Shinoda (Tokyo Institute of Technology)
Hiroko Murakami (Tokyo Institute of Technology)
Sadaoki Furui (Tokyo Institute of Technology)

We propose a two-step active learning method for supervised speaker adaptation. In the first step, the initial adaptation data is collected to obtain a phone error distribution. In the second step, those sentences whose phone distributions are close to the error distribution are selected, and their utterances are collected as the additional adaptation data. We evaluated the method using a Japanese speech database and maximum likelihood linear regression (MLLR) as the speaker adaptation algorithm. We confirmed that our method had a significant improvement over a method using randomly chosen sentences for adaptation.

#0Using VTLN matrices for Rapid and Computationally-Efficient Speaker Adaptation with Robustness to First-Pass Transcription Errors

Shakti Prasad Rath (Indian Institute of Technology Kanpur)
Srinivasan Umesh (Inidian Institute of Technology Kanpur)
Achintya Kumar Sarkar (Inidian Institute of Technology Kanpur)

In this paper we combine rapid adaptation capability of conventional VTLN with computational efficiency of transform-based adaptation such as CMLLR. Conventional VTLN requires very little adaptation data unlike transform-based adaptation methods. However, conventional VTLN is computationally expensive since it requires generation of warped features. We have recently shown that VTLN can be efficiently implemented as a linear-transformation with computational complexity similar to CMLLR. In this frame-work VTLN provides significant improvement in performance when there is small adaptation data than transform-based adaptation. We also show that the use of MLLT along with VTLN gives performance that is better than MLLR and comparable to SAT with MLLT even for large adaptation data. Further we show that in mismatched conditions, VTLN provides significant improvement over transform-based adaptation. We compare the performance of different methods on WSJ, RM and TIDIGITS tasks.

#0Acoustic Class Specific VTLN-Warping using Regression Class Trees

Shakti Prasad Rath (Indian Institute of Technology Kanpur)
Srinivasan Umesh (Indian Institute of Technology Kanpur)

In this paper we study the use of different frequency warp-factors for different acoustic classes. This is motivated by the fact that all acoustic classes do not exhibit similar spectral variation as a result of physiological differences in vocal tract and therefore the use of a single frequency-warp for the entire utterance may not be appropriate. We have recently proposed an VTLN method that implements VTLN-warping through a linear-transformation of the conventional MFCC features and efficiently estimates the warp-factor using the same sufficient statistics that are used in CMLLR adaptation. In this paper, we have shown that in this efficient framework of VTLN and using the idea of regression class tree it is possible to obtain separate frequency-warping for different acoustic classes. On the WSJ database we have shown the recognition performance of the proposed method for data driven based and phonetic knowledge regression class trees.

#0Bilinear Transformation Space-based Maximum Likelihood Linear Regression

Hwa Jeon Song (School of Electrical Engineering, Pusan National University)
Yongwon Jeong (School of Electrical Engineering, Pusan National University)
Hyung Soon Kim (School of Electrical Engineering, Pusan National University)

This paper proposes two types of bilinear transformation space-based speaker adaptation frameworks. In training session, transformation matrices for speakers are decomposed into the style factor for speakers’ characteristics and orthonormal basis of eigenvectors to control dimensionality of the canonical model by the singular value decomposition-based algorithm. In adaptation session, the style factor of a new speaker is estimated, depending on what kind of proposed framework is used. At the same time, the dimensionality of the canonical model can be reduced by the orthonormal basis from training. Moreover, both maximum likelihood linear regression (MLLR) and eigenspace-based MLLR are identified as special cases of our proposed methods. Experimental results show that the proposed methods are much more effective and versatile than other methods.

#0Speaking Style Adaptation for Spontaneous Speech Recognition Using Multiple-Regression HMM

Yusuke Ijima (Tokyo Institute of Technology)
Takeshi Matsubara (Tokyo Institute of Technology)
Takashi Nose (Tokyo Institute of Technology)
Takao Kobayashi (Tokyo Institute of Technology)

This paper describes a rapid model adaptation technique for spontaneous speech recognition. The proposed technique utilizes a multiple-regression hidden Markov model (MRHMM) and is based on a style estimation technique of speech. In the MRHMM, the mean vector of probability density function (pdf) is given by a function of a low-dimensional vector, called style vector, which corresponds to the intensity of expressivity of speaking style variation. The value of the style vector is estimated for every utterance of the input speech and the model adaptation is conducted by calculating new mean vectors of the pdf using the estimated style vector. The performance evaluation results using “Corpus of spontaneous Japanese (CSJ)” are shown under a condition in which the amount of model training and adaptation data is very small.

#0Improving the robustness by multiple sets of HMMs

Hans-Guenter Hirsch (Niederrhein University of Applied Sciences)
Andreas Kitzig (Niederrhein University of Applied Sciences)

The highest recognition performance is still achieved when training a recognition system with speech data that have been recorded in the acoustic scenario where the system will be applied. We investigated the approach of using several sets of HMMs. These sets have been trained on data that were recorded in different typical noise situations. One HMM set is individually selected at each speech input by comparing the pause segment at the beginning of the utterance with the pause models of all sets. We observed a considerable reduction of the error rates when applying this approach in comparison to two well known techniques for improving the robustness. Furthermore, we developed a technique to additionally adapt certain parameters of the selected HMMs to the specific noise condition. This leads to a further improvement of the recognition rates.

#0On the Use of Pitch Normalization for Improving Children\'s Speech Recognition

Rohit Sinha (Department of Electronics and Communication Engineering, Indian Institute of Technology Guwahati, Guwahati-781039, India.)
Shweta Ghai (Department of Electronics and Communication Engineering, Indian Institute of Technology Guwahati, Guwahati-781039, India.)

In this work, we have studied the effect of pitch variations across the speech signals in context of automatic speech recognition. Our initial study done on vowel data indicates that on account of insufficient smoothing of pitch harmonics by the filterbank, particularly for high pitch signals, the variances of mel frequency cepstral coefficients (MFCC) feature significantly increase with increase in the pitch of the speech signals. Further to reduce the variance of MFCC feature due to varying pitch among speakers, a maximum likelihood based explicit pitch normalization method has been explored. On connected digit recognition task, with pitch normalization a relative improvement of 15% is obtained over baseline for children's speech (higher pitch) on adults' speech (lower pitch) trained models.

#0Speaker normalization for template based speech recognition

Sébastien Demange (Katholieke Universiteit Leuven ESAT/PSI)
Dirk Van Compernolle (Katholieke Universiteit Leuven ESAT/PSI)

Vocal Tract Length Normalization (VTLN) has been shown to be an efficient speaker normalization tool for HMM based systems. In this paper we show that it is equally efficient for a template based recognition system. Template based systems, while promising, have as potential drawback that templates maintain all non phonetic details apart from the essential phonemic properties; i.e. they retain information on speaker and acoustic recording circumstances. This may lead to a very inefficient usage of the database. We show that after VTLN significantly more speakers - also from opposite gender - contribute templates to the matching sequence compared to the non-normalized case. In experiments on the Wall Street Journal database this leads to a relative word error rate reduction of 10%.

#0Combination of Acoustic and Lexical Speaker Adaptation for Disordered Speech Recognition

Oscar Saz (University of Zaragoza)
Eduardo Lleida (University of Zaragoza)
Antonio Miguel (University of Zaragoza)

This paper presents an approach to provide of lexical adaptation in Automatic Speech Recognition (ASR) of the disordered speech from a group of young impaired speakers. The outcome of an Acoustic Phonetic Decoder (APD) is used to learn new lexical variants of the 57-word vocabulary and add them to a lexicon personalized to each user. The possibilities of combination of this lexical adaptation with acoustic adaptation achieved through traditional Maximum A Posteriori (MAP) approaches are furtherer explored, and the results show the importance of matching the lexicon in the ASR decoding phase to the lexicon used for the acoustic adaptation.

#3Tree-based Estimation of Speaker Characteristics for Speech Recognition

Mats Blomberg (Dept. of Speech, Music and Hearing, KTH/CSC, Stockholm, Sweden)
Daniel Elenius (Dept. of Speech, Music and Hearing, KTH/CSC, Stockholm, Sweden)

A hierarchical tree is designed to reduce the computationally heavy demands of joint multi-dimensional estimation of speaker characteristic properties in speech recognition. The leaf model sets are created by transforming a conventionally trained set. Non-leaf sets are formed by merging the models of their child nodes. One- (VTLN) and four-dimensional speaker profile vectors (VTLN, two spectral slope parameters and model variance scaling) reduce the computational load to a fraction compared to that of an exhaustive search. In recognition experiments on children's connected digits using adult and male models, the one-dimensional tree search performed as well as the exhaustive search. Further reduction was achieved with four dimensions. The best recognition results are 0.93% and 10.2% WER in TIDIGITS and PF-Star-Sw, respectively, using adult models.

#5A Study on the Influence of Covariance Adaptation on Jacobian Compensation in Vocal Tract Length Normalization

Rama Sanand Doddipatla (Indian Institute of Technology Kanpur)
Shakti Prasad Rath (Indian Institute of Technology Kanpur)
Srinivasan Umesh (Indian Institute of Technology Kanpur)

In this paper, we first show that accounting for Jacobian in VTLN degrades the performance in the mismatched train and test speaker conditions. VTLN is implemented using our recently proposed approach of linear transformation of conventional MFCC, ie, a feature-transformation. In this case, Jacobian is simply the determinant of the LT. Feature transformation is equivalent to the means and covariances of the model being transformed by the inverse transformation while leaving the data unchanged. Using a set of adaptation experiments, we analyze the reasons for the degradation during Jacobian compensation and conclude that applying the same VTLN transformation on both means and variances does not fully match the data when there is a mismatch in the speaker conditions. We propose to use covariance adaptation on top of VTLN to account for the covariance mismatch between the train and the test speakers and show that accounting for Jacobian after covariance adaptation improves the performance.

Mon-Ses3-P2:
Prosody, Text Analysis, and Multilingual Models

Time:Monday 16:00 Place:Hewison Hall Type:Poster
Chair:Andrew Breen

#1Polyglot Speech Prosody Control

Harald Romsdorfer (Speech Processing Group, ETH Zurich, Switzerland)

Within a polyglot text-to-speech synthesis system, the generation of an adequate prosody for mixed-lingual texts, sentences, or even words, requires a polyglot prosody model that is able to seamlessly switch between languages and that applies the same voice for all languages. This paper presents the first polyglot prosody model that fulfills these requirements and that is constructed from independent monolingual prosody models. A perceptual evaluation showed that the synthetic polyglot prosody of about 82% of German and French mixed-lingual test sentences cannot be distinguished from natural polyglot prosody.

#2Weighted Neural Network Ensemble Models for Speech Prosody Control

Harald Romsdorfer (Speech Processing Group, ETH Zurich, Switzerland)

In text-to-speech synthesis systems, the quality of the predicted prosody contours influences quality and naturalness of synthetic speech. This paper presents a new statistical model for prosody control that combines an ensemble learning technique using neural networks as base learners with feature relevance determination. This weighted neural network ensemble model was applied for both, phone duration modeling and fundamental frequency modeling. A comparison with state-of-the-art prosody models based on classification and regression trees (CART), multivariate adaptive regression splines (MARS), or artificial neural networks (ANN), shows a 12% improvement compared to the best duration model and a 24% improvement compared to the best F0 model. The neural network ensemble model also outperforms another, recently presented ensemble model based on gradient tree boosting.

#3Cross-language F0 Modeling for Under-resourced Tonal Languages: A Case Study on Thai-Mandarin

Vataya Boonpiam (National Electronics and Computer Technology Center)
Anocha Rugchatjaroen (National Electronics and Computer Technology Center)
Chai Wutiwiwatchai (National Electronics and Computer Technology Center)

This paper proposed a novel method for F0 modeling in under-resourced tonal languages. Conventional statistical models require large training data which are deficient in many languages. In tonal languages, different syllabic tones are represented by different F0 shapes, some of them are similar across languages. With cross-language F0 contour mapping, we can augment the F0 model of one under-resourced language with corpora from another rich-resourced language. A case study on Thai HMM-based F0 modeling with a Mandarin corpus is explored. Comparing to baseline systems without cross-language resources, over 7% relative reduction of RMSE and significant improvement of MOS are obtained.

#4Prosodic issues in synthesising Thadou, a Tibeto-Burman tone language

Dafydd Gibbon (Universität Bielefeld, Bielefeld, Germany)
Pramod K. S. Pandey (Jawaharlal Nehru University, New Delhi, India)
D. Mary Kim Haokip (Assam University, Silchar, India)
Jolanta Bachan (Adam Mickiewicz University, Poznań, Poland)

The objective of the present analysis is to present linguistic constraints on the phonetic realisation of lexical tone which are relevant for the choice of speech synthesis development strategy for a specific type of tone language, in this case Thadou (Tibeto-Burman), which has lexical and morphosyntactic tone as well as phonetic tone displacement. The last two constraint types differ from those in more well-known tone languages such as Mandarin, and present problems for mainstream corpus-based speech synthesis techniques. Linguistic and phonetic models and a ‘microvoice’ for rule-based tone generation are developed.

#5Advanced Unsupervised Joint Prosody Labeling and Modeling for Mandarin Speech and Its Application to Prosody Generation for TTS

Chen-Yu Chiang (Dept. Communication Engineering, National Chiao Tung University, Taiwan)
Sin-Horng Chen (Dept. Communication Engineering, National Chiao Tung University, Taiwan)
Yih-Ru Wang (Dept. Communication Engineering, National Chiao Tung University, Taiwan)

Motivated by the success of the unsupervised joint prosody labeling and modeling (UJPLM) method for Mandarin speech on modeling of syllable pitch contour in our previous study, in this paper, the advanced UJPLM (A-UJPLM) method is proposed based on UJPLM to jointly label prosodic tags and model syllable pitch contour, duration and energy level. Experimental results on the Sinica Treebank corpus showed that most prosodic tags labeled were linguistically meaningful and the model parameters estimated were interpretable and generally agreed with other previous study. In virtue of the functions given by the model parameters, an application of A-UJPLM to the prosody generation for Mandarin TTS is proposed. Experimental results showed that the proposed method performed well. Most predicted prosodic features matched well to their original counterparts. This also reconfirmed the effectiveness of the A-UJPLM method.

#6Optimization of T-Tilt F0 Modeling

Ausdang Thangthai (National Electronics and Computer Technology Center (NECTEC))
Anocha Rugchatjaroen (National Electronics and Computer Technology Center (NECTEC))
Nattanun Thatphithakkul (National Electronics and Computer Technology Center (NECTEC))
Ananlada Chotimongkol (National Electronics and Computer Technology Center (NECTEC))
Chai Wutiwiwatchai (National Electronics and Computer Technology Center (NECTEC))

This paper investigates on the improvement of T-Tilt modeling, a modified Tilt model specifically designed for F0 modeling in tonal languages. The model has proved to work well for F0 analysis but suffers from text-to-F0 prediction. To optimize, the T-Tilt event is restricted to span over the whole syllable unit which helps reduce the number of parameters significantly. F0 interpolation and smoothing processes often performed in preprocessing are avoided to prevent modeling errors. F0 shape pre-classification and parameter clustering are introduced for better modeling. Evaluation results using the optimized model show the significant improvement for both F0 analysis and prediction.

#7A Multi-Level Context-Dependent Prosodic Model Applied to Duration Modeling

Nicolas OBIN (IRCAM)
Xavier RODET (IRCAM)
Anne LACHERET-DUJOUR (Modyco labs)

We present in this article a multi-level prosodic model based on the estimation of prosodic parameters on a set of well defined linguistic units. Different linguistic units are used to represent different scales of prosodic variations (local and global forms) and thus to estimate the linguistic factors that can explain the variations of prosodic parameters independently on each level. This model is applied to the modeling of syllable-based durational parameters on two read speech corpora - laboratory and acted speech. Compared to a syllable-based baseline model, the proposed approach improves performance in terms of the temporal organization of the predicted durations (correlation score) and reduces model's complexity, when showing comparable performance in terms of relative prediction error.

#8Sentiment classification in English from sentence-level annotations of emotions regarding models of affect

Alexandre Trilla (GTM - Grup de Recerca en Tecnologies Mèdia LA SALLE - UNIVERSITAT RAMON LLULL)
Francesc Alías (GTM - Grup de Recerca en Tecnologies Mèdia LA SALLE - UNIVERSITAT RAMON LLULL)

This paper presents a text classifier for automatically tagging the sentiment of input text according to the emotion that is being conveyed. This system has a pipelined framework composed of Natural Language Processing modules for feature extraction and a hard binary classifier for decision making between positive and negative categories. To do so, the Semeval 2007 dataset composed of sentences emotionally annotated is used for training purposes after being mapped into a model of affect. The resulting scheme stands a first step towards a complete emotion classifier for a future automatic expressive text-to-speech synthesizer.

#9Identification of Contrast and Its Emphatic Realization in HMM Based Speech Synthesis

Leonardo Badino (University of Edinburgh, Edinburgh, U.K.)
Sebastian Andersson (University of Edinburgh, Edinburgh, U.K.)
Junichi Yamagishi (University of Edinburgh, Edinburgh, U.K.)
Robert Clark (University of Edinburgh, Edinburgh, U.K.)

The work presented in this paper proposes to identify contrast in the form of contrastive word pairs and prosodically signal it with emphatic accents in a Text-to-Speech (TTS) application using a Hiddden-Markov-Model (HMM) based speech synthesis system.We first describe a novel method to automatically detect contrastive word pairs using textual features only and report its performance on a corpus of spontaneous conversations in English. Subsequently we describe the set of features selected to train a HMM based speech synthesis system and attempting to properly control prosodic prominence (including emphasis). Results from a large scale perceptual test show that in the majority of cases listeners judge emphatic contrastive word pairs as acceptable as their non-emphatic counterpart, while emphasis on non-contrastive pairs is almost never acceptable.

#10How to Improve TTS Systems for Emotional Expressivity

Antonio Rui Ferreira Rebordao (The University of Tokyo)
Mostafa Al Masum Shaikh (The University of Tokyo)
Keikichi Hirose (The University of Tokyo)
Nobuaki Minematsu (The University of Tokyo)

Several experiments have been carried out that revealed weaknesses of the current Text-To-Speech (TTS) systems in their emotional expressivity. Although some TTS systems allow XML-based representations of prosodic and/or phonetic variables, few publications considered, as a pre-processing stage, the use of intelligent text processing to detect affective information that can be used to tailor the parameters needed for emotional expressivity. This paper describes a technique for an automatic prosodic parameterization based on affective clues. This technique recognizes the affective information conveyed in a text and, accordingly to its emotional connotation, assigns appropriate pitch accents and other prosodic parameters by XML-tagging. This pre-processing assists the TTS system to generate synthesized speech that contains emotional clues. The experimental results are encouraging and suggest the possibility of suitable emotional expressivity in speech synthesis.

#11State mapping based method for cross-lingual speaker adaptation in HMM-based speech synthesis

Yi-Jian Wu (Microsoft)
Yoshihiko Nankaku (Nagoya Institute of Technology)
Keiichi Tokuda (Nagoya Institute of Technology)

A phone mapping-based method had been introduced for cross-lingual speaker adaptation in HMM-based speech synthesis. In this paper, we continue to propose a state mapping based method for cross-lingual speaker adaptation. In this method, we firstly establish the state mapping between two voice models in source and target languages using Kullback-Leibler divergence (KLD). Based on the established mapping information, we introduce two approaches to conduct cross-lingual speaker adaptation, including data mapping and transform mapping approaches. From the experimental results, the state mapping based method outperformed the phone mapping based method. In addition, the data mapping approach achieved better speaker similarity, and the transform mapping approach achieved better speech quality after adaptation.

#12Real Voice and TTS Accent Effects on Intelligibility and Comprehension for Indian Speakers of English as a Second Language

Frederick V. Weber (Earth Institute, Columbia University)
Kalika Bali (Microsoft Research, India)

We investigate the effect of accent on comprehension of English for speakers of English as a second language in southern India. Subjects were exposed to real and TTS voices with US and several Indian accents, and were tested for intelligibility and comprehension. Performance trends indicate a measurable advantage for familiar accents, and are broken down by various demographic factors.

#13Improving Consistence of Phonetic Transcription for Text-to-Speech

Pablo Daniel Agüero (FI-UNMDP)
Antonio Bonafonte (Universitat Politècnica de Catalunya, Barcelona, Spain)
Juan Carlos Tulli (FI-UNMDP)

Grapheme-to-phoneme conversion is an important step in speech segmentation and synthesis. Many approaches are proposed in the literature to perform appropriate transcriptions: CART, FST, HMM, etc. In this paper we propose the use of an automatic algorithm that uses the transformation-based error-driven learning to match the phonetic transcription with the speaker's dialect and style. Different transcriptions based on word, part-of-speech tags, weak forms and phonotactic rules are validated. The experimental results show an improvement in the transcription using an objective measure. The articulation MOS score is also improved, as most of the changes in phonetic transcription affect coarticulation effects.

Mon-Ses3-P1:
Human Speech Production I

Time:Monday 16:00 Place:Hewison Hall Type:Poster
Chair: Shrikanth Narayanan

#1Probabilistic effects on French [t] duration

Francisco Torreira (Radboud Universiteit Nijmegen & Max Planck Institute for Psycholinguistics)
Mirjam Ernestus (Radboud Universiteit Nijmegen & Max Planck Institute for Psycholinguistics)

The present study shows that [t] consonants are affected by probabilistic factors in a syllable-timed language as French, and in spontaneous as well as in journalistic speech. Study 1 showed a word bigram frequency effect in spontaneous French, but its exact nature depended on the corpus on which the probabilistic measures were based. Study 2 investigated journalistic speech and showed an effect of the joint frequency of the test word and its following word. We discuss the possibility that these probabilistic effects are due to the speaker's planning of upcoming words, and to the speaker's adaptation to the listener's needs.

#2On the production of sandhi phenomena in French: psycholinguistic and acoustic data

Odile Bagou (Groupe NeuroPsychoLinguistique, FLSH, University of Neuchâtel, Switzerland)
Violaine Michel (Groupe NeuroPsychoLinguistique, FLSH, University of Neuchâtel, Switzerland)
Marina Laganaro (Groupe NeuroPsychoLinguistique, FLSH, University of Neuchâtel, Switzerland)

This study addresses two complementary questions about the production of sandhi phenomena in French. First, we investigated whether the encoding of sandhi phenomena involves a processing cost compared to non-resyllabified sequences. The elicited sequences were then used to address our second question, namely how critical V1CV2 sequences are phonetically realized across different boundary conditions. Results on production latencies suggested that the encoding of liaison enchaînée involves an additional processing cost compared to enchaînement and non resyllabified sequence. More, acoustic analyses indicated durational differences across the three boundary conditions. Implications for both, psycholinguistic and phonological models are discussed.

#3Extreme reductions: Contraction of disyllables into monosyllables in Taiwan Mandarin

Chierh Cheng (Department of Speech, Hearing and Phonetic Sciences, University College London, UK)
Yi Xu (Department of Speech, Hearing and Phonetic Sciences, University College London, UK)

This study investigates a severe form of segmental reduction known as contraction. In Taiwan Mandarin, a disyllabic word or phrase is often contracted into a monosyllabic unit in conversational speech, just as “do not” is often contracted into “don’t” in English. A systematic experiment was conducted to explore the underlying mechanism of such contraction. Preliminary results show evidence that contraction is not a categorical shift but a gradient undershoot of the articulatory target as a result of time pressure. Moreover, contraction seems to occur only beyond a certain duration threshold. These findings may further our understanding of the relation between duration and segmental reduction.

#4Annotation and Features of Non-native Mandarin Tone Quality

Mitchell Peabody (MIT)
Stephanie Seneff (MIT)

Native speakers of non-tonal languages, such as American English, frequently have difficulty accurately producing the tones of Mandarin Chinese. This paper describes a corpus of Mandarin Chinese spoken by non-native speakers and annotated for tone quality using a simple Good-Bad system. We examine inter-rater correlation of the annotations and highlight the differences in feature distribution between native, good non-native, and bad non-native tone productions. We find that the features of tones judged by a simple majority to be bad are significantly different from features from tones judged to be good, and tones produced by native speakers.

#5On-line Formant Shifting as a Function of F0

Kateřina Chládková (Amsterdam Center for Language and Communication, University of Amsterdam, The Netherlands)
Paul Boersma (Amsterdam Center for Language and Communication, University of Amsterdam, The Netherlands)
Václav Jonáš Podlipský (Department of English and American Studies, Palacký University Olomouc, Czech Republic)

We investigate whether there is a within-speaker effect of a higher F0 on the values of the first and the second formant. When asked to speak at a high F0, speakers turn out to raise their formants as well. In the F1 dimension this effect is greater for women than for men. We conclude that while a general formant raising effect might be due to the physiology of a high F0 (i.e. raised larynx and shorter vocal tract), a plausible explanation for the gender-dependent size of the effect on F1 values can only be found in the undersampling hypothesis.

#6Production Boundary between Fricative and Affricate in Japanese and Korean Speakers

Kimiko Yamakawa (National Institute of Informatics)
Shigeaki Amano (NTT Communications Science Laboratories)
Shuichi Itahashi (National Institute of Informatics)

A fricative [s] and an affricate [ts] pronounced by both native Japanese and Korean speakers were analyzed to clarify the effect of the mother language on speech production. It was revealed that Japanese speakers have a clear individual production boundary between [s] and [ts], and that this boundary corresponds to the production boundary of all Japanese speakers. In contrast, although Korean speakers tend to have a clear individual production boundary, the boundary dose not corresponds to that of Japanese speakers. These facts suggest that Korean speakers tend to have a stable [s]-[ts] production boundary but that differ from Japanese speakers.

#7Aerodynamics of Fricative Production in European Portuguese

Cátia M. R. Pinho (IEETA, Universidade de Aveiro, Portugal)
Luis M. T. Jesus (IEETA and ESSUA, Universidade de Aveiro, Portugal)
Anna Barney (ISVR, University of Southampton, UK)

The characteristics of steady state fricative production, and those of the phone preceding and following the fricative, were investigated. Aerodynamic and electroglotographic (EGG) recordings of four normal adult speakers (two females and two males), producing a speech corpus of 9 isolated words with the European Portuguese (EP) voiced fricatives /v, z, Z/ in initial, medial and final word position, and the same 9 words embedded in 42 different real EP carrier sentences, were analysed. Multimodal data allowed the characterisation of fricatives in terms of their voicing mechanisms, based on the amplitude of oral flow, F1 excitation and fundamental frequency (F0).

#8Contextual effects on protrusion and lip opening for /i,y/

Anne Bonneau (LORIA/CNRS)
Julie Busset (LORIA/ UMR 7503)
Brigitte Wrobel-Dautcourt (LORIA/UMR7503)

This study investigates the effect of “adverse” contexts, especially that of the consonant /S/, on labial parameters for French /i,y/. Five parameters were analysed: the height, width and area of lip opening, the distance between the corners of the mouth, as well as lip protrusion. Ten speakers uttered a corpus made up of isolated vowels, syllables and logatoms. A special procedure has been designed to evaluate lip opening contours. Results showed that the carry-over effect of the consonant /S/ can impede the opposition between /i/ and /y/ in the protrusion dimension, depending upon speakers.

#9Speech Rate Effects on European Portuguese Nasal Vowels

Catarina Oliveira (University of Aveiro)
Paula Martins (Health School, University of Aveiro)
António Teixeira (DETI/IEETA, University of Aveiro)

This paper presents new temporal information regarding the production of European Portuguese (EP) nasal vowels, based on new EMMA data. The influence of speech rate on duration of velum gestures and their coordination with consonantic and glottal gestures were analyzed. As information on relative speed of articulators is scarce, the parameter stiffness for the nasal gestures was also calculated and analyzed. Results show clear effects of speech rate on temporal characteristics of EP nasal vowels. Speech rate reduces the duration of velum gestures, increases the stiffness and inter-gestural overlap.

#10Relation of formants and subglottal resonances in Hungarian vowels

Tamás Gábor Csapó (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary)
Zsuzsanna Bárkányi (Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest, Hungary)
Tekla Etelka Gráczi (Research Institute for Linguistics, Hungarian Academy of Sciences, Budapest, Hungary)
Tamás Bőhm (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics, Budapest, Hungary; Institute for Psychology, Hungarian Academy of Sciences, Budapest, Hungary)
Steven M. Lulich (Speech Communication Group, MIT, Cambridge, MA 02139)

The relation between vowel formants and subglottal resonances (SGRs) has previously been explored in English, German, and Korean. Results from these studies indicate that vowel classes are categorically separated by SGRs. We extended this work to Hungarian vowels, which have not been related to SGRs before. The Hungarian vowel system contains paired long and short vowels as well as a series of front rounded vowels, similar to German but more complex than English and Korean. Results indicate that SGRs separate vowel classes in Hungarian as in English, German, and Korean, and uncover additional patterns of vowel formants relative to the third subglottal resonance (Sg3). These results have implications for understanding phonological distinctive features, and applications in automatic speech technologies.

Mon-Ses3-P4:
Applications in learning and other areas

Time:Monday 16:00 Place:Hewison Hall Type:Poster
Chair: Nestor Becerra Yoma

#1Designing spoken tutorial dialogue with children to elicit predictable but educationally valuable responses

Gregory Aist (Carnegie Mellon University)
Jack Mostow (Carnegie Mellon University)

How to construct spoken dialogue interactions with children that are educationally effective and technically feasible? To address this challenge, we propose a design principle that constructs short dialogues in which (a) the user’s utterance are the external evidence of task performance or learning in the domain, and (b) the target utterances can be expressed as a well-defined set, in some cases even as a finite language (up to a small set of variables which may change from exercise to exercise.) The key approach is to teach the human learner a parameterized process that maps input to response. We describe how the discovery of this design principle came out of analyzing the processes of automated tutoring for reading and pronunciation and designing dialogues to address vocabulary and comprehension, show how it also accurately describes the design of several other language tutoring interactions, and discuss how it could extend to non-language tutoring tasks.

#2Optimizing non-native speech recognition for CALL applications

Joost van Doremalen (Centre for Language and Speech Technology, Radboud University Nijmegen)
Helmer Strik (Centre for Language and Speech Technology, Radboud University Nijmegen)
Catia Cucchiarini (Centre for Language and Speech Technology, Radboud University Nijmegen)

We are developing a Computer Assisted Language Learning (CALL) system that gives feedback to grammar and pronunciation that makes use of Automatic Speech Recognition (ASR). However, good quality unconstrained non-native ASR is not yet feasible. Therefore, we use an approach in which we try to elicit constrained responses. The task in the current experiments is to select utterances from a list of responses. The results of our experiments show that significant improvements can be obtained by optimizing the language model and acoustic models. In this way we could reduce the utterance error rate from 29-26% to 10-8%.

#3Evaluation of English Intonation based on Combination of Multiple Evaluation Scores

Akinori Ito (Graduate School of Engineering, Tohoku University)
Tomoaki Konno (Graduate School of Engineering, Tohoku University)
Masashi Ito (Graduate School of Engineering, Tohoku University)
Shozo Makino (Graduate School of Engineering, Tohoku University)

In this paper, we proposed a novel method for evaluating intonation of an English utterance spoken by a learner for intonation learning by a CALL system. The proposed method is based on an intonation evaluation method proposed by Suzuki et al., which uses “word importance factors,” which are calculated based on word clusters given by a decision tree. We extended Suzuki’s method so that multiple decision trees are used and the resulting intonation scores are combined using multiple regression. As a result of an experiment, we obtained correlation coefficient comparable to the correlation between human raters.

#4A LANGUAGE-INDEPENDENT FEATURE SET FOR THE AUTOMATIC EVALUATION OF PROSODY

Andreas Maier (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung)
Florian Hönig (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung)
Viktor Zeissler (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung)
Anton Batliner (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung)
Erik Körner (Universität Erlangen-Nürnberg, Japanologie)
Nobuyuki Yamanaka (Universität Erlangen-Nürnberg, Japanologie)
Peter Ackermann (Universität Erlangen-Nürnberg, Japanologie)
Elmar Nöth (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung)

In second language learning, the correct use of prosody plays a vital role. Therefore, an automatic method to evaluate the naturalness of the prosody of a speaker is desirable. We present a novel method to model prosody independently of the text and thus independently of the language as well. For this purpose, the voiced and unvoiced speech segments are extracted and a 187-dimensional feature vector is computed for each voiced segment. This approach is compared to word based prosodic features on a German text passage. Both are confronted with the perceptive evaluation of two native speakers of German. The word-based feature set yielded correlations of up to 0.92 while the text-independent feature set yielded 0.88. This is in the same range as the inter-rater correlation with 0.88.

#5Adapting the Acoustic Model of a Speech Recognizer for Varied Proficiency Non-Native Spontaneous Speech Using Read Speech with Language-Specific Pronunciation Difficulty

Klaus Zechner (Educational Testing Service)
Derrick Higgins (Educational Testing Service)
Rene Lawless (Educational Testing Service)
Yoko Futagi (Educational Testing Service)
Sarah Ohls (Educational Testing Service)
George Ivanov (Educational Testing Service)

This paper presents a novel approach to acoustic model adaptation of a recognizer for non-native spontaneous speech for candidates’ responses in a test of spoken English. Instead of transcribing spontaneous speech data, a read speech corpus is created where non-native speakers of English read English sentences of different degrees of pronunciation difficulty with respect to their native language. As a selection criterion we develop a novel score, the “phonetic challenge score”, consisting of a measure for native language-specific difficulties described in the second-language acquisition literature and also of a statistical measure based on the cross-entropy between phoneme sequences of the native language and English. The results of using the read speech for AM adaptation of a recognizer for spontaneous non-native speech show a significant reduction of word error rate for two of four language groups of the spontaneous speech test set as well as for the entire test set.

#6Analysis and Utilization of MLLR Speaker Adaptation Technique for Learners\' Pronunciation Evaluation

Dean Luo (The University of Tokyo)
Yu Qiao (The University of Tokyo)
Nobuaki Minematsu (The University of Tokyo)
Yutaka Yamauchi (Tokyo International University)
Hirose Keikichi (The University of Tokyo)

In this paper, we investigate the effects and problems of MLLR speaker adaptation when applied to pronunciation evaluation. Automatic scoring and error detection experiments are conducted on two publicly available databases of Japanese learners’ English pronunciation. As we expected, over adapta-tion causes misjudge of pronunciation accuracy. Based on the analyses, two novel methods, Forced-aligned GOP score and Regularized-MLLR adaptation, are proposed to solve the ad-verse effects of MLLR adaption. Experimental results show that the proposed methods can better utilize MLLR adaptation and avoid over adaptation.

#7Control of human generating force by use of acoustic information – Study on Onomatopoeic utterances for controlling small lifting-force

Miki Iimura (School of Engineering, Tokyo Denki University)
Taichi Sato (School of Engineering, Tokyo Denki University)
Kihachiro Tanaka (Faculty of Engineering, Saitama University)

We have conducted basic experiments for applying acoustic information to engineering problems. We asked the subjects to execute lifting actions while listening to sounds and measured the resultant lifting-force. We used human onomatopoeic utterances as the sounds that are presented to the subjects aiming to make their lifting-force small. Especially, we focused on the “emotion” or “nuance” contained in humans’ utterances, which is a unique characteristic evoked by the utterance’ acoustical features. We found that the emotion or nuance can control the lifting-force effectively. We also clarified the acoustical features that are responsible for effective control of lifting-force exerted by human.

#8Mi-DJ: a multi-source intelligent DJ service

Ching-Hsien Lee (researcher)
Hsu-Chih Wu (researcher)

In this paper, A Multi-source intelligent DJ (Mi-DJ) service is introduced. It is an audio program platform that integrates different media types, including audio and text format content. It acts like a DJ who plays personalized audio program to user whenever and wherever users need. The audio program is automatically generated, comprising several audio clips; all of them are from either existing audio files or text information, such as e-mail, calendar, news or user-preferred article. Our unique program generation technology makes user feel like listening to a well-organized program, instead of several separated audio files. The program can be organized dynamically, which realizes context-aware service based on location, user's schedule, or other user preference. With appropriate data management, text processing and speech synthesis technologies, Mi-DJ can be applied to many application scenarios. For example, it can be applied in language learning and tour guide.

#9Human Voice or Prompt Generation? Can they Co-exist in an Application?

Géza Németh (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics)
Csaba Zainkó (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics)
Mátyás Bartalis (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics)
Gábor Olaszy (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics)
Géza Kiss (Department of Telecommunications and Media Informatics, Budapest University of Technology and Economics)

This paper describes an R&D project regarding procedures for the automatic maintenance of the interactive voice response (IVR) system of a mobile telecom operator. The original plan was to create a generic voice prompt generation system for the customer service department. The challenge was to create a solution that is hard to distinguish from the human speaker (i.e. passing a sort of Turing-test) so its output can be freely mixed with original human recordings. The domain of the solution at the first step had to be narrowed down to the price list of available mobile phones and services. This is updated weekly, so the final operational system generates about 3 hours of speech at each weekend. It operates under human supervision but without intervention in the speech generation process. It was tested both by academic procedures and company customers and was accepted as fulfilling the original requirements.

#10Automatic vs. human question answering over multimedia meeting recordings

Quoc Anh Le (University of Namur)
Andrei Popescu-Belis (Idiap Research Institute)

Information access in meeting recordings can be assisted by meeting browsers, or can be fully automated following a question-answering (QA) approach. An information access task is defined, aiming at discriminating true vs. false parallel statements about facts in meetings. An automatic QA algorithm is applied to this task, using passage retrieval over a meeting transcript. The algorithm scores 59% accuracy for passage retrieval, while random guessing is below 1%, but only scores 60% on combined retrieval and question discrimination, for which humans reach 70%-80% and the baseline is 50%. The algorithm clearly outperforms humans for speed, at less than 1 second per question, vs. 1.5-2 minutes per question for humans. The degradation on ASR compared to manual transcripts still yields lower but acceptable scores, especially for passage identification. Automatic QA thus appears to be a promising enhancement to meeting browsers used by humans, as an assistant for relevant passage identification.

Tue-Ses0-K:
Tom Griffiths - Connecting human and machine learning via probabilistic models of cognition

Time:Tuesday 08:30 Place:Main Hall Type:Keynote
Chair:Steve Renals

08:30Connecting human and machine learning via probabilistic models of cognition

Tom Griffiths (UC Berkley)

Human performance defines the standard that machine learning systems aspire to in many areas, including learning language. This suggests that studying human cognition may be a good way to develop better learning algorithms, as well as providing basic insights into how the human mind works. However, in order for ideas to flow easily from cognitive science to computer science and vice versa, we need a common framework for describing human and machine learning. I will summarize recent work exploring the hypothesis that probabilistic models of cognition, which view learning as a form of statistical inference, provide such a framework, including results that illustrate how novel ideas from statistics can inform cognitive science. Specifically, I will talk about how probabilistic models can be used to identify the assumptions of learners, learn at different levels of abstraction, and link the inductive biases of individuals to cultural universals.

Tue-Ses1-O1:
ASR: Discriminative Training

Time:Tuesday 10:00 Place:Main Hall Type:Oral
Chair: Erik McDermott

10:00On the Semi-Supervised Learning of Multi-Layered Perceptrons

Jonathan Malkin (University of Washington)
Amarnag Subramanya (University of Washington)
Jeff Bilmes (University of Washington)

We present a novel approach for training a multi-layered perceptron (MLP) in a semi-supervised fashion. Our objective function, when optimized, balances training set accuracy with fidelity to a graph-based manifold over all points. Additionally, the objective favors smoothness via an entropy regularizer over classifier outputs as well as straightforward L2 regularization. Our approach also scales well enough to enable large-scale training. The results demonstrate significant improvement on several phone classification tasks over baseline MLPs.

10:20Generalized Discriminative Feature Transformation for Speech Recognition

Roger Hsiao (InterACT, Language Technologies Institute, Carnegie Mellon University)
Tanja Schultz (InterACT, Language Technologies Institute, Carnegie Mellon University)

We propose a new algorithm called Generalized Discriminative Feature Transformation (GDFT) for acoustic models in speech recognition. GDFT is based on Lagrange relaxation on a transformed optimization problem. We show that the existing discriminative feature transformation methods like feature space MMI/MPE (fMMI/MPE), region dependent linear transformation (RDLT), and a non-discriminative feature transformation, constrained maximum likelihood linear regression (CMLLR) are special cases of GDFT. We evaluate the performance of GDFT for Iraqi large vocabulary continuous speech recognition (LVCSR).

10:40A Fast Online Algorithm for Large Margin Training of Continuous Density Hidden Markov Models

Chih-Chieh Cheng (University of California, San Diego)
Fei Sha (University of Southern California)
Lawrence Saul (University of California, San Diego)

We propose an online learning algorithm for large margin training of continuous density hidden Markov models. The online algorithm updates the model parameters incrementally after the decoding of each training utterance. For large margin training, the algorithm attempts to separate the log-likelihoods of correct and incorrect transcriptions by an amount proportional to their Hamming distance. We evaluate this approach to hidden Markov modeling on the TIMIT speech database. We find that the algorithm yields significantly lower phone error rates than other approaches--both online and batch--that do not attempt to enforce a large margin. We also find that the algorithm converges much more quickly than analogous batch optimizations for large margin training.

11:00Maximum Mutual Information Estimation via Second Order Cone Programming for Large Vocabulary Continuous Speech Recognition

Dalei Wu (Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, CANADA)
Baojie Li (Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, CANADA)
Hui Jiang (Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, CANADA)

In this paper, we have successfully extended our previous work of convex optimization methods to MMIE-based discriminative training for large vocabulary continuous speech recognition. Specifically, we have re-formulated the MMIE training into a second order cone programming (SOCP) program using some convex relaxation techniques that we have previously proposed. Moreover, the entire SOCP formulation has been developed for word graphs instead of N-best lists to handle large vocabulary tasks. The proposed method has been evaluated in the standard WSJ-5k task and experimental results show that the proposed SOCP method significantly outperforms the conventional EBW method in terms of recognition accuracy as well as convergence behavior. Our experiments also show that the proposed SOCP method is efficient enough to handle some relatively large HMM sets normally used in large vocabulary tasks.

11:20Hidden Conditional Random Field with Distribution Constraints for Phone Classification

Dong Yu (Microsoft Research)
Li Deng (Microsoft Research)
Alex Acero (Microsoft Research)

We advance the recently proposed hidden conditional random field (HCRF) model by replacing the moment constraints (MCs) with the distribution constraints (DCs). We point out that the DCs are the same as the traditional MCs for the binary features but are able to better regularize the probability distribution of the continuous-valued features than the MCs. We show that under the DCs the HCRF model is no longer log-linear but embeds the model parameters in non-linear functions. We provide an effective solution to the resulting optimization problem by converting it to the traditional log-linear form at a higher-dimensional space of features exploiting cubic spline. We demonstrate that a 20.8% classification error rate can be achieved on the TIMIT phone classification task using the HCRF-DC model. This result is superior to any published single-system result on this task including the HCRF-MC model, the discriminatively trained HMMs, and the large-margin HMMs using the same features.

11:40Deterministic Annealing Based Training Algorithm for Bayesian Speech Recognition

Sayaka Shiota (Nagoya Institute of Technology)
Kei Hashimoto (Nagoya Institute of Technology)
Yoshihiko Nanakaku (Nagoya Institute of Technology)
Keiichi Tokuda (Nagoya Institute of Technology)

This paper proposes a deterministic annealing based training algorithm for Bayesian speech recognition. The Bayesian method is a statistical technique for estimating reliable predictive distributions by marginalizing model parameters. However, the local maxima problem in the Bayesian method is more serious than in the ML-based approach, because the Bayesian method treats not only state sequences but also model parameters as latent variables. The deterministic annealing EM (DAEM) algorithm has been proposed to improve the local maxima problem in the EM algorithm, and its effectiveness has been reported in HMM-based speech recognition using ML criterion. In this paper, the DAEM algorithm is applied to Bayesian speech recognition to relax the local maxima problem. Speech recognition experiments show that the proposed method achieved a higher performance than the conventional methods.

Tue-Ses1-O2:
Language acquisition

Time:Tuesday 10:00 Place:East Wing 1 Type:Oral
Chair:Maria Uther

10:00Connecting Rhythm and Prominence in Automatic ESL Pronunciation Scoring

Emily Nava (University of Southern California)
Joseph Tepperman (University of Southern California)
Louis Goldstein (University of Southern California)
Maria Luisa Zubizarreta (University of Southern California)
Shrikanth Narayanan (University of Southern California)

Past studies have shown that a native Spanish speaker's use of phrasal prominence is a good indicator of her level of English prosody acquisition. Because of the cross-linguistic differences in the organization of phrasal prominence and durational contrasts, we hypothesize that those speakers with English-like prominence in their L2 speech are also expected to have acquired English-like rhythm. Statistics from a corpus of native and nonnative English confirm that speakers with an English-like phrasal prominence are also the ones who use English-like rhythm. Additionally, two methods of automatic score generation based on vowel duration times demonstrate a correlation of at least 0.6 between these automatic scores and subjective scores for phrasal prominence. These findings suggest that simple vowel duration measures obtained from standard automatic speech recognition methods can be salient cues for estimating subjective scores of prosodic acquisition, and of pronunciation in general.

10:20Evaluating parameters for mapping adult vowels to imitative babbling

Ilana Heintz (The Ohio State University)
Mary Beckman (The Ohio State University)
Eric Fosler-Lussier (The Ohio State University)
Lucie Ménard (Université de Québec à Montréal)

We design a neural network model of first language acquisition to explore the relationship between child and adult speech sounds. The model learns simple vowel categories using a produce-and-perceive babbling algorithm in addition to listening to ambient speech. The model is similar to that of Westermann & Miranda (2004), but adds a dynamic aspect in that it adapts in both the articulatory and acoustic domains to changes in the child's speech patterns. The training data is designed to replicate infant speech sounds and articulatory configurations. By exploring a range of articulatory and acoustic dimensions, we see how the child might learn to draw correspondences between his or her own speech and that of a caretaker, whose productions are quite different from the child's. We also design an imitation evaluation paradigm that gives insight into the strengths and weaknesses of the model.

10:40Intonation of Japanese sentences spoken by English speakers

Chiharu Tsurutani (Griffith University, Australia)

This study investigated intonation of Japanese sentences spoken by Australian English speakers and the influence of their first language (L1) prosody on their intonation of Japanese sentences. The second language (L2) intonation is a complicated product of the L1 transfer at two levels of prosodic hierarchy: at word level and at phrase levels. L2 speech is hypothesized to retain the characteristics of L1, and to gain marked features of the target language only during the late stage of acquisition. Investigation of this hypothesis involved acoustic measurement of L2 speakers’ intonation contours, and comparison of these contours with those of native speakers.

11:00KLAIR: a Virtual Infant for Spoken Language Acquisition Research

Mark Huckvale (University College London)
Ian Howard (University of Cambridge)
Sascha Fagel (Berlin Institute of Technology)

Recent research into the acquisition of spoken language has stressed the importance of learning through embodied linguistic interaction with caregivers rather than through passive observation. However the necessity of interaction makes experimental work into the simulation of infant speech acquisition difficult because of the technical complexity of building real-time embodied systems. In this paper we present KLAIR: a software toolkit for building simulations of spoken language acquisition through interactions with a virtual infant. The main part of KLAIR is a sensori-motor server that supplies a client machine learning application with a virtual infant on screen that can see, hear and speak. By encapsulating the real-time complexities of audio and video processing within a server that will run on a modern PC, we hope that KLAIR will encourage and facilitate more experimental research into spoken language acquisition through interaction.

11:20An Articulatory Analysis of Phonological Transfer Using Real-Time MRI

Joseph Tepperman (University of Southern California)
Erik Bresch (University of Southern California)
Yoon-Chul Kim (University of Southern California)
Sungbok Lee (University of Southern California)
Louis Goldstein (University of Southern California)
Shrikanth Narayanan (University of Southern California)

Phonological transfer is the influence of a first language on phonological variations made when speaking a second language. With automatic pronunciation assessment applications in mind, this study intends to uncover evidence of phonological transfer in terms of articulation. Real-time MRI videos from three German speakers of English and three native English speakers are compared to uncover the influence of German consonants on close English consonants not found in German. Results show that nonnative speakers demonstrate the effects of L1 transfer through the absence of articulatory contrasts seen in native speakers, while still maintaining minimal articulatory contrasts that are necessary for automatic detection of pronunciation errors, encouraging the further use of articulatory models for speech error characterization and detection.

11:40Do Multiple Caregivers Speed up Language Acquisition?

Louis ten Bosch (Radboud University Nijmegen)
Okko Rasanen (Helsinki University of Technology)
Joris Driesen (Catholic University of Leuven)
Guillaume Aimetti (University of Sheffield)
Toomas Altosaar (Helsinki University of Technology)
Lou Boves (Radboud University Nijmegen)
Athena Corns (Radboud University Nijmegen)

In this paper we compare three different implementations of language learning to investigate the issue of speaker-dependent initial representations and subsequent generalization. These implementations are used in a comprehensive model of language acquisition under development in the FP6 FET project ACORNS. All algorithms are embedded in a cognitively and ecologically plausible framework, and perform the task of detecting word-like units without any lexical, phonetic, or phonological information. The results show that the computational approaches differ with respect to the extent they deal with unseen speakers, and how generalization depends on the variation observed during training.

Tue-Ses1-O3:
ASR: Lexical and Prosodic Models

Time:Tuesday 10:00 Place:East Wing 2 Type:Oral
Chair:Eric Fosler-Lussier

10:00Grapheme to phoneme conversion using an SMT system

Antoine Laurent (Laboratoire Informatique Université du Maine (LIUM))
Paul Deléglise (Laboratoire Informatique Université du Maine (LIUM))
Sylvain Meignier (Laboratoire Informatique Université du Maine (LIUM))

This paper presents an automatic grapheme to phoneme conversion system that uses statistical machine translation techniques provided by the Moses Toolkit. The generated word pronunciations are employed in the dictionary of an automatic speech recognition system and evaluated using the ESTER 2 French broadcast news corpus. Grapheme to phoneme conversion based on Moses is compared to two other methods: G2P, and a dictionary look-up method supplemented by a rule-based tool for phonetic transcriptions of words unavailable in the dictionary. Moses gives better results than G2P, and have performance comparable to the dictionary look-up strategy.

10:20Lexical and Phonetic Modeling for Arabic Automatic Speech Recognition

Long Nguyen (BBN Technologies)
Tim Ng (BBN Technologies)
Kham Nguyen (Northeastern University)
Rabih Zbib (Massachusetts Institute of Technology)
John Makhoul (BBN Technologies)

In this paper, we describe the use of either words or morphemes as lexical modeling units and the use of either graphemes or phonemes as phonetic modeling units for Arabic automatic speech recognition (ASR). We designed four Arabic ASR systems: two word-based systems and two morpheme-based systems. Experimental results using these four systems show that they have comparable state-of-the-art performance individually, but the more sophisticated morpheme-based system tends to be the best. However, they seem to complement each other quite well within the ROVER system combination framework to produce substantially-improved combined results.

10:40Assessing Context and Learning for isiZulu Tone Recognition

Gina-Anne Levow (University of Chicago)

Prosody plays an integral role in spoken language understanding. In isiZulu, a Nguni family language with lexical tone, prosodic information determines word meaning. We assess the impact of models of tone and coarticulation for tone recognition. We demonstrate the importance of modeling prosodic context to improve tone recognition. We employ this less commonly studied language to assess models of tone developed for English and Mandarin, finding common threads in coarticulatory modeling. We also demonstrate the effectiveness of semi-supervised and unsupervised tone recognition techniques for this less-resourced language, with weakly supervised approaches rivaling supervised techniques.

11:00A Sequential Minimization Algorithm for Finite-State Pronunciation Lexicon Models

Dobrisek Simon (Faculty of Electrical Engineering, Ljubljana University, Slovenia)
Vesnicer Bostjan (Faculty of Electrical Engineering, Ljubljana University, Slovenia)
Mihelic France (Faculty of Electrical Engineering, Ljubljana University, Slovenia)

The paper first presents a large-vocabulary automatic speech-recognition system that is being developed for the Slovenian language. The concept of a single-pass token-passing algorithm for fast speech decoding that can be used with the designed multi-level system structure is discussed. From the algorithmic point of view, the main component of the system is a finite-state pronunciation lexicon model. This component has crucial impact on the overall performance of the system and we developed a sequential minimization algorithm that very efficiently reduces the size and algorithmic complexity of the lexicon model. The presented experiments show that the sequential minimization algorithm considerably outperforms (up to 60 %) the conventional algorithms that were developed for the static global optimization of the finite-state transducers.

11:20A General-Purpose 32 ms Prosodic Vector for Hidden Markov Modeling

Kornel Laskowski (Carnegie Mellon University)
Mattias Heldner (KTH)
Jens Edlund (KTH)

Prosody plays a central role in conversation, making it important for speech technologies to model. Unfortunately, the application of standard modeling techniques to the acoustics of prosody has been hindered by difficulties in modeling intonation. In this work, we explore the suitability of the recently introduced fundamental frequency variation (FFV) spectrum as a candidate general representation of tone. Experiments on 4 tasks demonstrate that FFV features are complimentary to other acoustic measures of prosody and that hidden Markov models offer a suitable modeling paradigm. Proposed improvements yield a 35% relative decrease in error on unseen data and simultaneously reduce time complexity by a factor of five. The resulting representation is sufficiently mature for general deployment in a broad range of automatic speech processing applications.

11:40Vocabulary Expansion through Automatic Abbreviation Generation for Chinese Voice Search

Dong Yang (Department of Computer Science, Tokyo Institute of Technology)
Yi-cheng Pan (Department of Computer Science, Tokyo Institute of Technology)
Sadaoki Furui (Department of Computer Science, Tokyo Institute of Technology)

Long named entities are often abbreviated in oral Chinese language, and this usually leads to out-of-vocabulary(OOV) problems in speech recognition applications. In this paper, we propose a new method for automatically generating abbreviations for Chinese named entities and we perform vocabulary expansion using output of the abbreviation model for voice search. In our abbreviation modeling, we convert the abbreviation generation problem into a tagging problem and use the conditional random field (CRF) as the tagging tool. In the vocabulary expansion, considering the multiple abbreviation problem and limited coverage of top-1 abbreviation candidate, we add top-10 candidates into the vocabulary. In our experiments, for the abbreviation modeling, we achieved the top-10 coverage of 88.3% by the proposed method; for the voice search, we improved the voice search accuracy from 16.9% to 79.2% by incorporating the top-10 abbreviation candidates to vocabulary.

Tue-Ses1-O4:
Unit-Selection Synthesis

Time:Tuesday 10:00 Place:East Wing 3 Type:Oral
Chair:Alan Black

10:00Perceptual Cost Function for Cross-fading Based Concatenation

Qi Miao (Center for Spoken Language Understanding (CSLU), Division of Biomedical Computer Science (BMCS), Oregon Health & Science University (OHSU), Oregon, USA 97006)
Alexander Kain (Center for Spoken Language Understanding (CSLU), Division of Biomedical Computer Science (BMCS), Oregon Health & Science University (OHSU), Oregon, USA 97006)
Jan P. H. van Santen (Center for Spoken Language Understanding (CSLU), Division of Biomedical Computer Science (BMCS), Oregon Health & Science University (OHSU), Oregon, USA 97006)

In earlier research, we applied a linear weighted cross-fading function to ensure smooth concatenation. However, this can cause unnaturally shaped spectral trajectories. We propose context-sensitive cross-fading. To train this system, a perceptually validated cost function is needed, which is the focus of this paper. A corpus was designed to generate a variety of formant trajectory shapes. A perceptual experiment was performed and a multiple linear regression model was applied to predict perceptual quality ratings from various distances between cross-faded and natural trajectories. Results show that perceptual quality could be predicted well from the proposed distance measures.

10:20Exploring Automatic Similarity Measures for Unit Selection Tuning

Daniel Tihelka (University of West Bohemia)
Jan Romportl (SpeechTech s.r.o)

The paper focuses on the current handling of target features in the unit selection approach basically requiring huge corpora. In the paper there are outlined possible solutions based on measuring (dis)similarity among prosodic patterns. As the start of research, several intuitively chosen measures of acoustic signal (dis)similarity are presented and correlated to perceived similarity obtained from a large-scale listening test.

10:40Towards Intonation Control in Unit Selection Speech Synthesis

Cedric Boidin (Orange Labs)
Olivier Boeffard (IRISA / University of Rennes 1)
Thierry Moudenc (Orange Labs)
Geraldine Damnati (Orange Labs)

We propose to control intonation in unit selection speech synthesis with a mixed CART-HMM intonation model. The Finite State Machine (FSM) formulation is suited to incorporate the intonation model in the unit selection framework because it allows for combination of models with different unit types and handling competing intonative variants. Subjective experiments have been carried out to compare segmental and joint-prosodic-and-segmental unit selection.

11:00A Novel Approach to Cost Weighting in Unit Selection TTS

Jerome Bellegarda (Apple Inc.)

Unit selection text-to-speech synthesis relies on multiple cost criteria, each encapsulating a different aspect of acoustic and prosodic context at any given concatenation point. For a particular set of criteria, the relative weighting of the resulting costs crucially affects final candidate ranking. Their influence is typically determined in an empirical manner (e.g., based on a limited amount of synthesized data), yielding global weights that are thus applied to all concatenations indiscriminately. This paper proposes an alternative approach, based on a data-driven framework separately optimized for each concatenation. The cost distribution in every information stream is dynamically leveraged to locally shift weight towards those characteristics that prove most discriminative at this point. An illustrative case study underscores the potential benefits of this solution.

11:20Maximum Likelihood Unit Selection for Corpus-based Speech Synthesis

Abubeker Gamboa Rosales (University of Guanajuato)
Hamurabi Gamboa Rosales (Dresden University of Technology)
Ruediger Hoffmann (Dresden University of Technology)

Unit selection attempts to find the best combination of speech unit sequences in an inventory so that the perceptual differences between expected (natural) and synthesized signals are as low as possible. However, mismatches and distortions are still possible in concatenative speech synthesis and they are normally perceptible in the synthesized waveform. Therefore, unit selection strategies and parameter tuning are still important issues in the improvement of speech synthesis. We present a novel concept to increase the efficiency of the exhaustive speech unit search within the inventory via a unit selection model. This model bases its operation on a mapping analysis of the concatenation sub-costs, a Bayes optimal classification (BOC), and a Maximum likelihood selection ( MLS). The principle advantage of the proposed unit selection method is that it does not require an exhaustive training to set up weighted coefficients for target and concatenation subcosts.

11:40A Close Look into the Probablistic Concatenation Model for Corpus-based Speech Synthesis

Shinsuke Sakai (NICT)
Ranniery Maia (NICT)
Hisashi Kawai (NICT)
Satoshi Nakamura (NICT)

We have proposed a novel probabilistic approach to concatenation modeling for corpus-based speech synthesis, where the goodness of concatenation for a unit is modeled using a conditional Gaussian probability densities whose mean is defined as a linear transform of the feature vector from the previous unit, and have shown its effectiveness through a subjective listening test. In this paper, we further investigate the characteristics of the proposed method by a objective evaluation and by observing the sequence of concatenation scores across an utterance. We also present the mathematical relationships of the proposed method with other approaches and show that it has a flexible modeling power, having other approaches to concatenation scoring methods as special cases.

Tue-Ses1-S1:
Special Session: Advanced Voice Function Assessment

Time:Tuesday 10:00 Place:East Wing 4 Type:Special
Chair:Anna Barney & Mette Pedersen

10:00Acoustic and High-Speed Digital Imaging Based Analysis of Pathological Voice Contributes to Better Understanding and Differential Diagnosis of Neurological Dysphonias and of Mimicking Phonatory Disorders

Krzysztof Izdebski (Pacific Voice and Speech Foundation & Department of Otolaryngology: Head & Neck Surgery, Stanford Voice & Swallowing Center, Stanford University School of Medicine)
Yuling Yan (Department of Bioengineering, Santa Clara University & Department of Otolaryngology, Stanford University School of Medicine)
Melda Kunduk (Department of Communication Sciences and Disorders, Louisiana State University)

Using Nyquist-plots definitions and HSDI-based analyses of the acoustic and visual data base of similarly sounding disordered neurologically driven pathological phonations, we categorized these signals and provided an in-depth explanation of how these sounds differ, and how these sounds are generated at the glottic level. Combined evaluations based on modern technology strengthened our knowledge and improved objective guidelines on how to approach clinical diagnosis “by ear”, significantly aiding the process of differential diagnosis of complex pathological voice qualities in non-laboratory settings. Index Terms: HSDI, Nyquist-plots, voice quality, tremor overpressure, vocal arrests, neurologic dsyphonias, functional dysphonias, mimicking disorders

10:20Normalized Modulation Spectral Features for Cross-Database Voice Pathology Detection

Maria Markaki (Computer Science Department, University of Crete)
Yannis Stylianou (Computer Science Department, University of Crete)

In this paper, we employ normalized modulation spectral analysis for voice pathology detection. Such normalization is important when there is a mismatch between training and testing conditions, or in other words, employing the detection system in real (testing) conditions. Modulation spectra usually produce a high-dimensionality space. For classification purposes, the size of the original space is reduced using Higher Order Singular Value Decomposition (SVD). Further, we select most relevant features based on the mutual information between subjective voice quality and computed features, which leads to an adaptive to the classification task modulation spectra representation. For voice pathology detection, the adaptive modulation spectra is combined with an SVM classifier. To simulate the real testing conditions, we used two independently recorded databases; one for training and the other for testing. We address the difference of signal characteristics between training and testing data through subband normalization of modulation spectral features. Simulations show that feature normalization enables the cross-database detection of pathological voices even when training and test data are different.

10:40Speech sample salience analysis for speech cycle detection

Christophe Mertens (Laboratory of Images, Signals and Telecommunication Devices, CP 165/51, Faculté des Sciences Appliquées. Université Libre de Bruxelles)
Francis Grenez (Laboratory of Images, Signals and Telecommunication Devices, CP 165/51, Faculté des Sciences Appliquées. Université Libre de Bruxelles)
Jean Schoentgen (National Fund for Scientific Research, Belgium)

The presentation proposes a method for the measurement of cycle lengths in voiced speech. The background is the study of acoustic cues of slow (vocal tremor) and fast (vocal jitter) perturbations of the vocal frequency. Here, these acoustic cues are obtained by means of a temporal method that detects speech cycles via the so-called salience of the speech signal samples. The method does not request that the signal is locally periodic and the average period length is known a priori. Several implementations are considered and discussed. Salience analysis is compared with the auto-correlation method for cycle detection implemented in Praat.

11:00The Use of Telephone Speech Recordings for Assessment and Monitoring of Cognitive Function in Elderly People

Viliam Rapcan (Trinity Centre for Bioengineering, Trinity College Dublin, Ireland)
Shona D\'Arcy (Trinity Centre for Bioengineering, Trinity College Dublin, Ireland)
Nils Penard (Trinity College Institute of Neuroscience, Trinity College Dublin, Ireland)
Ian H. Robertson (Trinity College Institute of Neuroscience, Trinity College Dublin, Ireland)
Richard B. Reilly (Trinity Centre for Bioengineering & Trinity College Institute of Neuroscience, Trinity College Dublin, Ireland)

Cognitive assessment in clinic represents time consuming and expensive task. Speech may be employed as a means of monitoring cognitive function in elderly people. Extraction of speech characteristics from speech recorded remotely over a telephone was investigated and compared to speech characteristics extracted from recordings made in controlled environment. Results demonstrate that speech characteristics can be, with little changes in feature extraction algorithm, reliably (with overall accuracy of 93.2%) extracted from telephone quality speech. With further development of a fully automated IVR system, an early screening system for cognitive decline may be easily realized.

11:20Optimized Feature set to Assess Acoustic Perturbations in Dysarthric Speech

Sunil Nagaraja (Department of Electrical and Computer Engineering, University of New Brunswick, Canada)
Eduardo Castillo Guerra (Department of Electrical and Computer Engineering, University of New Brunswick, Canada)

This paper is focused on the optimization of features derived to characterize the acoustic perturbations encountered in a group of neurological disorders known as Dysarthria. The work derives a set of orthogonal features that enable acoustic analyses of dysarthric speech from eight different Dysarthria types. The feature set is composed by combinations of objective measurements obtained with digital signal processing algorithms and perceptual judgments of the most reliably perceived acoustic perturbations. The effectiveness of the features to provide relevant information of the disorders is evaluated with different classifiers enabling a classification rate up to 93.7%.

11:40A MICROPHONE-INDEPENDENT VISUALIZATION TECHNIQUE FOR SPEECH DISORDERS

Andreas Maier (Universität Erlangen-Nürnberg, Abteilung für Phoniatrie und Pädaudiologie)
Stefan Wenhardt (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung)
Tino Haderlein (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung)
Maria Schuster (Universität Erlangen-Nürnberg, Abteilung für Phoniatrie und Pädaudiologie)
Elmar Nöth (Universität Erlangen-Nürnberg, Lehrstuhl für Mustererkennung)

In this paper we introduce a novel method for the visualization of speech disorders. We demonstrate the method with disordered speech and a control group. However, both groups were recorded using two different microphones. The projection of the patient data using a single microphone yields significant correlations between the coordinates on the map and certain criteria of the disorder which were perceptually rated. However, projection of data from multiple microphones reduces this correlation. Usually, the acoustical mismatch between the microphones is greater than the mismatch between the speakers, i.e., not the disorders but the microphones form clusters in the visualization. Based on an extension of the Sammon mapping, we are able to create a map which projects the same speakers onto the same position even if multiple microphones are used. Furthermore, our method also restores the correlation between the map coordinates and the perceptual assessment.

12:00Evaluation of the Effect of the GSM Full Rate codec on the Automatic Detection of Laryngeal Pathologies Based on Cepstral Analysis

Ruben Fraile (Universidad Politecnica de Madrid)
Carmelo Sanchez (Universidad Politecnica de Madrid)
Juan I. Godino-Llorente (Universidad Politecnica de Madrid)
Nicolas Saenz-Lechon (Universidad Politecnica de Madrid)
Victor Osma-Ruiz (Universidad Politecnica de Madrid)
Juana M. Gutierrez (Universidad Politecnica de Madrid)

Advances in speech signal analysis during the last decade have allowed the development of automatic algorithms for a non-invasive detection of laryngeal pathologies. Bearing in mind the extension of these automatic methods to remote diagnosis scenarios, this paper analyzes the performance of a pathology detector based on Mel Frequency Cepstral Coefficients when the speech signal has undergone the distortion of a speech codec such as the GSM FR codec, which is use in one of the nowadays most widespread communications networks. It is shown that the overall performance of the automatic detection of pathologies is degraded less than 5%, and that such degradation is not due to the codec itself, but to the bandwidth limitation needed at its input. These results indicate that the GSM system can be more adequate to implement remote voice assessment than the analogue telephone channel.

12:20Cepstral analysis of vocal dysperiodicities in disordered connected speech

Ali Alpan (Laboratory of Images, Signals & Telecommunication Devices, Université Libre de Bruxelles, Brussels, Belgium)
Jean Schoentgen (National Fund for Scientific Research, Belgium)
Youri Maryn (Department of Otorhinolaryngology and Head & Neck Surgery, Department of Speech-Language Pathology and Audiology, Sint-Jan General Hospital, Bruges, Belgium)
Francis Grenez (Laboratory of Images, Signals & Telecommunication Devices, Université Libre de Bruxelles, Brussels, Belgium)
Peter Murphy (Department of Electronic and Computer Engineering, University of Limerick, Limerick, Ireland)

Several studies have shown that the amplitude of the first rahmonic peak (R1) in the cepstrum is an indicator of hoarse voice quality. The cepstrum is obtained by taking the inverse Fourier Transform of the log-magnitude spectrum. In the present study, a number of spectral analysis processing steps are implemented, including period-synchronous and period-asynchronous analysis, as well as harmonic-synchronous and harmonic-asynchronous spectral band-limitation prior to computing the cepstrum. The analysis is applied to connected speech signals. The correlation between amplitude R1 and perceptual ratings is examined for a corpus comprising 28 normophonic and 223 dysphonic speakers. One observes that the correlation between R1 and perceptual ratings increases when the spectrum is band-limited prior to computing the cepstrum. In addition, comparisons are made with a popular cepstral cue which is the cepstral peak prominence (CPP).

12:40Standard information from patients: the usefulness of self-evaluation measured with the French version of the VHI

Lise Crevier-Buchman (Department of Otolaryngology, Head & Neck Surgery, Hôpital Européen Georges Pompidou, Université Paris Descartes, Paris, France / Lab. Phonétique et Phonologie, UMR 7018 CNRS-Paris3/Sorbonne Nouvelle, Paris, France)
Stephanie Borel (Department of Otolaryngology, Head & Neck Surgery, Hôpital Européen Georges Pompidou, Université Paris Descartes, Paris, France / Lab. Phonétique et Phonologie, UMR 7018 CNRS-Paris3/Sorbonne Nouvelle, Paris, France)
Stephane Hans (Department of Otolaryngology, Head & Neck Surgery, Hôpital Européen Georges Pompidou, Université Paris Descartes, Paris, France)
Madeleine Menard (Department of Otolaryngology, Head & Neck Surgery, Hôpital Européen Georges Pompidou, Université Paris Descartes, Paris, France)
jacqueline Vaissiere (Lab. Phonétique et Phonologie, UMR 7018 CNRS-Paris3/Sorbonne Nouvelle, Paris, France)

Voice Handicap Index is a scale designed to measure the voice disability in daily life. Two groups of patients were evaluated. One group was represented by glottic carcinoma treated by cordectomy Type I & II (13 patients), type III (5 patients), type V (5 patients). Evaluation was done pre and postoperatively for 12 months. The other group was represented by patients with unilateral vocal fold paralysis treated by thyroplasty (17 patients). Evaluation was done before and 3 months postoperatively. Total VHI, emotional and physical subscales improved significantly for type I&II cordectomy and for thyroplasty. VHI can provide an insight into patient’s handicap

13:00Intelligibility Assessment in Children with Cleft Lip and Palate in Italian and German

Marcello Scipioni (Politecnico di Milano, Polo Regionale di Como, Italy)
Matteo Gerosa (FBK - Fondazione Bruno Kessler, Trento, Italy)
Diego Giuliani (FBK - Fondazione Bruno Kessler, Trento, Italy)
Elmar Nöth (Chair of Pattern Recognition, Friedrich-Alexander-University Erlangen-Nuremberg, Germany)
Andreas Maier (Chair of Pattern Recognition, Friedrich-Alexander-University Erlangen-Nuremberg, Germany)

Current research has shown that the speech intelligibility in children with cleft lip and palate (CLP) can be estimated automatically using speech recognition methods. On German CLP data high and significant correlations between human ratings and the recognition accuracy of a speech recognition system were already reported. In this paper we investigate whether the approach is also suitable for other languages. Therefore, we compare the correlations obtained on German data with the correlations on Italian data. A high and significant correlation (r=0.76; p < 0.01) was identified on the Italian data. This results do not differ significantly from the results on German data (p > 0.05).

13:20Universidade de Aveiro’s Voice Evaluation Protocol

Luis M. T. Jesus (IEETA and ESSUA, Universidade de Aveiro, Portugal)
Anna Barney (ISVR, University of Southampton, UK)
Ricardo Santos (Hospital Privado da Trofa, Portugal)
Janine Caetano (Agrupamento de Escolas Serra da Gardunha, Fundão, Portugal)
Juliana Jorge (RAIZ, Esmoriz, Portugal)
Pedro Sá Couto (Departamento de Matemática da Universidade de Aveiro, Portugal)

This paper presents Universidade de Aveiro’s Voice Evaluation Protocol for European Portuguese (EP), and a preliminary inter-rater reliability study. Ten patients with vocal pathology were assessed, by two Speech and Language Therapists (SLTs). Protocol parameters such as overall severity, roughness, breathiness, change of loudness (CAPE-V), grade, breathiness and strain (GRBAS), glottal attack, respiratory support, respiratory-phonotary-articulatory coordination, digital laryngeal manipulation, voice quality after manipulation, muscular tension and diagnosis, presented high reliability and were highly correlated (good inter-rater agreement and high value of correlation). Values for the overall severity and grade were similar to those reported in the literature.

Tue-Ses1-P1:
Human Speech Production II

Time:Tuesday 10:00 Place:Hewison Hall Type:Poster
Chair:Martin Cooke

#1Simple Physical Models of the Vocal Tract for Education in Speech Science

Takayuki Arai (Sophia University)

In the speech-related field, physical models of the vocal tract are effective tools for education in acoustics. Arai’s cylinder-type models are based on Chiba and Kajiyama’s measurement of vocal-tract shapes. The models quickly and effectively demonstrate vowel production. In this study, we developed physical models with simplified shapes as educational tools to illustrate how vocal-tract shape accounts for differences among vowels. As a result, the five Japanese vowels were produced by tube-connected models, where several uniform tubes with different cross-sectional areas and lengths are connected as Fant’s and Arai’s three-tube models.

#2Auto-meshing Algorithm for Acoustic Analysis of Vocal Tract

Kyohei Hayashi (Future University Hakodate)
Nobuhiro Miki (Future University Hakodate)

We propose a new method for an auto-meshing algorithm for an acoustic analysis of the vocal tract using the Finite Element Method (FEM). In our algorithm, the domain of the 3 dimensional figure of the vocal tract is decomposed into two domains; one is a surface domain and the other is an inner domain in order to employ the overlapping domain decomposition method. The meshing of surface blocks can be realized with smooth surfaces using a NURBS interpolation. We show the example of the meshes for the vocal tract figure of Japanese vowel /a/, and the trial result of the FEM simulation.

#3Voice production model employing an interactive boundary-layer analysis of glottal flow

Tokihiko Kaburagi (Department of Acoustic Design, Faculty of Design, Kyushu University)
Katsunori Daimo (Graduate School of Design, Kyushu University)
Shogo Nakamura (School of Design, Kyushu University)

A voice production model has been studied by considering essential aerodynamic and acoustic phenomena in phonation. Acoustic voice sources are produced by the volume flow through the glottis. A precise flow analysis is therefore performed based on the boundary-layer approximation and the viscous-inviscid interaction between the boundary layer and core flow. This flow analysis can supply information on the separation point of the glottal flow and the thickness of the boundary layer, and yield an effective prediction of the flow behavior. When the flow analysis is combined with a mechanical model of the vocal fold, the resulting acoustic wave travels through the vocal tract and a pressure change develops in the vicinity of the glottis. This change can affect the glottal flow and the motion of the folds, causing source-filter interaction. Preliminary simulations were conducted by changing the relationship between the fundamental and formant frequencies and their results were reported.

#4Characteristics of Two-Dimensional Finite Difference Techniques for Vocal Tract Analysis and Voice Synthesis

Matt Speed (Audio Lab, Department of Electronics, University of York)
Damian Murphy (Audio Lab, Department of Electronics, University of York)
David Howard (Audio Lab, Department of Electronics, University of York)

Both digital waveguide and finite difference techniques are numerical methods that have been demonstrated as appropriate for acoustic modelling applications. Whilst the application of the digital waveguide mesh to vocal tract modelling has been the subject of previous work, the application of comparable finite difference techniques is as yet untested. This study explores the characteristics of such a finite-difference approach to two-dimensional vocal tract modelling. Initial results suggest that finite difference techniques alone are not ideal, due to the limitation of non-dynamic behaviour and poor representation of admittance discontinuities in the approximation of three dimensional geometries. They do however introduce robust boundary formulations, and have a valid and useful application in modelling non-vital static volumes, particularly the nasal tract.

#5Adaptation of a predictive model of tongue shapes

Chao Qin (EECS, School of Engineering, University of California, Merced)
Miguel Carreira-Perpiñán (EECS, School of Engineering, University of California, Merced)

It is possible to recover the full midsagittal contour of the tongue with submillimetric accuracy from the location of just 3--4 landmarks on it. This involves fitting a predictive mapping from the landmarks to the contour using a training set consisting of contours extracted from ultrasound recordings. However, extracting sufficient contours is a slow and costly process. Here, we consider adapting a predictive mapping obtained for one condition (such as a given recording session, recording modality, speaker or speaking style) to a new condition, given only a few new contours and no correspondences. We propose an extremely fast method based on estimating a 2D-wise linear alignment mapping, and show it recovers very accurate predictive models from about 10 new contours.

#6Using sensor orientation information for computational head stabilisation in 3D Electromagnetic Articulography (EMA)

Christian Kroos (MARCS Auditory Laboratories, University of Western Sydney, Australia)

We propose a new simple algorithm to make use of the sensor orientation information in 3D Electromagnetic Articulography (EMA) for computational head stabilisation. The algorithm also provides a well-defined procedure in the case where only two sensors are available for head motion tracking and allows for the combining of position coordinates and orientation angles for head stabilisation with an equal weighting of each kind of information. An evaluation showed that the method using the orientation angles produced the most reliable results.

#7Collision Threshold Pressure Before and After Vocal Loading

Laura Enflo (Dept. of Speech, Music and Hearing, School of Computer Science & Communication, KTH, Sweden)
Johan Sundberg (Dept. of Speech, Music and Hearing, School of Computer Science & Communication, KTH, Sweden)
Friedemann Pabst (Hospital Dresden Friedrichstadt, Dresden, Germany)

The phonation threshold pressure (PTP) has been found to increase during vocal fatigue. In the present study we compare PTP and collision threshold pressure (CTP) before and after vocal loading in singer and non-singer voices. Seven subjects repeated the vowel sequence /a,e,i,o,u/ at an SPL of at least 80 dB @ 0.3 m for 20 min. Before and after this loading the subjects’ voices were recorded while they produced a diminuendo repeating the syllable /pa/. Oral pressure during the /p/ occlusion was used as a measure of subglottal pressure. Both CTP and PTP increased significantly after the vocal loading.

#8Gender differences in the realization of vowel-initial glottalization

Elke Philburn (University of Manchester, Department of Linguistics and English Language)

The aim of the study was to investigate gender-dependent differences in the realization of German glottalized vowel onsets. Laryngographic data of semi-spontaneous speech were collected from four male and four female speakers of Standard German. Measurements of relative vocal fold contact duration were carried out including glottalized vowel onsets as well as non-glottalized controls. The results show that female subjects realized the glottalized vowel onsets with greater maximum vocal fold contact duration than male subjects and that the glottalized vowel onsets produced by females were more clearly distinguished from the non-glottalized controls.

#9Stability and composition of functional synergies for speech movements in children and adults

Hayo Terband (Medical Psychology/Pediatric Neurology Centre/ENT, Radboud University Nijmegen Medical Centre, Nijmegen, the Netherlands)
Frits van Brenk (Department of Speech and Language Therapy, University of Strathclyde, Glasgow, UK)
Pascal van Lieshout (Department of Speech-Language Pathology, Oral Dynamics Lab; Department of Psychology; Institute of Biomaterials and Biomedical Engineering, University of Toronto, and Toronto Rehabilitation Institute, Toronto, Canada)
Lian Nijland (Medical Psychology/Pediatric Neurology Centre/ENT, Radboud University Nijmegen Medical Centre, Nijmegen, the Netherlands)
Ben Maassen (Medical Psychology/Pediatric Neurology Centre/ENT, Radboud University Nijmegen Medical Centre, Nijmegen, the Netherlands ; Department of Neurolinguistics, University of Groningen, Groningen, the Netherlands)

The consistency and composition of functional synergies for speech movements were investigated in 7 year-old children and adults in a reiterated speech task using electromagnetic articulography (EMA). Results showed higher variability in children for tongue tip and jaw, but not for lower lip movement trajectories. Furthermore, the relative contribution to the oral closure of lower lip was smaller in children compared to adults, whereas in this respect no difference was found for tongue tip. These results support and extend findings of non-linearity in speech motor development and illustrate the importance of a multi-measures approach in studying speech motor development.

#10An analysis of speech rate strategies in aging

Frits van Brenk (Department of Speech and Language Therapy, University of Strathclyde, Glasgow, UK; Medical Psychology/Pediatric Neurology Centre/ENT, Radboud University Nijmegen Medical Centre, Nijmegen, the Netherlands)
Hayo Terband (Medical Psychology/Pediatric Neurology Centre/ENT, Radboud University Nijmegen Medical Centre, Nijmegen, the Netherlands)
Pascal van Lieshout (Department of Speech-Language Pathology, Oral Dynamics Lab; Department of Psychology; Institute of Biomaterials and Biomedical Engineering, University of Toronto, and Toronto Rehabilitation Institute, Toronto, Canada)
Anja Lowit (Department of Speech and Language Therapy, University of Strathclyde, Glasgow, UK)
Ben Maassen (Medical Psychology/Pediatric Neurology Centre/ENT, Radboud University Nijmegen Medical Centre, Nijmegen, the Netherlands; Department of Neurolinguistics, University of Groningen, Groningen, the Netherlands)

Effects of age and speech rate on movement cycle duration were assessed using electromagnetic articulography. In a repetitive task syllables were articulated at eight rates, obtained by metronome and self-pacing. Results indicate that increased speech rate is associated with increasing movement cycle duration stability, while decreased rate leads to a decrease in uniformity of cycle duration, supporting the view that alterations in speech rate are associated with different motor control strategies involving durational manipulations. The relative contribution of closing movement durations increases with decreasing speech rate, and is a more dominant strategy for elderly speakers.

#11Variability and stability in collaborative dialogues: turn-taking and filled pauses

Štefan Beňuš (Constantine the Philosopher University, Nitra, Slovakia and Slovak Academy of Sciences, Bratislava, Slovakia)

Filled pauses have important and varied functions in turn-taking behavior, and better understanding of their relationship opens new ways for improving the quality and naturalness of dialogue systems. We use a corpus of collaborative task oriented dialogues to provide new insights into the relationship between filled pauses and turn-taking based on temporal and acoustic features. We then explore which of these patterns are stable and robust across speakers, which are prone to entrainment based on conversational partner, and which are variable and noisy. Our findings suggest that intensity is the least stable feature followed by pitch-related features, and temporal features relating filled pauses to chunking and turn-taking are the most stable.

#12Speaking in the presence of a competing talker

Youyi Lu (University of Sheffield)
Martin Cooke (Ikerbasque and University of the Basque Country)

How do speakers cope with a competing talker? This study investigated the possibility that speakers are able to retime their contributions to take advantages of temporal fluctuations in the background, reducing any adverse effects for an interlocutor. Speech was produced in quiet, competing talker, modulated noise and stationary backgrounds, with and without a communicative task. An analysis of the timing of contributions relative to the background indicated a significantly reduced chance of overlapping for the modulated noise backgrounds relative to quiet, with competing speech resulting in the least overlap. Strong evidence for an active overlap avoidance strategy is presented.

Tue-Ses1-P3:
Speech and Audio Segmentation and Classification

Time:Tuesday 10:00 Place:Hewison Hall Type:Poster
Chair:S. Umesh

#1Wavelet-based Speaker Change Detection in Single Channel Speech Data

Michael Wiesenegger (Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria)
Franz Pernkopf (Signal Processing and Speech Communication Laboratory, Graz University of Technology, Austria)

Speaker segmentation is the task of finding speaker turns in an audio stream. We propose a metric-based algorithm based on Discrete Wavelet Transform (DWT) features.Principal component analysis (PCA) or linear discriminant analysis (LDA) are further used to reduce the dimensionality of the feature space and remove redundant information. In the experiments our methods -- DWT-PCA and DWT-LDA -- are compared to the DISTBIC algorithm using clean and noisy data of the TIMIT database. Especially, under conditions with strong noise, i.e. -10dB SNR, our DWT-PCA approach is very robust, the false alarm rate (FAR) drops by ~2% and the missed detection rate (MDR) stays about the same compared to clean speech, whereas the DISTBIC method fails -- the FAR and MDR is almost ~0% and ~100%, respectively. For clean speech DWT-PCA shows an improvement of ~30% (relative) for both the FAR and MDR in comparison to the DISTBIC algorithm. DWT-LDA is performing slightly worse than DWT-PCA.

#2An Adaptive Threshold Computation for Unsupervised Speaker Segmentation

Laura Docio-Fernandez (University of Vigo)
Paula Lopez-Otero (University of Vigo)
Carmen Garcia-Mateo (University of Vigo)

Reliable speaker segmentation is critical in many applications in the speech processing domain. In this paper, we compare the performance of two speaker segmentation systems: the first one is inspired on a typical state-of-art speaker segmentation system, and the other is an improved version of the former system. We show that the proposed system has a better performance as it does not over-segment the data. This system includes an algorithm that randomly discards some of the point changes with a probability depending on its performance at any moment. Thus, the system merges adjacent segments when they are spoken by the same speaker with a high probability; anytime a change is discarded the discard probability will rise, as the system made a mistake; the opposite will occur when the two adjacent segments belong to different speakers, as there will not be a mistake in this case. We show the improvements of the new system through comparative experiments on TC-STAR spanish database.

#3A data-driven approach for estimating the time-frequency binary mask

Gibak Kim (Department of Electrical Engineering, University of Texas at Dallas)
Philipos Loizou (Department of Electrical Engineering, University of Texas at Dallas)

The ideal binary mask, often used in robust speech recognition applications, requires an estimate of the local SNR in each time-frequency (T-F) unit. A data-driven approach is proposed for estimating the instantaneous SNR of each T-F unit. By assuming that the a priori SNR and a posteriori SNR are uniformly distributed within a small region, the instantaneous SNR is estimated by minimizing the localized Bayes risk. The binary mask estimator derived by the proposed approach is evaluated in terms of hit and false alarm rates. Compared to the binary mask estimator that uses the decision-directed approach to compute the SNR, the proposed data-driven approach yielded substantial improvements (up to 40%) in classification performance, when assessed in terms of a sensitivity metric which is based on the difference between the hit and false alarm rates.

#4A Semi-supervised Version of Heteroscedastic Linear Discriminant Analysis

Zhou Haolang (CLSP, ECE, Johns Hopkins University)
Karakos Damianos (CLSP, COE, ECE, Johns Hopkins University)
Andreou Andreas (CLSP, ECE, Johns Hopkins University)

Heteroscedastic Linear Discriminant Analysis (HLDA) was introduced as an extension of Linear Discriminant Analysis to the case where the class-conditional distributions have unequal covariances. The HLDA transform is computed such that the likelihood of the training (labeled) data is maximized, under the constraint that the projected distributions are orthogonal to a nuisance space that does not offer any discrimination. In this paper we consider the case of semi-supervised learning, where a large amount of unlabeled data is also available. We derive update equations for the parameters of the projected distributions, which are estimated jointly with the HLDA transform, and we empirically compare it with the case where no unlabeled data are available. Experimental results with synthetic data and real data from a vowel recognition task show that, in most cases, semi-supervised HLDA results in improved performance over HLDA.

#5Self-learning Vector Quantization for Pattern Discovery from Speech

Okko Johannes Räsänen (Department of Signal Processing and Acoustics, Helsinki University of Technology, Finland)
Unto Kalervo Laine (Department of Signal Processing and Acoustics, Helsinki University of Technology, Finland)
Toomas Altosaar (Department of Signal Processing and Acoustics, Helsinki University of Technology, Finland)

A novel and computationally straightforward clustering algorithm was developed for vector quantization (VQ) of speech signals for a task of unsupervised pattern discovery (PD) from speech. The algorithm works in purely incremental mode, is computationally extremely feasible, and achieves comparable classification quality with the well-known k-means algorithm in the PD task. In addition to presenting the algorithm, general findings regarding the relationship between the amounts of training material, convergence of the clustering algorithm, and the ultimate quality of VQ codebooks are discussed.

#6Monaural Segregation of Voiced Speech using Discriminative Random Fields

Rohit Prabhavalkar (The Ohio State University)
Zhaozhang Jin (The Ohio State University)
Eric Fosler-Lussier (The Ohio State University)

Techniques for separating speech from background noise and other sources of interference have important applications for robust speech recognition and speech enhancement. Many traditional computational auditory scene analysis (CASA) based approaches decompose the input mixture into a time-frequency (T-F) representation, and attempt to identify the T-F units where the target energy dominates that of the interference. This is accomplished using a two stage process of segmentation and grouping. In this pilot study, we explore the use of Discriminative Random Fields (DRFs) for the task of monaural speech segregation. We find that the use of DRFs allows us to effectively combine multiple auditory features into the system, while simultaneously integrating the the two CASA stages into one. Our preliminary results suggest that CASA based approaches may benefit from the DRF framework.

#7Advancements in Whisper-Island Detection within Normally Phonated Audio Streams

Chi Zhang (Research Assistant, PhD Student)
John Hansen (Professer, Chair of E.E. Department)

In this study, several improvements are proposed for improved whisper-island detection within normally phonated audio streams. Based on our previous study, an improved feature, which is more sensitive to vocal effort change points between whisper and neutral speech, is developed and utilized in vocal effort change point(VECP) detection and vocal effort classification. Evaluation is based on the proposed multi-error score, where the improved feature showed better performance in VECPs detection with the lowest MES of 19.08. Furthermore, a more accurate whisper-island detection was obtained using the improved algorithm. Finally, the experimental detection rate results of 95.33% reflects better whisper-island detection performance for the improved algorithm versus that of the original baseline algorithm.

#8Joint Segmentation and Classification of Dialog Acts using Conditional Random Fields

Matthias Zimmermann (xbrain.ch)

This paper investigates the use of conditional random fields for joint segmentation and classification of dialog acts exploiting both word and prosodic features that are directly available from a speech recognizer. To validate the approach experiments are conducted with two different sets of dialog act types under both reference and speech to text conditions. Although the proposed framework is conceptually simpler than previous attempts at segmentation and classification of DAs it outperforms all previous systems for a task based on the ICSI (MRDA) meeting corpus.

#9Exploring Complex Vowels as Phrase Break Correlates in a Corpus of English Speech with ProPOSEL, a Prosody and POS English Lexicon

Claire Brierley (University of Bolton)
Eric Atwell (University of Leeds)

Real-world knowledge of syntax is seen as integral to the machine learning task of phrase break prediction but there is a deficiency of a priori knowledge of prosody in both rule-based and data-driven classifiers. Speech recognition has established that pauses affect vowel duration in preceding words. Based on the observation that complex vowels occur at rhythmic junctures in poetry, we run significance tests on a sample of contemporary British English and find a statistically significant correlation between complex vowels in canonical dictionary pronunciations of words in a text, and phrase breaks. The experiment depends on automatic text annotation via ProPOSEL, a prosody and part-of-speech English lexicon. Index Terms: prosody; real-world knowledge for machine learning; phrase break prediction; text-to-speech synthesis.

#10Automatic Topic Detection of Recorded Voice Messages

Caroline Clemens (Deutsche Telekom Laboratories, Berlin, Germany)
Stefan Feldes (T-Systems, Darmstadt, Germany)
Karlheinz Schuhmacher (Deutsche Telekom Laboratories, Berlin, Germany)
Joachim Stegmann (Deutsche Telekom Laboratories, Berlin, Germany)

We present an approach to automatic classification of spontaneously spoken voice messages. During overload periods at call-centers customers are offered a call-back at a later time. A speech dialog asks them to describe their concern on a voice box. The identified topics correspond to the supported service categories, which in turn determine the agent group the customer message is routed to. Our multistage classification process includes speech-to-text, stemming, keyword spotting, and categorization. Classifier training and evaluation have been performed with real-life data. Results show promising performance. The pilot will be launched in a field test.

#11Identification and Automatic Detection of Parasitic Speech Sounds

Jindrich Matousek (Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia, Czech Republic)
Radek Skarnitzl (Institute of Phonetics, Faculty of Arts & Philosophy, Charles University in Prague, Czech Republic)
Pavel Machac (Institute of Phonetics, Faculty of Arts & Philosophy, Charles University in Prague, Czech Republic)
Jan Trmal (Department of Cybernetics, Faculty of Applied Sciences, University of West Bohemia, Czech Republic)

This paper presents initial experiments with the identification and automatic detection of parasitic sounds in speech signals. The main goal of this study is to identify such sounds in the source recordings for unit-selection-based speech synthesis systems and thus to avoid their unintended usage in synthesised speech. The first part of the paper describes the phonetic analysis and identification of parasitic phenomena in recordings of two Czech speakers. In the second part, experiments with the automatic detection of parasitic sounds using HMM-based and BVM classifiers are presented. The results are encouraging, especially those for glottalization phenomena.

#12Phonetic alignment for speech synthesis in under-resourced languages

Daniel Van Niekerk (Human Language Technologies Research Group, Meraka Institute, CSIR, Pretoria, South Africa AND School of Electrical, Electronic and Computer Engineering, North-West University, Potchefstroom, South Africa)
Etienne Barnard (Human Language Technologies Research Group, Meraka Institute, CSIR, Pretoria, South Africa)

The rapid development of concatenative speech synthesis systems in resource scarce languages requires an efficient and accurate solution with regard to automated phonetic alignment. However, in this context corpora are often minimally designed due to a lack of resources and expertise necessary for large scale development. Under these circumstances many techniques toward accurate segmentation are not feasible and it is unclear which approaches should be followed. In this paper we investigate this problem by evaluating alignment approaches and demonstrating how these approaches can be applied to limit manual interaction while achieving acceptable alignment accuracy with minimal ideal resources.

#13Improving Initial Boundary Estimation for HMM-based Automatic Phonetic Segmentation

Udochukwu Kalu Ogbureke (School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland)
Julie Carson-Berndsen (School of Computer Science and Informatics, University College Dublin, Belfield, Dublin 4, Ireland)

This paper presents an approach to boundary estimation for automatic segmentation of speech given a phone (sound) sequence. The technique presented represents an extension to existing approaches to Hidden Markov Model based automatic segmentation which modifies the topology of the model to control for duration. An HMM system trained with this modified topology places 77.10%, 86.72% and 91.15% of the boundaries, on the TIMIT speech test corpus annotations, within 10, 15 and 20 ms respectively as compared with manual annotations. This represents an improvement over the baseline result of 70.99%, 83.50% and 89.18% for initial boundary estimation

Tue-Ses1-P4:
Speaker Recognition and Diarisation

Time:Tuesday 10:00 Place:Hewison Hall Type:Poster
Chair:Sadaoki Furui

#1Importance of Nasality Measures for Speaker Recognition Data Selection and Performance Prediction

Howard Lei (International Computer Science Institute)
Eduardo Lopez-Gonzalo (Dep. of Signals, Systems and Radiocomm., Universidad Politecnica Madrid, Spain)

We improve upon measures relating feature vector distributions to speaker recognition (SR) performances for SR performance prediction and arbitrary data selection. In particular, we examine the means and variances of 11 features pertaining to nasality (resulting in 22 measures), computing them on feature vectors of phones to determine which measures give good SR performance prediction of phones. We've found that the combination of nasality measures give a 0.917 correlation with the Equal Error Rates (EERs) of phones on SRE08, exceeding the correlation of our previous best measure (mutual information) by 12.7%. When implemented in our data-selection scheme (which does not require a SR system to be run), the nasality measures allow us to select data with combined EER better than data selected via running a SR system in certain cases, at a fortieth of the computational costs. The nasality measures require a tenth of the computational costs compared to our previous best measure.

#2Exploration of Vocal Excitation Modulation Features for Speaker Recognition

Ning Wang (Department of Electronic Engineering, The Chinese University of Hong Kong)
P. C. Ching (Department of Electronic Engineering, The Chinese University of Hong Kong)
Tan Lee (Department of Electronic Engineering, The Chinese University of Hong Kong)

To derive spectro-temporal vocal source features complementary to the conventional spectral-based vocal tract features in improving the performance and reliability of a speaker recognition system, the excitation related modulation properties are studied. Through multi-band demodulation method, source-related amplitude and phase quantities are parameterized into feature vectors. Evaluation of the proposed features is carried out first through a set of designed experiments on artificially generated inputs, and then by simulations on speech corpus. It is observed via the designed experiments that the proposed features are capable of capturing the vocal differences in terms of F0 variation, pitch epoch shape, and relevant excitation details between epochs. In the simulations, by combination with the standard spectral features, both the amplitude and the phase-related features are shown to evidently reduce the identification error rate and equal error rate in the speaker recognition system.

#3Speaker Identification for Whispered Speech Using Modified Temporal Patterns and MFCCs

Xing Fan (Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, Richardson, Texas 75083, USA)
John H.L. Hansen (Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, Richardson, Texas 75083, USA)

Whisper is used by talkers intentionally in certain circumstances to protect personal privacy. Due to the absence of periodic excitation in the production of whisper, there are considerable differences between neutral and whispered speech in the spectral structure. Therefore, performance of speaker ID systems trained with high energy voiced phonemes, degrades significantly when tested with whisper. This study considers a combination of modified temporal patterns (m-TRAPs) and MFCCs to improve the performance of a neutral trained system for whispered speech. The m-TRAPs are introduced based on an explanation for the whisper/neutral mismatch degradation of MFCCs based system. A phoneme-by-phoneme score weighting method is used to fuse the score from each subband. Text independent closed set speaker ID was conducted and experiment shows that m-TRAPs is especially efficient for whisper with low SNR. When combining the scores from both MFCCs and TRAPs GMMs, an absolute 26.3% improvement in accuracy is obtained compared with a traditional MFCCs baseline system. This result confirms a viable approach to improving speaker ID performance between neutral/whisper mismatch conditions.

#4Speaker Diarization for Meeting Room Audio

Hanwu Sun (Institute for Infocomm Research)
Tin Lay Nwe (Institute for Infocomm Research)
Bin Ma (Institute for Infocomm Research)
Haizhou Li (Institute for Infocomm Research)

This paper describes a speaker diarization system in 2007 NIST Rich Transcription (RT07) Meeting Recognition Evaluation for the task of Multiple Distant Microphone (MDM) in meeting room scenarios. The system includes three major modules: data preparation, initial speaker clustering and cluster purification/merging. The data preparation consists of the raw data Wiener filtering and beamforming, Time Difference of Arrival estimate and speech activity detection. Based on the initial processed data, two-stage histogram quantization has been used to perform the initial speaker clustering. A modified purification strategy via high-order GMM clustering method is proposed. BIC criterion is applied for cluster merging. The system achieves a competitive overall DER of 8.31% for RT07 MDM speaker diarization task.

#5Improving Speaker Segmentation via Speaker Identification and Text Segmentation

Runxin Li (InterACT, Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA)
Tanja Schultz (InterACT, Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA; Fakultat fur Informatik, Universitat Karlsruhe (TH), Germany)
Qin Jin (InterACT, Language Technologies Institute, Carnegie Mellon University, Pittsburgh, USA)

Speaker segmentation is an essential part of a speaker diarization system. Common segmentation systems usually miss speaker change points when speakers switch fast. These errors seriously confuse the following speaker clustering step and result in high overall speaker diarization error rates. In this paper two methods are proposed to deal with this problem: The first approach uses speaker identification techniques to boost speaker segmentation. And the second approach applies text segmentation methods to improve the performance of speaker segmentation. Experiments on Quaero speaker diarization evaluation data shows that our methods achieve up to 45% relative reduction in the speaker diarization error and 64% relative increase in the speaker change detection recall rate over the baseline system. Moreover, both these two approaches can be considered as post-processing steps over the baseline segmentation, therefore, they can be applied in any speaker diarization systems.

#6Overall performance metrics for multi-condition Speaker Recognition Evaluations

David van Leeuwen (TNO Human Factors)

In this paper we propose a framework for measuring the overall performance of an automatic speaker recognition system using a set of trials of a heterogeneous evaluation such as NIST SRE-2008, which combines several acoustic conditions in one evaluation. We do this by weighting trials of different conditions according to their relative proportion, and we derive expressions for the basic speaker recognition performance measures Cdet, Cllr, as well as the DET curve, from which EER and minCdet can be computed. Examples of pooling of conditions are shown on SRE-2008 data, including speaker sex and microphone type and speaking style.

#7Speaker Identification usingWarped MVDR Cepstral Features

Matthias Wölfel (ZKM|Center for Art and Media, Germany)
Qian Yang (Universität Karlsruhe (TH), Germany)
Jin Qin (Carnegie Mellon University, USA)
Tanja Schultz (Universität Karlsruhe (TH), Germany)

It is common practice to use similar or even the same feature extraction methods for automatic speech recognition and speaker recognition. While the front-end for the former requires to preserve phoneme discrimination and to compensate for speaker differences to some extend the front-end for the latter has to preserve the unique characteristics of individual speakers. It seems, therefore, contradictory to use the same feature extraction methods for both tasks. Starting out from the common practice we propose to use warped minimum variance distortionless response (MVDR) cepstral coefficients, which have already been demonstrated to preform superior for automatic speech recognition in particular under adverse conditions. Replacing the widely used mel-frequency cepstral coefficients by warped MVDR cepstral coefficients improves the speaker identification accuracy by up to 24% relative. We found that the optimal choice of the model order within the warped MVDR framework differs between speech recognition and speaker recognition, confirming our intuition that the two different tasks indeed require different feature extraction strategies.

#8Entropy Based Overlapped Speech Detection as a Pre-Processing Stage for Speaker Diarization

Oshry Ben-Harush (Ben-Gurion University of the Negev)
Itshak Lapidot (Sami Shamoon College of Engineering)
Hugo Guterman (Ben-Gurion University of the Negev)

One inherent deficiency of most diarization systems is their inability to handle co-channel or overlapped speech. Most of the suggested algorithms perform under singular conditions, require high computational complexity in both time and frequency domains. In this study, frame based entropy analysis of the audio data in the time domain serves as a single feature for an overlapped speech detection algorithm. Identification of overlapped speech segments is performed using Gaussian Mixture Modeling (GMM) along with well known classification algorithms applied on two speaker conversations. By employing this methodology, the proposed method eliminates the need for setting a hard threshold for each conversation or database. LDC CALLHOME American English corpus is used for evaluation of the suggested algorithm. The proposed method successfully detects 63.2% of the frames labeled as overlapped speech by the manual segmentation, while keeping a 5.4% false-alarm rate.

#9Speech Style and Speaker Recognition: a Case Study

Marco Grimaldi (School of Computer Science and Informatics, UCD, Dubin; FBK, via Sommarive 18, I-38100 Povo (Trento))
Fred Cummins (School of Computer Science and Informatics, UCD, Dubin)

This work presents an experimental evaluation of the effect of different speech styles on the task of speaker recognition. We make use of willfully altered voice extracted from the CHAINS corpus and methodically assess the effect of its use in a reference speaker identification and verification system. We contrast normal readings of text with two varieties of imitative styles and with the familiar, non-imitative, variant of fast speech. Furthermore, we test the applicability of a novel speech parameterization that has been suggested as a promising technique in the task of speaker identification: the pyknogram frequency estimate coefficients - pykfec.

#10The Majority Wins: a Method for Combining Speaker Diarization Systems

Marijn Huijbregts (University of Twente)
David Leeuwen, van (TNO Human Factors)
Franciska Jong, de (University of Twente)

In this paper we present a method for combining multiple diarization systems into one single system by applying a majority voting schema. The voting schema selects the best segmentation purely on basis of the output of each system. On our development set of NIST Rich Transcription evaluation meetings the voting method improves our system on all evaluation conditions. For the single distant microphone condition, the DER performance is improved by 7.8% (relative) compared to the best input system. For the multiple distant microphone condition the improvement is 3.6%.

#11Two-Wire Nuisance Attribute Projection

Yosef Solewicz (Department of Computer Science, Bar-Ilan University, Ramat-Gan, Israel)
Hagai Aronowitz (IBM Haifa Research Labs, Haifa 31905, Israel)

This paper addresses the task of nuisance reduction in two-wire speaker recognition applications. Besides channel mismatch, two-wire conversations are contaminated by extraneous speakers which represent an additional source of noise in the supervector domain. It is shown that two-wire nuisance manifests itself as undesirable directions in the interspeaker subspace. For this purpose, we derive two alternative Nuisance Attribute Projection (NAP) formulations tailored for two-wire sessions. The first formulation generalizes the NAP framework based on a model of two-wire conversations. The second formulation explicitly models the four- vs. two-wire supervector variability. Preliminary experiments show that two-wire NAP significantly outperforms regular NAP in varied two-wire tasks

Tue-Ses1-P2:
Speech perception II

Time:Tuesday 10:00 Place:Hewison Hall Type:Poster
Chair:Odette Scharenborg

#1THE EFFECT OF R-RESONANCE INFORMATION ON INTELLIGIBILITY

Antje Heinrich (Department of Linguistics, University of Cambridge, UK)
Sarah Hawkins (Department of Linguistics, University of Cambridge, UK)

We investigated the importance of phonetic information in preceding syllables for the intelligibility of minimally paired words containing /r/ or /l/. Target words were cross-spliced either into a different token of the same sentence (match) or into a sentence that was identical but originally uttered with the paired word (mismatch). Young and older adults heard the sentences in various background babbles. Matched phonetic information in syllables earlier in the sentence and in the syllable immediately preceding the target segment facilitated intelligibility for r- but not l-words. Despite hearing loss, older adults used this phonetic information as much as young listeners.

#2Perception of Temporal Cues at the Discourse Boundary

Hsin-Yi Lin (Ph.D Student of Graduate Institute of Linguistics, National Taiwan University)
Janice Fon (Assistant Professor of Linguistics, National Taiwan University)

This study investigates the role of temporal cues in the perception at discourse boundaries. Target cues were penult lengthening, final lengthening, and pause duration. Results showed that different cues are weighted differently for different purposes, where final lengthening is more important for subjects to detect boundaries, while pause duration is more responsible in cuing the sizes of boundaries.

#3Human Audio-Visual Consonant Recognition Analyzed with Three Bimodal Integration Models

Zhanyu Ma (Sound and Image Processing Lab, KTH - Royal Institute of Technology,Sweden)
Arne Leijon (Sound and Image Processing Lab, KTH - Royal Institute of Technology,Sweden)

With A-V recordings, ten normal hearing people took recognition tests at different signal-to-noise ratios (SNR). The A-V recognition results are predicted by the fuzzy logical model of perception (FLMP) and the post-labelling integration model (POSTL). We also applied hidden Markov models (HMMs) and multi-stream HMMs (MSHMMs) for the recognition. As expected, all the models agree qualitatively with the results that the benefit gained from the visual signal is larger at lower acoustic SNRs. However, the FLMP severely overestimates the A-V integration result, while the POSTL model underestimates it. Our automatic speech recognizers integrated the audio and visual stream efficiently. The visual automatic speech recognizer could be adjusted to correspond to human visual performance. The MSHMMs combine the audio and visual streams efficiently, but the audio automatic speech recognizer must be further improved to allow precise quantitative comparisons with human audio-visual performance.

#4Effects of tempo in radio commercials on young and elderly listeners

Hanny Ouden, den (Utrecht University)
Hugo Quene (Utrecht University)

The aim of the present study is to investigate the effects of tempo manipulations in radio commercials, on listeners’ evaluation, cognition and persuasion. Questionnaire scores from 131 young and 130 elderly listeners show effects of tempo manipulation on listeners’ subjective evaluation, but not on their cognitive scores. Tempo effects on persuasion scores are modulated by the listeners’ general disposition towards radio and radio commercials. In sum, it seems that not age but listeners’ general disposition is of importance in evaluating tempo manipulation of radio commercials.

#5Self-voice recognition in 4 to 5-year-old children

Sofia Strömbergsson (Department of Speech, Music and Hearing, KTH, Stockholm, Sweden)

Children’s ability to recognize their own recorded voice as their own was explored in a group of 4 to 5-year-old children. The task for the children was to identify which one of four voice samples represented their own voice. The results reveal that children perform well above chance level, and that a time span of 1-2 weeks between the recording and the identification does not affect the children’s performance. F0 similarity between the participant’s recordings and the reference recordings correlated with a higher error-rate. Implications for the use of recordings in speech and language therapy are discussed.

#6Are real tongue movements easier to speech read than synthesized?

Olov Engwall (Centre for Speech Technology, CSC, KTH, Stockholm, Sweden)
Preben Wik (Centre for Speech Technology, CSC, KTH, Stockholm, Sweden)

Speech perception studies with augmented reality displays in talking heads have shown that tongue reading abilities are weak initially, but that subjects become able to extract some information from intra-oral visualizations after a short training session. In this study, we investigate how the nature of the tongue movements influences the results, by comparing synthetic rule-based and actual, measured movements. The subjects were significantly better at perceiving sentences accompanied by real movements, indicating that the current coarticulation model developed for facial movements is not optimal for the tongue.

#7Eliciting a hierarchical structure of human consonant perception task errors using Formal Concept Analysis

Carmen Peláez-Moreno (University Carlos III Madrid, Spain)
Ana Isabel García-Moral (University Carlos III Madrid, Spain)
Francisco José Valverde-Albacete (University Carlos III Madrid, Spain)

In this paper we have used Formal Concept Analysis to elicit a hierarchical structure of human consonant perception task errors. We have used the Native Listeners experiments provided for the Consonant Challenge session of Interspeech 2008 to analyse perception errors comitted in relation with the place of articulation of the consonants being evaluated for one quiet and six noisy acoustic conditions.

#8Acoustic and Perceptual Effects of Vocal training in Amateur Male Singing

Takeshi Saitou (National Institute of Advanced Industrial Science and Technology (AIST))
Masataka Goto (National Institute of Advanced Industrial Science and Technology (AIST))

This paper reports our investigation of the acoustical effects of vocal training for amateur singers and of the contribution of those effects to perceived vocal quality. Recording singing voices before and after vocal training and then analyzing changes in acoustic parameters with a focus on features unique to singing voices, we found that two different F0 fluctuations (vibrato and overshoot) and singing formant were improved by the training. The results of psychoacoustic experiments showed that perceived voice quality was influenced more by the changes of F0 characteristics than by the changes of spectral characteristics and that acoustic features unique to singing voices contribute to perceived voice quality in the following order: vibrato, singing formant, overshoot, and preparation.

Tue-Ses2-O1:
Automotive and Mobile applications

Time:Tuesday 13:30 Place:Main Hall Type:Oral
Chair:Kate Knill

13:30Fast Speech Recognition for Voice Destination Entry in a Car Navigation System

Hoon Chung (ETRI)
Jeon Gue Park (ETRI)
Hyeon Bae Jeon (ETRI)
Yun Keun Lee (ETRI)

In this paper, we introduce a multi-stage decoding algorithm optimized to recognize very large number of entry names on a resource-limited embedded device. The multi-stage decoding algorithm is composed of a two-stage HMM-based coarse search and a detailed search. The two-stage HMM-based coarse search generates a small set of candidates that are assumed to contain a correct hypothesis with high probability, and the detailed search re-ranks the candidates by rescoring them with sophisticate acoustic models. In this paper, we take experiments with 1-millions of point-of-interest (POI) names on an in-car navigation device with a fixed-point processor running at 620MHz. The experimental result shows that the multi-stage decoding algorithm runs about 2.23 times real-time on the device without serious degradation of recognition performance.

13:50Improving Perceived Accuracy for In-Car Media Search

Yun-Cheng Ju (Microsoft Research)
Michael Seltzer (Microsoft Research)
Ivan Tashev (Microsoft Research)

Speech recognition technology is prone to mistakes, but this is not the only source of errors that cause speech recognition systems to fail; sometimes the user simply does not utter the command correctly. Usually, user mistakes are not considered when a system is designed and evaluated. This creates a gap between the claimed accuracy of the system and the actual accuracy perceived by the users. We address this issue quantitatively in our in-car infotainment media search task and propose expanding the capability of voice command to accommodate user mistakes while retaining a high percentage of the performance for queries with correct syntax. As a result, failures caused by user mistakes were reduced by an absolute 70% at the cost of a drop in accuracy of only 0.28%.

14:10Laying the Foundation for In-car Alcohol Detection by Speech

Florian Schiel (Bavarian Archive for Speech Signals, Ludwig-Maximilians-Universität München)
Christian Heinrich (Bavarian Archive for Speech Signals, Ludwig-Maximilians-Universität München)

The fact that an increasing number of functions in the automobile are and will be controlled by speech of the driver rises the question whether this speech input may be used to detect a possible alcoholic intoxication of the driver. For that matter a large part of the new Alcohol Language Corpus (ALC) edited by the Bavarian Archive of Speech Signals (BAS) will be used for a broad statistical investigation of possible feature candidates for classification. In this contribution we present the motivation and the design of the ALC corpus as well as first results from fundamental frequency and rhythm analysis. Our analysis by comparing sober and alcoholized speech of the same individuals suggests that there are in fact promising features that can automatically be derived from the speech signal during the speech recognition process and will indicate intoxication for most speakers.

14:30A Voice Search Approach to Replying to SMS Messages in Automobiles

Yun-Cheng Ju (Microsoft Research)
Tim Paek (Microsoft Research)

Automotive infotainment systems now provide drivers the ability to hear incoming Short Message Service (SMS) text messages using text-to-speech. However, the question of how best to allow users to respond to these messages using speech recognition remains unsettled. In this paper, we propose a robust voice search approach to replying to SMS messages based on template matching. The templates are empirically derived from a large SMS corpus and matches are accurately retrieved using a vector space model. In evaluating SMS replies within the acoustically challenging environment of automobiles, the voice search approach consistently outperformed using just the recognition results of a statistical language model or a probabilistic context-free grammar. For SMS replies covered by our templates, the approach achieved as high as 89.7% task completion when evaluating the top five reply candidates.

14:50Language Modeling for What-with-Where on GOOG-411

Charl van Heerden (Meraka Institute)
Johan Schalkwyk (Google Inc.)
Brian Strope (Google Inc.)

This paper describes the language modeling architectures and recognition experiments that enabled support of 'what-with-where' queries on GOOG-411. First we compare accuracy trade-offs between a single national business LM for business queries and using many small models adapted for particular cities. Experimental evaluations show that both approaches lead to comparable overall accuracy. Differences in the distributions of errors also lead to improvements from a simple combination. We then optimize variants of the national business LM in the context of combined business and location queries from the web, and finally evaluate these models on a recognition test from the recently fielded 'what-with-where' system.

15:10Very Large Vocabulary Voice Dictation for Mobile Devices

Jan Nouza (SpeechLab, Institute of Information Technology and Electronics Technical University of Liberec, 461 17 Liberec, Czech Republic)
Petr Cerva (SpeechLab, Institute of Information Technology and Electronics Technical University of Liberec, 461 17 Liberec, Czech Republic)
Jindrich Zdansky (SpeechLab, Institute of Information Technology and Electronics Technical University of Liberec, 461 17 Liberec, Czech Republic)

This paper deals with optimization techniques that can make very large vocabulary voice dictation applications deployable on recent mobile devices. We focus namely on optimization of signal parameterization (frame rate, FFT calculation, fixed-point representation) and on efficient pruning techniques employed on the state and Gaussian mixture level. We demonstrate the applicability of the proposed techniques on the practical design of an embedded 255K-word discrete dictation program developed for Czech. Its real performance is comparable to a client-server version of the fluent dictation program implemented on the same mobile device.

Tue-Ses2-O2:
Prosody: production I

Time:Tuesday 13:30 Place:East Wing 1 Type:Oral
Chair: Fred Cummins

13:30Did you say a BLUE banana? The prosody of contrast and abnormality in Bulgarian and Dutch

Diana V. Dimitrova (University of Groningen)
Gisela Redeker (University of Groningen)
John C.J. Hoeks (University of Groningen)

In a production experiment on Bulgarian that was based on a previous study on Dutch [1], we investigated the role of prosody when linguistic and extra-linguistic information coincide or contradict. Speakers described abnormally colored fruits in conditions where contrastive focus and discourse relations were varied. We found that the coincidence of contrast and abnormality enhances accentuation in Bulgarian as it did in Dutch. Surprisingly, when both factors are in conflict, the prosodic prominence of abnormality often overruled focus accentuation in both Bulgarian and Dutch, though the languages also show marked differences.

13:50A Quantitative Study of F0 Peak Alignment and Sentence Modality

Hansjörg Mixdorff (BHT University of Applied Sciences, Berlin, Germany)
Hartmut Pfitzinger (University of Kiel, Germany)

The current study examines the relationship between prosodic accent labels assigned in the Kiel Corpus of Spontaneous Speech IV, Isačenko’s intoneme classes of the underlying accents and the associated parameters of the Fujisaki model. Among other findings, there is a close connection between early peaks and information intonemes, as well as late peaks and non-terminal intonemes. The majority of tokens within both intoneme classes, however, are associated with medial peaks. Precise analysis of alignment shows that accent command offset times for information intonemes are significantly earlier than for non-terminal intonemes. This suggests that the anchoring of the relevant tonal transition could be more important for separating different intonational categories than that of the F0 peak.

14:10Closely related languages, different ways of realizing focus

Szu-wei Chen (Graduate Institute of Linguistic, National Chung Cheng University, Taiwan)
Bei Wang (Institute of Chinese Minority Languages, Minzu University of China, China)
Yi Xu (Department of Speech, Hearing and Phonetic Sciences, University College London, UK)

We investigated how focus was prosodically realized in Taiwanese, Taiwan Mandarin and Beijing Mandarin by monolingual and bilingual speakers. Acoustic analyses showed that all speakers raised pitch and intensity of focused words, but only Beijing Mandarin speakers lowered pitch and intensity of post-focus words. Cross-group differences in duration were mixed. When listening to stimuli from their own language groups, subjects from Beijing had over 80% focus recognition rate, while those from Taiwan had less than 70% recognition rate. This difference is mainly due to presence/absence of post-focus compression. These findings have implications for prosodic typology, language contact and bilingualism.

14:30Cross-variety Rhythm Typology in Portuguese

Plínio Barbosa (Speech Prosody Studies Group/Dep. of Linguistics/Inst.Est. Ling., Univ. of Campinas, Brazil)
Maria do Céu Viana (Center of Linguistics of the University of Lisbon, Portugal)
Isabel Trancoso (INESC-ID, Lisbon, Portugal)

This paper aims at proposing a measure of speech rhythm based on the inference of the coupling strength between the syllable oscillator and the stress group oscillator of an underlying coupled oscillators model. This coupling is inferred from the linear regression between the stress group duration and the number of syllables within the group, as well as from the multiple linear regression between the same parameters and an estimate of phrase stress prominence. This technique is applied to compare the rhythmic differences between European and Brazilian Portuguese in two speaking styles and three speakers per variety. Compared with a syllable-sized normalised PVI, the findings suggest that the coupling strength captures better the perceptual effects of the speakers' renditions. Furthermore, it shows that stress group duration is much better predicted by adding phrase stress prominence to the regression.

14:50Pitch adaptation in different age groups: boundary tones versus global pitch

Marie Nilsenova (Tilburg University)
Marc Swerts (Tilburg University)
Veronique Houtepen (Tilburg University)
Heleen Dittrich (Tilburg University)

Linguistic adaptation is a process by which interlocutors adjust their production to their environment. In the context of human-computer interaction, past research showed that adult speakers adapt to computer speech in various manners but less is known about younger age groups. We report the results of three priming experiments in which children in different age groups interacted with a prerecorded computer voice. The goal of the experiments was to determine to what extent children copy the pitch properties of the interlocutor. Based on the dialogue model of Pickering & Garrod, we predicted that children would be more likely to adapt to pitch primes that were meaningful in the context (high or low boundary tone) compared to primes with no apparent functionality (global pitch manipulation). This prediction was confirmed by our data. Moreover, we observed a decreasing trend in adaptation in the older age groups compared to the younger ones.

15:10Backchannel-Inviting Cues in Task-Oriented Dialogue

Agustín Gravano (Department of Computer Science, Columbia University, New York, NY, USA)
Julia Hirschberg (Department of Computer Science, Columbia University, New York, NY, USA)

We examine backchannel-inviting cues --- distinct prosodic, acoustic and lexical events in the speaker's speech that tend to precede a short response produced by the interlocutor to convey continued attention --- in the Columbia Games Corpus, a large corpus of task-oriented dialogues. We show that the likelihood of occurrence of a backchannel increases quadratically with the number of cues conjointly displayed by the speaker. Our results are important for improving the coordination of conversational turns in interactive voice-response systems, so that systems can produce backchannels in appropriate places, and so that they can elicit backchannels from users in expected places.

Tue-Ses2-O3:
ASR: Spoken Language Understanding

Time:Tuesday 13:30 Place:East Wing 2 Type:Oral
Chair:Lin-shan Lee

13:30What\'s in an Ontology for Spoken Language Understanding

Silvia Quarteroni (University of Trento)
Giuseppe Riccardi (University of Trento)
Marco Dinarelli (University of Trento)

Current Spoken Language Understanding systems rely either on hand-written semantic grammars or on flat attribute-value sequence labeling. In both approaches, concepts and their relations (when modeled at all) are domain-specific, thus making it difficult to expand or port the domain model. To address this issue, we introduce: 1) a domain model based on an ontology where concepts are classified into either predicative or argumentative; 2) the modeling of relations between such concept classes in terms of classical relations as defined in lexical semantics. We study and analyze our approach on a corpus of customer care data, where we evaluate the coverage and relevance of the ontology for the interpretation of speech utterances (clean and noisy).

13:50A Fundamental Study of Shouted Speech for Acoustic-Based Security System

Hiroaki NANJO (Faculty of Science and Technology, Ryukoku University, Japan)
Hiroki MIKAMI (Faculty of Science and Technology, Ryukoku University, Japan)
Hiroshi KAWANO (Graduate School of Science and Engineering, Ritsumeikan University, Japan)
Takanobu NISHIURA (Graduate School of Science and Engineering, Ritsumeikan University, Japan)

A speech processing system for ensuring safety and security, namely, acoustic-based security system is addressed. Focusing on indoor security such as school security, we study for an advanced acoustic-based system which can discriminate emergency shout from the other speech events based on the understanding of speech events. In this paper, we describe fundamental results of shouted speech.

14:10Evaluating the Potential Utility of ASR N-Best Lists for Incremental Spoken Dialogue Systems

Timo Baumann (University of Potsdam)
Okko Buß (University of Potsdam)
Michaela Atterer (University of Potsdam)
David Schlangen (University of Potsdam)

The potential of using ASR n-best lists for dialogue systems has often been recognised (if less often realised): it is often the case that even when the top-ranked hypothesis is erroneous, a better one can be found at a lower rank. In this paper, we describe metrics for evaluating whether the same potential carries over to incremental dialogue systems, where ASR output is consumed and reacted upon while speech is still ongoing. We show that even small N can provide an advantage for semantic processing, at a cost of a computational overhead.

14:30Improving the Recognition of Names by Document-Level Clustering

Bin Zhang (Department of Electrical Engineering, University of Washington, Seattle, WA 98195, USA)
Wei Wu (Department of Electrical Engineering, University of Washington, Seattle, WA 98195, USA)
Jeremy G. Kahn (Department of Linguistics, University of Washington, Seattle, WA 98195, USA)
Mari Ostendorf (Department of Electrical Engineering, University of Washington, Seattle, WA 98195, USA)

Named entities are of great importance in spoken document processing, but speech recognizers often get them wrong because they are infrequent. A name correction method based on document-level name clustering is proposed in this paper, consisting of three components: named entity detection, name clustering, and name hypothesis selection. We compare the performance of this method to oracle conditions and show that the oracle gain is a 23% reduction in name character error for Mandarin and the automatic approach achieves about 20% of that.

14:50Robust dependency parsing for Spoken Language Understanding of spontaneous speech

FREDERIC BECHET (Universite d\'Avignon)
ALEXIS NASR (LIF - CNRS / Universite Aix-Marseille)

We describe in this paper a syntactic parser for spontaneous speech geared towards the identification of verbal subcategorization frames. The parser proceeds in two stages. The first stage is based on generic syntactic resources for French. The second stage is a reranker which is specially trained for a given application. The parser is evaluated on the French MEDIA spoken dialogue corpus.

15:10Semantic Role Labeling with Discriminative Feature Selection for Spoken Language Understanding

Chao-Hong Liu (National Cheng Kung University, Tainan, TAIWAN)
Chung-Hsien Wu (National Cheng Kung University, Tainan, TAIWAN)

In the task of Spoken Language Understanding (SLU), Intent Classification techniques have been applied to different domains of Spoken Dialog Systems (SDS). Recently it was shown that intent classification performance can be improved with Semantic Role (SR) information. However, using SR information for SDS encounters two difficulties: 1) the state-of-the-art Automatic Speech Recognition (ASR) systems provide less than 80% recognition rate, 2) speech always exhibits ungrammatical expressions. This study presents an approach to Semantic Role Labeling (SRL) with discriminative feature selection to improve the performance of SDS. Bernoulli event features on word and part-of-speech sequences are introduced for better representation of the ASR recognized text. SRL and SLU experiments conducted using CoNLL-2005 SRL corpus and ATIS spoken corpus show that the proposed feature selection method with Bernoulli event features can improve intent classification by 3.4% and the performance of SRL.

Tue-Ses2-O4:
Speaker Diarisation

Time:Tuesday 13:30 Place:East Wing 3 Type:Oral
Chair:Douglas Reynolds

13:30A STUDY OF NEW APPROACHES TO SPEAKER DIARIZATION

Douglas Reynolds (MIT Lincoln Laboratory)
Patrick Kenny (CRIM)
Fabio Castaldo (Politecnico di Torino)

This paper reports on work carried out at the 2008 JHU Summer Workshop examining new approaches to speaker diarization. Four different systems were developed and experiments were conducted using summed-channel telephone data from the 2008 NIST SRE. The systems are a baseline agglomerative clustering system, a new Variational Bayes system using eigenvoice speaker models, a streaming system using a mix of low dimensional speaker factors and classic segmentation and clustering, and a new hybrid system combining the baseline system with a new cosine-distance speaker factor clustering. Results are presented using the Diarization Error Rate as well as by the EER when using diarization outputs for a speaker detection task. The best configurations of the diarization system produced DERs of 3.5-4.6\% and we demonstrate a weak correlation of EER and DER,

13:50REDEFINING THE BAYESIAN INFORMATION CRITERION FOR SPEAKER DIARISATION

Themos Stafylakis (Institute for Language and Speech Processing, National Technical University of Athens)
Vassilis Katsouros (Institute for Language and Speech Processing)
George Carayannis (Institute for Language and Speech Processing, National Technical University of Athens)

A novel approach to Bayesian Information Criterion (BIC) is introduced. The new criterion redefines the penalty terms of the BIC, such that each parameter is penalized with the effective sample size is trained with. Contrary to Local-BIC, the proposed criterion scores overall clustering hypotheses and therefore is not restricted to hierarchical clustering algorithms. Contrary to Global-BIC, it provides a local dissimilarity measure that depends only the statistics of the examined clusters and not on the overall sample size. We tested our criterion with two benchmark tests and found significant improvement in performance in the speaker diarisation task

14:10Speaker Diarization Using Divide-and-Conquer

Shih-Sian Cheng (Institute of Information Science, Academia Sinica, Taipei, Taiwan)
Chun-Han Tseng (Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan)
Chia-Ping Chen (Department of Computer Science and Engineering, National Sun Yat-sen University, Kaohsiung, Taiwan)
Hsin-Min Wang (Institute of Information Science, Academia Sinica, Taipei, Taiwan)

Speaker diarization systems consist of two core components: speaker segmentation and speaker clustering. The current state-of-the-art speaker diarization systems usually apply hierarchical agglomerative clustering (HAC) for speaker clustering after segmentation. However, HAC's quadratic computational complexity with respect to the number of data samples inevitably limits its application in large-scale data sets. In this paper, we propose a divide-and-conquer (DAC) framework for speaker diarization. It recursively partitions the input speech stream into two sub-streams, performs diarization on them separately, and then combines the diarization results obtained from them using HAC. The experiment results show that the proposed framework is faster than the conventional segmentation and clustering-based approach while achieving comparable diarization accuracy. Moreover, the proposed framework obtains a higher speedup over the conventional approach on a larger test data set.

14:30KL Realignment for Speaker Diarization with Multiple Feature Streams

Deepu Vijayasenan (Idiap Research Institute, 1920 Martigny, CH)
Fabio Valente (Idiap Research Institute, 1920 Martigny, CH)
Herve Bourlard (Idiap Research Institute, 1920 Martigny, CH)

This paper aims at investigating the use of Kullback-Leibler (KL) divergence based realignment with application to speaker diarization. The use of KL divergence based realignment operates directly on the speaker posterior distribution estimates and is compared with traditional realignment performed using HMM/GMM system. We hypothesize that using posterior estimates to re-align speaker boundaries is more robust than gaussian mixture models in case of multiple feature streams with different statistical properties. Experiments are run on the NIST RT06 data. They reveal that in case of conventional MFCC features the two approaches have the same performance while the KL based system outperforms the HMM/GMM re-alignment in case of combination of multiple feature streams (MFCC and TDOA). Furthermore we discuss the possible extension to other feature sets.

14:50Speech Overlap Detection in a Two-Pass Speaker Diarization System

Marijn Huijbregts (University of Twente)
David Leeuwen, van (TNO Human Factors)
Franciska Jong, de (University of Twente)

In this paper we present the two-pass speaker diarization system that we developed for the NIST RT09s evaluation. In the first pass of our system a model for speech overlap detection is generated automatically. This model is used in two ways to reduce the diarization errors due to overlapping speech. First, it is used in a second diarization pass to remove overlapping speech from the data while training the speaker models. Second, it is used to find speech overlap for the final segmentation so that overlapping speech segments can be generated. The experiments show that our overlap detection method improves the performance of all three of our system configurations.

15:10Improved Speaker Diarization of Meeting Speech with Recurrent Selection of Representative Speech Segments and Participant Interaction Pattern Modeling

Kyu Han (University of Southern California)
Shrikanth Narayanan (University of Southern California)

In this work we describe two distinct novel improvements to our speaker diarization system, previously proposed for analysis of meeting speech. The first approach focuses on recurrent selection of representative speech segments for speaker clustering while the other is based on participant interaction pattern modeling. The former selects speech segments with high relevance to speaker clustering, especially from a robust cluster modeling perspective, and keeps updating them throughout clustering procedures. The latter statistically models conversation patterns between meeting participants and applies it as a priori information when refining diarization results. Experimental results reveal that the two proposed approaches provide performance enhancement by 29.82% (relative) in terms of diarization error rate in tests on 13 meeting excerpts from various meeting speech corpora.

Tue-Ses2-P4:
Robust Automatic Speech Recognition I

Time:Tuesday 13:30 Place:Hewison Hall Type:Poster

#1Optimization of Dereverberation Parameters based on Likelihood of Speech Recognizer

Randy Gomez (Kyoto University)
Tatsuya Kawahara (Kyoto University)

Speech recognition under reverberant condition is a difficult task. Most dereverberation techniques used to address this problem enhance the reverberant waveform independent from that of the speech recognizer. In this paper, we improve the conventional Spectral Subtraction-based (SS) dereverberation technique. In our proposed approach, the dereverberation parameters are optimized to improve the likelihood of the acoustic model. The system is capable of adaptively fine-tuning these parameters jointly with acoustic model training. Additional optimization is also implemented during decoding of the test utterances. We have evaluated using real reverberant data and experimental results show that the proposed method significantly improves the recognition performance over the conventional approach.

#2Application of noise robust MDT speech recognition on the SPEECON and SpeechDat-Car databases

Jort Florent Gemmeke (Dept. of Linguistics, Radboud University, Nijmegen, The Netherlands)
Yujun Wang (ESAT Department, Katholieke Universiteit Leuven, Belgium)
Maarten Van Segbroeck (ESAT Department, Katholieke Universiteit Leuven, Belgium)
Bert Cranen (Dept. of Linguistics, Radboud University, Nijmegen, The Netherlands)
Hugo Van hamme (ESAT Department, Katholieke Universiteit Leuven, Belgium)

We show that the recognition accuracy of an MDT recognizer which performs well on artificially noisified data, deteriorates rapidly under realistic noisy conditions (using multiple microphone recordings from the SPEECON/SpeechDat-Car databases) and is outperformed by a commercially available recognizer which was trained using a multi-condition paradigm. Analysis of the recognition results indicates that the recording channels with the lowest SNRs where the MDR recognizer fails most, are also the channels which suffer most from room reverberation. Despite the channel compensation measures we took, it appears difficult to maintain the restorative power of MDT in such non-additive noise conditions.

#3Model based feature enhancement for automatic speech recognition in reverberant environments

Alexander Krueger (University of Paderborn)
Reinhold Haeb-Umbach (University of Paderborn)

In this paper we present a new feature space dereverberation technique for automatic speech recognition. We derive an expression for the dependence of the reverberant speech features in the log-mel spectral domain on the non-reverberant speech features and the room impulse response. The obtained observation model is used for a model based speech enhancement based on Kalman filtering. The performance of the proposed enhancement technique is studied on the AURORA5 database. In our currently best configuration, which includes uncertainty decoding, the number of recognition errors is approximately halved compared to the recognition of unprocessed speech.

#4A study of mutual front-end processing method based on statistical model for noise robust speech recognition

Masakiyo Fujimoto (NTT Communication Science Laboratories, NTT Corporation)
Kentaro Ishizuka (NTT Communication Science Laboratories, NTT Corporation)
Tomohiro Nakatani (NTT Communication Science Laboratories, NTT Corporation)

This paper addresses robust front-end processing for automatic speech recognition (ASR) in noise. Accurate recognition of corrupted speech requires noise robust front-end processing, e.g., voice activity detection (VAD) and noise suppression (NS). Typically, VAD and NS are combined as one-way processing, and are developed independently. However, VAD and NS should not be assumed to be independent techniques, because sharing each others' information is important for the improvement of front-end processing. Thus, we investigate the mutual front-end processing by integrating VAD and NS, which can beneficially share each others' information. In an evaluation of a concatenated speech corpus, CENSREC-1-C database, the proposed method improves the performance of both VAD and ASR compared with the conventional method.

#5Integrating Codebook and Utterance Information in Cepstral Statistics Normalization Techniques for Robust Speech Recognition

Guan-min He (National Chi Nan University)
Jeih-weih Hung (National Chi Nan University)

Cepstral statistics normalization techniques have been shown to be very successful at improving the noise robustness of speech features. This paper proposes a hybrid-based scheme to achieve a more accurate estimate of the statistical information of features in these techniques. By properly integrating codebook and utterance knowledge, the resulting hybrid-based approach significantly outperforms conventional utterance-based,segment-based and codebook-based approaches in noise environments. Furthermore, the high-performance CS-HEQ can be implemented with a short delay and can thus be applied in real-time online systems.

#6Reduced Complexity Equalization of Lombard Effect for Speech Recognition in Noisy Adverse Environments

Hynek Boril (Center for Robust Speech Systems, Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, U.S.A)
John H.L. Hansen (Center for Robust Speech Systems, Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, U.S.A)

Speech signal corruption by background noise, microphone channel variations, and speech production adjustments introduced by speakers in an effort to communicate efficiently over noise (Lombard effect) impact severely the automatic speech recognition (ASR) performance. Recently, a set of unsupervised techniques reducing ASR sensitivity to these sources of distortion have been presented. In this study, a scheme utilizing a set of speech-in-noise Gaussian mixture models and a neutral/LE classifier is shown to substantially decrease the computational load of the compensations (from 14 to 2–4 ASR decoding passes) while preserving the performance. In addition, an extended codebook capturing multiple environmental noises is introduced and shown to improve ASR in changing environments. The evaluation is conducted on the samples from the Czech Lombard Speech Database (CLSD‘05) presented in different levels of background car noise and Aurora 2 noises.

#7UNSUPERVISED TRAINING SCHEME WITH NON-STEREO DATA FOR EMPIRICAL FEATURE VECTOR COMPENSATION

Luis Buera (I3A, University of Zaragoza)
Antonio Miguel (I3A, University of Zaragoza)
Alfonso Ortega (I3A, University of Zaragoza)
Eduardo Lleida (I3A, University of Zaragoza)
Richard Stern (Carnegie Mellon University)

In this paper, a novel training scheme based on unsupervised and non-stereo data is presented for Multi-Environment Model-based LInear Normalization (MEMLIN) and MEMLIN with cross-probability model based on GMMs (MEMLIN-CPM). Both are data-driven feature vector normalization techniques which have been proved very effective in dynamic noisy acoustic environments. However, this kind of techniques usually requires stereo data in a previous training phase, which could be an important limitation in real situations. To compensate this drawback, we present an approach based on ML criterion and Vector Taylor Series (VTS). Experiments have been carried out with Spanish SpeechDat Car, reaching consistent improvements:48.7\% and 61.9\% when the novel training process is applied over MEMLIN and MEMLIN-CPM, respectively.

#8Incremental Adaptation with VTS and Joint Adaptively Trained Systems

Federico Flego (Cambridge University)
Mark Gales (Cambridge University)

Recently adaptive training schemes using model based compensation approaches such as VTS and JUD have been proposed. Adaptive training allows the use of multi-environment training data whilst training a neutral, ``clean'', acoustic model to be trained. This paper describes and assesses the advantages of using incremental, rather than batch, mode adaptation with these adaptively trained systems. Incremental adaptation reduces the latency during recognition, and has the possibility of reducing the error rate for slowly varying noise. The work is evaluated on a large scale multi-environment training configuration targeted at in-car speech recognition. Results on in-car collected test data indicate that incremental adaptation is an attractive option when using these adaptively trained systems.

#9Target Speech GMM-based Spectral Compensation for Noise Robust Speech Recognition

Takahiro Shinozaki (Tokyo Institute of Technology)
Sadaoki Furui (Tokyo Institute of Technology)

To improve speech recognition performance in adverse conditions, a noise compensation method is proposed that applies a transformation in the spectral domain whose parameters are optimized based on likelihood of speech GMM modeled on the feature domain. The idea is that additive and convolutional noises have mathematically simple expression in the spectral domain while speech characteristics are better modeled in the feature domain such as MFCC. The proposed method works as a feature extraction front-end that is independent from decoding engine, and has ability to compensate for non-stationary additive and convolutional noises with a short time delay. It includes spectral subtraction as a special case when no parameter optimization is performed. Experiments were performed using the AURORA-2J database. It has been shown that significantly higher recognition performance is obtained by the proposed method than spectral subtraction.

#10Noise-Robust Feature Extraction Based on Forward Masking

Sheng-Chiuan Chiou (Department of Computer Science and Engineering, National Sun Yat-sen University)
Chia-Ping Chen (Department of Computer Science and Engineering, National Sun Yat-sen University)

Forward masking is a phenomenon of human auditory perception, that a weaker sound is masked by a preceding stronger masker. In this paper, we postulate the mechanism of forward masking to be synaptic adaptation and temporal integration, and incorporate them in the feature extraction process of an automatic speech recognition system to improve noise-robustness. The synaptic adaptation is implemented by a highpass filter, and the temporal integration is implemented by a bandpass filter. We apply both filters in the domain of log mel-spectrum. On the Aurora 3 tasks, we evaluate three modified mel-frequency cepstral coefficients: synaptic adaptation only, temporal integration only, and both synaptic adaptation and temporal integration. Experiments show that the overall improvement is 16.1\%, 21.8\%, and 26.2\% respectively in the three cases over the baseline.

Tue-Ses2-P1:
Speech Analysis and Processing II

Time:Tuesday 13:30 Place:Hewison Hall Type:Poster
Chair:Aladdin Ariyaeeinia

#2Spectral and Temporal Modulation Features for Phonetic Recognition

Stephen Zahorian (Binghamton University)
Hongbing Hu (Binghamton University)
Zhengqing Chen (Binghamton University)
Jiang Wu (Binghamton University)

Recently, the modulation spectrum has been proposed and found to be a useful source of speech information. The modulation spectrum represents longer term variations in the spectrum and thus implicitly requires features extracted from much longer speech segments compared to MFCCs and their delta terms. In this paper, a Discrete Cosine Transform (DCT) analysis of the log magnitude spectrum combined with a Discrete Cosine Series (DCS) expansion of DCT coefficients over time is proposed as a method for capturing both the spectral and modulation information. Several variations of the DCT/DCS features were evaluated with phonetic recognition experiments using TIMIT and its telephone version (NTIMIT). Best results obtained with a combined feature set are 73.8% for TIMIT and 62.5% for NTIMIT. The modulation features are shown to be far more important than the spectral features for automatic speech recognition and far more noise robust.

#3Use of Harmonic Phase Information for Polarity Detection in Speech Signals

Ibon Saratxaga (University of the Basque Country)
Daniel Erro (University of the Basque Country)
Inmaculada Hernáez (University of the Basque Country)
Iñaki Sainz (University of the Basque Country)
Eva Navas (University of the Basque Country)

Phase information resultant from the harmonic analysis of the speech can be very successfully used to determine the polarity of a voiced speech segment. In this paper we present two algorithms which calculate the signal polarity from this information. One is based on the effect of the glottal signal on the phase of the first harmonics and the other on the relative phase shifts between the harmonics. The detection rates of these two algorithms are compared against others established algorithms.

#4Finite Mixture Spectrogram Modeling for Multipitch Tracking Using A Factorial Hidden Markov Model

Michael Wohlmayr (Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria)
Franz Pernkopf (Signal Processing and Speech Communication Laboratory, Graz University of Technology, Graz, Austria)

In this paper, we present a simple and efficient feature modeling approach for tracking the pitch of two speakers speaking simultaneously. We model the spectrogram features using Gaussian Mixture Models (GMMs) in combination with the Minimum Description Length (MDL) model selection criterion. This enables to automatically determine the number of Gaussian components depending on the available data for a specific pitch pair. A factorial hidden Markov model (FHMM) is applied for tracking. We compare our approach to two methods based on correlogram features. Those methods either use a HMM or a FHMM for tracking. Experimental results on the Mocha-TIMIT database show that our proposed approach significantly outperforms the correlogram-based methods for speech utterances mixed at 0dB. The superior performance even holds when adding white Gaussian noise to the mixed speech utterances during pitch tracking.

#5Group-Delay-Deviation Based Spectral Analysis of Speech

Anthony Stark (Griffith University)
Kuldip Paliwal (Griffith University)

In this paper, we investigate a new method for extracting useful information from the group delay spectrum of speech. The group delay spectrum is often poorly behaved and noisy. In the literature, various methods have been proposed to address this problem. However, to make the group delay a more tractable function, these methods have typically relied upon some modification of the underlying speech signal. The method proposed in this paper does not require such modifications. To accomplish this, we investigate a new function derived from the group delay spectrum, namely the group delay deviation. We use it for both narrowband analysis and wideband analysis of speech and show that this function exhibits meaningful formant and pitch information.

#6Speaker Dependent Mapping for Low Bit Rate Coding of Throat Microphone Speech

Anand Joseph Xavier Medabalimi (International Institute of Information Technology, Hyderabad, India)
Yegnanarayana Bayya (International Institute of Information Technology, Hyderabad, India)
Sanjeev Gupta (Center for Artificial Intelligence and Robotics, Bangalore, India)
Kesheorey R M (Center for Artificial Intelligence and Robotics, Bangalore, India)

Throat microphones (TM) which are robust to background noise can be used in environments with high levels of background noise. Speech collected using TM is perceptually less natural. The objective of this paper is to map the spectral features (represented in the form of cepstral features) of TM and close speaking microphone (CSM) speech to improve the former's perceptual quality, and to represent it in an efficient manner for coding. The spectral mapping of TM and CSM speech is done using a multilayer feed-forward neural network, which is trained from features derived from TM and CSM speech. The sequence of estimated CSM spectral features is quantized and coded as a sequence of codebook indices using vector quantization. The sequence of codebook indices, the pitch contour and the energy contour derived from the TM signal are used to store/transmit the TM speech information efficiently. At the receiver, the all-pole system corresponding to the estimated CSM spectral vectors is excited by a synthetic residual to generate the speech signal.

#7Analysis of Lombard Speech using Excitation Source Information

Bapineedu Gummadi (International Institute of Information Technology, Hyderabad, India)
Avinash Boppay (International Institute of Information Technology, Hyderabad, India)
Suryakanth. V. Gangashetty (International Institute of Information Technology, Hyderabad, India)
Yegnanarayana Bayya (International Institute of Information Technology, Hyderabad, India)

This paper examines the Lombard effect on the excitation features in speech production. These features correspond mostly to the acoustic features at subsegmental (< pitch period) level. The instantaneous fundamental frequency F0 (i.e., pitch), the strength of excitation at the instants of significant excitation and a loudness measure reflecting the sharpness of the impulse-like excitation around epochs are used to represent the excitation features at the subsegmental level. The Lombard effect influences the pitch and the loudness. The extent of Lombard effect on speech depends on the nature and level (or intensity) of the external feedback that causes the Lombard effect.

#8A Comparison of Linear and Nonlinear Dimensionality Reduction Methods Applied to Synthetic Speech

Andrew Errity (School of Computing, Dublin City University)
John McKenna (School of Computing, Dublin City University)

In this study a number of linear and nonlinear dimensionality reduction methods are applied to high dimensional representations of synthetic speech to produce corresponding low dimensional embeddings. Several important characteristics of the synthetic speech, such as formant frequencies and f0, are known and controllable prior to dimensionality reduction. The degree to which these characteristics are retained after dimensionality reduction is examined in visualisation and classification experiments. Results of these experiments indicate that each method is capable of discovering meaningful low dimensional representations of synthetic speech and that the nonlinear methods may outperform linear methods in some cases.

#9ZZT-domain Immiscibility of the Opening and Closing Phases of the LF GFM under Frame Length Variations

Christian Fischer Pedersen (Dept. of Electronic Systems, Aalborg University, Denmark)
Ove Andersen (Dept. of Electronic Systems, Aalborg University, Denmark)
Paul Dalsgaard (Dept. of Electronic Systems, Aalborg University, Denmark)

Current research has proposed a non-parametric speech waveform representation (rep) based on zeros of the z-transform (ZZT)[1][2]. Empirically, the ZZT rep has successfully been applied in discriminating the glottal and vocal tract components in pitch-synchronously windowed speech by using the unit circle (UC) as discriminant[1][2]. Further, similarity between ZZT reps of windowed speech, glottal flow waveforms, and waveforms of glottal flow opening and closing phases has been demonstrated[1][3]. Therefore, the underlying cause of the separation on either side of the UC can be analyzed via the ZZT reps of the opening and closing phase waveforms; the waveforms are generated by the LF glottal flow model (GFM)[1]. The present paper demonstrates this cause and effect analytically and thereby supplement the previous empirical works. Moreover, this paper demonstrates that immiscibility is variant under changes in frame lengths; lengths that maximize or minimize immiscibility are presented.

#10Dimension Reducing of LSF parameters Based on Radial Basis Function Neural Network

Hongjun Sun (+86-010-62632269)
Jianhua Tao (+86-010-62632269)
Huibin Jia (+86-010-62632269)

In this paper, we investigate a novel method for transforming line spectral frequency (LSF) parameters to lower dimensional coefficients. Radial basis function neutral network (RBF NN) based transforming model is used to fit LSF vectors. In the training process, two criterions, including mean squared error and weighted mean squared error, are involved to measure distance between original vector and approximate vector. Besides, features of LSF parameters are taken into account to supervise the training process. As a result, LSF vectors are represented by the coefficient vectors of transforming model. The experimental results reveal that 24-order LSF vector can be transformed to 15-dimension coefficient vector with an average spectral distortion of approximately 1dB. Subjective evaluation manifests that the transform method in this paper will not lead to significant voice quality decreasing.

#11Characterizing Speaker Variability Using Spectral Envelopes of Vowel Sounds

Harish Arsikere (Indian Institute of Technology - Kanpur)
Rama Sanand Doddipatla (Indian Institute of Technology - Kanpur)
Srinivasan Umesh (Indian Institute of Technology - Kanpur)

In this paper, we present a study to understand the relation between spectra of speakers enunciating the same sound and to investigate the issue of uniform versus non-uniform scaling. There is a lot of interest in understanding this relation as speaker variability is a major source of concern in many applications including Automatic Speech Recognition (ASR). Using dynamic programming, we find mapping relations between smoothed spectral envelopes of speakers enunciating the same sound and show that these relations are not linear but have a consistent non-uniform behavior. This non-uniform behavior is also shown to vary across vowels. Through a series of experiments, we show that using the observed non-uniform relation provides better vowel normalization than just a simple linear scaling relation. All results in this paper are based on vowel data from TIMIT, Hillenbrand et al. and North Texas databases.

#12Analysis of band structures for speaker-specific information in FM feature extraction

Tharmarajah Thiruvaran (School of Electrical Engineering and Telecommunications, The University of New South Wales and National Information Communication Technology (NICTA))
Eliathamby Ambikairajah (School of Electrical Engineering and Telecommunications, The University of New South Wales and National Information Communication Technology (NICTA))
Julien Epps (School of Electrical Engineering and Telecommunications, The University of New South Wales and National Information Communication Technology (NICTA))

Frequency modulation (FM) features are typically extracted using a filterbank, usually based on an auditory frequency scale, however there are psychophysical evidence to suggest that this scale may not be optimal for extracting speaker-specific information. In this paper, speaker-specific information in FM features is analyzed as a function of the filterbank structure at feature, model and classification stages. Scatter matrix based separation measures at the feature level and Kullback-Leibler distance based measures at the model level are used to analyze the discriminative contributions of the different bands. Then a series of speaker recognition experiments are performed to study how each band of the FM feature contributes to speaker recognition. Then a new filter banks structure is proposed that attempts to maximize the speaker-specific information in the FM feature for telephone data. Finally, the distribution of speaker specific information is analyzed for wideband speech.

#13Artificial Nasalization of Speech Sounds Based on Pole-Zero Models of Spectral Relations between Mouth and Nose Signals

Karl Schnell (Institute of Applied Physics, Goethe-University Frankfurt, Max-von-Laue-Str. 1, D-60438 Frankfurt am Main, Germany)
Arild Lacroix (Institute of Applied Physics, Goethe-University Frankfurt, Max-von-Laue-Str. 1, D-60438 Frankfurt am Main, Germany)

In this contribution, a method for nasalization of speech sounds is proposed based on model-based spectral relations between mouth and nose signals. For that purpose, the mouth and nose signals of speech utterances are recorded simultaneously. The spectral relations of the mouth and nose signals are modeled by pole-zero models. A filtering of non-nasalized speech signals by these pole-zero models yields approximately nasal signals, which can be utilized to nasalize the speech signals. The artificial nasalization can be exploited to modify speech units of a non-nasalized or weakly nasalized representation which should be nasalized due to coarticulation or for the production of foreign words.

#14Error metrics for impaired auditory nerve responses of different phoneme groups

Andrew Hines (Trinity College Dublin)
Naomi Harte (Trinity College Dublin)

An auditory nerve model allows faster investigation of new signal processing algorithms for hearing aids. This paper presents a study of the degradation of auditory nerve (AN) responses at a phonetic level for a range of sensorineural hearing losses and flat audiograms. The AN model of Zilany & Bruce was used to compute responses to a diverse set of phoneme rich sentences from the TIMIT database. The characteristics of both the average discharge rate and spike timing of the responses are discussed. The experiments demonstrate that a mean absolute error metric provides a useful measure of average discharge rates but a more complex measure is required to capture spike timing response errors.

#15Feature Extraction for Detecting Stop Consonants in Continuous Speech

Chi-Yueh Lin (National Tsing Hua University, Hsinchu, Taiwan)
Hsiao-Chuan Wang (National Tsing Hua University, Hsinchu, Taiwan)

Stop consonant is a highly non-stationary signal, distinct from other phonetic classes by possessing some particular acoustic characteristics. How to model its prominent acoustic landmark and detect it effectively in continuous speech have been challenge tasks for years. In this paper an approach using the two-dimensional discrete cosine transform (2D-DCT) to encode its burst portion in spectro-temporal domain is suggested. An emerging machine learning approach, random forest, uses the derived features to locate stop bursts in continuous speech. A series of experimental results demonstrate that our suggested approach has promising performance.

Tue-Ses2-P3:
ASR: Decoding and Confidence Measures

Time:Tuesday 13:30 Place:Hewison Hall Type:Poster
Chair:Kai Yu

#1Incremental composition of static decoding graphs

Miroslav Novak (IBM T.J. Watson Research Center)

A fast, scalable and memory-efficient method for static decoding graph construction is presented. As an alternative to the traditional transducer-based approach, it is based on incremental composition. Memory efficiency is achieved by combining composition, determinization and minimization into a single step, thus eliminating large intermediate graphs. We have previously reported the use of incremental composition limited to grammars and left cross-word context. Here, this approach is extended to n-gram models with explicit epsilon arcs and right cross-word context.

#2Evaluation of Phone Lattice Based Speech Decoding

Jacques Duchateau (Katholieke Universiteit Leuven)
Kris Demuynck (Katholieke Universiteit Leuven)
Hugo Van hamme (Katholieke Universiteit Leuven)

Previously, we proposed a flexible two-layered speech recogniser architecture, called FLaVoR. In the first layer an unconstrained, task independent phone recogniser generates a phone lattice. Only in the second layer the task specific lexicon and language model are applied to decode the phone lattice and produce a word level recognition result. In this paper, we present a further evaluation of the FLaVoR architecture. The performance of a classical single-layered architecture and the FLaVoR architecture are compared on two recognition tasks, using the same acoustic, lexical and language models. On the large vocabulary Wall Street Journal 5k and 20k benchmark tasks, the two-layered architecture resulted in slightly but not significantly better word error rates. On a reading error detection task for a reading tutor for children, the FLaVoR architecture clearly outperformed the single-layered architecture.

#3A Fully Data Parallel WFST-based Large Vocabulary Continuous Speech Recognition on a Graphics Processing Unit

Jike Chong (University of California, Berkeley)
Ekaterina Gonina (University of California, Berkeley)
Youngmin Yi (University of California, Berkeley)
Kurt Keutzer (University of California, Berkeley)

Tremendous compute throughput is becoming available in personal desktop and laptop systems through the use of graphics processing units (GPUs). However, exploiting this resource requires re-architecting an application to fit a data-parallel programming model. The complex graph traversal routines in the inference process for large vocabulary continuous speech recognition (LVCSR) have been considered by many as unsuitable for extensive parallelization. We explore and demonstrate a fully data parallel implementation of a speech inference engine on NVIDIA's GTX280 GPU. Our implementation has a compute-intensive phase for observation probability computation that allows dynamic elimination of redundant computation while maintaining close-to-peak execution efficiency. We demonstrate the importance of exploring application-level trade-offs in the communication-intensive graph traversal phase to adapt the algorithm to data parallel execution on GPUs.

#4Combined low level and high level features for Out-Of-Vocabulary Word detection

Benjamin LECOUTEUX (Laboratoire Informatique d\'Avignon (LIA) University of Avignon, France)
Georges LINARES (Laboratoire Informatique d\'Avignon (LIA) University of Avignon, France)
Benoit FAVRE (ICSI, 1947 Center St, Suite 600, Berkeley, CA 94704, USA)

This paper addresses the issue of Out-Of-Vocabulary (OOV) words detection in Large Vocabulary Continuous Speech Recognition (LVCSR) systems. We propose a method inspired by confidence measures, that consists in analyzing the recognition system outputs in order to automatically detect errors due to OOV words. This method combines various features based on acoustic, linguistic, decoding graph and semantics. We evaluate separately each feature and we estimate their complementarity. Experiments are conducted on a large French broadcast news corpus from the ESTER evaluation campaign. Results show good performance in real conditions: the method obtains a OOV word detection rate of 43%-90% with 2.5%-17.5% of false detection.

#5Bayes Risk Approximations Using Time Overlap with an Application to System Combination

Björn Hoffmeister (Chair of Computer Science 6, Computer Science Department, RWTH Aachen University)
Ralf Schlüter (Chair of Computer Science 6, Computer Science Department, RWTH Aachen University)
Hermann Ney (Chair of Computer Science 6, Computer Science Department, RWTH Aachen University)

The computation of the Minimum Bayes Risk (MBR) decoding rule for word lattices needs approximations. We investigate a class of approximations where the Levenshtein alignment is approximated under the condition that competing lattice arcs overlap in time. The approximations have their origins in MBR decoding and in discriminative training. We develop modified versions and propose a new, conceptually extremely simple confusion network algorithm. The MBR decoding rule is extended to scope with several lattices, which enables us to apply all the investigated approximations to system combination. All approximations are tested on a Mandarin and on an English LVCSR task for a single system and for system combination. The new methods are competitive in error rate and show some advantages over the standard approaches to MBR decoding.

#6Unsupervised Estimation of the Language Model Scaling Factor

Christopher M. White (Human Language Technology Center of Excellence, and Center for Language and Speech Processing, Johns Hopkins University)
Ariya Rastrow (Human Language Technology Center of Excellence, and Center for Language and Speech Processing, Johns Hopkins University)
Sanjeev Khudanpur (Human Language Technology Center of Excellence, and Center for Language and Speech Processing, Johns Hopkins University)
Frederick Jelinek (Human Language Technology Center of Excellence, and Center for Language and Speech Processing, Johns Hopkins University)

This paper addresses the adjustment of the language model (LM) scaling factor of an automatic speech recognition (ASR) system for a new domain using only un-transcribed speech. The main idea is to replace the (unavailable) reference transcript with an automatic transcript generated by an independent ASR system, and adjust parameters using this sloppy reference. It is shown that despite its fairly high error rate (ca. 35%), choosing the scaling factor to minimize disagreement with the erroneous transcripts is still an effective recipe for model selection. This effectiveness is demonstrated by adjusting an ASR system trained on Broadcast News to transcribe the MIT Lectures corpus. An ASR system for telephone speech produces the sloppy reference, and optimizing towards it yields a nearly optimal LM scaling factor for the MIT Lectures corpus.

#7Simultaneous Estimation of Confidence and Error Cause in Speech Recognition Using Discriminative Model

Atsunori Ogawa (NTT Corporation)
Atsushi Nakamura (NTT Corporation)

Since recognition errors are unavoidable in speech recognition, confidence scoring, which accurately estimates the reliability of recognition results, is a critical function for speech recognition engines. In addition to achieving accurate confidence estimation, if we are to develop speech recognition systems that will be widely used by the public, speech recognition engines must be able to report the causes of errors properly, namely they must offer a reason for any failure to recognize input utterances. This paper proposes a method that simultaneously estimates both confidences and causes of errors in speech recognition results by using discriminative models. We evaluated the proposed method in an initial speech recognition experiment, and confirmed its promising performance with respect to confidence and error cause estimation.

#8A Generalized Composition Algorithm for Weighted Finite-State Transducers

Cyril Allauzen (Google)
Michael Riley (Google)
Johan Schalkwyk (Google)

This paper describes a weighted finite-state transducer composition algorithm that generalizes the notion of the composition filter and present filters that remove useless epsilon paths and push forward labels and weights along epsilon paths. This filtering allows us to compose together large speech recognition context-dependent lexicons and language models much more efficiently in time and space than previously possible. We present experiments on Broadcast News and Google Search by Voice that demonstrate a 5% to 10% overhead for dynamic, runtime composition compared to a static, offline composition of the recognition transducer. To our knowledge, this is the first such system with such small overhead.

#9Word Confidence using Duration Models

Stefano Scanzio (Politecnico di Torino)
Pietro Laface (Politecnico di Torino)
Daniele Colibro (Loquendo S.p.A.)
Roberto Gemello (Loquendo S.p.A.)

In this paper, we propose a word confidence measure based on phone durations depending on large contexts. The measure is based on the expected duration of each recognized phone in a word. In the approach here proposed the duration of each phone is in principle context-dependent, and the measure is a function of the distance between the observed and expected phone duration distributions within a word. Our experiments show that, since the “duration confidence” does not make use of any acoustic information, its Equal Error Rate (EER) in terms of False Accept and False Rejection rates is not as good as the one obtained by using the more informed acoustic confidence measure. However, combining the two measures by a simple linear interpolation, the system EER improves by 6% to 10% relative on an isolated word recognition task in several languages.

#10A Comparison of Audio-free Speech Recognition Error Prediction Methods

Preethi Jyothi (Ohio State University)
Eric Fosler-Lussier (Ohio State University)

Predicting possible speech recognition errors can be invaluable for a number of Automatic Speech Recognition (ASR) applications. In this study, we extend a Weighted Finite State Transducer (WFST) framework for error prediction to facilitate a comparison between two approaches of predicting confusable words: examining recognition errors on the training set to learn phone confusions and utilizing distances between the phonetic acoustic models for the prediction task. We also expand the framework to deal with continuous word recognition and we can accurately predict 60% of the misrecognized sentences (with an average words-per-sentence count of 15) and a little over 70% of the total number of errors from the unseen test data where no acoustic information related to the test data is utilized.

#11Automatic Out-of-Language Detection based on Confidence Measures derived from LVCSR Word and Phone Lattices

Petr Motlicek (Idiap Research Institute, Martigny, Switzerland)

Confidence Measures (CMs) estimated from Large Vocabulary Continuous Speech Recognition (LVCSR) outputs are commonly used metrics to detect incorrectly recognized words. In this paper, we propose to exploit CMs derived from frame-based word and phone posteriors to detect speech segments containing pronunciations from non-target (alien) languages. The LVCSR system used is built for English, which is the target language, with medium-size recognition vocabulary (5k words). The efficiency of detection is tested on a set comprising speech from three different languages (English, German, Czech). Results achieved indicate that employment of specific temporal context (integrated in the word or phone level) significantly increases the detection accuracies. Furthermore, we show that combination of several CMs can also improve the efficiency of detection.

#12Automatic Estimation of Decoding Parameters Using Large-Margin Iterative Linear Programming

Brian Mak (The Hong Kong University of Science and Technology)
Tom Ko (The Hong Kong University of Science and Technology)

The decoding parameters in automatic speech recognition --- grammar factor and word insertion penalty --- are usually determined by performing a grid search on a development set. Recently, we cast their estimation as a convex optimization problem, and proposed a solution using an iterative linear programming algorithm. However, the solution depends on how well the development data set matches with the test set. In this paper, we further investigates an improvement on the generalization property of the solution by using large margin training within the iterative linear programming framework. Empirical evaluation on the WSJ0 5K speech recognition tasks shows that the recognition performance of the decoding parameters found by the improved algorithm using only a subset of the acoustic model training data is even better than that of the decoding parameters found by grid search on the development data, and is close to the performance of those found by grid search on the test set.

Tue-Ses2-P2:
Speech processing with audio or audiovisual input

Time:Tuesday 13:30 Place:Hewison Hall Type:Poster
Chair:Bob Damper

#1Application of Differential Microphone Array for IS-127 EVRC Rate Determination Algorithm

Henry Widjaja (Institut Teknologi Telkom)
Suryoadhi Wibowo (Institut Teknologi Telkom)

Differential microphone array is known to have low sensitivity to distant sound sources. Such characteristics maybe advantageous in voice activity detection where it can be assumed that the target speaker is close and background noise sources are distant. In this paper we develop a simple modification to the EVRC rate determination algorithm (EVRC RDA) to exploit the noise-canceling property of differential microphone array to improve its performance in highly dynamic noise environment. Comprehensive computer simulations show that the modified algorithm outperforms the original EVRC RDA in all tested noise conditions.

#2Estimating the position and orientation of an acoustic source with a microphone array network

Alberto Yoshihiro Nakano (Toyohashi University of Technology)
Seiichi Nakagawa (Toyohashi University of Technology)
Kazumasa Yamamoto (Toyohashi University of Technology)

We propose a method that finds the position and orientation of an acoustic source in an enclosed environment. For each of eight T-shaped arrays forming a microphone array network, the time delay of arrival (TDOA) of signals from microphone pairs, a source position candidate, and energy related features are estimated. These form the input for artificial neural networks (ANNs), the purpose of which is to provide indirectly a more precise position of the source and, additionally, to estimate the source's orientation using various combinations of the estimated parameters. The best combination of parameters (TDOAs and microphone positions) yields a 21.8% reduction in the mean average position error compared to baselines, and a correct orientation ratio higher than 99.0%. The position estimation baselines include two estimation methods: a TDOA-based method that finds the source position geometrically, and the SRP-PHAT that finds the most likely source position by spatial exploration.

#3Singing voice detection in polyphonic music using predominant pitch

Vishweshwara Rao (Electrical Engineering Department, Indian Institute of Technology Bombay)
Ramakrishnan Srinivasakannan (Electrical Engineering Department, Indian Institute of Technology Bombay)
Preeti Rao (Electrical Engineering Department, Indian Institute of Technology Bombay)

This paper demonstrates the superiority of energy-based features derived from the knowledge of predominant-pitch, for singing voice detection in polyphonic music over commonly used spectral features. However, such energy-based features tend to misclassify loud, pitched instruments. To provide robustness to such accompaniment we exploit the relative instability of the pitch contour of the singing voice by attenuating harmonic spectral content belonging to stable-pitch instruments, using sinusoidal modeling. The obtained feature shows high classification accuracy when applied to north Indian classical music data and is also found suitable for automatic detection of vocal-instrumental boundaries required for smoothing the frame-level classifier decisions.

#4Word stress assessment for computer aided language learning

Juan Pablo Arias (Universidad de Chile)
Nestor Becerra Yoma (Universidad de Chile)
Hiram Vivanco (Universidad de Chile)

In this paper an automatic word stress assessment system is proposed based on a top-to-bottom scheme. The method presented is text and language independent. The utterance pronounced by the student is directly compared with a reference one. The trend similarity of F0 and energy contours are compared frame-by-frame by using DTW alignment. The stress assessment evaluation system gives an EER equal to 21.5%, which in turn is similar to the error observed in phonetic quality evaluation schemes. These results suggest that the proposed system can be employed in real applications and applicable to any language.

#5A non-intrusive signal-based model for speech quality evaluation using automatic classification of background noises

Adrien Leman (France Telecom R&D)
Julien Faure (France Telecom R&D)
Etienne Parizet (INSA de Lyon)

This paper describes an original method for speech quality evaluation in the presence of different types of background noises for a range of communications (mobile, VoIP, RTC). The model is obtained from subjective experiments described in [1]. These experiments show that background noise can be more or less tolerated by listeners, depending on the sources of noise that can be identified. Using a classification method, the background noises can be classified into four groups. For each one of the four groups, a relation between loudness of the noise and speech quality is proposed.

#6Acoustic Event Detection for Spotting \"Hot Spots\" in Podcasts

Kouhei Sumi (Graduate School of Informatics, Kyoto University)
Tatsuya Kawahara (Graduate School of Informatics, Kyoto University)
Jun Ogata (National Institute of Advanced Industrial Science and Technology)
Masataka Goto (National Institute of Advanced Industrial Science and Technology)

This paper presents a method to detect acoustic events that can be used to find “hot spots” in podcast programs. We focus on meaningful non-verbal audible reactions which suggest hot spots such as laughter and reactive tokens. In order to detect this kind of short events and segment the counterpart utterances, we need accurate audio segmentation and classification, dealing with various recording environments and background music. Thus, we propose a method for automatically estimating and switching penalty weights for the BIC-based segmentation depending on background environments. Experimental results show significant improvement in detection accuracy by our method compared to when using a constant penalty weight.

#7Improving Detection of Acoustic Events Using Audiovisual Data and Feature Level Fusion

Taras Butko (Technical University of Catalonia)
Cristian Canton-Ferrer (Technical University of Catalonia)
Carlos Segura (Technical University of Catalonia)
Xavi Giro (Technical University of Catalonia)
Climent Nadeu (Technical University of Catalonia)
Javier Hernando (Technical University of Catalonia)
Josep-Ramon Casas (Technical University of Catalonia)

The detection of the acoustic events (AEs) that are naturally produced in a meeting room may help to describe the human and social activity that takes place in it. When applied to spontaneous recordings, the detection of AEs from only audio information shows a large amount of errors, which are mostly due to temporal overlapping of sounds. In this paper, a system to detect and recognize AEs using both audio and video information is presented. A feature-level fusion strategy is used, and the structure of the HMM-GMM based system considers each class separately and uses a one-against-all strategy for training. Experimental AED results with a new and rather spontaneous dataset are presented which show the advantage of the proposed approach.

#8Detecting Audio Events for Semantic Video Search

Miguel Bugalho (INESC-ID Lisboa / IST)
José Portêlo (INESC-ID Lisboa)
Isabel Trancoso (INESC-ID Lisboa / IST)
Thomas Pellegrini (INESC-ID Lisboa)
Alberto Abad (INESC-ID Lisboa)

This paper describes our work on audio event detection, one of our tasks in the European project VIDIVIDEO. Preliminary experiments with a small corpus of sound effects have shown the potential of this type of corpus for training purposes. This paper describes our experiments with SVM classifiers, and different features, using a 290-hour corpus of sound effects, which allowed us to build detectors for almost 50 semantic concepts. Although the performance of these detectors on the development set is quite good (achieving an average F-measure of 0.87), preliminary experiments on documentaries and films showed that the task is much harder in real-life videos, which so often include overlapping audio events.

#9Factor Analysis for Audio-based Video Genre Classification

Mickael Rouvier (LIA)
Matrouf Driss (LIA)
Georges Linarès (LIA)

Statistic classifiers operate on features that generally include both usefull and useless information. These two types of information are difficult to separate in the feature domain. Recently, a new paradigm based on a Latent Factor Analysis proposed a model decomposition into usefull and useless components. This method was successfully applied to speaker and language recognition tasks. In this paper, we study the use of Latent Factor Analysis for video genre classification by using only the audio channel. We propose a classification method based on short-term cepstral features and GMM or SVM classifiers, that are combined to Factor Analysis. Experiments are conducted on a corpus composed of 5 types of video (musics, commercials, cartoons, movies and news). The relative classification error reduction obtained by using the best factor analysis configuration with respect to baseline system (GMM-UBM) is about 56%, corresponding to a correct identification rate of about 90%.

#10Robust Audio-based Classification of Video Genre

Mickael Rouvier (LIA)
Georges Linarès (LIA)
Driss Matrouf (LIA)

Video genre classification is a challenging task in a global context of fast growing video collections availible on the Internet. In this paper, we present a new method for video genre identification by audio contents analysis. Our approach relies on the combination of low and high level audio features. We investigate the discriminative capacity of features related to acoustic instability, speaker interactivity, speech quality and acoustic space characterization. The genre identification is performed on these features by using a SVM classifier. Experiments are conducted on a corpus composed from cartoons, movies, news, commercials and music on which we obtain an identification rate of 91%.

#11Fusing Audio and Video Information for Online Speaker Diarization

Joerg Schmalenstroeer (Department of Communications Engineering, University of Paderborn, Germany)
Martin Kelling (Department of Communications Engineering, University of Paderborn, Germany)
Volker Leutnant (Department of Communications Engineering, University of Paderborn, Germany)
Reinhold Haeb-Umbach (Department of Communications Engineering, University of Paderborn, Germany)

In this paper we present a system for identifying and localizing speakers using distant microphone arrays and a steerable pan-tilt-zoom camera. Audio and video streams are processed in real-time to obtain the diarization information ``who speaks when and where'' with low latency to be used in advanced video conferencing systems or user-adaptive interfaces. A key feature of the proposed system is to first glean information about the speaker's location and identity from the audio and visual data streams separately and then to fuse these data in a probabilistic framework employing the Viterbi algorithm. Here, visual evidence of a person is utilized through a priori state probabilities, while location and speaker change information are employed via time-variant transition probablities. Experiments show that video information yields a substantial improvement compared to pure audio-based diarization.

#12Multimodal Speaker Verification Using Ancillary Known Speaker Characteristics Such as Gender or Age

Girija Chetty (University of Canberra)
Michael Wagner (University of Canberra)

Multimodal speaker verification based on easy-to-obtain bio-metric traits such as face and voice is rapidly gaining acceptance as the preferred technology for many applications. In many such practical applications, other characteristics of the speaker such as gender or age are known and may be exploited for enhanced verification accuracy. In this paper we present a parallel approach determining gender as an ancillary speaker characteristic, which is incorporated in the decision of a face-voice speaker verification system. Preliminary experiments with the DaFEx multimodal audio-video database show that fusing the results of gender recognition and identity verification improves the performance of multimodal speaker verification. Index Terms: multimodal, face-voice, speaker verification, speaker characterisation

#13Discovering Keywords from Cross-Modal Input: Ecological vs. Engineering Methods for Enhancing Acoustic Repetitions

Guillaume Aimetti (University of Sheffield)
Roger Moore (University of Sheffield)
Louis ten Bosch (Radboud University)
Okko Rasanen (Helsinki University of Technology)
Unto Laine (Helsinki University of Technology)

This paper introduces a computational model that automatically segments acoustic speech data and builds internal representations of keyword classes from cross-modal (acoustic and pseudo-visual) input. Acoustic segmentation is achieved using a novel dynamic time warping technique and the focus of this paper is on recent investigations conducted to enhance the identification of repeating portions of speech. This ongoing research is inspired by current cognitive views of early language acquisition and therefore strives for ecological plausibility in an attempt to build more robust speech recognition systems. Results show that an ad-hoc computationally engineered solution can aid the discovery of repeating acoustic patterns. However, we show that this improvement can be simulated in a more ecologically valid way.

Tue-Ses3-S1:
Panel: Speech & Intelligence

Time:Tuesday 16:00 Place:Main Hall Type:Special
Chair:Roger Moore

16:00Speech and Intelligence Panel Session

In line with the theme of this year’s INTERSPEECH conference, this special semi-plenary Panel Session will be run as a guided discussion, drawing on issues raised by the panel members and solicited in advance from the attendees. An international panel of distinguished experts will engage with the topic of ‘speech and intelligence’ and address open questions such the importance of a link between spoken language and other aspects of human cognition. It is expected that this special event will be both informative and entertaining, and will involve opportunities for audience participation. Panel chair: Roger Moore (UK); Panel members to include: Janet Baker (USA), Anton Batliner (Germany), Lou Boves (Netherlands), Nick Campbell (Eire), Hiroya Fujisaki (Japan), Bjorn Granstrom (Sweden), Tom Griffiths (USA), Sarah Hawkins (UK), Dirk Heylan (Netherlands), Mark Huckvale (UK) & Nobuaki Minematsu (Japan). If you have a particular question or topic that you would like the panel to discuss, then please send your suggestion(s) to r.k.moore@dcs.shef.ac.uk.

Tue-Ses3-O3:
Speaker verification & identification I

Time:Tuesday 16:00 Place:East Wing 2 Type:Oral
Chair: Patrick Kenny

16:00Investigation into variants of Joint Factor Analysis for speaker recognition

Lukas Burget (Brno University of Technology)
Pavel Matejka (Brno University of Technology)
Valiantsina Hubeika (Brno University of Technology)
Jan Cernocky (Brno University of Technology)

In this paper, we have investigated into JFA used for speaker recognition. First, we performed systematic comparison of full JFA with its simplified variants and confirmed superior performance of the full JFA with both eigenchannels and eigenvoices. We investigated into sensitivity of JFA on the number of eigenvoices both for the full one and simplified variants. We studied the importance of normalization and found that gender-dependent zt-norm was crucial. The results are reported on NIST 2006 and 2008 SRE evaluation data.

16:20Improved GMM-based Speaker Verification Using SVM-Driven Impostor Dataset Selection

Mitchell McLaren (SAIVT Research Laboratory, QUT, Brisbane, Australia)
Robbie Vogt (SAIVT Research Laboratory, QUT, Brisbane, Australia)
Brendan Baker (SAIVT Research Laboratory, QUT, Brisbane, Australia)
Sridha Sridharan (SAIVT Research Laboratory, QUT, Brisbane, Australia)

The problem of impostor dataset selection for GMM-based speaker verification is addressed through the recently proposed data-driven background dataset refinement technique. The SVM-based refinement technique selects from a candidate impostor dataset those examples that are most frequently selected as support vectors when training a set of SVMs on a development corpus. This study demonstrates the versatility of dataset refinement in the task of selecting suitable impostor datasets for use in GMM-based speaker verification. The use of refined Z- and T-norm datasets provided performance gains of 15% in EER in the NIST 2006 SRE over the use of heuristically selected datasets. The refined datasets were shown to generalise well to the unseen data of the NIST 2008 SRE.

16:40Adaptive Individual Background Model for Speaker Verification

Yossi Bar-Yosef (Tel-Aviv University, Tel-Aviv 69978, Israel)
Yuval Bistritz (Tel-Aviv University, Tel-Aviv 69978, Israel)

Most techniques for speaker verification today use Gaussian Mixture Models (GMMs) and make the decision by comparing the likelihood of the speaker model to the likelihood of a universal background model (UBM). The paper proposes to replace the UBM by an individual background model (IBM) that is generated for each speaker. The IBM is created using the K-nearest cohort models and the UBM by a simple new adaptation algorithm. The new GMM-IBM speaker verification system can also be combined with various score normalization techniques that have been proposed to increase the robustness of the GMM-UBM system. Comparative experiments were held on the NIST-2004-SRE database with a plain system setting (without score normalization) and also with the combination of adaptive test normalization (ATnorm). Results indicated that the proposed GMM-IBM system outperform a comparable GMM-UBM system.

17:00Optimization of Discriminative Kernels In SVM Speaker Verification

Shi-xiong Zhang (The Hong Kong Polytechnic University)
Man-wai Mak (The Hong Kong Polytechnic University)

In SVM speaker verification, the kernel needs to map variable-length observation sequences to fixed-size supervectors that capture the dynamic characteristics of speech utterances and allow speakers to be easily distinguished. Most kernels in SVM speaker verification are obtained by assuming a specific form for the similarity function of supervectors. This paper relaxes this assumption to derive a new general kernel. The kernel function is general in that it is a linear combination of any kernels belonging to the reproducing kernel Hilbert space. The combination weights are obtained by optimizing the ability of a discriminant function to separate a target speaker from impostors using either regression analysis or SVM training. The idea was applied to both low- and high-level speaker verification. In both cases, results show that the proposed kernels outperform the state-of-the-art sequence kernels.

17:20UBM-Based Sequence Kernel for Speaker Recognition

Zhenchun Lei (School of Computer and Information Engineering, Jiangxi Normal University, China)

This paper proposes a probabilistic sequence kernel based on the universal background model, which is widely used in speaker recognition. The Gaussian components are used to construct the speaker reference space, and the utterances with different length are mapped into the fixed size vectors after normalization with correlation matrix. Finally the linear support vector machine is used for speaker recognition. A transition probabilistic sequence kernel is also proposed by adaption the transition information between neighbor frames. The experiments on NIST 2001 show that the performance is compared with the traditional UBM-MAP model. If we fusion the models, the performance will be improved 16.8% and 19.1% respectively compared with the UBM-MAP model.

17:40GMM Kernel by Taylor Series for Speaker Verification

Xu Minqiang (Department of Electronic Science and Technology, USTC, Hefei, Anhui, China;Department of Electrical and Computer Engineering, UIUC, USA)
Zhou Xi (Department of Electrical and Computer Engineering, UIUC, USA)
Dai Beiqian (Department of Electronic Science and Technology, USTC, Hefei, Anhui, China)
Huang Thomas S. (Department of Electrical and Computer Engineering, UIUC, USA)

Currently, approach of Gaussian Mixture Model combined with Support Vector Machine to text-independent speaker verification task has produced the stat-of-the-art performance. Many kernels have been reported for combining GMM and SVM. In this paper, we propose a novel kernel to represent the GMM distribution by Taylor expansion theorem and it’s regarded as the input of SVM. The utterance-specific GMM is represented as a combination of orders of Taylor series expansing at the the means of the Gaussian components. Here we extract the distribution information around the means of the Gaussian components in the GMM as we can naturally assume that each mean position indicates a feature cluster in the feature space. And then the kernel computes the emsemble distance between orders of Taylor series. Results of our new kernel on NIST speaker recognition evaluation (SRE) 2006 core task have been shown relative improvements of up to 7.1% and 11.7% in EER for male and female compared to K-L divergence based SVM system.

Tue-Ses3-O4:
Text Processing for Spoken Language Generation

Time:Tuesday 16:00 Place:East Wing 3 Type:Oral
Chair:Bernd Möbius

16:00Automatic Syllabification for Danish Text-to-Speech Systems

Jeppe Beck (Microsoft Language Development Center)
Daniela Braga (Microsoft Language Development Center)
João Nogueira (Faculty of Sciences of University of Lisbon)
Miguel Dias (Microsoft Language Development Center)
Luis Coelho (Instituto Politécnico do Porto)

In this paper, a rule-based automatic syllabifier for Danish is described using the Maximal Onset Principle. Prior success rates of rule-based methods applied to Portuguese and Catalan syllabification modules were on the basis of this work. The system was implemented and tested using a very small set of rules. The results gave rise to 96.9% and 98.7% of word accuracy rate, contrary to our initial expectations, being Danish a language with a complex syllabic structure and thus difficult to be rule-driven. Comparison with data-driven syllabification system using artificial neural networks showed a higher accuracy rate of the former system.

16:20Hybrid Approach to Grapheme to Phoneme Conversion for Korean

Jinsik Lee (Pohang University of Science and Technology)
Byeongchang Kim (Catholic University of Daegu)
Gary Geunbae Lee (Pohang University of Science and Technology)

In the grapheme to phoneme conversion problem for Korean, two main approaches have been discussed: knowledge-based and data-driven methods. However, both camps have limitations: the knowledge-based hand-written rules cannot handle some of the pronunciation changes due to the lack of capability of linguistic analyzers and many exceptions; data-driven methods always suffer from data sparseness. To overcome the shortages of both camps, this paper presents a novel combining method which effectively integrates two components: (1) a rule-based converting system based on linguistically motivated hand-written rules and (2) a statistical converting system using a Maximum Entropy model. The experimental results clearly show the effectiveness of our proposed method.

16:40Robust LTS rules with the Combilex speech technology lexicon

Korin Richmond (CSTR, Informatics, Edinburgh University)
Robert Clark (CSTR, Informatics, Edinburgh University)
Sue Fitt (CSTR, Informatics, Edinburgh University)

Combilex is a high quality pronunciation lexicon aimed at speech technology applications that has recently been released by CSTR. Combilex benefits from several advanced features. This paper evaluates one of these: the explicit alignment of phones to graphemes in a word. This alignment can help to rapidly develop robust and accurate letter-to-sound (LTS) rules, without needing to rely on automatic alignment methods. To evaluate this, we used Festival's LTS module, comparing its standard automatic alignment with Combilex's explicit alignment. Our results show using Combilex's alignment improves LTS accuracy: 86.50% words correct as opposed to 84.49%, with our most general form of lexicon. In addition, building LTS models is greatly accelerated, as the need to list allowed alignments is removed. Finally, loose comparison with other studies indicates Combilex is a superior quality lexicon in terms of consistency and size.

17:00Letter-to-phoneme conversion by inference of rewriting rules

Vincent Claveau (IRISA - CNRS)

Phonetization is a crucial step for oral document processing. In this paper, a new letter-to-phoneme conversion approach is proposed; it is automatic, simple, portable and efficient. It relies on a machine learning technique initially developed for transliteration and translation; the system infers rewriting rules from examples of words with their phonetic representations. This approach is evaluated in the framework of the Pronalsyl Pascal challenge, which includes several datasets on different languages. The obtained results equal or outperform those of the best known systems. Moreover, thanks to the simplicity of our technique, the inference time of our approach is much lower than those of the best performing state-of-the-art systems.

17:20Online Discriminative Training for Grapheme-to-Phoneme Conversion

Sittichai Jiampojamarn (Department of Computing Science, University of Alberta)
Grzegorz Kondrak (Department of Computing Science, University of Alberta)

We present an online discriminative training approach to grapheme-to-phoneme (g2p) conversion. We employ a many-to-many alignment between graphemes and phonemes, which overcomes the limitations of widely used one-to-one alignments. The discriminative structure-prediction model incorporates input segmentation, phoneme prediction, and sequence modeling in a unified dynamic programming framework. The learning model is able to capture both local context features in inputs, as well as non-local dependency features in sequence outputs. Experimental results show that our system surpasses the state-of-the-art on several data sets.

17:40Using Same-Language Machine Translation to Create Alternative Target Sequences for Text-To-Speech Synthesis

Peter Cahill (University College Dublin)
Jinhua Du (Dublin City University)
Andy Way (Dublin City University)
Julie Carson-Berndsen (University College Dublin)

Modern speech synthesis systems attempt to produce speech utterances from an open domain of words. In some situations, the synthesiser will not have the appropriate units to pronounce some words or phrases accurately but it still must attempt to pronounce them. This paper presents a hybrid machine translation and unit selection speech synthesis system. The machine translation system was trained with English as the source and target language. Rather than the synthesiser only saying the input text as would happen in conventional synthesis systems, the synthesiser may say an alternative utterance with the same meaning. This method allows the synthesiser to overcome the problem of insufficient units in runtime.

Tue-Ses3-S2:
Special Session: Measuring the Rhythm of Speech

Time:Tuesday 16:00 Place:East Wing 4 Type:Special
Chair: Daniel Hirst & Greg Kochanski

#0Investigating Changes in the Rhythm of Maori over Time

Margaret Maclagan (University of Canterbury, New Zealand)
Catherine Watson (University of Auckland, New Zealand)
Jeanette King (University of Canterbury, New Zealand)
Ray Harlow (University of Waikato, New Zealand)
Laura Thompson (University of Auckland, New Zealand)
Peter Keegan (University of Auckland, New Zealand)

Present-day Maori elders comment that the mita (which includes rhythm) of the Maori language, has changed over time. This paper presents the first results in a study of the change of Maori rhythm. PVI analyses did not capture this change. Perceptual experiments, using extracts of speech low-pass filtered to 400 Hz, demonstrated that Maori and English speech could be distinguished. Listeners who spoke Maori were more accurate than those who spoke only English. The English and Maori speech of groups of different speakers born at different times was perceived differently, indicating that the rhythm of Maori has indeed changed over time.

#0The Dynamic Dimension of the Global Speech-Rhythm Attributes

Jan Volín (Institute of Phonetics, Charles University in Prague)
Petr Pollák (Faculty of Electrical Engineering, Czech Technical University in Prague)

Recent years have revealed that certain global attributes of speech rhythm can be quite successfully captured with respect to consonantal and vocalic intervals in spoken texts. One of the many problems of this approach lies in complex syllabic structures. Unless we make an a-priori phonological decision, sonorous consonants may contribute to either vocalic or consonantal part of the speech signal in post-initial and pre-final positions of syllabic onsets and codas. A procedure is offered to avoid phonological dilemmas together with tedious manual work. The method is tested on continuous Czech and English texts read out by several professionals.

#0Vowel duration in pre-geminate contexts in Polish

Zofia Malisz (Adam Mickiewicz University, Poznan)

The study presents Polish experimental data on the variability of vowel duration in the context of following singleton and geminate consonants. The aim of the study is to explain the low vocalic variability values obtained from "rhythm metrics" based analyses of speech rhythm. It also aims at contributing to the discussion about current dynamical models of speech rhythm that contain assumptions of the relative temporal stability of the vowel-to-vowel sequence. The results suggest that vowels in Polish co-vary with following consonant length in a roughly proportionate manner. An interpretation of the effect is offered where a fortition process overrides the possibility of temporal compensation. Index Terms: gemination, vowel duration, speech rhythm, Polish

#0Effects of Mora-timing in English Rhythm Control by Japanese Learners

Shizuka Nakamura (Graduate School of Global Information and Telecommunication Studies, Waseda University, Japan)
Hiroaki Kato (National Institute of Information and Communications Technology / Advanced Telecommunications Research Institute International, Japan)
Yoshinori Sagisaka (Graduate School of Global Information and Telecommunication Studies, Waseda University, Japan)

In our previous studies on an objective evaluation of English rhythm control by Japanese learners, we noticed that the accustomed mora-timing of Japanese learners might unfavorably affect English speech of stress-timing. In this paper, we analyzed durational differences between Japanese learners and native speakers in the corresponding speech units such as stressed/unstressed syllable, strong/weak vowel, syllable in content/function word, and closed/open syllable from a perspective of the contrast of stressed/unstressed syllables. It was confirmed that these durational differences caused by mora-timing strongly affected subjective evaluation by native teachers, through correlation analyses of these differences and subjective evaluation scores.

16:00The rhythm of text and the rhythm of utterances: from metrics to models.

Daniel Hirst (CNRS, Aix-Marseille Université, Aix-en-Provence, France)

The typological classification of languages as stress-timed, syllable-timed and mora-timed did not stand up to empirical investigation which found little or no evidence for the different types of isochrony which had been assumed to be the basis for the classification. In recent years, there has been a renewal of interest with the development of empirical metrics for measuring rhythm. In this paper it is shown that some of these metrics are more sensitive to the rhythm of the text than to the rhythm of the utterance itself. While a number of recent proposals have been made for improving these metrics it is proposed that what is needed is more detailed studies of large corpora in order to develop more sophisticated models of the way in which prosodic structure is realised in different languages. New data on British English is presented using the Aix-Marsec corpus.

16:20No Time to Lose? Time Shrinking Effects Enhance the Impression of Rhythmic ”Isochrony” and Fast Speech Rate

Petra Wagner (Universität Bielefeld)
Andreas Windmann (Universität Bielefeld)

Time Shrinking denotes the psycho-acoustic shrinking effect of a short interval on one or several subsequent longer intervals. Its effectiveness in the domain of speech perception has so far not been examined. Two perception experiments clearly suggest the influence of relative duration patterns triggering time shrinking on the perception of tempo and rhythmical isochrony or rather "evenness". A comparison between the experimental data and duration patterns across various languages suggests a strong influence of time shrinking on the impression of isochrony in speech and perceptual speech rate. Our results thus emphasize the necessity of taking into account relative timing within rhythmical domains such as feet, phrases or narrow rhythm units as a complementary perspective to popular global rhythm variability metrics.

16:40Measuring speech rhythm variation in a model-based framework

Plínio Barbosa (Speech Prosody Studies Group/Dep. of Linguistics/Inst.Est. Ling., Univ. of Campinas, Brazil)

A coupled-oscillators-model-based method for measuring speech rhythm is presented. This model explains cross-linguistic differences in rhythm as deriving from varying degrees of coupling strength between a syllable oscillator and a phrase stress oscillator. The method was applied to three texts read aloud in French, in Brazilian and European Portuguese by seven speakers. The results reproduce the early findings on rhythm typology for these languages/varieties with the following advantages: it successfully accounts for speech rate variation, related to the syllabic oscillator frequency in the model; it takes only syllable-sized units into account, not spliting syllables into vowels and consonants; the consequences of phrase stress magnitude on stress group duration are directly considered; both universal and language-specific aspects of speech rhythm are captured by the model.

17:00Rhythm measures with language-independent segmentation

Anastassia Loukina (Phonetics laboratory, University of Oxford, United Kingdom)
Greg Kochanski (Phonetics laboratory, University of Oxford, United Kingdom)
Chilin Shih (EALC/Linguistics, University of Illinois, Urbana-Champaign USA)
Elinor Keane (Phonetics laboratory, University of Oxford, United Kingdom)
Ian Watson (Phonetics laboratory, University of Oxford, United Kingdom)

We compare 15 measures of speech rhythm based on an automatic segmentation of speech into vowel-like and consonant-like regions. This allows us to apply identical segmentation criteria to all languages and compute rhythm measures over a large corpus. It may also approximate more closely the segmentation available to pre-lexical infants, who have been claimed to discriminate between languages. We find that within-language variation is large and comparable to the language-to-language differences we observed. We evaluate the success of different measures in separating languages and show that the efficiency of measures depends on the languages included in the corpus. Rhythm appears to be described by two dimensions and different published rhythm measures capture different aspects of it.

Tue-Ses3-P4:
Topics in Spoken Language Processing

Time:Tuesday 16:00 Place:Hewison Hall Type:Poster
Chair: Chiori Hori

#1Confidence-Based Techniques for Rapid and Robust Topic Identification of Conversational Telephone Speech

Jonathan Wintrode (US Department of Defense)
Scott Kulp (Rutgers University)

We investigate the impact of automatic speech recognition errors on the accuracy of topic identification in conversational telephone speech. We present a modified TF-IDF feature-weighting calculation that provides significant robustness under various recognition error conditions. For our experiments we take conversations from the Fisher corpus to produce 1-best and lattice outputs using one recognizer tuned to run at various speeds. We use SVM classifiers to perform topic identification on the output. We observe classifiers incorporating confidence information to be significantly more robust to errors than those treating output as unweighted text.

#2Localization of Speech Recognition in Spoken Dialog Systems: How Machine Translation Can Make Our Lives Easier

David Suendermann (SpeechCycle, Inc.)
Jackson Liscombe (SpeechCycle, Inc.)
Krishna Dayanidhi (SpeechCycle, Inc.)
Roberto Pieraccini (SpeechCycle, Inc.)

The localization of speech recognition for large-scale spoken dialog systems can be a tremendous manual exercise. Usually though, a vast number of transcribed and annotated utterances exists for the source language. In this paper, we propose to use such data and translate it into the target language using machine translation. The translated utterances and their associated (original) annotations are then used to train statistical grammars for all contexts of the target system. As an example, we localize an English spoken dialog system for Internet troubleshooting to Spanish by translating more than 4 million source utterances without any human intervention. In an application of the localized system to more than 10,000 utterances collected on a similar Spanish Internet troubleshooting system, we show that the overall accuracy was only 5.7% worse than that of the English source system.

#3Algorithms for Speech Indexing in Microsoft Recite

Kunal Mukerjee (Microsoft)
Shankar Regunathan (Microsoft)
Jeffrey Cole (Microsoft)

Microsoft Recite is a mobile application to store and retrieve spoken notes. Recite stores and matches n-grams of pattern class identifiers that are designed to be language neutral and handle a large number of out of vocabulary phrases. The query algorithm expects noise and fragmented matches and compensates for them with a heuristic ranking scheme. This contribution describes a class of indexing algorithms for Recite that allows for high retrieval accuracy while meeting the constraints of low computational complexity and memory footprint of embedded platforms. The results demonstrate that a particular indexing scheme within this class can be selected to optimize the trade-off between retrieval accuracy and insertion/query complexity.

#4Parallelized Viterbi Processor for 5,000-Word Large-Vocabulary Real-Time Continuous Speech Recognition FPGA System

Tsuyoshi Fujinaga (Kobe University)
Kazuo Miura (Kobe University)
Hiroki Noguchi (Kobe University)
Hiroshi Kawaguchi (Kobe University)
Masahiko Yoshimoto (Kobe University)

We propose a novel Viterbi processor for the large vocabulary real-time continuous speech recognition. This processor is built with multi Viterbi cores. Since each core can independently compute, these cores reduce the cycle times very efficiently. To verify the effect of utilizing multi cores, we implement a dual-core Viterbi processor in an FPGA and achieve 49% cycle-time reduction, compared to a single-core processor. Our proposed dual-core Viterbi processor achieves the 5,000-word real-time continuous speech recognition at 65.175 MHz. In addition, it is easy to implement scalable increases in the number of cores, which leads to achievement of the larger vocabulary.

#5SpLaSH (Spoken Language Search Hawk): integrating time-aligned with text-aligned annotations

Sara Romano (Natural Language Processing group Department of Physical Sciences, ‘Federico II’ University, Naples, Italy)
Elvio Cecere (Natural Language Processing group Department of Physical Sciences, ‘Federico II’ University, Naples, Italy)
Francesco Cutugno (Natural Language Processing group Department of Physical Sciences, ‘Federico II’ University, Naples, Italy)

In this work we present SpLaSH (Spoken Language Search Hawk), a toolkit used to perform complex queries on spoken language corpora. In SpLaSH, tools for the integration of time aligned annotations (TMA), by means of annotation graphs, with text aligned ones (TXA), by means of generic XML files, are provided. SpLaSH imposes a very limited number of constraints to the data model design, allowing the integration of annotations developed separately within the same dataset and without any relative dependency. It also provides a GUI allowing three types of queries: simple query on TXA or TMA structures, sequence query on TMA structure and cross query on both TXA and TMA integrated structures.

#6PodCastle: Collaborative Training of Acoustic Models on the Basis of Wisdom of Crowds for Podcast Transcription

Jun Ogata (National Institute of Advanced Industrial Science and Technology (AIST))
Masataka Goto (National Institute of Advanced Industrial Science and Technology (AIST))

This paper presents acoustic-model-training techniques for improving automatic transcription of podcasts. A typical approach for acoustic modeling is to create a task-specific corpus including hundreds of hours of speech data and their accurate transcriptions. This approach, however, is impractical in podcast-transcription task because manual generation of the transcriptions of the large amounts of speech covering all the various types of podcast contents will be too costly and time consuming. To solve this problem, we introduce collaborative training of acoustic models on the basis of wisdom of crowds, i.e., the transcriptions of podcast-speech data are generated by anonymous users on our web service PodCastle. We then describe a podcast-dependent acoustic modeling system by using RSS metadata to deal with the differences of acoustic conditions in podcasts. From our experimental results, the effectiveness of the proposed acoustic model training was confirmed.

#7A WFST-based Log-linear Framework for Speaking-style Transformation

Graham Neubig (Graduate School of Informatics, Kyoto University)
Shinsuke Mori (Graduate School of Informatics, Kyoto University)
Tatsuya Kawahara (Graduate School of Informatics, Kyoto University)

When attempting to make transcripts from automatic speech recognition results, disfluency deletion, transformation of colloquial expressions, and insertion of dropped words must be performed to ensure that the final product is clean transcript-style text. This paper introduces a system for the automatic transformation of the spoken word to transcript-style language that enables not only deletion of disfluencies, but also substitutions of colloquial expressions and insertion of dropped words. A number of potentially useful features are combined in a log-linear probabilistic framework, and the utility of each is examined. The system is implemented using weighted finite state transducers (WFSTs) to allow for easy combination of features and integration with other WFST-based systems. On evaluation, the best system achieved a 5.37% word error rate, a 5.49% absolute gain over a rule-based baseline and a 1.54% absolute gain over a simple noisy-channel model.

#8ClusterRank: A Graph Based Method for Meeting Summarization

Nikhil Garg (Ecole Polytechnique Fédérale de Lausanne, Switzerland)
Benoit Favre (International Computer Science Institute, Berkeley, USA)
Korbinian Reidhammer (International Computer Science Institute, Berkeley, USA)
Dilek Hakkani-Tür (International Computer Science Institute, Berkeley, USA)

This paper presents an unsupervised, graph based approach for extractive summarization of meetings. Graph based methods such as TextRank have been used for sentence extraction from news articles. These methods model text as a graph with sentences as nodes and edges based on word overlap. A sentence node is then ranked according to its similarity with other nodes. The spontaneous speech in meetings leads to incomplete, ill-formed sentences with high redundancy and calls for additional measures to extract relevant sentences. We propose an extension of the TextRank algorithm that clusters the meeting utterances and uses these clusters to construct the graph. We evaluate this method on the AMI meeting corpus and show a significant improvement over TextRank and other baseline methods.

#9Leveraging Sentence Weights in a Concept-based Optimization Framework for Extractive Meeting Summarization

Shasha Xie (International Computer Science Institute, Berkeley, CA)
Benoit Favre (International Computer Science Institute, Berkeley, CA)
Dilek Hakkani-Tur (International Computer Science Institute, Berkeley, CA)
Yang Liu (The University of Texas at Dallas, Richardson, TX)

We adopt an unsupervised concept-based global optimization framework for extractive meeting summarization, where a subset of sentences is selected to cover as many important concepts as possible. We propose to leverage sentence importance weights in this model. Three ways are introduced to combine the sentence weights within the concept-based optimization framework: selecting sentences for concept extraction, pruning unlikely candidate summary sentences, and using joint optimization of sentence and concept weights. Our experimental results on the ICSI meeting corpus show that our proposed methods can significantly improve the performance for both human transcripts and ASR output compared to the baseline of the concept-based approach, and this unsupervised approach achieves results comparable with those from supervised learning approaches presented in previous work.

#10Hybrids of Supervised and Unsupervised Models for Extractive Speech Summarization

Shih-Hsiang Lin (National Taiwan Normal University)
Yueng-Tien Lo (National Taiwan Normal University)
Yao-Ming Yeh (National Taiwan Normal University)
Berlin Chen (National Taiwan Normal University)

Speech summarization, distilling important information and removing redundant and incorrect information from spoken documents, has become an active area of intensive research in the recent past. In this paper, we consider hybrids of supervised and unsupervised models for extractive speech summarization. Moreover, we investigate the use of the unsupervised summarizer to improve the performance of the supervised summarizer when manual labels are not available for training the latter. A novel training data selection and relabeling approach designed to leverage the inter-document or/and the inter-sentence similarity information is explored as well. Encouraging results were initially demonstrated.

#11Automatic Detection of Audio Advertisements

Dan Melamed (AT&T Labs-Research)
Yeon-Jun Kim (AT&T Labs-Research)

Quality control analysts in customer service call centers often search for keywords in call transcripts. Their searches can return an overwhelming number of false positives when the search terms also appear in advertisements that customers hear while they are on hold. This paper presents new methods for detecting advertisements in audio data, so that they can be filtered out. In order to be usable in real-world applications, our methods are designed to minimize human intervention after deployment. Even so, they are much more accurate than a baseline HMM method.

#12Named Entity Network based on Wikipedia

Sameer Maskey (IBM Research)
Wisam Dakka (Google)

Named Entities (NEs) play an important role in many natural language and speech processing tasks. A resource that identifies relations between NEs could potentially be very useful. We present such automatically generated knowledge resource from Wikipedia, Named Entity Network (NE-NET), that provides a list of related Named Entities (NEs) and the degree of relation for any given NE. Unlike some manually built knowledge resource, NE-NET has a wide coverage consisting of 1.5 million NEs represented as nodes of a graph with 6.5 million arcs relating them. NE-NET also provides the ranks of the related NEs using a simple ranking function that we propose. In this paper, we present NE-NET and our experiments showing how NE-NET can be used to improve the retrieval of spoken (Broadcast News) and text documents.

Tue-Ses3-P2:
ASR: Acoustic Modelling

Time:Tuesday 16:00 Place:Hewison Hall Type:Poster
Chair:Simon King

#1Combined Discriminative Training for Multi-Stream HMM-based Audio-Visual Speech Recognition

Jing Huang (IBM Research)
Karthik Visweswariah (IBM Research)

In this paper we investigate discriminative training of models and feature space for a multi-stream HMM-based audio-visual speech recognizer (AVSR). Since the two streams are used together in decoding, we propose to train the parameters of the two streams jointly. This is in contrast to prior work which has considered discriminative training of parameters in each stream independent of the other. In experiments on a 20-speaker one-hour speaker independent test set, we obtain 22% relative gain on AVSR performance over A/V models whose parameters are trained separately, and 50% relative gain on AVSR over the baseline maximum-likelihood models. On a noisy (mismatched to training) test set, we obtain 21% relative gain over A/V models whose parameters are trained separately. This represents 30% relative improvement over the maximum-likelihood baseline.

#2Cued Speech Recognition for Augmentative Communication in Normal-hearing and Hearing-impaired Subjects

Panikos Heracleous (GIPSA-lab, Speech and Cognition Department)
Denis Beautemps (GIPSA-lab, Speech and Cognition Department)
Noureddine Aboutabit (GIPSA-lab, SPeech and Cognition Department)

Speech is the most natural communication mean for humans. However, in situations where audio speech is not available or cannot be perceived because of disabilities or adverse environmental conditions, people may resort to alternative methods such as augmented speech, i.e. audio speech supplemented or replaced by other modalities, such as audiovisual speech, or Cued Speech. Cued Speech is a visual communication mode, which uses lipreading and handshapes placed in different position to make spoken language wholly understandable to deaf individuals. The current study reports the authors' activities and progress in Cued Speech recognition for French. Previously, the authors have reported experimental results for vowel- and consonant recognition in Cued Speech for French in the case of a normal-hearing subject. The study has been extended by also employing a deaf cuer, and both cuer-dependent and multi-cuer experiments based on hidden Markov models (HMM) have been conducted.

#3On Acquiring Speech Production Knowledge from Articulatory Measurements

Daniel Neiberg (Department of Speech Music and Hearing (TMH), CSC, KTH, Stockholm, Sweden)
Gopal Ananthakrishnan (Department of Speech Music and Hearing (TMH), CSC, KTH, Stockholm, Sweden)
Mats Blomberg (Department of Speech Music and Hearing (TMH), CSC, KTH, Stockholm, Sweden)

The paper proposes a general version of a coupled Hidden Markov/Bayesian Network model for performing phoneme recognition on acoustic-articulatory data. The model uses knowledge learned from the articulatory measurements, available for training, for phoneme recognition on the acoustic input. After training on the articulatory data, the model is able to predict 71.5% of the articulatory state sequences using the acoustic input. Using optimized parameters, the proposed method shows a slight improvement for two speakers over the baseline phoneme recognition system which does not use articulatory knowledge. However, the improvement is only statistically significant for one of the speakers. While there is an improvement in recognition accuracy for the vowels, diphthongs and to some extent the semi-vowels, there is a decrease in accuracy for the remaining phonemes.

#4Measuring the gap between HMM-based ASR and TTS

John Dines (Idiap Research Institute)
Junichi Yamagishi (University of Edinburgh)
Simon King (University of Edinburgh)

The EMIME project is conducting research in the development of technologies for mobile, personalised speech-to-speech translation. The hidden Markov model is being used as the underlying technology in both automatic speech recognition and text-to-speech synthesis, thus, the investigation of unified statistical models has become an implicit goal of our research. As one of the first steps towards this goal, we have been investigating commonalities and differences between HMM-based ASR and TTS. In this paper we present results and analysis of a series of experiments that have been conducted with English ASR and TTS, measuring performance with respect to phone set and lexicon, feature extraction and HMM topology. Our results show that, although the fundamental statistical model may be essentially the same, optimal ASR and TTS performance may demand diametrically opposed system designs. This represents a major challenge to be addressed in the investigation of unified models.

#5Speech recognition with speech synthesis models by marginalising over decision tree leaves

John Dines (Idiap Research Institute)
Lakshmi Saheer (Idiap Research Institute)
Hui Liang (Idiap Research Institute)

There has been increasing interest in the use of unsupervised adaptation for the personalisation of text-to-speech (TTS), particularly in the context of speech-to-speech translation. This requires that we are able to generate adaptation transforms from the output of an automatic speech recognition (ASR) system. An approach that utilises unified ASR and TTS models would seem to offer an ideal mechanism for the application of unsupervised adaptation to TTS since transforms could be shared between ASR and TTS. Such unified models should use a common set of parameters. A major barrier to such parameter sharing is the use of differing contexts in ASR and TTS. In this paper we propose a simple approach that generates ASR models from a trained set of TTS models by marginalising over the TTS contexts that are not used by ASR. We present preliminary results of our proposed method on a large vocabulary speech recognition task and provide insights into future directions of this work.

#6Detailed description of triphone model using SSS-free algorithm

Motoyuki Suzuki (Institute of Technology and Science, The University of Tokushima)
Daisuke Honma (Graduate School of Engineering, Tohoku University)
Akinori Ito (Graduate School of Engineering, Tohoku University)
Shozo Makino (Graduate School of Engineering, Tohoku University)

The triphone model is frequently used as an acoustic model. It is effective for modeling phonetic variations caused by coarticulation. However, it is known that acoustic features of phonemes are also affected by other factors such as speaking style and speaking speed. In this paper, a new acoustic model is proposed. All training data which have the same phoneme context are automatically clustered into several clusters based on acoustic similarity, and a “sub-triphones” is trained using training data corresponding to a cluster. In experiments, the sub-triphone model achieved about 5% higher phoneme accuracy than the triphone model.

#7Decision Tree Acoustic Models for ASR

Jitendra Ajmera (Toshiba)
Masami Akamine (Toshiba)

This paper presents a summary of our research progress using decision-tree acoustic models (DTAM) for large vocabulary speech recognition. Various configurations of training DTAMs are proposed and evaluated on wall-street journal (WSJ) task. A number of different acoustic and categorical features have been used for this purpose. Various ways of realizing a forest instead of a single tree have been presented and shown to improve recognition accuracy. Although the performance is not shown to be better than Gaussian mixture models (GMMs), several advantages of DTAMs have been highlighted and exploited. These include compactness, computational simplicity and ability to handle unordered information.

#8Compression Techniques Applied to Multiple Speech Recognition Systems

Catherine Breslin (Toshiba Research Europe Ltd)
Matt Stuttle (Toshiba Research Europe Ltd)
Kate Knill (Toshiba Research Europe Ltd)

Speech recognition systems typically contain many Gaussian distributions, and hence a large number of parameters. This makes them both slow to decode speech, and large to store. Techniques have been proposed to decrease the number of parameters. One approach is to share parameters between multiple Gaussians, thus reducing the total number of parameters and allowing for shared likelihood calculation. Gaussian tying and subspace clustering are two related techniques which take this approach to system compression. These techniques can decrease the number of parameters with no noticeable drop in performance for single systems. However, multiple acoustic models are often used in real speech recognition systems. This paper considers the application of Gaussian tying and subspace compression to multiple systems. Results show that two speech recognition systems can be modelled using the same number of Gaussians as just one system, with little effect on individual system performance.

#9Graphical Models for Discrete Hidden Markov Models in Speech Recognition

Antonio Miguel (University of Zaragoza)
Alfonso Ortega (University of Zaragoza)
Luis Buera (University of Zaragoza)
Eduardo Lleida (University of Zaragoza)

Emission probability distributions in speech recognition have been traditionally associated to continuous random variables. The most successful models have been the mixtures of Gaussians in the states of the hidden Markov models to generate/capture observations. In this work we show how graphical models can be used to extract the joint information of more than two features. This is possible if we previously quantize the speech features to a small number of levels and model them as discrete random variables. In this paper it is shown a method to estimate a graphical model with a constrained number of dependencies, which is a subset of the directed acyclic graph based model framework, Bayesian networks. Some experimental results are obtained with this method compared to baseline systems of full and diagonal covariance matrices.

#10Factor Analyzed HMM Topology for Speech Recognition

Chuan-Wei Ting (National Cheng Kung University)
Jen-Tzung Chien (National Cheng Kung University)

This paper presents a new factor analyzed (FA) similarity measure between two Gaussian mixture models (GMMs). An adaptive hidden Markov model (HMM) topology is built to compensate the pronunciation variations in speech recognition. Our idea aims to evaluate whether the variation of a HMM state from new speech data is significant or not and judge if a new state should be generated in the models. Due to the effectiveness of FA data analysis, we measure the GMM similarity by estimating the common factors and specific factors embedded in the HMM means and variances. Similar Gaussian densities are represented by the common factors. Specific factors express the residual of similarity measure. We perform a composite hypothesis test due to common factors as well as specific factors. An adaptive HMM topology is accordingly established from continuous collection of training utterances. Experiments show that the proposed FA measure outperforms other measures with comparable size of parameters.

#11Tied-State Multi-path HMnet Model using Three-Domain Successive State Splitting

Soo-Young Suk (Speech Processing Group, Information Technology Research Institute, AIST)
Hiroaki Kojima (Speech Processing Group, Information Technology Research Institute, AIST)

In this paper, we address the improvement of an acoustic model using the multi-path Hidden Markov network (HMnet) model for automatically creating non-uniform tied-state, context-dependent hidden markov model topologies. Recent research has achieved multi-path model topologies in order to improve the recognition performance in gender-independent, spontaneous-speaking applications. However, the multi-path acoustic model size may increase and require more training samples depending on the increased number of paths. To solve this problem, we used a tied-state multi-path topology by which we can create a three-domain successive state splitting method to which environmental splitting is added. This method can obtain a suitable model topology with small mixture components. Experiments demonstrated that the proposed multi-path HMnet model performs better than single-path models for the same number of states.

#12Acoustic Modeling Using Exponential Families

Vaibhava Goel (IBM)
Peder Olsen (IBM)

We present a framework to utilize general exponential families for acoustic modeling. Maximum Likelihood (ML) parameter estimation is carried out using sampling based estimates of the partition function and expected feature vector. Markov Chain Monte Carlo procedures are used to draw samples from general exponential densities. We apply our ML estimation framework to two new exponential families to demonstrate the modeling flexibility afforded by this framework.

Tue-Ses3-P1:
Single- and Multichannel Speech Enhancement

Time:Tuesday 16:00 Place:Hewison Hall Type:Poster

#1Watermark Recovery From Speech Using Inverse Filtering And Sign Correlation

Robert Morris (SPAWAR Systems Center Pacific)
Ralph Johnson (SPAWAR Systems Center Pacific)
Vladimir Goncharoff (University of Illinois at Chicago)
Joseph DiVita (SPAWAR Systems Center Pacific)

This paper presents an improved method for asynchronous embedding and recovery of sub-audible watermarks in speech signals. The watermark, a sequence of DTMF tones, was added to speech without knowledge of its time-varying characteristics. Watermark recovery began by implementing a synchronized zero-phase inverse filtering operation to decorrelate the speech during its voiced segments. The final step was to apply the sign correlation technique, which resulted in performance advantages over linear correlation detection. Our simulations include the effects of finite word length in the correlator.

#2Weighted Linear Prediction for Speech Analysis in Noisy Conditions

Jouni Pohjalainen (Dept. Signal Processing and Acoustics, Helsinki University of Technology, FI-02015 TKK, Finland)
Heikki Kallasjoki (Adaptive Informatics Research Centre, Helsinki University of Technology, FI-02015 TKK, Finland)
Kalle Palomäki (Adaptive Informatics Research Centre, Helsinki University of Technology, FI-02015 TKK, Finland)
Mikko Kurimo (Adaptive Informatics Research Centre, Helsinki University of Technology, FI-02015 TKK, Finland)
Paavo Alku (Dept. Signal Processing and Acoustics, Helsinki University of Technology, FI-02015 TKK, Finland)

Following earlier work, we modify linear predictive (LP) speech analysis by including temporal weighting of the squared prediction error in the model optimization. In order to focus this so called weighted LP model on the least noisy signal regions in the presence of stationary additive noise, we use short-time signal energy as the weighting function. We compare the noisy spectrum analysis performance of weighted LP and its recently proposed variant, the latter guaranteed to produce stable synthesis models. As a practical test case, we use automatic speech recognition to verify that the weighted LP methods improve upon the conventional FFT and LP methods by making spectrum estimates less prone to corruption by additive noise.

#3Log-Spectral Magnitude MMSE Estimators under Super-Gaussian Densities

Richard Christian Hendriks (Delft University of Technology)
Richard Heusdens (Delft University of Technology)
Jesper Jensen (Oticon A/S)

Despite the fact that histograms of speech DFT coefficients are super-Gaussian, not much attention has been paid to develop estimators under these super-Gaussian distributions in combination with perceptual meaningful distortion measures. In this paper we present log-spectral magnitude MMSE estimators under super-Gaussian densities, resulting in an estimator that is perceptually more meaningful and in line with measured histograms of speech DFT coefficients. Compared to state-of-the-art reference methods, the presented estimator leads to an improvement of the segmental SNR in the order of 0.5 dB up to 1 dB. Moreover, listening tests show that the proposed estimator leads to significant improvement for the presented estimator over state-of-the-art methods.

#4Speech enhancement in a 2-dimensional area based on power spectrum estimation of multiple areas with investigation of existence of active sources

Yusuke Hioka (NTT Cyber Space Laboratories, NTT Corporation)
Kenichi Furuya (NTT Cyber Space Laboratories, NTT Corporation)
Yoichi Haneda (NTT Cyber Space Laboratories, NTT Corporation)
Akitoshi Kataoka (Fuculty of Science and Technology, Ryukoku University)

A microphone array that emphasizes sound sources located in a particular 2-dimensional area is described. We previously developed a method that estimates the power spectra of target and noise sounds using multiple fixed beamformings. However, that method requires the areas where the noise sources are located to be restricted. We describe the principle of this limitation then propose a procedure that investigates the possibility of the existence of a sound source in a target area and other areas beforehand to reduce the number of unknown power spectra to be estimated.

#5Modulation Domain Spectral Subtraction for Speech Enhancement

Kuldip Paliwal (Signal Processing Laboratory, Griffith University, Queensland, Australia)
Belinda Schwerin (Signal Processing Laboratory, Griffith University, Queensland, Australia)
Kamil Wojcicki (Signal Processing Laboratory, Griffith University, Queensland, Australia)

In this paper we investigate the modulation domain as an alternative to the acoustic domain for speech enhancement. More specifically, we wish to determine how competitive the modulation domain is for spectral subtraction as compared to the acoustic domain. For this purpose, we extend the traditional analysis-modification-synthesis framework to include modulation domain processing. We then compensate the noisy modulation spectrum for additive noise distortion by applying the spectral subtraction algorithm in the modulation domain. Using subjective listening tests and objective speech quality evaluation we show that the proposed method results in improved speech quality. Furthermore, applying spectral subtraction in the modulation domain does not introduce the musical noise artifacts that are typically present after acoustic domain spectral subtraction. The proposed methods also achieves better background noise reduction than the MMSE method.

#6Variational Loopy Belief Propagation for Multi-talker Speech Recognition

Steven Rennie (IBM)
John Hershey (IBM)
Peder Olsen (IBM)

We address single-channel speech separation and recognition by combining loopy belief propagation and variational inference methods. Inference is done in a graphical model consisting of an HMM for each speaker combined with the max interaction model of source combination. We present a new variational inference algorithm that exploits the structure of the max model to compute an arbitrarily tight bound on the probability of the mixed data. The variational parameters are chosen so that the algorithm scales linearly in the size of the language and acoustic models, and quadratically in the number of sources. The algorithm scores 30.7\% on the SSC task \cite{Cooke:09}, which is the best published result by a method that scales linearly with speaker model complexity to date. The algorithm achieves average recognition error rates of 27\%, 35\%, and 51\% on small datasets of SSC-derived speech mixtures containing two, three, and four sources, respectively, using a single audio channel.

#7Enhancement of Binaural Speech Using Codebook Constrained Iterative Binaural Wiener Filter

Nadir Cazi (Indian Institute of Science, Bangalore)
Thippur Sreenivas (Indian Institute of Science, Bangalore)

A clean speech VQ codebook has been shown to be effective in providing intraframe constraints and hence better convergence of the iterative wiener filtering scheme for single channel speech enhancement. Here we present an extension of the single channel CCIWF scheme to binaural speech input by incorporating a speech distortion weighted multi-channel wiener filter. The new algorithm shows considerable improvement over single channel CCIWF in each channel, in a diffuse noise field environment, in terms of aposteriori SNR and speech intelligibility measure. Next, considering a moving speech source, a good tracking performance is seen, upto a certain resolution.

#8A Semi-blind Source Separation Method with A Less Amount of Computation Suitable for Tiny DSP Modules

Kazunobu Kondo (Yamaha Corporation)
Makoto Yamada (Yamaha Corporation)
Hideki Kenmochi (Yamaha Corporation)

In this paper, we propose a method of implementing FDICA on tiny DSP modules. Firstly, we show a semi-blind separation matrix initialization step that consists of an estimation method using covariance fitting for a known source and an unknown source. It contributes to the faster convergence and less amount of computation. Secondly, a learning band selection step is shown that consists of the determinant of the covariance matrix as a criteria for selection; This achieves a significant reduction of an amount of computation with practical separation performance. Finally, the effectiveness of the proposed method is evaluated via the source separation simulations in anechoic and reverberant rooms, and also a procedure and a resource presumption for the integrated method which we call tinyICA are shown.

#9Model-based Speech Separation: Identifying Transcription using Orthogonality

Siu Wa Lee (The Chinese University of Hong Kong)
Frank K. Soong (Microsoft Research Asia)
Tan Lee (The Chinese University of Hong Kong)

Spectral envelopes and harmonics are the building elements of a speech signal. By estimating these elements, individual speech sources in a mixture observation can be reconstructed and hence separated. Transcription gives the spoken content. More important, it describes the expected sequence of spectral envelopes, if modeling of different speech sounds is acquired. Our recently proposed single-microphone speech separation algorithm exploits this to derive the spectral envelope trajectories of individual sources and remove interference accordingly. This paper investigates the relationship between the correctness of transcription hypotheses and the orthogonality of associated source estimates. An orthogonality measure is introduced to quantify the correlation between spectrograms. Experiments verify that underlying true transcriptions lead to a salient orthogonality distribution, which is distinguishable from the counterfeit transcription one.

#10Enhanced Minimum Statistics Technique Incorporating Soft Decision For Noise Suppression

Yun-Sik Park (Inha University)
Ji-Hyun Song (Inha University)
Jae-Hun Choi (Inha University)
Joon-Hyuk Chang (Inha University)

In this paper, we propose a novel approach to noise power estimation for robust noise suppression in noisy environments. From investigation of the state-of-the-art techniques for noise power estimation, it is discovered that the previously known methods are accurate mostly either during speech absence or speech presence but none of it works well in both situations. Our approach combines minimum statistics (MS) and soft decision (SD) techniques based on probability of speech absence. The performance of the proposed approach is evaluated by a quantitative comparison method and subjective test under various noise environments and found to yield better results compared with conventional MS and SD-based schemes.

#11Effect of Noise Reduction on Reaction Time to Speech in Noise

Mark Huckvale (UCL)
Jayne Leak (UCL)

In moderate levels of noise, listeners report that noise reduction (NR) processing can improve the perceived quality of a speech signal as measured on a typical MOS rating scale. Most quantitative experiments of intelligibility, however, show that NR reduces the intelligibility of noisy speech signals, and so should be expected to increase the cognitive effort required to process utterances. To study cognitive effort we look at how NR affects reaction times to speech in noise, using material that is still highly intelligible. We show that adding noise increases reaction times and that NR does not restore reaction times back to the quiet condition. The implication is that NR does not make speech "easier" to process, at least as far as this task is concerned.

#12Joint Noise Reduction and Dereverberation of Speech Using Hybrid TF-GSC and Adaptive MMSE Estimator

Behdad Dashtbozorg (Yazd University)
Hamid Reza Abutalebi (Yazd University)

This paper proposes a new multichannel hybrid method for dereverberation of speech signals in noisy environments. This method extends the use of a hybrid noise reduction method for dereverberation which is based on the combination of Generalized Sidelobe Canceller (GSC) and a single-channel noise reduction stage. In this research, we employ Transfer Function GSC (TF-GSC) that is more suitable for dereverberation. The single-channel stage is an Adaptive Minimum Mean-Square Error (AMMSE) spectral amplitude estimator. We also modify the AMMSE estimator for dereverberation application. Experimental results demonstrate superiority of the proposed method in dereverberation of speech signal in noisy environments.

#13A Study on Multiple Sound Source Localization with a Distributed Microphone System

Kook Cho (Ritsumeikan University)
Takanobu Nishiura (Ritsumeikan University)
Yoichi Yamashita (Ritsumeikan University)

This paper describes a novel method for multiple sound source localization and its performance evaluation in actual room environments. The proposed method localizes a sound source by finding the position that maximizes the accumulated correlation coefficient between multiple channel pairs. After the estimation of the first sound source, a typical pattern of the accumulated correlation for a single sound source is subtracted from the observed distribution of the accumulated correlation. Subsequently, the second sound source is searched again. To evaluate the effectiveness of the proposed method, experiments of multiple sound source localization were carried out in an actual office room. The result shows that multiple sound source localization accuracy is about 99.7%.

#14Robust Minimal Variance Distortionless Speech Power Spectra Enhancement

Tao Yu (CRSS: Center for Robust Speech System, University of Texas at Dallas, Texas,USA)
John H. L. Hansen (CRSS: Center for Robust Speech System, University of Texas at Dallas, Texas,USA)

In this study, we propose a novel minimal variance distortionless speech power spectral enhancement algorithm, which is robust to real-world implementation issues. Our proposed method is implemented in the power spectral domain where stochastic noise can be modeled as the exponential distribution, whose non-Gaussianity is explored by order statistics filter. Both theoretical and experimental results shows the effectiveness of our proposed method over traditional ones.

#15Speech Enhancement Minimizing Generalized Euclidean Distortion Using Supergaussian Priors

Amit Das (University of Colorado, Boulder and University of Texas, Dallas)
John H. L. Hansen (University of Texas, Dallas)

We introduce short time spectral estimators which minimize the weighted Euclidean distortion (WED) between the clean and estimated speech spectral components when clean speech is degraded by additive noise. The traditional minimum mean square error (MMSE) estimator does not take into account sufficient perceptual measure during enhancement of noisy speech. However, the new estimators discussed in this paper provide greater flexibility to improve speech quality. We explore the cases when clean speech spectral magnitude and discrete Fourier transform (DFT) coefficients are modeled by super-Gaussian priors like Chi and bilateral Gamma distributions respectively. We also present the joint maximum aposteriori (MAP) estimators of the Chi distributed spectral magnitude and uniform phase. Performance evaluations over two noise types and three SNR levels demonstrate improved results of the proposed estimators.

#16STFT-Based Speech Enhancement by Reconstructing the Harmonics

Iman Haji Abolhassani (INRS-Energie-Matériaux-Télécommunications, Montréal, Canada)
Sid-Ahmed Selouani (Université de Moncton, Campus de Shippagan, Canada)
Douglas O\'Shaughnessy (INRS-Energie-Matériaux-Télécommunications, Montréal, Canada)

A novel Short Time Fourier Transform (STFT) based speech enhancement method is introduced. This method enhances the magnitude spectrum of a noisy speech segment. The new idea that is used in this method is to basically reconstruct the harmonics at the multiples of the fundamental frequency (F0) rather than trying to improve them. The harmonics are produced, in the magnitude spectrum, using the knowledge of the window function we are using for the STFT. These harmonics are then scaled and laid on multiples of F0. Experimental results prove the effectiveness of this enhancement method in various noisy conditions and various SNR ratios.

#17Joint Speech Enhancement and Speaker Identification Using Monte Carlo Methods

Ciira wa Maina (Drexel University)
John MacLaren Walsh (Drexel University)

We present an approach to speaker identification using noisy speech observations where the speech enhancement and speaker identification tasks are performed jointly. This is motivated by the belief that human beings perform these tasks jointly and that optimality may be sacrificed if sequential processing is used. We employ a Bayesian approach where the speech features are modeled using a mixture of Gaussians prior. A Gibbs sampler is used to estimate the speech source and the identity of the speaker. Preliminary experimental results are presented comparing our approach to a maximum likelihood approach and demonstrating the ability of our method to both enhance speech and identify speakers.

Tue-Ses3-P3:
Assistive Speech Technology

Time:Tuesday 16:00 Place:Hewison Hall Type:Poster
Chair:Elmar Noeth

#1Personalizing synthetic voices for people with progressive speech disorders: judging voice similarity

Sarah Creer (University of Sheffield)
Stuart Cunningham (University of Sheffield)
Phil Green (University of Sheffield)
Kaniz Fatema (University of Kent)

In building personalized synthetic voices for people with speech disorders, the output should capture the individual's vocal identity. This paper reports a listener judgment experiment on the similarity of Hidden Markov Model based synthetic voices using varying amounts of adaptation data to two non-impaired speakers. We conclude that around 100 sentences of data is needed to build a voice that retains the characteristics of the target speaker but using more data improves the voice. Experiments using Multi-Layer Perceptrons (MLPs) are conducted to find which acoustic features contribute to the similarity judgments. Results show that mel-cepstral distortion and fraction of voicing agreement contribute most to replicating the similarity judgment but the combination of all features is required for accurate prediction. Ongoing work applies the findings to voice building for people with impaired speech.

#2Electrolaryngeal Speech Enhancement Based on Statistical Voice Conversion

Keigo Nakamura (Graduate School of Information Science, Nara Institute of Science and Technology)
Tomoki Toda (Graduate School of Information Science, Nara Institute of Science and Technology)
Hiroshi Saruwatari (Graduate School of Information Science, Nara Institute of Science and Technology)
Kiyohiro Shikano (Graduate School of Information Science, Nara Institute of Science and Technology)

This paper proposes a speaking-aid system for laryngectomees using GMM-based voice conversion that converts electrolaryngeal speech (EL speech) to normal speech. Because valid \(F_0\) information cannot be obtained from the EL speech, we have so far converted the EL speech to whisper. This paper conducts the EL speech conversion to normal speech using \(F_0\) counters estimated from the spectral information of the EL speech. The converted normal speech is experimentally evaluated to demonstrate its preference. Moreover, this paper experimentally investigates the output speech of our aid systems, that is whisper or normal speech.

#3Age Recognition for Spoken Dialogue Systems: Do We Need It?

Maria Wolters (CSTR, University of Edinburgh)
Ravichander Vipperla (CSTR, University of Edinburgh)
Steve Renals (CSTR, University of Edinburgh)

When deciding whether to adapt relevant aspects of the system to the particular needs of older users, spoken dialogue systems often rely on automatic detection of chronological age. In this paper, we show that vocal ageing as measured by acoustic features is an unreliable indicator of the need for adaptation. Simple lexical features greatly improve the prediction of both relevant aspects of cognition and interactions style. Lexical features also boost age group prediction. We suggest that adaptation should be based on observed behaviour, not on chronological age, unless it is not feasible to build classifiers for relevant adaptation decisions.

#4Speech-based and Multimodal Media Center for Different User Groups

Markku Turunen (University of Tampere)
Jaakko Hakulinen (University of Tampere)
Aleksi Melto (University of Tampere)
Juho Hella (University of Tampere)
Juha-Pekka Rajaniemi (University of Tampere)
Erno Mäkinen (University of Tampere)
Jussi Rantala (University of Tampere)
Tomi Heimonen (University of Tampere)
Tuuli Laivo (University of Tampere)
Hannu Soronen (Tampere University of Technogy)
Mervi Hansen (Tampere University of Technogy)
Pellervo Valkama (University of Tampere)
Toni Miettinen (University of Tampere)
Roope Raisamo (University of Tampere)

We present a multimodal media center interface based on speech input, gestures, and haptic feedback. For special user groups, including visually and physically impaired users, the application features a zoomable context + focus GUI in tight combination with speech output and full speech-based control. These features have been developed in cooperation with representatives of the user groups. Evaluations of the system with regular users have been conducted and results from a study where subjective evaluations were collected show that the performance and user experience of speech input were very good, similar to results from a ten month public pilot use.

#5Virtual Speech Reading Support for Hard of Hearing in a Domestic Multi-Media Setting

Samer Al Moubayed (KTH Centre for Speech Technology, Stockholm, Sweden.)
Jonas Beskow (KTH Centre for Speech Technology, Stockholm, Sweden.)
Anne-Marie Öster (KTH Centre for Speech Technology, Stockholm, Sweden.)
Giampiero Salvi (KTH Centre for Speech Technology, Stockholm, Sweden.)
Björn Grantröm (KTH Centre for Speech Technology, Stockholm, Sweden.)
Nic Van Son (Viataal, Nijmegen, The Netherlands)
Ellen Ormel (Viataal, Nijmegen, The Netherlands)

In this paper we present recent results on the development of the SynFace lip synchronized talking head towards multilinguality, varying signal conditions and noise robustness in the Hearing at Home project. We then describe the large scale hearing impaired user studies carried out for three languages. The user tests focus on measuring the gain in Speech Reception Threshold in Noise when using SynFace, and on measuring the effort scaling when using SynFace by hearing impaired people. Preliminary analysis of the results does not show significant gain in SRT or in effort scaling. But looking at inter-subject variability, it is clear that many subjects benefit from SynFace especially with speech with stereo babble noise.

#6Real-Time Correction of Closed-Captions

Patrick Cardinal (Centre de Recherche Informatique de Montréal)
Gilles Boulianne (Centre de Recherche Informatique de Montréal)

Live closed-captions for deaf and hard of hearing audiences are currently produced by stenographers, or by voice writers using speech recognition. Both techniques can produce captions with errors. We are currently developing a correction module that allows a user to intercept the real-time caption stream and correct it before it is broadcast. We report results of preliminary experiments on correction rate and actual user performance using a prototype correction module connected to the output of a speech recognition captioning system.

#7Universal Access: Speech Recognition for Talkers with Spastic Dysarthria

Harsh Vardhan Sharma (Beckman Institute for Advanced Science and Technology, Urbana, USA)
Mark Hasegawa-Johnson (Beckman Institute for Advanced Science and Technology, Urbana, USA)

This paper describes the results of our experiments in small and medium vocabulary dysarthric speech recognition, using the database being recorded by our group under the Universal Access initiative. We develop and test speaker-dependent, word- and phone-level speech recognizers utilizing the hidden Markov Model architecture; the models are trained exclusively on dysarthric speech produced by individuals diagnosed with cerebral palsy. The experiments indicate that (a) different system configurations (being word vs. phone based, number of states per HMM, number of Gaussian components per state specific observation probability density etc.) give useful performance (in terms of recognition accuracy) for different speakers and different task-vocabularies, and (b) for very low intelligibility subjects, speech recognition outperforms human listeners on recognizing dysarthric speech.

#8Exploring Speech Therapy Games with Children on the Autism Spectrum

Mohammed E Hoque (Massachusetts Institute of Technology)
Joseph K Lane (Massachusetts Institute of Technology)
Rana el Kaliouby (Massachusetts Institute of Technology)
Matthew Goodwin (Massachusetts Institute of Technology)
Rosalind Picard (Massachusetts Institute of Technology)

Individuals on the autism spectrum often have difficulties producing intelligible speech with either high or low speech rate, and atypical pitch and/or amplitude affect. In this study, we present a novel intervention towards customizing speech enabled games to help them produce intelligible speech. In this approach, we clinically and computationally identify the areas of speech production difficulties of our participants. We provide an interactive and customized interface for the participants to meaningfully manipulate the prosodic aspects of their speech. Over the course of 12 months, we have conducted several pilots to set up the experimental design, developed a suite of games and audio processing algorithms for prosodic analysis of speech. Preliminary results demonstrate our intervention being engaging and effective for our participants.

#9Analyzing GMMs to characterize resonance anomalies in speakers suffering from apnoea

José Luis Blanco (Signal, Systems & RadioCommunications Department. Universidad Politécnica de Madrid, Spain)
Rubén Fernández (Signal, Systems & RadioCommunications Department. Universidad Politécnica de Madrid, Spain)
David Pardo (Signal, Systems & RadioCommunications Department. Universidad Politécnica de Madrid, Spain)
Álvaro Sigüenza (Signal, Systems & RadioCommunications Department. Universidad Politécnica de Madrid, Spain)
Luis A. Hernández (Signal, Systems & RadioCommunications Department. Universidad Politécnica de Madrid, Spain)
José Alcázar (Respiratory Department. Hospital Torrecardenas, Almeria, Spain)

Past research on the speech of apnoea patients has revealed that resonance anomalies are among the most distinguishing traits for these speakers. This paper presents an approach to characterize these peculiarities using GMMs and distance measures between distributions. We report the findings obtained with two analytical procedures, working with a purpose-designed speech database of both healthy and apnoea-suffering patients. First, we validate the database to guarantee that the models trained are able to describe the acoustic space in a way that may reveal differences between groups. Then we study abnormal nasalization in apnoea patients by considering vowels in nasal and non-nasal phonetic contexts. Our results confirm that there are differences between the groups, and that statistical modelling techniques can be used to describe this factor. Results further suggest that it would be possible to design an automatic classifier using such discriminative information.

#10On the Mutual Information between Source and Filter Contributions for Voice Pathology Detection

Thomas Drugman (Faculté Polytechnique de Mons)
Thomas Dubuisson (Faculté Polytechnique de Mons)
Thierry Dutoit (Faculté Polytechnique de Mons)

This paper addresses the problem of automatic detection of voice pathologies directly from the speech signal. For this, we investigate the use of the glottal source estimation as a means to detect voice disorders. Three sets of features are proposed, depending on whether they are related to the speech or the glottal signal, or to prosody. The relevancy of these features is assessed through mutual information-based measures. This allows an intuitive interpretation in terms of discrimation power and redundancy between the features, independently of any subsequent classifier. It is discussed which characteristics are interestingly informative or complementary for detecting voice pathologies.

#11A System for Detecting Miscues in Dyslexic Read Speech

Morten Højfeldt Rasmussen (Multimedia Information and Signal Processing, Department of Electronic Systems, Aalborg University, Denmark)
Zheng-Hua Tan (Multimedia Information and Signal Processing, Department of Electronic Systems, Aalborg University, Denmark)
Børge Lindberg (Multimedia Information and Signal Processing, Department of Electronic Systems, Aalborg University, Denmark)
Søren Holdt Jensen (Multimedia Information and Signal Processing, Department of Electronic Systems, Aalborg University, Denmark)

While miscue detection in general is a well explored research field little attention has so far been paid to miscue detection in dyslexic read speech. This domain differs substantially from the domains that are commonly researched, as for example dyslexic read speech includes frequent regressions and long pauses between words. A system detecting miscues in dyslexic read speech is presented. It includes an ASR component employing a forced-alignment like grammar adjusted for dyslexic input and uses the GOP score and phone duration to accept or reject the read words. Experimental results show that the system detects miscues at a false alarm rate of 5.3% and a miscue detection rate of 40.1%. These results are worse than current state of the art reading tutors perhaps indicating that dyslexic read speech is a challenge to handle.

Wed-Ses0-K:
Deb Roy - New Horizons in the Study of Language Development

Time:Wednesday 08:30 Place:Main Hall Type:Keynote
Chair:Roger Moore

08:30New Horizons in the Study of Language Development

Deb Roy (MIT Media Lab)

Emerging forms of ecologically-valid longitudinal recordings of human behavior and social interaction promise fresh perspectives on age-old questions of child development. In a pilot effort, 240,000 hours of audio and video recordings of one child’s life at home are being analyzed with a focus on language development. To study a corpus of this scale and richness, current methods of developmental sciences are insufficient. New data analysis algorithms and methods for interpretation and computational modeling are under development. Preliminary speech analysis reveals surprising levels of linguistic “finetuning” by caregivers that may provide crucial support for word learning. Ongoing analysis of various other aspects of the corpus aim to model detailed aspects of the child’s language development as a function of learning mechanisms combined with everyday experience. Plans to collect similar corpora from more children based on a streamlined recording system are underway.

Wed-Ses1-O1:
Speaker verification & identification II

Time:Wednesday 10:00 Place:Main Hall Type:Oral
Chair:Jean-Francois Bonastre

10:00Does Session Variability Compensation in Speaker Recognition Model Intrinsic Variation Under Mismatched Conditions?

Elizabeth Shriberg (SRI International)
Sachin Kajarekar (SRI International)
Nicolas Scheffer (SRI International)

Intersession variability (ISV) compensation in speaker recognition is well studied with respect to extrinsic variation, but little is known about its ability to model intrinsic variation. We find that ISV compensation is remarkably successful on a corpus of intrinsic variation that is highly controlled for channel (a dominant component of ISV). The results are particularly surprising because the ISV training data come from a different corpus than do speaker train and test data. We further find that relative improvements are (1) inversely related to uncompensated performance, (2) reduced more by vocal effort train/test mismatch than by speaking style mismatch, and (3) reduced additionally for mismatches in both style and level. Results demonstrate that intersession variability compensation does model intrinsic variation, and suggest that mismatched data may be more useful than previously expected for modeling certain types of within-speaker variability in speech.

10:20Variability Compensated Support Vector Machines Applied to Speaker Verification

Zahi Karam (DSPG, Research Laboratory of Electronics at MIT & MIT Lincoln Laboratory)
William Campbell (MIT Lincoln Laboratory)

Speaker verification using SVMs has proven successful, specifically using the GSV Kernel with NAP. Also, the recent popularity and success of JFA has led to promising attempts to use speaker factors directly as SVM features. NAP projection and the use of speaker factors are methods of handling variability: NAP by removing nuisance variability, and using speaker factors by forcing the discrimination to be performed based on inter-speaker variability. These successes have led us to propose a new method we call VCSVM to handle both inter and intra-speaker variability directly in the SVM optimization. VCSVM adds a regularized penalty to the optimization that biases the normal to the hyperplane to be orthogonal to the nuisance subspace or alternatively the complement of the inter-speaker variability subspace. The bias attempts to emphasize inter-speaker variability while ignoring intra-speaker variability. This paper presents the VCSVM theory and promising results on nuisance compensation.

10:40Support Vector Machines versus Fast Scoring in the Low-Dimensional Total Variability Space for Speaker Verification

Najim Dehak (CRIM-ETS)
Réda Dehak (LRDE-EPITA)
Patrick Kenny (CRIM)
Niko Brummer (Agnitio)
Pierre Ouellet (CRIM)
Pierre Dumouchel (CRIM-ETS)

This paper presents a new speaker verification system architecture based on Joint Factor Analysis (JFA) as feature extractor. In this modeling, the JFA is used to define a new low-dimensional space named the total variability factor space, instead of both channel and speaker variability spaces for the classical JFA. The main contribution in this approach, is the use of the cosine kernel in the new total factor space to design two different systems: the first system is Support Vector Machines based, and the second one uses directly this kernel as a decision score. This last scoring method makes the process faster and less computation complex compared to others classical methods. We tested several intersession compensation methods in total factors, and we found that the combination of Linear Discriminate Analysis and Within Class Covariance Normalization achieved the best performance.

11:00Within-Session Variability Modelling for Factor Analysis Speaker Verification

Robbie Vogt (Speech Research Lab, QUT)
Jason Pelecanos (IBM T.J. Watson Research Center)
Nicolas Scheffer (SRI International)
Sachin Kajarekar (SRI International)
Sridha Sridharan (Speech Research Lab, QUT)

This work presents an extended Joint Factor Analysis model including explicit modelling of unwanted within-session variability. The goals of the proposed extended JFA model are to improve verification performance with short utterances by compensating for the effects of limited or imbalanced phonetic coverage, and to produce a flexible JFA model that is effective over a wide range of utterance lengths without adjusting model parameters such as retraining session subspaces. Experimental results on the 2006 NIST SRE corpus demonstrate the flexibility of the proposed model by providing competitive results over a wide range of utterance lengths without retraining and also yielding modest improvements in a number of conditions over current state-of-the-art.

11:20Speaker Recognition by Gaussian Information Bottleneck

Ron M Hecht (Department of Computer Science, Tel-Aviv University, Tel-Aviv, Israel)
Elad Noor (The Weizmann Institute of Science, Rehovot, Israel)
Naftali Tishby (School of Engineering and Computer Science, Hebrew University, Jerusalem, Israel)

This paper explores a novel approach for the extraction of relevant information in speaker recognition tasks. This approach uses a principled information theoretic framework - the Information Bottleneck method (IB). In our application, the method compresses the acoustic data while preserving mostly the relevant information for speaker identification. This paper focuses on a continuous version of the IB method known as the Gaussian Information Bottleneck (GIB). This version assumes that both the source and target variables are high dimensional multivariate Gaussian variables. The GIB was applied in our work to the Super Vector (SV) dimension reduction conundrum. Experiments were conducted on the male part of the NIST SRE 2005 corpora. The GIB representation was compared to other dimension reduction techniques and to a baseline system. In our experiments, the GIB outperformed the baseline system; achieving a 6.1% Equal Error Rate (EER) compared to the 15.1% EER of a baseline system.

11:40Variational Dynamic Kernels for Speaker Verification

Chris Longworth (Cambridge University Engineering Department)
Rogier van Dalen (Cambridge University Engineering Department)
Mark Gales (Cambridge University Engineering Department)

An important aspect of SVM-based speaker verification is the choice of dynamic kernel. Recently there has been interest in the use of kernels based on the Kullback-Leibler divergence between GMMs. Since this has no closed-form solution, typically a matched-pair upper bound is used instead. This places significant restrictions on the forms of model structure that may be used. All GMMs must contain the same number of components and must be adapted from a single background model. For many tasks this will not be optimal. In this paper, dynamic kernels are proposed based on alternative, variational approximations to the KL divergence. Unlike the matched-pair bound, these do not restrict the forms of GMM that may be used. Additionally, using a more accurate approximation of the divergence may lead to performance gains. Preliminary results using these kernels are presented on the NIST 2002 SRE dataset.

Wed-Ses1-O2:
Emotion and Expression I

Time:Wednesday 10:00 Place:East Wing 1 Type:Oral
Chair:Ailbhe Ni Chasaide

10:00Emotion dimensions and formant position

Martijn Bastiaan Goudbeek (University of Tilburg, the Netherlands / Swiss Center for Affective Sciences, Geneva, Switzerland)
Jean Philippe Goldman (Language Technology Laboratory, University of Geneva, Switzerland)
Klaus Scherer (Swiss Center for Affective Sciences, Switzerland)

The influence of emotion on articulatory precision was investigated in a newly established corpus of acted emotional speech. The frequencies of the first and second formant of the vowels /i/, /u/, and /a/ was measured and shown to be significantly affected by emotion dimension. High arousal resulted in a higher mean F1 in all vowels, whereas positive valence resulted in higher mean values for F2. The dimension potency/control showed a pattern of effects that was consistent with a larger vocalic triangle for emotions high in potency/control. The results are interpreted in the context of Scherer's component process model.

10:20Identifying Uncertain Words within an Utterance via Prosodic Features

Heather Pon-Barry (Harvard University)
Stuart Shieber (Harvard University)

We describe an experiment that investigates whether sub-utterance prosodic features can be used to detect uncertainty at the word-level. That is, given an utterance that is classified as uncertain, we want to determine which word or phrase the speaker is uncertain about. We have a corpus of utterances spoken under varying degrees of certainty. Using combinations of sub-utterance prosodic features we train models to predict the level of certainty of an utterance. On a set of utterances that were perceived to be uncertain, we compare the predictions of our models for two candidate `target word' segmentations: (a) one with the actual word causing uncertainty as the proposed target word, and (b) one with a control word as the proposed target word. Our best model correctly identifies the word causing the uncertainty rather than the control word 91% of the time.

10:40Evaluating Evaluators: A Case Study in Understanding the Benefits and Pitfalls of Multi-Evaluator Modeling

Emily Mower (University of Southern California)
Maja J Mataric (University of Southern California)
Shrikanth Narayanan (University of Southern California)

Emotion perception is a complex process, often measured using stimuli presentation experiments that query evaluators for their perceptual ratings of emotional cues. These evaluations contain variability both related and unrelated to the evaluated utterances. One approach to handling this variability is to model emotion perception at the individual level. However, the reported perception of users may not adequately capture the emotional acoustic properties of an utterance. This problem can be mitigated by creating averaged evaluator models. We demonstrate that this averaging improves classification performance compared to models created using individual-specific evaluations. We also demonstrate that the performance increases are related to the consistency with which evaluators label data. These results suggest that the acoustic properties of emotional speech are better captured using models formed from averaged evaluations rather than from individual-specific evaluations.

11:00Responding to User Emotional State by Adding Emotional Coloring to Utterances

Jaime Acosta (University of Texas at El Paso)
Nigel Ward (University of Texas at El Paso)

When people speak to each other, they share a rich set of nonverbal behaviors such as varying prosody in voice. These behaviors, sometimes interpreted as demonstrations of emotions, call for appropriate responses, but today’s spoken dialog systems lack the ability to do so. We collected a corpus of persuasive dialogs, specifically conversations about graduate school between a staff member and students, and had judges label all utterances with triples indicating the perceived emotions, using the three dimensions: activation, evaluation, and power. We found immediate response patterns, in which the staff member colored her utterances in response to the emotion shown by the student in the immediately previous utterance, and built a predictive model suitable for use in a dialog system to persuasively discuss graduate school with students.

11:20Analysis of Laugh Signals for Detecting in Continuous Speech

Sudheer Kumar K (International Institute of Information Technology, Hyderabad, India)
Sri Harish Reddy M (International Institute of Information Technology, Hyderabad, India)
Sri Rama Murty K (Indian Institute of Technology Madras, Chennai, India)
Yegnanarayana B (International Institute of Information Technology, Hyderabad, India)

Laughter is a nonverbal vocalization that occurs often in speech communication. Since laughter is produced by the speech production mechanism, spectral analysis methods are used mostly for the study of laughter acoustics. In this paper the significance of excitation features for discriminating laughter and speech is discussed. New features describing the excitation characteristics are used to analyze the laugh signals. The features are based on instantaneous pitch and strength of excitation at epochs. An algorithm is developed based on these features to detect laughter regions in continuous speech. The results are illustrated by detecting laughter regions in a TV broadcast program.

11:40Data-driven Clustering in Emotional Space for Affect Recognition Using Discriminatively Trained LSTM Networks

Martin Woellmer (Technische Universitaet Muenchen)
Florian Eyben (Technische Universitaet Muenchen)
Bjoern Schuller (Technische Universitaet Muenchen)
Ellen Douglas-Cowie (Queen\'s University Belfast)
Roddy Cowie (Queen\'s University Belfast)

In today's affective databases speech turns are often labelled on a continuous scale for emotional dimensions such as valence or arousal to better express the diversity of human affect. However, applications like virtual agents usually map the detected emotional user state to rough classes in order to reduce the multiplicity of emotion dependent system responses. Since these classes often do not optimally reflect emotions that typically occur in a given application, this paper investigates data-driven clustering of emotional space to find class divisions that better match the training data and the area of application. Thereby we consider the Belfast Sensitive Artificial Listener database and TV talkshow data from the VAM corpus. We show that a discriminatively trained Long Short-Term Memory (LSTM) recurrent neural net that explicitly learns clusters in emotional space and additionally models context information outperforms both, Support Vector Machines and a Regression-LSTM net.

Wed-Ses1-O3:
Automatic Speech Recognition: Adaptation II

Time:Wednesday 10:00 Place:East Wing 2 Type:Oral
Chair:Satoshi Nakamura

10:00On the Estimation and the Use of Confusion-Matrices for Improving ASR Accuracy

Santiago Omar Caballero Morales (University of East Anglia, School of Computing Sciences)
Stephen Cox (University of East Anglia, School of Computing Sciences)

In previous work, we described how learning the pattern of recognition errors made by an individual using a certain ASR system leads to increased recognition accuracy compared with a standard MLLR adaptation approach. This was the case for low-intelligibility speakers with dysarthric speech, but no improvement was observed for normal speakers. In this paper, we describe an alternative method for obtaining the training data for confusion-matrix estimation for normal speakers which is more effective than our previous technique. We also address the issue of data sparsity in estimation of confusion-matrices by using non-negative matrix factorization (NMF) to discover structure within them. The confusion-matrix estimates made using these techniques are integrated into the ASR process using a technique termed as ``metamodels'', and the results presented here show statistically significant gains in word recognition accuracy when applied to normal speech.

10:20A Study on Soft Margin Estimation of Linear Regression Parameters for Speaker Adaptation

Shigeki Matsuda (Spoken Language Communication Group, National Institute of Information and Communication Technology)
Yu Tsao (Spoken Language Communication Group, National Institute of Information and Communication Technology)
Jinyu Li (Speech Component Group, Microsoft Corporation)
Satoshi Nakamura (Spoken Language Communication Group, National Institute of Information and Communication Technology)
Chin-Hui Lee (School of Electrical and Computer Engineering, Georgia Institute of Technology)

We formulate a framework for soft margin estimation- based linear regression (SMELR) and apply it to supervised speaker adaptation. Enhanced separation capability and increased discriminative ability are two key properties in margin-based discriminative training. For the adaptation process to be able to flexibly utilize any amount of data, we also propose a novel interpolation scheme to linearly combine the speaker independent (SI) and speaker adaptive SMELR (SMELR/SA) models. The two proposed SMELR algorithms were evaluated on a Japanese large vocabulary continuous speech recognition task. Both the SMELR and interpolated SI+SMELR/SA techniques showed improved speech adaptation performance in comparison with the well-known maximum likelihood linear regression (MLLR) method. We also found that the interpolation framework works even more effectively than SMELR when the amount of adaptation data is relatively small.

10:40Exploring the Role of Spectral Smoothing in context of Children\'s Speech Recognition

Shweta Ghai (Department of Electronics and Communication Engineering, Indian Institute of Technology Guwahati, Guwahati-781039, India.)
Rohit Sinha (Department of Electronics and Communication Engineering, Indian Institute of Technology Guwahati, Guwahati-781039, India.)

This work is motivated by our earlier study which shows that on explicit pitch normalization the children's speech recognition performance on the adults' speech trained models improves as a result of reduction in the pitch-dependent distortions in the spectral envelope. In this paper, we study the role of spectral smoothing in context of children's speech recognition. The spectral smoothing has been effected in the feature domain by two approaches viz., modification of bandwidth of the filters in the filterbank and cepstral truncation. In conjunction, both approaches give significant improvement in the children's speech recognition performance with 57% relative improvement over the baseline. Also, when combined with the widely used vocal tract length normalization (VTLN), these spectral smoothing approaches result in an additional 25% relative improvement over the VTLN performance for children's speech recognition on the adults' speech trained models.

11:00Unsupervised Lattice-based Acoustic Model Adaptation for Speaker-Dependent Conversational Telephone Speech Transcription

Kit Thambiratnam (Microsoft Research)
Frank Seide (Microsoft Research)

This paper examines the application of lattice adaptation techniques to speaker-dependent models for the purpose of conversational telephone speech transcription. Given sufficient training data per speaker, it is feasible to build adapted speaker-dependent models using lattice MLLR and lattice MAP. Experiments on iterative and cascaded adaptation are presented. Additionally various strategies for thresholding frame posteriors are investigated, and it is shown that accumulating statistics from the local best-confidence path is sufficient to achieve optimal adaptation. Overall, an iterative cascaded lattice system was able to reduce WER by 7.0% abs., which was a 0.8% abs. gain over transcript-based adaptation. Lattice adaptation reduced the unsupervised/supervised adaptation gap from 2.5\% to 1.7\%.

11:20Rapid Unsupervised Adaptation Using Frame Independent Output Probabilities of Gender and Context Independent Phoneme Models

Satoshi KOBASHIKAWA (NTT Cyber Space Laboratories)
Atsunori OGAWA (NTT Communication Science Laboratories)
Yoshikazu YAMAGUCHI (NTT Cyber Space Laboratories)
Satoshi TAKAHASHI (NTT Cyber Space Laboratories)

Business is demanding higher recognition accuracy with no increase in computation time compared to previously adopted baseline speech recognition systems. Accuracy can be improved by adding a gender dependent acoustic model and unsupervised adaptation based on CMLLR. CMLLR-based batch-type unsupervised adaptation estimates a single global transformation matrix by utilizing prior unsupervised labeling, which unfortunately increases the computation time. Our proposed technique reduces prior gender selection and labeling time by using frame independent output probabilities of only gender dependent speech GMM and monophone HMM in a dual-gender acoustic model. The proposed technique further raises accuracy by employing a power term after adaptation. Simulations using spontaneous speech show that the proposed technique reduces computation time by 17.9 % and the relative error in correct rate by 13.7 % compared to the baseline without prior gender selection and unsupervised adaptation.

11:40Bark-shift based nonlinear speaker normalization using the second subglottal resonance

Shizhen Wang (University of California, Los Angeles)
Yi-Hui Lee (University of California, Los Angeles)
Abeer Alwan (University of California, Los Angeles)

In this paper, we propose a Bark-scale shift based piecewise nonlinear warping function for speaker normalization, and a joint frequency discontinuity and energy attenuation detection algorithm to estimate the second subglottal resonance (Sg2). We then apply Sg2 for rapid speaker normalization. Experimental results on children's speech recognition show that the proposed nonlinear warping function is more effective for speaker normalization than linear frequency warping. Compared to maximum likelihood based grid search methods, Sg2 normalization is more efficient and achieves comparable or better performance, especially for limited normalization data.

Wed-Ses1-O4:
Voice Transformation I

Time:Wednesday 10:00 Place:East Wing 3 Type:Oral
Chair:Yannis Stylianou

10:00Many-to-many eigenvoice conversion with reference voice

Yamato Ohtani (Graduate School of Information Science, Nara Institute of Science and Technology)
Tomoki Toda (Graduate School of Information Science, Nara Institute of Science and Technology)
Hiroshi Saruwatari (Graduate School of Information Science, Nara Institute of Science and Technology)
Kiyohiro Shikano (Graduate School of Information Science, Nara Institute of Science and Technology)

We propose many-to-many voice conversion (VC) techniques to convert an arbitrary source voice into an arbitrary target voice. We have been hitherto proposed one-to-many eigenvoice conversion (EVC) and many-to-one EVC. In EVC, an eigenvoice GMM (EV-GMM) is trained in advance using multiple parallel data sets of a reference speaker and many pre-stored speakers. The EV-GMM is flexibly adapted to an arbitrary speaker using a small amount of data. In this paper, we realize many-to-many VC by sequentially performing many-to-one EVC and one-to-many EVC through the reference speaker using the same EV-GMM. Experimental results demonstrate the effectiveness of the proposed method.

10:20Alleviating the One-to-Many Mapping Problem in Voice Conversion with Context-Dependent Modeling

Elizabeth Godoy (Orange Labs)
Olivier Rosec (Orange Labs)
Thierry Chonavel (Telecom Bretagne)

This paper addresses the "one-to-many" mapping problem in Voice Conversion (VC) by exploring source-to-target mappings in GMM-based spectral transformation. Specifically, we examine differences using source-only versus joint source/target information in the classification stage of transformation, effectively illustrating a "one-to-many effect" in the traditional acoustically-based GMM. We propose combating this effect by using phonetic information in the GMM learning and classification. We then show the success of our proposed context-dependent modeling with transformation results using an objective error criterion. Finally, we discuss implications of our work in adapting current approaches to VC.

10:40Efficient Modeling of Temporal Structure of Speech For Applications in Voice Transformation

Binh Phu Nguyen (School of Information Science, Japan Advanced Institute of Science and Technology)
Akagi Masato (School of Information Science, Japan Advanced Institute of Science and Technology)

Aims of voice transformation are to change styles of given utterances. Most voice transformation methods process speech signals in a time-frequency domain. In the time domain, when processing spectral information, conventional methods do not consider relations between neighboring frames. If unexpected modifications happen, there are discontinuities between frames, which leads to the degradation of the speech quality. This paper proposes a new modeling of temporal structure of speech to ensure the smoothness of the transformed speech for improving the speech quality in voice transformation. We propose an improvement of the temporal decomposition (TD) technique to model the temporal structure of speech. The TD is used to ensure the smoothness of the transformed speech. We investigate the TD in two applications, concatenative speech synthesis and spectral voice conversion. Experimental results confirm the effectiveness of TD in terms of improving the quality of the transformed speech.

11:00Cross-Language Voice Conversion Based on Eigenvoices

Malorie Charlier (Faculté Polytechnique de Mons)
Yamato Ohtani (Graduate School of Information Science, Nara Institute of Science and Technology)
Tomoki Toda (Graduate School of Information Science, Nara Institute of Science and Technology)
Alexis Moinet (Faculté Polytechnique de Mons)
Thierry Dutoit (Faculté Polytechnique de Mons)

This paper presents a novel cross-language voice conversion (VC) method based on eigenvoice conversion (EVC). Cross language VC is a technique for converting voice quality between two speakers uttering different languages each other. In general, parallel data consisting of utterance pairs of those two speakers are not available. To deal with this problem, we apply EVC to cross-language VC because EVC framework can develop the conversion model without using parallel data. The results of subjective evaluations demonstrate that the proposed method yields significant performance improvements compared with a conventional cross-language VC method based on frame selection.

11:20Voice Conversion using K-Histograms and Frame Selection

Alejandro José Uriz (FI-UNMDP)
Pablo Daniel Agüero (FI-UNMDP)
Antonio Bonafonte (Universitat Politècnica de Catalunya, Barcelona, Spain)
Juan Carlos Tulli (FI-UNMDP)

The goal of voice conversion systems is to modify the voice of a source speaker to be perceived as if it had been uttered by another specific speaker. Many approaches found in the literature work based on statistical models and introduce an oversmoothing in the target features. Our proposal is a new model that combines several techniques used in unit selection for text-to-speech and a non-gaussian transformation mathematical model. Subjective results support the proposed approach.

11:40Online Model Adaptation for Voice Conversion using Model-based Speech Synthesis Technique

Dalei Wu (Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, CANADA)
Baojie Li (Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, CANADA)
Hui Jiang (Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, CANADA)
Qianjie Fu (House Ear Institute, 2100 West Third Street, Los Angeles, CA 90057, USA)

In this paper, we present a novel voice conversion method using model-based speech synthesis that can be used for some applications where prior knowledge or training data is not available from the source speaker. In the proposed method, training data from a target speaker is used to build a GMM-based speech model and voice conversion is then performed for each utterance from the source speaker according to the pre-trained target speaker model. To reduce the mismatch between source and target speakers, online model adaptation is proposed to improve model selection accuracy, based on maximum likelihood linear regression (MLLR). Objective and subjective evaluations suggest that the proposed methods are quite effective in generating acceptable voice quality for voice conversion even without training data from source speakers.

Wed-Ses1-S1:
Special Session: Lessons and Challenges Deploying Voice Search

Time:Wednesday 10:00 Place:East Wing 4 Type:Special
Chair:Mike Cohen & Mike Phillips

10:00Role of Natural Language Understanding in Voice Local Search

Junlan Feng (AT&T Labs Research)
Srinivas Banglore (AT&T Labs Research)
Mazin Gilbert (AT&T Labs Research)

Speak4it is a voice-enabled local search system currently available for iPhone devices. The natural language understanding (NLU) component is one of the key technology modules in this system. The role of NLU in voice-enabled local search is twofold: (a) parse the automatic speech recognition (ASR) output (1-best and word lattices) into meaningful segments that contribute to high-precision local search, and (b) understand user’s intent. This paper is concerned with the first task of NLU. In previous work, we had presented a scalable approach to parsing, which is built upon text indexing and search framework, and can also parse ASR lattices. In this paper, we propose an algorithm to improve the baseline by extracting the “subjects” of the query. Experimental results indicate that lattice-based query parsing outperforms ASR 1-best based parsing by 2.1% absolute and extracting subjects in the query improves the robustness of search.

10:20Recognition and Correction of Voice Web Search Queries

Keith Vertanen (University of Cambridge)
Per Ola Kristensson (University of Cambridge)

In this work we investigate how to recognize and correct voice web search queries. We describe our corpus of web search queries and show how it was used to improve the accuracy of recognition. We show that using a search-specific vocabulary with automatically generated pronunciations is superior to using a vocabulary limited to a fixed pronunciation dictionary. We conducted a formative user study to investigate recognition and correction aspects of voice search in a mobile context. In the user study, we found that despite a word error rate of 48%, users were able to speak and correct search queries in about 18 seconds. Users did this while walking around using a mobile touch-screen device.

10:40Voice Search and Everything Else – What Users Are Saying to the Vlingo Top Level Voice UI

Chao Wang (Vlingo)

No abstract available.

11:00Searching Google by Voice

Johan Schalkwyk (Google)

No abstract available.

11:20Multiple-hypotheses searches from deeply parsed requests to multiple-evidences scoring: the DeepQA challenge

Roberto Sicconi (IBM)

No abstract available.

11:40Research Areas in Voice Search: Lessons from Microsoft Deployments

Geoffrey Zweng (Microsoft)

No abstract available.

Wed-Ses1-P1:
Phonetics, Phonology, cross-language comparisons, pathology

Time:Wednesday 10:00 Place:Hewison Hall Type:Poster
Chair: Valerie Hazan

#1Fast Transcription of Unstructured Audio Recordings

Brandon Roy (MIT Media Laboratory)
Deb Roy (MIT Media Laboratory)

We introduce a new method for human-machine collaborative speech transcription that is significantly faster than existing transcription methods. In this approach, automatic audio processing algorithms are used to robustly detect speech in audio recordings and split speech into short, easy to transcribe segments. Sequences of speech segments are loaded into a transcription interface that enables a human transcriber to simply listen and type, obviating the need for manually finding and segmenting speech or explicitly controlling audio playback. As a result, playback stays synchronized to the transcriber's speed of transcription. In evaluations using naturalistic audio recordings made in everyday home situations, the new method is up to 6 times faster than other popular transcription tools while preserving transcription quality.

#2Finding Allophones: an Evaluation on Consonants in the TIMIT Corpus

Timothy Kempton (University of Sheffield)
Roger Moore (University of Sheffield)

Phonemic analysis, the process of identifying the contrastive sounds in a language, involves finding allophones; phonetic variants of those contrastive sounds. An algorithm for finding allophones (developed by Peperkamp et al.) is evaluated on consonants in the TIMIT acoustic phonetic transcripts. A novel phonetic filter based on the active articulator is introduced and has a higher recall than previous filters. The combined retrieval performance, measured by area under the ROC curve, is 83%. The system implemented can process any language transcribed in IPA and is currently being used to assist the phonemic analysis of unwritten languages.

#3Automatic formant extraction for sociolinguistic analysis of large corpora

Keelan Evanini (University of Pennsylvania)
Stephen Isard (University of Pennsylvania)
Mark Liberman (University of Pennsylvania)

In this paper, we propose a method of formant prediction from pole and bandwidth data, and apply this method to automatically extract F1 and F2 values from a corpus of regional dialect variation in North America that contains 134,000 manual formant measurements. These predicted formants are shown to increase performance over the default formant values from a popular speech analysis package. Finally, we demonstrate that sociolinguistic analysis based on vowel formant data can be conducted reliably using the automatically predicted values, and we argue that sociolinguists should begin to use this methodology in order to be able to analyze larger amounts of data efficiently.

#4Investigating phonetic information reduction and lexical confusability

William Hartmann (The Ohio State University)
Eric Fosler-Lussier (The Ohio State University)

In the presence of pronunciation variation and the masking effects of additive noise, we investigate the role of phonetic information reduction and lexical confusability on ASR performance. Contrary to previous work \cite{Briscoe89}, we show that place of articulation as a representation for unstressed segments performs at least as well as manner of articulation in the presence of additive noise. Methods of phonetic reduction introduce lexical confusibility which negatively impact performance. By limiting this confusability, recognizers that employ high levels of phonetic reduction (40.1%) can perform as well a baseline system in the presence of nonstationary noise.

#5Improving phone recognition performance via phonetically-motivated units

Hyejin Hong (Department of Linguistics, Seoul National University, Seoul, Korea)
Minhwa Chung (Department of Linguistics, Seoul National University, Seoul, Korea)

This paper examines how phonetically-motivated units affect the performance of phone recognition systems. Focusing on the realization of /h/, which is one of the most frequently error-making phones in Korean phone recognition, three different phone sets are designed by considering optional phonetic constraints which show complementary distributions. Experimental results show that one of the proposed sets, the h-deletion set improves phone recognition performance compared to the baseline phone recognizer. It is noteworthy that this set needs no additional phonetic unit, which means that no more HMM is necessary to be modeled, accordingly it has the advantage in terms of model size. Besides, it obtains competent performance compared to the baseline system in terms of word recognition as well. Thus, this phonetically-motivated approach dealing with improvement of phone recognition performance is expected to be used in embedded solutions which require fast and light recognition process.

#6An Evaluation of Formant Tracking method on an Arabic labeled Database

Imen Jemaa (Unite de Recherche Traitement du Signal, Traitement de l Image et Reconnaissance de Formes)
Oussama Rekhis (Unite de Recherche Traitement du Signal, Traitement de l Image et Reconnaissance de Formes)
Kais Ouni (Unite de Recherche Traitement du Signal, Traitement de l Image et Reconnaissance de Formes)
Yves Laprie (Equipe Parole, LORIA nancy, France)

In this paper we present a labeled Arabic database of the first three formant tracks. This database is used to evaluate a new automatic formant tracking algorithm based on Fourier ridges detection. In this method we have introduced a continuity constraint based on the computation of center of gravity for a set of frequency formant candidates. This leads to connect a frame of speech to its neighbours and thus to improve the robustness of track. The formant trajectories obtained from the proposed algorithm are compared to those manually labeled from the database and those given by LPC based Praat tool.

#7Comparison of Manual and Automated Estimates of Subglottal Resonances

Wolfgang Wokurek (IMS Uni Stuttgart)
Andreas Madsack (IMS Uni Stuttgart)

This study compares manual measurements of the first two subglottal resonances to the results of an automated measurement procedure for the same quantities. We also briefly sketch the sensor prototype that is used for the measurements. The subglottal resonances are presented in the space spanned by the vowels' first two formants. A three axis acceleration sensor is gently pressed at the neck of the speaker. In front of the ligamentum conicum, located near the lower end of the larynx, pressure signals may be recorded that follow the subglottal pressure changes at least up to 2 kHz bandwidth. The recordings of the subglottal pressure signals are made simultaneously with recordings of the electroglottogram and the acoustic speech sound with 12 male and 12 female speakers.

#8Using durational cues in a computational model of spoken-word recognition

Odette Scharenborg (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)

Evidence that listeners use durational cues to help resolve temporarily ambiguous speech input has accumulated over the past few years. In this paper, we investigate whether durational cues are also beneficial for word recognition in a computational model of spoken-word recognition. Two sets of simulations were carried out using the acoustic signal as input. The simulations showed that the computational model, like humans, takes benefit from durational cues during word recognition, and uses these to disambiguate the speech signal. These results thus provide support for the theory that durational cues play a role in spoken-word recognition. Index Terms: duration, spoken-word recognition, computational modelling

#9Second language discrimination vowel contrasts by adults speakers with a five vowel system

Bianca Sisinni (CRIL (Centro di Ricerca Interdisciplinare sul Linguaggio) - Salento University (Lecce - Italy))
Mirko Grimaldi (CRIL (Centro di Ricerca Interdisciplinare sul Linguaggio) - Salento University (Lecce - Italy))

This study tests the ability of a group of Salento Italian undergraduate students that have been exposed to L2 in a scholastic context to perceive British English second language (L2) vowel phonemes. The aim is to verify if the Perceptual Assimilation Model could be applied to them. In order to test their ability to perceive L2 phonemes, subjects have executed an identification and an oddity discrimination test. The results indicated that the L2 discrimination processes are in line with those predicted by the PAM, supporting the idea that students with a formal L2 background are still naïve listeners to the L2.

#10Three-way Laryngeal Categorization of Japanese, French, English and Chinese Plosives by Korean Speakers

Tomohiko Ooigawa (Phonetics Laboratory, Sophia University, Tokyo, Japan)
Shigeko Shinohara (Phonetics Laboratory, Sophia University, Tokyo, Japan)

Korean has a three-way laryngeal contrast in oral stops. This paper reports perception patterns of plosives of Japanese, French, English and Chinese by Korean speakers. In Korean loanwords, laryngeal contrasts of Japanese, French, and English plosives show distinct patterns. To test whether perception explains the loanword patterns, we selected languages with different acoustic properties and carried out perception tests. Our results reveal discrepancies between the phonological adaptation and the acoustic perception patterns.

#11The effect of F0 peak-delay on the L1 / L2 perception of English lexical stress

Shinichi Tokuma (Chuo University)
Yi Xu (University College London)

This study investigated the perceptual effect of F0 peak-delay on L1 / L2 perception of English lexical stress. A bisyllabic English non-word /nInI/ whose F0 was set to reach its peak in the second syllable was embedded in a frame sentence and used as the stimulus of the perceptual experiment. Native English and Japanese speakers were asked to determine lexical stress locations in the experiment. The results showed that in the perception of English lexical stress, delayed F0 peaks which were aligned with the second syllable of the stimulus words perceptually affected Japanese and English groups in the same manner: both groups perceived the delayed F0 peaks as a cue to lexical stress in the first syllable when the peaks were aligned with, or before, the end of /n/ in the second syllable. A supplementary experiment conducted on Japanese speakers confirmed the location of the categorical boundary. These findings are supported by the data provided by previous studies.

#12Lexical tone production by Cantonese speakers with Parkinson’s disease

Joan K-Y Ma (Dresden University of Technology)

This study was to investigate lexical tone production in Cantonese speakers associated with Parkinson’s disease (PD) and the effect of intonation on the production of lexical tone. Speech data was collected from five Cantonese PD speakers. Speech materials consisted of targets contrasting in tones, embedded in different sentence contexts (initial, medial and final) and intonations (statements and questions). Analysis of the normalized F0 patterns showed that PD speakers contrasted the six lexical tones in similar manner as compared with control speakers across positions and intonations, except at the final position of questions. Significantly lower F0 values were found at the 75% and 100% time points of the final syllable of questions for the PD speakers than for the control speakers, indicating that intonation has a smaller influence on the F0 patterns of lexical tones for PD speakers than control speakers.

#13Acoustic cues of palatalisation in plosive + lateral onset clusters

Daniela Müller (CLLE-ERSS, Université de Toulouse 2 - Le Mirail, Toulouse, France & Romanisches Seminar, Ruprecht-Karls-Universität Heidelberg, Heidelberg, Germany)
Sidney Martin Mota (Escola Oficial d\'Idiomes de Tarragona, Tarragona, Spain)

Palatalisation of /l/ in obstruent + lateral onset clusters in the absence of a following palatal sound has received a considerable amount of attention from historical linguistics. The phonetics of its development, however, remains less well-investigated. This paper aims at studying the acoustic cues that could have led plosive + lateral onset clusters to develop palatalisation. It is found that onset clusters with velar plosives favour palatalisation more than labial + lateral clusters, and that a high degree of darkness diminishes the likelihood of palatalisation to take place.

Wed-Ses1-P3:
Statistical Parametric Synthesis II

Time:Wednesday 10:00 Place:Hewison Hall Type:Poster
Chair:Simon King

#1A BayesianApproach to Hidden Semi-Markov Model Based Speech Synthesis

Kei Hashimoto (Nagoya Institute of Technology)
Yoshihiko Nankaku (Nagoya Institute of Technology)
Keiichi Tokuda (Nagoya Institute of Technology)

This paper proposes a Bayesian approach to hidden semi-Markov model (HSMM) based speech synthesis. Recently, hidden Markov model (HMM) based speech synthesis based on the Bayesian approach was proposed. The Bayesian approach is a statistical technique for estimating reliable predictive distributions by treating model parameters as random variables. In the Bayesian approach, all processes for constructing the system are derived from one single predictive distribution which exactly represents the problem of speech synthesis. However, there is an inconsistency between training and synthesis: although the speech is synthesized from HMMs with explicit state duration probability distributions, HMMs are trained without them. In this paper, we introduce an HSMM, which is an HMM with explicit state duration probability distributions, into the HMM-based Bayesian speech synthesis system. Experimental results show that the use of HSMM improves the naturalness of the synthesized speech.

#2Rich Context Modeling for High Quality HMM-Based TTS

Zhi-Jie Yan (Microsoft Research Asia)
Yao Qian (Microsoft Research Asia)
Frank K. Soong (Microsoft Research Asia)

This paper presents a rich context modeling approach to high quality HMM-based speech synthesis. We first analyze the over-smoothing problem in conventional decision tree tying-based HMM, and then propose to model the training speech tokens with rich context models. Special training procedure is adopted for reliable estimation of the rich context model parameters. In synthesis, a search algorithm following a context-based pre-selection is performed to determine the optimal rich context model sequence which generates natural and crisp output speech. Experimental results show that spectral envelopes synthesized by the rich context models are with crisper formant structures and evolve with richer details than those obtained by the conventional models. The speech quality improvement is also perceived by listeners in a subjective preference test, in which 76% of the sentences synthesized using rich context modeling are preferred.

#3Tying covariance matrices to reduce the footprint of HMM-based speech synthesis systems

Keiichiro Oura (Depertment of Computer Science and Engineering, Nagoya Institute of Technology, Japan)
Heiga Zen (Depertment of Computer Science and Engineering, Nagoya Institute of Technology, Japan)
Yoshihiko Nankaku (Depertment of Computer Science and Engineering, Nagoya Institute of Technology, Japan)
Akinobu Lee (Depertment of Computer Science and Engineering, Nagoya Institute of Technology, Japan)
Keiichi Tokuda (Depertment of Computer Science and Engineering, Nagoya Institute of Technology, Japan)

This paper proposes a technique of reducing footprint of HMMbased speech synthesis systems by tying all covariance matrices. HMM-based speech synthesis systems usually consume smaller footprint than unit-selection synthesis systems because statistics rather than speech waveforms are stored. However, further reduction is essential to put them on embedded devices which have very small memory. According to the empirical knowledge that covariance matrices have smaller impact for the quality of synthesized speech than mean vectors, here we propose a clustering technique of mean vectors while tying all covariance matrices. Subjective listening test results show that the proposed technique can shrink the footprint of an HMM-based speech synthesis system while retaining the quality of synthesized speech.

#4The HMM Synthesis Algorithm of an Embedded Unified Speech Recognizer and Synthesizer

Guntram Strecha (Technische Universität Dresden, Germany)
Matthias Wolff (Technische Universität Dresden, Germany)
Frank Duckhorn (Technische Universität Dresden, Germany)
Sören Wittenberg (Technische Universität Dresden, Germany)
Constanze Tschöpe (Fraunhofer Institute for Non-Destructive Testing, Dresden, Germany)
()

In this paper we present an embedded unified speech recognizer and synthesizer using identical, speaker independent Hidden-Markov-Models. The system was prototypically realized on a signal processor extended by a field programmable gate array. In a first section we will give a brief overview of the system. The main part of the paper deals with a specially designed unit based HMM synthesis algorithm. In a last section we state the results of an informal listening evaluation of the speech synthesizer.

#5Syllable HMM based Mandarin TTS and Comparison with Concatenative TTS

Zhiwei Shuang (University of Science and Technology of China, IBM China Research Lab)
Shiyin Kang (Tsinghua University)
Qin Shi (IBM China Research Lab)
Yong Qin (IBM China Research Lab)
Lianhong Cai (Tsinghua University)

This paper introduces a Syllable HMM based Mandarin TTS system. 10-state left-to-right HMMs are used to model each syllable. We leverage the corpus and the front end of concatenative TTS system to build the Syllable HMM based TTS system. Furthermore, we utilize the unique consonant/vowel structure of Mandarin syllable to improve the voiced/unvoiced decision of HMM states. Evaluation result shows that the Syllable HMM based Mandarin TTS system with a 5.3MB’s model size can achieve an overall quality close to a concatenative TTS system with 1GB’ data size.

#6Pulse Density Representation of Spectrum for Statistical Speech Processing

Yoshinori Shiga (National Institute of Information and Communications Technology (NICT), Japan)

This study investigates a new spectral representation that is suitable for statistical parametric speech synthesis. Statistical speech processing involves spectral averaging in the training process; however, averaging spectra in the domain of conventional speech parameters over-smooths the resulting means, which degrades the quality of the speech synthesised. In the proposed representation, high-energy parts of the spectrum, such as sections of dominant formants, are represented by a group of high-density pulses in the frequency domain. These pulses' locations (i.e., frequencies) are then parameterised. The representation is theoretically capable of averaging spectra with less over-smoothing effect. The experimental results provide the optimal values of factors necessary for the encoding and decoding of the proposed representation towards the future applications of speech synthesis.

#7Parameterization of Vocal Fry in HMM-Based Speech Synthesis

Hanna Silén (Department of Signal Processing, Tampere University of Technology, Finland)
Elina Helander (Department of Signal Processing, Tampere University of Technology, Finland)
Jani Nurminen (Nokia Devices R&D, Tampere, Finland)
Moncef Gabbouj (Department of Signal Processing, Tampere University of Technology, Finland)

HMM-based speech synthesis offers a way to generate speech with different voice qualities. However, sometimes databases contain certain inherent voice qualities that need to be parametrized properly. One example of this is vocal fry typically occurring at the end of utterances. A popular mixed excitation vocoder for HMM-based speech synthesis is STRAIGHT. The standard STRAIGHT is optimized for modal voices and may not produce high quality with other voice types. Fortunately, due to the flexibility of STRAIGHT, different F0 and aperiodicity measures can be used in the synthesis without any inherent degradations in speech quality. We have replaced the STRAIGHT excitation with a representation based on a robust F0 measure and a carefully determined two-band voicing. According to our analysis-synthesis experiments, the new parameterization can improve the speech quality. In HMM-based speech synthesis, the quality is significantly improved especially due to the better modeling of vocal fry.

#8A Deterministic plus Stochastic Model of the Residual Signal for Improved Parametric Speech Synthesis

Thomas Drugman (Faculté Polytechnique de Mons)
Geoffrey Wilfart (Acapela Group)
Thierry Dutoit (Faculté Polytechnique de Mons)

Speech generated by parametric synthesizers generally suffers from a typical buzziness. In order to alleviate this problem, a more suited modeling of the excitation should be adopted. For this, we hereby propose an adaptation of the Deterministic plus Stochastic Model (DSM) for the residual. In this model, the excitation is divided into two distinct spectral bands delimited by a maximum voiced frequency. The deterministic part concerns the low-frequency contents and consists of a decomposition of pitch-synchronous residual frames on an orthonormal basis obtained by Principal Component Analysis, while the stochastic component is a high-pass filtered noise. The proposed residual model is integrated within a HMM-based speech synthesizer and is compared to the traditional excitation through a subjective test. Results show a significative improvement for both male and female voices. The proposed model is also shown to be suited for its integration in commercial applications.

#9A decision tree-based clustering approach to state definition in an excitation modeling framework for HMM-based speech synthesis

Ranniery Maia (National Institute of Information and Communications Technology, Japan)
Tomoki Toda (Nara Institute of Science and Technology, Japan)
Keiichi Tokuda (Nagoya Institute of Technology, Japan)
Shinsuke Sakai (National Institute of Information and Communications Technology, Japan)
Satoshi Nakamura (National Institute of Information and Communications Technology, Japan)

This paper presents a decision tree-based algorithm to cluster residual segments assuming an excitation model based on state-dependent filtering of pulse train and white noise. The decision tree construction principle is the same as the one applied to speech recognition. Here parent nodes are split using the residual maximum likelihood criterion. Once these excitation decision trees are constructed for residual signals segmented by full context models, using questions related to the full context of the training sentences, they can be utilized for excitation modeling in speech synthesis based on hidden Markov models (HMM). Experimental results have shown that the algorithm in question is very effective in terms of clustering residual signals given segmentation, pitch marks and full context questions, resulting in filters with good residual modeling properties.

#10An improved minimum generation error based model adaptation for HMM-based speech synthesis

Yi-Jian Wu (Microsoft)
Long Qin (Carnegie Mellon University)
Keiichi Tokuda (Nagoya Institute of Technology)

Aminimum generation error (MGE) criterion had been proposed for model training in HMM-based speech synthesis. In this paper, we apply the MGE criterion to model adaptation for HMM-based speech synthesis, and introduce an MGE linear regression (MGELR) based model adaptation algorithm, where the regression matrices used to transform source models are optimized so as to minimize the generation errors of adaptation data. In addition, we incorporate the recent improvements of MGE criterion into MGELR-based model adaptation, including state alignment under MGE criterion and using a log spectral distortion (LSD) instead of Euclidean distance for spectral distortion measure. From the experimental results, the adaptation performance was improved after incorporating these two techniques, and the formal listening tests showed that the quality and speaker similarity of synthesized speech after MGELRbased adaptation were significantly improved over the original MLLR-based adaptation.

#11Two-pass decision tree construction for unsupervised adaptation of HMM-based synthesis models

Matthew Gibson (Cambridge University)

Hidden Markov model (HMM) -based speech synthesis systems possess several advantages over concatenative synthesis systems. One such advantage is the relative ease with which HMM-based systems are adapted to speakers not present in the training dataset. Speaker adaptation methods used in the field of HMM-based automatic speech recognition (ASR) are adopted for this task. In the case of unsupervised speaker adaptation, previous work has used a supplementary set of acoustic models to firstly estimate the transcription of the adaptation data. By defining a mapping between HMM-based synthesis models and ASR-style models, this paper introduces an approach to the unsupervised speaker adaptation task for HMM-based speech synthesis models which avoids the need for supplementary acoustic models. Further, this enables unsupervised adaptation of HMM-based speech synthesis models without the need to perform linguistic analysis of the estimated transcription of the adaptation data.

#12Speaker adaptation using a parallel phone set pronunciation dictionary for Thai-English Bilingual TTS

Anocha Rugchatjaroen (National Electronics and Computer Technology Center (NECTEC), Thailand)
Nattanun Thatphithakkul (National Electronics and Computer Technology Center (NECTEC), Thailand)
Ananlada Chotimongkol (National Electronics and Computer Technology Center (NECTEC), Thailand)
Chai Wutiwiwatchai (National Electronics and Computer Technology Center (NECTEC), Thailand)
Ausdang Thangthai (National Electronics and Computer Technology Center (NECTEC), Thailand)

This paper develops a bilingual Thai-English TTS system from two monolingual HHM-based TTS systems. An English Nagoya HMM-based TTS system (HTS) provides correct pronunciations of English words but the voice is different from the voice in a Thai HTS system. We apply a CSMAPLR adaptation technique to make the English voice sounds more similar to the Thai voice. To overcome a phone mapping problem normally occurs with a pair of languages that have dissimilar phone sets, we utilize a cross-language pronunciation mapping through a parallel phone set pronunciation dictionary. The results from the subjective listening test show that English words synthesized by our proposed system are more intelligible (with 0.61 higher MOS) than the existing bilingual Thai-English TTS. Moreover, with the proposed adaptation method, the synthesized English words sound more similar to synthesized Thai words.

#13HMM-based Automatic Eye-blink Synthesis from Speech

Michal Dziemianko (Centre for Speech Technology Research, University of Edinburgh, UK)
Gregor Hofer (Centre for Speech Technology Research, University of Edinburgh, UK)
Hiroshi Shimodaira (Centre for Speech Technology Research, University of Edinburgh, UK)

In this paper we present a novel technique to automatically synthesize eye blinking from a speech signal. Animating the eyes of a talking head is important as they are a major focus of attention during interaction. The developed system predicts eye blinks from the speech signal and generates animation trajectories automatically employing a ''Trajectory Hidden Markov Model''. The evaluation of the recognition performance showed that eye blinks can be predicted from speech with an F-score value upwards of 52%, which is well above chance. Additionally, a perceptual evaluation was conducted, that confirmed that adding eye blinking significantly improves the perception the character. Finally it showed that the speech synchronised synthesized blinks outperform random blinking in naturalness ratings.

Wed-Ses1-P4:
Resources, annotation and evaluation

Time:Wednesday 10:00 Place:Hewison Hall Type:Poster
Chair:Michael Wagner

#1Resources for Speech Research: Present and Future Infrastructure Needs

Lou Boves (Department of Language and Speech, University of Nijmegen)
Rolf Carlson (Speech, Music and Hearing, KTH)
Erhard Hinrichs (Seminar für Sprachwissenschaft, Universität Tübingen)
David House (Speech, Music and Hearing, KTH)
Steven Krauwer (Utrecht institute of Linguistics UiL OTS, Utrecht University)
Lothar Lemnitzer (Seminar für Sprachwissenschaft, Universität Tübingen)
Martti Vainio (Department of Speech Sciences, University of Helsinki)
Peter Wittenburg (Max Planck Institute for Psycholinguistics)

This paper introduces the EU-FP7 project CLARIN, a joint effort of over 150 institutions in Europe, aimed at the creation of a sustainable language resources and technology infrastructure for the humanities and social sciences research community. The paper briefly introduces the vision behind the project and how it relates to speech research with a focus on the contributions that CLARIN can and will make to research in spoken language processing.

#2Speech recordings via the internet: An overview of the VOYS project in Scotland

Catherine Dickie (Speech Science Research Centre, Queen Margaret University, Edinburgh, UK)
Felix Schaeffler (Speech Science Research Centre, Queen Margaret University, Edinburgh, UK)
Christoph Draxler (Institute of Phonetics and Speech Processing, Ludwig-Maximillian University, Munich, Germany)
Klaus Jänsch (Institute of Phonetics and Speech Processing, Ludwig-Maximillian University, Munich, Germany)

The VOYS (Voices of Young Scots) project aims to establish a speech database of adolescent Scottish speakers. This database will serve for speech recognition technology and sociophonetic research. 300 pupils will ultimately be recorded at secondary schools in 10 locations in Scotland. Recordings are performed via the Internet using two microphones (close-talk and desktop) in 22,05 kHz 16 bit linear stereo signal quality. VOYS is the first large-scale and cross-boundary speech data collection based on the WikiSpeech content management system for speech resources. In VOYS, schools receive a kit containing the microphones and A/D interface and they organise the recordings themselves. The recorded data is immediately uploaded to the server in Munich, alleviating the schools from all data-handling tasks. This paper outlines the corpus specification, describes the technical issues, summarises the signal quality and gives a status report.

#3The Multi-Session Audio Research Project (MARP) Corpus: Goals, Design and Initial Findings

Aaron Lawson (RADC Inc.)
Allen Stauffer (RADC Inc.)
Edward Cupples (RADC Inc.)
Stanley Wenndt (Air Force Research Laboratory)
Wayne Bray (Oasis Systems)
John Grieco (Air Force Research Laboratory)

This project describes the composition and goals of the Multi- session Audio Research Project (MARP) corpus and some initial experimental findings. The MARP corpus is a three year longitudinal collect of 21 sessions and more than 60 participants. This study was undertaken to test the impact of various factors on speaker recognition, such as inter-session variability, intonation, aging, whispering and text dependency. Initial results demonstrate the impact of sentence intonation, whispering, text dependency and cross session tests. These results highlight the sensitivity of speaker recognition to vocal, environmental and phonetic conditions that are commonly encountered but rarely explored or tested.

#4Structure and Annotation of Polish LVCSR Speech Database

Katarzyna Klessa (The Institute of Linguistics, Adam Mickiewicz University, Poznań, Poland)
Grażyna Demenko (1. The Institute of Linguistics, Adam Mickiewicz University, Poznań, Poland, 2. Poznań Supercomputing and Networking Center, Polish Academy of Scences, Poznań, Poland)

This paper reports on the problems occurring in the process of building LVCSR (Large Vocabulary Continuous Speech Recognition) corpora based on the internal evaluation of the Polish database JURISDIC. The initial assumptions are discussed together with technical matters concerning the database realization and annotation results. Providing rich database statistics was considered crucial especially regarding linguistic description both for database evaluation and for the implementation of linguistic factors in acoustic models for speech recognition. The assumed principles for database construction are: low redundancy, acoustic-phonetic variability adequate to dictation task, representativeness, balanced, heterogeneous structure enabling separate or combined modeling of phonetic-acoustic structures.

#5Balanced corpus of informal spoken Czech: compilation, design and findings

Martina Waclawičová (Institute of the Czech National Corpus, Charles University in Prague, Czech Republic)
Michal Křen (Institute of the Czech National Corpus, Charles University in Prague, Czech Republic)
Lucie Válková (Institute of the Czech National Corpus, Charles University in Prague, Czech Republic)

The paper presents ORAL2008, a new 1-million corpus of spoken Czech compiled within the framework of the Czech National Corpus project. ORAL2008 is designed as a representation of authentic spoken language used in informal situations and it is balanced in the main sociolinguistic categories of speakers. The paper concentrates also on the data collection, its broad coverage and the transcription system that registers variability of spoken Czech. Possible findings based on the provided data are finally outlined.

#6JTrans: an open-source software for semi-automatic text-to-speech alignment

Christophe Cerisara (LORIA UMR 7503)
Odile Mella (LORIA UMR 7503)
Dominique Fohr (LORIA UMR 7503)

Aligning speech corpora with text transcriptions is an important requirement of many speech processing, data mining applications and linguistic researches. Despite recent progress in the field of speech recognition, many linguists still manually align spontaneous and noisy speech recordings to guarantee a good alignment quality. This work proposes an open-source java software with an easy-to-use GUI that integrates dedicated semi-automatic speech alignment algorithms that can be dynamically controlled and guided by the user. The objective of this software is to facilitate and speed up the process of creating and aligning speech corpora.

#7Predicting the quality of multimodal systems based on judgments of single

Ina Wechsung (Deutsche Telekom Laboratories, TU Berlin, Germany)
Klaus-Peter Engelbrecht (Deutsche Telekom Laboratories, TU Berlin, Germany)
Anja B. Naumann (Deutsche Telekom Laboratories, TU Berlin, Germany)
Julia Seebode (Research training group prometei, TU Berlin, Germany)
Stefan Schaffer (Research training group prometei, TU Berlin, Germany)
Florian Metze (interACT center, Carnegie Mellon University, Pittsburgh, PA, USA.)
Sebastian Möller (Deutsche Telekom Laboratories, TU Berlin, Germany)

This paper investigates the relationship between user ratings of multimodal systems and ratings of the single modalities. Based on previous research showing precise predictions of ratings of multimodal systems from single modality ratings, it was hypothesized that the accuracy might have been caused by the participants' efforts to rate consistently. We address this issue with two new studies. In the first study, the multimodal system was presented before the single modality versions were known by the users. In the second study, the type of system was changed, and age effects were investigated. We apply linear regression and show that models get worse when the order is changed. In addition, models for younger users perform better than those for older users. We conclude that ratings can be impacted by the effort of users to judge consistently, as well as their ability to do so.

#8Auto-Checking Speech Transcriptions by Multiple Template Constrained Posterior

Lijuan Wang (Microsoft Research Asia, Beijing, China)
Shenghao Qin (Microsoft Business Division, Beijing, China)
Frank Soong (Microsoft Research Asia, Beijing, China)

Checking transcription errors in speech database is an important but tedious task that traditionally requires intensive manual labor. In [9], Template Constrained Posterior (TCP) was proposed to automate the checking process by screening potential erroneous sentences with a single context template. However, single template-based method is not robust and requires parameter optimization that still involves some manual work. In this work, we propose to use multiple templates which is more robust and requires no development data for parameter optimization. By using its multiple hypothesis sifting capabilities -- from well-defined, full context to loosely defined context like wild card, the confidence for a focus unit can be measured at different expected accuracy. The joint verification by multiple TCP improves measured confidence of each unit in the transcription and is robust across different speech databases.

#9Subjective Experiments on Influence of Response Timing in Spoken Dialogues

Toshihiko Itoh (Graduate School of Information Science and Technology, Hokkaido University)
Norihide Kitaoka (Graduate School of Information Science, Nagoya University)
Ryota Nishimura (Department of Information and Computer Sciences, Toyohashi University of Technology)

To verify the validity of analysis results relating to dialogue rhythm from earlier studies, we produced spoken dialogues based on analysis results relating to response timing and the other spoken dialogues, and performed subjective experiments to investigate parameters such as the naturalness of the dialogue, the incongruity of the synthesized speech, and the ease of comprehension of the utterances. We used very short task-oriented four-turn dialogues using synthesized speech in Experiment 1, and approx. one-minute free-conversation dialogues in Experiment 2 using natural human speech and synthesized speech. As a result, we were able to show that a natural response timing exists for utterances, and that response timings that conform to the utterance contents are felt to be more natural, thus demonstrating the validity of the analysis results relating to dialogue rhythm.

#10Usability Study of VUI consistent with GUI Focusing on Age-Groups

Jun Okamoto (Information Technology Laboratory, Asahi Kasei Corporation)
Tomoyuki Kato (Information Technology Laboratory, Asahi Kasei Corporation)
Makoto Shozakai (Information Technology Laboratory, Asahi Kasei Corporation)

We studied the usability of a Voice User Interface (VUI) that is consistent with a Graphical User Interface (GUI), and focused on its dependency with user age-groups. Usability tests were iteratively conducted on 245 Japanese subjects with age-groups from 20s to 60s using a prototype of an in-vehicle information application. Next we calculated and analyzed statistics of the usability tests. We discuss the differences in usability with respect to age-groups and how to handle them. We propose that it is necessary to make voice guidance straightforward and to devise a VUI consistent with a GUI (VGUI) in order to let users understand the system structure. Also we found that the default design of a VGUI should be as simple as possible so that elderly users, who may be slow to learn the new system structure, are able to easily learn it.

#11Annotating Communicative Function and Semantic Content in Dialogue Act for Construction of Consulting Dialogue Systems

Teruhisa Misu (NICT)
Kiyonori Ohtake (NICT)
Chiori Hori (NICT)
Hideki Kashioka (NICT)
Satoshi Nakamura (NICT)

Our goal in this study is to train a dialogue manager that can handle consulting dialogues through spontaneous interactions from a tagged dialogue corpus. We have collected 130 hours of consulting dialogues in sightseeing guidance domain. This paper provides our taxonomy of dialogue act (DA) annotation that can describe two aspects of utterances. One is a communicative function (speech act), and the other is a semantic content of an utterance. We provide an overview of the Kyoto tour guide dialogue corpus and a preliminary analysis using the dialogue act tags.

#12Improved Speech Summarization with Multiple-Hypothesis Representations and Kullback-Leibler Divergence Measures

Shih-Hsiang Lin (National Taiwan Normal University)
Berlin Chen (National Taiwan Normal University)

Imperfect speech recognition often leads to degraded performance when leveraging existing text-based methods for speech summarization. To alleviate this problem, this paper investigates various ways to robustly represent the recognition hypotheses of spoken documents beyond the top scoring ones. Moreover, a new summarization method stemming from the Kullback-Leibler (KL) divergence measure and exploring both the sentence and document relevance information is proposed to work with such robust representations. Experiments on broadcast news speech summarization seem to demonstrate the utility of the presented approaches.

#13An Improved Speech Segmentation Quality Measure: the R-value

Okko Johannes Räsänen (Department of Signal Processing and Acoustics, Helsinki University of Technology, Finland)
Unto Kalervo Laine (Department of Signal Processing and Acoustics, Helsinki University of Technology, Finland)
Toomas Altosaar (Department of Signal Processing and Acoustics, Helsinki University of Technology, Finland)

Phone segmentation in ASR is usually performed indirectly by Viterbi decoding of HMM output. Direct approaches also exist, e.g., blind speech segmentation algorithms. In either case, performance of automatic speech segmentation algorithms is often measured using automated evaluation algorithms and used to optimize a segmentation system’s performance. However, evaluation approaches reported in literature were found to be lacking. Also, we have determined that increases in phone boundary location detection rates are often due to increased over-segmentation levels and not to algorithmic improvements, i.e., by simply adding random boundaries a better hit-rate can be achieved when using current quality measures. Since established measures were found to be insensitive to this type of random boundary insertion, a new R-value quality measure is introduced that indicates how close a segmentation algorithm’s performance is to an ideal point of operation.

#14No Sooner Said Than Done? Testing Incrementality of Semantic Interpretations of Spontaneous Speech

Michaela Atterer (University of Potsdam)
Timo Baumann (University of Potsdam)
David Schlangen (University of Potsdam)

Ideally, a spoken dialogue system should react without much delay to a user's utterance. Such a system would already select an object, for instance, before the user has finished her utterance about moving this particular object to a particular place. A prerequisite for such a prompt reaction is that semantic representations are built up on the fly and passed on to other modules. Few approaches to incremental semantics construction exist, and, to our knowledge, none of those has been systematically tested on a spontaneous speech corpus. In this paper, we develop measures to test empirically on transcribed spontaneous speech to what extent we can create semantic interpretation on the fly with an incremental semantic chunker that builds a frame semantics.

Wed-Ses1-P2:
Prosody perception and language acquisition

Time:Wednesday 10:00 Place:Hewison Hall Type:Poster
Chair:David House

#1Perception of English Compound vs. Phrasal Stress: Natural vs. Synthetic Speech

Irene Vogel (University of Delaware)
Arild Hestvik (University of Delaware)
H. Timothy Bunnell (Nemours Biomedical Research)
Laura Spinu (University of Delaware)

The ability of listeners to distinguish between compound and phrasal stress in English was examined on the basis of a picture selection task. The responses to naturally and synthetically produced stimuli were compared. While greater overall accuracy was observed with the natural stimuli, the same pattern of greater accuracy with compound stress than with phrasal stress was observed with both types of stimuli.

#2New Method for Delexicalization and its Application to Prosodic Tagging for Text-to-Speech Synthesis

Martti Vainio (Department of Speech Sciences, University of Helsinki)
Antti Suni (Department of Speech Sciences, University of Helsinki)
Tuomo Raitio (Department of Signal Processing and Acoustics, Helsinki University of Technology)
Jani Nurminen (Nokia Devices R&D)
Juhani Järvikivi (Max Planck Institute for Psycholinguistics)
Paavo Alku (Department of Signal Processing and Acoustics, Helsinki University of Technology)

This paper describes a new flexible delexicalization method based on glottal excited parametric speech synthesis scheme. The system utilizes inverse filtered glottal flow and all-pole modelling of the vocal tract. The method provides a possibility to retain and manipulate all relevant prosodic features of any kind of speech. Most importantly, the features include voice quality, which has not been properly modeled in earlier delexicalization methods. The functionality of the new method was tested in a prosodic tagging experiment aimed at providing word prominence data for a text-to-speech synthesis system. The experiment confirmed the usefulness of the method and further corroborated earlier evidence that linguistic factors influence the perception of prosodic prominence.

#3Speech rate and pauses in non-native Finnish

Minnaleena Toivola (Department of General Linguistics, University of Helsinki, Finland)
Mietta Lennes (Department of General Linguistics and Department of Speech Sciences, University of Helsinki, Finland)
Eija Aho (Department of General Linguistics, University of Helsinki, Finland)

In this study, the temporal aspects of speech are compared in read-aloud Finnish produced by six native and 16 non-native speakers. It is shown that the speech and articulation rates as well as pause durations are different for native and non-native speakers. Moreover, differences exist between the groups of speakers representing four different non-native languages. Surprisingly, the native Finnish speakers tend to make longer pauses than the non-natives. The results are relevant when developing methods for assessing fluency or the strength of foreign accent.

#4Modelling similarity perception of intonation

Uwe Reichel (University of Munich)
Felicitas Kleber (University of Munich)
Raphael Winkelmann (University of Munich)

In this study a perception experiment was carried out to examine the perceived similarity of intonation contours. Amongst other results we found, that the subjects are capable to produce consistent similarity judgements. On the basis of this data we studied the influence of several physical distance measures on the human similarity judgements by grouping these measures to principal components and by comparing the weights of these components in a linear regression model predicting human perception. Non-correlation based distance measures for f0 contours received the highest relative weight. Finally, we developed applicable linear regression and neural feed forward network models predicting similarity perception of intonation on the basis of physical contour distances. The performance of the neural networks, measured in terms of mean absolute error, did not differ significantly from the human performance derived from judgement consistency.

#5Studying L2 Suprasegmental Features in Asian Englishes: A Position Paper

Helen Meng (The Chinese University of Hong Kong)
Chiu-yu Tseng (Academia Sinica)
Mariko Kondo (Waseda University)
Alissa Harrison (The Chinese University of Hong Kong)
Tanya Viscelgia (Academia Sinica)

This position paper highlights the importance of suprasegmental training in secondary language (L2) acquisition. Suprasegmental features are manifested in terms of acoustic cues and convey important information about linguistic and information structures. Hence, L2 learners must harness appropriate suprasegmental productions for effective communication. However, this learning process is influenced by language transfer. We propose to design and collect a corpus to support systematic analysis of L2 suprasegmental features. We lay out a set of carefully selected textual environments that illustrate how suprasegmental features convey information including part-of-speech, syntax, focus, speech acts and semantics. We intend to use these textual environments for collecting speech data in a variety of Asian Englishes. Analyses of such corpora should lead to research findings that have important implications for language education and CALL applications.

#6Classification of disfluent phenomena as fluent communicative devices in specific prosodic contexts

Helena Moniz (FLUL/CLUL INESC-ID)
Isabel Trancoso (IST/INESC-ID)
Ana Mata (FLUL/CLUL)

This work explores prosodic cues of disfluent phenomena. In our previous work, we conducted a perceptual experiment regarding (dis)fluency ratings. Results suggested that some disfluencies may be considered felicitous by listeners, namely filled pauses and prolongations. In an attempt to discriminate which linguistic features are more salient in the classification of disfluencies as either fluent or disfluent phenomena, we used CART techniques on a corpus of 3.5 hours of spontaneous and prepared non-scripted speech. CART results pointed out 2 splits: break indices and contour shape. The first split indicates that events uttered at breaks 3 and 4 are considered felicitous. The second shows that these events must have flat or ascending contours to be considered as such; otherwise they are strongly penalized. Our preliminary results suggest that there are regular trends in the production of these events, namely, prosodic phrasing and contour shape.

#7Cross-Cultural Perception of Discourse Phenomena

Rolf Carlson (CTT, KTH)
Julia Hirschberg (Columbia University)

We discuss perception studies of two low level indicators of discourse phenomena by Swedish, Japanese, and Chinese native speakers. Subjects were asked to identify upcoming prosodic boundaries and disfluencies in Swedish spontaneous speech. We hypothesize that speakers of prosodically unrelated languages should be less able to predict upcoming phrase boundaries but potentially better able to identify disfluencies, since indicators of disfluency are more likely to depend upon lexical, as well as acoustic information. However, surprisingly, we found that both phenomena were fairly well recognized by native and non-native speakers, with, however, some possible interference from word tones for the Chinese subjects.

#8Modelling Vocabulary Growth from Birth to Young Adulthood

Roger Moore (University of Sheffield)
Louis ten Bosch (Radboud University Nijmegen)

There has been considerable debate over the existence of the ‘vocabulary spurt’ phenomenon - an apparent acceleration in word learning that is commonly said to occur in children around the age of 18 months. This paper presents an investigation into modelling the phenomenon using data from almost 1800 children. The results indicate that the acquisition of a receptive/productive lexicon can be quite adequately modelled as a single growth function with an ecologically well founded and cognitively plausible interpretation. Hence it is concluded that there is little evidence for the vocabulary spurt phenomenon as a separable aspect of language acquisition.

#9Adaptive Non-negative Matrix Factorization in a Computational Model of Language Acquisition

Joris Driesen (Dept. ESAT, KULeuven, Leuven)
Louis ten Bosch (CLST, Radboud University, Nijmegen)
Hugo Van hamme (Dept. ESAT, KULeuven, Leuven)

During the early stages of language acquisition, young infants face the task of learning a basic vocabulary without the aid of prior linguistic knowledge. It is believed the long term episodic memory plays an important role in this process. Ex- periments have shown that infants retain large amounts of very detailed episodic information about the speech they perceive (e.g. [1]). This weakly justifies the fact that some algorithms at- tempting to model the process of vocabulary acquisition compu- tationally process large amounts of speech data in batch. Non- negative Matrix Factorization (NMF), a technique that is par- ticularly successful in data mining but can also be applied to vocabulary acquisition (e.g. [2]), is such an algorithm. In this paper, we will integrate an adaptive variant of NMF into a com- putational framework for vocabulary acquisition, foregoing the need for long term storage of speech inputs, and show its accuracy matches that of the batch algorithm

#10Classifying clear and conversational speech based on acoustic features

Akiko Amano-Kusumoto (Oregon Health & Science University)
John-Paul Hosom (Oregon Health & Science University)
Izhak Shafran (Oregon Health & Science University)

This paper reports an investigation of features relevant for classifying two speaking styles, namely, conversational speaking style and clear (e.g. hyper-articulated) speaking style. Spectral and prosodic features were automatically extracted from speech and classified using decision tree classifiers and multilayer perceptrons to achieve accuracies of about 71% and 77% respectively. More interestingly, we found that out of the 56 features only about 9 features are needed to capture the most predictive power. While perceptual studies have shown that spectral cues are more useful than prosodic features for intelligibility [Kain2008], here we find prosodic features are more important for classification.

#11The Acoustic Characteristics of Russian Vowels in Children of 6 and 7 Years of Age

Elena Lyakso (Saint-Petersburg State University)
Olga Frolova (Saint-Petersburg State University)
Alex Grigoriev (Saint-Petersburg State University)

The purpose of this investigation is to examine the process of acoustic features of vowels from child speech approaching corresponding values in the normal Russian adult speech. The vowels formants structure, pitch and vowels duration were examined. Word stress and palatal context influence on the formants structure of the vowels were taken into account. It was shown that the word stress is formed by 6 -7 years of age on the basis of the feature typical for Russian language. Formant structure of Russian vowels /u/ and /i/ is not formed by the age of 7 years. Native speakers recognize the meaning of 57-93% words in speech of 6 and 7-years-old children.

#12Japanese children’s acquisition of prosodic politeness expressions

Takaaki Shochi (Division of Cognitive Psychology, Kumamoto University, Japan)
Donna Erickson (Showa Music University, Kawasaki City, Japan)
Kaoru Sekiyama (Division of Cognitive Psychology, Kumamoto University, Japan)
Albert Rilliard (LIMSI-CNRS)
Véronique Aubergé (GIPSA Lab, Grenoble, France)

This paper presents a perception experiment to measure the ability of Japanese children in fourth and fifth grade elementary school to recognize culturally encoded expressions of politeness and impoliteness in their native language. Audiovisual stimuli were presented to listeners, who rate the politeness degree and a possible situation where such an expression could be used. Analysis of results focuses on the differences and the similarities between adult listeners and children, for each attitude and modality. Facial information seems to be retrieved earlier than audio ones, and expressions of different degrees of Japanese politeness, including expressions of kyoshuku, are still not understood around 10 years of age.

#13Perceptual training of singleton and geminate stops in Japanese language by Korean learners

Mee Sonu (University of Waseda)
Keiichi Tajima (University of Hosei)
Hiroaki Kato (NICT/ATR)
Yoshinori Sagisaka (University of Waseda)

We aim to build up an effective perceptual training paradigm toward a computer-assisted language learning (CALL) system for second language. This study investigated the effectiveness of the perceptual training on Korean-speaking learners of Japanese in the distinction between geminate and singleton stops of Japanese. The training consisted of identification of geminate and singleton stops with feedback. We investigated whether training improves the learners’ identification of the geminate and singleton stops in Japanese. Moreover, we examined how perceptual training is affected by factors that influence speaking rate. Results were as follows. Participants who underwent perceptual training improved overall performance to a greater extent than untrained control participants. However, there was no significant difference between the group that was trained with three speaking rates and the group that was trained with normal rate only.

Wed-Ses2-O1:
Word-level perception

Time:Wednesday 13:30 Place:Main Hall Type:Oral
Chair:Jeesun Kim

13:30Semantic context effects in the recognition of acoustically unreduced and reduced words

Marco van de Ven (Max Planck Institute for Psycholinguistics, The Netherlands)
Benjamin V. Tucker (University of Alberta, Canada)
Mirjam Ernestus (Radboud University Nijmegen, The Netherlands)

Listeners require context to understand the casual pronunciation variants of words that are typical of spontaneous speech (Ernestus et al., 2002). The present study reports two auditory lexical decision experiments, investigating listeners' use of semantic contextual information in the comprehension of unreduced and reduced words. We found a strong semantic priming effect for low frequency unreduced words, whereas there was no such effect for reduced words. Word frequency was facilitatory for all words. These results show that semantic context is relevant especially for the comprehension of unreduced words, which is unexpected given the listener driven explanation of reduction in spontaneous speech.

13:50Context effects and the processing of ambiguous words: Further evidence from semantic incongruence

Michael C. W. Yip (The Hong Kong Institute of Education)

A cross-modal naming experiment was conducted to further verify the effects of context and other lexical information in the processing of Chinese homophones during spoken language comprehension. In this experiment, listeners named aloud a visual probe as fast as they could, at a pre-designated point upon hearing the sentence, which ended with a spoken Chinese homophone. Results further support that context has exerted an effect on the disambiguation of various homophonic meanings at an early stage, within the acoustic boundary of the word. This contextual effect was even stronger than the tonal information. Finally, the present results are in line with the context-dependency hypothesis that selection of the appropriate meaning of an ambiguous word depends on the simultaneous interaction among sentential, tonal and other lexical information during lexical access.

14:10The roles of reconstruction and lexical storage in the comprehension of regular pronunciation variants

Mirjam Ernestus (Radboud University Nijmegen & Max Planck Institute for Psycholinguistics)

This paper investigates how listeners process regular pronunciation variants, resulting from simple general reduction processes. Study 1 shows that when listeners are presented with new words, they store the pronunciation variants presented to them, whether these are unreduced or reduced. Listeners thus store information on word-specific pronunciation variation. Study 2 suggests that if participants are presented with regularly reduced pronunciations, they also reconstruct and store the corresponding unreduced pronunciations. These unreduced pronunciations apparently have special status. Together the results support hybrid models of speech processing, assuming roles for both exemplars and abstract representations.

14:30Lexical Embedding in Spoken Dutch

Odette Scharenborg (Centre for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)
Stefanie Okolowski (University of Trier, Germany)

A stretch of speech is often consistent with multiple words, e.g., the sequence /hæm/ is consistent with ‘ham’ but also with the first syllable of ‘hamster’, resulting in temporary ambiguity. However, to what degree does this lexical embedding occur? Analyses on two corpora of spoken Dutch showed that 11.9%-19.5% of polysyllabic word tokens have word-initial embedding, while 4.1%-7.5% of monosyllabic word tokens can appear word-initially embedded. This is much lower than suggested by an analysis of a large dictionary of Dutch. Speech processing thus appears to be simpler than one might expect on the basis of statistics on a dictionary.

14:50Real-time lexical competitions during speech-in-speech comprehension

Véronique Boulenger (Laboratoire Dynamique du Langage, CNRS UMR 5596, Lyon, France)
Michel Hoen (Stem Cell and Brain Research Institute, INSERM U846, Lyon, France)
François Pellegrino (Laboratoire Dynamique du Langage, CNRS UMR 5596, Lyon, France)
Fanny Meunier (Laboratoire Dynamique du Langage, CNRS UMR 5596, Lyon, France)

This study investigates speech comprehension in competing multi-talker babble. We examined the effects of number of simultaneous talkers and of frequency of words in the babble on lexical decision to target words. Results revealed better performance at a low talker number (n = 2). Importantly, frequency of words in the babble significantly affected performance: high frequency word babble interfered more strongly with word recognition than low frequency babble. This informational masking was particularly salient for the 2-talker babble. These findings suggest that investigating speech-in-speech comprehension may provide crucial information on lexical competition processes that occur in real-time during word recognition.

15:10Discovering consistent word confusions in noise

Martin Cooke (Ikerbasque and University of the Basque Country)

Listeners make mistakes when communicating under adverse conditions, with overall error rates reasonably well-predicted by existing speech intelligibility metrics. However, a detailed examination of confusions made by a majority of listeners is more likely to provide insights into processes of normal word recognition. The current study measured the rate at which robust misperceptions occurred for highly-confusable words embedded in noise. In a second experiment, confusions discovered in the first listening test were subjected to a range of manipulations designed to help identify their cause. These experiments reveal that while majority confusions are quite rare, they occur sufficiently often to make large-scale discovery worthwhile. Surprisingly few misperceptions were due solely to energetic masking by the noise, suggesting that speech and noise react in complex ways which are not well-described by traditional masking concepts.

Wed-Ses2-O2:
Applications in education and learning

Time:Wednesday 13:30 Place:East Wing 1 Type:Oral
Chair:Maxine Eskenazi

13:30A Large Greek-English Dictionary with Incorporated Speech and Language Processing Tools

Dimitrios Lyras (Speech & Language Processing Group, Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, Patras, Greece, GR-26500)
George Kokkinakis (Speech & Language Processing Group, Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, Patras, Greece, GR-26500)
Alexandros Lazaridis (Speech & Language Processing Group, Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, Patras, Greece, GR-26500)
Kyriakos Sgarbas (Speech & Language Processing Group, Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, Patras, Greece, GR-26500)
Nikos Fakotakis (Speech & Language Processing Group, Wire Communications Laboratory, Department of Electrical and Computer Engineering, University of Patras, Patras, Greece, GR-26500)

A large Greek-English Dictionary with 81,515 entries, 192,592 translations into English and 50,106 usage examples with their translation has been developed in combined printed and electronic (DVD) form. The electronic dictionary features unique facilities for searching the entire or any part of the Greek and English section, and has incorporated a series of speech and language processing tools which efficiently assist learners of Greek and English. This paper presents the human-machine interface of the dictionary and the most important tools, i.e. the TTS-synthesizers for Greek and English, the lemmatizers for Greek and English, the Grapheme-to-Phoneme converter for Greek and the syllabification system for Greek.

13:50Predicting Children\'s Reading Ability using Evaluator-Informed Features

Matthew Black (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, CA, USA)
Joseph Tepperman (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, CA, USA)
Sungbok Lee (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, CA, USA)
Shrikanth Narayanan (Signal Analysis and Interpretation Laboratory (SAIL), University of Southern California, Los Angeles, CA, USA)

Automatic reading assessment software has the difficult task of trying to model human-based observations, which have both objective and subjective components. In this paper, we mimic the grading patterns of a "ground-truth" (average) evaluator in order to produce models that agree with many people's judgments. We examine one particular reading task, where children read a list of words aloud, and evaluators rate the children's overall reading ability on a scale from one to seven. We first extract various features correlated with the specific cues that evaluators said they used. We then compare various supervised learning methods that mapped the most relevant features to the ground-truth evaluator scores. Our final system predicted these scores with 0.91 correlation, higher than the average inter-evaluator agreement.

14:10Automatic Intonation Classification for Speech Training Systems

Gyorgy Szaszak (Budapest University of Technology and Economics, Department of Telecommunications and Media Informatics, Budapest, Hungary)
David Sztaho (Department of Telecommunications and Media Informatics, Budapest, Hungary)
Klara Vicsi (Department of Telecommunications and Media Informatics, Budapest, Hungary)

A prosodic Hidden Markov model (HMM) based modality recognizer has been developed, which, after supra-segmental acoustic pre-processing, can perform clause and sentence boundary detection and modality (sentence type) recognition. This modality recognizer is adapted to carry out automatic evaluation of the intonation of the produced utterances in a speech training system for hearing-impaired persons or foreign language learners. The system is evaluated on utterances from normally-speaking persons and tested with speech-impaired (due to hearing problems) persons. To allow a deeper analysis, the automatic classification of the intonation is compared to subjective listening tests.

14:30Automated Pronunciation Scoring using Confidence Scoring and Landmark-based SVM

Su-Youn Yoon (University of Illinois at Urbana Champaign)
Mark Hasegawa-Johnson (University of Illinois at Urbana Champaign)
Richard Sproat (Oregon Health and Science Univeristy Portland)

In this study, we present a pronunciation scoring method for second language learners of English (hereafter, L2 learners). This study presents a method using both confidence scoring and classifiers. Classifiers have an advantage over confidence scoring for specialization in the specific phonemes where L2 learners make frequent errors. Classifiers (Landmark-based SVMs) were trained in order to distinguish L2 phonemes from their frequent substitution patterns. In this study, the method was evaluated on the specific English phonemes where L2 English learners make frequent errors. The results suggest that the automated pronunciation scoring method can be improved consistently by combining the two methods

14:50ASR based pronunciation evaluation with automatically generated competing vocabulary

Carlos Molina (Universidad de Chile)
Nestor Becerra Yoma (Universidad de Chile)
Jorge Wuth (Universidad de Chile)
Hiram Vivanco (Universidad de Chile)

In this paper the application of automatic speech recognition (ASR) technology in CAPT (Computer Aided Pronunciation Training) is addressed. A method to automatically generate the competitive lexicon, required by an ASR engine to compare the pronunciation of a target word with its correct and wrong phonetic realization, is presented. In order to enable the effi- cient deployment of CAPT applications, the generation of this competitive lexicon does not require any human assistance or a priori information of mother language dependent errors. The method presented here leads to averaged subjective-objective score correlation equal to 0.82 and 0.75 depending on the task.

15:10High Performance Automatic Mispronunciation Detection Method Based on Neural Network and TRAP Features

Hongyan Li (Chinese Academy of Sciences, Beijing, China)
Shijin Wang (Chinese Academy of Sciences, Beijing, China)
Jiaen Liang (Chinese Academy of Sciences, Beijing, China)
Shen Huang (Chinese Academy of Sciences, Beijing, China)
Bo Xu (Chinese Academy of Sciences, Beijing, China)

We propose a new approach to utilize temporal information and neural network to improve the performance of automatic mispronunciation detection. Firstly, the alignment results between speech and phoneme sequences are obtained within the GMM-HMM framework. Then, the TRAPs are introduced to describe the pronunciation quality. We use MLP to calculate the final posterior probability of each testing phoneme, and determine whether it is a mispronunciation or not. Moreover, we combine the TRAPs-MLP method with our existing methods to further improve the performance. Experiments show that the TRAPs-MLP method can give a significant relative improvement of 39.04% in EER reduction, compared with the baseline GMM-UBM method.

Wed-Ses2-O3:
ASR: new paradigms I

Time:Wednesday 13:30 Place:East Wing 2 Type:Oral
Chair: Geoffrey Zweig

13:30The Semi-Supervised Switchboard Transcription Project

Amarnag Subramanya (Dept. of Electrical Engg., University of Washington, Seattle)
Jeff Bilmes (Dept. of Electrical Engg., University of Washington, Seattle)

In previous work, we proposed a new graph-based semi-supervised learning (SSL) algorithm and showed that it outperforms other state-of-the-art SSL approaches for classifying documents and web-pages. Here we use a multi-threaded implementation in order to scale the algorithm to very large data sets. We treat the phonetically annotated portion of the Switchboard transcription project (STP) as labeled data and automatically annotate (at the phonetic level) the Switchboard I (SWB) training set and show that our proposed approach outperforms state-of-the-art SSL algorithms as well as a state-of-the-art strictly supervised classifier. As a result, we have STP-style annotations of the entire SWB-I training set which we refer to as semi-supervised STP (S3TP).

13:50Maximum Mutual Information Multi-phone Units in Direct Modeling

Geoffrey Zweig (Microsoft Research)
Patrick Nguyen (Microsoft Research)

This paper introduces a class of discriminative features for use in maximum entropy speech recognition models. The features we propose are acoustic detectors for discriminatively determined multi-phone units. The multi-phone units are found by computing the mutual information between the phonetic sub-sequences that occur in the training lexicon, and the word labels. This quantity is a function of an error model governing our ability to detect phone sequences accurately (an otherwise informative sequence which cannot be reliably detected is not so useful). We show how to compute this mutual information quantity under a class of error models efficiently, in one pass over the data, for all phonetic sub-sequences in the training data. After this computation, detectors are created for a subset of highly informative units. We then define two novel classes of features based on these units: associative and transductive. Incorporating these features in a maximum entropy based direct model for Voice-Search outperforms the baseline by 24% in sentence error rate.

14:10Profiling Large-Vocabulary Continuous Speech Recognition on Embedded Devices: A Hardware Resource Sensitivity Analysis

Kai Yu (Carnegie Mellon University)
Rob Rutenbar (Carnegie Mellon University)

When deployed in embedded systems, speech recognizers are necessarily reduced from large-vocabulary continuous speech recognizers (LVCSR) found on desktops or servers to fit the limited hardware. However, embedded hardware continues to evolve in capability; today’s smartphones are much more powerful than their recent ancestors. This begets a new question: which hardware features not currently found on today’s embedded platforms, but potentially add-ons to tomorrow’s devices, are most likely to help recognizer performance? Said differently – what is the sensitivity of the recognizer to fine-grain details of the embedded hardware resources? To answer this question rigorously and quantitatively, we offer results from a detailed study of LVCSR performance as a function of microarchitecture options on an embedded ARM11 and an enterprise-class Intel Core2Duo. We estimate speed and energy consumption, and show, feature by feature, how hardware resources impact recognizer performance.

14:30Continuous Speech Recognition Using Attention Shift Decoding with Soft Decision

Ozlem Kalinli (University of Southern California)
Shrikanth Narayanan (University of Southern California)

We present an attention shift decoding (ASD) method inspired by human speech recognition. In contrast to the traditional automatic speech recognition (ASR) systems, ASD decodes speech inconsecutively using reliability criteria; the gaps (unreliable speech regions) are decoded with the evidence of islands (reliable speech regions). On the BU Radio News Corpus, ASD provides significant improvement (2.9% absolute) over the baseline ASR results when it is used with oracle island-gap information. At the core of the ASD method is the automatic island-gap detection. Here, we propose a new feature set for automatic island-gap detection which achieves 83.7% accuracy. To cope with the imperfect nature of the island-gap classification, we also propose a new ASD algorithm using soft decision. The ASD with soft decision provides 0.4% absolute (2.2% relative) improvement over the baseline ASR results when it is used with automatically detected islands and gaps.

14:50Towards Using Hybrid Word and Fragment Units for Vocabulary Independent LVCSR Systems

Ariya Rastrow (Human Language Technology Center of Excellence, and Center for Language and Speech Processing, Johns Hopkins University)
Abhinav Sethy (IBM T.J Watson Research Center, Yorktown Heights, NY, USA)
Bhuvana Ramabhadran (IBM T.J Watson Research Center, Yorktown Heights, NY, USA)
Frederick Jelinek (Human Language Technology Center of Excellence, and Center for Language and Speech Processing, Johns Hopkins University)

This paper presents the advantages of augmenting a word-based system with sub-word units as a step towards building open vocabulary speech recognition systems. We show that a hybrid system which combines words and data-driven, variable length sub word units has a better phone accuracy then word only systems. In addition the hybrid system is better in detecting Out-Of-Vocabulary (OOV) terms and representing them phonetically. Results are presented on the RT-04 broadcast news and MIT Lecture data sets. An FSM-based approach to recover OOV words from the hybrid lattices is also presented. At an OOV rate of 2.5% on RT-04 we observed a 8% relative improvement in phone error rate (PER), 7.3% relative improvement in oracle PER and 7% relative improvement in WER after recovering the OOV terms. A significant reduction of 33% relative in PER is seen in the OOV regions.

15:10Unsupervised training of an HMM-based speech recognizer for topic classification

Herbert Gish (BBN Technologies)
Man-hung Siu (BBN Technologies)
Arthur Chan (BBN Technologies)
William Belfield (BBN Technologies)

HMM-based Speech-To-Text (STT) systems are widely deployed not only for dictation tasks but also as the first processing stage of many automatic speech applications such as spoken topic classification. However, the necessity of transcribed data for training the HMMs precludes its use in domains where transcribed speech is difficult to come by because of the specific domain, channel or language. In this work, we propose building HMM-based speech recognizers without transcribed data by formulating the HMM training as an optimization over both the parameter and transcription sequence space. We describe how this can be easily implemented using existing STT tools. We tested the effectiveness of our unsupervised training approach on the task of topic classification on the Switchboard corpus. The unsupervised HMM recognizer, initialized with a segmental tokenizer, outperformed both the a HMM phoneme recognizer trained with 1 hour of transcribed data, and the Brno University of Technology (BUT) Hungarian phoneme recognizer. This approach can also be applied to other speech applications, including spoken term detection, language and speaker verification.

Wed-Ses2-O4:
Single-Channel Speech Enhancement

Time:Wednesday 13:30 Place:East Wing 3 Type:Oral
Chair:Bayya Yegnanarayana

13:30Constrained Probabilistic Subspace Maps applied to Speech Enhancement

Kaustubh Kalgaonkar (Georgia Institute of Technology)
Mark Clements (Georgia Institute of Technology)

This paper presents a probabilistic algorithm that extracts a mapping between two subspaces by representing each subspace as a collection of states. In most of the cases, the data is a time series with temporal constraints. This paper suggests a method to impose temporal constraints on the transition within the states of the subspace. This probabilistic model is successfully applied to the problem of speech enhancement and improves the performance of Wiener filter by providing robust estimates of SNR.

13:50Reconstructing Clean Speech from Noisy MFCC Vectors

Ben Milner (University of East Anglia)
Jonathan Darch (University of East Anglia)
Ibrahim Almajai (University of East Anglia)

The aim of this work is to reconstruct clean speech solely from a stream of noise-contaminated MFCC vectors, as may be encountered in distributed speech recognition systems. Speech reconstruction is performed using the ETSI Aurora back-end speech reconstruction standard which requires MFCC vectors, fundamental frequency and voicing information. In this work, fundamental frequency and voicing are obtained using maximum a posteriori prediction from input MFCC vectors, thereby allowing speech reconstruction solely from a stream of MFCC vectors. Two different methods to improve prediction accuracy in noisy conditions are then developed. Experimental results first establish that improved fundamental frequency and voicing prediction is obtained when noise compensation is applied. A series of human listening tests are then used to analyse the reconstructed speech quality, which determine the effectiveness of noise compensation in terms of mean opinion scores.

14:10An Evaluation of Objective Quality Measures for Speech Intelligibility Prediction

Cees H. Taal (Delft University of Technology)
Richard C. Hendriks (Delft University of Technology)
Richard Heusdens (Delft University of Technology)
Jesper Jensen (Oticon A/S)
Ulrik Kjems (Oticon A/S)

In this research various objective quality measures are evaluated in order to predict the intelligibility for a wide range of non-linearly processed speech signals and speech degraded by additive noise. The obtained results are compared with the prediction results of a more advanced perceptual-based model proposed by Dau et al. and an objective intelligibility measure, namely the coherence speech intelligibility index (cSII). These tests are performed in order to gain more knowledge between the link of speech-quality and speech-intelligibility and may help us to exploit the extensive research done into the field of speech-quality for speech-intelligibility. It is shown that cSII does not necessarily show better performance compared to conventional objective (speech)-quality measures. In general, the DAU-model is the only method with reasonable results for all processing conditions.

14:30Performance Comparison of HMM and VQ Based Single Channel Speech Separation

Hossein Radfar (Department of Electrical and Computer Engineering, University of Toronto, Canada)
Geoffrey Chan (Department of Electrical and Computer Engineering, Queen\'s University, Kingston, Canada)
Richard Dansereau (Department of Systems and Computer Engineering, Carleton University, Ottawa, Canada)
Willy Wong (Department of Electrical and Computer Engineering, University of Toronto, Canada)

In this paper, single channel speech separation (SCSS) techniques based on hidden Markov models (HMM) and vector quantization (VQ) are described and compared in terms of (a) signal-to-noise ratio (SNR) between separated and original speech signals, (b) preference of listeners, and (c) computational complexity. The SNR results show that the HMM-based technique marginally outperforms the VQ-based technique by 0.85 dB in experiments conducted on mixtures of female-female, male-male, and male-female speakers. Subjective tests show that listeners prefer HMM over VQ for 86.70 % of test speech files. This improvement, however, is at the expense of a drastic increase in computational complexity when compared with the VQ-based technique.

14:50Stereo-input Speech Recognition using Sparseness-based Time-frequency Masking in a Reverberant Environment

Yosuke Izumi (Department of Information Physics and Computing, University of Tokyo)
Kenta Nishiki (Department of Information Physics and Computing, University of Tokyo)
Shinji Watanabe (NTT Communication Science Laboratoriesy2-4)
Takuya Nishimoto (Department of Information Physics and Computing, University of Tokyo)
Nobutaka Ono (Department of Information Physics and Computing, University of Tokyo)
Shigeki Sagayama (Department of Information Physics and Computing, University of Tokyo)

We present noise-robust speech recognition using sparseness-based underdetermined blind source separation (BSS) technique. As a representative underdetermined BSS method, we utilized time-frequency masking (TFM) in this paper. Although TFM is able to separate target speech from interferences effectively, one should consider two problems. One is that masking does not work well in noisy or reverberant environment. Another is that masking itself might cause some distortion of the target speech. For the former, we apply our robust TFM and show its recognition performance. Next, investigating the distortion caused by TFM, we reveal following facts through experiments: 1) soft mask is better than binary mask in terms of recognition performance and 2) cepstral mean normalization (CMN) reduces the distortion, especially for that caused by soft mask. At the end, we evaluate the recognition performance of our method in noisy and reverberant real environment.

15:10Enhancing Audio Speech using Visual Speech Features

Ibrahim Almajai (University of East Anglia)
Ben Milner (University of East Anglia)

This work presents a novel approach to speech enhancement by exploiting the bimodality of speech and the correlation that exists between audio and visual speech features. For speech enhancement, a visually-derived Wiener filter is developed. This obtains clean speech statistics from visual features by modelling their joint density and making a maximum a posteriori estimate of clean audio from visual speech features. Noise statistics for the Wiener filter utilise an audio-visual voice activity detector which classifies input audio as speech or nonspeech, enabling a noise model to be updated. Analysis shows estimation of speech and noise statistics to be effective with human listening tests measuring the effectiveness of the resulting Wiener filter.

Wed-Ses2-S1:
Special Session: Active Listening & Synchrony

Time:Wednesday 13:30 Place:East Wing 4 Type:Special
Chair: Nick Campbell & Joakim Gustafson

13:30Understanding Speaker-Listener Interactions

Dirk Heylen (University of Twente)

We provide an eclectic generic framework to understand the back and forth interactions between participants in a conversation highlighting the complexity of the actions that listeners are engaged in. Communicative actions of one participant implicate the ``other" in many ways. In this paper, we try to enumerate some essential relevant dimensions of this reciprocal dependence.

13:50Detecting changes in speech expressiveness in participants of a radio program

Plínio Barbosa (Speech Prosody Studies Group/Dep. of Linguistics/Inst.Est. Ling., Univ. of Campinas, Brazil)

A method for speech expressiveness change detection is presented which combines a dimensional analysis of speech expression, a Principal Component Analysis technique, as well as multiple regression analysis. From the three inferred rates of activation, valence, and involvement, two PCA-factors explain 97 % of the variance of the judges' evaluations of a corpus of radio show interaction. The multiple regression analysis predicted the values of the two listener-oriented, PCA-derived dimensions of promptness and empathy from the acoustic parameters automatically obtained from a set of 206 utterances produced by radio show's participants. Analysed chronologically, the utterances reveal expression change from automatic acoustic analysis.

14:10An Audio-Visual Approach to Measuring \\Discourse Synchrony in Multimodal Conversation Data

Nick Campbell (Trinity College Dublin)

This paper describes recent work on the automatic extraction of visual and audio parameters relating to the detection of synchrony in discourse, and to the modelling of active listening for advanced speech technology. It reports findings based on image processing that reliably identify the strong entrainment between members of a group conversation, and describes techniques for the extraction and analysis of such information.

14:30Towards Flexible Representations for Analysis of Accomodation of Temporal Features in Spontaneous Dialogue Speech

Spyros Kousidis (Digital Media Center, Dublin Institute of Technology, Ireland)
David Dorran (Audio Research Group, Dublin Institute of Technology, Ireland)
Ciaran McDonnell (Digital Media Center, Dublin Institute of Technology, Ireland)
Eugene Coyle (Audio Research Group, Dublin Institute of Technology)

Current advances in spoken interface design point towards a shift towards more “human-like” interaction, as opposed to the traditional “push-to-talk” approach. However, human dialogue is characterized by synchrony and multi-modality, and these properties are not captured by traditional representation approaches, such as turn succession. This paper proposes an alternative representation schema for recorded (human) dialogues, which employs per frame averages of speaker turn distribution, in order to inform further analyses of temporal features (pauses and overlaps) in terms of inter-speaker accommodation. Preliminary results of such analyses are provided.

14:50Are we ‘in sync’: Turn-taking in collaborative dialogues

Štefan Beňuš (Constantine the Philosopher University, Nitra, Slovakia and Slovak Academy of Sciences, Bratislava, Slovakia)

We used a corpus of collaborative task oriented dialogues in American English to compare two units of rhythmic structure – pitch accents and syllables – within the coupled oscillator model of rhythmical entrainment in turn-taking proposed in Wilson & Wilson (2005). We found that pitch accents are a slightly better fit than syllables as the unit of rhythmical structure for the model, but we also observed weak support for the model in general. Some turn-taking types such as 'pause interruptions' and 'backchanneling' had more salient rhythmical characteristics than others.

15:10An Audio-Visual Attention System for Online Association Learning

Martin Heckmann (Honda Research Institute Europe GmbH)
Holger Brandl (Research Institute for Cognition and Robotics, University of Bielefeld)
Xavier Domont (University of Darmstadt, Institut für Automatisierungstechnik, FG Regelungstheorie)
Bram Bolder (Honda Research Institute Europe GmbH)
Frank Joublin (Honda Research Institute Europe GmbH)
Christian Goerick (Honda Research Institute Europe GmbH)

We present an audio-visual attention system for speech based interaction with a humanoid robot where a tutor can teach visual properties/locations (e.g "left") and corresponding, arbitrary speech labels. The acoustic signal is segmented via the attention system and speech labels are learned from a few repetitions of the label by the tutor. The attention system integrates bottom-up stimulus driven saliency calculation (delay-and-sum beamforming, adaptive noise level estimation) and top-down modulation (spectral properties, segment length, movement and interaction status of the robot). We evaluate the performance of different aspects of the system based on a small dataset.

Wed-Ses2-P4:
LVCSR Systems and Spoken Term Detection

Time:Wednesday 13:30 Place:Hewison Hall Type:Poster
Chair:Simon King

#1Real-Time Live Broadcast News Subtitling System for Spanish

Alfonso Ortega (University of Zaragoza)
Jose Enrique Garcia (University of Zaragoza)
Antonio Miguel (University of Zaragoza)
Eduardo Lleida (University of Zaragoza)

Subtitling of live broadcast news is a very important application to meet the needs of deaf and hard of hearing people. However, live subtitling is a high cost operation in terms of qualification human resources and therefore, money if high precision is desired. Automatic Speech Recognition researchers can help to perform this task saving both time and money developing systems that deliver subtitles fully synchronized with speech without human assistance. In this paper we present a real-time system for automatic subtitling of live broadcast news in Spanish based on the News Redaction Computer texts and an Automatic Speech Recognition engine to provide precise temporal alignment of speech to text scripts with negligible latency. The presented system is working satisfactory on the Aragonese Public Television from June 2008 without human assistance.

#2Development of the 2008 SRI Mandarin Speech-to-text System for Broadcast News and Conversations

Xin Lei (SRI International)
Wei Wu (Univ. of Washington)
Wen Wang (SRI International)
Arindam Mandal (SRI International)
Andreas Stolcke (SRI International)

We describe the recent progress in SRI’s Mandarin speech-to- text system developed for 2008 evaluation in the DARPA GALE program. A data-driven lexicon expansion technique and lan- guage model adaptation methods contribute to the improvement in recognition performance. Our system yields 8.3% character error rate on the GALE dev08 test set, and 7.5% after combining with RWTH systems. Compared to our 2007 evaluation system, a significant improvement of 13% relative has been achieved.

#3Multifactor Adaptation for Mandarin Broadcast News and Conversation Speech Recognition

Wen Wang (SRI International)
Arindam Mandal (SRI International)
Xin Lei (SRI International)
Andreas Stolcke (SRI International)
Jing Zheng (SRI International)

We explore the integration of multiple factors such as genre and speaker gender for acoustic model adaptation tasks to improve Mandarin ASR system performance on broadcast news and broadcast conversation audio. We investigate the use of multi-factor clustering of acoustic model training data and the application of MPE-MAP and fMPE-MAP acoustic model adaptations. We found that by effectively combining these adaptation approaches, we can achieve 5% relative improvement on the final recognition error rate from SRI's state-of-the-art Mandarin ASR system.

#4Development of the GALE 2008 Mandarin LVCSR System

Christian Plahl (RWTH Aachen University)
Björn Hoffmeister (RWTH Aachen University)
Georg Heigold (RWTH Aachen University)
Jonas Lööf (RWTH Aachen University)
Ralf Schlüter (RWTH Aachen University)
Hermann Ney (RWTH Aachen University)

This paper describes the current improvements of the RWTH Mandarin LVCSR system. We introduce vocal tract length normalization for the Gammatone features and present comparable results for Gammatone based feature extraction and classical feature extraction. In order to benefit from the huge amount of data of 1600h available in the GALE project we have trained the acoustic models up to 8M Gaussians. We present detailed character error rates for the different number of Gaussians. Different kinds of systems are developed and a two stage decoding framework is applied, which uses cross-adaptation and a subsequent lattice-based system combination. In addition to various acoustic front-ends, these systems use different kinds of neural network toneme posterior features. We present detailed recognition results of the development cycle and the different acoustic front-ends of the systems. Finally, we compare the ultimate evaluation system to our last years system and can report a 10% relative improvement.

#5The RWTH Aachen University Open Source Speech Recognition System

David Rybach (RWTH Aachen University, Germany)
Christian Gollan (RWTH Aachen University, Germany)
Georg Heigold (RWTH Aachen University, Germany)
Björn Hoffmeister (RWTH Aachen University, Germany)
Jonas Lööf (RWTH Aachen University, Germany)
Ralf Schlüter (RWTH Aachen University, Germany)
Hermann Ney (RWTH Aachen University, Germany)

We announce the public availability of the RWTH Aachen University speech recognition toolkit. The toolkit includes state of the art speech recognition technology for acoustic model training and decoding. Speaker adaptation, speaker adaptive training, unsupervised training, a finite state automata library, and an efficient tree search decoder are notable components. Comprehensive documentation, example setups for training and recognition, and a tutorial are provided to support newcomers.

#6Online Detecting End Times of Spoken Utterances for Synchronization of Live Speech and its Transcripts

Jie Gao (ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences)
Qingwei Zhao (ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences)
Yonghong Yan (ThinkIT Speech Lab, Institute of Acoustics, Chinese Academy of Sciences)

In this paper, we present our initial efforts in the task of Automatically Synchronizing live spoken Utterances with their Transcripts (textual contents) (ASUT). We address the problem of online detecting of the end time of a spoken utterance given its textual content, which is one of the key problems of the ASUT task. A frame-synchronous likelihood ratio test (FS-LRT) procedure is proposed and explored under the hidden Markov model (HMM) framework. The property of FS-LRT is studies empirically. Experiments indicate that our proposed approach shows satisfying performance. In addition, the proposed procedure has been successfully applied in a subtitling system for live broadcast news.

#7Real-Time ASR from Meetings

Philip N. Garner (Idiap Research Institute, Martigny, Switzerland)
John Dines (Idiap Research Institute, Martigny, Switzerland)
Thomas Hain (Speech and Hearing Group, The University of Sheffield, UK)
Asmaa El Hannani (Speech and Hearing Group, The University of Sheffield, UK)
Martin Karafiat (Speech Processing Group, Brno University of Technology, Czech Republic)
Danil Korchagin (Idiap Research Institute, Martigny, Switzerland)
Mike Lincoln (Centre for Speech Technology Research, The University of Edinburgh, UK)
Vincent Wan (Speech and Hearing Group, The University of Sheffield, UK)
Le Zhang (Centre for Speech Technology Research, The University of Edinburgh, UK)

The AMI(DA) system is a meeting room speech recognition system that has been developed and evaluated in the context of the NIST Rich Text (RT) evaluations. Recently, the "Distant Access" requirements of the AMIDA project have necessitated that the system operate in real-time. Another more difficult requirement is that the system fit into a live meeting transcription scenario. We describe an infrastructure that has allowed the AMI(DA) system to evolve into one that fulfils these extra requirements. We emphasise the components that address the live and real-time aspects.

#8Improvements to the LIUM French ASR system based on CMU Sphinx: what helps to significantly reduce the word error rate?

Paul Deleglise (LIUM - University of Le Mans)
Yannick Esteve (LIUM - University of Le Mans)
Sylvain Meignier (LIUM - University of Le Mans)
Teva Merlin (LIUM - University of Le Mans)

This paper describes the new ASR system developed by the LIUM and analyzes the various origins of the significant drop of the word error rate observed in comparison to the previous LIUM ASR system. This study was made on the test data of the latest evaluation campaign of ASR systems on French broadcast news, called ESTER2 and organized in December 2008. For the same computation time, the new system yields a word error rate about 38% lower than what the previous system (which reached the second position during the ESTER1 evaluation campaign) did. This paper evaluates the gain provided by various changes to the system: implementation of new search and training algorithms, new training data, vocabulary size, etc. The LIUM ASR system was the best open-source ASR system of the ESTER2 campaign.

#9MERGING SEARCH SPACES FOR SUBWORD SPOKEN TERM DETECTION

Timo Mertens (Norwegian University of Science and Technology)
Daniel Schneider (Fraunhofer IAIS)
Joachim Köhler (Fraunhofer IAIS)

We describe how complementary search spaces, addressed by two different methods used in Spoken Term Detection (STD), can be merged for German subword STD. We propose fuzzy-search techniques on lattices to narrow the gap between subword and word retrieval. The first technique is based on an edit-distance, where no a priori knowledge about confusions is employed. Additionally, we propose a weighting method which explicitly models pronunciation variation on a subword level and thus improves robustness against false positives. Recall is improved by 6% absolute when retrieving on the merged search space rather than using an exact lattice search. By modeling subword pronunciation variation, we increase recall in a high-precision setting by 2% absolute compared to the edit-distance method.

#10A Posterior Probability-Based System Hybridisation and Combination for Spoken Term Detection

Javier Tejedor (HCTLab-UAM)
Dong Wang (The Centre For Speech Technology Research)
Simon King (The Centre For Speech Technology Research)
Joe Frankel (The Centre For Speech Technology Research)
Jose Colas (HCTLab-UAM)

Spoken term detection (STD) is a fundamental task for multimedia information retrieval. To improve the detection performance, we have presented a direct posterior-based confidence measure generated from a neural network. In this paper, we propose a detection-independent confidence estimation based on the direct posterior confidence measure, in which the decision making is totally separated from the term detection. Based on this idea, we first present a hybrid system which conducts the term detection and confidence estimation based on different sub-word units and then propose a combination method which merges detections from heterogeneous term detectors based on the direct posterior-based confidence. Experimental results demonstrated that the proposed methods improved system performance considerably for both English and Spanish.

#11Stochastic Pronunciation Modelling for Spoken Term Detection

Dong Wang (The Centre for Speech Technology Research, University of Edinburgh, UK)
Simon King (The Centre for Speech Technology Research, University of Edinburgh, UK)
Joe Frankel (The Centre for Speech Technology Research, University of Edinburgh, UK)

A major challenge faced by a spoken term detection (STD) system is the detection of out-of-vocabulary (OOV) terms. Although a subword-based STD system is able to detect OOV terms, performance reduction is always observed compared to in-vocabulary terms. Current approaches to STD do not acknowledge the particular properties of OOV terms, such as pronunciation uncertainty. In this paper, we use a stochastic pronunciation model to deal with the uncertain pronunciations of OOV terms. By considering all possible term pronunciations, predicted by a joint-multigram model, we observe a significant performance improvement.

#12Term-Dependent Confidence for Out-of-Vocabulary Term Detection

Dong Wang (The Centre for Speech Technology Research, University of Edinburgh, UK)
Simon King (The Centre for Speech Technology Research, University of Edinburgh, UK)
Joe Frankel (The Centre for Speech Technology Research, University of Edinburgh, UK)
Peter Bell (The Centre for Speech Technology Research, University of Edinburgh, UK)

Within a spoken term detection (STD) system, the decision maker plays an important role in retrieving reliable detections. Most of the state-of-the-art STD systems make decisions based on a confidence measure that is term-independent, which poses a serious problem for out-of-vocabulary (OOV) term detection. In this paper, we study a term-dependent confidence measure based on confidence normalisation and discriminative modelling, particularly focusing on its remarkable effectiveness for detecting OOV terms. Experimental results indicate that the term-dependent confidence provides much more significant improvement for OOV terms than terms in-vocabulary.

#13A Comparison of Query-by-Example Methods for Spoken Term Detection

Wade Shen (MIT/Lincoln Laboratory)
Christopher White (MIT/Lincoln Laboratory)
Timothy Hazen (MIT/Lincoln Laboratory)

In this paper we examine an alternative interface for phonetic search, namely query-by-example, that avoids OOV issues associated with both standard word-based and phonetic search methods. We develop three methods that compare query lattices derived from example audio against a standard ngram-based phonetic index and we analyze factors affecting the performance of these systems. We show that the best systems under this paradigm are able to achieve 77% precision when retrieving utterances from conversational telephone speech and returning 10 results from a single query (performance that is better than a similar dictionary-based approach) suggesting significant utility for search applications. We also show that these systems can be further improved using relevance feedback: By incorporating four additional queries the precision of the best system can be improved by 13.7% relative.

#14Fast Keyword Detection Using Suffix Array

Kouichi Katsurada (Toyohashi University of Technology)
Shigeki Teshima (Toyohashi University of Technology)
Tsuneo Nitta (Toyohashi University of Technology)

In this paper, we propose a technique for detecting keywords quickly from a very large speech database without using a large memory space. To accelerate searches and save memory, we used a suffix array as the data structure and applied phoneme-based DP-matching. To avoid an exponential increase in the process time with the length of the keyword, a long keyword is divided into short sub-keywords. Moreover, an iterative lengthening search algorithm is used to rapidly output accurate search results. The experimental results show that it takes less than 100ms to detect the first set of search results from a 10,000-h virtual speech database.

Wed-Ses2-P1:
Emotion and Expression II

Time:Wednesday 13:30 Place:Hewison Hall Type:Poster
Chair:Louis ten Bosch

#1Perceiving Surprise on Cue Words: Prosody and Semantics Interact on Right and Really

Catherine Lai (University of Pennsylvania)

Cue words in dialogue have different interpretations depending context and prosody. This paper presents a corpus study and perception experiment investigating when prosody causes 'right' and 'really' to be perceived as questioning or expressing surprise. Pitch range is found to be the best cue for surprise. This extends to the question rating for 'really' but not for 'right'. In fact, prosody appears to interact with semantics so ratings differ for these two types of cue word even when prosodic features are similar. So, different semantics appears to result in different surprise/question rating thresholds.

#2Emotion Recognition using Linear Transformations in Combination with Video

Rok Gajsek (LUKS, University of Ljubljana)
Vitomir Struc (LUKS, University of Ljubljana)
Simon Dobrisek (LUKS, University of Ljubljana)
France Mihelic (LUKS, University of Ljubljana)

The paper discuses the usage of linear transformations of Hidden Markov Models, normally employed for speaker and environment adaptation, as a way of extracting the emotional components from the speech. A constrained version of Maximum Likelihood Linear Regression (CMLLR) transformation is used as a feature for classification of normal or aroused emotional state. We present a procedure of incrementally building a set of speaker independent acoustic models, that are used to estimate the CMLLR transformations for emotion classification. An audio-video database of spontaneous emotions (AvID) is briefly presented since it forms the basis for the evaluation of the proposed method. Emotion classification using the video part of the database is also described and the added value of combining the visual information with the audio features is shown.

#3Speaker Dependent Emotion Recognition Using Prosodic Supervectors

Ignacio Lopez-Moreno (Universidad Autonoma de Madrid)
Carlos Ortego-Resa (Universidad Autonoma de Madrid)
Daniel Ramos (Universidad Autonoma de Madrid)
Joaquin Gonzalez-Rodriguez (Universidad Autonoma de Madrid)

This work presents a novel approach for detection of emotions embedded in the speech signal. The proposed approach works at the prosodic level, and models the statistical distribution of the prosodic features with Gaussian Mixture Models (GMM) mean-adapted from a Universal Background Model (UBM). This allows the use of GMM-mean supervectors, which are clasified by a Support Vector Machine (SVM). Our proposal is compared to a popular baseline, which classifies with an SVM a set of selected prosodic features from the whole speech signal. In order to measure the speaker inter-variability, which is a factor of degradation in this task, speaker dependent and speaker independent frameworks have been considered. Experiments have been carried out under the SUSAS subcorpus, including real and simulated emotions. Results shows that in a speaker dependent framework our proposed approach achieves a relative improvement greater 14% in Equal Error Rate (EER) with respect to the baseline approach. The relative improvement is greater than 17% when both approaches are combined together by fusion with respect to the baseline

#4Physiologically-inspired Feature Extraction for Emotion Recognition

Yu Zhou (Institute of Acoustics, Chinese Academy of Sciences)
Yanqing Sun (Institute of Acoustics, Chinese Academy of Sciences)
Junfeng Li (School of Information Science, Japan Advanced Institute of Science and Technology)
Jianping Zhang (Institute of Acoustics, Chinese Academy of Sciences)
Yonghong Yan (Institute of Acoustics, Chinese Academy of Sciences)

In this paper, we propose a new feature extraction method for emotion recognition based on the knowledge of the emotion production mechanism in physiology. It was reported by physi- acoustist that emotional speech is differently encoded from the normal speech in terms of articulation organs and that emotion information in speech is concentrated in different frequencies caused by the different movements of organs [4]. To apply these findings into emotion recognition system, in this paper, we first quantify the distribution of speech emotion information along with each frequency band by exploiting the Fisher’s F-Ratio and mutual information techniques, and then propose a non-uniform sub-band processing method which is able to extract and em- phasize the emotion features in speech. These extracted fea- tures are finally applied to emotional recognition. Experimental results in speech emotion recognition show that the extracted features using our proposed non-uniform sub-band processing outperform the traditional (MFCC) features, and the average er- ror reduction rate is 16.8% for speech emotion recog- nition.

#5Perceived Loudness and Voice Quality in Affect Cueing

Irena Yanushevskaya (Trinity College Dublin)
Christer Gobl (Trinity College Dublin)
Ailbhe Ní Chasaide (Trinity College Dublin)

The paper describes an auditory experiment aimed at testing whether the intrinsic loudness of a stimulus with a given voice quality influences the way in which it signals affect. Synthesised voice quality stimuli in which intrinsic loudness was systematically manipulated were presented to listeners to test the effect of this manipulation on the affective colouring of the stimuli. The results showed that even when devoid of intrinsic loudness variation, non-modal voice quality stimuli were capable of communicating affect. However, changing the loudness of a particular non-modal voice quality stimulus towards its intrinsic loudness resulted in the increase of affective ratings. Increased loudness importantly enhances (for the relevant stimuli) the perception of high activation states.

#6Modeling Mutual Influence of Interlocutor Emotion States in Dyadic Spoken Interactions

Chi-Chun Lee (Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA)
Carlos Busso (Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA)
Sungbok Lee (Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA)
Shrikanth Narayanan (Signal Analysis and Interpretation Laboratory (SAIL), Electrical Engineering Department, University of Southern California, Los Angeles, CA 90089, USA)

In dyadic human interactions, mutual influence - a person's influence on the interacting partner's behaviors - is shown to be important and could be incoporated into the modeling framework in characterizing, and automatically recognizing the participants' states. We propose a Dynamic Bayesian Network (DBN) to explicitly model the conditional dependency between two interacting partners' emotion states in a dialog using data from the IEMOCAP corpus of expressive dyadic spoken interactions. Also, we focus on automatically computing the Valence-Activation emotion attributes to obtain a continous characterization of the participants' emotion flow. Our proposed DBN models the temporal dynamics of the emotion states as well as the mutual influence between speakers in a dialog. With speech based features, the proposed network improves classification accuracy by 3.67% absolute and 7.12% relative over the Gaussian Mixture Model (GMM) baseline on isolated turn-by-turn emotion classification.

#7A Detailed Study of Word-Position Effects on Emotion Expression in Speech

Jangwon Kim (University of Southern California)
Sungbok Lee (University of Southern California)
Shrikanth Narayanan (University of Southern California)

We investigate emotional effects on articulatory-acoustic speech characteristics with respect to word location within a sentence. We examined the hypothesis that emotional effect will vary based on word position by first examining articulatory features manually extracted from Electromagnetic articulography data. Initial articulatory data analyses indicated that the emotional effects on sentence medial words are significantly stronger than on initial words. To verify that observation further, we expanded our hypothesis testing to include both acoustic and articulatory data, and a consideration of an expanded set of words from different locations. Results suggest that emotional effects are generally more significant on sentence medial words than sentence initial and final words. This finding suggests that word location needs to be considered as a factor in emotional speech processing.

#8CMAC for Speech Emotion Profiling

Norhaslinda Kamaruddin (Center for Computational Intelligent, School of Computer Engineering, Nanyang Technological University, Blk N4 #2A-36, Nanyang Avenue, Singapore 639798)
Wahab Abdul (Center for Computational Intelligent, School of Computer Engineering, Nanyang Technological University, Blk N4 #2A-36, Nanyang Avenue, Singapore 639798)

Cultural and environmental differences have been one of the factors that cause failures in speech emotion analysis and processing. If this diversity could be lumped as noise artifacts in detecting emotion through speech, then we can extract pure emotion speech data from the raw emotional speech. In this paper we use the amplitude spectral subtraction (ASS) method to profile the emotion from raw emotional speech. In addition, the robustness of the cerebellar model arithmetic computer (CMAC) is used to ensure all other noise effect can be suppressed. This profiling scheme is based on the affection space model that is well-known by psychologist and also the ASS method used for speech enhancement. Result from the speech emotion profiling shown potential of using such technique to extract hidden features for detecting intra-cultural and inter-cultural variation and similarity that is missing from current approach of speech emotion recognition.

#9On the relevance of high-level features for speaker independent emotion recognition of spontaneous speech

Marko Lugger (Chair for system theory and signal processing)
Bin Yang (Chair for system theory and signal processing)

In this paper we study the relevance of so called high-level speech features for the application of speaker independent emotion recognition. After we give a brief definition of high-level features, we discuss for which standard feature groups high-level features are conceivable. Two groups of high-level features are proposed within this paper: a feature set based on the separation of phonation and articulation called voice quality parameters and a second feature set deduced from music theory called harmony features. Harmony features give information about the frequency interval and chord content of the pitch data of a spoken utterance. Finally, we study the gain of classification rate by combining the proposed high-level features with the standard low-level features. We show that both high-level feature sets improve the speaker independent classification performance for spontaneous emotional speech.

#10Recognising Interest in Conversational Speech - Comparing Bag of Frames and Supra-segmental Features

Bjoern Schuller (Technische Universitaet Muenchen)

It is common knowledge that affective and emotion-related states are acoustically well modelled on a supra-segmental level. Nonetheless successes are reported for frame-level processing either by means of dynamic classification or multi-instance learning techniques. In this work a quantitative feature-type-wise comparison between frame-level and supra-segmental analysis is carried out for the recognition of interest in human conversational speech. To shed light on the respective differences the same classifier, namely Support-Vector-Machines, is used in both cases: once by clustering a `bag of frames' of unknown sequence length employing Multi-Instance Learning techniques, and once by statistical functional application for the projection of the time series onto a static feature vector. As database serves the Audiovisual Interest Corpus of naturalistic interest.

Wed-Ses2-P3:
Speech Synthesis Methods

Time:Wednesday 13:30 Place:Hewison Hall Type:Poster
Chair:Nobuaki Minematsu

#1Optimal Event Search Using a Structural Cost Function --- Improvement of Structure to Speech Conversion ---

Daisuke Saito (The University of Tokyo)
Yu Qiao (The University of Tokyo)
Nobuaki Minematsu (The University of Tokyo)
Keikichi Hirose (The University of Tokyo)

This paper describes a new method for the framework of structure to speech conversion we previously proposed. Most of the speech synthesizers take a phoneme sequence as input and generate speech by converting each of the phonemes into its corresponding sound. However, infants usually acquire speech communication ability without phoneme sequences. Since their phonemic awareness is immature, they can hardly decompose an utterance into a phoneme sequence. Developmental psychology claims that infants acquire the holistic sound patterns of words (word Gestalt) from the utterances of their parents, and they reproduce them with their vocal tubes. This behavior is called vocal imitation. We defined the word Gestalt physically and proposed a method of extracting it from an utterance in the previous studies. We applied it to speech generation, which we call structure to speech conversion. This paper proposes and evaluates a method for improving our framework based on a structural cost function.

#2Deriving Vocal Tract Shapes From ElectroMagnetic Articulograph Data Via Geometric Adaptation and Matching

Ziad Al Bawab (Carnegie Mellon University)
Lorenzo Turicchia (Massachusetts Institute of Technology)
Richard M. Stern (Carnegie Mellon University)
Bhiksha Raj (Carnegie Mellon University)

In this paper, we present our efforts towards deriving vocal tract shapes from ElectroMagnetic Articulograph data (EMA) via geometric adaptation and matching. We describe a novel approach for adapting Maeda's geometric model of the vocal tract to one speaker in the MOCHA database. We show how we can rely solely on the EMA data for adaptation. We present our search technique for the vocal tract shapes that best fit the given EMA data. We then describe our approach of synthesizing speech from these shapes. Results on Mel-cepstral distortion reflect improvement in synthesis over the approach we used before without adaptation.

#3Towards unsupervised articulatory resynthesis of German utterances using EMA data

Ingmar Steiner (Institute of Phonetics, Saarland University)
Korin Richmond (Centre for Speech Technology Research, University of Edinburgh)

As part of ongoing research towards integrating an articulatory synthesizer into a text-to-speech (TTS) framework, a corpus of German utterances recorded with electromagnetic articulography (EMA) is resynthesized to provide training data for statistical models. The resynthesis is based on a measure of similarity between the original and resynthesized EMA trajectories, weighted by articulatory relevance. Preliminary results are discussed and future work outlined.

#4The KlattGrid speech synthesizer

David Weenink (University of Amsterdam)

We present a new speech synthesizer class, named KlattGrid, for the Praat program. This synthesizer is based on the original descriptions of Klatt (1980,1990). New aspects of a KlattGrid in comparison with other Klatt-type synthesizers are that a KlattGrid: - is not frame-based but time-based. You specify parameters as a function of time with any precision you like. - has no limitations on the number of oral formants, nasal formants, nasal antiformants, tracheal formants or tracheal antiformants that can be defined. - has separate formants for the frication part. - allows varying the form of the glottal flow function as a function of time. - allows for any number of formants and bandwidths to be modified during the open phase of the glottis. - uses no beforehand quantization of amplitude parameters. - is fully integrated into the freely available speech analysis program Praat.

#5Development of a Kenyan English Text To Speech System: A Method of Developing a TTS for a previously undefined English Dialect

Mucemi Gakuru (Teknobyte Ltd)

This work provides a method that can be used to build an English TTS for a population who speak a dialect which is not defined and for which no resources exist, by showing how a Text to Speech System (TTS) was developed for the English dialect spoken in Kenya. To begin with, the existence of a unique English dialect which had not previously been defined was confirmed from the need by the English speaking Kenyan population to have a TTS in an accent different from the British accent. This dialect is referred to here and has also been branded as Kenyan English®. Given that building a TTS requires language features to be adequately defined, it was necessary to develop the essential features of the dialect such as the phoneset and the lexicon and then verifying their correctness. The paper shows how it was possible to come up with a systematic approach for defining these features through tracing the evolution of the dialect. It also discusses how the TTS was built and tested.

#6Feedback Loop for Prosody Prediction in Concatenative Speech Synthesis

Javier Latorre (Toshiba Corporate Research and Development Center)
Sergio Gracia (TelecomBCN, Universitat Politecnica Catalunya, Spain)
Masami Akamine (Toshiba Corporate Research and Development Center)

We propose a method for concatenative speech synthesis that permits to obtain a better matching between the logF0 and duration predicted by the prosody module and the waveform generation back-end. The proposed method is based upon our previous multilevel parametric F0 model and Toshiba’s plural unit selection and fusion synthesizer. The method adds a feedback loop from the back-end into the prosody module so that the prosodical information of the selected units is used to re-estimate new prosody values. The feedback loop defines a frame level prosody model which consists of the average value and variance of the duration and logF0 of the selected units. The loglikelihood defined by this model is added to the log-likelihood of the prosody model. From the maximization of this total loglikelihood, we obtain the prosody values that produce the optimum compromise between the distortion introduced by F0 discontinuities and the one created by the prosody adjusting signal processing.

#7Assessing a Speaker for Fast Speech in Unit Selection Speech Synthesis

Donata Moers (Institut für Kommunikationswissenschaften, Abt. Sprache und Kommunikation, Rheinische Friedrich-Wilhelms-Universität Bonn, Bonn, Germany)
Petra Wagner (Fakultät für Linguistik und Literaturwissenschaft, Universität Bielefeld, Bielefeld, Germany)

This paper describes work in progress concerning the ad­equate modeling of fast speech in unit selection speech syn­thesis systems, mostly having in mind blind and visually im­paired users. Initially, a survey of the main characteristics of fast speech will be given. Subsequently, strategies for fast speech production will be discussed. Certain requirements concerning the ability of a speaker of a fast speech unit selec­tion inventory are drawn. The following section deals with a perception study where a selected speaker's ability to speak fast is investigated. To conclude, a preliminary perceptual ana­lysis of the recordings for the speech synthesis corpus is presented.

#8Unit Selection based Speech Synthesis for Poor Channel Condition

Ling Cen (Institute for Infocomm Research)
Minghui Dong (Institute for Infocomm Research)
Paul Chan (Institute for Infocomm Research)
Haizhou Li (Institute for Infocomm Research)

Synthesized speech can be largely degraded in noise, resulting in compromised speech quality. In this paper, we propose a unit selection based speech synthesis system for better speech quality under poor channel conditions. First, the measurement of speech intelligibility is incorporated in the cost function as a searching criterion for unit selection. Next, the prosody of the selected units is modified according to the Lombard effect. Prosody modification includes increasing the amplitude of unvoiced phoneme and enlarging the speech duration. Finally, the FIR equalization via convex optimization is applied to reduce signal distortion due to the channel effect. Listening test in our experiments shows that the quality level of synthetic speech can be improved under poor channel conditions with the help of our proposed synthesis system.

#9Vocalic sandwich, a unit designed for unit selection TTS

Didier Cadic (Orange Labs)
Cédric Boidin (Orange Labs)
Christophe d\'Alessandro (LIMSI)

Unit selection text-to-speech systems currently produce very natural synthetic sentences by concatenating speech segments from a large database. Recently, increasing demand for designing high quality voices with less data creates need for further optimization of the textual corpus recorded by the speaker. The optimization process of this corpus is traditionally guided by the coverage rate of well-known units: triphones, words… Such units are however not dedicated to concatenative speech synthesis; they are of general use in speech technologies and linguistics. In this paper, we describe a new unit which takes account of concatenative TTS own features: the "vocalic sandwich." Both an objective and a perceptual evaluation tend to show that vocalic sandwiches are appropriate units for corpus design.

#10Speech synthesis based on the plural unit selection and fusion method using FWF model

Ryo Morinaka (Toshiba Corporate R&D Center, Japan)
Masatsune Tamura (Toshiba Corporate R&D Center, Japan)
Masahiro Morita (Toshiba Corporate R&D Center, Japan)
Takehiko Kagoshima (Toshiba Corporate R&D Center, Japan)

For speech synthesizers, enhanced diversity and improved quality of synthesized speech are required. Speaker interpolation and voice conversion are the techniques that enhance diversity. The PUSF (plural unit selection and fusion) method generates synthesized waveforms using pitch-cycle waveforms. However, it is difficult to modify its spectral features while keeping naturalness of synthesized speech. In the present work, we investigated how best to represent speech waveforms. We introduce a method that decomposes a pitch waveform in a voiced portion into a periodic component and an aperiodic component. Moreover, we introduce the FWF (formant waveform) model to represent the periodic component. Because the FWF model represents the pitch waveform in accordance with formant parameters, it can control the formant parameters independently. We realized a method that can easily be applied to the diversity-enhancing techniques in the PUSF-based method.

#11Speech synthesis without a phone inventory

Matthew Peter Aylett (Centre for Speech Technology Research, University of Edinburgh/ Cereproc Ltd)
Simon King (Centre for Speech Technology Research, University of Edinburgh)
Junichi Yamagishi (Centre for Speech Technology Research, University of Edinburgh)

In speech synthesis the unit inventory is decided using phonological and phonetic expertise. This process is resource intensive and potentially sub-optimal. In this paper we investigate how acoustic clustering, together with lexicon constraints, can be used to build a self-organised inventory. Six English speech synthesis systems were built using two frameworks, unit selection and parametric HTS for three inventory conditions: 1) a traditional phone set, 2) a system using orthographic units, and 3) a self-organised inventory. A listening test showed a strong preference for the classic system, and for the orthographic system over the self-organised system. Results also varied by letter to sound complexity and database coverage. This suggests the self-organised approach failed to generalise pronunciation as well as introducing noise above and beyond that caused by orthographic sound mismatch.

#12Context-dependent additive log F0 model for HMM-based speech synthesis

Heiga Zen (Toshiba Research Europe Limited, Cambridge Research Laboratory)
Norbert Braunschweiler (Toshiba Research Europe Limited, Cambridge Research Laboratory)

This paper proposes a context-dependent additive acoustic modelling technique and its application to logarithmic fundamental frequency (log F0) modelling for HMM-based speech synthesis. In the proposed technique, mean vectors of state-output distributions are composed as the weighted sum of decision tree-clustered context-dependent bias terms. Its model parameters and decision trees are estimated and built based on the maximum likelihood (ML) criterion. The proposed technique has the potential to capture the additive structure of log F0 contours. A preliminary experiment using a small database showed that the proposed technique yielded encouraging results.

Wed-Ses2-P2:
Expression, Emotion and Personality Recognition

Time:Wednesday 13:30 Place:Hewison Hall Type:Poster
Chair:John H.L. Hansen

#1Classifying Turn-Level Uncertainty Using Word-Level Prosody

Diane Litman (University of Pittsburgh)
Mihai Rotaru (Textkernel B.V.)
Greg Nicholas (Brown University)

Spoken dialogue researchers often use supervised machine learning to classify turn-level user affect from a set of turn-level features. The utility of sub-turn features has been less explored, due to the complications introduced by associating a variable number of sub-turn units with a single turn-level classification. We present and evaluate several voting methods for using word-level pitch and energy features to classify turn-level user uncertainty in spoken dialogue data. Our results show that when linguistic knowledge regarding prosody and word position is introduced into a word-level voting model, classification accuracy is significantly improved compared to the use of both turn-level and uninformed word-level models.

#2Detecting Subjectivity in Multiparty Speech

Gabriel Murray (Department of Computer Science, University of British Columbia)
Giuseppe Carenini (Department of Computer Science, University of British Columbia)

In this research we aim to detect subjective sentences in spontaneous speech and label them for polarity. We introduce a novel technique wherein subjective patterns are learned from both labeled and unlabeled data, using n-grams with varying levels of lexical instantiation. Applying this technique to meeting speech, we gain significant improvement over state-of-the-art approaches and demonstrate the method's robustness to ASR errors. We also show that coupling thepattern-based approach with structural and lexical features of meetings yields additional improvement.

#3Pitch Contour Parameterisation based on Linear Stylisation for Emotion Recognition

Vidhyasaharan Sethu (School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney, NSW 2052, Australia)
Eliathamby Ambikairajah (School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney, NSW 2052, Australia)
Julien Epps (School of Electrical Engineering and Telecommunications, The University of New South Wales, Sydney, NSW 2052, Australia)

The pitch contour contains information that characterises the emotion being expressed by speech, and consequently features extracted from pitch form an integral part of many automatic emotion recognition systems. While pitch contours may have many small variations and hence are difficult to represent compactly, it may be possible to parameterise them by approximating the contour for each voiced segment by a straight line. This paper looks at such a parameterisation method in the context of emotion recognition. Listening tests were performed to subjectively determine if the linearly stylised contours were able to sufficiently capture information pertaining to emotions expressed in speech. Furthermore these parameters were used as features for an automatic 5-class emotion classification system. The use of the proposed parameters rather than pitch statistics resulted in a relative increase in accuracy of about 20%.

#4FEATURE-BASED AND CHANNEL-BASED ANALYSES OF INTRINSIC VARIABILITY IN SPEAKER VERIFICATION

Martin Graciarena (SRI International)
Tobias Bocklet (University of Erlangen)
Elizabeth Shriberg (SRI International)
Andreas Stolcke (SRI International)
Sachin Kajarekar (SRI International)

We explore how intrinsic variations (those associated with the speaker rather than the recording environment) affect text-independent speaker verification performance. In a previous paper we introduced the SRI-FRTIV corpus and provided speaker verification results using a Gaussian mixture model (GMM) system on telephone-channel speech. In this paper we explore the use of other speaker verification systems on the telephone channel data and compare against the GMM baseline. We found the GMM system to be one of the more robust across all conditions. Systems relying on recognition hypotheses had a significant degradation in low vocal effort conditions. We also explore the use of the GMM system on several other channels. We found improved performance on table-top microphones compared to the telephone channel in furtive conditions and gradual degradations as a function of the distance from the microphone to the speaker.

#5Robust Angry Speech Detection Employing TEO-Based Discriminative Classifier Combination

Wooil Kim (Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, Richardson, Texas, USA)
John Hansen (Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, Richardson, Texas, USA)

This paper proposes an effective angry speech detection employing the TEO-based feature extraction. Decorrelation process is applied to the TEO-based feature and minimum classification error training is employed. Combination with the conventional MFCC is also employed to utilize its effectiveness to characterize the spectral envelope of speech signals. The experimental results over the SUSAS corpus demonstrate the proposed angry speech detection scheme is effective at increasing detection accuracy on the open-speaker and open-vocabulary task. Up to 7.78% of classification accuracy is obtained by combination of the proposed methods including decorrelation of TEO-based feature, discriminative training, and classifier combination.

#6Improving Emotion Recognition using Class-Level Spectral Features

Dmitri Bitouk (University of Pennsylvania)
Ani Nenkova (University of Pennsylvania)
Ragini Verma (University of Pennsylvania)

Traditional approaches to automatic emotion recognition from speech typically make use of utterance level prosodic features. Still, a great deal of useful information about expressivity and emotion can be gained from spectral features or from measurements from specific regions of the utterance, such as the stressed vowels. Here we introduce a novel set of spectral features for emotion recognition: statistics of Mel-Frequency Spectral Coefficients computed over three phoneme classes. We investigate performance of our features in the task of speaker-independent emotion recognition using two datasets. Our results clearly indicate that indeed both the richer set of spectral features and the differentiation between phoneme type classes are beneficial for the task. Classification accuracies are consistently higher for our features compared to prosodic or utterance-level spectral features. Combination of our phoneme class features with prosodic features leads to even further improvement.

#7Arousal and Valence prediction in spontaneous emotional speech: felt versus perceived emotion

Khiet Truong (University of Twente)
David van Leeuwen (TNO Defence, Security, and Safety)
Mark Neerincx (TNO Defence, Security, and Safety)
Franciska de Jong (University of Twente)

In this paper, we describe emotion recognition experiments carried out for spontaneous affective speech with the aim to compare the added value of annotation of felt emotion versus annotation of perceived emotion. Using speech material available in the TNO-GAMING corpus (a corpus containing audiovisual recordings of people playing videogames), speech-based affect recognizers were developed that can predict Arousal and Valence scalar values. Two types of recognizers were developed in parallel: one trained with felt emotion annotations (generated by the gamers themselves) and one trained with perceived/observed emotion annotations (generated by a group of observers). The experiments showed that, in speech, with the methods and features currently used, observed emotions are easier to predict than felt emotions. The results suggest that recognition performance strongly depends on how and by whom the emotion annotations are carried out.

#8Dimension Reduction Approaches for SVM based Speaker Age Estimation

Gil Dobry (The Open University of Israel)
Ron Hecht (PuddingMedia)
Mireille Avigal (The Open University of Israel)
Yaniv Zigel (Ben-Gurion University)

This paper presents two novel dimension reduction approaches applied on the gaussian mixture model (GMM) supervectors, to improve age estimation speed and accuracy. The GMM supervector embodies many speech characteristics irrelevant to age estimation and like noise, they are harmful to the system’s generalization ability. In addition, the support vectors machine (SVM) evaluation computation grows with the vector’s dimension, especially when using complex kernels. The first approach presented is the weighted-pairwise principal components analysis (WPPCA) that reduces the vector dimension by minimizing the redundant variability. The second approach is based on anchor-models, using a novel anchors selection method. Experiments showed that dimension reduction makes the evaluation process 5 times faster and using the WPPCA approach, it is also 5% more accurate.

#9ANN based Decision Fusion for Speech Emotion Recognition

Lu Xu (State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory of Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China)
Mingxing Xu (State Key Laboratory of Intelligent Technology and Systems, Tsinghua National Laboratory of Information Science and Technology, Department of Computer Science and Technology, Tsinghua University, Beijing 100084, China)
Dali Yang (Department of Computer Science and Technology, Beijing Information Science and Technology University, Beijing 100101, China)

As a hot research field, speech emotion recognition has attracted increasing attentions from both academic and business. In this paper, we proposed a method to recognize speech emotions adopting ANNs and to fuse two kinds of recognitions using different features at the decision level. Each emotional utterance is recognized by some individual recognizers firstly. Then the outputs of these recognizers were fused adopting the voting strategy. Furthermore, the dimensionality of supervectors constructed from spectral features is reduced through PCA. Experimental results demonstrated that the proposed decision fusion is effective and the dimensionality reduction is feasible.

#10Processing affected speech within human machine interaction

Bogdan Vlasenko (Cognitive Systems, IESK, Otto-von-Guericke Universitaet)
Andreas Wendemuth (Cognitive Systems, IESK, Otto-von-Guericke Universitaet)

Spoken dialog systems (SDS) integrated into human-machine interaction interfaces is becoming a standard technology. Current state-of-the-art SDS, usually, is not able to provide for the user a natural way of communication. Existing automated dialog systems do not dedicate enough attention to problems in the interaction related to affected user behavior. As a result, Automatic Speech Recognition (ASR) engines are not able to recognize affected speech and dialog strategy does not make use of the user’s emotional state. This paper addresses some aspects of processing affected speech within natural human-machine interaction. First of all, we propose an affected speech adapted ASR engine. Second, we describe our methods of emotion recognition within speech and present our results of emotion classification within Interspeech 2009 Emotion Challenge. Third, we test affected speech adapted speech recognition models and introduce an approach to achieve emotion adaptive dialog management in human-machine interaction.

#11Emotion Recognition from Speech using Extended Feature Selection and a Simple Classifier

Ali Hassan (University of Southampton)
Robert Damper (University of Southampton)

We describe extensive experiments on the recognition of emotion from speech using acoustic features only. Two databases of acted emotional speech (Berlin and DES) have been used in this work. The principal focus is on methods for selection of good features from a relatively large set of hand-crafted features, perhaps formed by fusing different feature sets used by different researchers. We show that the monotonic assumption underlying popular sequential selection algorithms does not hold, and use this finding to improve recognition accuracy. We show further that a very simple classifier (-nearest neighbour) produces better results than any so far reported by other researchers on these databases, suggesting that previous work has failed to match the complexity of the classifier used to the complexity of the data. Finally, several potentially fruitful avenues for future work are outlined.

Wed-Ses3-O1:
Language Recognition

Time:Wednesday 16:00 Place:Main Hall Type:Oral
Chair:Honza Černocký

16:00A Human Benchmark for Language Recognition

Rosemary Orr (University College Utrecht)
David van Leeuwen (TNO Human Factors)

In this study, we explore a human benchmark in language recognition, for the purpose of comparing human performance to machine performance in the context of the NIST LRE 2007. Humans are categorised in terms of language proficiency, and performance is presented per proficiency. The main challenge in this work is the design of a test and application of a performance metric which allows a meaningful comparison of humans and machines. The main result of this work is that where subjects have lexical knowledge of a language, even at a low level, they perform as well as the state of the art in language recognition systems in 2007.

16:20Large Margin Estimation of Gaussian Mixture Model Parameters with Extended Baum-Welch for Spoken Language Recognition

Donglai Zhu (Institute for Infocomm Research, Singapore)
Bin Ma (Institute for Infocomm Research, Singapore)
Haizhou Li (Institute for Infocomm Research, Singapore)

Discriminative training (DT) methods of acoustic models, such as SVM and MMI-training GMM, have been proved effective in spoken language recognition. In this paper we propose a DT method for GMM using the large margin (LM) estimation. Unlike traditional MMI or MCE methods, the LM estimation attempts to enhance the generalization ability of GMM to deal with new data that exists mismatch with training data. We define the multi-class separation margin as a function of GMM likelihoods, and derive update formulae of GMM parameters with the extended Baum-Welch algorithm. Results on the NIST language recognition evaluation (LRE) 2007 task show that the LM estimation achieves better performance and faster convergent speed than the MMI estimation.

16:40Linguistically-motivated automatic classification of regional French varieties

Cécile Woehrling (LIMSI-CNRS)
Philippe Boula de Mareüil (LIMSI-CNRS)
Martine Adda-Decker (LIMSI-CNRS)

The goal of this study is to automatically differentiate French varieties (standard French and French varieties spoken in the South of France, Alsace, Belgium nd Switzerland) by applying a linguistically-motivated approach. We took dvantage of automatic phoneme alignment to measure vowel formants, consonant (de)voicing, pronunciation variants as well as prosodic cues. These features were then used to identify French varieties by applying classification techniques. On large corpora of hundreds of speakers, over 80% correct identification scores were obtained. The confusions between varieties and the features used (by decision trees) are linguistically grounded.

17:00Discriminative Acoustic Language Recognition via Channel-Compensated GMM Statistics

Niko Brummer (AGNITIO)
Albert Strasheim (AGNITIO)
Valiantsina Hubeika (Brno University of Technology)
Pavel Matejka (Brno University of Technology)
Lukas Burget (Brno University of Technology)
Ondrej Glembek (Brno University of Technology)

We propose a novel design for acoustic feature-based automatic spoken language recognizers. Our design is inspired by recent advances in text-independent speaker recognition, where intra-class variability is modeled by factor analysis in Gaussian mixture model (GMM) space. We use approximations to GMM-likelihoods which allow variable-length data sequences to be represented as statistics of fixed size. Our experiments on NIST LRE'07 show that variability-compensation of these statistics can reduce error-rates by a factor of three. Finally, we show that further improvements are possible with discriminative logistic regression training.

17:20Language Score Calibration using Adapted Gaussian Back-end

Mohamed Faouzi BenZeghiba (LIMSI-CNRS)
Jean-luc Gauvain (LIMSI-CNRS)
Lori Lamel (LIMSI-CNRS)

Generative Gaussian back-end and discriminative logistic regression are the most used approaches for language score fusion and calibration. Combination of these two approaches can significantly improve the performance. This paper proposes the use of an adapted Gaussian back-end, where the mean of the language-dependent Gaussian is adapted from the mean of a language-specific background Gaussian via maximum a posteriori estimation algorithm. Experiments are conducted using the LRE-07 evaluation data. Compared to the conventional Gaussian back-end approach for a closed set task, relative improvements in the C_avg of 50%, 17% and 4.2% are obtained on the 30s, 10s and 3s conditions, respectively. Besides this, the estimated scores are better calibrated. A combination with logistic regression results in a system with the best calibrated scores.

17:40A Framework for Discriminative SVM/GMM Systems for Language Recognition

William Campbell (MIT Lincoln Laboratory)
Zahi Karam (MIT Lincoln Laboratory, DSPG Research Laboratory of Electronics at MIT)

Language recognition with support vector machines and shifted-delta cepstral features has been an excellent performer in NIST-sponsored language evaluation for many years. A novel improvement of this method has been the introduction of hybrid SVM/GMM systems. These systems use GMM supervectors as an SVM expansion for classification. In prior work, methods for scoring SVM/GMM systems have been introduced based upon either standard SVM scoring or GMM scoring with a pushed model. Although prior work showed experimentally that GMM scoring yielded better results, no framework was available to explain the connection between SVM scoring and GMM scoring. In this paper, we show that there are interesting connections between SVM scoring and GMM scoring. We provide a framework both theoretically and experimentally that connects the two scoring techniques. This connection should provide the basis for further research in SVM discriminative training for GMM models.

Wed-Ses3-O2:
Phonetics & Phonology

Time:Wednesday 16:00 Place:East Wing 1 Type:Oral
Chair:Denis Burnham

16:00Functional Data Analysis as a Tool for Analyzing Speech Dynamics: A Case Study on the French Word c\'était

Michele Gubian (Centre for Language & Speech Technology, Radboud University, Nijmegen, NL)
Francisco Torreira (Centre for Language & Speech Technology, Radboud Universiteit Nijmegen & Max Planck Institute for Psycholinguistics)
Helmer Strik (Centre for Language & Speech Technology, Radboud University, Nijmegen, NL)
Lou Boves (Centre for Language & Speech Technology, Radboud University, Nijmegen, NL)

In this paper we introduce Functional Data Analysis (FDA) as a tool for analyzing dynamic transitions in speech signals. FDA makes it possible to perform statistical analyses of sets of mathematical functions in the same way as classical multivariate analysis treats scalar measurement data. We illustrate the use of FDA with a reduction phenomenon affecting the French word c'était /setE/ `it was', which can be reduced to [stE] in conversational speech. FDA reveals that the dynamics of the transition from [s] to [t] in fully reduced cases may still be different from the dynamics of [s] - [t] transitions in underlying /st/ clusters such as in the word stage.

16:20Large-Scale Analysis of Formant Frequency Estimation Variability in Conversational Telephone Speech

Nancy Chen (MIT Lincoln Laboratory)
Wade Shen (MIT LIncoln Laboratory)
Joseph Campbell (MIT Lincoln Laboratory)
Reva Schwartz (United States Secret Service)

We quantify how the telephone channel and regional dialect influence formant estimates extracted from Wavesurfer in spontaneous conversational speech from over 3,600 native American English speakers. To the best of our knowledge, this is the largest scale study on this topic. We found that F1 estimates are higher in cellular channels than those in landline, while F2 in general shows an opposite trend. We also characterized vowel shift trends in northern states in U.S.A. and compared them with the Northern city chain shift (NCCS). Our analysis is useful in forensic applications where it is important to distinguish between speaker, dialect, and channel characterisitcs.

16:40Developing an automatic functional annotation system for British English Intonation

Saandia Vanessa Ali (Laboratoire Parole et Langage, CNRS & Aix-Marseille Université, France)
Daniel Hirst (Laboratoire Parole et Langage, CNRS & Aix-Marseille Université, France)

One of the fundamental aims of prosodic analysis is to provide a reliable means of extracting functional information (what prosody contributes to meaning) directly from prosodic form (i.e. the way in which prosody, in this case intonation, is phonetically manifested). This paper addresses the development of an automatic functional annotation system for British English. It is based on the study of a large corpus of British English and a procedure of analysis by synthesis, enabling the possibility of testing and enriching different models of English intonation on the one hand and working towards an automatic version of the annotation process on the other.

17:00Intrinsic vowel duration and the post-vocalic voicing effect: Some evidence from dialects of North American English

Joshua Tauberer (University of Pennsylvania)
Keelan Evanini (University of Pennsylvania)

We report the results of a comprehensive dialectal survey of three vowel duration phenomena in North American English: gross duration differences between dialects, the effect of post-vocalic consonant voicing, and intrinsic vowel duration. Duration data, from HMM-based forced alignment of phones in the Atlas of North American English corpus (Labov, Ash, and Boberg 2006), showed that 1) the post-vocalic voicing effect appears in every dialect region and all but one dialect, and 2) dialectal variation in first formant frequency appears to be independent of intrinsic vowel duration. This second result adds evidence that intrinsic vowel durations are targets stored in the grammar and do not result from physiological constraints.

17:20Investigating /l/ Variation in English through Forced Alignment

Jiahong Yuan (University of Pennsylvania)
Mark Liberman (University of Pennsylvania)

We present a new method for measuring the "darkness" of /l/, and use it to investigate the variation of English /l/ in a large speech corpus that is automatically aligned with phones predicted from an orthographic transcript. We found a correlation between the rime duration and /l/-darkness for syllable-final /l/, but no correlation between /l/ duration and darkness for syllable-initial /l/. The data showed a clear difference between clear and dark /l/ in English, and also showed that syllable-final /l/ was less dark preceding an unstressed vowel than preceding a consonant or a word boundary.

17:40Structural Analysis of Dialects, Sub-dialects and Sub-sub-dialects of Chinese

XueBin Ma (The University of Tokyo, Tokyo, Japan)
Akira Nemoto (Nankai University, TianJin, China)
Nobuaki Minematsu (The University of Tokyo, Tokyo, Japan)
Yu Qiao (The University of Tokyo, Tokyo, Japan)
Keikichi Hirose (The University of Tokyo, Tokyo, Japan)

In China, there are seven big dialect regions and most of them have many sub-dialects and sub-sub-dialects. Therefore, people from different dialect regions often cannot communicate orally. In this paper, using the finals of the dialectal utterances of a specific list of written characters, a dialect pronunciation structure is built for every speaker and these speakers are classified. Then, the results of classifying 16 Mandarin speakers show that they are linguistically classified with little influence of their age and gender. Finally, distances among sub-sub-dialects are similarly calculated and evaluated. All the results show high validity and accordance to linguistic studies.

Wed-Ses3-O3:
Speech activity detection

Time:Wednesday 16:00 Place:East Wing 2 Type:Oral
Chair:Isabel Trancoso

16:00Voice Activity Detection Using Singular Value Decomposition-based Filter

Hwa Jeon Song (School of Electrical Engineering, Pusan National University)
Sung Min Ban (School of Electrical Engineering, Pusan National University)
Hyung Soon Kim (School of Electrical Engineering, Pusan National University)

This paper proposes a novel voice activity detector (VAD) based on singular value decomposition (SVD). The spectro-temporal characteristics of background noise region can be easily analyzed by SVD. The proposed method naturally drops hangover algorithm from VAD. Moreover, it adaptively changes the decision threshold by employing the most dominant singular value of the observation matrix in the noise region. According to simulation results, the proposed VAD shows significantly better performance than the conventional statistical model-based method and is less sensitive to the environmental changes. In addition, the proposed algorithm requires very low computational cost compared with other algorithms.

16:20Voice Activity Detection Using Partially Observable Markov Decision Process

Chiyoun Park (Samsung Advanced Institute of Technology)
Namhoon Kim (Samsung Advanced Institute of Technology)
Jeongmi Cho (Samsung Advanced Institute of Technology)

Partially observable Markov decision process (POMDP) has been generally used to model agent decision processes such as dialogue management. In this paper, possibility of applying POMDP to a voice activity detector (VAD) has been explored. The proposed system first formulates hypotheses about the current noise environment and speech activity. Then, it decides and observes the features that are expected to be the most salient in the estimated situation. VAD decision is made based on the accumulated information. A comparative evaluation is presented to show that the proposed method outperforms other model-based algorithms regardless of noise types or signal-to-noise ratio.

16:40High-Accuracy, Low-Complexity Voice Activity Detection Based on A Posteriori SNR Weighted Energy

Zheng-Hua Tan (Aalborg University)
Børge Lindberg (Aalborg University)

This paper presents a voice activity detection (VAD) method using the measurement of a posteriori signal-to-noise ratio (SNR) weighted energy. The motivations are manifold: 1) the difference in frame-to-frame energy provides a great discrimination for speech signals, 2) speech segments, besides their characteristics, are accounted also on their reliability e.g. measured by SNR, 3) the a posteriori SNR for noise-only segments will theoretically equal to 0 dB, being ideal for VAD, and 4) both energy and a posteriori SNR are easy to estimate, resulting in a low complexity. The method is experimentally shown to be superior to a number of referenced methods and standards.

17:00Fusing Fast Algorithms to Achieve Efficient Speech Detection in FM Broadcasts

Stephane Pigeon (Royal Military Academy)
Patrick Verlinde (Royal Military Academy)

This paper describes a system aimed at detecting speech segments in FM broadcasts. To achieve high processing speeds, simple but fast algorithms are used. To output robust decisions, a combination of many different algorithms has been considered. The system is fully operational in the context of Open Source Intelligence, since 2007.

17:20Robust Speech Recognition Using VAD-measure-embedded Decoder

Tasuku Oonishi (Tokyo Institute of Technology)
Paul Dixon (Tokyo Institute of Technology)
Koji Iwano (Tokyo City University)
Sadaoki Furui (Tokyo Institute of Technology)

In a speech recognition system a Voice Activity Detector (VAD) is a crucial component for not only maintaining accuracy but also for reducing computational consumption. Front-end approaches which drop non-speech frames typically attempt to detect speech frames by utilizing speech/non-speech classification information such as the zero crossing rate or statistical models. These approaches discard the speech/non-speech classification information after voice detection. This paper proposes an approach that uses the speech/non-speech information to adjust the score of the recognition hypotheses. Experimental results show that our approach can improve the accuracy significantly and reduce computational consumption by combining the front-end method.

17:40Investigating Privacy-sensitive Features for Speech Detection in Multiparty Conversations

Sree Hari Krishnan Parthasarathi (Idiap Research Institute, Martigny, Switzerland and Ecole Polytechnique Federale de Lausanne, Switzerland.)
Mathew Magimai.-Doss (Idiap Research Institute, Martigny, Switzerland)
Herve Bourlard (Idiap Research Institute, Martigny, Switzerland and Ecole Polytechnique Federale de Lausanne, Switzerland.)
Daniel Gatica-Perez (Idiap Research Institute, Martigny, Switzerland and Ecole Polytechnique Federale de Lausanne, Switzerland.)

We investigate four different privacy-sensitive features, namely energy, zero crossing rate, spectral flatness, and kurtosis, for speech detection in multiparty conversations. We liken this scenario to a meeting room and define our datasets and annotations accordingly. The temporal context of these features is modeled. With no temporal context, energy is the best performing single feature. But by modeling temporal context, kurtosis emerges as the most effective feature. Also, we combine the features. Besides yielding a gain in performance, certain combinations of features also reveal that a shorter temporal context is sufficient. We then benchmark other privacy-sensitive features utilized in previous studies. Our experiments show that the performance of all the privacy-sensitive features modeled with context is close to that of state-of-the-art spectral-based features, without extracting and using any features that can be used to reconstruct the speech signal.

Wed-Ses3-O4:
Multimodal speech (e.g. audiovisual speech, gesture)

Time:Wednesday 16:00 Place:East Wing 3 Type:Oral
Chair:Ji Ming

16:00Evaluation of External and Internal Articulator Dynamics for English Pronunciation Learning

Lan Wang (CAS/CUHK ShenZhen Institute of Advanced Integration Technologies, Chinese Academy of Sciences)
Hui Chen (CAS/CUHK ShenZhen Institute of Advanced Integration Technologies, Chinese Academy of Sciences)
JianJun Ouyang (CAS/CUHK ShenZhen Institute of Advanced Integration Technologies, Chinese Academy of Sciences)

In this paper we present a data-driven 3D talking head system using facial video and a X-ray film database for speech research. In order to construct a database recording the three dimensional positions of articulators at phoneme-level, the feature points of articulators were defined and labeled in facial and X-ray images for each English phoneme. Dynamic displacement based deformations were used in three modes to simulate the motions of both external and internal articulators. For continuous speech, the articulatory movements of each phoneme within an utterance were concatenated. A blending function was also employed to smooth the concatenation. In audio-visual test, a set of minimal pairs were used as the stimuli to access the realistic degree of articulatory motions of the 3D talking head. In the experiments where the subjects are native speakers and professional English teachers, a word identification accuracy of 91.1% among 156 tests was obtained.

16:20Robust Audio-Visual Speech Synchrony Detection by Generalized Bimodal Linear Prediction

Kshitiz Kumar (Carnegie Mellon University)
Jiri Navratil (IBM Thomas J. Watson Research Center)
Etienne Marcheret (IBM Thomas J. Watson Research Center)
Vit Libal (None)
Gerasimos Potamianos (Institute of Informatics and Telecommunications)

We study the problem of detecting audio-visual synchrony in video segments containing a speaker in frontal head pose. The problem holds a number of important applications, for example speech source localization, speech activity detection, speaker diarization, speech source separation, and biometric spoofing detection. In particular, we build on earlier work, extending our previously proposed time-evolution model of audio-visual features to include non-causal (future) feature information. This significantly improves robustness of the method to small time-alignment errors between the audio and visual streams, as demonstrated by our experiments. In addition, we compare the proposed model to two known literature approaches for audio-visual synchrony detection, namely mutual information and hypothesis testing, and we show that our method is superior to both.

16:40Acoustic-to-articulatory inversion using speech recognition and trajectory formation based on phoneme hidden Markov models

Atef Ben Youssef (GIPSA-lab (Dept Parole & Cognition / ICP), CNRS – Universités de Grenoble, France)
Pierre Badin (GIPSA-lab (Dept Parole & Cognition / ICP), CNRS – Universités de Grenoble, France)
Gérard Bailly (GIPSA-lab (Dept Parole & Cognition / ICP), CNRS – Universités de Grenoble, France)
Panikos Heracleous (GIPSA-lab (Dept Parole & Cognition / ICP), CNRS – Universités de Grenoble, France)

In order to recover the movements of usually hidden articulators such as tongue or velum, we have developed a speech inversion method. HMMs are trained, in a multistream framework, from two synchronous streams: articulatory movements measured by EMA, and MFCC + energy from the speech signal. A speech recognition procedure based on the acoustic part of the HMMs delivers the chain of phonemes and together with their durations, information that is subsequently used by a trajectory formation procedure based on the articulatory part of the HMMs to synthesise the articulatory movements. The RMS reconstruction error ranged between 1.1 and 2. mm.

17:00Speaker discriminability for visual speech modes

Jeesun Kim (MARCS Auditory Laboratories, University of Western Sydney, Australia)
Chris Davis (MARCS Auditory Laboratories, University of Western Sydney, Australia)
Christian Kroos (MARCS Auditory Laboratories, University of Western Sydney, Australia)
Harold Hill (School of Psychology, University of Wollongong, Australia)

Does speech mode affect recognizing people from their visual speech? We examined 3D motion data from 4 talkers saying 10 sentences (twice). Speech was in noise, in quiet or whispered. Principal Component Analyses (PCAs) were conducted and speaker classification was determined by Linear Disciminant Analysis (LDA). The first five PCs for the rigid motion and the first 10 PCs each for the non-rigid motion and the combined motion were input to a series of LDAs for all possible combinations of PCs that could be constructed using the retained PCs. The discriminant functions and classification coefficients were determined on the training data to predict the talker of the test data. Classification performance for both the in-noise and whispered speech modes were superior to the in-quiet one. Superiority of classification was found even if only the first PC (jaw motion) was used, i.e., measures of jaw motion when speaking in noise or whispering hold promise for bimodal person recognition or verification.

17:20Audio-Visual prosody of social attitudes in Vietnamese: building and evaluating a tones balanced corpus

Dang Khoa Mac (Laboratory of Informatics of Grenoble (LIG), France)
Véronique Aubergé (Laboratory of Informatics of Grenoble (LIG), France)
Albert Rilliard (LIMSI-CNRS, Orsay, France)
Eric Castelli (International Research Center MICA, Vietnam)

This paper presents the building and a first evaluation of a tones balanced Audio-Visual corpus of social affect in Vietnamese language. This under-resourced tonal language has specific glottalization and co-articulation phenomena, for which interactions with attitudes prosody are a very interesting issue. A well-controlled recording methodology was designed to build a large representative audio-visual corpus for 16 attitudes, and one speaker. A perception experiment was carried out to evaluate a speaker’s perceived performances and to study the role and integration of the audio, visual, and audio-visual information in the listener’s perception of the speaker’s attitudes. The results reveal characteristics of Vietnamese prosodic attitudes and allow us to investigate such social affect in Vietnamese language.

17:40Direct, Modular and Hybrid Audio to Visual Speech Conversion methods – a Comparative Study

Gyorgy Takacs (PPKE ITK)

A systematic comparative study of audio to visual speech conversion methods is described in this paper. A direct conversion system is compared to conceptually different ASR based solutions. Hybrid versions of the different solutions will also be presented. The methods are tested using the same speech material, audio preprocessing and facial motion visualization units. Only the conversion blocks are changed. Subjective opinion score evaluation tests prove the naturalness of the direct conversion is the best.

Wed-Ses3-S1:
Special Session: Machine Learning for Adaptivity in Spoken Dialogue Systems

Time:Wednesday 16:00 Place:East Wing 4 Type:Special
Chair: Oliver Lemon & Olivier Pietquin

16:00A User Modeling-based Performance Analysis of a Wizarded Uncertainty-Adaptive Dialogue System Corpus

Kate Forbes-Riley (Learning Research and Development Center (LRDC), University of Pittsburgh, USA)
Diane Litman (Learning Research and Development Center (LRDC), University of Pittsburgh, USA)

Motivated by prior spoken dialogue system research in user modeling, we analyze interactions between performance and user class in a dataset previously collected with two wizarded spoken dialogue tutoring systems that adapt to user uncertainty. We focus on user classes defined by expertise level and gender, and on both objective (learning) and subjective (user satisfaction) performance metrics. We find that lower expertise users learn best from one adaptive system but prefer the other, while higher expertise users learned more from one adaptive system but didn’t prefer either. Female users both learn best from and prefer the same adaptive system, while males preferred one adaptive system but didn’t learn more from either. Our results yield an empirical basis for future investigations into whether adaptive system performance can improve by adapting to user uncertainty differently based on user class.

16:20Using Dialogue-Based Dynamic Language Models for Improving Speech Recognition

Juan Manuel Lucas-Cuesta (Speech Technology Group, Universidad Politécnica de Madrid)
Fernando Fernández-Martínez (Speech Technology Group, Universidad Politécnica de Madrid)
Javier Ferreiros (Speech Technology Group, Universidad Politécnica de Madrid)

We present a new approach to dynamically create and manage different language models to be used on a spoken dialogue system. We apply an interpolation based approach, using several measures obtained by the Dialogue Manager to decide what LM the system will interpolate and also to estimate the interpolation weights. We propose to use not only semantic information (the concepts extracted from each recognized utterance), but also information obtained by the dialogue manager module (DM), that is, the objectives or goals the user wants to fulfill, and the proper classification of those concepts according to the inferred goals. The experiments we have carried out show improvements over word error rate when using the parsed concepts and the inferred goals from a speech utterance for rescoring the same utterance.

16:40Reinforcement Learning for Dialog Management using Least-Squares Policy Iteration and Fast Feature Selection

Lihong Li (Rutgers University)
Jason Williams (AT&T Labs - Research)
Suhrid Balakrishnan (AT&T Labs - Research)

Reinforcement learning (RL) is a promising technique for creating a dialog manager. RL accepts features of the current dialog state and seeks to find the best action given those features. Although it is often easy to posit a large set of potentially useful features, in practice, it is difficult to find the subset which is large enough to contain useful information yet compact enough to reliably learn a good policy. In this paper, we propose a method for RL optimization which automatically performs feature selection. The algorithm is based on least-squares policy iteration, a state-of-the-art RL algorithm which is highly sample-efficient and can learn from a static corpus or on-line. Experiments in dialog simulation show it is more stable than a baseline RL algorithm taken from a working dialog system.

17:00Hybridisation of Expertise and Reinforcement Learning in Dialogue Systems

Romain Laroche (Orange Labs & LIP6)
Ghislain Putois (Orange Labs)
Philippe Bretier (Orange Labs)
Bernadette Bouchon-Meunier (LIP6 & CNRS)

This paper addresses the problem of introducing learning capabilities in industrial handcrafted automata-based Spoken Dialogue Systems, in order to help the developer to cope with his dialogue strategies design tasks. While classical reinforcement learning algorithms position their learning at the dialogue move level, the fundamental idea behind our approach is to learn at a finer internal decision level (which question, which words, which prosody, \dots). These internal decisions are made on the basis of different (distinct or overlapping) knowledge. This paper proposes a novel reinforcement learning algorithm that can be used to make a data-driven optimisation of such handcrafted systems. An experiment shows that the convergence can be up to 20 times faster than with Q-Learning.

17:20Bayesian Learning of Confidence Measure Function for Generation of Utterances and Motions in Object Manipulation Dialogue Task

Komei Sugiura (National Institute of Information and Communications Technology)
Naoto Iwahashi (National Institute of Information and Communications Technology)
Hideki Kashioka (National Institute of Information and Communications Technology)
Satoshi Nakamura (National Institute of Information and Communications Technology)

This paper proposes a method that generates motions and utterances in an object manipulation dialogue task. The proposed method integrates belief modules for speech, vision, and motions into a probabilistic framework so that a user's utterances can be understood based on multimodal information. Responses to the utterances are optimized based on an integrated confidence measure function for the integrated belief modules. Bayesian logistic regression is used for the learning of the confidence measure function. The experimental results revealed that the proposed method reduced the failure rate from 12% down to 2.6% while the rejection rate was less than 24%.

17:40Predicting how it sounds: Re-ranking dialogue prompts based on TTS quality for adaptive Spoken Dialogue Systems

Cedric Boidin (Orange Labs)
Verena Rieser (University of Edinburgh)
Lonneke van der Plas (University of Geneva)
Oliver Lemon (University of Edinburgh)
Jonathan Chevelu (Orange Labs)

This paper presents a method for adaptively re-ranking paraphrases in a Spoken Dialogue System (SDS) according to their predicted Text To Speech (TTS) quality. We collect data under 4 different conditions and extract a rich feature set of 55 TTS runtime features. We build predictive models of user ratings using linear regression with latent variables. We then show that these models transfer to a more specific target domain on a separate test set. All our models significantly outperform a random baseline. Our best performing model reaches the same performance as reported by previous work, but it requires 75% less annotated training data. The TTS re-ranking model is part of an end-to-end statistical architecture for Spoken Dialogue Systems developed by the CLASSiC project.

Wed-Ses3-P3:
Robust Automatic Speech Recognition II

Time:Wednesday 16:00 Place:Hewison Hall Type:Poster
Chair:Peter Jancovic

#1Noisy Speech Recognition by using Output Combination of Discrete-Mixture HMMs and Continuous-Mixture HMMs

Tetsuo Kosaka (Graduate School of Science and Engineering, Yamagata University)
You Saito (Graduate School of Science and Engineering, Yamagata University)
Masaharu Kato (Graduate School of Science and Engineering, Yamagata University)

This paper presents an output combination approach for noiserobust speech recognition. The aim of this work is to improve recognition performance for adverse conditions which contain both stationary and non-stationary noise. In the proposed method, both discrete-mixture HMMs (DMHMMs) and continuous-mixture HMMs (CMHMMs) are used as acoustic models. In the DMHMM, subvector quantization is used instead of vector quantization and each state has multiple mixture components. Our previous work showed that DMHMM system indicated better performance in low SNR and/or non-stationary noise conditions. In contrast, CMHMM system was better in the opposite conditions. Thus, we take a system combination approach of the two models to improve the performance in various kinds of noise conditions. The proposed method was evaluated on a LVCSR task with 5K word vocabulary. The results showed that the proposed method was effective in various kinds of noise conditions.

#2Adaptive Training with Noisy Constrained Maximum Likelihood Linear Regression for Noise Robust Speech Recognition

D. K. Kim (Department of Electronics and Computer Engineering, Chonnam National University)
M. J. F. Gales (Cambridge University Engineering Department)

Adaptive training is a widely used technique for building speech recognition systems on non-homogeneous training data. Recently there has been interest in applying these approaches for situations where there is significant levels of background noise. This work extends the most popular form of linear transform for adaptive training, constrained MLLR, to reflect additional uncertainty from noise corrupted observations. This new form of transform, Noisy CMLLR, uses a modified version of generative model between clean speech and noisy observation, similar to factor analysis. Adaptive training using NCMLLR with both maximum likelihood and discriminative criteria are described. Experiments are conducted on noise-corrupted Resource Management and in-car recorded data. In preliminary experiments this new form achieves improvements in recognition performance over the standard approach in low signal-to-noise ratio conditions.

#3Performance Comparisons of the Integrated Parallel Model Combination Approaches with Front-End Noise Reduction

Guanghu Shen (Dept. of ICE, School of EECS, Yeungnam University)
Soo-Young Suk (Speech Processing Group, Information Technology Research Institute, AIST)
Hyun-Yeol Chung (School of EECS, Yeungnam University)

In this paper, to find the best noise robustness approach, we study on approaches implemented at both-end (i.e. front-end and back-end) of speech recognition system. To reduce the noise with lower speech distortion at front-end, we investigate the Two-stage Mel-warped Wiener Filtering (TMWF) in the integrated Parallel Model Combination (PMC) approach. Furthermore, the first-stage of TMWF (i.e. One-stage Mel-warped Wiener Filgering (OMWF)), as well as the well-known Wiener Filtering (WF), is effective to reduce the noise, so we integrate PMC with those front-end noise reduction approaches. From the recognition performance, TMWF-PMC shows improved performance comparing with the well-known WF-PMC, and OMWF-PMC also shows a comparable performance in all noises.

#4Tuning Support Vector Machines for Robust Phoneme Classification with Acoustic Waveforms

Jibran Yousafzai (King\'s College London)
Zoran Cvetkovic (King\'s College London)
Peter Sollich (King\'s College London)

This work focuses on the robustness of phoneme classification to additive noise in the acoustic waveform domain using support vector machines (SVMs). We address the issue of designing kernels for acoustic waveforms which imitate the state-of-the-art representations such as PLP and MFCC and are tuned to the physical properties of speech. For comparison, classification results in the PLP representation domain with cepstral mean-and-variance normalization (CMVN) using standard kernels are also reported. It is shown that our custom-designed kernels achieve better classification performance at high noise. Finally, we combine the PLP and acoustic waveform representations to attain better classification than either of the individual representations over the entire range of noise levels tested, from quiet condition up to -18dB SNR.

#5An analytic derivation of a phase-sensitive observation model for noise robust speech recognition

Volker Leutnant (University of Paderborn, Germany)
Reinhold Haeb-Umbach (University of Paderborn, Germany)

In this paper we present an analytic derivation of the moments of the phase factor between clean speech and noise cepstral or log-mel-spectral feature vectors. The development shows, among others, that the probability density of the phase factor is of sub-Gaussian nature and that it is independent of the noise type and the signal-to-noise ratio, however dependent on the mel filter bank index. Further we show how to compute the contribution of the phase factor to both the mean and the variance of the noisy speech observation likelihood, which relates the speech and noise feature vectors to those of noisy speech. The resulting phase-sensitive observation model is then used in model-based speech feature enhancement, leading to significant improvements in word accuracy on the AURORA2 database.

#6Variational Model Composition for Robust Speech Recognition with Time-Varying Background Noise

Wooil Kim (Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, Richardson, Texas, USA)
John Hansen (Center for Robust Speech Systems (CRSS), Erik Jonsson School of Engineering & Computer Science, University of Texas at Dallas, Richardson, Texas, USA)

This paper proposes a novel model composition method to improve speech recognition performance in time-varying background noise conditions. Variational noise models are generated by selectively applying perturbation factors to a basis model, resulting in a collection of various types of spectral patterns in log-spectral domain. The basis noise model is obtained from the silence segments of the input speech. The proposed Variational Model Composition (VMC) method is employed to generate multiple environmental models for our previously proposed feature compensation method. Experimental results prove that the proposed method is considerably effective at increasing speech recognition performance in time-varying background noise conditions.

#7Comparison of Estimation Techniques in Joint Uncertainty Decoding for Noise Robust Speech Recognition

Haitian Xu (Toshiba Research Europe LTD, UK)
K.K. Chin (Toshiba Research Europe LTD, UK)

Model-based joint uncertainty decoding (JUD) has recently achieved promising results by integrating the front-end uncertainty into the back-end decoding by estimating JUD transforms in a mathematically consistent framework. There are different ways of estimating the JUD transforms resulting in different JUD methods. This paper gives an overview of the estimation techniques existing in the literature including data-driven parallel model combination, Taylor series based approximation and the recently proposed second order approximation. Application of a new technique based on the unscented transformation is also proposed for the JUD framework. The different techniques have been compared in terms of both recognition accuracy and computational cost on a database recorded in a real car environment. Experimental results indicate the unscented transformation is one of the best options for estimating JUD transforms as it maintains a good balance between accuracy and efficiency.

#8Replacing Uncertainty Decoding with Subband Re-estimation for Large Vocabulary Speech Recognition in Noise

Jianhua Lu (Queen\'s University Belfast)
Ming Ji (Queen\'s University Belfast)
Roger Woods (Queen\'s University Belfast)

In this paper, we propose a novel approach for parameterized model compensation for large-vocabulary speech recognition in noisy environments. The new compensation algorithm, termed CMLLR-SUBREST, combines the model-based uncertainty decoding (UD) with subspace distribution clustering hidden Markov modeling (SDCHMM), so that the UD-type compensation can be realized by re-estimating the models based on small amount of adaptation data. This avoids the estimation of the covariance biases, which is required in model-based UD and usually needs a numerical approach. The Aurora 4 corpus is used in the experiments. We have achieved 16.9% relative WER (word error rate) reduction over our previous missing-feature (MF) based decoding and 16.1% over the combination of Constrained MLLR compensation and MF decoding. The number of model parameters is reduced by two orders of magnitude.

Wed-Ses3-P1:
Phonetics

Time:Wednesday 16:00 Place:Hewison Hall Type:Poster
Chair:Helmer Strik

#1How similar are clusters resulting from schwa deletion in French from identical underlying clusters?

Audrey Bürki (Laboratoire de Psycholinguistique Expérimentale, Université de Genève)
Cécile Fougeron (Laboratoire de Phonétique et Phonologie, UMR7018, CNRS-Paris3/Sorbonne Nouvelle, Paris)
Christophe Veaux (IRCAM, Analysis-Synthesis Team, Paris)
Ulrich Frauenfelder (Laboratoire de Psycholinguistique Expérimentale, Université de Genève)

Clusters resulting from the deletion of schwa in French are compared with identical underlying clusters in words and pseudowords. Both manual and automatic acoustical comparisons suggest that clusters resulting from schwa deletion in French are highly similar to identical underlying clusters. Furthermore, cluster duration is not longer for clusters resulting from schwa deletion than for identical underlying clusters. Clusters in pseudowords show a different acoustical and durational pattern from the two other clusters in words.

#2Word-final [t]-deletion: An analysis on the segmental and sub-segmental level

Barbara Schuppler (Center for Language and Speech Technology, Radboud University Nijmegen, The Netherlands)
Wim Van Dommelen (Department of Language and Communication Studies, NTNU Trondheim, Norway)
Jacques Koreman (Department of Language and Communication Studies, NTNU Trondheim, Norway)
Mirjam Ernestus (Center for Language and Speech Technology, Radboud Universitz Nijmegen, The Netherlands and Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands)

This paper presents a study on the reduction of word-final [t]s in conversational standard Dutch. Based on a large amount of tokens annotated on the segmental level, we show that the bigram frequency and the segmental context are the main predictors for the absence of [t]s. In a second study, we present an analysis of the detailed acoustic properties of word-final [t]s and we show that bigram frequency and context also play a role on the sub-segmental level. This paper extends research on the realization of /t/ in spontaneous speech and shows the importance of incorporating sub-segmental properties in models of speech.

#3Rarefaction gestures and Coarticulation in Mangetti Dune !Xung clicks

Amanda Miller (The University of British Columbia)
Abigail Scott (The University of British Columbia)
Bonny Sands (Northern Arizona University)
Sheena Shah (Georgetown University)

We provide high-speed ultrasound data on the four Mangetti Dune !Xung clicks. The posterior constriction is uvular for all four clicks - front uvular for [|] and [=] and back uvular for [!] and [||]. [!] and [||] both involve tongue center lowering and tongue root retraction as part of the rarefaction gestures. The rarefaction gestures in [|] and [=] involve tongue center lowering. Lingual cavity volume is largest for [!], followed by [||], [=] and [|]. A tongue tip recoil effect is found following [!], but the effect is smaller than that seen in IsiXhosa in earlier studies.

#4The Acoustics of Mangetti Dune !Xung clicks

Amanda Miller (University of British Columbia)
Sheena Shah (Georgetown University)

We document the acoustics of the four Mangetti Dune !Xung coronal clicks. We report the temporal measures of burst duration, relative burst amplitude and rise time, as well as the spectral value of center of gravity in the click bursts. COG correlates with lingual cavity volume. We show that there is inter-speaker variation in the acoustics of the palatal click, which we expect to correlate with a difference in the anterior constriction release dynamics. We show that burst duration, amplitude and rise time are correlated, similar to the correlation found between rise time and frication duration in affricates.

#5Acoustic Characteristics of Ejectives in Amharic

Hussien Seid (International Institute of Information Technology Hyderabad, India)
Rajendran Suyambu (International Institute of Information Technology Hyderabad, India)
Yegnanarayana Bayya (International Institute of Information Technology Hyderabad, India)

In this paper a preliminary investigation of the acoustic characteristics of Amharic ejectives in comparison with their unvoiced conjugates is presented. The normalized error from linear prediction residual and a zero frequency resonator output are used to locate the instant of release of the oral closure and the instant of the start of voicing, respectively. Amharic ejectives are found to have longer closure duration and smaller VOT than their unvoiced conjugates. Cross-linguistic comparisons reveal that no ejectives of two languages behave acoustically in a similar manner despite similarity in their articulation.

#6Sentence-final particles in Hong Kong Cantonese: Are they tonal or intonational?

Wing Li Wu (Department of Speech, Hearing and Phonetic Sciences, University College London, UK)

Cantonese is rich in sentence-final particles (SFPs), morphemes serving to show various linguistic or attitudinal meanings. The acoustic manifestations of these SFPs are not yet clear. This paper presents detailed analyses of the fundamental frequency tracings, final F0, final velocity and duration of ten SFPs in Hong Kong Cantonese. The results show that most of these SFPs are very similar to the lexical tones in terms of the F0 measurements, but the durations are significantly different in half the cases. The notable differences may give some insight into the nature of this special class of words.

#7Same Tone, Different Category: Linguistic-Tonetic Variation in the Areal Tone Acoustics of Chuqu Wu

William James Steed (Australian National University)
Philip John Rose (Australian National University)

Acoustic and auditory data are presented for the citation tones of single speakers from nine sites (eight hitherto undescribed in English) from the little-studied Chuqu subgroup of Wu in East Central China: Lishui, Longquan, Qingyuan, Longyou, Jinyun, Qingtian, Yunhe, Jingning and Taishun. The data demonstrate a high degree of complexity, having no less than 22 linguistic-tonetically different tones. The nature of the complexity of these forms is discussed, especially with respect to whether the variation is continuous or categorical, and inferences are drawn on their historical development. Index terms: tone, linguistic-tonetics, Chinese, Wu, Chuqu, areal variation.

#8Why would aspiration lower the pitch of the following vowel? Observations from Leng-shui-jiang Chinese

Caicai ZHANG (The Hong Kong University of Science and Technology)

This paper is a preliminary report of the aspiration-conditioned tonal split in Leng-shui-jiang (LSJ hereafter) Chinese. So far no consensus has been reached concerning the intrinsic perturbation of aspiration on the F0 of the following vowel. Conflicting data come from both the same language and different languages. In order to shed light on this issue, F0 and Closing quotient (Qx hereafter) are calculated in syllables after aspirated and unaspirated obstruents from six speakers (three male, three female) in LSJ dialect. The results turn out that F0 is significantly lower after the aspirated obstruents in two out of the three tone groups. The relatively lower Qx found in the syllables with aspirated initials is a possible explanation for the lower pitch.

#9Dialectal Characteristics of Osaka and Tokyo Japanese: Analyses of Phonologically Identical Words

Kanae Amino (Sophia University)
Takayuki Arai (Sophia University)

This study investigates the characteristics of the two major dialects of Japanese: Osaka and Tokyo dialects. We recorded the utterances of the speakers of both dialects, and analysed the differences that appear in the accentuation of the words at the phonetic-acoustic level. The Japanese words that are phonologically identical in both dialects were used as the analysis target. The results showed that the pitch patterns contained the dialect-dependent features of Osaka Japanese. Furthermore, these patterns could not be fully mimicked by speakers of Tokyo Japanese. These results show that there is a phonetics-phonology gap in the dialectal differences, and that we may exploit this gap for forensic purposes.

#10Categories and gradience in intonation: Evidence from linguistics and neurobiology

Brechtje Post (Research Centre for English and Applied Linguistics, University of Cambridge, Cambridge, UK)
Francis Nolan (Department of Linguistics, University of Cambridge, Cambridge, UK)
Emmanuel Stamatakis (Division of Anaesthesia, University of Cambridge, Cambridge, UK)
Toby Hudson (Research Centre for English and Applied Linguistics, University of Cambridge, Cambridge, UK)

Multiple cues interact to signal multiple functions in intonation simultaneously, which makes intonation notoriously complex to analyze. The Autosegmental-Metrical model for intonation analysis has proved to be an excellent vehicle for separating the components, but evidence for the phonetics/phonology dichotomy on which it hinges has proved elusive. Advocating a multidisciplinary approach, this paper outlines a new research project which combines traditional behavioural experiments with neuro-linguistic data to advance our understanding of the linguistic representation and neural correlates of intonation.

#11Exploring Vocalization of /l/ in English: an EPG and EMA study

Mitsuhiro Nakamura (Nihon University)

This study focuses on the spatiotemporal properties of lingual gestures for the vocalized /l/ (L -vocalization), relating them to those for the clear and dark variants of /l/. It is hypothesized that vocalization is not simply a code weakening of the tip/blade gesture. Using the EPG and EMA data collected from the Multichannel Articulatory (MOCHA) database, various measurements are performed. The spatial variations in terms of peak displacement showed that the speakers control the tip lowering and/or dorsum backing as the production strategy for the dark/vocalized allophony. The temporal variations in terms of ‘tip delay (Sproat & Fujimura 1993)’ revealed that the timing control of the tip and dorsum gesture constitutes a continuum for the three variants of /l/. The results bear on issues in models of lingual articulation, prosodic controls of articulatory gestures, and coarticulation and phonology.

#12The monophthongs and diphthongs of North-eastern Welsh: an acoustic study

Robert Mayr (Centre for Speech and Language Therapy, University of Wales Institute Cardiff, UK)
Hannah Davies (Centre for Speech and Language Therapy, University of Wales Institute Cardiff, UK)

Descriptive accounts of Welsh vowels indicate systematic differences between Northern and Southern varieties. Few studies have, however, attempted to verify these claims instrumentally, and little is known about regional variation in Welsh vowel systems. The present study aims to provide a first preliminary analysis of the acoustic properties of Welsh monophthongs and diphthongs, as produced by a male speaker from North-eastern Wales. The results indicate distinctive production of all the monophthong categories of Northern Welsh. Interesting patterns of spectral change were found for the diphthongs. Implications for theories of contrastivity in vowel systems are discussed.

#13Voicing Profile of Polish sonorants: [r] in obstruent clusters

Jagoda Sieczkowska (Universität Stuttgart)
Bernd Möbius (Universität Bonn, Universität Stuttgart)
Antje Schweitzer (Universität Stuttgart)
Michael Walsh (Universität Stuttgart)
Grzegorz Dogil (Universität Stuttgart)

This study aims at defining and analyzing voicing profile of Polish sonorant [r] showing the variability of its realizations depending on segmental and prosodic position. Voicing profile is defined as the “frame-by-frame voicing status of a speech sound in continuous speech” [19], [23]. Word-final devoicing of sonorants [8], [9], [14], is shortly reviewed and analyzed in terms of the conducted corpus-based investigation. We used automatic tools [10] to extract consonants’ features, F0 values and obtain voicing profile. The results show that liquid [r] devoice word and syllable finally, particularly with left voiceless stop context. Index Terms: sonorant, liquid, voicing, Polish, speech database.

Wed-Ses3-P4:
Prosody: Production II

Time:Wednesday 16:00 Place:Hewison Hall Type:Poster
Chair:Shinichi Tokuma

#1Perception and Production of Boundary Tones in Whispered Dutch

Willemijn Heeren (Leiden University Center for Linguistics, Leiden University, The Netherlands)
Vincent Van Heuven (Leiden University Center for Linguistics, Leiden University, The Netherlands)

The main cue to interrogativity in Dutch declarative questions is found in their final boundary tone. When whispering, a speaker does not produce the most important acoustic information conveying this: the fundamental frequency. In this paper listeners are shown to perceive the difference between whispered declarative questions and statements, though less clearly than in phonated speech. Moreover, possible acoustic correlates conveying whispered question intonation were investigated. The results show that the second formant may convey pitch in whispered speech, and also that first formant and intensity differences exist between high and low boundary tones in both phonated and whispered speech.

#2Pitch Accents and Information Status in a German Radio News Corpus

Katrin Schweitzer (University of Stuttgart)
Arndt Riester (University of Stuttgart)
Michael Walsh (University of Stuttgart)
Grzegorz Dogil (University of Stuttgart)

This paper presents a corpus analysis of prosodic realisations of information status categories in terms of pitch accent types. The annotations base on a recent annotation scheme for information status that is based on semantic criteria applied to written text. For each information status category, typical pitch accent realisations are identified. Moreover, the relevance of the strict semantic information status labelling scheme on the prosodic realisation is examined. It can be shown that the semantic cri- teria are reflected in prosody, i.e. the prosodic findings corrob- orate the theoretical assumptions made in the framework.

#3Analysis of Voice Fundamental Frequency Contours of Continuing and Terminating Prosodic Phrases in Four Swiss German Dialects

Adrian Leemann (Hirose and Minematsu Lab, The University of Tokyo, Japan and Universität Bern, Switzerland)
Keikichi Hirose (Hirose and Minematsu Lab, The University of Tokyo, Japan)
Hiroya Fujisaki (Professor Emeritus, The University of Tokyo, Japan)

In the present study, the F0 contours of continuing and terminating prosodic phrases of 4 Swiss German dialects are analyzed by means of the command-response model. In every model parameter, the two prosodic phrase types show significant differences: continuing prosodic phrases indicate higher phrase command magnitude and shorter durations. Locally, they demonstrate more distinct accent command amplitudes as well as durations. In addition, continuing prosodic phrases have later rises relative to segment onset than terminating prosodic phrases. In the same context, fine phonetic differences between the dialects are highlighted.

#4Intonational features for identifying regional accents of Italian

Michelina Savino (Dept. of Psychology, University of Bari, Italy)

Aim of this paper is providing a preliminary account of some intonational features useful for identifying a large number of Italian accents, estimated as representative of Italian regional variation, by analysing a corpus of comparable speech materials consisting of Map Task dialogues. Analysis concentrates on the intonational characteristics of yes-no questions, which can be realised very differently across varieties, whereas statements are generally characterised by a (low) falling final movement. Results of this preliminary investigation indicate that intonational features useful for identifying Italian regional accents are the tune type (rising-falling vs falling-rising vs rising), and the nuclear peak alignment in rising-falling contours (mid vs late).

#5Analysis and Recognition of Accentual Patterns

Agnieszka Wagner (Institute of Linguistics, Adam Mickiewicz University in Poznan)

This study proposes a framework of automatic analysis and recognition of pitch accent patterns. In the first place we present the results of analyses which aimed at identification of acoustic cues signaling prominent syllables and different pitch accent types distinguished at the surface-phonological level. The resulting representation provides a framework of analysis of pitch accent patterns at the acoustic-phonetic level. The representation is compact - it consists of 13 acoustic features, has low redundancy – the features can not be derived from one another and wide coverage – it encodes distinctions between perceptually different utterances. Next, we train statistical models to automatically determine pitch accent patterns of utterances using the acoustic-phonetic representation. The efficiency of the best models consists in achieving high accuracy (above 80% on average) using small acoustic feature vectors.

#6Using Responsive Prosodic Variation to Acknowledge the User\'s Current State

Nigel Ward (University of Texas at El Paso)
Rafael Escalante-Ruiz (University of Texas at El Paso)

Spoken dialog systems today do not vary the prosody of their utterances, although prosody is known to have many useful expressive functions. In a corpus of memory quizzes, we identify eleven dimensions of prosodic variation, each with its own expressive function. We identified the situations in which each was used, and how to detect these situations from the dialog context and the prosody of the interlocutor's previous utterance. We implemented the resulting response rules and had 21 users interact with two versions of the system. Overall they preferred the version in which the prosodic forms of the acknowledgments were chosen to be suitable for each specific context. This suggests that simple adjustments to system prosody based on local context can have value to users.

#7Intonation segments and segmental intonation

Oliver Niebuhr (Laboratoire Parole & Langage, UMR 6057 CNRS, Université de Provence, Aix-en-Provence)

An acoustic analysis of a German dialogue corpus showed that the sound qualities and durations of fricatives, vocoids, and diphthongs at the ends of question and statement utterances varied systematically with the utterance-final intonation segments, which were high-rising in the questions and terminal-falling in the statements. In the high-rising intonations, the fricatives showed higher centre-of-gravity values and the vocoids and diphthongs had lower F1 and higher F2 values. The ways in which the variations relate to phenomena like sibilant/spectral pitch and intrinsic F0 suggest that they are meant to support the pitch course. Thus, they may be called segmental intonations.

#8The Phrase-Final Accent in Kammu: Effects of Tone, Focus and Engagement

David House (KTH, Stockholm, Sweden)
Anastasia Karlsson (Lund University, Sweden)
Jan-Olof Svantesson (Lund University, Sweden)
Damrong Tayanin (Lund University, Sweden)

The phrase-final accent can typically contain a multitude of simultaneous prosodic signals. In this study, aimed at separating the effects of lexical tone from phrase-final intonation, phrase-final accents of two dialects of Kammu were analyzed. Kammu, a Mon-Khmer language spoken primarily in northern Laos, has dialects with lexical tones and dialects with no lexical tones. Both dialects seem to engage the phrase-final accent to simultaneously convey focus, phrase finality, utterance finality, and speaker engagement. Both dialects also show clear evidence of truncation phenomena. These results have implications for our understanding of the interaction between tone, intonation and phrase-finality.

#9Tonal Alignment in Three Varieties of Hiberno-English

Raya Kalaldeh (Trinity College Dublin)
Amelie Dorn (Trinity College Dublin)
Ailbhe Ní Chasaide (Trinity College Dublin)

This pilot study investigates the tonal alignment of pre-nuclear (PN) and nuclear (N) accents in three Hiberno-English (HE) regional varieties: Dublin, Drogheda, and Donegal English. The peak alignment is investigated as a function of the number of unstressed syllables before PN and after N. Dublin and Drogheda English appear to a have fixed peak alignment in both nuclear and pre-nuclear conditions. Donegal English, however, shows a drift in peak alignment in nuclear and pre-nuclear conditions. Findings also show that the peak is located earlier in nuclear and later in pre-nuclear conditions across the three dialects.

#10Determining intonational boundaries from the acoustic signal

Lourdes Aguilar (Universitat Autònoma de Barcelona)
Antonio Bonafonte (Universitat Politècnica Catalunya)
Francisco Campillo (Universidad de Vigo)
David Escudero (Universidad de Valladolid)

This article has two-fold aims: it reports firstly the improvement of a speech database in Catalan for speech synthesis (Festcat) with the information about prosodic boundaries using the break index labels proposed in the ToBI system; and secondly, it presents the experiments undergone to determine the acoustic markers that can differentiate among the break-indexes. Several experiments using different classification techniques were performed in order to compare the relative merit of different attributes to characterize breaks. Results show that the prosodic phrase breaks are correlated with: presence of a pause, lengthening of the pre-break syllable and the F0 contour of the span between the stressed syllable and the following post-stressed, if there are, immediately preceding the break.

#11Compression and Truncation Revisited

Claudia K. Ohl (Inst. of Phonetics and Digital Speech Processing (IPDS), Univ. Kiel)
Hartmut R. Pfitzinger (Inst. of Phonetics and Digital Speech Processing (IPDS), Univ. Kiel)

This paper investigates the influence of varying segmental structures on the realizations of utterance-final rising and falling intonation contours. Following Grabe's study on adjustment strategies in German, i.e. truncation and compression, a similar experiment was carried out, using materials with decreasing stretches of voicing in questions, lists, and statements. However, the results presented in the present paper could not confirm the idea of such common adjustment strategies. Instead, considerable variation was found as to how the phrase-final intonation contours were adjusted to the respective amounts of voicing: the strategies varied strongly across different word groups.

#12Comparison of Fujisaki-Model Extractors and F0 Stylizers

Hartmut R. Pfitzinger (Inst. of Phonetics and Digital Speech Processing (IPDS), Univ. Kiel)
Hansjörg Mixdorff (Dept. of Computer Sciences and Media, TFH Berlin University of Applied Sciences)
Jan Schwarz (Inst. for Circuit and System Theory (LNS), Faculty of Engineering, Univ. Kiel)

This study compares four automatic methods for estimating Fujisaki-model parameters. Since interpolation and smoothing are necessary prerequisites for all approaches their fitting accuracies are also compared with that of a novel stylization method. A hand-corrected set of results from one of the methods which was created on linguistic grounds served as second benchmark. Although the four methods yield comparable results with respect to their total errors, they show different error distributions. The manually corrected version provided a poorer fit of the F0 contours than the automatic one.

#13Is tonal alignment interpretation independent of methodology?

Caterina Petrone (ZAS)
Mariapaola D\' Imperio (LPL)

Tonal target detection is a very difficult task, especially in presence of consonantal perturbations. Though different detection methods have been adopted in tonal alignment research, we still do not know the more reliable one. In our paper, we found that such methodological choices have serious theoretical implications. Interpretation of the data strongly depends on whether tonal targets have been detected by a manual, a semi-automatic or an automatic procedure. Moreover, different segmental classes can affect target placement especially in automatic detection. This suggests the importance of keeping segmental class separate for the purpose of statistical analysis.

#14Modeling the Intonation of Topic Structure: Two Approaches

Margaret Zellers (Research Centre for English & Applied Linguistics, University of Cambridge, UK)
Brechtje Post (Research Centre for English & Applied Linguistics, University of Cambridge, UK)
Mariapaola D\'Imperio (Laboratoire Parole & Langage, Université de Provence, UMR6057 CNRS, Aix-en-Provence, France)

Intonational variation is widely regarded as a source of information about the topic structure of spoken discourse. However, many factors other than topic can influence this variation. We compared two models of intonation in terms of their ability to account for these other sources of variation. In dealing with this variation, the models paint different pictures of the intonational correlates of topic.

Wed-Ses3-P2:
Speaker verification & identification III

Time:Wednesday 16:00 Place:Hewison Hall Type:Poster
Chair:Aladdin Ariyaeeinia

#1Mel, Linear, and Antimel Frequency Cepstral Coefficients in Broad Phonetic Regions for Telephone Speaker Recognition

Howard Lei (International Computer Science Institute)
Eduardo Lopez-Gonzalo (Dep. of Signals, Systems and Radiocomm., Universidad Politecnica Madrid, Spain)

We've examined the speaker discriminative power of mel-, antimel- and linear-frequency cepstral coefficients (MFCCs, a-MFCCs and LFCCs) in the nasal, vowel, and non-nasal consonant speech regions. Our inspiration came from the work of Lu and Dang in 2007, who showed that filterbank energies at some frequencies mainly outside the telephone bandwidth possess more speaker discriminative power due to physiological characteristics of speakers, and derived a set of cepstral coefficients that outperformed MFCCs in non-telephone speech. Using telephone speech, we've discovered that LFCCs gave 21.5% and 15.0% relative EER improvements over MFCCs in nasal and non-nasal consonant regions, agreeing with our filterbank energy f-ratio analysis. We've also found that using only the vowel region with MFCCs gives a 9.1% relative improvement over using all speech. Last, we've shown that a-MFCCs are valuable in combination, contributing to a system with 17.3% relative improvement over our baseline.

#2Fast GMM Computation for Speaker Verification Using Scalar Quantization and Discrete Densities

Guoli Ye (HKUST)
Brian Mak (HKUST)
Man Wai Mak (PolyU)

Most of current state-of-the-art speaker verification (SV) systems use Gaussian mixture model (GMM) to represent the universal background model (UBM) and the speaker models (SM). For an SV system that employs log-likelihood ratio between SM and UBM to make the decision, its computational efficiency is largely determined by the GMM computation. This paper attempts to speedup GMM computation by converting a continuous-density GMM to a single or a mixture of discrete densities using scalar quantization. We investigated a spectrum of such discrete models: from high-density discrete models to discrete mixture models, and their combination called high-density discrete-mixture models. For the NIST 2002 SV task, we obtained an overall speedup by a factor of 2--100 with little loss in EER performance.

#3Text-Independent Speaker Identification Using Vocal Tract Length Normalization for Building Universal Background Model

Achintya Sarkar (Indian Institute of Technology - Kanpur)
Srinivasan Umesh (Indian Institute of Technology - Kanpur)
Shakti Prasad Rath (Indian Institute of Technology - Kanpur)

In this paper, we propose to use Vocal Tract Length Normalization (VTLN) to build the Universal Background Model (UBM) for a closed set speaker identification system. Vocal Tract Length (VTL) differences among speakers is a major source of variability in the speech signal. Since the UBM model is trained using data from many speakers, it statistically captures this inherent variation in the speech signal, which results in a "coarse'' model in the acoustic space. This may cause the adapted speaker models obtained from the UBM model to have significantly high overlap in the acoustic space. We hypothesize that the use of VTLN will help in compacting the UBM model and thus the speaker adapted models obtained from this compact model will have better speaker-separability in the acoustic space. We perform experiments on MIT, TIMIT and NIST-2004 SRE databases and show that using VTLN we can achieve lesser Identification Error Rates as compared to the conventional GMM-UBM based method.

#4BUT system for NIST 2008 speaker recognition evaluation

Lukas Burget (Brno University of Technology)
Michal Fapso (Brno University of Technology)
Valiantsina Hubeika (Brno University of Technology)
Ondrej Glembek (Brno University of Technology)
Martin Karafiat (Brno University of Technology)
Marcel Kockmann (Brno University of Technology)
Pavel Matejka (Brno University of Technology)
Petr Schwarz (Brno University of Technology)
Jan Cernocky (Brno University of Technology)

This paper presents BUT system submitted to NIST 2008 SRE. It includes two subsystems based on Joint Factor Analysis (JFA) GMM/UBM and one based on SVM-GMM. The systems were developed on NIST SRE 2006 data, and the results are presented on NIST SRE 2008 evaluation data. We concentrate on the influence of side information in the calibration.

#5Selection of the Best Set of Shifted Delta Cepstral Features in Speaker Verification Using Mutual Information

Jose R. Calvo de Lara (CENATAV)
Gabriel Hernandez (CENATAV)
Rafael Fernandez (CENATAV)

Shifted delta cepstral (SDC) features, obtained by concatenating delta cepstral features across multiples speech frames, were recently reported to produce superior performance to delta cepstral features in language and speaker recognition systems. In this paper, the use of SDC features in a speaker verification experiment is reported. Mutual information between SDC features and identity of a speaker is used to select the best set of SDC parameters. The experiment evaluates robustness of the best SDC features due to channel and handset mismatch in speaker verification. The result reflects an EER relative reduction until 19% in a speaker verification experiment.

#6Forensic speaker recognition using traditional features comparing automatic and human-in-the-loop formant tracking

de Castro Alberto (Universidad Autonoma de Madrid)
Ramos Daniel (Universidad Autonoma de Madrid)
Gonzalez-Rodriguez Joaquin (Universidad Autonoma de Madrid)

In this paper we compare forensic speaker recognition with traditional features using two different formant tracking strategies: one performed automatically and one semi-automatic performed by human experts. The main contribution of the work is the use of an automatic method for formant tracking, which allows a much faster recognition process and the use of a much higher amount of data for modelling background population, calibration, etc. This is especially important in likelihood-ratio-based forensic speaker recognition, where the variation of features among a population of speakers must be modelled in a statistically robust way. Experiments show that, although recognition using the human-in-the-loop approach is better than using the automatic scheme, the performance of the latter is also acceptable. Moreover, we present a novel feature selection method which allows the analysis of which feature of each formant has a greater contribution to the discriminating power of the whole recognition process, which can be used by the expert in order to decide which features in the available speech material are important.

#7Open-Set Speaker Identification under Mismatch Conditions

Surosh G Pillay (University of Hertfordshire, Hatfield, UK)
Aladdin Ariyaeeinia (University of Hertfordshire, Hatfield, UK)
Perasiriyan Sivakumaran (University of Hertfordshire, Hatfield, UK)
Mark Pawlewski (BT Labs, Ipswich, UK)

This paper presents investigations into the performance of open-set, text-independent speaker identification (OSTI-SI) under mismatched data conditions. The scope of the study includes attempts to reduce the adverse effects of such conditions through the introduction of a modified parallel model combination (PMC) method together with condition-adjusted T-Norm (CT-Norm) into the OSTI-SI framework. The experiments are conducted using examples of real world noise. Based on the outcomes, it is demonstrated that the above approach can lead to considerable improvements in the accuracy of open-set speaker identification operating under severely mismatched data conditions. The paper details the realisation of the modified PMC method and CT-Norm in the context of OSTI-SI, presents the experimental investigations and provides an analysis of the results.

#8MiniVectors: an Improved GMM-SVM Approach for Speaker Verification

Xavier Anguera Miro (Telefonica Research)

The accuracy levels achieved by state-of-the-art Speaker Verification systems are high enough for the technology to be used in real-life applications. Unfortunately, the transfer from the lab to the field is not as straight-forward as could be: the best performing systems can be computationally expensive to run and need large speaker model footprints. In this paper, we compare two speaker verification algorithms (GMM-SVM Supervectors and Kharroubi's GMM-SVM vectors) and propose an improvement of Kharroubi's system that: (a) achieves up to 17% relative performance improvement when compared to the Supervectors algorithm; (b) is 24\% faster in run time and (c) makes use of speaker models that are 94\% smaller than those needed by the Supervectors algorithm.

#9Robustness of Phase based Features for Speaker Recognition

Padmanabhan Rajan (IIT Madras)
Sree Hari Krishnan Parthasarathi (Idiap Research Institute)
Hema A. Murthy (IIT Madras)

This paper demonstrates the robustness of group-delay based features for speech processing. An analysis of group delay functions is presented which show that these features retain formant structure even in noise. Furthermore, a speaker verification task performed on the NIST 2003 database show lesser error rates, when compared with the traditional MFCC features. We also mention about using feature diversity to dynamically choose the feature for every claimed speaker.

#10The MIT Lincoln Laboratory 2008 Speaker Recognition System

Douglas Sturim (MIT Lincoln Laboratory)
William Campbell (MIT Lincoln Laboratory)
Zahi Karam (MIT Lincoln Laboratory)
Douglas Reynolds (MIT Lincoln Laboratory)
Fred Richardson (MIT Lincoln Laboratory)

A primary emphasis in last years NIST 2008 Speaker Recognition Evaluation (SRE) was to greatly expand the use of auxiliary microphones. This offered the additional channel variations which has been a historical challenge to speaker verification systems. In this paper we present the MIT Lincoln Laboratory Speaker Recognition system applied to the task in the NIST 2008 SRE. Our approach during the evaluation was two-fold: 1) Utilize recent advances in variational nuisance modeling (latent factor analysis and nuisance attribute projection) to allow our spectral speaker verification systems to better compensate for the channel variation introduced, and 2) fuse systems targeting the different linguistic tiers of information, high and low. The performance of thesystem is presented when applied on a NIST 2008 SRE task. Post evaluation analysis is conducted on the sub-task when interview microphones are present.

#11Speaker Recognition on Lossy Compressed Speech using the Speex Codec

Allen Stauffer (RADC Inc.)
Aaron Lawson (RADC Inc.)

This paper examines the impact of lossy speech coding with Speex on GMM-UBM speaker recognition (SR). Audio from 120 speakers was compressed with Speex into twelve data sets, each with a different level of compression quality from 0 (most compressed) to 10 (least), plus uncompressed. Experiments looked at performance under matched and mismatched compression conditions, using models conditioned for the coded environment, and Speex coding applied to improving SR performance on other coders. Results show that Speex is effective for compression of data used in SR and that Speex coding can improve performance on data compressed by the GSM codec.

#12Text-Independent Speaker Verification Using Rank Threshold in Large Number of Speaker Models

Haruka Okamoto (Graduate School of Advanced Integration Sciences, Chiba University)
Amira Abdelwahab (Graduate School of Advanced Integration Sciences, Chiba University)
Masahumi Nishida (Faculty of Science and Engineering, Doshisha University)
Satoru Tsuge (Institute of Technology and Science, The University of Tokushima)
Yasuo Horiuchi (Graduate School of Advanced Integration Sciences, Chiba University)
Shingo Kuroiwa (Graduate School of Advanced Integration Sciences, Chiba University)

In this paper, we propose a novel speaker verification method which determines that a claimer is accepted or rejected by the rank of the claimer in a large number of speaker models instead of score normalization, such as T-norm. We also discuss the speed-up by select ing cohort subset for each speaker. This approach can significantly reduce computation resulting in faster speaker verification decision. We conducted text-independent speaker verification experiments using large-scale Japanese speaker recognition evaluation corpus constructed by National Research Institute of Police Science. As results of experiments, the proposed method achieved an equal error rate of 2.2%, while T-norm obtained 2.7 %.

#13Age Role of factor analysis in Speaker Identification

yun lei (university of texas at dallas)
John Hansen (university of texas at dallas)

The speaker acoustic space described by factor analysis model is assumed to model majority of speaker variation using less latent factors. In this study, age factor, as an observable important factor of speaker’s voice, is analyzed and employed in the description of the speaker acoustic space, using factor analysis approach. A age dependent acoustic space is obtained to represent the age ariation in the speaker acoustic space and the effect of the age dependent acoustic space in eigenvoice model is evaluated on NIST SRE08 corpus. In addition, the data pool with different age distributions are tested on the joint factor analysis model to measure the age influence from the data pool.

#14Do Humans and speaker verification system use the same information to differentiate voices?

Juliette Kahn (Laboratoire Informatique d\'Avignon, Avignon, France)
Solange Rossato (Laboratoire Informatique de Grenoble, Grenoble, France)

The aim of this paper is to analyze the pairwise comparisons of voices by a speaker verification system (ALIZE/Spk) and by human. A database of familial groups of 24 speakers was created. A single sentence was chosen for the perception test. The same sentence was used the test signal for the ALIZE/Spk trained on another part of the corpus. Results shows that the voice proximities within a familial group were well recovered in the speaker representation by ALIZE and much less returned in the representation from perception test

Thu-Ses0-K:
Mari Ostendorf - Transcribing Speech for Spoken Language Processing

Time:Thursday 08:30 Place:Main Hall Type:Keynote
Chair:Martin Russell

08:30Transcribing Speech for Spoken Language Processing

Mari Ostendorf (University of Washington)

As storage costs drop and bandwidth increases, there has been a rapid growth of spoken information available via the web or in online archives -- including radio and TV broadcasts, oral histories, legislative proceedings, call center recordings, etc. -- raising problems of document retrieval, information extraction, summarization and translation for spoken language. While there is a long tradition of research in these technologies for text, new challenges arise when moving from written to spoken language. In this talk, we look at differences between speech and text, and how we can leverage the information in the speech signal beyond the words to provide structural information in a rich, automatically generated transcript that better serves language processing applications. In particular, we look at three interrelated types of structure (segmentation, prominence and syntax), methods for automatic detection, the benefit of optimizing rich transcription for the target language processing task, and the impact of this structural information in tasks such as parsing, topic detection, information extraction and translation.

Thu-Ses1-O1:
Robust Automatic Speech Recognition III

Time:Thursday 10:00 Place:Main Hall Type:Oral
Chair:Phil Green

10:00Accounting for the Uncertainty of Speech Estimates in the Complex Domain for Minimum Mean Square Error Speech Enhancement

Ramón Fernandez Astudillo (Chair of Electronics and Medical Signal Processing, Berlin Institute of Technology, Germany)
Dorothea Kolossa (Chair of Electronics and Medical Signal Processing, Berlin Institute of Technology, Germany)
Reinhold Orglmeister (Chair of Electronics and Medical Signal Processing, Berlin Institute of Technology, Germany)

Uncertainty decoding and uncertainty propagation, or error propagation, techniques have emerged as a powerful tool to increase the accuracy of automatic speech recognition systems by employing an uncertain, or probabilistic, description of the speech features rather than the usual point estimate. In this paper we analyze the uncertainty generated in the complex Fourier domain when performing speech enhancement with the Wiener or Ephraim-Malah filters. We derive closed form solutions for the computation of the error of estimation and show that it provides a better insight into the origin of estimation uncertainty. We also show how the combination of such an error estimate with uncertainty propagation and uncertainty decoding or modified imputation yields superior recognition robustness when compared to conventional MMSE estimators with little increase in the computational cost.

10:20Signal Separation for Robust Speech Recognition Based on Phase Difference Information Obtained in the Frequency Domain

Chanwoo Kim (Carnegie Mellon University)
Kshitiz Kumar (Carnegie Mellon University)
Bhiksha Raj (Carnegie Mellon University)
Richard Stern (Carnegie Mellon University)

In this paper, we present a new two-microphone approach that improves speech recognition accuracy when speech is masked by other speech. The algorithm improves on previous systems that have been successful in separating signals based on differences in arrival time of signal components from two microphones. The present algorithm differs from these efforts in that the signal selection takes place in the frequency domain. We observe that additional smoothing of the phase estimates over time and frequency is needed to support adequate speech recognition performance. We demonstrate that the algorithm described in this paper provides better recognition accuracy than time-domain-based signal separation algorithms, and at less than 10 percent of the computation cost.

10:40Transforming Features to Compensate Speech Recogniser Models for Noise

Rogier van Dalen (Cambridge University Engineering Department)
Federico Flego (Cambridge University Engineering Department)
Mark Gales (Cambridge University Engineering Department)

To make speech recognisers robust to noise, either the features or the models can be compensated. Feature enhancement is often fast; model compensation is often more accurate, because it predicts the corrupted speech distribution. It is therefore able, for example, to take uncertainty about the clean speech into account. This paper re-analyses the recently-proposed predictive linear transformations for noise compensation as minimising the KL divergence between the predicted corrupted speech and the adapted models. New schemes are then introduced which apply observation-dependent transformations in the front-end to adapt the back-end distributions. One applies transforms in the exact same manner as the popular minimum mean square error (MMSE) feature enhancement scheme, and is as fast. The new method performs better on AURORA 2.

11:00Subband Temporal Modulation Spectrum Normalization for Automatic Speech Recognition in Reverberant Environments

Xugang Lu (National Institute of Information and Communications Technology, Japan)
Masashi Unoki (Japan Advanced Institute of Science and Technology, Japan)
Satoshi Nakamura (National Institute of Information and Communications Technology, Japan)

Speech recognition in reverberant environments is still a challenge problem. In this paper, we first investigated the reverberant effect on subband temporal envelopes by using the modulation-transfer-function (MTF). Based on the investigation, we proposed an algorithm which normalizes the subband temporal modulation spectrum (TMS) to reduce the diffusion effect of the reverberation. During the normalization, both the subband TMS of the clean and reverberant speech are normalized to a reference TMS calculated from a clean speech data set for each frequency subband. Based on the normalized subband TMS, the inverse Fourier transform was done to restore the subband temporal envelopes. We tested our algorithm on reverberant speech recognition tasks. For comparison, the traditional Mel-frequency cepstral coefficient with relative spectral filtering was used. Experiments showed that using the feature extracted by the proposed method had totally 80.64% relative improvement.

11:20Robust In-Car Spelling Recognition - A Tandem BLSTM-HMM Approach

Martin Woellmer (Technische Universitaet Muenchen)
Florian Eyben (Technische Universitaet Muenchen)
Bjoern Schuller (Technische Universitaet Muenchen)
Yang Sun (Technische Universitaet Muenchen)
Tobias Moosmayr (BMW Group)
Nhu Nguyen-Thien (Continental Automotive GmbH)

As an intuitive hands-free input modality automatic spelling recognition is especially useful for in-car human-machine interfaces. However, for today's speech recognition engines it is extremely challenging to cope with similar sounding spelling speech sequences in the presence of noises such as the driving noise inside a car. Thus, we propose a novel Tandem spelling recogniser, combining a Hidden Markov Model (HMM) with a discriminatively trained bidirectional Long Short-Term Memory (BLSTM) recurrent neural net. The BLSTM network captures long-range temporal dependencies to learn the properties of in-car noise, which makes the Tandem BLSTM-HMM robust with respect to speech signal disturbances at extremely low signal-to-noise ratios and mismatches between training and test noise conditions. Experiments considering various driving conditions reveal that our Tandem recogniser outperforms a conventional HMM by up to 33%.

11:40Applying Non-Negative Matrix Factorization on Time-Frequency Reassignment Spectra for Missing Data Mask Estimation

Maarten Van Segbroeck (K.U.Leuven, Department of Electrical Engineering (ESAT))
Hugo Van hamme (K.U.Leuven, Department of Electrical Engineering (ESAT))

The application of Missing Data Theory (MDT) has shown to improve the robustness of automatic speech recognition (ASR) systems. A crucial part in a MDT-based recognizer is the computation of the reliability masks from noisy data. To estimate accurate masks in environments with unknown, non-stationary noise statistics, we need to rely on a strong model for the speech. In this paper, an unsupervised technique using non-negative matrix factorization (NMF) discovers phone-sized time-frequency patches into which speech can be decomposed. The input matrix for the NMF is constructed using a high resolution and reassigned time-frequency representation. This representation facilitates an accurate detection of the patches that are active in unseen noisy speech. After further denoising of the patch activations, speech and noise can be reconstructed from which missing feature masks are estimated. Recognition experiments on the Aurora2 database demonstrate the effectiveness of this technique.

Thu-Ses1-O2:
Prosody: perception

Time:Thursday 10:00 Place:East Wing 1 Type:Oral
Chair: Yi Xu

10:00Experiments on Automatic Prosodic Labeling

Antje Schweitzer (Institute of Natural Language Processing, University of Stuttgart)
Bernd Moebius (Institute of Natural Language Processing, University of Stuttgart)

This paper presents results from experiments on automatic prosodic labeling. Using the WEKA machine learning software [Witten/Frank:2005], classifiers were trained to determine for each syllable in a speech database of a male speaker its pitch accent and its boundary tone. Pitch accents and boundaries are according to the GToBI(S) dialect, with slight modifications. Classification was based on 35 attributes involving PaIntE F0 parametrization [Moehler/Conkie:1998] and normalized phone durations, but also some phonological information as well as higher-linguistic information. Several classification algorithms yield results of approx. 78% accuracy on the word level for pitch accents, and approx. 88% accuracy on the word level for phrase boundaries, which compare very well to results of other studies. The classifiers generalize to similar data of a female speaker in that they perform equally well as classifiers trained directly on the female data.

10:20German Boundary tones show Categorical Perception and a Perceptual Magnet Effect when presented in different contexts

Katrin Schneider (Institut of Natural Language Processing, University of Stuttgart, Germany)
Grzegorz Dogil (Institut of Natural Language Processing, University of Stuttgart, Germany)
Bernd Möbius (Institut of Natural Language Processing, University of Stuttgart, Germany and Institute of Communication Sciences, University of Bonn, Germany)

The experiment presented in this paper examines categorical perception as well as the perceptual magnet effect in German boundary tones, taking also context information into account. The test phrase is preceded by different context sentences that are assumed to affect the location of the category boundary in the stimulus continuum between the low and the high boundary tone. Results provide evidence for the existence of a low and a high boundary tone in German, corresponding to statement versus question interpretation, respectively. Furthermore, in contrast to previous findings, a prototype was found not only in the category of the low but also in the category of the high boundary tone, supporting the hypothesis that context might have been taken into account to solve a possible ambiguity between H% and a previously hypothesized non-low and non-terminal boundary tone.

10:40Eye Tracking for the Online Evaluation of Prosody in Speech Synthesis: Not So Fast!

Michael White (Department of Linguistics, The Ohio State University)
Rajakrishnan Rajkumar (Department of Linguistics, The Ohio State University)
Kiwako Ito (Department of Linguistics, The Ohio State University)
Shari R. Speer (Department of Linguistics, The Ohio State University)

This paper presents an eye-tracking experiment comparing the processing of different accent patterns in unit selection synthesis and human speech. The synthetic speech results failed to replicate the facilitative effect of contextually appropriate accent patterns found with human speech, while producing a more robust intonational garden-path effect with contextually inappropriate patterns, both of which could be due to processing delays seen with the synthetic speech. As the synthetic speech was of high quality, the results indicate that eye tracking holds promise as a highly sensitive and objective method for the online evaluation of prosody in speech synthesis.

11:00Prosodic Analysis of Foreign-Accented English

Hansjörg Mixdorff (BHT Berlin University of Applied Sciences, Germany)
John Ingram (University of Queensland, Australia)

This study compares utterances by Vietnamese learners of Australian English with those of native subjects. In a previous study the utterances had been rated for foreign accent and intelligibility. We aim to find measurable prosodic differences accounting for the perceptual results. Our outcomes indicate, inter alia, that unaccented syllables are relatively longer compared with accented ones in the Vietnamese corpus than those in the Australian English corpus. Furthermore, the correlations of syllabic durations in utterances of one and the same sentence are much higher for Australian English subjects than for Vietnamese learners of English. Vietnamese speakers use a larger range of f0 and produce more pitch-accents than Australian speakers.

11:20Perception of the evolution of prosody in the French broadcast news style

Philippe Boula de Mareüil (LIMSI-CNRS)
Albert Rilliard (LIMSI-CNRS)
Alexandre Allauzen (LIMSI-CNRS - Université Paris-Sud 11, Orsay)

This study makes use of advances in automatic speech processing to analyse French audiovisual archives and the perception of the journalistic style evolution regarding prosody. Three perceptual experiments were run, using prosody transplantation, delexicalisation and imitation. Results show that the fundamental frequency and duration correlates of prosody enable old-fashioned recordings to be distinguished from more recent ones. The higher the pitch is and the more there are pitch movements on syllables which may be interpreted as word-initially stressed, the more speech samples are perceived as dating back to the 40s or the 50s.

11:40Prosodic effects on vowel production: evidence from formant structure

Yoonsook Mo (Department of Linguistics, University of Illinois at Urbana-Champaign)
Jennifer Cole (Department of Linguistics, University of Illinois at Urbana-Champaign)
Mark Hasegawa-Johnson (Department of Electrical and Computer Engineering, University of Illinois at Urbana-Champaign)

On the basis of the auditory perception of nearly 100 ordinary, untrained listeners, this paper reports on the effect of prosodic prominence on the formant patterns of vowels using speech data from the Buckeye corpus of spontaneous American English. Evaluating two hypotheses (Hyperarticulation vs. Sonority Expansion Hypothesis), the evidence reported here from spontaneous speech shows that prominent vowels have expanded sonority regardless of vowel height, and are hyperarticulated only when hyperarticulation does not interfere with sonority expansion, extending the model proposed in prior controlled “laboratory” studies to cover prosody in spontaneous speech.

Thu-Ses1-O3:
Segmentation and Classification

Time:Thursday 10:00 Place:East Wing 2 Type:Oral
Chair:Stephen Cox

10:00An adaptive BIC approach for robust audio stream segmentation

Janez Zibert (Primorska Institute of Natural Sciences and Technology, University of Primorska, Koper, Slovenia)
Andrej Brodnik (Primorska Institute of Natural Sciences and Technology, University of Primorska, Koper, Slovenia)
France Mihelic (Faculty of Electrical Engineering, University of Ljubljana, Ljubljana, Slovenia)

In this paper we focus on an audio segmentation. We present a novel method for robust estimation of decision-thresholds for accurate detection of acoustic change points in continuous audio streams. In standard segmentation procedures the decision-thresholds are usually set in advance and need to be tuned from development data. In the presented approach we tried to remove a need for using pre-determined decision-thresholds and propose a method for estimation of thresholds directly from the currently processed audio data. It employs change-detection methods from two well-established audio segmentation approaches based on the Bayesian Information Criterion. Following from that, we develop two audio segmentation procedures, which enable us to adaptively tune boundary-detection thresholds and to combine different audio representations in the segmentation process. The proposed segmentation procedures are tested on broadcast news audio data.

10:20Improving the robustness of phonetic segmentation to accent and style variation with a two-staged approach

Vaishali Patil (I.I.T. Bombay, India.)
Shrikant Joshi (I. I. T. Bombay, India.)
Preeti Rao (I. I. T. Bombay, India.)

Correct and temporally accurate phonetic segmentation of speech utterances is important in applications ranging from transcription alignment to pronunciation error detection. Automatic speech recognizers used in these tasks provide insufficient temporal alignment accuracy apart from a recognition performance that is sensitive to accent and style variations from the training data. A two-staged approach combining HMM broad-class recognition with acoustic-phonetic knowledge based refinement is evaluated for phonetic segmentation accuracy in the context of accent and style mismatches with training data. Index Terms: phonetic segmentation, pronunciation scoring

10:40Signature Cluster Model Selection for Incremental Gaussian Mixture Cluster Modeling in Agglomerative Hierarchical Speaker Clustering

Kyu Han (University of Southern California)
Shrikanth Narayanan (University of Southern California)

Agglomerative hierarchical speaker clustering (AHSC) has been widely used for classifying speech data by speaker characteristics. Its bottom-up, one-way structure of merging the closest cluster pair at every recursion step, however, makes it difficult to recover from incorrect merging. Hence, making AHSC robust to incorrect merging is an important issue. In this paper we address this problem in the framework of AHSC based on incremental Gaussian mixture models, which we previously introduced for better representing variable cluster size. Specifically, to minimize contamination in cluster models by heterogeneous data, we select and keep updating a representative (or signature) model for each cluster during AHSC. Experiments on meeting speech excerpts (4 hours total) verify that the proposed approach improves average speaker clustering performance by approximately 20% (relative).

11:00SPEAKER SEGMENTATION AND CLUSTERING FOR SIMULTANEOUSLY PRESENTED SPEECH

Lingyun Gu (Carnegie Mellon University)
Richard Stern (Carnegie Mellon University)

This paper proposes a new scheme used to segment and cluster speech segments on an unsupervised basis in cases where multiple speakers are presented simultaneously at different SNRs. The new elements in our work are in the development of new feature for segmenting and clustering simultaneously-presented speech, the procedure for identifying a candidate set of possible speaker-change points, and the use of pair-wise cross-segment distance distributions to cluster segments by speaker. The proposed system is evaluated in terms of the F measure that is obtained. The system is compared to a baseline system that uses MFCC for acoustic features, the Bayesian Information Criterion (BIC) for detecting speaker-change points, and the Kullback- Leibler distance for clustering the segments. Experimental results indicate that the new system consistently provides better performance than the baseline system with very small computational cost.

11:20Trimmed KL Divergence between Gaussian Mixtures for Robust Unsupervised Acoustic Anomaly Detection

Nash Borges (Johns Hopkins University, Human Lang. Tech. Ctr. of Excellence)
Gerard G. L. Meyer (Johns Hopkins University, Human Lang. Tech. Ctr. of Excellence)

In previous work, we presented several implementations of acoustic anomaly detection by training a model on purely normal data and estimating the divergence between it and other input. Here, we reformulate the problem in an unsupervised framework and allow for anomalous contamination of the training data. We focus exclusively on methods employing Gaussian mixture models (GMMs) since they are often used in speech processing systems. After analyzing what caused the Kullback-Leibler (KL) divergence between GMMs to break down in the face of training contamination, we came up with a promising solution. By trimming one quarter of the most divergent Gaussians from the mixture model, we significantly outperformed the untrimmed approximation for contamination levels of 10% and above, reducing the equal error rate from 33.8% to 6.4% at 33% contamination. The performance of the trimmed KL divergence showed no significant dependence on the investigated contamination levels.

11:40How to Loose Confidence: Probabilistic Linear Machines for Multiclass Classification

Hui Lin (University of Washington)
Jeff Bilmes (University of Washington)
Koby Crammer (University of Pennsylvania)

In this paper, we propose a novel multiclass classifier that we call a probabilistic linear machine (PLM) which overcomes the low-entropy problem of exponential-based classifiers. Although PLMs are linear classifiers we use a careful design of the parameters matched with weak requirements over the features to output a true {\em a probability distribution} over labels given an input instance. We cast the discriminative learning problem as linear programming, which can scale up to large problems on the order of millions of training samples. Our experiments on phonetic classification show that PLM achieves high entropy while maintaining a comparable accuracy to other state-of-the-art classifiers.

Thu-Ses1-O4:
Evaluation & standardisation of SL technology and systems

Time:Thursday 10:00 Place:East Wing 3 Type:Oral
Chair:Sebastian Möller

10:00Quantifying Wideband Speech Codec Degradations via Impairment Factors: The New ITU-T P.834.1 Methodology and Its Application to the G.711.1 Codec

Sebastian Möller (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany)
Nicolas Côté (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany)
Atsuko Kurashima (NTT Service Integration Laboratories, Tokyo, Japan)
Noritsugu Egi (NTT Service Integration Laboratories, Tokyo, Japan)
Akira Takahashi (NTT Service Integration Laboratories, Tokyo, Japan)

Wideband speech codecs usually provide better perceptual speech quality than their narrowband counterparts, but they still degrade quality compared to an uncoded transmission path. In order to quantify these degradations, a new methodology is presented which derives a one-dimensional quality index on the basis of instrumental measurements. This index can be used to rank different wideband speech codecs according to their degradations and to calculate overall quality in conjunction with other degradations, like packet loss. We apply this methodology to derive respective indices for the new G.711.1 codec.

10:20SUXES - User Experience Evaluation Method for Spoken and Multimodal Interaction

Markku Turunen (University of Tampere)
Jaakko Hakulinen (University of Tampere)
Aleksi Melto (University of Tampere)
Tomi Heimonen (University of Tampere)
Tuuli Laivo (University of Tampere)
Juho Hella (University of Tampere)

Much work remains to be done with subjective evaluations of speech-based and multimodal systems. In particular, user experience is still hard to evaluate. SUXES is an evaluation method for collecting subjective metrics with user experiments. It captures both user expectations and user experiences, making it possible to analyze the state of the application and its interaction methods, and compare results. We present the SUXES method with examples of user experiments with different applications and modalities.

10:40Results of the N-Best 2008 Dutch Speech Recognition Evaluation

David van Leeuwen (TNO Human Factors)
Judith Kessens (TNO Human Factors)
Eric Sanders (Radboud University Nijmegen)
Henk van den Heuvel (Radboud University Nijmegen)

In this paper we report the results of a Dutch speech recognition system evaluation held in 2008. The evaluation contained material in two domains: Broadcast News (BN) and Conversational Telephone Speech (CTS) an in two main accent regions (Flemish and Dutch). In total 7 sites submitted recognition results to the evaluation, totaling 58 different submissions in the various conditions. Best performances ranged from 15.9\,\% word error rate for BN, Flemish to 46.1\,\% for CTS, Flemish. This evaluation is the first of its kind for the Dutch language.

11:00SHoUT, the University of Twente Submission to the N-Best 2008 Speech Recognition Evaluation for Dutch

Marijn Huijbregts (University of Twente)
Roeland Ordelman (University of Twente)
Laurens Werff, van der (University of Twente)
Franciska Jong, de (University of Twente)

In this paper we present our primary submission to the first Dutch and Flemish large vocabulary continuous speech recognition benchmark, N-Best. We describe our system workflow, the models we created for the four evaluation tasks and how we approached the problem of compounding that is typical for a language such as Dutch. We present the evaluation results and our post-evaluation analysis.

11:20NIST 2008 Speaker Recognition Evaluation: Performance Across Telephone and Room Microphone Channels

Alvin F. Martin (National Institute of Standards and Technology)
Craig S. Greenberg (National Institute of Standards and Technology)

We describe the 2008 NIST Speaker Recognition Evaluation, including the speech data used, the test conditions included, the participants, and some of the performance results obtained. This evaluation was distinguished by including as part of the required test condition interview type speech as well as conversational telephone speech, and speech recorded over microphone channels as well as speech recorded over telephone lines. Notable was the relative consistency of best system performance obtained over the different speech types, including those involving different types in training and test. Some comparison with performance in prior evaluations is also discussed.

11:40The ESTER 2 evaluation campaign for the rich transcription of French radio broadcasts

Sylvain Galliano (DGA/CEP)
Guillaume Gravier (AFCP)
Laura Chaubard (DGA/CEP)

This paper reports on the final results of the ESTER 2 evaluation campaign held from 2007 to April 2009. The aim of this campaign was to evaluate automatic radio broadcasts rich transcription systems for the French language. The evaluation tasks were divided into three main categories: audio event detection and tracking (e.g., speech vs. music, speaker tracking), orthographic transcription, and information extraction. The paper describes the data provided for the campaign, the task definitions and evaluation protocols as well as the results.

Thu-Ses1-S1:
Special Session: New Approaches to Modeling Variability for Automatic Speech Recognition

Time:Thursday 10:00 Place:East Wing 4 Type:Special
Chair:Carol Espy-Wilson & Jennifer Cole

10:00A Noise-type and level-dependent MPO-based speech enhancement architecture

Vikramjit Mitra (University of Maryland, College Park)
Bengt Borgstrom (University of California, Los Angeles)
Carol Espy-Wilson (University of Maryland, College Park)
Abeer Alwan (University of California, Los Angeles)

In previous work, a speech enhancement algorithm based on phase opponency and a periodicity measure (MPO-APP) was developed for speech recognition. Axiomatic thresholds were used in the MPO-APP regardless of the signal-to-noise ratio (SNR) of the corrupted speech or any characterization of the noise. The current work developed an algorithm for adjusting the threshold in the MPO-APP based on the SNR and whether the speech signal is clean, corrupted by aperiodic noise or corrupted with noise with periodic components. In addition, variable frame rate (VFR) analysis has been incorporated so that dynamic regions in the speech signal are more heavily sampled than steady-state regions. The result is a 2-stage algorithm that gives superior performance to the previous MPO-APP, and to several other state-of-the-art speech enhancement algorithms.

10:20Complementarity of MFCC, PLP and Gabor features in the presence of speech-intrinsic variabilities

Bernd T. Meyer (University of Oldenburg)
Birger Kollmeier (University of Oldenburg)

In this study, the effect of speech-intrinsic variabilities such as speaking rate, effort and speaking style on automatic speech recognition (ASR) is investigated. We analyze the influence of such variabilities as well as extrinsic factors (i.e., additive noise) on the most common features in ASR (mel-frequency cepstral coefficients and perceptual linear prediction features) and spectro-temporal Gabor features. MFCCs performed best for clean speech, whereas Gabors were found to be the most robust feature in extrinsic variabilities. Intrinsic variations were found to have a strong impact on error rates. While performance with MFCCs and PLPs was degraded in much the same way, Gabor features exhibit a different sensivity towards these variabilities and are, e.g., well-suited to recognize speech with varying pitch. The results suggest that spectro-temporal and classic features carry complementary information, which could be exploited in feature-stream experiments.

10:40Noise robustness of Tract Variables and their application to Speech Recognition

Vikramjit Mitra (1Department of Electrical and Computer Engineering, University of Maryland, USA)
Hosung Nam (Haskins Laboratories, New Haven, USA)
Carol Espy-Wilson (1Department of Electrical and Computer Engineering, University of Maryland, USA)
Elliot Saltzman (Haskins Laboratories, New Haven, USA)
Louis Goldstein (Haskins Laboratories, New Haven, USA)

This paper analyzes the noise robustness of vocal tract constriction variable estimation and also investigates their role for noise robust speech recognition. We implement a simple direct inverse model using a feed-forward artificial neural network (ANN) to estimate vocal tract time functions (VTTF) from acoustic speech signal parameterized as Melfrequency cepstral coefficients (MFCC). The training corpus was obtained from the TAsk Dynamics Application model (TADA [1]), which generated the synthetic speech as well as their corresponding VTTFs. Eight different vocal tract (VT) constriction variables consisting of five constriction degree variables (lip aperture [LA], tongue body [TBCD], tongue tip [TTCD], velum [VEL], and glottis [GLO]); three constriction location variables lip protrusion [LP], tongue tip [TTCL], tongue body [TBCL]) were considered in this study.

11:00Articulatory Phonological Code for Word Classification

Xiaodan Zhuang (University of Illinois at Urbana-Champaign)
Hosung Nam (Haskins Laboratories, New Haven, U.S.A.)
Mark Hasegawa-Johnson (University of Illinois at Urbana-Champaign)
Louis Goldstein (Haskins Laboratories, New Haven, U.S.A.)
Elliot Saltzman (Haskins Laboratories, New Haven, U.S.A.)

We propose a framework that leverages articulatory phonology for speech recognition. "Gestural pattern vectors" (GPV) encode the instantaneous gestural activations that exist across all tract variables at each time. Given a speech observation, recognizing the sequence of GPV recovers the ensemble of gestural activations, i.e., the gestural score. For each word in the vocabulary, we use a task dynamic model of inter-articulator speech coordination to generate the "canonical" gestural score. Speech recognition is achieved by matching the ensemble of gestural activations. In particular, we estimate the likelihood of the recognized GPV sequence on word-dependent GPV sequence models trained using the canonical gestural scores. These likelihoods, weighted by confidence score of the recognized GPVs, are used in a Bayesian speech recognizer. Pilot gestural score recovery and word classification experiments are carried out using synthesized data from one speaker. The observation distribution of each GPV is modeled by an artificial neural network and Gaussian mixture tandem model. Bigram GPV sequence models are used to distinguish gestural scores of different words. Given the tract variable time functions, about 80% of the instantaneous gestural activation is correctly recovered. Word recognition accuracy is over 85% for a vocabulary of 139 words with no training observations. These results suggest that the proposed framework might be a viable alternative to the classic sequence-of-phones model.

11:20Robust Keyword Spotting with Rapidly Adapting Point Process Models

Aren Jansen (Dept of Computer Science, University of Chicago)
Partha Niyogi (Depts. of Computer Science and Statistics, University of Chicago)

In this paper, we investigate the noise robustness properties of frame-based and sparse point process-based models for spotting keywords in continuous speech. We introduce a new strategy to improve point process model (PPM) robustness by adapting low-level feature detector thresholds to preserve background firing rates in the presence of noise. We find that this unsupervised approach can significantly outperform fully supervised maximum likelihood linear regression (MLLR) adaptation of an equivalent keyword-filler HMM system in the presence of additive white and pink noise. Moreover, we find that the sparsity of PPMs introduces an inherent resilience to non-stationary babble noise not exhibited by the frame-based HMM system. Finally, we demonstrate that our approach requires less adaptation data than MLLR, permitting rapid online adaptation.

11:40Automatically Rating Pronunciation Through Articulatory Phonology

Joseph Tepperman (University of Southern California)
Louis Goldstein (University of Southern California)
Sungbok Lee (University of Southern California)
Shrikanth Narayanan (University of Southern California)

Articulatory Phonology's link between cognitive speech planning and the physical realizations of vocal tract constrictions has implications for speech acoustic and duration modeling that should be useful in assigning subjective ratings of pronunciation quality to nonnative speech. In this work, we compare traditional phoneme models used in automatic speech recognition to similar models for articulatory gestural pattern vectors, each with associated duration models. What we find is that, on the CDT corpus, gestural models outperform the phoneme-level baseline in terms of correlation with listener ratings, and in combination phoneme and gestural models outperform either one alone. This also validates previous findings with a similar (but not gesture-based) pseudo-articulatory representation.

Thu-Ses1-P1:
Speech Coding

Time:Thursday 10:00 Place:Hewison Hall Type:Poster
Chair:Børge Lindberg

#1DIFFERENTIAL VECTOR QUANTIZATION OF FEATURE VECTORS FOR DISTRIBUTED SPEECH RECOGNITION

Jose Enrique Garcia (University of Zaragoza)
Alfonso Ortega (University of Zaragoza)
Antonio Miguel (University of Zaragoza)
Eduardo Lleida (University of Zaragoza)

Distributed speech recognition arises for solving computational limitations of mobile devices like PDAs or mobile phones. Due to bandwidth restrictions, it is necessary to develop efficient transmission techniques of acoustic features in Automatic Speech Recognition applications. This paper presents a technique for compressing acoustic feature vectors based on differential vector quantization. It is a combination of vector quantization and differential encoding schemes. Recognition experiments have been carried out, showing that the proposed method outperforms de ETSI standard VQ system, and classical VQ scheme for different codebook lengths and situations. With the proposed scheme, bit rates as low as 2.1 kbps can be used without decreasing the performance of the ASR system in terms of WER compared with a system without quantization.

#2Arithmetic Coding of Sub-band Residuals in FDLP Speech/Audio Codec

Petr Motlicek (Idiap Research Institute, Martigny, Switzerland)
Sriram Ganapathy (ECE Dept., Johns Hopkins University, Baltimore, USA)
Hynek Hermansky (ECE Dept., Johns Hopkins University, Baltimore, USA)

A speech/audio codec based on Frequency Domain Linear Prediction (FDLP) exploits auto-regressive modeling to approximate instantaneous energy in critical frequency sub-bands of relatively long input segments. The current version of the FDLP codec operating at 66 kbps has been shown to provide comparable subjective listening quality results to state-of-the-art codecs on similar bit-rates even without employing standard blocks such as entropy coding or simultaneous masking. This paper describes an experimental work to increase compression efficiency of the FDLP codec by employing entropy coding. Unlike conventional Huffman coding employed in current speech/audio coding systems, we describe an efficient way to exploit arithmetic coding to entropy compress quantized spectral magnitudes of the subband FDLP residuals. Such an approach provides 11% (~ 3 kbps) bit-rate reduction compared to the Huffman coding algorithm (~ 1 kbps).

#3Pitch Variation Estimation

Tom Bäckström (Fraunhofer Institute of Integrated Systems)
Stefan Bayer (Fraunhofer Institute of Integrated Systems)
Sascha Disch (Fraunhofer Institute of Integrated Systems)

A method for estimating the normalised pitch variation is described. While pitch tracking is a classical problem, in applications where the pitch magnitude is not required but only the change in pitch, all the main problems of pitch tracking can be avoided, such as octave jumps and intricate peak-finding heuristics. The presented approach is efficient, accurate and unbiased. It was developed for use in speech and audio coding for pitch variation compensation, but can also be used as additional information for pitch tracking.

#4Soft Decision-Based Acoustic Echo Suppression in a Frequency Domain

Yun-Sik Park (Inha University)
Ji-Hyun Song (Inha University)
Jae-Hun Choi (Inha University)
Joon-Hyuk Chang (Inha University)

In this paper, we propose a novel acoustic echo suppression (AES) technique based on soft decision in a frequency domain. The proposed approach provides an efficient and unified framework for such procedures as AES gain computation, AES gain modification using soft decision, and estimation of relevant parameters based on the same statistical model assumption of the near-end and far-end signal instead of the conventional strategies requiring the additional residual echo suppression (RES) step. Performances of the proposed AES algorithm are evaluated by objective tests under various environments and better results compared with the conventional AES method are obtained.

#5Fine-Granular Scalable MELP Coder Based on Embedded Vector Quantization

Mouloud DJAMAH (INRS-EMT)
Douglas O’Shaughnessy (INRS-EMT)

This paper presents an efficient codebook design for tree-structured vector quantization (TSVQ) which is embedded in nature. The federal standard MELP (mixed excitation linear prediction) speech coder is modified by replacing the original single stage vector quantizer for Fourier magnitudes with a TSVQ and the original multistage vector quantizer (MSVQ) for line spectral frequencies (LSF) with a multistage TSVQ (MTVQ) . The modified coder is fine-granular bit-rate scalable with gradual change in quality for the synthetic speech when the number of bits available for LSF and Fourier magnitudes decoding is decremented bit-by-bit.

#6Joint Quantization Strategies for Low Bit-Rate Sinusoidal Coding

Emre Unver (CCSR, University of Surrey)
Stephane Villette (CCSR, University of Surrey)
Ahmet Kondoz (CCSR, University of Surrey)

Transparent speech quality has not been achieved at low bit rates, especially at 2.4 kbps and below, which is an area of interest for military and security applications. In this paper, strategies for low bit rate sinusoidal coding are discussed. Previous work in the literature on using metaframes and performing variable bit allocation according to the metaframe type is extended. An optimum metaframe size compromise between delay and quantization gains is found. A new method for voicing determination from the LPC shape is also presented. The proposed techniques have been applied to the SB-LPC vocoder to produce speech at 1.2/0.8 kbps, and compared to the original SB-LPC vocoder at 2.4/1.2 kbps as well as an established standard (MELP) at 2.4/1.2/0.6 kbps in a listening test. It has been found that the proposed techniques have been effective in reducing the bit-rate while not compromising the speech quality.

#7Steganographic Band Width Extension for the AMR Codec of Low-Bit-Rate Modes

Akira Nishimura (Tokyo University of Information Sciences)

This paper proposes a bandwidth extension (BWE) method for the AMR narrow-band speech codec using steganography, which is called steganographic BWE herein. The high-band information is embedded into the pitch delay data of the AMR codec using an extended quantization-based method that achieves increased embedding capacity and higher perceived sound quality than the previous steganographic method. The target bit-rate mode is below 7 kbps, the level below which the previous steganographic BWE method did not maintain adequate sound quality. The sound quality of the steganographic BWE speech signals decoded from the embedded bitstream is comparable to that of the wide-band speech signals of the AMR-WB codec at a bit rate of less than 6.7 kbps, with only a slight degradation in the quality relative to speech signals decoded from the same bitstream by the legacy AMR decoder.

#8Ultra low bit-rate speech coding based on unit-selection with joint spectral-residual quantization: No transmission of any residual information

V Ramasubramanian (Siemens Corporate Technology - India)
D Harish (Siemens Corporate Technology - India)

A recent trend in ultra low bit-rate speech coding is based on segment quantization by unit-selection principle using large continuous codebooks as unit database. We show that use of such large unit databases allows speech to be reconstructed at the decoder by using the best unit's residual itself (in the unit database), thereby obviating the need to transmit any side information about the residual of the input speech. For this, it becomes necessary to jointly quantize the spectral and residual information at the encoder during unit selection and we propose various composite measures for such a joint spectral-residual quantization within a unit-selection algorithm proposed earlier. We realize ultra low bit-rate speaker-dependent speech coding at an overall rate of 250 bits/sec using unit database sizes of 19 bits/unit (524288 phone-like units or about 6 hours of speech) with spectral distortions less than 2.5 dB that retains intelligibility, naturalness, prosody and speaker-identity.

#9On the Cost of Backward Compatibility for Communication Codecs

Konstantin Schmidt (Fraunhofer Institute for Integrated Circuits, Erlangen, Germany)
Markus Schnell (Fraunhofer Institute for Integrated Circuits, Erlangen, Germany)
Nikolaus Rettelbach (Fraunhofer Institute for Integrated Circuits, Erlangen, Germany)
Manfred Lutzky (Fraunhofer Institute for Integrated Circuits, Erlangen, Germany)
Jochen Issing (Fraunhofer Institute for Integrated Circuits, Erlangen, Germany)

Super wideband (SWB) communication calls more and more attention as can be seen by the standardization activities of SWB extensions for well-established wideband codecs, e.g. G.722 or G.711.1. This paper presents a technical solution for extending the G.722 codec and compares the new technology to other standardized SWB codecs. Hereby, a closer look is given on the concept of extending technologies to more capabilities in contrast to non-backwards compatible solutions.

#10A Media-Specific FEC Based on Huffman Coding for Distributed Speech Recognition

Lee Young Han (Gwangju Institute of Science and Technology)
Kim Hong Kook (Gwangju Institute of Science and Technology)

In this paper, we propose a media-specific forward error correction (FEC) method based on Huffman coding for distributed speech recognition (DSR). In order to mitigate the performance degradation of DSR in noisy channel environments, the importance of each subvector for the DSR system is first explored. As a result, the first subvector information for the mel-frequency cepstral coefficients (MFCCs) is then added as an error protection code. At the same time, Huffman coding methods are applied to compressed MFCCs to prevent the bit-rate increase by using such protection codes,. Different Huffman trees for MFCCs are designed according to the voicing class, subvector-wise, and their combinations. It is shown from the recognition experiments on the Aurora 4 large vocabulary database under several noisy channel conditions that the proposed FEC method is able to achieve the relative average word error rate (WER) reduction by 9.03~17.81% compared with the standard DSR system using no FEC methods.

Thu-Ses1-P4:
Systems for Spoken Language Understanding

Time:Thursday 10:00 Place:Hewison Hall Type:Poster
Chair:Renato De Mori

#1Classification-Based Strategies for Combining Multiple 5-W Question Answering Systems

Sibel Yaman (ICSI)
Dilek Hakkani-Tur (ICSI)
Gokhan Tur (SRI International)
Ralph Grishman (Computer Science Department, New York University)
Mary Harper (Hopkins HLT Center of Excellence, University of Maryland)
Kathleen R. McKeown (Computer Science Department, Columbia University)
Adam Meyers (Computer Science Department, New York University)
Kartavya Sharma (Computer Science Department, Columbia University)

We describe and analyze inference strategies for combining outputs from multiple question answering systems each of which was developed independently. Specifically, we address the task of finding answers to the 5-Wh questions (who, what, when, where, and why) for each given sentence. The approach we take revolves around determining the best system using discriminative learning. In particular, we train support vector machines with a set of novel features that encode systems’ capabilities of returning as many correct answers as possible. We analyze two combination strategies: one combines multiple systems at the granularity of sentences, and the other at the granularity of individual fields. Our experimental results indicate that the proposed features and combination strategies were able to improve the overall performance by 22% to 36% relative to a random selection, 16% to 35% relative to a majority voting scheme, and 15% to 23% relative to the best individual system.

#2Combining Semantic and Syntactic Information Sources for 5-W Question Answering

Sibel Yaman (ICSI)
Dilek Hakkani-Tur (ICSI)
Gokhan Tur (SRI International)

This paper focuses on combining answers generated by a semantic parser that produce semantic role labels (SRLs) and those generated by syntactic parser that produce function tags for answering 5-W questions, i.e., who, what, when, where, and why. We propose a probabilistic view in which a system's ability to correctly answer 5-W questions is measured with the probability of its answers being produced for the given word sequence. This is achieved by training statistical language models (LMs) that are used to predict whether the answers returned by semantic or by syntactic parsers are more likely. We evaluated our approach using OntoNotes dataset. Our experimental results indicate that the proposed LM-based combination strategy was able to improve the performance of the best individual system in terms of F1 measure as well as error rate. Furthermore, the error rates for each question type were also significantly reduced.

#3Phrase and Word Level Strategies for Detecting Appositions in Speech

Benoit Favre (International Computer Science Institute)
Dilek Hakkani-Tur (International Computer Science Institute)

Appositions are grammatical constructs in which two noun phrases are placed side-by-side, one modifying the other. Detecting them in speech can help extract semantic information useful, for instance, for co-reference resolution and question answering. We compare and combine three approaches: word-level and phrase-level classifiers, and a syntactic parser trained to generate appositions. On reference parses, the phrase-level classifier outperforms the other approaches while on automatic parses and ASR output, the combination of the apposition-generating parser and the word-level classifier works best. An analysis of the system errors reveals that parsing accuracy and world knowledge are very important for this task.

#4Error correction of proportions in spoken opinion surveys

Nathalie Camelin (LIA - University of Avignon)
Renato De Mori (LIA - University of Avignon)
Frederic Bechet (LIA - University of Avignon)
Geraldine Damnati (France Telecom R&D)

The paper analyzes the types of errors encountered in automatic spoken surveys. These errors are different from the ones that appear when surveys are taken by humans because they are caused by the imprecision of an automatic system. Previous studies presented a strategy that consists in the robust detection of subjective opinions about a particular topic in a spoken message. If the same automatic system is used for estimating opinion proportions in different spoken surveys, then the error rate of the entire automatic process should not vary too much in different surveys for each type of opinions. Based on this conjecture, a linear error model is derived and used for error correction. Experimental results obtained with data of a real-world deployed system show significant error reductions obtained in the automatic estimation of proportions in spoken surveys. These reductions are particularly useful for a trend analysis of opinions over time.

#5Transformation-based Learning for Semantic parsing

Filip Jurčíček (Engineering Department, Cambridge University)
Milica Gašić (Engineering Department, Cambridge University)
Simon Keizer (Engineering Department, Cambridge University)
François Mairesse (Engineering Department, Cambridge University)
Blaise Thomson (Engineering Department, Cambridge University)
Kai Yu (Engineering Department, Cambridge University)
Steve Young (Engineering Department, Cambridge University)

This paper presents a semantic parser that transforms an initial semantic hypothesis into the correct semantics by applying an ordered list of transformation rules. These rules are learnt automatically from a training corpus with no prior linguistic knowledge and no alignment between words and semantic concepts. The learning algorithm produces a compact set of rules which enables the parser to be very efficient while retaining high accuracy. We show that this parser is competitive with respect to the state-of-the-art semantic parsers on the ATIS and TownInfo tasks.

#6Large-Scale Polish SLU

Patrick Lehnen (Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Germany)
Stefan Hahn (Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Germany)
Hermann Ney (Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Germany)
Agnieszka Mykowiecka (Institute of Computer Science, Polish Academy of Sciences, Warsaw, Poland and Polish-Japanese Institute of Information Technology, Warsaw, Poland)

In this paper, we present state-of-the art concept tagging results on a new corpus for Polish SLU. For this language, it is the first large-scale corpus (~200 different concepts) which has been semantically annotated and will be made publicly available. Conditional Random Fields have proven to lead to best results for string-to-string translation problems. Using this approach, we achieve a concept error rate of 22.6% on an evaluation corpus. To additionally extract attribute values, a combination of a statistical and a rule-based approach is used leading to a CER of 30.2%.

#7Optimizing CRFs for SLU Tasks in Various Languages Using Modified Training Criteria

Stefan Hahn (Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Germany)
Patrick Lehnen (Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Germany)
Georg Heigold (Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Germany)
Hermann Ney (Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, Germany)

In this paper, we present improvements of our state-of-the-art concept tagger based on conditional random fields. Statistical models have been optimized for three tasks of varying complexity in three languages (French, Italian, and Polish). Modified training criteria have been investigated leading to small improvements. The respective corpora as well as parameter optimization results for all models are presented in detail. A comparison of the selected features between languages as well as a close look at the tuning of the regularization parameter is given. The experimental results show in what level the optimizations of the single systems are portable between languages.

#8Learning Lexicons from Spoken Utterances Based on Statistical Model Selection

Ryo Taguchi (Advanced Telecommunications Research Institute International / Nagoya Institute of Technology)
Naoto Iwahashi (Advanced Telecommunications Research Institute International / National Institute of Information and Communications Technology)
Takashi Nose (Advanced Telecommunications Research Institute International / Tokyo Institute of Technology)
Kotaro Funakoshi (Honda Research Institute Japan Co., Ltd.)
Mikio Nakano (Honda Research Institute Japan Co., Ltd.)

This paper proposes a method for the unsupervised learning of lexicons from pairs of a spoken utterance and an object as its meaning without any a priori linguistic knowledge other than a phoneme acoustic model. In order to obtain a lexicon, a statistical model of the joint probability of a spoken utterance and an object is learned based on the minimum description length principle. This model consists of a list of word phoneme sequences and three statistical models: the phoneme acoustic model, a word-bigram model, and a word meaning model. Experimental results show that the method can acquire acoustically, grammatically and semantically appropriate words with about 85% phoneme accuracy.

#9Improving Speech Understanding Accuracy with Limited Training Data Using Multiple Language Models and Multiple Understanding Models

Masaki Katsumaru (Graduate School of Informatics, Kyoto University, Japan)
Mikio Nakano (Honda Research Institute Japan Co., Ltd., Japan)
Kazunori Komatani (Graduate School of Informatics, Kyoto University, Japan)
Kotaro Funakoshi (Honda Research Institute Japan Co., Ltd., Japan)
Tetsuya Ogata (Graduate School of Informatics, Kyoto University, Japan)
Hiroshi G. Okuno (Graduate School of Informatics, Kyoto University, Japan)

We aim to improve a speech understanding module with a small amount of training data. A speech understanding module uses a language model (LM) and a language understanding model (LUM). A lot of training data are needed to improve the models. Such data collection is, however, difficult in an actual process of development. We therefore design and develop a new framework that uses multiple LMs and LUMs to improve speech understanding accuracy under various amounts of training data. Even if the amount of available training data is small, each LM and each LUM can deal well with different types of utterances and more utterances are understood by using multiple LM and LUM. As one implementation of the framework, we develop a method for selecting the most appropriate speech understanding result from several candidates. The selection is based on probabilities of correctness calculated by logistic regressions. We evaluate our framework with various amounts of training data.

#10Low-Cost Call Type Classification for Contact Center Calls Using Partial Transcripts

Youngja Park (IBM T.J. Watson Research Center)
Wilfried Teiken (IBM T.J. Watson Research Center)
Stephen Gates (IBM T.J. Watson Research Center)

Call type classification for call center calls using automatically generated transcripts is limited mainly by the high cost and low accuracy of automatic speech transcription. To address the challenges, we examine if using only partial conversations yields accuracy comparable to using the entire customer-agent conversations. We exploit two interesting characteristics of call center calls. First, contact center calls are highly scripted following prescribed steps, and the customer’s problem (i.e., the determinant of the call type) is typically stated in the beginning of a call. Second, agents often more clearly repeat or rephrase what customers said. Our experiments with 1,677 customer calls show that two partial transcripts comprising only the agents’ utterances and the first 40 speaker turns actually produce slightly higher classification accuracy than the entire conversations. Using partial conversations can significantly reduce the cost for speech transcription.

#11A New Quality Measure for Topic Segmentation of Text and Speech

Mehryar Mohri (Courant Institute of Mathematical Sciences)
Pedro Moreno (Google, Inc.)
Eugene Weinstein (Courant Institute of Mathematical Sciences)

The recent proliferation of large multimedia collections has gathered immense attention from the speech research community, because speech recognition enables the transcription and indexing of such collections. Topicality information can be used to improve transcription quality and enable content navigation. In this paper, we give a novel quality measure for topic segmentation algorithms that improves over previously used measures. Our measure takes into account not only the presence or absence of topic boundaries but also the content of the text or speech segments labeled as topic-coherent. Additionally, we demonstrate that topic segmentation quality of spoken language can be improved using speech recognition lattices. Using lattices, improvements over the baseline one-best topic model are observed when measured with the previously existing topic segmentation quality measure, as well as the new measure proposed in this paper (9.4% and 7.0% relative error reduction, respectively).

#12Concept Segmentation and Labeling for Conversational Speech

Marco Dinarelli (University of Trento)
Alessandro Moschitti (University of Trento)
Giuseppe Riccardi (University of Trento)

Spoken Language Understanding performs automatic concept labeling and segmentation of speech utterances. For this task, many approaches have been proposed based on both generative and discriminative learning models. While all these methods have shown remarkable accuracy on manual transcription of spoken utterances, robustness to noisy automatic transcription is still an open issue. In this paper we study algorithms for Spoken Language Understanding combining complementary learning models: Stochastic Finite State Transducers produce a list of hypotheses, which are re-ranked using a discriminative algorithm based on kernel methods. Our experiments on two different spoken dialog corpora: MEDIA and LUNA, show that the combined generative-discriminative reaches the state-of-the-art such as Conditional Random Fields (CRF) on manual transcriptions, and it is robust to noisy automatic transcriptions, outperforming, in some cases, the state-of-the-art.

Thu-Ses1-P3:
Automatic Speech Recognition: Language Models II

Time:Thursday 10:00 Place:Hewison Hall Type:Poster
Chair:Mari Ostendorf

#1Multiple Text Segmentation for Statistical Language Modeling

Sopheap Seng (LIG Laboratory)
Laurent Besacier (LIG Laboratory)
Brigitte Bigi (LIG Laboratory)
Eric Castelli (MICA International Research Center)

In this article we deal with the text segmentation problem in statistical language modeling for under-resourced languages with a writing system without word boundary delimiters. While the lack of text resources has a negative impact on the performance of language models, the errors introduced by the automatic word segmentation makes those data even less usable. To better exploit the text resources, we propose a method to estimate the N-gram language model from the training corpus on which each sentence is segmented in multiple ways instead of a unique segmentation. The multiple segmentations generate more N-grams from the training corpus and allow obtaining the N-grams not found in unique segmentation. We use this approach to train the language models for automatic speech recognition systems of Khmer and Vietnamese languages and the multiple segmentations lead to a better performance than the unique segmentation approach.

#2Measuring Tagging Performance of a Joint Language Model

Denis Filimonov (University of Maryland)
Mary Harper (HLTCOE Jones Hopkins University, University of Maryland)

Predicting syntactic information in a joint language model has been shown not only to improve the model at its main task of predicting words, but it also allows this information to be passed to other applications, such as an ASR system. This raises the question of just how accurate the syntactic information predicted by the LM is. In this paper, we present a joint language model designed to not only to scale to large quantities of training data, but also to be able to utilize fine-grain syntactic information, as well as other features, such as morphology and prosody. We evaluate the accuracy of our model at predicting syntactic information on the POS tagging task against state-of-the-art POS taggers, and on perplexity against the ngram model.

#3Improved Language Modelling Using Bag of Word Pairs

Langzhou Chen (Toshiba Research Europe Limited)
KK Chin (Toshiba Research Europe Limited)
Kate Knill (Toshiba Research Europe Limited)

The bag-of-words (BoW) method has been used widely in language modelling and information retrieval. In this paper, the concept of BoW is extended to Bag-of-Word Pairs (BoWP), which expresses the document as a group of word pairs. Using word pairs as a unit, the system can capture more complex semantic information than BoW. Under the LSA framework, the BoWP system is shown to improve both perplexity and word error rate (WER) compared to a BoW system.

#4Morphological Analysis and Decomposition for Arabic Speech-to-Text Systems

Frank Diehl (University of Cambridge)
Mark Gales (University of Cambridge)
Marcus Tomalin (University of Cambridge)
Phil Woodland (University of Cambridge)

Language modelling for a morphologically complex language such as Arabic is a challenging task. Its agglutinative structure results in data sparsity problems and high out-of-vocabulary rates. In this work these problems are tackled by applying the MADA tools to the Arabic text. In addition to morphological decomposition, MADA performs context-dependent stem-normalisation. Thus, if word-level system combination, or scoring, is required this normalisation must be reversed. To address this, a novel context-sensitive method for morpheme-to-word conversion is introduced. The performance of the MADA decomposed system was evaluated on an Arabic broadcast transcription task. The MADA-based system out-performed the word-based system, with both the morphological decomposition and stem normalisation being found to be important.

#5Investigating the Use of Morphological Decomposition and Diacritization for Improving Arabic LVCSR

Amr Ibrahim El-Desoky (Chair of Computer Science 6, RWTH-Aachen University, Germany)
Christian Gollan (Chair of Computer Science 6, RWTH-Aachen University, Germany)
David Rybach (Chair of Computer Science 6, RWTH-Aachen University, Germany)
Ralf Schlueter (Chair of Computer Science 6, RWTH-Aachen University, Germany)
Hermann Ney (Chair of Computer Science 6, RWTH-Aachen University, Germany)

One of the challenges related to large vocabulary Arabic speech recognition is the rich morphology nature of Arabic language which leads to both high out-of-vocabulary (OOV) rates and high language model (LM) perplexities. Another challenge is the absence of the short vowels (diacritics) from the Arabic written transcripts which causes a large difference between spoken and written language and thus a weaker connection between the acoustic and language models. In this work, we try to address these two important challenges by introducing both morphological decomposition and diacritization in Arabic language modeling. Finally, we are able to obtain about 3.7% relative reduction in word error rate (WER) with respect to a comparable non-diacritized full-words system running on our test set.

#6Topic Dependent Language Model based on Topic Voting on Noun History

Welly Naptali (Toyohashi University of Technology)
Masatoshi Tsuchiya (Toyohashi University of Technology)
Seiichi Nakagawa (Toyohashi University of Technology)

Language models (LMs) are important in automatic speech recognition systems. In this paper, we propose a new approach to a topic dependent LM, where the topic is decided in an unsupervised manner. Latent Semantic Analysis (LSA) is employed to reveal hidden (latent) relations among nouns in the context words. To decide the topic of an event, a fixed size word history sequence (window) is observed, and voting is then carried out based on noun class occurrences weighted by a confidence measure. Experiments on the Wall Street Journal corpus and Mainichi Shimbun (Japanese newspaper) corpus show that our proposed method gives better perplexity than the comparative baselines, including a word-based/class-based n-gram LM, their interpolated LM, a cache-based LM, and the Latent Dirichlet Allocation (LDA)-based topic dependent LM.

#7Investigation of Morph-based Speech Recognition Improvements across Speech Genres

Péter Mihajlik (Dept. of Telecommunications and Media Informatics, Budapest University of Technology & Economics, Hungary)
Balázs Tarján (AITIA International Inc., Budapest, Hungary)
Zoltán Tüske (Dept. of Telecommunications and Media Informatics, Budapest University of Technology & Economics, Hungary)
Tibor Fegyó (Dept. of Telecommunications and Media Informatics, Budapest University of Technology & Economics, Hungary)

The improvement achieved by changing the basis of speech recognition from words to morphs (various sub-word units) varies greatly across tasks and languages. We make an attempt to explore the source of this variability by the investigation of three LVCSR tasks corresponding to three speech genres of a highly agglutinative language. Novel, press conference and broadcast news transcription results are presented and compared to spontaneous speech recognition results in several experimental setups. A noticeable correlation is observed between an easily computable characteristic of various language speech recognition tasks and between the relative improvements due to (statistical) morph-based approaches.

#8Effective use of pause information in language modelling for speech recognition

Kengo Ohta (Department of Information and Computer Sciences, Toyohashi University of Technology, Japan)
Masatoshi Tsuchiya (Information and Media Center, Toyohashi University of Technology, Japan)
Seiichi Nakagawa (Department of Information and Computer Sciences, Toyohashi University of Technology, Japan)

This paper addresses mismatch between speech processing units used by a speech recognizer and sentences of corpora. A standard speech recognizer divides an input speech into speech processing units based on its power information. On the other hand, training corpora of language models are divided into sentences based on punctuations. There is inevitable mismatch between speech processing units and sentences, and both of them are not optimal for a spontaneous speech recognition task. This paper presents two sub issues to address this problem. At first, the words of the preceding units are utilized to predict the words of the succeeding units, in order to address the mismatch between speech processing units and optimal units. Secondly, we propose a method to build a language model including short pause from a corpus with no short pause to address the mismatch between speech processing units and sentences. Their combination achieved a 4.5% relative improvement over the conventional method in the meeting speech recognition task.

#9A Parallel Training Algorithm for Hierarchical Pitman-Yor Process Language Models

Songfang Huang (CSTR, The University of Edinburgh)
Steve Renals (CSTR, The University of Edinburgh)

The Hierarchical Pitman Yor Process Language Model (HPYLM) is a Bayesian language model based on a non-parametric prior, the Pitman-Yor Process. It has been demonstrated, both theoretically and practically, that the HPYLM can provide better smoothing for language modeling, compared with state-of-the-art approaches such as interpolated Kneser-Ney and modified Kneser-Ney smoothing. However, estimation of Bayesian language models is expensive in terms of both computation time and memory; the inference is approximate and requires a number of iterations to converge. In this paper, we present a parallel training algorithm for the HPYLM, which enables the approach to be applied in the context of automatic speech recognition, using large training corpora with large vocabularies. We demonstrate the effectiveness of the proposed algorithm by estimating language models from corpora for meeting transcription containing over 200 million words, and observe significant reductions in perplexity and word error rate.

#10Probabilistic and Possibilistic Language Models Based on the World Wide Web

Stanislas Oger (LIA - University of Avignon)
Georges Linarès (LIA - University of Avignon)
Vladimir Popescu (LIA - University of Avignon)

Usually, language models are built either from a closed corpus, or by using World Wide Web retrieved documents, which are considered as a closed corpus themselves. In this paper we propose several other ways, more adapted to the nature of the Web, of using this resource for language modeling. We first start by improving an approach consisting in estimating n-gram probabilities from Web search engine statistics. Then, we propose a new way of considering the information extracted from the Web in a probabilistic framework. Then, we also propose to rely on Possibility Theory for effectively using this kind of information. We compare these two approaches on two automatic speech recognition tasks: (i) transcribing broadcast news data, and (ii) transcribing domain-specific data, concerning surgical operation film comments. We show that the two approaches are effective in different situations.

Thu-Ses1-P2:
Voice Transformation II

Time:Thursday 10:00 Place:Hewison Hall Type:Poster
Chair:Tomoki Toda

#1HMM adaptation and voice conversion for the synthesis of child speech: a comparison

Oliver Watts (Centre for Speech Technology Research, University of Edinburgh, UK)
Junichi Yamagishi (Centre for Speech Technology Research, University of Edinburgh, UK)
Simon King (Centre for Speech Technology Research, University of Edinburgh, UK)
Kay Berkling (Inline Internet Online Dienste GmbH, Germany)

This study compares two different methodologies for producing data-driven synthesis of child speech from existing systems that have been trained on the speech of adults. On one hand, an existing statistical parametric synthesiser is transformed using model adaptation techniques, informed by linguistic and prosodic knowledge, to the speaker characteristics of a child speaker. This is compared with the application of voice conversion techniques to convert the output of an existing waveform concatenation synthesiser with no explicit linguistic or prosodic knowledge. In a subjective evaluation of the similarity of synthetic speech to natural speech from the target speaker, the HMM-based systems evaluated are generally preferred, although this is at least in part due to the higher dimensional acoustic features supported by these techniques.

#2HMM-based Speaker Characteristics Emphasis Using Average Voice Model

Takashi Nose (Tokyo Institute of Technology)
Junichi Asada (Tokyo Institute of Technology)
Takao Kobayashi (Tokyo Institute of Technology)

This paper presents a technique for controlling and emphasizing speaker characteristics of synthetic speech. The key idea comes from the way of imitating voice by professional impersonators. In the voice imitation, impersonators effectively utilize exaggeration of a target speaker's voice characteristics. %to enhance the imitation performance. To model and control the degree of speaker characteristics, we use a speech synthesis framework based on multiple-regression hidden semi-Markov model (MRHSMM). In MRHSMM, mean parameters are given by multiple regression of a low-dimensional control vector. The control vector represents how much the target speaker's model parameters are different from those of the average voice model. By changing the control vector in speech synthesis, we can control the degree of voice characteristics of the target speaker. Results of subjective experiments show that the speaker reproducibility of synthetic speech is improved by emphasizing speaker characteristics.

#3An Evaluation Methodology for Prosody Transformation Systems based on Chirp Signals

Damien LOLIVE (IRISA / University of Rennes 1 - ENSSAT)
Nelly BARBOT (IRISA / University of Rennes 1 - ENSSAT)
Olivier BOEFFARD (IRISA / University of Rennes 1 - ENSSAT)

Evaluation of prosody transformation systems is an important issue. First, the existing evaluation methodologies focus on parallel evaluation of systems and are not applicable to compare parallel and non-parallel systems. Secondly, these methodologies do not guarantee the independence from other features such as the segmental component. In particular, its influence cannot be neglected during evaluation and introduces a bias in the listening test. To answer these problems, we propose an evaluation methodology that depends only on the melody of the voice and that is applicable in a non-parallel context. Given a melodic contour, we propose to build an audio whistle from a chirp signal model. Experimental results show the efficiency of the proposed method concerning the discrimination of voices using only their melody information. An example of transformation function is also given and the results confirm the applicability of this methodology.

#4Voice Morphing based on Interpolation of Vocal Tract Area Functions Using AR-HMM Analysis of Speech

Yoshiki Nambu (University of Tsukuba)
Masahiko Mikawa (University of Tsukuba)
Kazuyo Tanaka (University of Tsukuba)

This paper presents a new voice morphing method which focuses on the continuity of phonological identity overall inter- and extra-polated regions. Main features of the method are 1) to separate the characteristic of vocal tract area resonances from that of vocal cord waves by using AR-HMM analysis of speech, 2) interpolation in a log vocal tract area function domain and 3) independent morphing for the vocal tract resonances and vocal cord wave characteristics. By the morphing system constructed on a statistical conversion method, the continuity of formants and perceptual difference between a conventional method and the proposed method are confirmed.

#5A Novel Model-based Pitch Conversion Method for Mandarin Speech

Hsin-Te Hwang (Dept. Communication Engineering, National Chiao Tung University, Taiwan)
Chen-Yu Chiang (Dept. Communication Engineering, National Chiao Tung University, Taiwan)
PO-Yi Sung (Dept. Communication Engineering, National Chiao Tung University, Taiwan)
Sin-Horng Chen (Dept. Communication Engineering, National Chiao Tung University, Taiwan)

In this paper, a novel model-based pitch conversion method for Mandarin is presented and compared with other two conventional conversion methods, i.e. the mean/variance transformation approach and the GMM-based mapping approach. Syllable pitch contour is first quantized by 3rd order orthogonal expansion coefficients; then, the source and target speakers’ prosodic models are constructed, respectively. Two mapping methods based on the prosodic model are presented. Objective tests confirmed that one of the proposed methods are superior the conventional methods. Some findings in informal listening tests and objective tests are worthwhile to further investigate.

#6Observation of empirical cumulative distribution of vowel spectral distances and its application to vowel based voice conversion

Hideki Kawahara (Wakayama University)
Masanori Morise (Ritsumeikan University)
Toru Takahashi (Kyoto University)
Hideki Banno (Meijo University)
Ryuichi Nisimura (Wakayama University)
Toshio Irino (Wakayama University)

A simple and fast voice conversion method based only on vowel information is proposed. The proposed method relies on empirical distribution of perceptual spectral distances between representative examples of each vowel segment extracted using TANDEM-STRAIGHT spectral envelope estimation procedure. Mapping functions of vowel spectra are designed to preserve vowel space structure defined by the observed empirical distribution while transforming position and orientation of the structure in an abstract vowel spectral space. By introducing physiological constraints in vocal tract shapes and vocal tract length normalization, difficulties in careful frequency alignment between vowel template spectra of the source and the target speakers can be alleviated without significant degradations in converted speech. The proposed method is a frame-based instantaneous method and is relevant for realtime processing. Applications of the proposed method in cross language voice conversion is also discussed.

#7Japanese Pitch Conversion for Voice Morphing Based on Differential Modeling

Ryuki Tachibana (Tokyo Research Lab., IBM Research)
Zhiwei Shuang (China Research Lab., IBM Research)
Masafumi Nishimura (Tokyo Research Lab., IBM Research)

In this paper, we convert the pitch contours predicted by a TTS system that models a source speaker to resemble the pitch contours of a target speaker. When the speaking styles of the speakers are very different, complex conversions such as adding or deleting pitch peaks may be required. Our method does the conversions by modeling the direct pitch features and differential pitch features at the same time based on linguistic features. The differential pitch features are calculated from matched pairs of source and target pitch values. We show experimental results in which the target speaker's characteristics are successfully modeled based on a very limited training corpus. The proposed pitch conversion method stretches the possibilities of TTS customization for various speaking styles.

#8A Novel Technique for Voice Conversion Based on Style and Content Decomposition with Bilinear Models

Victor Popa (Department of Signal Processing, Tampere University of Technology, Tampere, Finland)
Jani Nurminen (Devices, Nokia, Tampere, Finland)
Moncef Gabbouj (Department of Signal Processing, Tampere University of Technology, Tampere, Finland)

This paper presents a novel technique for voice conversion by solving a two-factor task using bilinear models. The spectral content of the speech represented as line spectral frequencies is separated into so-called style and content parameterizations using a framework proposed in [1]. This formulation of the voice conversion problem in terms of style and content offers a flexible representation of factor interactions and facilitates the use of efficient training algorithms based on singular value decomposition and expectation maximization. Promising results in a comparison with the traditional Gaussian mixture model based method indicate increased robustness with small training sets.

#9Rule-Based Voice Quality Variation with Formant Synthesis

Felix Burkhardt (Deutsche Telekom Laboratories)

We describe an approach to simulate different phonation types, following John Laver’s terminology, by means of a hybrid (rulebased and unit concatenating) formant synthesizer. Different voice qualities were generated by following hints from the literature and applying the revised KLGLOTT88 model. Within a listener perception experiment, we show that the phonation types get distinguished by the listeners and lead to emotional impression as predicted by literature. The synthesis system and its source code, as well as audio samples can be downloaded at http://emoSyn.syntheticspeech.de/.

Thu-Ses2-O1:
User Interactions in Spoken Dialog Systems

Time:Thursday 13:30 Place:Main Hall Type:Oral
Chair: Roberto Pieraccini

13:30Learning the Structure of Human-Computer and Human-Human Dialogs

David Griol (Universidad Carlos III de Madrid)
Giuseppe Riccardi (University of Trento)
Emilio Sanchis (Universitat Politecnica de Valencia)

We are interested in the problem of understanding human conversation structure in the context of human-machine and human-human interaction. We present a statistical methodology for detecting the structure of spoken dialogs based on a generative model learned using decision trees. To evaluate our approach we have used the LUNA corpora, collected from real users engaged in problem solving tasks. The results of the evaluation show that automatic segmentation of spoken dialogs is very effective not only with models built using separately human-machine dialogs or human-human dialogs, but it is also possible to infer the task-related structure of human-human dialogs with a model learned using only human-machine dialogs

13:50Pause and gap length in face-to-face interaction

Jens Edlund (KTH Speech Music & Hearing, Stockholm, Sweden)
Mattias Heldner (KTH Speech Music & Hearing, Stockholm, Sweden)
Julia Hirschberg (Columbia University, New York, USA)

It has long been noted that conversational partners tend to exhibit increasingly similar pitch, intensity, and timing behavior over the course of a conversation. However, the metrics developed to measure this similarity to date have generally failed to capture the dynamic temporal aspects of this process. In this paper, we propose new approaches to measuring interlocutor similarity in spoken dialogue. We define similarity in terms of convergence and synchrony and propose approaches to capture these, illustrating our techniques on gap and pause production in Swedish spontaneous dialogues.

14:10Modeling Other Talkers for Improved Dialog Act Recognition in Meetings

Kornel Laskowski (Carnegie Mellon University)
Elizabeth Shriberg (SRI International)

Automatic dialog act (DA) modeling has been shown to benefit meeting understanding, but current approaches to DA recognition tend to suffer from a common problem: they under-represent behaviors found at turn edges, during which the ``floor'' is negotiated among meeting participants. We propose a new approach that takes into account speech from other talkers, relying only on speech/non-speech information from all participants. We find (1) that modeling other participants improves DA detection, even in the absence of other information, (2) that only the single locally most talkative other participant matters, and (3) that 10~seconds provides a sufficiently large local context. Results further show significant performance improvements over a lexical-only system --- particularly for the DAs of interest. We conclude that interaction-based modeling at turn edges can be achieved by relatively simple features and should be incorporated for improved meeting understanding.

14:30A Closer Look at Quality Judgments of Spoken Dialog Systems

Klaus-Peter Engelbrecht (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany)
Felix Hartard (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany)
Florian Gödde (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany)
Sebastian Möller (Quality and Usability Lab, Deutsche Telekom Laboratories, TU Berlin, Germany)

User judgments of Spoken Dialog Systems provide evaluators of such systems with a valid measure of their overall quality. Models for the automatic prediction of user judgments have been built, following the introduction of PARADISE (Walker et al. 1997). Main applications are the comparison of systems, the analysis of parameters affecting quality, and the adoption of dialog management strategies. However, a common model which applies to different systems and users has not been found so far. With the aim of getting a closer insight into the quality-relevant characteristics of spoken interactions, an experiment was conducted where 25 users judged the same 5 dialogs. User judgments were collected after each dialog turn. The paper presents an analysis of the obtained results and some conclusions for future work.

14:50New Methods for the Analysis of Repeated Utterances

Geoffrey Zweig (Microsoft Research)

This paper proposes three novel and effective procedures for jointly analyzing repeated utterances. First, we propose repetition-driven system switching, where repetition triggers the use of an independent backup system for decoding. Second, we propose a cache language model for use with the second utterance. Finally, we propose a method with which the acoustics from multiple utterances - not necessarily exact repetitions of each other - can be combined to into a composite that increases accuracy. The combination of all methods produces a relative increase in sentence accuracy of 65.7% for repeated voice-search queries.

15:10The Effects of Different Voices for Speech-Based In-Vehicle Interfaces: Impact of Young and Old Voices on Driving Performance and Attitude

Ing-Marie Jonsson (Department of Computer and Information Science, Linköping University, Linköping, Sweden)
Nils Dahlbäck (Department of Computer and Information Science, Linköping University, Linköping, Sweden)

This paper investigates how matching age of driver with age of voice in a conversational in-vehicle information system affects attitudes and performance. 36 participants from age groups, 55 -75 and 18 - 25, interacted with a conversational system with young or old voice in a driving simulator. Results show that all drivers rather communicated with a young than old voice in the car. This willingness to communicate had a detrimental effect on driving performance. It is hence important to carefully select voices, since voice properties can have enormous effects on driving safety. Clearly, one voice doesn’t fit all.

Thu-Ses2-O2:
Production: Articulation and acoustics

Time:Thursday 13:30 Place:East Wing 1 Type:Oral

13:30In search of Non-uniqueness in the Acoustic-to-Articulatory Mapping

Gopal Ananthakrishnan (Department of Speech Music and Hearing, CSC, KTH, Stockholm, Sweden)
Daniel Neiberg (Department of Speech Music and Hearing, CSC, KTH, Stockholm, Sweden)
Olov Engwall (Department of Speech Music and Hearing, CSC, KTH, Stockholm, Sweden)

This paper explores the possibility and extent of non-uniqueness in the acoustic-to-articulatory inversion of speech, from a statistical point of view. It proposes a technique to estimate the non-uniqueness, based on finding peaks in the conditional probability function of the articulatory space. The paper corroborates the existence of non-uniqueness in a statistical sense, especially in stop consonants, nasals and fricatives. The relationship between the importance of the articulator position and nonuniqueness at each instance is also explored.

13:50Estimation of articulatory gesture patterns from speech acoustics

Prasanta Ghosh (Department of Electrical Engineering, University of Southern California, LA, CA, 90089)
Shrikanth Narayanan (Department of Electrical Engineering, University of Southern California, LA, CA, 90089)
Pierre Divenyi (EBIRE, Martinez, CA 94553)
Louis Goldstein (Department of Linguistics, University of Southern California, LA, CA, 90089)
Elliot Saltzman (Haskins Laboratories, New Haven, CT 06511)

We investigated dynamic programming (DP) and state-model (SM) approaches for estimating gestural scores from speech acoustics. We performed a word-identification task using the gestural pattern vector sequences estimated by each approach. For a set of 75 randomly chosen words, we obtained the best word-identification accuracy (66.67%) using the DP approach. This result implies that considerable support for lexical access during speech perception might be provided by such a method of recovering gestural information from acoustics.

14:10Formant Trajectories for Acoustic-to-Articulatory Inversion

İ Yücel Özbek (METU)
Mark Hasegawa-Johnson (UIUC)
Mübeccel Demirekler (METU)

This work examines the utility of formant frequencies and their energies in acoustic-to-articulatory inversion. For this purpose, formant frequencies and formant spectral amplitudes are automatically estimated from audio, and are treated as observations for the purpose of estimating electromagnetic articulography (EMA) coil positions. A mixture Gaussian regression model with mel-frequency cepstral (MFCC) observations is modified by using formants and energies to either replace or augment the MFCC observation vector. The augmented observation results in 3.4\% lower RMS error, and 2\% higher correlation coefficient, than the baseline MFCC observation. Improvement is especially good for plosive consonants, possibly because formant tracking provides information about the acoustic resonances that would be otherwise unavailable during plosive closure and release.

14:30A robust variational method for the acoustic-to-articulatory problem

Blaise Potard (LORIA / Nancy Université)
Yves Laprie (LORIA / CNRS)

This paper presents a novel acoustic-to-articulatory method based on an articulatory synthesizer and variational calculus, without the need for an initial trajectory. Validation in ideal conditions is performed to show the potential of the method, and the performances are compared to codebook based methods.We also investigate the precision of the articulatory trajectories found for various acoustic vectors dimensions. Possible extensions are discussed.

14:50Comparison of Vowel Structures of Japanese and English in Articulatory and Auditory Spaces

Jianwu Dang (Japan Advanced Institute of Science and Technology, Japan)
Mark Tiede (Haskins Laboratories and MIT R.L.E., USA)
Jiahong Yuan (University of Pennsylvania, USA)

In this study, we investigated the vowel structures of Japanese and English in both articulatory space and auditory perceptual space using Laplacian eigenmaps, and examined relations between speech production and perception. Results showed that the vowel structure reflects the articulatory features for both languages. The degree of tongue-palate approximation is the most important feature for vowels, followed by the open ratio of the mouth to oral cavity. The topological relations of the vowel structures are consistent with both the articulatory and auditory perceptual spaces; in particular the lip-protruded vowel /UW/ of English was distinct from the unrounded Japanese /ɯ/. The rhotic vowel /ER/ was located apart from the surface constructed by the other vowels, where the same phenomena appeared in both spaces.

15:10The articulatory and acoustic impact of Scottish English /r/ on the preceding vowel-onset

Janine Lilienthal (Queen Margaret University, Edinburgh, UK)

This paper demonstrates the use of smoothing spline ANOVA and T tests to analyze whether the influence of syllable final consonants on the preceding vowel differs for articulation and acoustics. The onset of vowels either followed by phrase-final /r/ or by phrase-initial /r/ is compared for two Scottish English speakers. To measure articulatory differences of opposing vowel pairs, smoothing splines of midsagittal tongue shape recorded via ultrasound imaging are compared. For the acoustic data, differences of the first two formant frequencies at the onset are tested. The results confirm that there is no 1:1 mapping between articulation and acoustics.

Thu-Ses2-O3:
Features for Speech and Speaker Recognition

Time:Thursday 13:30 Place:East Wing 2 Type:Oral
Chair: Thomas Hain

13:30Static and Dynamic Modulation Spectrum for Speech Recognition

Sriram Ganapathy (Department of Electrical and Computer Engineering, Johns Hopkins University, USA)
Samuel Thomas (Department of Electrical and Computer Engineering, Johns Hopkins University, USA)
Hynek Hermansky (Department of Electrical and Computer Engineering, Johns Hopkins University, USA)

We present a feature extraction technique based on static and dynamic modulation spectrum derived from long-term envelopes in sub-bands. Estimation of the sub-band temporal envelopes is done using Frequency Domain Linear Prediction (FDLP). These sub-band envelopes are compressed with a static (logarithmic) and dynamic (adaptive loops) compression. The compressed sub-band envelopes are transformed into modulation spectral components which are used as features for speech recognition. Experiments are performed on a phoneme recognition task using a hybrid HMM-ANN phoneme recognition system and an ASR task using the TANDEM speech recognition system. The proposed features provide a relative improvements of 3.8 % and 11.5 % in phoneme recognition accuracies for TIMIT and conversation telephone speech (CTS) respectively. Further, these improvements are found to be consistent for ASR tasks on OGI-Digits database (relative improvement of 13.5 %).

13:502-D PROCESSING OF SPEECH FOR MULTI-PITCH ANALYSIS

Tianyu T. Wang (MIT Lincoln Laboratory)
Thomas F. Quatieri (MIT Lincoln Laboratory)

This paper introduces a two-dimensional (2-D) processing approach for the analysis of multi-pitch speech sounds. Our framework invokes the short-space 2-D Fourier transform magnitude of a narrowband spectrogram, mapping harmonically-related signal components to multiple concentrated entities in a new 2-D space. First, localized time-frequency regions of the spectrogram are analyzed to extract pitch candidates. These candidates are then combined across multiple regions for obtaining separate pitch estimates of each speech-signal component at a single point in time. We refer to this as multi-region analysis (MRA). By explicitly accounting for pitch dynamics within localized time segments, this separability is distinct from that which can be obtained using short-time autocorrelation methods typically employed in state-of-the-art multi-pitch tracking algorithms. We illustrate the feasibility of MRA for multi-pitch estimation on mixtures of synthetic and real speech.

14:10A Correlation-Maximization Denoising Filter Used as An Enhancement Frontend for Noise Robust Bird Call Classification

Wei Chu (Department of Electrical Engineering, University of California, Los Angeles)
Abeer Alwan (Department of Electrical Engineering, University of California, Los Angeles)

In this paper, we propose a Correlation-Maximization denoising filter which utilizes periodicity information to remove additive noise in bird calls. We also developed a statistically-based noise robust bird-call classification system which uses the denoising filter as a frontend. Enhanced bird calls which are the output of the denoising filter are used for feature extraction. Gaussian Mixture Models (GMM) and Hidden Markov Models (HMM) are used for classification. Experiments on a large noisy corpus containing bird calls from 5 species have shown that the Correlation-Maximization filter is more effective than the Wiener filter in improving the classification error rate of bird calls which have a quasi-periodic structure. This improvement results in a 4.1% classification error rate which is better than the system without a denoising frontend and a system with a Wiener filter denoising frontend.

14:30Preliminary Inversion Mapping Results with a New EMA Corpus

Korin Richmond (CSTR, Informatics, Edinburgh University)

In this paper, we apply our inversion mapping method, the trajectory mixture density network (TMDN), to a new corpus of articulatory data, recorded with a Carstens AG500 electromagnetic articulograph. This new data set, mngu0, is relatively large and phonetically rich, among other beneficial characteristics. We obtain good results, with a root mean square (RMS) error of only 0.99mm. This compares very well with our previous lowest result of 1.54mm RMS error for equivalent coils of the MOCHA fsew0 EMA data. We interpret this as showing the mngu0 data set is potentially more consistent than the fsew0 data set, and is very useful for research which calls for articulatory trajectory data. It also supports our view that the TMDN is very much suited to the inversion mapping problem.

14:50Time-Varying Autoregressive Tests for Multiscale Speech Analysis

Daniel Rudoy (Harvard University)
Thomas F. Quatieri (MIT Lincoln Laboratory)
Patrick J. Wolfe (Harvard University)

In this paper we develop hypothesis tests for speech waveform nonstationarity based on time-varying autoregressive models, and demonstrate their efficacy in speech analysis tasks at both segmental and sub-segmental scales. Key to the successful synthesis of these ideas is our employment of a generalized likelihood ratio testing framework tailored to autoregressive coefficient evolutions suitable for speech. After evaluating our framework on speech-like synthetic signals, we present preliminary results for two distinct analysis tasks using speech waveform data. At the segmental level, we develop an adaptive short-time segmentation scheme and evaluate it on whispered speech recordings, while at the sub-segmental level, we address the problem of detecting the glottal flow closed phase. Results show that our hypothesis testing framework can reliably detect changes in the vocal tract parameters across multiple scales, thereby underscoring its broad applicability to speech analysis.

15:10Audio keyword extraction by unsupervised word discovery

Armando Muscariello (IRISA, Metiss Research Group, France)
Guillaume Gravier (IRISA, Metiss Research Group, France)
Frédéric Bimbot (IRISA, Metiss Research Group, France)

In real audio data, frequently occurring patterns often convey relevant information on the overall content of the data. The possibility to extract meaningful portions of the main content by identifying such key patterns, can be exploited for providing audio summaries and speeding up the access to relevant parts of the data. We refer to these patterns as audio motifs in analogy with the nomenclature in its counterpart task in biology. We describe a framework for the discovery of audio motifs in streams in an unsupervised fashion, as no acoustic or linguistic models are used. We define the fundamental problem by decomposing the overall task into elementary subtasks; then we propose a solution that combines a one-pass strategy that exploits the local repetitiveness of motifs and a dynamic programming technique to detect repetitions in audio. Results of an experiment on a broadcast show are shown to illustrate the effectiveness of the technique in providing audio summaries of real data.

Thu-Ses2-O4:
Speech and multimodal resources & annotation

Time:Thursday 13:30 Place:East Wing 3 Type:Oral
Chair:Kristiina Jokinen

13:30ASR Corpus Design for Resource-Scarce Languages

Etienne Barnard (Meraka Institute)
Marelie Davel (Meraka Institute)
Charl Johannes van Heerden (Meraka Institute)

We investigate the number of speakers and the amount of data that is required for the development of useable speaker-independent speech-recognition systems in resource-scarce languages. Our experiments employ the Lwazi corpus, which contains speech in the eleven official languages of South Africa. We find that a surprisingly small number of speakers (fewer than 50) and around 10 to 20 hours of speech per language are sufficient for the purposes of acceptable phone-based recognition.

13:50Pronunciation Dictionary Development in Resource-Scarce Environments

Marelie Davel (Human Language Technologies Research Group, Meraka Institute, CSIR, South Africa)
Olga Martirosian (Human Language Technologies Research Group, Meraka Institute, CSIR, South Africa)

The deployment of speech technology systems in the developing world is often hampered by the lack of appropriate linguistic resources. A suitable pronunciation dictionary is one such a resource that can be difficult to obtain for lesser-resourced languages. We design a process for the development of pronunciation dictionaries in resource-scarce environments, and apply this to the development of pronunciation dictionaries for ten of the official languages of South Africa. We define the semi-automated development and verification process in detail and discuss practicalities, outcomes and lessons learnt. We analyse the accuracy of the developed dictionaries and demonstrate how the distribution of rules generated from the dictionaries provides insight into the inherent predictability of the languages studied.

14:10XTrans: a speech annotation and transcription tool

Meghan Lammie Glenn (Linguistic Data Consortium)
Stephanie M. Strassel (Linguistic Data Consortium)
Haejoong Lee (Linguistic Data Consortium)

We present XTrans, a multi-platform, multilingual, multi-channel transcription application designed and developed by Linguistic Data Consortium. XTrans provides new and efficient solutions to many common challenges encountered during the manual transcription process of a wide variety of audio genres, such as supporting multiple audio channels in a meeting recording or right-to-left text directionality for languages like Arabic. To facilitate accurate transcription, XTrans incorporates a number of quality control functions, and provides a user-friendly mechanism for transcribing overlapping speech. This paper will describe the motivation to develop a new transcription tool, and will give an overview of XTrans functionality.

14:30How to Select a Good Training-data Subset for Transcription: Submodular Active Selection for Sequences

Hui Lin (University of Washington)
Jeff Bilmes (University of Washington)

Given a large un-transcribed corpus of speech utterances, we address the problem of how to select a good subset for word-level transcription under a given fixed transcription budget. We employ submodular active selection on a Fisher-kernel based graph over un-transcribed utterances. The selection is theoretically guaranteed to be near-optimal. Moreover, our approach is able to bootstrap without requiring {\em any} initial transcribed data, whereas traditional approaches rely heavily on the quality of an initial model trained on some labeled data. Our experiments on phone recognition show that our approach outperforms both average-case random selection and uncertainty sampling significantly.

14:50Improving acceptability assessment for the labelling of affective speech corpora

Zoraida Callejas (University of Granada)
Ramón López-Cózar (University of Granada)

In this paper we study how to address the assessment of affective speech corpora. We propose the use of several coefficients and provide guidelines to obtain a more complete background about the quality of their annotation. This proposal has been evaluated employing a corpus of non-acted emotions gathered from spontaneous interactions of users with a spoken dialogue system. The results show that, due to the nature of non-acted emotional corpora, traditional interpretations would in most cases consider the annotation of these corpora unacceptable even with very high inter-annotator agreement. Our proposal provides a basis to argue their acceptability by supplying a more fine-grained vision of their quality.

15:10The Broadcast Narrow Band Speech Corpus: A New Resource Type for Large Scale Language Recognition

Christopher Cieri (Linguistic Data Consortium, University of Pennsylvania)
Linda Brandschain (Linguistic Data Consortium, University of Pennsylvania)
Abby Neely (Linguistic Data Consortium, University of Pennsylvania)
David Graff (Linguistic Data Consortium, University of Pennsylvania)
Kevin Walker (Linguistic Data Consortium, University of Pennsylvania)
Chris Caruso (Linguistic Data Consortium, University of Pennsylvania)

This paper describes a new resource type, broadcast narrow band speech for use in large scale language recognition research and technology development. After providing the rational for this new resource type, the paper describes the collection, segmentation, auditing procedures and data formats used. Along the way, it addresses issues of defining language and dialect in found data and how ground truth is established for this corpus. Index Terms: multilingual speech corpora, language recognition, language identification, language detection, language, dialect, mutual intelligibility, broadcast news, conversational speech

Thu-Ses2-O5:
Speech Analysis and Processing III

Time:Thursday 13:30 Place:East Wing 4 Type:Oral
Chair:Yannis Stylianou

13:30Model-based automatic evaluation of L2 learner\'s English timing

Chatchawarn Hansakunbuntheung (GITI / Language and Speech Science Research Labs, Waseda University)
Kato Hiroaki (NICT/ATR Media Information Science Laboratories)
Sagisaka Yoshinori (GITI / Language and Speech Science Research Labs, Waseda University)

This paper proposes a method to automatically measure the timing characteristics of a second-language learner’s speech as a means to evaluate language proficiency in speech production. We used the durational differences from native speakers’ speech as an objective measure to evaluate the learner’s timing characteristics. To provide flexible evaluation without the need to collect additional native-English reference speech, we employed predicted segmental durations using a statistical duration model instead of measured raw durations of natives’ speech. The proposed evaluation method was tested using English speech data uttered by multiple Thai-native learners’ groups with different experiences of English study. An evaluation experiment shows that the proposed measure closely correlates to the subjects’ experiences of English study. These results support the effectiveness of the proposed model-based objective evaluation.

13:50A Bayesian Approach to Non-Intrusive Quality Assessment of Speech

Petko N. Petkov (KTH-Royal Institute of Technology, Stockholm, Sweden)
Iman S. Mossavat (TUE-Eindhoven University of Technology)
Bastiaan Kleijn (KTH-Royal Institute of Technology, Stockholm, Sweden)

A Bayesian approach to non-intrusive quality assessment of narrow-band speech is presented. The speech features used to assess quality are the sample mean and variance of bandpowers evaluated from the temporal envelope in the channels of an auditory filter-bank. Bayesian multivariate adaptive regression splines (BMARS) is used to map features into quality ratings. The proposed combination of features and regression method leads to a high performance quality assessment algorithm that learns efficiently from a small amount of training data and avoids overfitting. Use of the Bayesian approach also allows the derivation of credible intervals on the model predictions, which provide a quantitative measure of model confidence and can be used to identify the need for complementing the training databases.

14:10Precision of Phoneme Boundaries Derived using Hidden Markov Models

Ladan Baghai-Ravary (Phonetics Laboratory, Oxford University)
Greg Kochanski (Phonetics Laboratory, Oxford University)
John Coleman (Phonetics Laboratory, Oxford University)

Some phoneme boundaries correspond to abrupt changes in the acoustic signal. Others are ambiguous because the transition from one phoneme to the next is gradual. This paper compares the boundaries identified by different alignment systems, using different signal representations and HMM structures. The variability of the boundaries is analysed and the consistency between the boundaries from the various systems is analysed to identify which classes of phoneme boundary can be identified reliably by an automatic system, and which are ill-defined and ambiguous. Such techniques should improve the efficiency with which new alignment and HMM training algorithms can be developed.

14:30A Novel Method for Epoch Extraction from Speech Signals

Lakshmish Kaushik (INRS, Montreal, Canada)
Douglas O\'Shaughnessy (INRS, Montreal, Canada)

This paper introduces a novel method of speech epoch extraction using modified Wigner-Ville distribution. Wigner-Ville Distribution is an efficient speech representation tool using which minute speech variations can be tracked precisely. In this paper, epoch detection/extraction using accurate energy tracking, noise robustness, and efficient speech representation properties of modified discrete Wigner-Ville distribution is explored. The developed technique is tested using Arctic database and its epoch information from the electro-glottograph as reference epochs. Developed algorithm is compared with the available state of art methods in various noise conditions (babble, white, and vehicle) and different levels of degradation. Proposed method outperforms the existing methods in literature.

14:50LS Regularization of Group Delay Features for Speaker Recognition

Jia Min Karen Kua (The University of New South Wales)
Julien Epps (The University of New South Wales)
Eliathamby Ambikairajah (The University of New South Wales)
Eric Choi (National ICT Australia (NICTA))

Due to the increasing use of fusion in speaker recognition systems, features that are complementary to MFCCs offer opportunities to advance the state of the art. One promising feature is based on group delay, however this can suffer large variability due to its numerical formulation. In this paper, we investigate reducing this variability in group delay features with least squares regularization. Evaluations on the NIST 2001 and 2008 SRE databases show a relative improvement of at least 6% and 18% EER respectively when group delay-based system is fused with MFCC-based system.

15:10Glottal Closure and Opening Instant Detection from Speech Signals

Thomas Drugman (Faculté Polytechnique de Mons)
Thierry Dutoit (Faculté Polytechnique de Mons)

This paper proposes a new procedure to detect Glottal Closure and Opening Instants (GCIs and GOIs) directly from speech waveforms. The procedure is divided into two successive steps. First a mean-based signal is computed, and intervals where speech events are expected to occur are extracted from it. Secondly, at each interval a precise position of the speech event is assigned by locating a discontinuity in the Linear Prediction residual. The proposed method is compared to the DYPSA algorithm on the CMU ARCTIC database. A significant improvement as well as a better noise robustness are reported. Besides, results of GOI identification accuracy are promising for the glottal source characterization.

Thu-Ses2-P2:
ASR: Acoustic Model Features

Time:Thursday 13:30 Place:Hewison Hall Type:Poster
Chair:Richard Stern

#1Investigation into bottle-neck features for meeting speech recognition

Frantisek Grezl (Brno University of Technology, Brno, Czech Republic)
Karafiat Martin (Brno University of Technology, Brno, Czech Republic)
Burget Lukas (Brno University of Technology, Brno, Czech Republic)

This work investigates into recently proposed Bottle-Neck features for ASR. The bottle-neck ANN structure is imported into Split Context architecture gaining significant WER reduction. Further, Universal Context architecture was developed which simplifies the system by using only one universal ANN for all temporal splits. Significant WER reduction can be obtained by applying fMPE on top of our BN features as a technique for discriminative feature extraction and further gain is also obtained by retraining model parameters using MPE criterion. The results are reported on meeting data from RT07 evaluation.

#2Multi-Stream to Many-Stream: Using Spectro-Temporal Features for ASR

Sherry Y. Zhao (International Computer Science Institute)
Suman Ravuri (International Computer Science Institute)
Nelson Morgan (International Computer Science Institute)

We report progress in the use of multi-stream spectro-temporal features for both small and large vocabulary automatic speech recognition tasks. Features are divided into multiple streams for parallel processing and dynamic utilization in this approach. For small vocabulary speech recognition experiments, the incorporation of up to 28 dynamically-weighted spectro-temporal feature streams along with MFCCs yields roughly 21% improvement on the baseline in low noise conditions and 47% improvement in noise-added conditions, a greater improvement on the baseline than in our previous work. A four stream framework yields a 14% improvement over the baseline in the large vocabulary low noise recognition experiment. These results suggest that the division of spectro-temporal features into multiple streams may be an effective way to flexibly utilize an inherently large number of features for automatic speech recognition.

#3Tandem Representations of Spectral Envelope and Modulation Frequency Features for ASR

Samuel Thomas (Department of Electrical and Computer Engineering, Johns Hopkins University, USA)
Sriram Ganapathy (Department of Electrical and Computer Engineering, Johns Hopkins University, USA)
Hynek Hermansky (Department of Electrical and Computer Engineering, Johns Hopkins University, USA)

We present a feature extraction technique for automatic speech recognition that uses Tandem representation of short-term spectral envelope and modulation frequency features. These features, derived from sub-band temporal envelopes of speech estimated using frequency domain linear prediction, are combined at the phoneme posterior level. Tandem representations derived from these phoneme posteriors are used along with HMM based ASR systems for both small and large vocabulary continuous speech recognition (LVCSR) tasks. For a small vocabulary continuous digit task on the OGI Digits database, the proposed features reduce the word error rate (WER) by 13 % relative to other feature extraction techniques. We obtain a relative reduction of about 14 % in WER for an LVCSR task using the NIST RT05 evaluation data. For phoneme recognition tasks on the TIMIT database these features provide a relative improvement of 13 % compared to other techniques.

#4Entropy-Based Feature Analysis for Speech Recognition

Panji Setiawan (Siemens Enterprise Communications GmbH & Co. KG)
Harald Hoege (SVOX Deutschland GmbH)
Tim Fingscheidt (Technische Universitaet Braunschweig)

Based on the concept of entropy, a new approach to analyse the quality of features as used in speech recognition is proposed. We regard the relation between the hidden Markov model (HMM) states and the corresponding frame based feature vectors as a coding problem, where the states are sent through a noisy recognition channel and received as feature vectors. Using the relation between Shannon's conditional entropy and the error rate on state level, we estimate how much information is contained in the feature vectors to recognize the states. Thus, the conditional entropy is a measure for the quality of the features. Finally, we show how noise reduces the information contained in the features.

#5Hierarchical Processing of the Modulation Spectrum for GALE Mandarin LVCSR system

Fabio Valente (IDIAP)
Mathew Magimai.-Doss (IDIAP)
Christian Plahl (RWTH Aachen University)
Suman Ravuri (ICSI)

This paper aims at investigating the use of TANDEM features based on hierarchical processing of the modulation spectrum. The study is done in the framework of the GALE project for recognition of Mandarin Broadcast data. We describe the improvements obtained using the hierarchical processing and the addition of features like pitch and short-term critical band energy. Results are consistent with previous findings on a different LVCSR task suggesting that the proposed technique is effective and robust across several conditions. Furthermore we describe integration into RWTH GALE LVCSR system trained on 1600 hours of Mandarin data and present progress across the GALE 2007 and GALE 2008 RWTH systems resulting in approximatively \(20\%\) CER reduction on several data set.

#6Hill-Climbing Feature Selection for Multi-Stream ASR

David Gelbart (International Computer Science Institute, USA)
Nelson Morgan (International Computer Science Institute, USA)
Alexey Tsymbal (Siemens AG, Germany)

We performed automated feature selection for multi-stream (i.e., ensemble) automatic speech recognition, using a hill-climbing (HC) algorithm that changes one feature at a time if the change improves a performance score. For both clean and noisy data sets (using the OGI Numbers corpus), HC usually improved performance on held out data compared to the initial system it started with, even for noise types that were not seen during the HC process. Overall, we found that using Opitz’s scoring formula, which blends single-classifier word recognition accuracy and ensemble diversity, worked better than ensemble accuracy as a performance score for guiding HC in cases of extreme mismatch between the SNR of training and test sets. Our noisy version of the Numbers corpus, our multi-layer-perceptron-based Numbers ASR system, and our HC scripts are available online.

#7Robust F0 Estimation Based on Log-Time Scale Autocorrelation and Its Application to Mandarin Tone Recognition

Yusuke Kida (Corporate Research & Development Center, Toshiba Corpolation, Japan)
Masaru Sakai (Corporate Research & Development Center, Toshiba Corpolation, Japan)
Takashi Masuko (Corporate Research & Development Center, Toshiba Corpolation, Japan)
Akinori Kawamura (Corporate Research & Development Center, Toshiba Corpolation, Japan)

This paper proposes a novel F0 estimation method in which delta-logF0 is directly estimated based on autocorrelation function (ACF) on a logarithmic time scale. Since peaks of ACFs of periodic signals have a specific pattern on the log-time scale and the period only affects the position of the pattern, delta-logF0 can be estimated directly from the shift of the peaks of the logtime scale ACF (LTACF) without F0 estimation. Then logF0 is estimated from the sum of LTACFs shifted based on deltalogF0. Experimental results show that the proposed method is more robust against noise than the baseline ACF-based method. It is also shown that the proposed method significantly improves the Mandarin tone recognition accuracy.

#8Invariant-integration method for robust feature extraction in speaker-independent speech recognition

Florian Müller (Institute for Signal Processing, University of Lübeck, Germany)
Alfred Mertins (Institute for Signal Processing, University of Lübeck, Germany)

The vocal tract length (VTL) is one of the variabilities that speaker-independent automatic speech recognition (ASR) systems encounter. Standard methods to compensate for the effects of different VTLs within the processing stages of the ASR systems often have a high computational effort. By using an appropriate spectro-temporal representation, a change in VTL can be approximately described by a translation in the subband-index space. We present a new type of features that is based on the principles of invariant integration, and an according feature selection method is described. ASR experiments show the increased robustness of the proposed features in comparison to standard MFCCs.

#9Discriminative Feature Transformation using Output Coding for Speech Recognition

Omid Dehzangi (School of Computer Engineering, Nanyang Technological University)
Ma Bin (Institute for Infocomm Research, Singapore)
ENG sIONG Chng (School of Computer Engineering, Nanyang Technological University)
Haizhou Li (Institute for Infocomm Research, Singapore)

In this paper, we present a new mechanism to extract discriminative acoustic features for speech recognition using continuous output coding (COC) based feature transformation. Our proposed method first expands the short-time spectral features into a higher dimensional feature space to improve its discriminative capability. The expansion is performed by employing the polynomial expansion. The high dimension features are then projected into lower dimension space using continuous output coding technique implemented by a set of linear SVMs. The resulting feature vectors are designed to encode the difference between phones. The generated features are shown to be more discriminative than MFCCs and experimental results on both TIMIT and NTIMIT corpus showed better phone recognition accuracy with the proposed features.

#10Discriminant Spectrotemporal Features for Phoneme Recognition

Nima Mesgarani (Johns Hopkins University)
Sivaram Garimella (Johns Hopkins University)
Sridhar Krishna Nemala (Johns Hopkins University)
Hynek Hermansky (Johns Hopkins University)

We propose discriminant methods for deriving two-dimensional spectrotemporal features for phoneme recognition that are estimated to maximize the separation between the representations of phoneme classes. The linearity of the filters results in their intuitive interpretation enabling us to investigate the working principles of the system and to improve its performance by locating the sources of error. Two methods for the estimation of filters are proposed: Regularized Least Square (RLS) and Modified Linear Discriminant Analysis (MLDA). Both methods reach a comparable improvement over the baseline condition demonstrating the advantage of the discriminant spectrotemporal filters.

#11Auditory Model Based Optimization of MFCCs Improves Automatic Speech Recognition Performance

Saikat Chatterjee (KTH - Royal Institute of Technology)
Christos Koniaris (KTH - Royal Institute of Technology)
W. Bastiaan Kleijn (KTH - Royal Institute of Technology)

Using a spectral auditory model along with perturbation based analysis, we develop a new framework to optimize a set of features such that it emulates the behavior of the human auditory system. The optimization is carried out in an off-line manner based on the conjecture that the local geometries of the feature domain and the perceptual auditory domain should be similar. Using this principle, we modify and optimize the static mel frequency cepstral coefficients (MFCCs) without considering any feedback from the speech recognition system. We show that improved recognition performance is obtained for any environmental condition, clean as well as noisy.

Thu-Ses2-P3:
ASR: Tonal Language, Cross-Lingual and Multilingual ASR

Time:Thursday 13:30 Place:Hewison Hall Type:Poster
Chair:Lori Lamel

#1Pronunciation-based ASR for names

Henk Van den Heuvel (CLST, Fac. of Arts, Radboud University Nijmegen, Netherlands)
Bert Réveil (ELIS, Ghent University, Belgium)
Jean-Pierre Martens (ELIS, Ghent University, Belgium)

To improve the ASR of proper names a novel method based on the generation of pronunciation variants by means of phoneme-to-phoneme converters (P2Ps) is proposed. The aim is convert baseline transcriptions into variants that maximally resemble actual name pronunciations that were found in a training corpus. The method has to operate in a cross lingual setting with native Dutch persons speaking Dutch and foreign names, and foreign persons speaking Dutch names. The P2Ps are trained to act either on conventional G2P-transcriptions or on canonical transcriptions that were provided by a human expert. Including the variants produced by the P2Ps in the lexicon of the recognizer substantially improves the recognition accuracy for natives pronouncing foreign names, but not for the other investigated combinations.

#2How speaker tongue and name source language affect the automatic recognition of spoken names

Bert Réveil (DSSP, ELIS, Ghent University)
Jean-Pierre Martens (DSSP, ELIS, Ghent University)
Bart D\'hoore (Nuance)

In this paper the automatic recognition of person names and geographical names uttered by native and non-native speakers is examined in an experimental set-up. The major aim was to raise our understanding of how well and under which circumstances previously proposed methods of multilingual pronunciation modeling and multilingual acoustic modeling contribute to a better name recognition in a cross-lingual context. To come to a meaningful interpretation of results we have categorized each language according to the amount of exposure a native speaker is expected to have had to this language. After having interpreted our results we have also tried to find an answer to the question of how much further improvement one might be able to attain with a more advanced pronunciation modeling technique which we plan to develop.

#3Online Generation of Acoustic Models for Multilingual Speech Recognition

Martin Raab (Harman Becker Automotive Systems)
Guillermo Aradilla (Harman Becker Automotive Systems)
Rainer Gruhn (Harman Becker Automotive Systems)
Elmar Nöth (University of Erlangen-Nuremberg)

Our goal is to provide a multilingual speech based Human Machine Interface for in-car infotainment and navigation systems. The multilinguality is for example needed for music player control via speech as artist and song names in the globalized music market come from many languages. Another frequent use case is the input of foreign navigation destinations via speech. In this paper we propose approximated projections between mixtures of Gaussians that allow the generation of the multilingual system from monolingual systems. This makes the creation of the multilingual systems on an embedded system possible with the benefit that training and maintenance effort remain unchanged compared to the provision of monolingual systems. We also sketch how this algorithm can help together with our previous work to have an efficient architecture for multilingual speech recognition on embedded devices.

#4Basic speech recognition for spoken dialogues

Charl van Heerden (Meraka Institute)
Etienne Barnard (Meraka Institute)
Marelie Davel (Meraka Institute)

Spoken dialogue systems (SDSs) have great potential for information access in the developing world. However, the realisation of that potential requires the solution of several challenging problems, including the development of sufficiently accurate speech recognisers for a diverse multitude of languages. We investigate the feasibility of developing small-vocabulary speaker-independent ASR systems designed for use in a telephone-based information system, using ten resource-scarce languages spoken in South Africa as a case study. We find that limited speech corpora (3 to 8 hours of data from around 200 speakers) are sufficient for the development of reasonably accurate recognisers: Error rates are in the range 2% to 12% for a ten-word task, where vocabulary words are excluded from training to simulate vocabulary-independent performance. This approach is substantially more accurate than cross-language transfer, and sufficient for the development of basic spoken dialogue systems.

#5Tonal Articulatory Feature for Mandarin and its Application to Conversational LVCSR

Qingqing Zhang (ThinkIT Speech Laboratory Institute of Acoustics Chinese Academy of Sciences)
Jielin Pan (ThinkIT Speech Laboratory Institute of Acoustics Chinese Academy of Sciences)
yonghong Yan (ThinkIT Speech Laboratory Institute of Acoustics Chinese Academy of Sciences)

This paper presents our recent work on the development of a tonal Articulatory Feature (AF) for Mandarin and its application to conversational LVCSR. Motivated by the theory of Mandarin phonology, eight features for classifying the acoustic units and one feature for classifying the tone are investigated and constructed in the paper, and the AF-based tandem approach is used to improve speech recognition performances. With this Mandarin AF set, a significant relative reduction on Character Error Rate is obtained over the baseline system using the standard acoustic feature, and the comparison between the ASR systems based on AF classifiers with and without the tonal feature demonstrates that the system with the tonal feature achieves better performances further.

#6Effects of Language Mixing for Automatic Recognition of Cantonese-English Code-Mixing Utterances

Houwei Cao (Department of Electronic Engineering, The Chinese University of Hong Kong)
Pak-Chung Ching (Department of Electronic Engineering, The Chinese University of Hong Kong)
Tan Lee (Department of Electronic Engineering, The Chinese University of Hong Kong)

While automatic speech recognition of either Cantonese or English alone has achieved a great degree of success, recognition of Canton-English code-mixing speech is not as trivial. This paper attempts to analyze the effect of language mixing on recognition performance of code-mixing utterances. By examining the recognition results of Canton-English code-mixing speech, where Canton is the matrix language and English is the embedded language, we noticed that recognition accuracy of the embedded language plays a significant role to the overall performance. In particular, significant performance degradation is found in the matrix language if the embedded words can not be recognized correctly. We also studied the error propagation effect of the embedded English. The results show that the error in embedded English words may propagate to two neighboring Cantonese syllables. Finally, analysis is carried out to determine the influencing factors for recognition performance in embedded English.

#7A One-Step Tone Recognition Approach Using MSD-HMM for Continuous Speech

Changliang Liu (ThinkIT Speech Lab. Institute of Acoustics, Chinese Academy of Science)
Fengpei Ge (ThinkIT Speech Lab. Institute of Acoustics, Chinese Academy of Science)
Fuping Pan (ThinkIT Speech Lab. Institute of Acoustics, Chinese Academy of Science)
Bin Dong (ThinkIT Speech Lab. Institute of Acoustics, Chinese Academy of Science)
Yonghong Yan (ThinkIT Speech Lab. Institute of Acoustics, Chinese Academy of Science)

There are two types of methods for tone recognition of continuous speech: one-step and two-step approaches. Two-step approaches need to identify the syllable boundaries firstly, while one-step approaches do not. Previous studies mostly focus on two-step approaches. In this paper, a one-step approach using Multi-space distribution HMM (MSD-HMM) is investigated. The F0, which only exists in voiced speech, is modeled by MSD-HMM. Then, a tonal syllable network is built based on the reference and Viterbi search is carried out on it to find the best tone sequence. Two modifications to the conventional tri-phone HMM models are investigated: tone-based context expansion and syllable-based model units. The experimental results proved that tone-based context information is more important for tone recognition and syllable-based HMM models are much better than phone-based ones. The final tone correct rate result is 88.8%, which is much higher than the state-of-the-art two-step approaches.

#8Stream-based Context-sensitive Phone Mapping for Cross-lingual Speech Recognition

Khe Chai Sim (Institute for Infocomm Research, Singapore)
Haizhou Li (Institute for Infocomm Research, Singapore)

Recently, a Probabilistic Phone Mapping model was proposed to facilitate cross-lingual automatic speech recognition using a foreign phonetic system. Under this framework, discrete HMMs are used to map a foreign phone sequence to a target phone sequence. Context-sensitive mapping is made possible by expanding the discrete observation symbols to include the contexts of the foreign phones in which they appear in the sequence. Unfortunately, modelling the context dependencies jointly results in dramatical increase in model parameters as wider contexts are used. In this paper, the probability of context-dependent symbol is decomposed into the product of probabilities of the symbol and its contexts. This can be modelled conveniently using a multiple-stream discrete HMM system where the contexts are treated as independent streams. Experimental results are reported on TIMIT English phone recognition task using the Czech, Hungarian and Russian foreign phone recognisers.

#9Human Translations Guided Language Discovery for ASR Systems

Sebastian Stüker (Institut für Anthropomatik, Universität Karlsruhe (TH))
Laurent Besacier (Laboratory of Informatics of Grenoble (LIG), University J. Fourier)
Alex Waibel (Institut für Anthropomatik, Universität Karlsruhe (TH))

The traditional approach of collecting and annotating the necessary training data is due to economic constraints not feasible for most of the 7,000 languages in the world. At the same time it is of vital interest to have natural language processing systems address practically all of them. Therefore, new, efficient ways of gathering the needed training material have to be found. In this paper we continue our experiments on exploiting the knowledge gained from human simultaneous translations that happen frequently in the real world, in order to discover word units in a new language. We evaluate our approach by measuring the performance of statistical machine translation systems trained on the word units discovered from an oracle phoneme sequence. We improve it then by combining it with a word discovery technique that works without supervision, solely on the unsegmented phoneme sequences.

Thu-Ses2-P1:
Speaker and speech variability, Paralinguistic and nonlinguistic cues

Time:Thursday 13:30 Place:Hewison Hall Type:Poster
Chair:Christer Gobl

#1A Novel Codebook Search Technique for Estimating the Open Quotient

Yen-Liang Shue (Department of Electrical Engineering, University of California, Los Angeles)
Jody Kreiman (Division of Head and Neck Surgery, UCLA School of Medicine)
Abeer Alwan (Department of Electrical Engineering, University of California, Los Angeles)

The open quotient (OQ), loosely defined as the proportion of time the glottis is open during phonation, is an important parameter in many source models. Accurate estimation of OQ from acoustic signals is a non-trivial process as it involves the separation of the source signal from the vocal-tract transfer function. Often this process is hampered by the lack of direct physiological data with which to calibrate algorithms. In this paper, an analysis-by-synthesis method using a codebook of harmonically-based Liljencrants-Fant (LF) source models in conjunction with a constrained optimizer was used to obtain estimates of OQ from four subjects. The estimates were compared with physiological measurements from high-speed imaging. Results showed relatively high correlations between the estimated and measured values for only two of the speakers, suggesting that existing source models may be unable to accurately represent some source signals.

#2Long Term Examination of Intra-Session and Inter-Session Speaker Variability

Aaron Lawson (RADC Inc.)
Allen Stauffer (RADC Inc.)
Brett Smolenski (RADC Inc.)
Benjamin Pokines (Oasis Systems)
Matthew Leonard (University of Texas at Dallas)
Edward Cupples (RADC Inc.)

Session variability in speaker recognition is a well recognized phenomena, but poorly understood largely due to a dearth of robust longitudinal data. The current study uses a large, long-term speaker database to quantify both speaker variability changes within a conversation and the impact of speaker variability changes over the long term (3 years). Results demonstrate that 1) change in accuracy over the course of a conversation is statistically very robust and 2) that the aging effect over three years is statistically negligible. Finally we demonstrate that voice change during the course of a conversation is, in large part, comparable across sessions.

#3Distorted visual information influences audiovisual perception of voicing

Ragnhild Eg (Department of Psychology, Norwegian University of Science and Technology (NTNU))
Dawn Behne (Department of Psychology, Norwegian University of Science and Technology (NTNU))

Research has shown that visual information becomes less reliable when images are severely distorted. Furthermore, while voicing is generally identified from acoustical cues, it may also provide perception with visual cues. The current study investigated the impact of video distortion on the audiovisual perception of voicing. Audiovisual stimuli were presented to 30 participants with the original video quality, or with reduced video resolution (75x60 pixels, 45x36 pixels). Results revealed that in addition to increased auditory reliance with video distortion, particularly for voiceless stimuli, perception of voiceless stimuli was more influenced by the visual modality than voiced stimuli.

#4Perceived naturalness of a synthesizer of disordered voices

Samia Fraj (Laboratory of Images, Signals & telecommunication devices, Université Libre de Bruxelles, Brussels, Belgium.)
Francis Grenez (Laboratory of Images, Signals & telecommunication devices, Université Libre de Bruxelles, Brussels, Belgium.)
Jean Schoentgen (Laboratory of Images, Signals & telecommunication devices, Université Libre de Bruxelles, Brussels, Belgium. National Fund for Scientific Research, Belgium.)

The presentation describes a synthesizer of normal and disordered voice timbres and their perceptual evaluation with respect to naturalness. The simulator uses a shaping function model, which enables controlling the perturbations of the frequency and harmonic richness of the glottal area signal via the control of the instantaneous frequency and amplitude of two harmonic driving functions. Several types of perturbations are simulated. Perceptual experiments, which involve stimuli of synthetic and human vowels with normal values of perturbations, have been carried out. The first has been based on a binary synthetic/natural classification. The second has involved a discrimination task. Both experiments suggest that human judges are unable to distinguish between human and synthetic vowels prepared with the synthesizer described here.

#5Audio-Visual Speech Asynchrony Modeling in a Talking Head

Alexey Karpov (St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, Russia)
Liliya Tsirulnik (United Institute of Informatics Problems of the National Academy of Sciences, Minsk, Belarus)
Zdeněk Krňoul (University of West Bohemia in Pilsen, Czech Republic)
Andrey Ronzhin (St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, Russia)
Boris Lobanov (United Institute of Informatics Problems of the National Academy of Sciences, Minsk, Belarus)
Miloš Železný (University of West Bohemia in Pilsen, Czech Republic)

An audio-visual speech synthesis system with modeling of asynchrony between auditory and visual speech modalities is proposed in the paper. Corpus-based study of real recordings gave us the required data for understanding the problem of modalities asynchrony that is partially caused by the co-articulation phenomena. A set of context-dependent timing rules and recommendations was elaborated in order to make a synchronization of auditory and visual speech cues of the animated talking head similar to a natural humanlike way. The cognitive evaluation of the model-based talking head for Russian with implementation of the original asynchrony model has shown high intelligibility and naturalness of audio-visual synthesized speech.

#6The Effects of Fundamental Frequency and Formant Space on Speaker Discrimination through Bone-conducted Ultrasonic Hearing

Takayuki Kagomiya (Institute for Human Science and Biomedical Engineering, National Institute of Advanced Industrial Science and Technology (AIST), Japan)
Seiji Nakagawa (Institute for Human Science and Biomedical Engineering, National Institute of Advanced Industrial Science and Technology (AIST), Japan)

Human listeners can perceive speech signals from voice-modulated ultrasonic carrier presented through a bone-conduction stimulator, even if they are sensorineural hearing loss patients. As application of this phenomenon, we have been developing bone-conducted ultrasonic hearing aid (BCUHA). This research examined whether formant space and F0 can be clues of speaker discrimination in BCU hearing as well as via air-conduction (AC) hearing. A series of speaker discrimination experiments revealed that both formant space and F0 are able to be cues for speaker discrimination even via BCUHA. However, sensibility for formant space in BCU hearing is smaller than in AC hearing.

#7Automatic Detection and Prediction of Topic Changes through Automatic Detection of Register Variations and Pause Duration

celine de looze (Laboratoire Parole et Langage, CNRS et Université de Provence, Aix-en-Provence, France)
stephane rauzy (Laboratoire Parole et Langage, CNRS et Université de Provence, Aix-en-Provence, France)

In this article a clustering algorithm, allowing the automatic detection of speakers’ register changes, is presented. Together with automatic detection of pause duration, it has shown to be efficient for the automatic detection and prediction of topic changes. The need to take into account other parameters such as tempo and intensity, in the framework of Linear Discriminant Analysis, is proposed in order to improve the identification of the topic structure of discourse. Index Terms: register variations, pause duration, topic changes, automatic detection and prediction.

#8Analyzing Features for Automatic Age Estimation on Cross-Sectional Data

Werner Spiegl (Chair of Pattern Recognition (LME), University Erlangen-Nuremberg, Germany)
Georg Stemmer (SVOX Deutschland GmbH, Munich, Germany)
Eva Lasarcyk (Dep. of Computational Linguistics and Phonetics, Saarland University, Germany)
Varada Kolhatkar (Dep. of Computer Science, University of Minnesota Duluth, USA)
Andrew Cassidy (The Center for Language and Speech Processing, Johns Hopkins University, Baltimore, USA)
Blaise Potard (CRIN, Nancy, France)
Stephen Shum (International Computer Science Institute, University of California at Berkeley, USA)
Young Chol Song (Dep. of Computer Science, Stony Brook University, USA)
Puyang Xu (The Center for Language and Speech Processing, Johns Hopkins University, Baltimore, USA)
Peter Beyerlein (Dep. Bioinformatics, University of Applied Sciences Wildau, Berlin, Germany)
James Harnsberger (Speech Perception Laboratory, University of Florida, USA)
Elmar Noeth (Chair of Pattern Recognition (LME), University Erlangen-Nuremberg, Germany)

We develop an acoustic feature set for the estimation of a person's age from a recorded speech signal. The baseline features are Mel-frequency cepstral coefficients (MFCCs) which are extended by various prosodic features, pitch and formant frequencies. From experiments on the University of Florida Vocal Aging Database we can draw different conclusions. On the one hand, adding prosodic, pitch and formant features to the MFCC baseline leads to relative reductions of the mean absolute error between 4-20%. Improvements are even larger when perceptual age labels are taken as a reference. On the other hand, reasonable results with a mean absolute error in age estimation of about 12 years are already achieved using a simple gender-independent setup and MFCCs only. Future experiments will evaluate the robustness of the prosodic features against channel variability on other databases and investigate the differences between perceptual and chronological age labels.

#9Intercultural Differences and Commonality in Evaluation of Pathological Voice Quality:Perceptual and Acoustical Comparisons between RASATI and GRBASI Scales

Emi Juliana Yamauchi (Graduate School of Comprehensive Scientific Research, Prefectural University of Hiroshima, Hiroshima, Japan)
Satoshi Imaizumi (Department of Communication Sciences and Disorders, Prefectural University of Hiroshima, Hiroshima, Japan)
Tomoyuki Haji (Kurashiki Central Hospital, Okayama, Japan)

This paper analyzed differences and commonality in pathological voice quality evaluation between two different scaling systems, GRBASI and RASATI. The results identified significant interrelations between the scales. Harshness, included in RASATI, is described as noisiness and strain in the GRBASI scale. Roughness is found to be the most consistent factor and easiest to identify by listeners of different linguistic backgrounds. Intercultural agreement in pathological voice quality evaluation seems be possible.

#10F0 cues for the discourse functions of “hã” in Hindi

Kalika Bali (Microsoft Research Labs India)

Affirmative particles are often employed in conversational speech to convey more than their literal semantic meaning. The discourse information conveyed by such particles can have consequences in both Speech Understanding and Speech Production for a Spoken Dialogue System. This paper analyses the different discourse functions of the affirmative particle hã (“yes”) in Hindi and in explores the role of fundamental frequency (f0) as a cue to disambiguating these functions.

#11Audio spatialisation strategies for multitasking during teleconferences

Stuart N. Wrigley (University of Sheffield)
Simon Tucker (University of Sheffield)
Guy J. Brown (University of Sheffield)
Steve Whittaker (University of Sheffield)

Multitasking during teleconferences is becoming increasingly common: participants continue their work whilst monitoring the audio for topics of interest. Our previous work has established the benefit of spatialised audio presentation on improving multitasking performance. In this study, we investigate the different spatialisation strategies employed by subjects in order to aid their multitasking performance and improve their user experience. Subjects were given the freedom to place each participant at a different location in the acoustic space both in terms of azimuth and distance. Their strategies were based upon cues regarding keywords and which participant will utter them. Our findings suggest that subjects employ consistent strategies with regard to the location of target and distracter talkers. Furthermore, manipulation of the acoustic space plays an important role in multitasking performance and the user experience.

#12Speech rate effects on linguistic change

Alexsandro Meireles (Federal University of Espírito Santo)
Plínio Barbosa (State University of Campinas)

This work deals with the possible role of speech rate on diachronic change from antepenultimate stress words to penultimate stress words. Our results suggest that speech rate may explain this process of linguistic change, since the medial post-stressed vowel reduces more, without deletion, than the final post-stressed vowel from normal to fast rate. These results were confirmed by Friedman's Anova. One-Way Anova also indicated that the duration of the medial post-stressed vowel is significantly smaller than that of the final post-stressed vowel. Besides, linguistic changes influenced by speech rate act according to dialect and gender.

#13Mandarin Spontaneous Narrative Planning—Prosodic Evidence from National

Chiu-yu Tseng (Academa Sinica)
Zhao-yu Su (Academia Sinica)
Lin-shan Lee (Naitonal Taiwan Universityi)

This paper discusses discourse planning of pre-organized spontaneous narratives (SpnNS) in comparison with read speech (RS). F0 and tempo modulations are compared by speech paragraph size and discourse boundaries. The speaking rate of SpnNS from university classroom lecture is 2 to 3 times to that of RS by professionals; paragraph phrasing of SpnNS is 6 times that of RS. Patterns of paragraph association are distinct for SpnNS and RS. Sub-paragraph and paragraph units in RS are marked by distinct relative F0 resets and boundary pause duration, but by patterns of intensity contrasts in SpnNS instead. Consistent to both data sets is the finding that combined relative supra-segmental cues reflecting global prosodic properties are more discriminative to distinguish discourse boundaries than any fragments of singular cue, supporting higher-level discourse planning in the acoustic signals. We believe these findings can be directly applied to speech technology development.

Thu-Ses2-P4:
ASR: new paradigms II

Time:Thursday 13:30 Place:Hewison Hall Type:Poster
Chair:Michael Schuster

#1The Case for Case-Based Automatic Speech Recognition

Viktoria Maier (Department of Speech and Hearing, University of Sheffield, Sheffield, United Kingdom)
Roger K. Moore (Department of Speech and Hearing, University of Sheffield, Sheffield, United Kingdom)

In order to avoid global parameter settings which are locally suboptimal, this paper argues for the inclusion of more knowledge (in particular procedural knowledge) into automatic speech recognition (ASR) systems. Two related fields provide inspiration for this new perspective: (a) ‘cognitive architectures’ indicate how experience with related problems can give rise to more (expert) knowledge, and (b) ‘case-based reasoning’ provides an extended framework which is relevant to any similarity-based recognition systems. The outcome of this analysis is a proposal for a new approach termed ‘Case-Based ASR’.

#2A Self-Labeling Speech Corpus: Collecting Spoken Words with an Online Educational Game

Ian McGraw (MIT)
Alexander Gruenstein (MIT)
Andrew Sutherland (MIT, Quizlet.com)

We explore a new approach to collecting and transcribing speech data by using online educational games. One such game, Voice Race, elicited over 55,000 utterances over a 22 day period, representing 18.7 hours of speech. Voice Race was designed such that the transcripts for a significant subset of utterances can be automatically inferred using the contextual constraints of the game. Game context can also be used to simplify transcription to a multiple choice task, which can be performed by non-experts. We found that one third of the speech collected with Voice Race could be automatically transcribed with over 98% accuracy; and that an additional 49% could be labeled cheaply by Amazon Mechanical Turk workers. We demonstrate the utility of the self-labeled speech in an acoustic model adaptation task, which resulted in a reduction in the Voice Race utterance error rate. The collected utterances cover a wide variety of vocabulary, and should be useful across a range of research.

#3A noise robust method for pattern discovery in quantized time series: the concept matrix approach

Okko Johannes Räsänen (Department of Signal Processing and Acoustics, Helsinki University of Technology, Finland)
Unto Kalervo Laine (Department of Signal Processing and Acoustics, Helsinki University of Technology, Finland)
Toomas Altosaar (Department of Signal Processing and Acoustics, Helsinki University of Technology, Finland)

An efficient method for pattern discovery from discrete time series is introduced in this paper. The method utilizes two parallel streams of data, a discrete unit time-series and a set of labeled events, From these inputs it builds associative models between systematically co-occurring structures existing in both streams. The models are based on transitional probabilities of events at several different time scales. Learning and recognition processes are incremental, making the approach suitable for on-line learning tasks. The capabilities of the algorithm are demonstrated in a continuous speech recognition task operating in varying noise levels.

#4Using Parallel Architectures in Speech Recognition

Patrick Cardinal (Centre de Recherche Informatique de Montréal)
Pierre Dumouchel (École de Technologie Supérieure)
Gilles Boulianne (Centre de Recherche Informatique de Montréal)

The speed of modern processors has remained constant over the last few years and thus, to be scalable, applications must be parallelized. In addition to the main CPU, almost every computer is equipped with a Graphics Processors Unit (GPU) which is in essence a specialized parallel processor. This paper explores how performances of speech recognition systems can be enhanced by using GPU for the acoustic computations and multi-core CPUs for the Viterbi search in a large vocabulary application. The multi-core implementation of our speech recognition system runs 1.3 times faster than the single-threaded CPU implementation. Addition of the GPU for dedicated acoustic computations increases the speed by a factor of 2.8, leading to a word accuracy improvement of 16.6% absolute at real-time, compared to the the single-threaded CPU implementation.

#5Example-Based Speech Recognition using Formulaic Phrases

Christopher James Watkins (University of East Anglia)
Stephen James Cox (University of East Anglia)

In this paper, we describe the design of an ASR system that is based on identifying and extracting formulaic phrases from a corpus and then, rather than building statistical models of them, performing example-based recognition of these phrases. We describe a method for combining formulaic phrases into a bigram language model that results in a 13% decrease in WER on a monophone HMM recogniser over the baseline. We show that using this model with phrase templates in the example-based recogniser gives a significant improvement in WER compared to word templates, but performance still falls short of the HMM recogniser. We also describe an LDA decision tree classifier that reduces the search space of the DTW decoder by 40% while at the same time decreasing WER.

#6Parallel Fast Likelihood Computation for LVCSR using Mixture Decomposition

Naveen Parihar (Dept. of Electrical and Computer Engineering, Mississippi State University, USA)
Ralf Schlueter (Human Lang. and Pattern Recognition, Comp. Sc. Dept., RWTH Aachen University, Germany)
David Rybach (Human Lang. and Pattern Recognition, Comp. Sc. Dept., RWTH Aachen University, Germany)
Eric Hansen (Dept. of Computer Science and Engineering, Mississippi State University, USA)

This paper describes a simple and robust method for improving the runtime of likelihood computation on multi-core processors without degrading system accuracy. The method improves runtime by parallelizing likelihood computations on a multi-core processor. Mixtures are decomposed among the cores and each core computes the likelihood of the mixture allocated to it. We study two approaches to mixture decomposition – Chunk based and Decision-tree based. When applied to RWTH TC-STAR EPPS English LVCSR system on an Intel Core2 Quad processor with varying pruning-beam width settings, the method resulted in a 54% to 70% improvement in the likelihood computation runtime, and a 18% to 59% improvement in the overall runtime.

#7An indexing weight for voice-to-text search

Chen Liu (Applied Research and Technology Center, Motorola, Schaumburg, IL 60196, USA)

The TF–IDF (term frequency–inverse document frequency) weight is a well-known indexing weight in information retrieval and text mining. However, it is not suitable for the increasingly popular voice-to-text search, as it does not take into account the impact of voice in the search process. We propose a method for calculating a new indexing weight, which is used as guidance for selection of suitable queries for voice-to-text search. In designing the new weight, we combine prominence factors from both the text and acoustic domains. Experimental results show significant improvement in the average search success rate with the new indexing weight.

#8On invariant structural representation for speech recognition theoretical validation and experimental improvement

Yu Qiao (The University of Tokyo)
Nobuaki Minematsu (The University of Tokyo)
Keikichi Hirose (The University of Tokyo)

This paper describes our recent progress on invariant structural representation of speech. Theoretically, we prove that the maximum likelihood based decomposition can lead to the same structural representation for a sequence and its transformed version. Practically, we introduce a method of discriminant analysis of eigen-structure to deal with two limitations of the structural representation, namely, high dimensionality and too strong invariance. In one experiment, we examine the performance of structural representations to vocal track length (VTL) differences. The experimental results indicate that structural representations have much more robustness to VTL changes than HMM. In another experiment, we evaluate the proposed method through recognizing connected Japanese vowels. The proposed method achieves a recognition rate 99.0%, which is higher than those of the previous structure based recognition methods and word HMM.

#9Articulatory Feature Asynchrony Analysis and Compensation in Detection-Based ASR

I-Fan Chen (Institute of Information Science, Academia Sinica, Taipei)
Hsin-Min Wang (Institute of Information Science, Academia Sinica, Taipei)

This paper investigates the effects of two types of imperfection, namely detection errors and articulatory feature asynchrony, of the front-end articulatory feature detector on the performance of a detection-based ASR system. Based on a set of variable-controlled experiments, we find that articulatory feature asynchrony is the major issue that should be addressed in detection-based ASR. To this end, we propose several methods to reduce the asynchrony or the effects of asynchrony. The results are quite promising; for example, currently, we can achieve 67.67% phone accuracy in the TIMIT free phone recognition task with only 11 binary-valued articulatory features.

#10CRANDEM: Conditional Random Fields for Word Recognition

Jeremy Morris (Department of Computer Science and Engineering, The Ohio State University)
Eric Fosler-Lussier (Department of Computer Science and Engineering, The Ohio State University)

To date, the use of Conditional Random Fields (CRFs) in automatic speech recognition has been limited to the tasks of phone classification and phone recognition. In this paper, we present a framework for using CRF models in a word recognition task that extends the well-known Tandem HMM framework to CRFs. We show results that compare favorably to a set of standard baselines, and discuss some of the benefits and potential pitfalls of this method.

#11HEAR: An Hybrid Episodic-Abstract speech Recognizer

Sébastien Demange (Katholieke Universiteit Leuven ESAT/PSI)
Dirk Van Compernolle (Katholieke Universiteit Leuven ESAT/PSI)

This paper presents a new architecture for automatic continuous speech recognition called HEAR - Hybrid Episodic-Abstract speech Recognizer. HEAR relies on both parametric speech models (HMMs) and episodic memory. We propose an evaluation on the Wall Street Journal corpus, a standard continuous speech recognition task, and compare the results with a state-of-the-art HMM baseline. HEAR is shown to be a viable and a competitive architecture. While the HMMs have been studied and optimized during decades, their performance seems to converge to a limit which is lower than human performance. On the contrary, episodic memory modeling for speech recognition as applied in HEAR offers flexibility to enrich the recognizer with information the HMMs lack. This opportunity as well as future work are exposed in a discussion.