|
10thAnnual Conference of the International Speech Communication Association
Interspeech 2009 Brighton
|
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Thu-Ses2-P1: Speaker and speech variability, Paralinguistic and nonlinguistic cues
| Time: | Thursday 13:30 |
Place: | Hewison Hall |
Type: | Poster |
| Chair: | Christer Gobl |
| #1 | A Novel Codebook Search Technique for Estimating the Open Quotient
Yen-Liang Shue (Department of Electrical Engineering, University of California, Los Angeles) Jody Kreiman (Division of Head and Neck Surgery, UCLA School of Medicine) Abeer Alwan (Department of Electrical Engineering, University of California, Los Angeles)
The open quotient (OQ), loosely defined as the proportion of time the glottis is open during phonation, is an important parameter in many source models. Accurate estimation of OQ from acoustic signals is a non-trivial process as it involves the separation of the source signal from the vocal-tract transfer function. Often this process is hampered by the lack of direct physiological data with which to calibrate algorithms. In this paper, an analysis-by-synthesis method using a codebook of harmonically-based Liljencrants-Fant (LF) source models in conjunction with a constrained optimizer was used to obtain estimates of OQ from four subjects. The estimates were compared with physiological measurements from high-speed imaging. Results showed relatively high correlations between the estimated and measured values for only two of the speakers, suggesting that existing source models may be unable to accurately represent some source signals.
|
| #2 | Long Term Examination of Intra-Session and Inter-Session Speaker Variability
Aaron Lawson (RADC Inc.) Allen Stauffer (RADC Inc.) Brett Smolenski (RADC Inc.) Benjamin Pokines (Oasis Systems) Matthew Leonard (University of Texas at Dallas) Edward Cupples (RADC Inc.)
Session variability in speaker recognition is a well recognized phenomena, but poorly understood largely due to a dearth of robust longitudinal data. The current study uses a large, long-term speaker database to quantify both speaker variability changes within a conversation and the impact of speaker variability changes over the long term (3 years). Results demonstrate that 1) change in accuracy over the course of a conversation is statistically very robust and 2) that the aging effect over three years is statistically negligible. Finally we demonstrate that voice change during the course of a conversation is, in large part, comparable across sessions.
|
| #3 | Distorted visual information influences audiovisual perception of voicing
Ragnhild Eg (Department of Psychology, Norwegian University of Science and Technology (NTNU)) Dawn Behne (Department of Psychology, Norwegian University of Science and Technology (NTNU))
Research has shown that visual information becomes less reliable when images are severely distorted. Furthermore, while voicing is generally identified from acoustical cues, it may also provide perception with visual cues. The current study investigated the impact of video distortion on the audiovisual perception of voicing. Audiovisual stimuli were presented to 30 participants with the original video quality, or with reduced video resolution (75x60 pixels, 45x36 pixels). Results revealed that in addition to increased auditory reliance with video distortion, particularly for voiceless stimuli, perception of voiceless stimuli was more influenced by the visual modality than voiced stimuli.
|
| #4 | Perceived naturalness of a synthesizer of disordered voices
Samia Fraj (Laboratory of Images, Signals & telecommunication devices, Université Libre de Bruxelles, Brussels, Belgium.) Francis Grenez (Laboratory of Images, Signals & telecommunication devices, Université Libre de Bruxelles, Brussels, Belgium.) Jean Schoentgen (Laboratory of Images, Signals & telecommunication devices, Université Libre de Bruxelles, Brussels, Belgium. National Fund for Scientific Research, Belgium.)
The presentation describes a synthesizer of normal and disordered voice timbres and their perceptual evaluation with respect to naturalness. The simulator uses a shaping function model, which enables controlling the perturbations of the frequency and harmonic richness of the glottal area signal via the control of the instantaneous frequency and amplitude of two harmonic driving functions. Several types of perturbations are simulated. Perceptual experiments, which involve stimuli of synthetic and human vowels with normal values of perturbations, have been carried out. The first has been based on a binary synthetic/natural classification. The second has involved a discrimination task. Both experiments suggest that human judges are unable to distinguish between human and synthetic vowels prepared with the synthesizer described here.
|
| #5 | Audio-Visual Speech Asynchrony Modeling in a Talking Head
Alexey Karpov (St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, Russia) Liliya Tsirulnik (United Institute of Informatics Problems of the National Academy of Sciences, Minsk, Belarus) Zdeněk Krňoul (University of West Bohemia in Pilsen, Czech Republic) Andrey Ronzhin (St. Petersburg Institute for Informatics and Automation of the Russian Academy of Sciences, Russia) Boris Lobanov (United Institute of Informatics Problems of the National Academy of Sciences, Minsk, Belarus) Miloš Železný (University of West Bohemia in Pilsen, Czech Republic)
An audio-visual speech synthesis system with modeling of asynchrony between auditory and visual speech modalities is proposed in the paper. Corpus-based study of real recordings gave us the required data for understanding the problem of modalities asynchrony that is partially caused by the co-articulation phenomena. A set of context-dependent timing rules and recommendations was elaborated in order to make a synchronization of auditory and visual speech cues of the animated talking head similar to a natural humanlike way. The cognitive evaluation of the model-based talking head for Russian with implementation of the original asynchrony model has shown high intelligibility and naturalness of audio-visual synthesized speech.
|
| #6 | The Effects of Fundamental Frequency and Formant Space on Speaker Discrimination through Bone-conducted Ultrasonic Hearing
Takayuki Kagomiya (Institute for Human Science and Biomedical Engineering, National Institute of Advanced Industrial Science and Technology (AIST), Japan) Seiji Nakagawa (Institute for Human Science and Biomedical Engineering, National Institute of Advanced Industrial Science and Technology (AIST), Japan)
Human listeners can perceive speech signals from voice-modulated ultrasonic carrier presented through a bone-conduction stimulator, even if they are sensorineural hearing loss patients. As application of this phenomenon, we have been developing bone-conducted ultrasonic hearing aid (BCUHA). This research examined whether formant space and F0 can be clues of speaker discrimination in BCU hearing as well as via air-conduction (AC) hearing. A series of speaker discrimination experiments revealed that both formant space and F0 are able to be cues for speaker discrimination even via BCUHA. However, sensibility for formant space in BCU hearing is smaller than in AC hearing.
|
| #7 | Automatic Detection and Prediction of Topic Changes through Automatic Detection of Register Variations and Pause Duration
celine de looze (Laboratoire Parole et Langage, CNRS et Université de Provence, Aix-en-Provence, France) stephane rauzy (Laboratoire Parole et Langage, CNRS et Université de Provence, Aix-en-Provence, France)
In this article a clustering algorithm, allowing the automatic detection of speakers’ register changes, is presented. Together with automatic detection of pause duration, it has shown to be efficient for the automatic detection and prediction of topic changes. The need to take into account other parameters such as tempo and intensity, in the framework of Linear Discriminant Analysis, is proposed in order to improve the identification of the topic structure of discourse. Index Terms: register variations, pause duration, topic changes, automatic detection and prediction.
|
| #8 | Analyzing Features for Automatic Age Estimation on Cross-Sectional Data
Werner Spiegl (Chair of Pattern Recognition (LME), University Erlangen-Nuremberg, Germany) Georg Stemmer (SVOX Deutschland GmbH, Munich, Germany) Eva Lasarcyk (Dep. of Computational Linguistics and Phonetics, Saarland University, Germany) Varada Kolhatkar (Dep. of Computer Science, University of Minnesota Duluth, USA) Andrew Cassidy (The Center for Language and Speech Processing, Johns Hopkins University, Baltimore, USA) Blaise Potard (CRIN, Nancy, France) Stephen Shum (International Computer Science Institute, University of California at Berkeley, USA) Young Chol Song (Dep. of Computer Science, Stony Brook University, USA) Puyang Xu (The Center for Language and Speech Processing, Johns Hopkins University, Baltimore, USA) Peter Beyerlein (Dep. Bioinformatics, University of Applied Sciences Wildau, Berlin, Germany) James Harnsberger (Speech Perception Laboratory, University of Florida, USA) Elmar Noeth (Chair of Pattern Recognition (LME), University Erlangen-Nuremberg, Germany)
We develop an acoustic feature set for the estimation of a person's age from a recorded speech signal. The baseline features are Mel-frequency cepstral coefficients (MFCCs) which are extended by various prosodic features, pitch and formant frequencies. From experiments on the University of Florida Vocal Aging Database we can draw different conclusions. On the one hand, adding prosodic, pitch and formant features to the MFCC baseline leads to relative reductions of the mean absolute error between 4-20%. Improvements are even larger when perceptual age labels are taken as a reference. On the other hand, reasonable results with a mean absolute error in age estimation of about 12 years are already achieved using a simple gender-independent setup and MFCCs only. Future experiments will evaluate the robustness of the prosodic features against channel variability on other databases and investigate the differences between perceptual and chronological age labels.
|
| #9 | Intercultural Differences and Commonality in Evaluation of Pathological Voice Quality:Perceptual and Acoustical Comparisons between RASATI and GRBASI Scales
Emi Juliana Yamauchi (Graduate School of Comprehensive Scientific Research, Prefectural University of Hiroshima, Hiroshima, Japan) Satoshi Imaizumi (Department of Communication Sciences and Disorders, Prefectural University of Hiroshima, Hiroshima, Japan) Tomoyuki Haji (Kurashiki Central Hospital, Okayama, Japan)
This paper analyzed differences and commonality in pathological voice quality evaluation between two different scaling systems, GRBASI and RASATI. The results identified significant interrelations between the scales. Harshness, included in RASATI, is described as noisiness and strain in the GRBASI scale. Roughness is found to be the most consistent factor and easiest to identify by listeners of different linguistic backgrounds. Intercultural agreement in pathological voice quality evaluation seems be possible.
|
| #10 | F0 cues for the discourse functions of “hã” in Hindi
Kalika Bali (Microsoft Research Labs India)
Affirmative particles are often employed in conversational speech to convey more than their literal semantic meaning. The discourse information conveyed by such particles can have consequences in both Speech Understanding and Speech Production for a Spoken Dialogue System. This paper analyses the different discourse functions of the affirmative particle hã (“yes”) in Hindi and in explores the role of fundamental frequency (f0) as a cue to disambiguating these functions.
|
| #11 | Audio spatialisation strategies for multitasking during teleconferences
Stuart N. Wrigley (University of Sheffield) Simon Tucker (University of Sheffield) Guy J. Brown (University of Sheffield) Steve Whittaker (University of Sheffield)
Multitasking during teleconferences is becoming increasingly common: participants continue their work whilst monitoring the audio for topics of interest. Our previous work has established the benefit of spatialised audio presentation on improving multitasking performance. In this study, we investigate the different spatialisation strategies employed by subjects in order to aid their multitasking performance and improve their user experience. Subjects were given the freedom to place each participant at a different location in the acoustic space both in terms of azimuth and distance. Their strategies were based upon cues regarding keywords and which participant will utter them. Our findings suggest that subjects employ consistent strategies with regard to the location of target and distracter talkers. Furthermore, manipulation of the acoustic space plays an important role in multitasking performance and the user experience.
|
| #12 | Speech rate effects on linguistic change
Alexsandro Meireles (Federal University of Espírito Santo) Plínio Barbosa (State University of Campinas)
This work deals with the possible role of speech rate on diachronic change from antepenultimate stress words to penultimate stress words. Our results suggest that speech rate may explain this process of linguistic change, since the medial post-stressed vowel reduces more, without deletion, than the final post-stressed vowel from normal to fast rate. These results were confirmed by Friedman's Anova. One-Way Anova also indicated that the duration of the medial post-stressed vowel is significantly smaller than that of the final post-stressed vowel. Besides, linguistic changes influenced by speech rate act according to dialect and gender.
|
| #13 | Mandarin Spontaneous Narrative Planning—Prosodic Evidence from National
Chiu-yu Tseng (Academa Sinica) Zhao-yu Su (Academia Sinica) Lin-shan Lee (Naitonal Taiwan Universityi)
This paper discusses discourse planning of pre-organized
spontaneous narratives (SpnNS) in comparison with read
speech (RS). F0 and tempo modulations are compared by
speech paragraph size and discourse boundaries. The speaking
rate of SpnNS from university classroom lecture is 2 to 3
times to that of RS by professionals; paragraph phrasing of
SpnNS is 6 times that of RS. Patterns of paragraph association
are distinct for SpnNS and RS. Sub-paragraph and paragraph
units in RS are marked by distinct relative F0 resets and
boundary pause duration, but by patterns of intensity contrasts
in SpnNS instead. Consistent to both data sets is the finding
that combined relative supra-segmental cues reflecting global
prosodic properties are more discriminative to distinguish
discourse boundaries than any fragments of singular cue,
supporting higher-level discourse planning in the acoustic
signals. We believe these findings can be directly applied to
speech technology development.
|
|
|