|
10thAnnual Conference of the International Speech Communication Association
Interspeech 2009 Brighton
|
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Wed-Ses1-P3: Statistical Parametric Synthesis II
| Time: | Wednesday 10:00 |
Place: | Hewison Hall |
Type: | Poster |
| Chair: | Simon King |
| #1 | A BayesianApproach to Hidden Semi-Markov Model Based Speech Synthesis
Kei Hashimoto (Nagoya Institute of Technology) Yoshihiko Nankaku (Nagoya Institute of Technology) Keiichi Tokuda (Nagoya Institute of Technology)
This paper proposes a Bayesian approach to hidden semi-Markov model (HSMM) based speech synthesis. Recently, hidden Markov model (HMM) based speech synthesis based on the Bayesian approach was proposed. The Bayesian approach is a statistical technique for estimating reliable predictive distributions by treating model parameters as random variables. In the Bayesian approach, all processes for constructing the system are derived from one single predictive distribution which exactly represents the problem of speech synthesis. However, there is an inconsistency between training and synthesis: although the speech is synthesized from HMMs with explicit state duration probability distributions, HMMs are trained without them. In this paper, we introduce an HSMM, which is an HMM with explicit state duration probability distributions, into the HMM-based Bayesian speech synthesis system. Experimental results show that the use of HSMM improves the naturalness of the synthesized speech.
|
| #2 | Rich Context Modeling for High Quality HMM-Based TTS
Zhi-Jie Yan (Microsoft Research Asia) Yao Qian (Microsoft Research Asia) Frank K. Soong (Microsoft Research Asia)
This paper presents a rich context modeling approach to high quality HMM-based speech synthesis. We first analyze the over-smoothing problem in conventional decision tree tying-based HMM, and then propose to model the training speech tokens with rich context models. Special training procedure is adopted for reliable estimation of the rich context model parameters. In synthesis, a search algorithm following a context-based pre-selection is performed to determine the optimal rich context model sequence which generates natural and crisp output speech. Experimental results show that spectral envelopes synthesized by the rich context models are with crisper formant structures and evolve with richer details than those obtained by the conventional models. The speech quality improvement is also perceived by listeners in a subjective preference test, in which 76% of the sentences synthesized using rich context modeling are preferred.
|
| #3 | Tying covariance matrices to reduce the footprint of HMM-based speech synthesis systems
Keiichiro Oura (Depertment of Computer Science and Engineering, Nagoya Institute of Technology, Japan) Heiga Zen (Depertment of Computer Science and Engineering, Nagoya Institute of Technology, Japan) Yoshihiko Nankaku (Depertment of Computer Science and Engineering, Nagoya Institute of Technology, Japan) Akinobu Lee (Depertment of Computer Science and Engineering, Nagoya Institute of Technology, Japan) Keiichi Tokuda (Depertment of Computer Science and Engineering, Nagoya Institute of Technology, Japan)
This paper proposes a technique of reducing footprint of HMMbased
speech synthesis systems by tying all covariance matrices.
HMM-based speech synthesis systems usually consume
smaller footprint than unit-selection synthesis systems because
statistics rather than speech waveforms are stored. However,
further reduction is essential to put them on embedded devices
which have very small memory.
According to the empirical knowledge that covariance matrices
have smaller impact for the quality of synthesized speech
than mean vectors, here we propose a clustering technique
of mean vectors while tying all covariance matrices. Subjective
listening test results show that the proposed technique can
shrink the footprint of an HMM-based speech synthesis system
while retaining the quality of synthesized speech.
|
| #4 | The HMM Synthesis Algorithm of an Embedded Unified Speech Recognizer and Synthesizer
Guntram Strecha (Technische Universität Dresden, Germany) Matthias Wolff (Technische Universität Dresden, Germany) Frank Duckhorn (Technische Universität Dresden, Germany) Sören Wittenberg (Technische Universität Dresden, Germany) Constanze Tschöpe (Fraunhofer Institute for Non-Destructive Testing, Dresden, Germany) ()
In this paper we present an embedded unified speech recognizer and synthesizer using identical, speaker independent Hidden-Markov-Models. The system was prototypically realized on a signal processor extended by a field programmable gate array. In a first section we will give a brief overview of the system. The main part of the paper deals with a specially designed unit based HMM synthesis algorithm. In a last section we state the results of an informal listening evaluation of the speech synthesizer.
|
| #5 | Syllable HMM based Mandarin TTS and Comparison with Concatenative TTS
Zhiwei Shuang (University of Science and Technology of China, IBM China Research Lab) Shiyin Kang (Tsinghua University) Qin Shi (IBM China Research Lab) Yong Qin (IBM China Research Lab) Lianhong Cai (Tsinghua University)
This paper introduces a Syllable HMM based Mandarin TTS system. 10-state left-to-right HMMs are used to model each syllable. We leverage the corpus and the front end of concatenative TTS system to build the Syllable HMM based TTS system. Furthermore, we utilize the unique consonant/vowel structure of Mandarin syllable to improve the voiced/unvoiced decision of HMM states. Evaluation result shows that the Syllable HMM based Mandarin TTS system with a 5.3MB’s model size can achieve an overall quality close to a concatenative TTS system with 1GB’ data size.
|
| #6 | Pulse Density Representation of Spectrum for Statistical Speech Processing
Yoshinori Shiga (National Institute of Information and Communications Technology (NICT), Japan)
This study investigates a new spectral representation that is suitable for statistical parametric speech synthesis. Statistical speech processing involves spectral averaging in the training process; however, averaging spectra in the domain of conventional speech parameters over-smooths the resulting means, which degrades the quality of the speech synthesised. In the proposed representation, high-energy parts of the spectrum, such as sections of dominant formants, are represented by a group of high-density pulses in the frequency domain. These pulses' locations (i.e., frequencies) are then parameterised. The representation is theoretically capable of averaging spectra with less over-smoothing effect. The experimental results provide the optimal values of factors necessary for the encoding and decoding of the proposed representation towards the future applications of speech synthesis.
|
| #7 | Parameterization of Vocal Fry in HMM-Based Speech Synthesis
Hanna Silén (Department of Signal Processing, Tampere University of Technology, Finland) Elina Helander (Department of Signal Processing, Tampere University of Technology, Finland) Jani Nurminen (Nokia Devices R&D, Tampere, Finland) Moncef Gabbouj (Department of Signal Processing, Tampere University of Technology, Finland)
HMM-based speech synthesis offers a way to generate speech with different voice qualities. However, sometimes databases contain certain inherent voice qualities that need to be parametrized properly. One example of this is vocal fry typically occurring at the end of utterances. A popular mixed excitation vocoder for HMM-based speech synthesis is STRAIGHT. The standard STRAIGHT is optimized for modal voices and may not produce high quality with other voice types. Fortunately, due to the flexibility of STRAIGHT, different F0 and aperiodicity measures can be used in the synthesis without any inherent degradations in speech quality. We have replaced the STRAIGHT excitation with a representation based on a robust F0 measure and a carefully determined two-band voicing. According to our analysis-synthesis experiments, the new parameterization can improve the speech quality. In HMM-based speech synthesis, the quality is significantly improved especially due to the better modeling of vocal fry.
|
| #8 | A Deterministic plus Stochastic Model of the Residual Signal for Improved Parametric Speech Synthesis
Thomas Drugman (Faculté Polytechnique de Mons) Geoffrey Wilfart (Acapela Group) Thierry Dutoit (Faculté Polytechnique de Mons)
Speech generated by parametric synthesizers generally suffers from a typical buzziness. In order to alleviate this problem, a more suited modeling of the excitation should be adopted. For this, we hereby propose an adaptation of the Deterministic plus Stochastic Model (DSM) for the residual. In this model, the excitation is divided into two distinct spectral bands delimited by a maximum voiced frequency. The deterministic part concerns the low-frequency contents and consists of a decomposition of pitch-synchronous residual frames on an orthonormal basis obtained by Principal Component Analysis, while the stochastic component is a high-pass filtered noise. The proposed residual model is integrated within a HMM-based speech synthesizer and is compared to the traditional excitation through a subjective test. Results show a significative improvement for both male and female voices. The proposed model is also shown to be suited for its integration in commercial applications.
|
| #9 | A decision tree-based clustering approach to state definition in an excitation modeling framework for HMM-based speech synthesis
Ranniery Maia (National Institute of Information and Communications Technology, Japan) Tomoki Toda (Nara Institute of Science and Technology, Japan) Keiichi Tokuda (Nagoya Institute of Technology, Japan) Shinsuke Sakai (National Institute of Information and Communications Technology, Japan) Satoshi Nakamura (National Institute of Information and Communications Technology, Japan)
This paper presents a decision tree-based algorithm to cluster residual segments assuming an excitation model based on state-dependent filtering of pulse train and white noise. The decision tree construction principle is the same as the one applied to speech recognition. Here parent nodes are split using the residual maximum likelihood criterion. Once these excitation decision trees are constructed for residual signals segmented by full context models, using questions related to the full context of the training sentences, they can be utilized for excitation modeling in speech synthesis based on hidden Markov models (HMM). Experimental results have shown that the algorithm in
question is very effective in terms of clustering residual signals given segmentation, pitch marks and full context questions, resulting in filters with good residual modeling properties.
|
| #10 | An improved minimum generation error based model adaptation for HMM-based speech synthesis
Yi-Jian Wu (Microsoft) Long Qin (Carnegie Mellon University) Keiichi Tokuda (Nagoya Institute of Technology)
Aminimum generation error (MGE) criterion had been proposed for model training in HMM-based speech synthesis. In this paper, we apply the MGE criterion to model adaptation for
HMM-based speech synthesis, and introduce an MGE linear regression (MGELR) based model adaptation algorithm, where the regression matrices used to transform source models are optimized so as to minimize the generation errors of adaptation data. In addition, we incorporate the recent improvements of MGE criterion into MGELR-based model adaptation, including state alignment under MGE criterion and using a log spectral distortion (LSD) instead of Euclidean distance for spectral distortion measure. From the experimental results, the adaptation performance was improved after incorporating these two techniques, and the formal listening tests showed that the quality and speaker similarity of synthesized speech after MGELRbased adaptation were significantly improved over the original MLLR-based adaptation.
|
| #11 | Two-pass decision tree construction for unsupervised adaptation of HMM-based synthesis models
Matthew Gibson (Cambridge University)
Hidden Markov model (HMM) -based speech synthesis systems possess several advantages over concatenative synthesis systems. One such advantage is the relative ease with which HMM-based systems are adapted to speakers not present in the training dataset. Speaker adaptation methods used in the field of HMM-based automatic speech recognition (ASR) are adopted for this task. In the case of unsupervised speaker adaptation, previous work has used a supplementary set of acoustic models to firstly estimate the transcription of the adaptation data. By defining a mapping between HMM-based synthesis models and ASR-style models, this paper introduces an approach to the unsupervised speaker adaptation task for HMM-based speech synthesis models which avoids the need for supplementary acoustic models. Further, this enables unsupervised
adaptation of HMM-based speech synthesis models without the need to perform linguistic analysis of the estimated transcription of the adaptation data.
|
| #12 | Speaker adaptation using a parallel phone set pronunciation dictionary for Thai-English Bilingual TTS
Anocha Rugchatjaroen (National Electronics and Computer Technology Center (NECTEC), Thailand) Nattanun Thatphithakkul (National Electronics and Computer Technology Center (NECTEC), Thailand) Ananlada Chotimongkol (National Electronics and Computer Technology Center (NECTEC), Thailand) Chai Wutiwiwatchai (National Electronics and Computer Technology Center (NECTEC), Thailand) Ausdang Thangthai (National Electronics and Computer Technology Center (NECTEC), Thailand)
This paper develops a bilingual Thai-English TTS system from two monolingual HHM-based TTS systems. An English Nagoya HMM-based TTS system (HTS) provides correct pronunciations of English words but the voice is different from the voice in a Thai HTS system. We apply a CSMAPLR adaptation technique to make the English voice sounds more similar to the Thai voice. To overcome a phone mapping problem normally occurs with a pair of languages that have dissimilar phone sets, we utilize a cross-language pronunciation mapping through a parallel phone set pronunciation dictionary. The results from the subjective listening test show that English words synthesized by our proposed system are more intelligible (with 0.61 higher MOS) than the existing bilingual Thai-English TTS. Moreover, with the proposed adaptation method, the synthesized English words sound more similar to synthesized Thai words.
|
| #13 | HMM-based Automatic Eye-blink Synthesis from Speech
Michal Dziemianko (Centre for Speech Technology Research, University of Edinburgh, UK) Gregor Hofer (Centre for Speech Technology Research, University of Edinburgh, UK) Hiroshi Shimodaira (Centre for Speech Technology Research, University of Edinburgh, UK)
In this paper we present a novel technique to automatically synthesize
eye blinking from a speech signal. Animating the eyes of a talking
head is important as they are a major focus of attention during
interaction. The developed system predicts eye blinks from the speech
signal and generates animation trajectories automatically employing a
''Trajectory Hidden Markov Model''. The evaluation of the recognition
performance showed that eye blinks can be predicted from speech with
an F-score value upwards of 52%, which is well above chance.
Additionally, a perceptual evaluation was conducted, that confirmed
that adding eye blinking significantly improves the perception the
character. Finally it showed that the speech synchronised synthesized
blinks outperform random blinking in naturalness ratings.
|
|
|