T-7: Fundamentals and recent advances in HMM-based speech synthesis
Presented by
Keiichi Tokuda and Heiga Zen
Abstract
Over the last ten years, the quality of speech synthesis has drastically improved with the rise of general corpus-based
speech synthesis. Especially, state-of-the-art unit selection speech synthesis can generate natural-sounding high
quality speech. However, for constructing human-like talking machines, speech synthesis systems are required to have
an ability to generate speech with arbitrary speaker’s voice characteristics, various speaking styles including native
and non-native speaking styles in different languages, varying emphasis and focus, and/or emotional expressions; it
is still difficult to have such flexibility with unit-selection synthesizers, since they need a large-scale speech corpus
for each voice.
In recent years, a kind of statistical parametric speech synthesis based on hidden Markov models (HMMs) has been
developed. The system has the following features:
- Original speaker’s voice characteristics can easily be reproduced because all speech features including spectral,
excitation, and duration parameters are modeled in a unified framework of HMM, and then generated from
the trained HMMs themselves.
- Using a very small amount of adaptation speech data, voice characteristics can easily be modified by transforming
HMM parameters by a speaker adaptation technique used in speech recognition systems.
From these features, the HMM-based speech synthesis approach is expected to be useful for constructing speech
synthesizers which can give us the flexibility we have in human voices.
In this tutorial, the system architecture is outlined, and then basic techniques used in the system, including algorithms
for speech parameter generation from HMM, are described with simple examples. Relation to the unit
selection approach, trajectory modeling, recent improvements, and evaluation methodologies are are summarized.
Techniques developed for increasing the flexibility and improving the speech quality are also reviewed.
Speaker Biography
Keiichi Tokuda received the Dr.Eng. degree from Tokyo Institute of Technology in 1989. He is now the director
of the Speech Processing Laboratory and a Professor in the Department of Computer Science and Engineering at
Nagoya Institute of Technology. He has been an invited researcher at ATR Spoken Language Translation Research
Laboratories and was a visiting researcher at Carnegie Mellon University from 2001 to 2002. He has been working
on HMM-based speech synthesis after he proposed an algorithm for speech parameter generation from HMM in
1995. He is also the principal designer of opensource software packages: HTS (http://hts.sp.nitech.ac.jp/) and
SPTK (http://sp-tk.sourceforge.net/). In 2005, Keiichi Tokuda and Dr. Alan Black (CMU) organized the largest
ever evaluation of corpus-based speech synthesis techniques, the Blizzard Challenge, which has progressed to an
annual event. He was a member of the Speech Technical Committee of the IEEE Signal Processing Society from
2000 to 2003. Currently he is a member of ISCA Advisory Council and an associate editor of IEEE Transactions
on Audio, Speech & Language Processing, and acts as organizer and reviewer for many major speech conferences,
workshops and journals. He published over 60 journal papers and over 150 conference papers, and received 5 paper
awards.
Heiga Zen received the Dr.Eng. degree in computer science and engineering from Nagoya Institute of Technology
in 2006. He is currently a Research Engineer in the Speech Technology Group of Toshiba Research Europe Ltd.
Cambridge Research Laboratory. He was an intern researcher at the ATR Spoken Language Translation Research
Laboratories in 2003 and an intern/co-op researcher at the IBM T. J. Watson Research Center from 2004 to 2005.
From April 2006 to July 2008, he was a postdoctoral research associate at the Nagoya Institute of Technology. He
has been working on HMM-based speech synthesis for 8 years after joining Prof. Tokuda’s research group in 2000.
He was also the main developer and maintainer of HTS, one of the main developers of the Festival Speech Synthesis
System, one of the main developers of SPTK, and one of the active contributors to the hidden Markov model toolkit
(HTK). He published 10 journal papers and over 40 conference papers, and received 5 paper awards.