|
10thAnnual Conference of the International Speech Communication Association
Interspeech 2009 Brighton
|
Technical Programme
This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.
Wed-Ses1-O4: Voice Transformation I
| Time: | Wednesday 10:00 |
Place: | East Wing 3 |
Type: | Oral |
| Chair: | Yannis Stylianou |
| 10:00 | Many-to-many eigenvoice conversion with reference voice
Yamato Ohtani (Graduate School of Information Science, Nara Institute of Science and Technology) Tomoki Toda (Graduate School of Information Science, Nara Institute of Science and Technology) Hiroshi Saruwatari (Graduate School of Information Science, Nara Institute of Science and Technology) Kiyohiro Shikano (Graduate School of Information Science, Nara Institute of Science and Technology)
We propose many-to-many voice conversion (VC) techniques to convert an arbitrary source voice into an arbitrary target voice. We have been hitherto proposed one-to-many eigenvoice conversion (EVC) and many-to-one EVC. In EVC, an eigenvoice GMM (EV-GMM) is trained in advance using multiple parallel data sets of a reference speaker and many pre-stored speakers. The EV-GMM is flexibly adapted to an arbitrary speaker using a small amount of data. In this paper, we realize many-to-many VC by sequentially performing many-to-one EVC and one-to-many EVC through the reference speaker using the same EV-GMM. Experimental results demonstrate the effectiveness of the proposed method.
|
| 10:20 | Alleviating the One-to-Many Mapping Problem in Voice Conversion with Context-Dependent Modeling
Elizabeth Godoy (Orange Labs) Olivier Rosec (Orange Labs) Thierry Chonavel (Telecom Bretagne)
This paper addresses the "one-to-many" mapping problem in Voice Conversion (VC) by exploring source-to-target mappings in GMM-based spectral transformation. Specifically, we examine differences using source-only versus joint source/target information in the classification stage of transformation, effectively illustrating a "one-to-many effect" in the traditional acoustically-based GMM. We propose combating this effect by using phonetic information in the GMM learning and classification. We then show the success of our proposed context-dependent modeling with transformation results using an objective error criterion. Finally, we discuss implications of our work in adapting current approaches to VC.
|
| 10:40 | Efficient Modeling of Temporal Structure of Speech For Applications in Voice Transformation
Binh Phu Nguyen (School of Information Science, Japan Advanced Institute of Science and Technology) Akagi Masato (School of Information Science, Japan Advanced Institute of Science and Technology)
Aims of voice transformation are to change styles of given utterances. Most voice transformation methods process speech signals in a time-frequency domain. In the time domain, when processing spectral information, conventional methods do not consider relations between neighboring frames. If unexpected modifications happen, there are discontinuities between frames, which leads to the degradation of the speech quality. This paper proposes a new modeling of temporal structure of speech to ensure the smoothness of the transformed speech for improving the speech quality in voice transformation. We propose an improvement of the temporal decomposition (TD) technique to model the temporal structure of speech. The TD is used to ensure the smoothness of the transformed speech. We investigate the TD in two applications, concatenative speech synthesis and spectral voice conversion. Experimental results confirm the effectiveness of TD in terms of improving the quality of the transformed speech.
|
| 11:00 | Cross-Language Voice Conversion Based on Eigenvoices
Malorie Charlier (Faculté Polytechnique de Mons) Yamato Ohtani (Graduate School of Information Science, Nara Institute of Science and Technology) Tomoki Toda (Graduate School of Information Science, Nara Institute of Science and Technology) Alexis Moinet (Faculté Polytechnique de Mons) Thierry Dutoit (Faculté Polytechnique de Mons)
This paper presents a novel cross-language voice conversion (VC) method based on eigenvoice conversion (EVC). Cross language VC is a technique for converting voice quality between two speakers uttering different languages each other. In general, parallel data consisting of utterance pairs of those two speakers are not available. To deal with this problem, we apply EVC to cross-language VC because EVC framework can develop the conversion model without using parallel data. The results of subjective evaluations demonstrate that the proposed method yields significant performance improvements compared with a conventional cross-language VC method based on frame selection.
|
| 11:20 | Voice Conversion using K-Histograms and Frame Selection
Alejandro José Uriz (FI-UNMDP) Pablo Daniel Agüero (FI-UNMDP) Antonio Bonafonte (Universitat Politècnica de Catalunya, Barcelona, Spain) Juan Carlos Tulli (FI-UNMDP)
The goal of voice conversion systems is to modify the voice of a source speaker to be perceived as if it had been uttered by another specific speaker. Many approaches found in the literature work based on statistical models and introduce an oversmoothing in the target features. Our proposal is a new model that combines several techniques used in unit selection for text-to-speech and a non-gaussian transformation mathematical model. Subjective results support the proposed approach.
|
| 11:40 | Online Model Adaptation for Voice Conversion using Model-based Speech Synthesis Technique
Dalei Wu (Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, CANADA) Baojie Li (Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, CANADA) Hui Jiang (Department of Computer Science and Engineering, York University, 4700 Keele Street, Toronto, Ontario M3J 1P3, CANADA) Qianjie Fu (House Ear Institute, 2100 West Third Street, Los Angeles, CA 90057, USA)
In this paper, we present a novel voice conversion method using model-based speech synthesis that can be used for some applications where prior knowledge or training data is not available from the source speaker. In the proposed method, training data from a target speaker is used to build a GMM-based speech model and voice conversion is then performed for each utterance from the source speaker according to the pre-trained target speaker model. To reduce the mismatch between source and target speakers, online model adaptation is proposed to improve model selection accuracy, based on maximum likelihood linear regression (MLLR). Objective and subjective evaluations suggest that the proposed methods are quite effective in generating acceptable voice quality for voice conversion even without training data from source speakers.
|
|
|