Brighton Pavilion

10thAnnual Conference of the International Speech Communication Association

ISCA Interspeech 2009 Brighton

Technical Programme

This is the final programme for this session. For oral sessions, the timing on the left is the current presentation order, but this may still change, so please check at the conference itself. If you have signed in to My Schedule, you can add papers to your own personalised list.

Mon-Ses2-O4:
Speech Analysis and Processing I

Time:Monday 13:30 Place:East Wing 3 Type:Oral
Chair:Ben Milner

13:30Nearly Perfect Detection of Continuous F0 Contour and Frame Classification for TTS Synthesis

Thomas Ewender (Speech Processing Group, Computer Engineering and Networks Laboratory ETH Zurich, Switzerland)
Sarah Hoffmann (Speech Processing Group, Computer Engineering and Networks Laboratory ETH Zurich, Switzerland)
Beat Pfister (Speech Processing Group, Computer Engineering and Networks Laboratory ETH Zurich, Switzerland)

We present a new method for the estimation of a continuous fundamental frequency (F0) contour. The algorithm implements a global optimization and yields virtually error-free F0 contours for high quality speech signals. Such F0 contours are subsequently used to extract a continuous fundamental wave. Some local properties of this wave, together with a number of other speech features allow to classify the frames of a speech signal into five classes: voiced, unvoiced, mixed, irregularly glottalized and silence. The presented F0 detection and frame classification can be applied to F0 modeling and prosodic modification of speech segments in high-quality concatenative speech synthesis.

13:50AM-FM ESTIMATION FOR SPEECH BASED ON A TIME-VARYING SINUSOIDAL MODEL

Yannis Pantazis (University of Crete)
Olivier Rosec (Orange Labs)
Yannis Stylianou (University of Crete)

In this paper we present a method based on a time-varying sinusoidal model for a robust and accurate estimation of amplitude and frequency modulations (AM-FM) in speech. The suggested approach has two main steps. First, speech is modeled as a sinusoidal model with time-varying amplitudes. Specifically, the model makes use of a first order time polynomial with complex coefficients for capturing instantaneous amplitude and frequency (phase) components. Next, the model parameters are updated by using the previously estimated instantaneous phase information. Thus, an iterative scheme for AM-FM decomposition of speech is suggested which was validated on synthetic AM-FM signals and tested on reconstruction of voiced speech signals where the signal-to-error reconstruction ratio (SERR) was used as measure. Compared to the standard sinusoidal representation, the suggested approach found to improve the corresponding SERR by 47%, resulting in over 30 dB of SERR.

14:10Voice Source Waveform Analysis and Synthesis using Principal Component Analysis and Gaussian Mixture Modelling

Jon Gudnason (Imperial College London)
Mark Thomas (Imperial College London)
Patrick Naylor (Imperial College London)
Daniel Ellis (Columbia University)

The paper presents a voice source waveform modeling techniques based on principal component analysis (PCA) and Gaussian mixture modeling (GMM). The voice source is obtained by inverse-filteirng speech with the estimated vocal tract filter. This decomposition is useful in speech analysis, synthesis, recognition and coding. Here, a data-driven approach is presented for signal decomposition and classification based on the principal components of the voice source. The principal components are analyzed and the `prototype' voice source signals corresponding to the Gaussian mixture means are examined. We show how an unknown signal can be decomposed into its components and/or prototypes and resynthesized. We show how the techniques are suited for both low bitrate or high quality analysis/synthesis schemes.

14:30Model-Based Estimation Of Instantaneous Pitch In Noisy Speech

Jung Ook Hong (Statistics and Information Sciences Laboratory, Harvard University)
Patrick J. Wolfe (Statistics and Information Sciences Laboratory, Harvard University)

In this paper we propose a model-based approach to instantaneous pitch estimation in noisy speech, by way of incorporating pitch smoothness assumptions into the well-known harmonic model. In this approach, the latent pitch contour is modeled using a basis of smooth polynomials, and is fit to waveform data by way of a harmonic model whose partials have time-varying amplitudes. The resultant nonlinear least squares estimation task is accomplished through the Gauss-Newton method with a novel initialization step that serves to greatly increase algorithm efficiency. We demonstrate the accuracy and robustness of our method through comparisons to state-of-the art pitch estimation algorithms using both simulated and real waveform data.

14:50Complex Cepstrum-based Decomposition of Speech for Glottal Source Estimation

Thomas Drugman (Faculté Polytechnique de Mons)
Baris Bozkurt (Izmir Institute of Technology)
Thierry Dutoit (Faculté Polytechnique de Mons)

Homomorphic analysis is a well-known method for the separation of non-linearly combined signals. More particularly, the use of complex cepstrum for source-tract deconvolution has been discussed in various articles. However there exists no study which proposes a glottal flow estimation methodology based on cepstrum and reports effective results. In this paper, we show that complex cepstrum can be effectively used for glottal flow estimation by separating the causal and anticausal components of a windowed speech signal as done by the Zeros of the Z-Transform (ZZT) decomposition. Based on exactly the same principles presented for ZZT decomposition, windowing should be applied such that the windowed speech signals exhibit mixed-phase characteristics which conform the speech production model that the anticausal component is mainly due to the glottal flow open phase. The advantage of the complex cepstrum-based approach compared to the ZZT decomposition is its much higher speed.

15:10Approximate Intrinsic Fourier Analysis of Speech

Frank Tompkins (Statistics and Information Sciences Laboratory, Harvard University)
Patrick J. Wolfe (Statistics and Information Sciences Laboratory, Harvard University)

Popular parametric models of speech sounds such as the source-filter model provide a fixed means of describing the variability inherent in speech waveform data. However, nonlinear dimensionality reduction techniques such as the intrinsic Fourier analysis method of Jansen and Niyogi provide a more flexible means of adaptively estimating such structure directly from data. Here we employ this approach to learn a low-dimensional manifold whose geometry is meant to reflect the structure implied by the human speech production system. We derive a novel algorithm to efficiently learn this manifold for the case of many training examples--the setting of both greatest practical interest and computational difficulty. We then demonstrate the utility of our method by way of a proof-of-concept phoneme identification system that operates effectively in the intrinsic Fourier domain.