Analysis of Duration Prediction Accuracy in HMM-Based Speech Synthesis

Hanna Silén, Elina Helander, Jani Nurminen, Moncef Gabbouj, Tampere University of Technology, Department of Signal Processing

Appropriate phoneme durations are essential for high quality speech synthesis. In hidden Markov model-based text-to-speech (HMM-TTS), durations are typically modeled statistically using state duration probability distributions and duration prediction for unseen contexts. Use of rich context features enables synthesis without high-level linguistic knowledge. In this paper we analyze the accuracy of state duration modeling against phone duration modeling using simple prediction techniques. In addition to the decision tree-based techniques, regression techniques for rich context features with high collinearity are discussed and evaluated.