Combining greedy algorithms with expert guided manipulation for the definition of a balanced prosodic Spanish-Catalan radio news corpus

David Escudero-Mancebo, Cesar Gonzalez Ferreras, Juan María Garrido Almiñana, Enma Rodero, Lourdes Aguilar, Antonio Bonafonte, Universidad de Valladolid

This article reports the process of building a bilingual (Spanish-Catalan) text corpus balanced in parallel taking into account prosodic features for both languages. We propose an expert guideline for text manipulation that in combination with greedy algorithms significantly improves the quality of the selected corpus. The application of this methodology to a radio news corpus empirically supports the proposed strategy.