Nobuaki Minematsu, University of Tokyo
Perceptual invariance against a large amount of acoustic variability in speech has been a long-discussed question in speech science and engineering  and it is still an open question [2,3]. Recently, we proposed a candidate answer to it based on mathematically-guaranteed relational invariance [4,5]. Here, completely transform-invariant features, f-divergences, are extracted from speech dynamics of an utterance and they are used to represent that utterance. In this paper, this representation is interpreted from a viewpoint of telecommunications and evolutionary anthropology. Speech production is often regarded as a process of modulating the baseline timbre of a speaker's voices by manipulating the vocal organs, i.e. spectrum modulation. Then, extraction of the linguistic content from an utterance can be viewed as a process of spectrum demodulation. This modulation-demodulation model of speech communication has a good link to known morphological and cognitive differences between humans and apes. The model also claims that a linguistic content is transmitted mainly by supra-segmental features.