Speech separation algorithms for improving human-machine interaction
Francisco Javier Ibarrola
University of Illinois at Urbana-Champaign
Over the last decade, the technological advances and the massive adoption of portable electronic devices with high computational capacity have significantly moved the boundaries of human-machine interactions. Many of them, such as automatic translation systems, emotion state recognition devices and digital personal assistants, make use of speech as their input. Although these applications are currently able to perform reasonably well in controlled laboratory conditions, one of the main limitations arising when moving them into everyday life is that unfailingly the electronic recording devices will capture not only the target speech, but also other sounds from sources active at the same time. The necessity of isolating or cleaning sound captured by a given recording device has given rise to the so-called Source Separation field. Although several approaches exist for performing this task, based mainly in some characteristics of the sources (statistical independence, sparsity, etc.), they have two main disadvantages: they require a number of microphones greater or equal than the active sound sources in order to work reasonably well, and their performance is severely degraded by reverberation and other ambient disturbances. In this proposal, we plan to combine Machine Learning techniques (mainly Neural Networks and Nonnegative Matrix Factorization-based representations) to build a new Speech Source Separation method that greatly enhances separation quality to overcome the current limitations of human-machine interaction applications.