Speaker-independent vowel recognition combining voice features and mouth shape image with neural network



This paper describes a neural approach intended to improve the performance of a voice recognition system for unrestricted speakers using not only voice sound features but also image features of the mouth shape. The FFT power spectrum of acoustic speech was used as the voice feature. In addition, the gray-level image, binary image, and geometrical shape features of the mouth were used as the compensatory information and a comparison made of which kinds of image features are effective for voice recognition by a neural network.

For unrestricted speakers, a vowel recognition rate of about 80 percent was obtained using only voice features. However, this increased to some 92 percent when voice features plus binary images were used. This method can be applied not only to the improvement of voice recognition, but also to aid the communication of hearing-impaired people.