Name: Akash Krishnan and Matthew Fernandez
School: Oregon Episcopal School, Portland, Oregon
Project Title: "The Classification and Recognition of Emotions in Prerecorded Speech"
Overview: Speech processing is a growing and important field. Our project is primarily focused on speech audio processing; more specifically, emotional speech processing. The idea of emotion recognition through computer algorithms using digital signal processing has been a concept for about 30 years; however, only now are people starting to gain interest. Speech recognition, also known as word-sentence recognition, is the ability for a computer to convert an audio signal into text (speech-to-text). Emotion recognition is the ability for the computer to determine the emotional content of an audio signal. Many of the same techniques are used for both word recognition and emotion recognition; however, the major difference lies in the physical and psychological aspects of human speech. Current emotion detectors have an accuracy of around 50% at identifying the most dominant emotion from four to five different emotions. Because of the low results from current-day technology, there are not many feasible applications present. Nevertheless, another goal for our project was to develop life-changing applications with a specific and distinct purpose. Some of our ideas are broad, and some of them are ready to be quickly implemented and tested.
We procured a database of 534 emotional audio files in the German language, which contained a variety of seven different emotions. The more differentiable emotions, the more difficult it is to identify them, and that is proven by the fact that many people cannot determine the difference between more than four to five emotions at a time. Through our background research, we have concluded that an average German speaker can identify the seven different emotions correctly at a success rate of about 80%. We wrote a computer program in Matlab to extract 57 different features. The first 18 features consisted of statistical values. The other features are Mel-Frequency Cepstral Coefficients (MFCC), which consisted of 13 original coefficient features, and their first and second derivatives. Using these 57 features, we developed a complex, two-part, feature-based classification engine to train and test audio signals for emotional content. We also plot the MFCC's and resort these graphs from least to greatest, a novel method that increases our results significantly. In addition, we eliminate the silence from the signals, another method that improves current technology.
Our results are well above the current industry and research levels. When we train our program with 100% of the database and test with the same files, we can achieve results of 77% accuracy, well above current research. To further test our program, we evaluated our program by training with half the database and tested with the other half. Our results for this test dropped to 71% with a variance of 2% (based on 20 tests); we expected lower results because the program has less knowledge of each of the emotions. These results are furthering field and we have preliminary results with a second database. This database has five emotions and over 18000 files; each file represents one word from an emotional child. Our results were 95% accurate when trained with 100% of the database and 91% accurate when trained with 50%. This same 50% training set was used in an Interspeech 2009 Contest and the winner was 65% accurate, so our results are well above the best other researchers.
To further our research we plan to evaluate our program with new databases in emotion, but also attempt to identify age or gender of speakers. This year we established new heights for emotion recognition through our innovative approach utilizing the novel MFCC resorting. In the next year, we are planning to approach a new concept called phoneme identification. By identifying the phonemes, short parts of syllables, in a signal, we will further eliminate sounds that do not contribute to emotion and tailor our recognition to specific sounds. Autistic children have difficulty interpreting emotion, so we are applying our research to create a device to signal the emotions being spoken through happy, sad, angry, etc. faces on a wristwatch face.