Prediction-based classification for audiovisual discrimination between laughter and speech


Petridis, Stavros and Pantic, Maja and Cohn, Jeffrey F. (2011) Prediction-based classification for audiovisual discrimination between laughter and speech. In: IEEE International Conference on Automatic Face & Gesture Recognition and Workshops, FG 2011, 21-25 March 2011, Santa Barbara, CA (pp. pp. 619-626).

[img] PDF
Restricted to UT campus only
: Request a copy
Abstract:Recent evidence in neuroscience support the theory that prediction of spatial and temporal patterns in the brain plays a key role in human actions and perception. Inspired by these findings, a system that discriminates laughter from speech by modeling the spatial and temporal relationship between audio and visual features is presented. The underlying assumption is that this relationship is different between speech and laughter. Neural networks are trained which learn the audio-to-visual and visual-to-audio feature mapping together with the time evolution of audio and visual features for both classes. Classification of a new frame / sequence is performed via prediction. All the networks produce a prediction of the expected audio / visual features and their prediction errors are combined for each class. The model which best describes the audiovisual feature relationship, i.e., results in the lowest prediction error, provides its label to the input frame / sequence. Using 4 different datasets, the proposed system is compared to standard feature-level fusion on cross-database experiments. In almost all test cases, prediction-based classification outperforms feature-level fusion. Similar conclusion are drawn when adding artificial feature-level noise to the datasets.
Item Type:Conference or Workshop Item
Copyright:© 2011 IEEE
Electrical Engineering, Mathematics and Computer Science (EEMCS)
Research Group:
Link to this item:
Official URL:
Export this item as:BibTeX
HTML Citation
Reference Manager


Repository Staff Only: item control page