The Second Emotion Recognition In The Wild Challenge (EmotiW 2014)

@ ACM International Conference on Multimodal Interaction (ICMI 2014), Istanbul

Home People Important
Dates
Challenge
Details
Workshop
Details
Submitting
Results
Paper
Submission
FAQ


The Emotion Recognition In The Wild Challenge and Workshop (EmotiW) 2014 grand Challenge consists of an audio-video based emotion classification challenges, which mimics real-world conditions. Traditionally, emotion recognition has been performed on laboratory controlled data. While undoubtedly worthwhile at the time, such lab controlled data poorly represents the environment and conditions faced in real-world situations. With the increase in the number of video clips online, it is worthwhile to explore the performance of emotion recognition methods that work ‘in the wild’. The goal of this Grand Challenge is to extend and carry forward the new common platform for evaluation of emotion recognition methods in real-world conditions defined in EmotiW 2013 at ACM International Conference on Multimodal Interaction 2013.

The database in the 2014 challenge is the Acted Facial Expression In Wild (AFEW) 4.0, which has been collected from movies showing close-to-real-world conditions. Three sets for training, validation and testing will be made available. The challenge seeks participation from researchers working on emotion recognition intend to create, extend and validate their methods on data in real-world conditions. 

The database will be divided into three sets for the challenge: training, validation and testing. The current version of AFEW 4.0, available at cs.anu.edu.au/few contains two sets, extended versions of these sets will be used for training and validation; for testing, new unseen data will be used. The task is to classify a sample audio-video clip into one of the seven categories: Anger, Disgust, Fear, Happiness, Neutral, Sadness and Surprise. The labeled training and validation sets will be made available early and the new, unlabeled test set will be made available on 4th July 2014. There are no separate video-only, audio-only, or audio-video challenges. Participants are free to use either modality or both. Results for all methods will be combined into one set in the end. Participants are allowed to use their own features and classification methods. The labels of the testing set are unknown. Participants will need to adhere to the definition of training, validation and testing sets. In their papers, they may report on results obtained on the training and validation sets, but only the results on the testing set will be taken into account for the overall Grand Challenge results. A paper describing the method is required to be submitted by the participating teams.

Baseline
We provide audio and video baselines. For video, the face is localized using Mixture of Parts framework of Zhu and Ramana 2012 and tracking is performed using IntraFace library. The fiducial points generated by IntraFace are used for aligning the face. Post aligning LBP-TOP features are extracted from non-overlapping spatial 4x4 blocks. The LBP-TOP feature from each block are concatenated to create one feature vector. Non-linear RBF kernel based SVM is learnt for emotion classification. The video only baseline system achieves 34.4% classification accuracy. The audio baseline is computed by extracting features using the OpenSmile toolkit. A linear SVM classifier is learnt. The audio only based system gives 26.2% classification accuracy. A feature level fusion is performed, where the audio and video features are conatenated  and a non-linear RBF kernel based SVM is learnt. The performance drops here and the classification accuracy is 28.2%. For the test set, the video only based accuracy is 33.7%; audio only based classification accuracy is 26.7% and audio video feature fusion is 24.6%.

For any queries please email at: EmotiW2014@gmail.com