Advancing End-to-End Indonesian Speech Emotion Recognition Using XLSR-Wav2Vec2

Speech emotion recognition (SER) is crucial in enhancing human-computer interaction, particularly in languages with limited speech resources, such as Indonesian. This study compares three approaches for Indonesian SER using the IndoWaveSentiment dataset: Support Vector Machine (SVM) using Mel-Frequency Cepstral Coefficients (MFCC) features, Convolutional Neural Network (CNN) using Mel-Spectrograms, and self-supervised model, Cross-Lingual Speech Representations (XLSR) Wav2Vec2, which directly learns from raw audio signals. While SVM with linear kernel performs well given the limited data size and CNN benefits from optimized convolutional settings, both rely on manual feature engineering. In contrast, XLSR-Wav2Vec2 leverages end-to-end learning and cross-lingual speech representations to extract rich acoustic features without handcrafted preprocessing. The model fine-tuned only on Indonesian speech (monolingual) achieved the highest accuracy (0.9167) and F1-score (0.9166), outperforming both baselines. These findings confirm the effectiveness of self-supervised models in low-resource settings and open future directions for expanding SER capabilities in Indonesian.

Authors:
Kuncahyo Setyo Nugroho, Syahroni Wahyu Iriananda, Ismail Akbar, Mahmud Isnan, Bens Pardamean

2025 International Conference on Information Technology Research and Innovation (ICITRI)

Read Full Article