Improving Indonesian Emotion Detection using OpenAI o4-mini Text Normalization

As an important psychological response, emotion influences human decision-making, cognitive processes, and communication patterns. The basic human emotions are joy, love, anger, sadness, and fear. Many previous studies have classified basic human emotions from Twitter texts. However, informal and slang words pose a significant problem to the classification performance. Despite attempts to thwart the problem, none are adaptable for handling new informal or slang words that do not appear during the classifier training or finetuning. This study proposes the utilization of OpenAI o4-mini model, specifically the o4-mini-high, to normalize the text, with a hypothesis that the classifier achieves better performance through enriched context and semantics. IndoBERT and IndoBERTweet models finetuned using the o4-normalized Indonesian public Twitter dataset show significant improvement (9.48% F1-score increase in IndoBERT; 6.94% F1-score increase in IndoBERTweet), while only minor improvement is shown by logistic regression model (1.47% F1-score increase). The minor improvement is due to TF-IDF sparse representations, which disable the logistic regression model from leveraging the enriched semantics. Further embedding analysis using principal component analysis (PCA) reveals different semantic groupings from IndoBERT and IndoBERTweet due to the different training domains. Nevertheless, IndoBERT and IndoBERTweet have remarkably improved distinguishing basic human emotions from Twitter text through the OpenAI o4-mini text normalization.
Authors:
Matthew Martianus Henry, Dave Christian Thio, Kuncahyo Setyo Nugroho, Bens Pardamean
2025 10th International Conference on Computer Science and Computational Intelligence (ICCSCI)