LLM2Vec Sentence Embeddings Analysis in Sentiment Classification

Sentiment analysis is crucial method in business intelligence to extract insights, which typically begin with sentiment classification. One of the latest frameworks for generating sentence embeddings for sentiment classification is LLM2Vec, which allows Transformer decoder-based models to generate sentence embeddings for text representation. Its capability is deemed language-agnostic, which, in this study, the framework is leveraged for Tokopedia tweet sentiment analysis to prove the claim. The base decoder models used in the LLM2Vec framework were Llama 3 8B, Llama 2 7B, Sheared Llama 1.3B, and Mistral 7B. Two BERT-based models, which are the Indonesian SBERT model and IndoBERT trained with the SimCSE approach, were employed as a comparison. The generated embeddings were classified using logistic regression, SVM, and MLP Classifier. Classifiers using embedding generated by LLM2Vec with Llama 3 8B and Mistral 7B achieves on-par performance with classifiers that utilize IndoBERT SimCSE embeddings, while classifiers using embeddings generated by LLM2Vec with Llama 2 7B and Sheared Llama 1.3B achieves much lower performance. Classifiers with Indonesian SBERT embeddings achieve the highest F1 score performance. Despite slightly lower performance, this study has proven the language-agnostic capability of LLM2Vec, especially with Llama 3 8B and Mistral 7B in colloquial Bahasa Indonesia sentiment analysis, since none of the base decoders were ever trained using the Bahasa Indonesia corpus.

Authors:
Matthew Martianus Henry, Nur Adhianti Heryanto, Mahmud Isnan, Dian Kurnianingrum, Chyntia Ika Ratnapuri, Bens Pardamean

2024 IEEE Conference on Data and Software Engineering

Read Full Article