Sentence Similarity Task Performance with Data Augmentation and Pre-trained Transformers

In Natural Language Processing (NLP), especially for data retrieval and text mining, the task of sentence similarity is crucial. To improve sentence similarity, this paper examines how our approach integrates text augmentation and Transformers. Additional training examples are generated through text augmentation, which involves various transformations such as random swapping, synonym replacement, random deletion, and random insertion of the original sentence or text. By expanding the dataset, we aim to capture a wide variety of sentence variations. We used 1,800 pairs of human-written English sentences were obtained from two testing datasets, namely The Stanford Natural Language Inference (SNLI) corpus (version 1.0) and the Sentences Involving Compositional Knowledge (SICK). The Bidirectional Encoder Representations from Transformers (BERT) model, which is very effective in NLP, employs a self-attention mechanism to collect contextual data. To encode the input sentence into a high-dimensional representation, this paper uses a pre-trained transformer model. As a first step, we extend the original sentence pair by adding part of the text using easy data augmentations. Then, we calculate the similarity between sentence transformation and sentence embedding using sentence BERT (SBERT). We found that random swap augmentation achieved the highest F1-score in the all-MPNET-base-v2 model; however, for the experimental group using the all-roberta-large-v1 model, random insertion achieved the highest F1-score.
Authors:
Andrea Stevens Karnyoto, Fitya Syarifa Mozar, Mahmud Isnan,Gregorius Natanael Elwirehardja, Bens Pardamean
2025 International Conference on Cybernetics and Intelligent Systems (ICORIS)