Twitter Dataset for Hate Speech and Cyberbullying Detection in Indonesian Language

During the 2019 election period in Indonesia, many hate speech and cyberbullying cases have occurred in social media platforms including Twitter. The government tries to filter every negative content to be spread out during this period. However, to detect hate speech is not an easy task. This paper presents the process of developing a dataset that can be used to build a hate speech detection model. More than 1 million tweets have been successfully collected from using Twitter API. The basic preprocessing and preliminary study using machine learning was implemented. Latent Dirichlet Allocation (LDA) algorithm was used to extract the topic for each tweet to see whether these topics can be associated with debate themes. Pretrained sentiment analysis was also applied to the dataset to generate a polarity score for each tweet. From 83,752 tweets included in the analysis step, the number of positive and negative tweets are almost the same.

Conference: International Conference on Information Management and Technology 2019, Bali, Indonesia

Trisna Febriana, Arif Budiarto

Read Full Paper