Unsupervised News Topic Modelling with Doc2Vec and Spherical Clustering

In the digital and Internet era, companies are racing to profile their target users based on their online activities. One of the reliable sources is the news articles they read that can represent their interests. However, extracting latent information from the news articles is not an easy task for a human. In this paper, we introduced a practical model to automatically extract latent information from news articles with predetermined topics. Our proposed model used unsupervised learning, thus alleviating the need for humans to label news items manually. Doc2vec was used to generate word vectors for each article. Afterward, a spectral clustering algorithm was applied to group the data based on the similarity. A supervised Long Short Term Memory (LSTM) model was built to compare the clustering performance. The best 1, best 3, and best 5 scores were used to evaluate our model. The result showed that our model could not outperformed LSTM model for the best 1 score. However, the best 5 score result indicated that our model was sufficiently robust to cluster the articles based on topic similarity. Additionally, the proposed unsupervised model was implemented in both an on-premise server, and a cloud server. Surprisingly, our proposed method could run faster in the cloud server despite its less number of CPU cores.
International Conference on Computer Science and Computational Intelligence 2020
Arif Budiarto, Reza Rahutomo, Hendra Novyantara Putra, Tjeng Wawan Cenggoro, Muhamad Fitra Kacamarga, and Bens Pardamean