Published in · 5 min read · Mar 21, 2023
--
BLEND360 — Aishwarya Bhangale, Daphney Valiatingara, Meet Paradia, Kristin (Jiating) Chen, Brett Li, Jesse fa*gan
In the previous post, we introduced the theoretical grounding of the four most widely used algorithms in topic modeling. This article will focus on their model comparison from research findings.
Related Research
In the world of text analysis, researchers are continually seeking new ideas and techniques for discovering topics. In a recent study by Egger & Yu (2022), four popular topic modeling algorithms were compared: LDA, NMF, Top2Vec, and BERTopic. The authors evaluated how these algorithms performed on short text data from social media and identified their respective strengths and weaknesses.
The study used a Twitter dataset with approximately 31,800 unique tweets, including hashtags such as #covidtravel, #covid, and #travel. While transformer-based models like BERTopic and Top2Vec retained the tweets’ original structure, LDA and NMF utilized preprocessed data that included stop words removal, tokenization, stemming, and lemmatization.
After constructing the models, the researchers compared their performance based on algorithmic similarity: LDA vs. NMF and Top2Vec vs. BERTopic.
The authors argue that evaluating topic modeling techniques based on a single metric, such as coherence, is insufficient. Therefore, this study relied on human judgment and domain knowledge expertise to provide a more comprehensive evaluation of each algorithm’s performance on the given dataset.
Comparing LDA and NMF
In their study, researchers used “coherence” scores to determine the optimal number of topics and identify the best topics. They found that LDA had 14 optimal topics, while NMF identified 10. The topic names were assigned based on the most frequently occurring words by their TF-IDF weights for a particular topic. Interestingly, the study concluded that NMF’s results were more aligned with human judgment, ultimately outperforming LDA. However, as we know, topic extraction using LDA and NMF primarily relies on hyperparameters, so it’s not entirely unexpected that most of the results were as expected.
Comparing BERTopic and Top2Vec
Both transformer algorithms used in the study allow researchers to search for specific terms or keywords within the results to identify relevant topics related to them. In this study, the researchers searched for terms such as “cancel,” “flight,” and “travel bubble” to find topics related to these terms.
For instance, when searching for the keyword “flight,” the study found that the topics generated by the Top2Vec model were focused more on policy and regulation, while the topics generated by the BERTopic model were more concerned with the nature of air transport. By searching for specific keywords, researchers can identify topics that are most relevant to their research interests and explore them further. Essentially, this feature allows researchers to narrow down their focus and target specific topics within the larger output generated by the algorithm.
Summary of Model Comparisons
This research has yielded some valuable insights into the performance of various topic modeling algorithms. By relying on human and domain knowledge expertise to evaluate the models, the study concluded that BERTopic and NMF were the best performers for this dataset, followed by Top2Vec and LDA. While both BERTopic and NMF were able to identify distinct topics, BERTopic had the added advantage of discovering related topics around a specific term, providing even deeper insights into the data. Overall, BERTopic was found to excel in all aspects of the topic modeling domain, and even allowed for further reduction of topics.
On the other hand, Top2Vec’s resulting topics often overlapped and contained multiple concepts, making it less ideal for datasets with less than 1,000 documents. Despite this, Top2Vec was still able to uncover unique findings that were missed by other topic modeling approaches. For instance, it identified the topic of ‘politicians’ in the hierarchical reduction of topics, which was missed by other algorithms.
More traditional algorithms, such as LDA and NMF, did not yield easily separable or interpretable topics. LDA required detailed assumptions about hyperparameters to achieve optimal results, while NMF showed better results by using Term Frequency-Inverse Document Frequency (TF-IDF) over raw-word frequencies.
It’s worth noting that the nature of a dataset can impact the choice of topic modeling technique. While BERTopic and Top2Vec offer many advantages, such as no preprocessing, support for hierarchical topic reduction, scalability, and advanced search and visualization capabilities, they may not be the best choice for longer text data. Another research by Cosimo Albanese (2022) suggests that BERTopic and Top2Vec perform better with shorter text, such as social media posts or news headlines, due to the limited ability of embeddings to consider long lists of tokens when building semantic representations. Conversely, for smaller datasets or longer textual data, faster models that use fewer resources and know the number of topics beforehand, such as LDA or NMF, may produce better results.
Research findings support data scientists to narrow down algorithm selection based on different criteria such as the nature of data and the assumption. It’s intriguing for us to apply these algorithms to another dataset for a hands-on comparison; so we did it and have included the findings in an upcoming post. Stay tuned! :D
If you want to know more, please let us know, comment below and/or send us a message at [email protected].
Blend360 co-creates data science solutions with clients to achieve their business goals. Clients are the hero in our story, and we provide the tools and expertise needed to succeed. We use advanced analytics, machine learning, and AI to enable data-driven decisions that drive growth and innovation. As a leader in the data science industry, we collaborate with businesses of all sizes to create innovative solutions that make a difference.
- Cosimo Albanese, N. (2022, September 19). Topic Modeling with LSA, pLSA, LDA, NMF, BERTopic, Top2Vec: a Comparison. Retrieved from Towards Data Science : https://towardsdatascience.com/topic-modeling-with-lsa-plsa-lda-nmf-bertopic-top2vec-a-comparison-5e6ce4b1e4a5
- Deerwester, S., Dumais, S. T., & Furnas, G. W. (1990). Indexing by latent semantic analysis. Journal of the American Society for Information Science, 391–407: https://asistdl.onlinelibrary.wiley.com/doi/10.1002/%28SICI%291097-4571%28199009%2941%3A6%3C391%3A%3AAID-ASI1%3E3.0.CO%3B2-9
- Egger, R., & Yu, J. 2022. A Topic Modeling Comparison Between LDA, NMF, Top2Vec, and BERTopic to Demystify Twitter Posts. Frontiers in Sociology, 7: 886498: https://www.frontiersin.org/articles/10.3389/fsoc.2022.886498/full