Topic Modelling: A Comparison Between LDA, NMF, BERTopic and Top2Vec — Part I (2024)

BLEND360 — Aishwarya Bhangale, Daphney Valiatingara, Meet Paradia, Kristin (Jiating) Chen, Brett Li, Jesse fa*gan

Topic Modelling: A Comparison Between LDA, NMF, BERTopic and Top2Vec — Part I (3)

In the previous post, we introduced the theoretical grounding of the four most widely used algorithms in topic modeling. This article will focus on their model comparison from research findings.

Related Research

In the world of text analysis, researchers are continually seeking new ideas and techniques for discovering topics. In a recent study by Egger & Yu (2022), four popular topic modeling algorithms were compared: LDA, NMF, Top2Vec, and BERTopic. The authors evaluated how these algorithms performed on short text data from social media and identified their respective strengths and weaknesses.

The study used a Twitter dataset with approximately 31,800 unique tweets, including hashtags such as #covidtravel, #covid, and #travel. While transformer-based models like BERTopic and Top2Vec retained the tweets’ original structure, LDA and NMF utilized preprocessed data that included stop words removal, tokenization, stemming, and lemmatization.

After constructing the models, the researchers compared their performance based on algorithmic similarity: LDA vs. NMF and Top2Vec vs. BERTopic.

The authors argue that evaluating topic modeling techniques based on a single metric, such as coherence, is insufficient. Therefore, this study relied on human judgment and domain knowledge expertise to provide a more comprehensive evaluation of each algorithm’s performance on the given dataset.

Comparing LDA and NMF

Topic Modelling: A Comparison Between LDA, NMF, BERTopic and Top2Vec — Part I (4)

In their study, researchers used “coherence” scores to determine the optimal number of topics and identify the best topics. They found that LDA had 14 optimal topics, while NMF identified 10. The topic names were assigned based on the most frequently occurring words by their TF-IDF weights for a particular topic. Interestingly, the study concluded that NMF’s results were more aligned with human judgment, ultimately outperforming LDA. However, as we know, topic extraction using LDA and NMF primarily relies on hyperparameters, so it’s not entirely unexpected that most of the results were as expected.

Comparing BERTopic and Top2Vec

Topic Modelling: A Comparison Between LDA, NMF, BERTopic and Top2Vec — Part I (5)

Both transformer algorithms used in the study allow researchers to search for specific terms or keywords within the results to identify relevant topics related to them. In this study, the researchers searched for terms such as “cancel,” “flight,” and “travel bubble” to find topics related to these terms.

For instance, when searching for the keyword “flight,” the study found that the topics generated by the Top2Vec model were focused more on policy and regulation, while the topics generated by the BERTopic model were more concerned with the nature of air transport. By searching for specific keywords, researchers can identify topics that are most relevant to their research interests and explore them further. Essentially, this feature allows researchers to narrow down their focus and target specific topics within the larger output generated by the algorithm.

Summary of Model Comparisons

This research has yielded some valuable insights into the performance of various topic modeling algorithms. By relying on human and domain knowledge expertise to evaluate the models, the study concluded that BERTopic and NMF were the best performers for this dataset, followed by Top2Vec and LDA. While both BERTopic and NMF were able to identify distinct topics, BERTopic had the added advantage of discovering related topics around a specific term, providing even deeper insights into the data. Overall, BERTopic was found to excel in all aspects of the topic modeling domain, and even allowed for further reduction of topics.

On the other hand, Top2Vec’s resulting topics often overlapped and contained multiple concepts, making it less ideal for datasets with less than 1,000 documents. Despite this, Top2Vec was still able to uncover unique findings that were missed by other topic modeling approaches. For instance, it identified the topic of ‘politicians’ in the hierarchical reduction of topics, which was missed by other algorithms.

More traditional algorithms, such as LDA and NMF, did not yield easily separable or interpretable topics. LDA required detailed assumptions about hyperparameters to achieve optimal results, while NMF showed better results by using Term Frequency-Inverse Document Frequency (TF-IDF) over raw-word frequencies.

It’s worth noting that the nature of a dataset can impact the choice of topic modeling technique. While BERTopic and Top2Vec offer many advantages, such as no preprocessing, support for hierarchical topic reduction, scalability, and advanced search and visualization capabilities, they may not be the best choice for longer text data. Another research by Cosimo Albanese (2022) suggests that BERTopic and Top2Vec perform better with shorter text, such as social media posts or news headlines, due to the limited ability of embeddings to consider long lists of tokens when building semantic representations. Conversely, for smaller datasets or longer textual data, faster models that use fewer resources and know the number of topics beforehand, such as LDA or NMF, may produce better results.

Research findings support data scientists to narrow down algorithm selection based on different criteria such as the nature of data and the assumption. It’s intriguing for us to apply these algorithms to another dataset for a hands-on comparison; so we did it and have included the findings in an upcoming post. Stay tuned! :D

If you want to know more, please let us know, comment below and/or send us a message at [email protected].

Blend360 co-creates data science solutions with clients to achieve their business goals. Clients are the hero in our story, and we provide the tools and expertise needed to succeed. We use advanced analytics, machine learning, and AI to enable data-driven decisions that drive growth and innovation. As a leader in the data science industry, we collaborate with businesses of all sizes to create innovative solutions that make a difference.

Topic Modelling: A Comparison Between LDA, NMF, BERTopic and Top2Vec — Part I (2024)
Top Articles
How do I enable the VPN feature on my NETGEAR router using a Windows computer?
North Carolina famous food! Discover the unique flavors of North Carolina. - Belmont Lake Preserve
Euro (EUR), aktuální kurzy měn
Otterbrook Goldens
Chris wragge hi-res stock photography and images - Alamy
Mama's Kitchen Waynesboro Tennessee
Costco The Dalles Or
Beautiful Scrap Wood Paper Towel Holder
Bhad Bhabie Shares Footage Of Her Child's Father Beating Her Up, Wants Him To 'Get Help'
Edgar And Herschel Trivia Questions
Red Heeler Dog Breed Info, Pictures, Facts, Puppy Price & FAQs
I Touch and Day Spa II
Northern Whooping Crane Festival highlights conservation and collaboration in Fort Smith, N.W.T. | CBC News
boohoo group plc Stock (BOO) - Quote London S.E.- MarketScreener
Commodore Beach Club Live Cam
Ally Joann
[Cheryll Glotfelty, Harold Fromm] The Ecocriticism(z-lib.org)
Persona 5 Royal Fusion Calculator (Fusion list with guide)
Amazing Lash Studio Casa Linda
Directions To Cvs Pharmacy
Craigslist Lake Charles
Publix Near 12401 International Drive
Cfv Mychart
Fuse Box Diagram Honda Accord (2013-2017)
Roseann Marie Messina · 15800 Detroit Ave, Suite D, Lakewood, OH 44107-3748 · Lay Midwife
Rek Funerals
Fairwinds Shred Fest 2023
Red Sox Starting Pitcher Tonight
Σινεμά - Τι Ταινίες Παίζουν οι Κινηματογράφοι Σήμερα - Πρόγραμμα 2024 | iathens.gr
Craigslist Red Wing Mn
Cross-Border Share Swaps Made Easier Through Amendments to India’s Foreign Exchange Regulations - Transatlantic Law International
Scanning the Airwaves
Oxford Alabama Craigslist
Mandy Rose - WWE News, Rumors, & Updates
Bella Thorne Bikini Uncensored
Pp503063
Cheetah Pitbull For Sale
Section 212 at MetLife Stadium
Sam's Club Gas Prices Deptford Nj
M Life Insider
Seminary.churchofjesuschrist.org
Engr 2300 Osu
Vindy.com Obituaries
Mudfin Village Wow
Despacito Justin Bieber Lyrics
Winta Zesu Net Worth
844 386 9815
St Anthony Hospital Crown Point Visiting Hours
Dancing Bear - House Party! ID ? Brunette in hardcore action
F9 2385
The Hardest Quests in Old School RuneScape (Ranked) – FandomSpot
Noelleleyva Leaks
Latest Posts
Article information

Author: Lakeisha Bayer VM

Last Updated:

Views: 5869

Rating: 4.9 / 5 (49 voted)

Reviews: 88% of readers found this page helpful

Author information

Name: Lakeisha Bayer VM

Birthday: 1997-10-17

Address: Suite 835 34136 Adrian Mountains, Floydton, UT 81036

Phone: +3571527672278

Job: Manufacturing Agent

Hobby: Skimboarding, Photography, Roller skating, Knife making, Paintball, Embroidery, Gunsmithing

Introduction: My name is Lakeisha Bayer VM, I am a brainy, kind, enchanting, healthy, lovely, clean, witty person who loves writing and wants to share my knowledge and understanding with you.