Topic models :: Tutorials for quanteda (2024)

Table of Contents
LDA Seeded LDA

Topics models are unsupervised document classification techniques. By modeling distributions of topics over words and words over documents, topic models identify the most discriminatory groups of documents automatically.

require(quanteda)require(quanteda.corpora)require(seededlda)require(lubridate)
corp_news <- download("data_corpus_guardian")

We will select only news articles published in 2016 using corpus_subset() and the year function from the lubridate package.

corp_news_2016 <- corpus_subset(corp_news, year(date) == 2016)ndoc(corp_news_2016)
## [1] 1959

Further, after removal of function words and punctuation in dfm(), we will only keep the top 20% of the most frequent features (min_termfreq = 0.8) that appear in less than 10% of all documents (max_docfreq = 0.1) using dfm_trim() to focus on common but distinguishing features.

toks_news <- tokens(corp_news_2016, remove_punct = TRUE, remove_numbers = TRUE, remove_symbol = TRUE)toks_news <- tokens_remove(toks_news, pattern = c(stopwords("en"), "*-time", "updated-*", "gmt", "bst"))dfmat_news <- dfm(toks_news) %>% dfm_trim(min_termfreq = 0.8, termfreq_type = "quantile", max_docfreq = 0.1, docfreq_type = "prop")

quanteda does not implement topic models, but you can fit LDA and seeded-LDA with the seededlda package.

LDA

k = 10 specifies the number of topics to be discovered. This is an important parameter and you should try a variety of values and validate the outputs of your topic models thoroughly.

tmod_lda <- textmodel_lda(dfmat_news, k = 10)

You can extract the most important terms for each topic from the model using terms().

terms(tmod_lda, 10)
## topic1 topic2 topic3 topic4 topic5 topic6 ## [1,] "oil" "corbyn" "son" "officers" "clinton" "refugees"## [2,] "markets" "johnson" "parents" "violence" "sanders" "brussels"## [3,] "prices" "leadership" "felt" "prison" "cruz" "talks" ## [4,] "investors" "boris" "park" "victims" "hillary" "french" ## [5,] "shares" "shadow" "room" "sexual" "obama" "summit" ## [6,] "rates" "jeremy" "love" "abuse" "trump's" "refugee" ## [7,] "banks" "tory" "mother" "criminal" "bernie" "migrants"## [8,] "trading" "doctors" "story" "officer" "ted" "turkey" ## [9,] "quarter" "junior" "father" "crime" "rubio" "benefits"## [10,] "sector" "khan" "knew" "incident" "senator" "asylum" ## topic7 topic8 topic9 topic10 ## [1,] "sales" "syria" "australia" "climate" ## [2,] "apple" "isis" "australian" "water" ## [3,] "customers" "military" "labor" "energy" ## [4,] "google" "islamic" "turnbull" "food" ## [5,] "users" "un" "budget" "development"## [6,] "technology" "syrian" "funding" "gas" ## [7,] "games" "forces" "housing" "drug" ## [8,] "game" "muslim" "senate" "medical" ## [9,] "iphone" "peace" "education" "hospital" ## [10,] "app" "china" "coalition" "patients"

You can then obtain the most likely topics using topics() and save them as a document-level variable.

head(topics(tmod_lda), 20)
## text136751 text136585 text139163 text169133 text153451 text163885 text157885 ## topic10 topic2 topic7 topic3 topic4 topic10 topic3 ## text173244 text137394 text169408 text184646 text127410 text134923 text169695 ## topic2 topic9 topic3 topic2 topic4 topic2 topic1 ## text147917 text157535 text177078 text174393 text181782 text143323 ## topic3 topic7 topic10 topic3 topic3 topic2 ## 10 Levels: topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8 ... topic10
# assign topic as a new document-level variabledfmat_news$topic <- topics(tmod_lda)# cross-table of the topic frequencytable(dfmat_news$topic)
## ## topic1 topic2 topic3 topic4 topic5 topic6 topic7 topic8 topic9 topic10 ## 200 218 233 236 192 83 204 185 191 210

Seeded LDA

In the seeded LDA, you can pre-define topics in LDA using a dictionary of “seed” words.

# load dictionary containing seed wordsdict_topic <- dictionary(file = "../dictionary/topics.yml")print(dict_topic)
## Dictionary object with 5 key entries.## - [economy]:## - market*, money, bank*, stock*, bond*, industry, company, shop*## - [politics]:## - lawmaker*, politician*, election*, voter*## - [society]:## - police, prison*, school*, hospital*## - [diplomacy]:## - ambassador*, diplomat*, embassy, treaty## - [military]:## - military, soldier*, terrorist*, marine, navy, army

The number of topics is determined by the number of keys in the dictionary. Next, we can fit the seeded LDA model using textmodel_seededlda() and specify the dictionary with our relevant keywords.

tmod_slda <- textmodel_seededlda(dfmat_news, dictionary = dict_topic)

Some of the topic words are seed words, but the seeded LDA identifies many other related words.

terms(tmod_slda, 20)
## economy politics society diplomacy military ## [1,] "markets" "clinton" "hospital" "refugees" "military" ## [2,] "banks" "sanders" "prison" "syria" "labor" ## [3,] "oil" "cruz" "schools" "brussels" "australian"## [4,] "energy" "obama" "violence" "isis" "corbyn" ## [5,] "climate" "hillary" "officers" "talks" "australia" ## [6,] "sales" "trump's" "hospitals" "un" "turnbull" ## [7,] "prices" "bernie" "victims" "syrian" "johnson" ## [8,] "stock" "ted" "cases" "french" "budget" ## [9,] "food" "rubio" "sexual" "turkey" "leadership"## [10,] "rates" "senator" "drug" "islamic" "cabinet" ## [11,] "sector" "gun" "officer" "border" "shadow" ## [12,] "banking" "politicians" "abuse" "aid" "funding" ## [13,] "businesses" "primary" "crime" "summit" "coalition" ## [14,] "investors" "race" "facebook" "refugee" "senate" ## [15,] "housing" "elections" "medical" "migrants" "terrorist" ## [16,] "shares" "kasich" "child" "agreement" "boris" ## [17,] "costs" "candidates" "mental" "forces" "doctors" ## [18,] "average" "delegates" "died" "military" "tory" ## [19,] "development" "photograph" "parents" "russian" "parties" ## [20,] "trading" "supporters" "mother" "asylum" "jeremy"

topics() returns dictionary keys as the most likely topics of documents.

head(topics(tmod_slda), 20)
## text136751 text136585 text139163 text169133 text153451 text163885 text157885 ## economy military economy society society economy politics ## text173244 text137394 text169408 text184646 text127410 text134923 text169695 ## military military society military society military economy ## text147917 text157535 text177078 text174393 text181782 text143323 ## society economy economy politics society military ## Levels: economy politics society diplomacy military
# assign topics from seeded LDA as a document-level variable to the dfmdfmat_news$topic2 <- topics(tmod_slda)# cross-table of the topic frequencytable(dfmat_news$topic2)
## ## economy politics society diplomacy military ## 479 219 633 233 388
Topic models :: Tutorials for quanteda (2024)
Top Articles
What Happens If You Get Caught Shoplifting From Walmart? - Attorney 911
Planning to retire at 40? Wait, have you done your math right?
Scheelzien, volwassenen - Alrijne Ziekenhuis
Where are the Best Boxing Gyms in the UK? - JD Sports
jazmen00 x & jazmen00 mega| Discover
Kathleen Hixson Leaked
Ffxiv Palm Chippings
Chatiw.ib
Don Wallence Auto Sales Vehicles
South Park Season 26 Kisscartoon
Costco The Dalles Or
1movierulzhd.fun Reviews | scam, legit or safe check | Scamadviser
Blairsville Online Yard Sale
Craigslist Kennewick Pasco Richland
Comenity Credit Card Guide 2024: Things To Know And Alternatives
Obituary Times Herald Record
Delectable Birthday Dyes
Valentina Gonzalez Leak
Katherine Croan Ewald
Webcentral Cuny
Praew Phat
The best TV and film to watch this week - A Very Royal Scandal to Tulsa King
Zack Fairhurst Snapchat
Missed Connections Inland Empire
Halo Worth Animal Jam
R. Kelly Net Worth 2024: The King Of R&B's Rise And Fall
Hampton University Ministers Conference Registration
Costco Gas Hours St Cloud Mn
Panola County Busted Newspaper
Naya Padkar Gujarati News Paper
Greensboro sit-in (1960) | History, Summary, Impact, & Facts
Gillette Craigslist
Sinfuldeed Leaked
*!Good Night (2024) 𝙵ull𝙼ovie Downl𝚘ad Fr𝚎e 1080𝚙, 720𝚙, 480𝚙 H𝙳 HI𝙽DI Dub𝚋ed Fil𝙼yz𝚒lla Isaidub
How often should you visit your Barber?
Boneyard Barbers
Mrstryst
Frommer's Belgium, Holland and Luxembourg (Frommer's Complete Guides) - PDF Free Download
Rage Of Harrogath Bugged
Temu Y2K
15 Best Things to Do in Roseville (CA) - The Crazy Tourist
Infinite Campus Farmingdale
Umiami Sorority Rankings
Nimbleaf Evolution
Swsnj Warehousing Inc
Large Pawn Shops Near Me
Hillsborough County Florida Recorder Of Deeds
Lesson 5 Homework 4.5 Answer Key
antelope valley for sale "lancaster ca" - craigslist
Solving Quadratics All Methods Worksheet Answers
Supervisor-Managing Your Teams Risk – 3455 questions with correct answers
Latest Posts
Article information

Author: Amb. Frankie Simonis

Last Updated:

Views: 6074

Rating: 4.6 / 5 (56 voted)

Reviews: 87% of readers found this page helpful

Author information

Name: Amb. Frankie Simonis

Birthday: 1998-02-19

Address: 64841 Delmar Isle, North Wiley, OR 74073

Phone: +17844167847676

Job: Forward IT Agent

Hobby: LARPing, Kitesurfing, Sewing, Digital arts, Sand art, Gardening, Dance

Introduction: My name is Amb. Frankie Simonis, I am a hilarious, enchanting, energetic, cooperative, innocent, cute, joyous person who loves writing and wants to share my knowledge and understanding with you.