Building Language Models in NLP (2024)

Introduction

A language model in NLP is a probabilistic statistical model that determines the probability of a given sequence of words occurring in a sentence based on the previous words. It helps to predict which word is more likely to appear next in the sentence. Hence it is widely used in predictive text input systems, speech recognition, machine translation, spelling correction etc. The input to a language model is usually a training set of example sentences. The output is a probability distribution over sequences of words. We can use the last one word (unigram), last two words (bigram), last three words (trigram) or last n words (n-gram) to predict the next word as per our requirements.

This article was published as a part of theData Science Blogathon.

Table of contents

  • What is Language Models?
    • Why Language Models?
  • Reading the Raw Text Corpus
  • Preprocessing the Raw Text
  • Creating Unigram, Bigram and Trigram Language Models
  • Predicting Next Three words using Bigram and Trigram Models
  • Frequently Asked Questions

What is Language Models?

Language models are a fundamental component of natural language processing (NLP) systems. A language model is a statistical model that assigns probabilities to sequences of words, allowing it to predict which word or sequence of words is most likely to occur next given the previous words.

Language models play a crucial role in many NLP applications:

  • Predictive text input (auto-complete)
  • Speech recognition
  • Machine translation
  • Spelling and grammar correction
  • Generating human-like text

There are different types of language models in language modeling in NLP, from relatively simple n-gram models that consider only the last n words, to advanced neural network models like transformers that can capture long-range contextual dependencies.

What sets language models apart is their ability to quantify linguistic knowledge in a statistical framework that computers can process. This allows NLP systems to generate, understand, and translate natural language with increasing fluency and accuracy using language models in NLP.

Moreover, large language models pretrained on vast text data have emerged as a powerful basis for transfer learning to many downstream NLP model tasks in language modeling in NLP, substantially advancing the field’s capabilities.

Why Language Models?

Language models form the backbone of Natural Language Processing. They are a way of transforming qualitative information about text into quantitative information that machines can understand. They have applications in a wide range of industries like tech, finance, healthcare, military etc. All of us encounter language models daily, be it the predictive text input on our mobile phones or a simple Google search. Hence NLP models form an integral part of any natural language processing application.

Building Language Models in NLP (1)

In this article, we will be learning how to build unigram, bigram and trigram language models on a raw text corpus and perform next word prediction using them.

Reading the Raw Text Corpus

We will begin by reading the text corpus which is an excerpt from Oliver Twist. You can download the text file from here. Once it is downloaded, read the text file and find the total number of characters in it.

file = open("rawCorpus.txt", "r")rawReadCorpus = file.read()print ("Total no. of characters in read dataset: {}".format(len(rawReadCorpus)))

We need to import the nltk library to perform some basic text processing tasks which we will do with the help of the following code :

import nltknltk.download()from nltk.tokenize import word_tokenize,sent_tokenize

Preprocessing the Raw Text

Firstly, we need to remove all new lines and special characters from the text corpus. We do that by the following code :

import stringstring.punctuation = string.punctuation +'“'+'”'+'-'+'’'+'‘'+'—'string.punctuation = string.punctuation.replace('.', '')file = open('rawCorpus.txt').read()#preprocess data to remove newlines and special charactersfile_new = ""for line in file: line_new = line.replace("n", " ") file_new += line_newpreprocessedCorpus = "".join([char for char in file_new if char not in string.punctuation])

After removing newlines and special characters, we can break up the corpus to obtain the words and the sentences using sent_tokenize and word_tokenize from nltk.tokenize. Let us print the first 5 sentences and the first 5 words obtained from the corpus :

sentences = sent_tokenize(preprocessedCorpus)print("1st 5 sentences of preprocessed corpus are : ")print(sentences[0:5])words = word_tokenize(preprocessedCorpus)print("1st 5 words/tokens of preprocessed corpus are : ")print(words[0:5])

Output:

Building Language Models in NLP (2)

We also need to remove stopwords from the corpus. Stopwords are some commonly used words like ‘and’, ‘the’, ‘at’ which do not add any special meaning or significance to a sentence. A list of stopwords are available with nltk, and they can be removed from the corpus using the following code :

nltk.download('stopwords')from nltk.corpus import stopwordsstop_words = set(stopwords.words('english'))filtered_tokens = [w for w in words if not w.lower() in stop_words]

Creating Unigram, Bigram and Trigram Language Models

We can create n-grams using the ngrams module from nltk.util. N-grams are a sequence of n consecutive words occurring in the corpus. For example, the sentence “I love dogs” – ‘I’, ‘love’ and ‘dogs’ are unigrams while ‘I love’ and ‘love dogs’ are bigrams. ‘I love dogs’ is itself a trigram i.e. a contiguous sequence of three words. We obtain unigrams, bigrams and trigrams from the corpus using the following code :

from collections import Counterfrom nltk.util import ngramsunigrams=[]bigrams=[]trigrams=[]for content in (sentences): # *** Write code *** content = content.lower() content = word_tokenize(content) for word in content: if (word =='.'): content.remove(word) else: unigrams.append(word) bigrams.extend(ngrams(content,2)) ##similar for trigrams # *** Write code *** trigrams.extend(ngrams(content,3))print ("Sample of n-grams:n" + "-------------------------")print ("--> UNIGRAMS: n" + str(unigrams[:5]) + " ...n")print ("--> BIGRAMS: n" + str(bigrams[:5]) + " ...n")print ("--> TRIGRAMS: n" + str(trigrams[:5]) + " ...n")

Output:

Building Language Models in NLP (3)

Next, we obtain those unigrams, bigrams and trigrams from the corpus which do not have stopwords like articles, prepositions or determiners in them. For example, we remove bigrams like ‘in the’ and we remove unigrams like ‘the’, ‘a’ etc. We use the following code for the removal of stopwords from n-grams.

def stopwords_removal(n, a): b = [] if n == 1: for word in a: count = 0 if word in stop_words: count = 0 else: count = 1 if (count==1): b.append(word) return(b) else: for pair in a: count = 0 for word in pair: if word in stop_words: count = count or 0 else: count = count or 1 if (count==1): b.append(pair) return(b)unigrams_Processed = stopwords_removal(1,unigrams)bigrams_Processed = stopwords_removal(2,bigrams)trigrams_Processed = stopwords_removal(3,trigrams)print ("Sample of n-grams after processing:n" + "-------------------------")print ("--> UNIGRAMS: n" + str(unigrams_Processed[:5]) + " ...n")print ("--> BIGRAMS: n" + str(bigrams_Processed[:5]) + " ...n")print ("--> TRIGRAMS: n" + str(trigrams_Processed[:5]) + " ...n")

The unigrams, bigrams and trigrams obtained in this way look like :

Building Language Models in NLP (4)

We can obtain the count or frequency of each n-gram appearing in the corpus. This will be useful later when we need to calculate the probabilities of the next possible word based on previous n-grams. We write a function get_ngrams_freqDist which returns the frequency corresponding to each n-gram sent to it. We obtain the frequencies of all unigrams, bigrams and trigrams in this way.

def get_ngrams_freqDist(n, ngramList): ngram_freq_dict = {} for ngram in ngramList: if ngram in ngram_freq_dict: ngram_freq_dict[ngram] += 1 else: ngram_freq_dict[ngram] = 1 return ngram_freq_dictunigrams_freqDist = get_ngrams_freqDist(1, unigrams)unigrams_Processed_freqDist = get_ngrams_freqDist(1, unigrams_Processed)bigrams_freqDist = get_ngrams_freqDist(2, bigrams)bigrams_Processed_freqDist = get_ngrams_freqDist(2, bigrams_Processed)trigrams_freqDist = get_ngrams_freqDist(3, trigrams)trigrams_Processed_freqDist = get_ngrams_freqDist(3, trigrams_Processed)

Predicting Next Three words using Bigram and Trigram Models

The chain rule is used to compute the probability of a sentence in a language model. Let w1w2…wn be a sentence where w1, w2, wn are the individual words. Then the probability of the sentence occurring is given by the following formula :

Building Language Models in NLP (5)

For example, the probability of the sentence “I love dogs” is given by :

P(I love dogs) = P(I)P(love | I)P(dogs | I love)

Now the individual probabilities can be obtained in the following way :

P(I) = Count(‘I’) / Total no. of words

P(love | I) = Count(‘I love’) / Count(‘I’)

P(dogs | I love) =Count(‘I love dogs’) / Count(‘I love’)

Note that Count(‘I’), Count(‘I love’) and Count(‘I love dogs’) are the frequencies of the respective unigram, bigram and trigram which we computed earlier using theget_ngrams_freqDist function.

Now, when we use a bigram model to compute the probabilities, the probability of each new word depends only on its previous word. That is, for the previous example, the probability of the sentence becomes :

P(I love dogs) = P(I)P(love | I)P(dogs | love)

Similarly, for a trigram model, the probability will be given by :

P(I love dogs) = P(I)P(love | I)P(dogs | I love) since the probability of each new word depends on the previous two words.

Trigram modelling can be better explained by the following diagram

Building Language Models in NLP (6)

However, there is a catch involved in this kind of modelling. Suppose there is some bigram that does not appear in the training set but appears in the test set. Then we will assign a probability of 0 to that bigram, making the overall probability of the test sentence 0, which is undesirable. Smoothing is done to overcome this problem. Parameters are smoothed (or regularized) to reassign some probability mass to unseen events. One way of smoothing is Add-one or Laplace smoothing, which we will be using in this article. Add-one smoothing is performed by adding 1 to all bigram counts and V (no. of unique words in the corpus) to all unigram counts.

Building Language Models in NLP (7)

Now that we have understood what smoothed bigram and trigram models are, let us write the code to compute them. We will be using the unprocessed bigrams and trigrams (without articles, determiners removed) for prediction.

smoothed_bigrams_probDist = {}V = len(unigrams_freqDist)for i in bigrams_freqDist: smoothed_bigrams_probDist[i] = (bigrams_freqDist[i] + 1)/(unigrams_freqDist[i[0]]+V)smoothed_trigrams_probDist = {}for i in trigrams_freqDist: smoothed_trigrams_probDist[i] = (trigrams_freqDist[i] + 1)/(bigrams_freqDist[i[0:2]]+V)

Next, we try to predict the next three words of three test sentences using the computed smoothed bigram and trigram language models.

testSent1 = "There was a sudden jerk, a terrific convulsion of the limbs; and there he"testSent2 = "They made room for the stranger, but he sat down"testSent3 = "The hungry and destitute situation of the infant orphan was duly reported by"

First, we tokenize the test sentences into component words and obtain the last unigrams and bigrams appearing in them.

token_1 = word_tokenize(testSent1)token_2 = word_tokenize(testSent2)token_3 = word_tokenize(testSent3)ngram_1 = {1:[], 2:[]} ngram_2 = {1:[], 2:[]}ngram_3 = {1:[], 2:[]}for i in range(2): ngram_1[i+1] = list(ngrams(token_1, i+1))[-1] ngram_2[i+1] = list(ngrams(token_2, i+1))[-1] ngram_3[i+1] = list(ngrams(token_3, i+1))[-1]print("Sentence 1: ", ngram_1,"nSentence 2: ",ngram_2,"nSentence 3: ",ngram_3)

Next, we write functions to predict the next word and the next 3 words respectively of the three test sentences using the smoothed bigram model.

def predict_next_word(last_word,probDist): next_word = {} for k in probDist: if k[0] == last_word[0]: next_word[k[1]] = probDist[k] k = Counter(next_word) high = k.most_common(1) return high[0]def predict_next_3_words(token,probDist): pred1 = [] pred2 = [] next_word = {} for i in probDist: if i[0] == token: next_word[i[1]] = probDist[i] k = Counter(next_word) high = k.most_common(2) w1a = high[0] w1b = high[1] w2a = predict_next_word(w1a,probDist) w3a = predict_next_word(w2a,probDist) w2b = predict_next_word(w1b,probDist) w3b = predict_next_word(w2b,probDist) pred1.append(w1a) pred1.append(w2a) pred1.append(w3a) pred2.append(w1b) pred2.append(w2b) pred2.append(w3b) return pred1,pred2print("Predicting next 3 possible word sequences with smoothed bigram model : ")pred1,pred2 = predict_next_3_words(ngram_1[1][0],smoothed_bigrams_probDist)print("1a)" +testSent1 +" "+ '33[1m' + pred1[0][0]+" "+pred1[1][0]+" "+pred1[2][0] + '33[0m')print("1b)" +testSent1 +" "+ '33[1m' + pred2[0][0]+" "+pred2[1][0]+" "+pred2[2][0] + '33[0m')pred1,pred2 = predict_next_3_words(ngram_2[1][0],smoothed_bigrams_probDist)print("2a)" +testSent2 +" "+ '33[1m' + pred1[0][0]+" "+pred1[1][0]+" "+pred1[2][0] + '33[0m')print("2b)" +testSent2 +" "+ '33[1m' + pred2[0][0]+" "+pred2[1][0]+" "+pred2[2][0] + '33[0m')pred1,pred2 = predict_next_3_words(ngram_3[1][0],smoothed_bigrams_probDist)print("3a)" +testSent3 +" "+ '33[1m' + pred1[0][0]+" "+pred1[1][0]+" "+pred1[2][0] + '33[0m')print("3b)" +testSent3 +" "+ '33[1m' + pred2[0][0]+" "+pred2[1][0]+" "+pred2[2][0] + '33[0m')

The predictions from the smoothed bigram model are :

Building Language Models in NLP (8)

We obtain predictions from the smoothed trigram model similarly.

def predict_next_word(last_word,probDist): next_word = {} for k in probDist: if k[0:2] == last_word: next_word[k[2]] = probDist[k] k = Counter(next_word) high = k.most_common(1) return high[0]def predict_next_3_words(token,probDist): pred = [] next_word = {} for i in probDist: if i[0:2] == token: next_word[i[2]] = probDist[i] k = Counter(next_word) high = k.most_common(2) w1a = high[0] tup = (token[1],w1a[0]) w2a = predict_next_word(tup,probDist) tup = (w1a[0],w2a[0]) w3a = predict_next_word(tup,probDist) pred.append(w1a) pred.append(w2a) pred.append(w3a) return predprint("Predicting next 3 possible word sequences with smoothed trigram model : ")pred = predict_next_3_words(ngram_1[2],smoothed_trigrams_probDist)print("1)" +testSent1 +" "+ '33[1m' + pred[0][0]+" "+pred[1][0]+" "+pred[2][0] + '33[0m')pred = predict_next_3_words(ngram_2[2],smoothed_trigrams_probDist)print("2)" +testSent2 +" "+ '33[1m' + pred[0][0]+" "+pred[1][0]+" "+pred[2][0] + '33[0m')pred = predict_next_3_words(ngram_3[2],smoothed_trigrams_probDist)print("3)" +testSent3 +" "+ '33[1m' + pred[0][0]+" "+pred[1][0]+" "+pred[2][0] + '33[0m')

Output:

Building Language Models in NLP (9)

Conclusion

Language models in language modeling in NLP are powerful tools for predicting the likelihood of word sequences in natural language. This article demonstrated how to build unigram, bigram, and trigram language models in NLP from a raw text corpus and use them for next word prediction. The smoothing technique of adding small constants to the n-gram counts was applied to address the zero probability issue for unseen n-grams. While bigram and trigram models provided reasonable next word predictions on the example sentences, more advanced neural language models in language modeling in NLP can achieve superior performance by capturing longer-range dependencies. Nonetheless, understanding n-gram language models lays a foundation for more complex approaches in natural language processing language models in NLP.

The media shown in this article is not owned by Analytics Vidhya and are used at the Author’s discretion.

Frequently Asked Questions

Q1. What are language models in NLP?

A. Language models are probabilistic statistical models that determine the probability of a sequence of words occurring in a sentence or text based on the previous words. They are fundamental to many NLP tasks like predictive text, speech recognition, and machine translation.

Q2. How to build an NLP model?

A. To build an NLP model, you typically need to preprocess text data, extract relevant features (e.g. n-grams, word embeddings), choose an appropriate model architecture (e.g. neural networks, ensemble methods), train the model on labeled data, tune hyperparameters, and evaluate performance.

Q3. How to build a language model?

A. To build a language model, you tokenize the text into words/n-grams, count their frequencies, and estimate probabilities of word sequences using techniques like maximum likelihood estimation with smoothing. Neural language models use neural networks trained on text to model these probabilities.

Q4. How are language models built?

A. Language models are built by first preprocessing a text corpus, then extracting n-grams (sequences of n words) and counting their frequencies. Probabilities of word sequences are estimated from these counts, often using smoothing techniques to account for unseen n-grams. Advanced neural language models learn these probabilities automatically through training on text data.

blogathonbuild language modeldata preprocessingNLP

K

Koushiki04 Jun, 2024

BeginnerModel DeploymentNLPProject

Building Language Models in NLP (2024)
Top Articles
Do personal loans require down payments? - Upstart Answers
Not Just Bitcoin: The Top 7 Cryptocurrencies All Gained in 2016
Global Foods Trading GmbH, Biebesheim a. Rhein
Euro (EUR), aktuální kurzy měn
Wausau Marketplace
Nm Remote Access
2022 Apple Trade P36
Goteach11
Produzione mondiale di vino
123 Movies Babylon
Was sind ACH-Routingnummern? | Stripe
Chastity Brainwash
Babyrainbow Private
Michaels W2 Online
Flower Mound Clavicle Trauma
Labor Gigs On Craigslist
Panorama Charter Portal
Tcu Jaggaer
Simpsons Tapped Out Road To Riches
Rachel Griffin Bikini
Race Karts For Sale Near Me
Jang Urdu Today
Sea To Dallas Google Flights
Gran Turismo Showtimes Near Marcus Renaissance Cinema
Nsa Panama City Mwr
Sadie Sink Reveals She Struggles With Imposter Syndrome
Mythical Escapee Of Crete
Darrell Waltrip Off Road Center
Dr Seuss Star Bellied Sneetches Pdf
Free T33N Leaks
Restored Republic
Page 2383 – Christianity Today
Stouffville Tribune (Stouffville, ON), March 27, 1947, p. 1
Where Do They Sell Menudo Near Me
W B Crumel Funeral Home Obituaries
Linabelfiore Of
Keeper Of The Lost Cities Series - Shannon Messenger
Leatherwall Ll Classifieds
The Boogeyman Showtimes Near Surf Cinemas
Chuze Fitness La Verne Reviews
How much does Painttool SAI costs?
Saybyebugs At Walmart
Ig Weekend Dow
Rocky Bfb Asset
National Weather Service Richmond Va
Grand Valley State University Library Hours
Sound Of Freedom Showtimes Near Amc Mountainside 10
Das schönste Comeback des Jahres: Warum die Vengaboys nie wieder gehen dürfen
The Plug Las Vegas Dispensary
Subdomain Finer
Latest Posts
Article information

Author: Arielle Torp

Last Updated:

Views: 6094

Rating: 4 / 5 (61 voted)

Reviews: 92% of readers found this page helpful

Author information

Name: Arielle Torp

Birthday: 1997-09-20

Address: 87313 Erdman Vista, North Dustinborough, WA 37563

Phone: +97216742823598

Job: Central Technology Officer

Hobby: Taekwondo, Macrame, Foreign language learning, Kite flying, Cooking, Skiing, Computer programming

Introduction: My name is Arielle Torp, I am a comfortable, kind, zealous, lovely, jolly, colorful, adventurous person who loves writing and wants to share my knowledge and understanding with you.