INLP Assignment - 1 : Language Modeling Using N-Gram Model

Files included:

tokenizer.py
language_model.py
generator.py
8 txt files for perplexity scores of each LM for train and test set.
Report containing analysis of generation and perplexity scores with examples
README.md

The corpus files and .npy files containing the calculated probabilities can be found here

The .npy files which store the probabilities for different models should be present in the same directory as these files. This is crucial for running the files. The .py files have the path of .npy files hardcoded.

Instructions to run the files

tokenizer.py
When run, this file inputs a text and outputs the tokenized text.
To run this file use python3 tokenizer.py
Example usage:

-> python3 tokenizer.py Your text: In 'Pride and Prejudice' by Jane Austen,
Elizabeth Bennett meets Mr Darcy at a ball hosted by her friend @charles_bingly.
They dance, but Mr Darcy finds her behaviour "tolerable, but not handsome enough
to tempt him" #rude. She later visits Pemberley, Mr Darcy's estate, where she
learns more about his character. Check out more information at
https://janeausten.co.uk.

-> [['In', "'", 'Pride', 'and', 'Prejudice', "'", 'by', 'Jane', 'Austen', ',', 'Elizabeth', 'Bennett', 'meets', 'Mr', 'Darcy', 'at', 'a', 'ball', 'hosted', 'by', 'her', 'friend', '<MENTION>', '.'], ['They', 'dance', ',', 'but', 'Mr', 'Darcy', 'finds', 'her','behaviour', '"', 'tolerable', ',', 'but', 'not', 'handsome', 'enough', 'to', 'tempt', 'him', '"', '<HASHTAG>', '.'], ['She', 'later', 'visits', 'Pemberley', ',', 'Mr', "Darcy's", 'estate', ',', 'where', 'she', 'learns', 'more', 'about', 'his', 'character', '.'], ['Check', 'out', 'more', 'information', 'at', '<URL>', '.']]

language_model.py
When run, this file inputs a sentence and outputs the likelihood score of the text.
To run this file use python3 language_model.py <lm_type> <path_to_corpus>
Example usage:

-> python3 language_model.py i Ulysses-James_Joyce.txt
   Input sentence: Hello there what are you doing

-> probab score: 1.9699820759850983e-22

generator.py
When run, this file inputs a sentence and outputs the top k choices for next word prediction with their probability.
To run this file use python3 language_model.py <lm_type> <path_to_corpus> k
Example usage:

-> python3 generator.py i Ulysses-James_Joyce.txt 4
   Input sentence: What are you doing

-> the 0.19387780092955073
   I 0.14182319825969222
   round 0.1390787516408624
   here 0.1390597917333057

Generation

The punctuation marks were discarded after processing the text. Hence, the model will not generate punctuation marks during the sequence generation. The max_token has been set to 20. This can be changed by changing the max_tokens in generate_sequence function in the class N_Gram_model.

For unsmoothed model:

Prompt: ‘I am not where I want to be……’
Results:
n=1: 'I am not where I want to be </s>'

n=2: 'I am not where I want to be in the whole of the whole of the whole of the whole of the whole of the whole of the'

n=3: 'I am not where I want to be in London and when at last that he had been brought up for the sake of discovering them To be'

n=4: 'I am not where I want to be told why my views were directed to Longbourn instead of to yours A house in town I conclude They are'

n=5: 'I am not where I want to be told whether I ought or ought not to make our acquaintance in general understand Wickham's character They are gone off'

We observe that as we increase the value of n, the quality of the text generated also increases. This is because when n is large the the model asseses the context of the input and then generates the sequence. This leads to coherent and meaningful sequences as comapred to gibberish when n is low. If a context is not seen by the model during training, then the model will directly output EOS tag and stop the generation. For unseen data that is OOD data, the value of n does not improve the quality of the generated text. It performs bad consistently.

Prompt: 'Earth is the third planet in the solar system….'

Results:
n=1: 'Earth is the third planet in the solar system </s>'

n=2: 'Earth is the third planet in the solar system </s>'

n=3: 'Earth is the third planet in the solar system </s>'

n=4: 'Earth is the third planet in the solar system </s>'

n=5: 'Earth is the third planet in the solar system </s>'

For smoothed models:

Prompt: 'What are you doing here...'

Results: 'What are you doing here Stephen It flows purling widely flowing floating foampool flower unfurling They talk excitedly Little piece of original verse written by'


Prompt: 'King Macbeth was....'
Result: 'King Macbeth was <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK> <UNK>'

We observe that for unseen context we output the UNK tag which is for an unkown token. For out of domain data, again the generation of UNK is very frequent leading to udesireable output sequence. The calculated probabilites after smoothing are used for generation here.

Perplexity scores

For generating the perplexity scores of a particular corpus, use the write_perplexity function of the class N_Gram_model. For this first load the .npy files which have the probabilties for training corpus stored. Then use the write_perplexity function. A txt file with the average perplexity and sentence-wise perplexity will be created. Following is the script to get the perplexity score files.

model = N_Gram_model('Ulysses-James_Joyce.txt', 'g')
model.read_file()
model.preprocess()
model.set_up()
model.load('Ulysses_prob_dict_good_turing.npy')
model.write_perplexity()

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

INLP Assignment - 1 : Language Modeling Using N-Gram Model

Files included:

Instructions to run the files

Generation

Perplexity scores

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 2 Commits
INLP Assignment-1 Report.pdf		INLP Assignment-1 Report.pdf
README.md		README.md
generator.py		generator.py
language_model.py		language_model.py
tokenizer.py		tokenizer.py

ishank31/Language-Modeling-Using-N-Gram-model

Folders and files

Latest commit

History

Repository files navigation

INLP Assignment - 1 : Language Modeling Using N-Gram Model

Files included:

Instructions to run the files

Generation

Perplexity scores

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages