Skip to content

sreeraj-sudhakaran/Information-Summarizer-Powered-by-Wikipedia

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

2 Commits
 
 
 
 

Repository files navigation

Information Summarizer : Powered by Wikipedia

Abstract—A summarized data can help a user to understand the data in a quick glance and helps them in saving time. In this project, we will be building an information summarizer which can find and shorten the information. Here, we will be using Wikipedia as the primary source of information. The information collected from this source will be provided to the user in a summarized version.

Keywords—Wikipedia, NLP, BART, PEGASUS, word_tokenizer

INTRODUCTION

From its day of introduction, Wikipedia has proven to be one of the best sources of information. It consists of elaborated definitions on many topics in various domains. Even though this content can be informative, sometimes it will take time to go through all this data. Most user’s lookup definition/information on keywords only to obtain a concise data which they can go through in matter of minutes, this is where text summarizers come into action. The text summarizer goes through a given content and gives a brief information which will summarize the whole data. In this project, we aim to create a search engine, which takes an input from the user and collect information on the given data from Wikipedia and gives shortened information to the user

Data Collection

The data for this analysis was collected from the Wikipedia database. Here we used the Wikipedia library to obtain the contents related to the required data. In order to collect the data, first we give a specific keyword as an input to the Wikipedia library. The library will then give the article based on the given keyword.

Data Preprocessing

Here the data generated by the Wikipedia library consists of in dept content divided by subheadings and all the references and citations of the article. We will need to remove all these contents before moving into the model part.

At first, we will remove the citations and other characters from the dataset. For this, we will use the re library for searching and removing these parameters.

Extractive Text Summerization

Extractive summarization is the summarizing technique based on the word frequency. In order to perform extractive summarization, we will need to calculate the word rankings of individual words in the article.

Word ranking s calculated based on the frequency of repetition of the individual words in that particular article. Here, we will use the tokenized data to calculate the word frequency.

Once the list of words and its respective word frequency is calculated, we will start scoring the individual sentences. For calculating the sentence scores, first we will split the article into individual sentences. Then for each sentence, we will add the ranks(previously calculated) of the individual words contained in that particular sentence. The resultant obtained value is the score for that particular sentence. This process is repeated for each sntnces. Once calculated, these values are store for further processing.Once sentence scores are calculated we will start the process of creating Text summary.

The Extractive summary works on the basis of the relation between the article and its most repeated words. Here it considers that the words that are most repeated are of high importance in comparison to less frequent words. Therefore, the sentences with the most number of repeated words will have higher importance. AS we have calculated the senence score of each individual sentences we will now sort the sentence in decreasing order of their scores. Here the generated summary will be the string containing the top sentences in this list.

Analysing the Highest scored sentence

The sentence with the highest score in our sample article is given below

“Global warming usually refers to human-induced warming of the Earth system, whereas climate change can refer to natural or anthropogenic change.”

Here in the above scatterplot, we can see how much the score does each individual words contribute towards the final score. The score of this particular sentence is

Lowest scored sentence and Sentence Score Calculation (from generated summary)

The sentence with the lowest score in our sample article is given below

“Current inequalities between men and women, between rich and poor, and between different ethnicities have been observed to worsen as a consequence of climate variability and climate change”

Here in the above scatterplot, we can see how much the score does each individual words contribute towards the final score. The score of this particular sentence is 300.82

Abstractive Summarization

When it comes to abstractive summarization, the model tried to generate the summary based on its understanding of the article. Here, the generated article might contain new sentences or para phrased sentences compared to the original article.

In this project we will be combining two abstractive summarization models to generate the output. The models are follows,

  1. Pegasus model
  2. Bart model

Pegasus model.

Pegasus model is a pretrained NLP model suitable for various language processing tasks. The model is built on a seq2seq architecture and trained on numerous datasets varying from web articles to news articles. In this project we will be making use of the pretrained model and generating the output.

BART model.

The BART or Bidirectional and Auto-Regressive Transformer model is a very strong NLP model. It is a transformer model which consist of encoder as well as decoder model and autoregressive model combined into one seq2seq model.

In this project, we will be fine tuning the BART model for text summarization. Here we will make use of the “xsum” dataset obtained from the hugging face database.

XSUM dataset

The XSUM dataset is a news summary dataset, it consists of 11,332 summaries and its respective contents. We will be using this dataset for fine tuning the BART model.

Use Summary Analysis

We will now compare the output of the extractive summarization with the combined output of the abstractive summarization.

Size comparison of the summaries.

In the given pie chart, we can see that the summarization techniques have helped in reducing the large dataset into less than one by third of its actual size.The summary obtained through Pretrained summary is the lowest in size compared to the other 2 summaries.

Word composition of the summaries

In the below word clouds, we can observe that the most frequent word remains the same in both summaries. While words such as climate change, climate, human remains to be common on both three-word clouds, words such as African, mitigated are rare/ unique.

Fig. Wordcloud representing Extractive summary

Fig. Wordcloud representing Abstractive summary

Fig. WordCloud presenting the original article

Common words among the top 50 words

Here we consider the top50 words in each category. The words are ranked based on how frequent they appear. The below venn diagram depicts the number of common words under each category.

Here in the below venn- diagram there are 11 words common in all three set of data and there are no words that are unique among the extractive and abstractive summaries

Usage of common words

In the above stacked bar chart, we have the the frequency count of the 11 common words that we saw in the above venn diagram. Here the proportion of the words various among the three sets of data

Most common Bigrams

Bigrams are the combination of 2 words that appear in a data. In the below bar charts, we can see the top 12 bigrams in each category.

The word combinations (climate, change), (greenhouse, gas) and (global warming) have retained its position in all three data.

Most Common Trigrams

Trigrams are the combination of three words appearing in the dataset.

The trigram (greenhouse, gas, emitted) was commonly seen in both the original article and extractive summary. Where as the trigrams in abstractive summary stayed unique.

Top 50 words

Now lets look into the top 50 common words appearing on all three data

Through the above graph, we can see that , few words remain to be common in all the three article. This similarity was observed due to the important relationship between n these words and the keyword/article

Conclusion

When both Summarization techniques were compared, the abstractive summarization gave a more natural and concise summary of the contentThe extractive summarization method reproduced the sentences from the original article, this method may not represent the exact concept of the original article

The abstractive method produce sentences based on the model’s understanding of the given data. Thus it’s the most viable solutionThe abstractive summarization requires a lot of data for the training of model. The selection of proper training dataset is important in fine tuning the model.

References

.

  1. “Great Learning Team*.*” Text Summarization in Python, https://www.mygreatlearning.com/blog/text-summarization-in-python/.
  2. “Abstractive Summarization Using Deep Learning.” Section, https://www.section.io/engineering-education/abstractive-summarization-using-deep-learning/.
  3. “Pegasus-xsum.” Hugging Face, https://huggingface.co/google/pegasus-xsum.
  4. “Visualization with Python.” Matplotlib, https://matplotlib.org/

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published