Information Summarizer : Powered by Wikipedia
Abstract—A summarized data can help a user to understand the data in a quick glance and helps them in saving time. In this project, we will be building an information summarizer which can find and shorten the information. Here, we will be using Wikipedia as the primary source of information. The information collected from this source will be provided to the user in a summarized version.
Keywords—Wikipedia, NLP, BART, PEGASUS, word_tokenizer
From its day of introduction, Wikipedia has proven to be one of the best sources of information. It consists of elaborated definitions on many topics in various domains. Even though this content can be informative, sometimes it will take time to go through all this data. Most user’s lookup definition/information on keywords only to obtain a concise data which they can go through in matter of minutes, this is where text summarizers come into action. The text summarizer goes through a given content and gives a brief information which will summarize the whole data. In this project, we aim to create a search engine, which takes an input from the user and collect information on the given data from Wikipedia and gives shortened information to the user
The data for this analysis was collected from the Wikipedia database. Here we used the Wikipedia library to obtain the contents related to the required data. In order to collect the data, first we give a specific keyword as an input to the Wikipedia library. The library will then give the article based on the given keyword.
Here the data generated by the Wikipedia library consists of in dept content divided by subheadings and all the references and citations of the article. We will need to remove all these contents before moving into the model part.
At first, we will remove the citations and other characters from the dataset. For this, we will use the re library for searching and removing these parameters.
Extractive summarization is the summarizing technique based on the word frequency. In order to perform extractive summarization, we will need to calculate the word rankings of individual words in the article.
Word ranking s calculated based on the frequency of repetition of the individual words in that particular article. Here, we will use the tokenized data to calculate the word frequency.
Once the list of words and its respective word frequency is calculated, we will start scoring the individual sentences. For calculating the sentence scores, first we will split the article into individual sentences. Then for each sentence, we will add the ranks(previously calculated) of the individual words contained in that particular sentence. The resultant obtained value is the score for that particular sentence. This process is repeated for each sntnces. Once calculated, these values are store for further processing.Once sentence scores are calculated we will start the process of creating Text summary.
The Extractive summary works on the basis of the relation between the article and its most repeated words. Here it considers that the words that are most repeated are of high importance in comparison to less frequent words. Therefore, the sentences with the most number of repeated words will have higher importance. AS we have calculated the senence score of each individual sentences we will now sort the sentence in decreasing order of their scores. Here the generated summary will be the string containing the top sentences in this list.
The sentence with the highest score in our sample article is given below
“Global warming usually refers to human-induced warming of the Earth system, whereas climate change can refer to natural or anthropogenic change.”
Here in the above scatterplot, we can see how much the score does each individual words contribute towards the final score. The score of this particular sentence is
The sentence with the lowest score in our sample article is given below
“Current inequalities between men and women, between rich and poor, and between different ethnicities have been observed to worsen as a consequence of climate variability and climate change”
Here in the above scatterplot, we can see how much the score does each individual words contribute towards the final score. The score of this particular sentence is 300.82
When it comes to abstractive summarization, the model tried to generate the summary based on its understanding of the article. Here, the generated article might contain new sentences or para phrased sentences compared to the original article.
In this project we will be combining two abstractive summarization models to generate the output. The models are follows,
- Pegasus model
- Bart model
Pegasus model is a pretrained NLP model suitable for various language processing tasks. The model is built on a seq2seq architecture and trained on numerous datasets varying from web articles to news articles. In this project we will be making use of the pretrained model and generating the output.
The BART or Bidirectional and Auto-Regressive Transformer model is a very strong NLP model. It is a transformer model which consist of encoder as well as decoder model and autoregressive model combined into one seq2seq model.
In this project, we will be fine tuning the BART model for text summarization. Here we will make use of the “xsum” dataset obtained from the hugging face database.
The XSUM dataset is a news summary dataset, it consists of 11,332 summaries and its respective contents. We will be using this dataset for fine tuning the BART model.
We will now compare the output of the extractive summarization with the combined output of the abstractive summarization.
In the given pie chart, we can see that the summarization techniques have helped in reducing the large dataset into less than one by third of its actual size.The summary obtained through Pretrained summary is the lowest in size compared to the other 2 summaries.
In the below word clouds, we can observe that the most frequent word remains the same in both summaries. While words such as climate change, climate, human remains to be common on both three-word clouds, words such as African, mitigated are rare/ unique.
Fig. Wordcloud representing Extractive summary
Fig. Wordcloud representing Abstractive summary
Fig. WordCloud presenting the original article
Here we consider the top50 words in each category. The words are ranked based on how frequent they appear. The below venn diagram depicts the number of common words under each category.
Here in the below venn- diagram there are 11 words common in all three set of data and there are no words that are unique among the extractive and abstractive summaries
In the above stacked bar chart, we have the the frequency count of the 11 common words that we saw in the above venn diagram. Here the proportion of the words various among the three sets of data
Bigrams are the combination of 2 words that appear in a data. In the below bar charts, we can see the top 12 bigrams in each category.
The word combinations (climate, change), (greenhouse, gas) and (global warming) have retained its position in all three data.
Trigrams are the combination of three words appearing in the dataset.
The trigram (greenhouse, gas, emitted) was commonly seen in both the original article and extractive summary. Where as the trigrams in abstractive summary stayed unique.
Now lets look into the top 50 common words appearing on all three data
Through the above graph, we can see that , few words remain to be common in all the three article. This similarity was observed due to the important relationship between n these words and the keyword/article
When both Summarization techniques were compared, the abstractive summarization gave a more natural and concise summary of the contentThe extractive summarization method reproduced the sentences from the original article, this method may not represent the exact concept of the original article
The abstractive method produce sentences based on the model’s understanding of the given data. Thus it’s the most viable solutionThe abstractive summarization requires a lot of data for the training of model. The selection of proper training dataset is important in fine tuning the model.
.
- “Great Learning Team*.*” Text Summarization in Python, https://www.mygreatlearning.com/blog/text-summarization-in-python/.
- “Abstractive Summarization Using Deep Learning.” Section, https://www.section.io/engineering-education/abstractive-summarization-using-deep-learning/.
- “Pegasus-xsum.” Hugging Face, https://huggingface.co/google/pegasus-xsum.
- “Visualization with Python.” Matplotlib, https://matplotlib.org/