Using OCR Text Mining and NMF Topic Modeling to Analyze Historical Newspaper
Topic Modeling is a general method that can be useful for newspaper text analysis. This method involves using machine learning algorithms to identify the underlying topics or themes present in a collection of newspaper articles. These algorithms generate topics by identifying the underlying themes or patterns in a collection of documents. By clustering similar articles together based on their topic, researchers can gain a better understanding of the issues, debates, and trends that are shaping the news coverage.
Non-negative Matrix Factorization (NMF) is a common topic modelling method and matrix factorization technique that decomposes a matrix of word frequencies into two matrices representing a set of topics and the corresponding weights of each topic for each document. In more detail, in NMF, a matrix is factorized into two non-negative matrices, often referred to as the basis and mixture matrices. In the context of topic modelling, the document-term matrix can be factorized into a document-topic matrix (mixture matrix) and a topic-term matrix (basis matrix). Topic-term matrix represents the distribution of words in each topic while document-topic matrix contains the weights and probabilities of a list of frequent words presented in each topic. In fact, it represents the distribution of topics in each document which shows the probability that a document belongs to a particular topic which is to identify the most probable topics for each document. The goal of the factorization is to minimize the reconstruction error between the original document-term matrix and the product of the two matrices.
In the preprocessing step, we first clean the text data to reduce the number of terms and improve the quality of the features by removing irrelevant or noisy data. This step also include stemming or lemmatization to reduce words to their root form. We remove stop words, non-alphabetic characters, short words such as 'mr','mrs','sald','th', and 'amherstburg' which does not add any value to the results.
The preprocessed text is then converted into a numerical representation, document-term matrix (DTM) . The NMF model is then trained on vectorized text data in which the number of topics to generate is specified in advance by experts. We set vectorizer to only consider the top 10,000 most frequent terms in the corpus and create an NMF model with 5 topics and fit it to the document-term matrix.
Once the model is trained, the document-term matrix is factorized into two non-negative matrices: a document-topic matrix and a topic-term matrix. Then, it use to infer the topic distribution for each document in the corpus. Hence, it calculates the correlation between the document-term matrix and the topic-term matrix to generate the probability of the correlations . In summary, The document-term matrix represents the frequency of words in each document or the entire corpus, while the document-topic matrix represents the distribution of topics in each document, and the topic-term matrix represents the distribution of words in each topic.
The specific results using NMF may vary depending on the specific parameters used in the algorithm, such as the regularization parameter. However, the general approach involves factorizing a document-term matrix into two non-negative matrices and interpreting the resulting topics based on the words with the highest weights in the topic-term matrix.
Finally, the topics generated by the model can be interpreted by examining the top words in each topic, which are the words with the highest weights for that topic in the topic-term matrix. Hence, we obtain the top 10 words for each topic and retrieve the top 10 documents that have the highest weight and probability for a given topic. To get an idea of the types of documents that fall under each topic, the results of the list of words are analyzed by the experts and assign a topic to each set of words based on their logical and semantic relationship. Going through the related documents/ newspapers in each topic category as well as searching in the historical literature, we found a close relationship between our findings and the trend report in the literature.
Here is the results of getting topics from the years between 1920 and 1930. The NMF parameters set as follows: init='nndsvda', solver= 'mu', max_iter= 10000.
Based on the list of words in each category, the possible topics could be as follows. Of course, these are just some possible topics based on the words provided.
• Top 10 words for topic 1: • ['home', 'sunday', 'held', 'county', 'school', 'miss', 'phone', 'essex', 'years', 'church']
This topic could be centered around the idea of local events and activities that bring the community together.
• Top 5 documents for topic 1: • ['1925-06-19.txt', '1925-09-25.txt', '1926-02-12.txt', '1926-09-24.txt', '1927-07-08.txt']
It might be newspapers were talking about an important event for the coming Sunday. And according to the literature, overall, the late 1925 and early 1927 period in Canada was a time of political, economic, and social change, with significant events taking place in a variety of areas. For example, in September 1925, Canada held a federal election and activities lasted until early 1926.
• Top 10 words for topic 2: • ['bank', 'farm', 'sullivan', 'street', 'house', 'acres', 'sale', 'phone', 'good', 'apply']
This topic could include articles related to the buying and selling of property, such as "house", "acres", and "sale". The words "street" and "phone" could also suggest a focus on real estate agents or property listings. The word "apply" might suggest the application requirements for selling and buying the property management listed.
• Top 5 documents for topic 2: • ['1920-04-23.txt', '1920-08-13.txt', '1920-08-20.txt', '1920-08-27.txt', '1922-04-28.txt']
Looking at literature: The early 20th century was a period of rapid growth and development in Canada, particularly in urban areas, and there was significant activity in the real estate market during this time.
• Top 10 words for topic 3: • ['motion', 'mayor', 'windsor', 'sandwich', 'essex', 'road', 'town', 'reeve', 'county', 'council']
This topic could explore the role of mayors, reeves, and town councils in governing Sandwich and Essex County over time, and how this governance has impacted the development of local infrastructure such as roads and transportation.
• Top 5 documents for topic 3: • ['1928-07-27.txt', '1929-09-13.txt', '1930-05-23.txt', '1930-05-30.txt', '1930-06-27.txt']
Looking at the literature represents a diverse range of political, economic, and cultural developments that occurred during the late 1920s and early 1930s in Canada. In Sep 1929, the Great Depression began in North America with the Wall Street Crash in the United States. The economic downturn had a significant impact on Canada, leading to high unemployment and hardship throughout the 1930s. and In June 1930, the Canadian government announced a new tariff policy known as the "National Policy," which aimed to protect Canadian industries by imposing high tariffs on imported goods. Also, The Bank of Montreal building, officially opened in Montreal, Quebec in July 1930.
• Top 10 words for topic 4: • ['evening', 'teams', 'club', 'pins', 'league', 'score', 'games', 'high', 'game', 'team']
This topic could focus specifically on the history of pinball leagues and clubs, including changes in technology and culture that have impacted the popularity of the game over time.
• Top 5 documents for topic 4: • ['1928-03-02.txt', '1928-03-16.txt', '1928-04-20.txt', '1928-11-09.txt', '1928-11-16.txt']
Looking at the literature, in April 1928, the Canadian Olympic Committee was established to promote amateur sports and represent Canadian athletes in international competitions. The first modern Olympic Games in which Canada participated was in 1900, but the Canadian Olympic Committee wasn't established until 1928.
• Top 10 words for topic 5: • ['elected', 'years', 'councillors', 'january', 'deputy', 'year', 'acclamation', 'reeve', 'december', 'christmas']
This topic could examine the history of an important political event that was related to the beginning of the winter season.
• Top 5 documents for topic 5: • ['1924-12-26.txt', '1925-12-25.txt', '1927-12-23.txt', '1927-12-30.txt', '1928-12-21.txt']
Looking at the literature, it could refer to the January elections in Canadian politics, including the reasons why some municipalities hold elections in January, the impact of winter weather on voter turnout, and the challenges and opportunities of campaigning during the holiday season. Moreover, in Dec 1924, Boxing Day was first observed as a statutory holiday in Canada, and in Dec 1927, Mackenzie King, the leader of the Liberal Party, was sworn in as Prime Minister of Canada. King served as Prime Minister for three separate terms, from 1921-1926, 1926-1930, and 1935-1948.
All in all, these interpretations were based on the literature review which could not be comprehensive, however, it seems 1920 to 1930 shows an improving economic and social growth in Canada after a great depression using good political management.