This code is used for the article "Between News and History: Identifying Networked Topics of Collective Attention on Wikipedia"
Extra data (too large for GitHub) is available through the project Dropbox. Downloading and processing all data takes some time. However, instructions in the data readme indicate where files are used, and what is required to skip past some of the processing stages.
Consult the demo_community_detection.ipynb for an example extracting the related Wikipedia articles to a given news event.
Project abstract
The digital information landscape has introduced a new dimension to understanding how we collectively react to new information and preserve it at the societal level. This, together with the emergence of platforms such as Wikipedia, has challenged traditional views on the relationship between current events and historical accounts of events, with an ever-shrinking divide between "news" and "history". Wikipedia's place as the Internet's primary reference work thus poses the question of how it represents both traditional encyclopaedic knowledge and evolving important news stories. In other words, how is information on and attention towards current events integrated into the existing topical structures of Wikipedia? To address this we develop a temporal community detection approach towards topic detection that takes into account both short term dynamics of attention as well as long term article network structures. We apply this method to a dataset of one year of current events on Wikipedia to identify clusters distinct from those that would be found solely from page view time series correlations or static network structure. We are able to resolve the topics that more strongly reflect unfolding current events vs more established knowledge by the relative importance of collective attention dynamics vs link structures. We also offer important developments by identifying and describing the emergent topics on Wikipedia. This work provides a means of distinguishing how these information and attention clusters are related to Wikipedia’s twin faces of encyclopaedic knowledge and current events---crucial to understanding the production and consumption of knowledge in the digital age.