Skip to content

Data Mining Historical Newspaper Metadata (Europeana Newspaper Project)

Notifications You must be signed in to change notification settings

jpmoreux/EN-data_mining

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

30 Commits
 
 
 
 
 
 

Repository files navigation

EN-data_mining

Data Mining Historical Newspaper Metadata (Europeana Newspaper Project)

Synopsis

Newspapers from European digital librabries collections are part of the data set OLR’ed (Optical Layout Recognition) by the project Europeana Newspapers (www.europeana-newspapers.eu). The OLR refinement (performed by CCS) consists of the description of the structure of each issue and articles (spatial extent, title and subtitle, classification of content types) using the METS/ALTO formats.

From each digital document is derived a set of bibliographical and descriptive metadata relating to content (date of publication, number of pages, articles, words, illustrations, etc.). Shell and XSLT scripts called with Xalan-Java are used to extract some metadata from METS manifest or OCR files.

Detailled presentation :

Installation

You can use XSLT (DOS scripts) or Perl script (faster).

Sample documents are stored in the "DOCS" folder. The metadata are generated in a "STATS" folder.

XSLT

Two DOS shell scripts :

  • batch-EN.bat
  • xslt.cmd

Two XSLT sheets:

  • analyseAltosCCS.xsl
  • calculeStatsMETS_CSV.xsl

The XSLT are runned with Xalan-Java. Path to the Java bin must be set in xslt.cmd.

For each document, its metadata are stored in the STATS folder under two formats :

  • XML (raw metadata, with detailled values for each page)
  • CSV (metadata at the issue level)

An aggregated file (metadata.csv) contains all the CSV metadata.

Test
  1. Open a DOS terminal.
  2. Change dir to the batch folder
  3. batch-EN.bat

Perl

Faster and richer (more metadata) than the XSLT scripts.

One Perl script : extractMD.pl For each document, its metadata are stored in the STATS folder under your prefered formats : XML, JSON, CSV, txt

Test
  1. Open a shell terminal.
  2. Change dir to the batch folder
  3. perl extractMD.pl DOCS xml json

Charts

See on Github and here.

(Made with Highcharts)

Datasets

The complete set of derived data contains about 4,500,000 atomic metadata from six national and regional French newspapers (1814-1945, 880,000 pages, 150,000 issues) of Gallica (www.gallica.fr) press collections:

  • Le Matin
  • Le Gaulois
  • Le Petit journal illustré
  • Le Journal des débats politiques et littéraires
  • Le Petit Parisien
  • Ouest-Eclair

See Datasets

License

CC0

CC0

This work has been part-funded through the EU Competitiveness and Innovation Framework Programme grant Europeana Newspapers (Ref. 297380)

About

Data Mining Historical Newspaper Metadata (Europeana Newspaper Project)

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • HTML 99.6%
  • Other 0.4%