GitHub - jpmoreux/EN-data_mining: Data Mining Historical Newspaper Metadata (Europeana Newspaper Project)

EN-data_mining

Data Mining Historical Newspaper Metadata (Europeana Newspaper Project)

Synopsis

Newspapers from European digital librabries collections are part of the data set OLR’ed (Optical Layout Recognition) by the project Europeana Newspapers (www.europeana-newspapers.eu). The OLR refinement (performed by CCS) consists of the description of the structure of each issue and articles (spatial extent, title and subtitle, classification of content types) using the METS/ALTO formats.

From each digital document is derived a set of bibliographical and descriptive metadata relating to content (date of publication, number of pages, articles, words, illustrations, etc.). Shell and XSLT scripts called with Xalan-Java are used to extract some metadata from METS manifest or OCR files.

Detailled presentation :

Installation

You can use XSLT (DOS scripts) or Perl script (faster).

Sample documents are stored in the "DOCS" folder. The metadata are generated in a "STATS" folder.

XSLT

Two DOS shell scripts :

batch-EN.bat
xslt.cmd

Two XSLT sheets:

analyseAltosCCS.xsl
calculeStatsMETS_CSV.xsl

The XSLT are runned with Xalan-Java. Path to the Java bin must be set in xslt.cmd.

For each document, its metadata are stored in the STATS folder under two formats :

XML (raw metadata, with detailled values for each page)
CSV (metadata at the issue level)

An aggregated file (metadata.csv) contains all the CSV metadata.

Test

Open a DOS terminal.
Change dir to the batch folder
batch-EN.bat

Perl

Faster and richer (more metadata) than the XSLT scripts.

One Perl script : extractMD.pl For each document, its metadata are stored in the STATS folder under your prefered formats : XML, JSON, CSV, txt

Test

Open a shell terminal.
Change dir to the batch folder
perl extractMD.pl DOCS xml json

Charts

See on Github and here.

(Made with Highcharts)

Datasets

The complete set of derived data contains about 4,500,000 atomic metadata from six national and regional French newspapers (1814-1945, 880,000 pages, 150,000 issues) of Gallica (www.gallica.fr) press collections:

Le Matin
Le Gaulois
Le Petit journal illustré
Le Journal des débats politiques et littéraires
Le Petit Parisien
Ouest-Eclair

See Datasets

License

CC0

This work has been part-funded through the EU Competitiveness and Innovation Framework Programme grant Europeana Newspapers (Ref. 297380)

Name		Name	Last commit message	Last commit date
Latest commit History 30 Commits
Charts		Charts
Scripts		Scripts
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

EN-data_mining

Synopsis

Installation

XSLT

Test

Perl

Test

Charts

Datasets

License

About

Releases

Packages

Languages

jpmoreux/EN-data_mining

Folders and files

Latest commit

History

Repository files navigation

EN-data_mining

Synopsis

Installation

XSLT

Test

Perl

Test

Charts

Datasets

License

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages