Skip to content
interviewBubble edited this page Aug 7, 2018 · 6 revisions

The DBpedia Information Extraction Framework

The DBpedia community uses a flexible and extensible framework to extract different kinds of structured information from Wikipedia. The DBpedia extraction framework is written using Scala 2.8. The framework is available from the DBpedia Github repository (GNU GPL License). The change log may reveal more recent developments. More recent configuration options can be found here: https://github.com/dbpedia/extraction-framework/wiki

When the above link is 100% complete, we will remove this section.

For historic reasons, the documentation of the old, superseded PHP-based DBpedia extraction framework is still available here.

Contents

  1. Getting started
  2. Overview
  3. Core Module
  4. Dump extraction Module
    1. Configuration
    2. Running the dump extraction
    3. Running Abstract Extraction
  5. Step by Step Guide
  6. Server Module
    1. Configuration
    2. Running the extraction server

1. Getting started

Before you can start developing, you need to take care of some prerequisites:

  • DBpedia Extraction Framework Get the most recent revision from the Github repository.
  • git clone git://github.com/dbpedia/extraction-framework.git (read only)
  • Java Development Kit The DBpedia extraction framework uses Java. Get the most recent JDK from http://java.com/.
  • The DBpedia extraction framework currently requires at least Java 7 (JDK v1.7.0) for full functionality.
  • You can compile and run it with an earlier JDK by deleting or blanking the following two files. (The launchers_purge-download_and_purge-extract_in the_dump_module won't work, but they are not vitally necessary.)
  • core/src/main/scala/org/dbpedia/extraction/util/Rich Path.scala
  • dump/src/main/scala/org/dbpedia/extraction/dump/clean/Clean.scala
  • Maven is used for project management and build automation. Get it from: http://maven.apache.org/. Either Maven 2 or 3 should work.

This is enough to compile and run the DBpedia extraction framework. The required input files — the wikimedia dumps — will be downloaded by extractor code if configured to do so (see below).

If you'd like to use an IDE for coding, there are a number of options:

2. Overview

The DBpedia extraction framework is structured into different modules

  • Core Module : Contains the core components of the framework.
  • Dump extraction Module : Contains the DBpedia dump extraction application.

The Live, Wiktionary, and Server modules are not necessary for running the extraction and can be disabled from the pom.xml for that use case.

3. Core Module

http://www4.wiwiss.fu-berlin.de/dbpedia/wiki/DataFlow.png

Components

  • Source : The Source package provides an abstraction over a source of Media Wiki pages.
  • WikiParser : The Wiki Parser package specifies a parser, which transforms an Media Wiki page source into an Abstract Syntax Tree (AST).
  • Extractor : An Extractor is a mapping from a page node to a graph of statements about it.
  • Destination : The Destination package provides an abstraction over a destination of RDF statements.

In addition to the core components, a number of utility packages offers essential functionality to be used by the extraction code:

For details about a package follow the links.

You can find the complete scaladoc here

4. Dump extraction Module

More recent configuration options can be found here: https://github.com/dbpedia/extraction-framework/wiki/Extraction-Instructions.

When the above link is 100% complete, we will remove this section.

The framework is undergoing a lot of refactoring, and the following sections are not 100% correct. For now, you should use the 'dump' branch for bump-based extraction.

$ git clone git://github.com/dbpedia/extraction-framework.git
$ cd extraction-framework
$ mvn clean install
$ cd dump
$ ../run download config=download.properties.file
$ ../run extraction extraction.properties.file

The extraction and download property files contain many documentation comments and can be easily adapted to your needs

4.1. Configuration

All configurations are read from a Java properties file named dump/config.properties. After a fresh checkout you will need to copy it from the .default and modify according to your needs (e.g. like removing unwanted languages to be extracted).

The following properties are available:

  • dumpDir The directory where the dumps are located. The DEF expects to see subdirectories of the type « enwiki/[date] " inside
  • updateDumps If true, the extraction framework will download every dump which is either missing or not up-to-date. If you want to use your own dumps or don't want the framework to update the dumps, set it to false and be sure that (uncompressed) dumps are available in dumpDir/<lang>/<date>/<lang>wiki-<date>-pages-articles.xml.
  • outputDir The output directory.
  • languages The languages of the Wikipedia dumps to be extracted.
  • extractors The extractor classes to be used for the extraction. See Available Extractors. Language specific extractors can be configured using a property of the format extractors.{wikiCode}, e.g., extractors.en

4.2. Running the dump extraction

Before you can start the extraction you need to install the framework into your maven repository by running mvn install from the extraction directory.

The dump extraction is started by running mvn scala:run from the directory extraction/dump.

4.3. Running Abstract Extraction

Abstracts are not generated by the Simple Wiki Parser, they are produced by a local wikipedia clone using a modified mediawiki installation.

In order to generate clean abstracts from Wikipedia articles, one needs to render wiki templates as they would be rendered in the original Wikipedia instance. So in order for the DBpedia Abstract Extractor to work, a running Media Wiki instance with Wikipedia data in a database is necessary.

See:

http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/extraction_framework/file/d580c99b5bbc/core/src/main/scala/org/dbpedia/extraction/mappings/AbstractExtractor.scala#l66

http://dbpedia.hg.sourceforge.net/hgweb/dbpedia/dbpedia/file/efc0afb0faa3/abstractExtraction/README.txt

5. Step by Step Guide

Extraction on Ubuntu.

Internationalization

If you leave the updateDumps setting to false you could download the dump you are interested with from http://dumps.wikimedia.org/backup-index.html, pick the latest complete one from <lang>wiki (e.g., itwiki) and choose the pages-articles.xml.bz2 one (e.g., itwiki-20120226-pages-articles.xml.bz2). The input file must be placed in dumpDir/<lang>/<date> (e.g., /srv/dbpedia/dumps/it/201220226/itwiki-20120226-pages-articles.xml.bz2, if dumpDir is /srv/dbpedia/dumps).

6. Server Module

This module is intended for testing the framework.

6.1. Configuration

There are two Scala classes that configure the parameters of the server:

  • In org.dbpedia.extraction.server.Configuration, you can configure the possible languages and URL to the mappings wiki API. * In org.dbpedia.extraction.server.ExtractionManager in the function loadExtractor, you can configure the extractors that should be used by the extraction server. See Available Extractors.

6.2. Running the extraction server

Before you can start the server you need to install the framework into your maven repository by running mvn install from the extraction directory.

The extraction server is started by running mvn scala:run from the directory extraction/server. The standard port is 9999.

A browser window should open in which you can specify the language and the URI that you would like to extract.

Clone this wiki locally