-
Notifications
You must be signed in to change notification settings - Fork 19
Home
Probably the largest free dataset available on the Internet is the full XML dump of the English Wikipedia. This dataset in it's uncompressed form is about 5.5Tb and still growing. The sheer size of this dataset poses some serious challenges to analyze the data. In theory, Hadoop would be a great tool to analyze this dataset but it turns out that this is not necessarily the case.
Jimmy Lin wrote the Cloud9 a Hadoop InputReader that can handle the stub Wikipedia dump files (the stub dump files contain all variables as in the full dump file with the exception of the text of each revision). Unfortanutely, his InputReader does not support the full XML dump files.
The way that the XML dump files are organized is as follows: each dump file starts with some metadata tags and after that come the tags that contain the revisions. Hadoop has a StreamXmlRecordReader that allows you to grab an XML fragment and send it as input to a mapper. This poses two problems:
- Some pages are so large (10's of Gb's) that you will run inevitable into out of memory errors.
- Splitting by tag leads to serious information loss as you don't know to which page a revision belongs.
Hence, Hadoop's StreamXmlRecordReader is not suitable to analyze the full Wikipedia dump files.
During the last couple of weeks, the Wikimedia Foundation fellows of the Summer of Research have been working hard on tackling this problem. In particular a big thank you to Yusuke Matsubara, Shawn Walker, Aaron Halfaker and Fabian Kaelin. We have released a customized InputFormat for the full Wikipedia dump files that supports both the compressed (bz2) and uncompressed files. The project is called WikiHadoop and the code is available on Github at https://github.com/whym/wikihadoop
##Features of WikiHadoop Wikihadoop offers the following:
- WikiHadoop uses Hadoop's streaming interface, so you can write your own mapper in Python, Ruby, Hadoop Pipes or Java.
- You can choose between sending 1 or 2 revisions to a mapper. If you choose two revisions then it will send two consecutive revisions from a single page to a mapper. These two revisions can be used to create a diff between them (what has been added / removed). The syntax for this option is:
-D org.wikimedia.wikihadoop.previousRevision=false (true is the default)
- You can specify which namespaces to include when parsing the XML files. Default behavior is to include all namespaces. You can specify this by entering a regular expression. The syntax for this option is:
-D org.wikimedia.wikihadoop.ignorePattern='xxxx'
- You can parse both bz2 compressed and uncompressed files using WikiHadoop.
##Getting Ready
- Install and configure Hadoop 0.21. The reason you need Hadoop 0.21 is that it has streaming support for bz2 files and Hadoop 0.20 does not support this. Good places to look for help on configuration can be found http://www.michael-noll.com/tutorials/running-hadoop-on-ubuntu-linux-multi-node-cluster/ and http://hadoop.apache.org/common/docs/current/cluster_setup.html.
- Download WikiHadoop and extract the source tree. Confirm there is a directory called mapreduce.
- Download Hadoop Common and extract the source tree. Confirm there is a directory called mapreduce.
- Move to the top directory of the source tree of your copy of Hadoop Common.
- Merge the mapreduce directory of your copy of WikiHadoop into that of Hadoop Common.
rsync -r ../wikihadoop/mapreduce/ mapreduce/
- Move to the directory called mapreduce/src/contrib/streaming under the source tree of Hadoop Common.
cd mapreduce/src/contrib/streaming
- Run Ant to build a jar file.
ant jar
- Find the jar file at mapreduce/build/contrib/streaming/hadoop-${version}-streaming.jar under the Hadoop common source tree.
##Tutorial So now we are ready to start crunching some serious data!
- Download the latest full dump from http://dumps.wikimedia.org/enwiki/latest/. Look for the files that start with enwiki-latest-pages-meta-history and end with 'bz2'. You can also download the 7z files but then you will need to decompress them. Hadoop cannot stream 7z files at the moment.
- Copy the bz2 files to HDFS. Make sure you have enough space, you can delete the bz2 files from your regular partition after they have been copied to HDFS.
hdfs dfs -copyFromLocal /path/to/dump/files/enwiki--pages-meta-history.xml.bz2 /path/on/hdfs/
You can check to see if the files were successfully copy to hdfs via:hdfs dfs -ls /path/on/hdfs/
- Once the files are in HDFS, you can launch Hadoop by entering the following command:
hadoop jar hadoop-0.<version>-streaming.jar -D mapred.child.ulimit=3145728 -D mapreduce.task.timeout=0 -D mapreduce.input.fileinputformat.split.minsize=400000000 #Sets the file split size a smaller size will mean more seeking and SLOWER processing time -D mapred.output.compress=true -D mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec -input /path/on/hdfs/enwiki-<date>-pages-meta-history<file>.xml.bz2 -output /path/on/hdfs/out -mapper <name of mapper> -inputformat org.wikimedia.wikihadoop.StreamWikiDumpInputFormat
- You can customize your job with the following parameters:
- -D . This is a regular expression that determines which namespaces to include. The default is to include all the namspaces.
- -D mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec. This compresses the output of the mapper using the LZO compression algorithm. This is optional but it saves hard disk space.
##Real Life Application At the Wikimedia Foundation we wanted to have a more fine-grained understanding of the different types of editors that we have. To analyze this, we need to generate the diffs between two revisions to see what type of content an editor has removed and added. In the examples folder of https://github.com/whym/wikihadoop you can find our mapper function that creates diffs based on the two revisions it receives from WikiHadoop. We set the number of reducers to 0 as there is no aggregation over the diffs, we want just want to store them.
You can launch this as follows:
hadoop jar hadoop-0.<version>-streaming.jar -D mapred.child.ulimit=3145728 -D mapreduce.task.timeout=0 -D mapreduce.input.fileinputformat.split.minsize=400000000 -D mapred.reduce.tasks=0 -D mapred.output.compress=true -D mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec -input /path/on/hdfs/enwiki-<date>-pages-meta-history<file>.xml.bz2 -output /path/on/hdfs/out -mapper /path/to/wikihadoop/examples/mapper.py -inputformat org.wikimedia.wikihadoop.StreamWikiDumpInputFormat
Depending on the number of nodes in your cluster, the number of cores in each node, and memory on each node, this job will run for quite a while. We are running it on a three node mini-cluster with quad-core machines and the job takes about 14 days to parse the entire English Wikpedia dump files.