Incorrect reading of BZip2 input splits #312

ebastien · 2013-12-27T08:12:29Z

When reading from a large bzip2 text file (i.e. larger than the HDFS block size), all the mappers read the whole text file instead of their assigned split. For instance, reading a bzip2 text file of 140MB from HDFS with a block size of 128MB, spawn two mappers, each reading the whole input file, i.e. 140MB.

The input splits reported in the log file are not the same as when reading the same input file with an equivalent Java job: Scoobi cuts the input file at the HDFS block boundary (0 to 128MB for the first split, then 128MB to 140MB for the second split), whereas the Java job cuts roughly at the middle of the input file (0 to 70MB and 70MB to 140MB).

I notice that Scoobi manipulates the InputFormat, the InputSplit and the RecordReader in the Source.read method (DataSource.scala#81). Could it be that the Hadoop bzip2 split logic gets jammed at this point?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Incorrect reading of BZip2 input splits #312

Incorrect reading of BZip2 input splits #312

ebastien commented Dec 27, 2013

Incorrect reading of BZip2 input splits #312

Incorrect reading of BZip2 input splits #312

Comments

ebastien commented Dec 27, 2013