Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Incorrect reading of BZip2 input splits #312

Open
ebastien opened this issue Dec 27, 2013 · 0 comments
Open

Incorrect reading of BZip2 input splits #312

ebastien opened this issue Dec 27, 2013 · 0 comments
Milestone

Comments

@ebastien
Copy link

When reading from a large bzip2 text file (i.e. larger than the HDFS block size), all the mappers read the whole text file instead of their assigned split. For instance, reading a bzip2 text file of 140MB from HDFS with a block size of 128MB, spawn two mappers, each reading the whole input file, i.e. 140MB.

The input splits reported in the log file are not the same as when reading the same input file with an equivalent Java job: Scoobi cuts the input file at the HDFS block boundary (0 to 128MB for the first split, then 128MB to 140MB for the second split), whereas the Java job cuts roughly at the middle of the input file (0 to 70MB and 70MB to 140MB).

I notice that Scoobi manipulates the InputFormat, the InputSplit and the RecordReader in the Source.read method (DataSource.scala#81). Could it be that the Hadoop bzip2 split logic gets jammed at this point?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant