You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When reading from a large bzip2 text file (i.e. larger than the HDFS block size), all the mappers read the whole text file instead of their assigned split. For instance, reading a bzip2 text file of 140MB from HDFS with a block size of 128MB, spawn two mappers, each reading the whole input file, i.e. 140MB.
The input splits reported in the log file are not the same as when reading the same input file with an equivalent Java job: Scoobi cuts the input file at the HDFS block boundary (0 to 128MB for the first split, then 128MB to 140MB for the second split), whereas the Java job cuts roughly at the middle of the input file (0 to 70MB and 70MB to 140MB).
I notice that Scoobi manipulates the InputFormat, the InputSplit and the RecordReader in the Source.read method (DataSource.scala#81). Could it be that the Hadoop bzip2 split logic gets jammed at this point?
The text was updated successfully, but these errors were encountered:
When reading from a large bzip2 text file (i.e. larger than the HDFS block size), all the mappers read the whole text file instead of their assigned split. For instance, reading a bzip2 text file of 140MB from HDFS with a block size of 128MB, spawn two mappers, each reading the whole input file, i.e. 140MB.
The input splits reported in the log file are not the same as when reading the same input file with an equivalent Java job: Scoobi cuts the input file at the HDFS block boundary (0 to 128MB for the first split, then 128MB to 140MB for the second split), whereas the Java job cuts roughly at the middle of the input file (0 to 70MB and 70MB to 140MB).
I notice that Scoobi manipulates the InputFormat, the InputSplit and the RecordReader in the Source.read method (DataSource.scala#81). Could it be that the Hadoop bzip2 split logic gets jammed at this point?
The text was updated successfully, but these errors were encountered: