Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-4027][Streaming] WriteAheadLogBackedBlockRDD to read received either from BlockManager or WAL in HDFS #2931

Closed
wants to merge 13 commits into from

Conversation

tdas
Copy link
Contributor

@tdas tdas commented Oct 24, 2014

As part of the initiative of preventing data loss on streaming driver failure, this sub-task implements a BlockRDD that is backed by HDFS. This BlockRDD can either read data from the Spark's BlockManager, or read the data from file-segments in write ahead log in HDFS.

Most of this code has been written by @harishreedharan

@tdas
Copy link
Contributor Author

tdas commented Oct 24, 2014

@JoshRosen Can you take a look?

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22152 has started for PR 2931 at commit eadde56.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #22152 timed out for PR 2931 at commit eadde56 after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22152/
Test FAILed.

@tdas
Copy link
Contributor Author

tdas commented Oct 24, 2014

Jenkins, test this.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #420 has started for PR 2931 at commit eadde56.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 24, 2014

Test build #420 has finished for PR 2931 at commit eadde56.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@harishreedharan
Copy link
Contributor

The HdfsBackedRDDSuite is passing - not sure why there are some other failures. Maybe we are missing some cleanup?

val partition = split.asInstanceOf[HDFSBackedBlockRDDPartition]
val locations = getBlockIdLocations()
locations.getOrElse(partition.blockId,
HdfsUtils.getBlockLocations(partition.segment.path, hadoopConfiguration)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain how this code gets the block locations of the segment of the file that the partition needs? The offsets dont seem to be passed on to the HDFSUtils.getBlockLocations

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this one in the PR sent to your repo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed.

}

// Hadoop Configuration is not serializable, so broadcast it as a serializable.
val broadcastedHadoopConf = sc.broadcast(new SerializableWritable(hadoopConfiguration))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Over in #2935, @davies is planning to add some code to SerializableWritable to address the Hadoop Configuration constructor thread-safety issue, so you shouldn't have to do it here once we've merged that patch.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to take the SerializableWritable as the argument in the constructor (as being done in #2935) or should we just take the hadoopConf and wrap it in the SerializableWritable once that is merged? We don't want to change the interface later.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For now I am leaving this as is. Lets revisit this later if needed.

@JoshRosen
Copy link
Contributor

I left a pass of fairly shallow style comments; I'll loop back later to offer more substantive feedback and to actually check that I understand this logic.

Make sure getBlockLocations uses offset and length to find the blocks on...
@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22189 has started for PR 2931 at commit c709f2f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22189 has finished for PR 2931 at commit c709f2f.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HDFSBackedBlockRDDPartition(
    • class HDFSBackedBlockRDD[T: ClassTag](

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22189/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Oct 27, 2014

Test build #22300 has started for PR 2931 at commit 9c86a61.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 27, 2014

Test build #22300 has finished for PR 2931 at commit 9c86a61.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class HDFSBackedBlockRDDPartition(
    • class HDFSBackedBlockRDD[T: ClassTag](

@JoshRosen
Copy link
Contributor

This looks good to me.

@transient override val blockIds: Array[BlockId],
@transient val segments: Array[WriteAheadLogFileSegment],
val storeInBlockManager: Boolean,
val storageLevel: StorageLevel
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: the common style in spark is

    val storageLevel: StorageLevel)
  extends BlockRDD[T](sc, blockIds) {

@rxin
Copy link
Contributor

rxin commented Oct 29, 2014

@harishreedharan / @tdas I made a few more comments. Most are just nits that I've left earlier.

@harishreedharan
Copy link
Contributor

Thanks @rxin. Updates coming soon.

@tdas
Copy link
Contributor Author

tdas commented Oct 30, 2014

@rxin I updated. Only part i am not in agreement is the preferred location logic.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22521 has started for PR 2931 at commit ed5fbf0.

  • This patch merges cleanly.

@harishreedharan
Copy link
Contributor

Apart from the readability, does one have a performance benefit over the other?

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22521 has finished for PR 2931 at commit ed5fbf0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class WriteAheadLogBackedBlockRDDPartition(
    • class WriteAheadLogBackedBlockRDD[T: ClassTag](

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22521/
Test PASSed.

@tdas
Copy link
Contributor Author

tdas commented Oct 30, 2014

@harishreedharan I dont think so. The block location is called only once in both, and the hdfs location is called only once and only if required. I dont think there is any issue in performance between these two possible different implementations.

def blockLocations = getBlockIdLocations().get(partition.blockId)
def segmentLocations = HdfsUtils.getFileSegmentLocations(
partition.segment.path, partition.segment.offset, partition.segment.length, hadoopConfig)
blockLocations.orElse(segmentLocations).getOrElse(Seq.empty)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not walking over my dead body type of thing, but I think declaring two inline functions for this, coupled with orElse / getOrElse is less intuitive to most people.

Maybe others can chime in here. @shivaram @pwendell

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah its not very ideal as I think the most easy to understand is something like

if ( ) {
  blockLocations
} else if ( ) {
  segmentLocations
} else {
  Seq.empty
}

but this isnt too bad if the above isn't possible

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually this discussion is moot because we should just let getFileSegmentLocations return Seq[String] rather than Option[Seq[String]], and then this should only consist of two branches, accomplishable with a single getOrElse.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is the final version that I am doing then.

    val blockLocations = getBlockIdLocations().get(partition.blockId)
    def segmentLocations = HdfsUtils.getFileSegmentLocations(...)
    blockLocations.getOrElse(segmentLocations)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Correct. Once we make that change, I think both the getOrElse and the if..else solutions are equivalent - one is a scala way of doing things, and the other is the "traditional" way. The ones using def/lazy val is really a more scala way of doing it.

I have no preference for any one method, but would generally consider the overhead and performance incurred by each and I am not that much of an expert in scala to know.

@rxin
Copy link
Contributor

rxin commented Oct 30, 2014

@tdas you also missed one other comments ...

@tdas
Copy link
Contributor Author

tdas commented Oct 30, 2014

@rxin, crap, i missed that. Personally I find the parenthesis ending in the next line more logical as braces always end in next line. But will do it in the interest of consistency.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22537 has started for PR 2931 at commit 4a5866f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22537 has finished for PR 2931 at commit 4a5866f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class WriteAheadLogBackedBlockRDDPartition(
    • class WriteAheadLogBackedBlockRDD[T: ClassTag](

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22537/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22562 has started for PR 2931 at commit 209e49c.

  • This patch merges cleanly.

@tdas
Copy link
Contributor Author

tdas commented Oct 30, 2014

Alright! I think we have converged to the best solution here. I am going to wait for the tests to pass and then converge. Thanks @rxin and @JoshRosen for all the feedback!

@tdas tdas changed the title [SPARK-4027][Streaming] HDFSBasedBlockRDD to read received either from BlockManager or WAL in HDFS [SPARK-4027][Streaming] WriteAheadLogBackedBlockRDD to read received either from BlockManager or WAL in HDFS Oct 30, 2014
@SparkQA
Copy link

SparkQA commented Oct 30, 2014

Test build #22562 has finished for PR 2931 at commit 209e49c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • class WriteAheadLogBackedBlockRDDPartition(
    • class WriteAheadLogBackedBlockRDD[T: ClassTag](

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22562/
Test PASSed.

@asfgit asfgit closed this in fb1fbca Oct 30, 2014
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants