Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-3660][STREAMING] Initial RDD for updateStateByKey transformation #2665

Closed
wants to merge 12 commits into from

Conversation

soumitrak
Copy link
Contributor

SPARK-3660 : Initial RDD for updateStateByKey transformation

I have added a sample StatefulNetworkWordCountWithInitial inspired by StatefulNetworkWordCount.

Please let me know if any changes are required.

SPARK-3660 : Initial RDD for updateStateByKey transformation
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

1 similar comment
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@tdas
Copy link
Contributor

tdas commented Oct 21, 2014

@soumitrak This is a good addition. Here are some high level suggestions.

  1. Please update the title as [SPARK-3660][STREAMING]....
  2. Please add unit tests for this new functionality
  3. Lets not create another example. Please update the StatefulNetworkWordCount exammple. That should be cool.

@tdas
Copy link
Contributor

tdas commented Oct 21, 2014

Jenkins, this is ok to test.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have started for PR 2665 at commit dde4271.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 21, 2014

QA tests have finished for PR 2665 at commit dde4271.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22012/
Test PASSed.

@tdas
Copy link
Contributor

tdas commented Oct 24, 2014

@soumitrak Hey any thoughts?

@soumitrak
Copy link
Contributor Author

Sorry, I was off for couple of days.

  1. I will send a new pull request with title changed
  2. Is there a unit test for StatefulNetworkWordCount? I can use that to create a test.
  3. I will modify StatefulNetworkWordCount as suggested.

----- Original Message -----
From: "Tathagata Das" notifications@github.com
To: "apache/spark" spark@noreply.github.com
Cc: "soumitrak" kumar.soumitra@gmail.com
Sent: Friday, October 24, 2014 11:57:05 AM
Subject: Re: [spark] Adding support of initial value for state update. (#2665)

@soumitrak Hey any thoughts?


Reply to this email directly or view it on GitHub .

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22190 has started for PR 2665 at commit 8f40ca0.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22190 timed out for PR 2665 at commit 8f40ca0 after a configured wait of 120m.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22190/
Test FAILed.

@soumitrak
Copy link
Contributor Author

TD, I have incorporated your feedback, and add a testcase. Let me know if there is anything else.

@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22206 has started for PR 2665 at commit 4efa58b.

  • This patch merges cleanly.

@soumitrak soumitrak changed the title Adding support of initial value for state update. [SPARK-3660][STREAMING] Initial RDD for updateStateByKey transformation Oct 25, 2014
@SparkQA
Copy link

SparkQA commented Oct 25, 2014

Test build #22206 has finished for PR 2665 at commit 4efa58b.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/22206/
Test FAILed.

@tdas
Copy link
Contributor

tdas commented Nov 7, 2014

Jenkins, test this please.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23074 has started for PR 2665 at commit 4efa58b.

  • This patch does not merge cleanly.

@SparkQA
Copy link

SparkQA commented Nov 7, 2014

Test build #23074 has finished for PR 2665 at commit 4efa58b.

  • This patch fails Spark unit tests.
  • This patch does not merge cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23074/
Test FAILed.

@soumitrak
Copy link
Contributor Author

TD, Let me know how I can help.

@tdas
Copy link
Contributor

tdas commented Nov 10, 2014

Jenkins, test this please.

* Set up required DStreams to test the DStream operation using the sequence
* of input collections, and initial sequence.
*/
def setupStreamsWithInitial[U: ClassTag, V: ClassTag](
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure it is necessary to add this method.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See other comment.

@SparkQA
Copy link

SparkQA commented Nov 10, 2014

Test build #23166 has started for PR 2665 at commit 4efa58b.

  • This patch does not merge cleanly.

new HashPartitioner (numInputPartitions), true, initialRDD)
}

testOperationWithInitial(initial, inputData, updateStateOperation, outputData, true)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of setting up and using testOperationWithInitial just for this unit test, lets try to do the alternative.

    val updateStateOperation = (s: DStream[String]) => {
      val initialRDD = s.context.sparkContext.makeRDD(initial)
      val updateFunc = (values: Seq[Int], state: Option[Int]) => {
        Some(values.sum + state.getOrElse(0))
      }
      val newUpdateFunc = (iterator: Iterator[(String, Seq[Int], Option[Int])]) => {
        iterator.flatMap(t => updateFunc(t._2, t._3).map(s => (t._1, s)))
      }
      s.map(x => (x, 1)).updateStateByKey[Int](newUpdateFunc,
        new HashPartitioner (numInputPartitions), true, initialRDD)
    }

And then use the usual testOperation

@SparkQA
Copy link

SparkQA commented Nov 11, 2014

Test build #23219 has finished for PR 2665 at commit 3da51a2.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait LeaderElectionAgent
    • trait LeaderElectable
    • trait PersistenceEngine
    • abstract class StandaloneRecoveryModeFactory(conf: SparkConf)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23219/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 11, 2014

Test build #23223 has started for PR 2665 at commit 9781135.

  • This patch merges cleanly.

@@ -443,6 +443,23 @@ class JavaPairDStream[K, V](val dstream: DStream[(K, V)])(
scalaFunc
}

private def convertUpdateStateFunctionWithIterator[S]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Incorrect indentation

 private def convertUpdateStateFunctionWithIterator[S](
      in: JFunction2[JList[V], Optional[S], Optional[S]]   // 4 space indent
    ):  (Iterator[(K, Seq[V], Option[S])]) => Iterator[(K, S)] = {   // 2 space indent

@tdas
Copy link
Contributor

tdas commented Nov 11, 2014

This is starting to look good. Please address the comments (mainly Scala API addition), and we will good to go. This is a good addition to updateStateByKey.

@SparkQA
Copy link

SparkQA commented Nov 11, 2014

Test build #23223 has finished for PR 2665 at commit 9781135.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23223/
Test PASSed.

@SparkQA
Copy link

SparkQA commented Nov 11, 2014

Test build #23225 has started for PR 2665 at commit 304f636.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Nov 12, 2014

Test build #23225 has finished for PR 2665 at commit 304f636.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
    • trait LeaderElectionAgent
    • trait LeaderElectable
    • trait PersistenceEngine
    • abstract class StandaloneRecoveryModeFactory(conf: SparkConf)

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23225/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Nov 12, 2014

Test build #23231 has started for PR 2665 at commit ee8980b.

  • This patch merges cleanly.


JavaPairDStream<String, Integer> updated = pairStream.updateStateByKey(
new Function2<List<Integer>, Optional<Integer>, Optional<Integer>>() {
@Override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit (do this if you have to make more changes): incorrect indentation. should be indented by more than the line with new Function2...

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't have any other change to do. Actually I copied the previous method.

@SparkQA
Copy link

SparkQA commented Nov 12, 2014

Test build #23231 has finished for PR 2665 at commit ee8980b.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/23231/
Test PASSed.

@tdas
Copy link
Contributor

tdas commented Nov 12, 2014

Alright I am merging this.

@tdas
Copy link
Contributor

tdas commented Nov 12, 2014

Thanks very much for the changes!

@asfgit asfgit closed this in 36ddeb7 Nov 12, 2014
@soumitrak
Copy link
Contributor Author

TD, Thanks for getting this through.

tianyi pushed a commit to asiainfo/spark that referenced this pull request Dec 4, 2014
SPARK-3660 : Initial RDD for updateStateByKey transformation

I have added a sample StatefulNetworkWordCountWithInitial inspired by StatefulNetworkWordCount.

Please let me know if any changes are required.

Author: Soumitra Kumar <kumar.soumitra@gmail.com>

Closes apache#2665 from soumitrak/master and squashes the following commits:

ee8980b [Soumitra Kumar] Fixed copy/paste issue.
304f636 [Soumitra Kumar] Added simpler version of updateStateByKey API with initialRDD and test.
9781135 [Soumitra Kumar] Fixed test, and renamed variable.
3da51a2 [Soumitra Kumar] Adding updateStateByKey with initialRDD API to JavaPairDStream.
2f78f7e [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
d4fdd18 [Soumitra Kumar] Renamed variable and moved method.
d0ce2cd [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
31399a4 [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
4efa58b [Soumitra Kumar] [SPARK-3660][STREAMING] Initial RDD for updateStateByKey transformation
8f40ca0 [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
dde4271 [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
fdd7db3 [Soumitra Kumar] Adding support of initial value for state update. SPARK-3660 : Initial RDD for updateStateByKey transformation
tianyi pushed a commit to asiainfo/spark that referenced this pull request Dec 4, 2014
SPARK-3660 : Initial RDD for updateStateByKey transformation

I have added a sample StatefulNetworkWordCountWithInitial inspired by StatefulNetworkWordCount.

Please let me know if any changes are required.

Author: Soumitra Kumar <kumar.soumitra@gmail.com>

Closes apache#2665 from soumitrak/master and squashes the following commits:

ee8980b [Soumitra Kumar] Fixed copy/paste issue.
304f636 [Soumitra Kumar] Added simpler version of updateStateByKey API with initialRDD and test.
9781135 [Soumitra Kumar] Fixed test, and renamed variable.
3da51a2 [Soumitra Kumar] Adding updateStateByKey with initialRDD API to JavaPairDStream.
2f78f7e [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
d4fdd18 [Soumitra Kumar] Renamed variable and moved method.
d0ce2cd [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
31399a4 [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
4efa58b [Soumitra Kumar] [SPARK-3660][STREAMING] Initial RDD for updateStateByKey transformation
8f40ca0 [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
dde4271 [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
fdd7db3 [Soumitra Kumar] Adding support of initial value for state update. SPARK-3660 : Initial RDD for updateStateByKey transformation
tianyi pushed a commit to asiainfo/spark that referenced this pull request Dec 4, 2014
SPARK-3660 : Initial RDD for updateStateByKey transformation

I have added a sample StatefulNetworkWordCountWithInitial inspired by StatefulNetworkWordCount.

Please let me know if any changes are required.

Author: Soumitra Kumar <kumar.soumitra@gmail.com>

Closes apache#2665 from soumitrak/master and squashes the following commits:

ee8980b [Soumitra Kumar] Fixed copy/paste issue.
304f636 [Soumitra Kumar] Added simpler version of updateStateByKey API with initialRDD and test.
9781135 [Soumitra Kumar] Fixed test, and renamed variable.
3da51a2 [Soumitra Kumar] Adding updateStateByKey with initialRDD API to JavaPairDStream.
2f78f7e [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
d4fdd18 [Soumitra Kumar] Renamed variable and moved method.
d0ce2cd [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
31399a4 [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
4efa58b [Soumitra Kumar] [SPARK-3660][STREAMING] Initial RDD for updateStateByKey transformation
8f40ca0 [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
dde4271 [Soumitra Kumar] Merge remote-tracking branch 'upstream/master'
fdd7db3 [Soumitra Kumar] Adding support of initial value for state update. SPARK-3660 : Initial RDD for updateStateByKey transformation
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants