[BEAM-3371] Enable running integration tests on Spark #6244

lgajowy · 2018-08-17T15:50:50Z

This PR enables running IOIT on Spark runner. One can do that by setting up remote spark cluster (spark master) and running this command:

./gradlew clean integrationTest -p sdks/java/io/file-based-io-tests/ -DintegrationTestPipelineOptions='["--numberOfRecords=1000", "--filenamePrefix=PREFIX", "--runner=TestSparkRunner", "--sparkMaster=spark://LGs-Mac.local:7077", "--tempLocation=/tmp/"]' -DintegrationTestRunner=spark --tests org.apache.beam.sdk.io.text.TextIOIT --info

I experienced some difficulties with TFRecordIOIT, ParquetIOIT and XmlIOIT tests:

XmlIOIT fails on assertion (hashcode of PCollection is different that it should be)
ParquetIOIT suffers for dependency missmatch (java.lang.NoSuchMethodError: org.apache.parquet.hadoop.ParquetWriter$Builder.<init>(Lorg/apache/parquet/io/OutputFile;)V)
TFRecordIOIT cannot find the created file: java.io.FileNotFoundException: No files found for spec: PREFIX_1534520885787*

Those issues are of different nature than the one described in BEAM-3371 and have to be tackled separately.

@iemejia Could you take a look when you're back? There's no need to rush - I'll be off for 2 weeks now.

CC: @pabloem

Follow this checklist to help us incorporate your contribution quickly and easily:

Format the pull request title like [BEAM-XXX] Fixes bug in ApproximateQuantiles, where you replace BEAM-XXX with the appropriate JIRA issue, if applicable. This will automatically link the pull request to the issue.
If this contribution is large, please file an Apache Individual Contributor License Agreement.

It will help us expedite review of your Pull Request if you tag someone (e.g. @username) to look at it.

Post-Commit Tests Status (on master branch)

Lang	Apex	Dataflow	Flink	Gearpump	Samza	Spark
Go	---	---	---	---	---	---
Java
Python	---		---	---	---	---

lgajowy · 2018-08-17T15:54:28Z

BTW: what is the reason of the fact that filesToStage option is duplicated in DataflowWorkerPoolOptions, FlinkPipelineOptions and SparkPipelineOptions? Do you think we can move this to PipelineOptions interface? IMO it would make things even easier (maybe not only for this fix)?

lgajowy · 2018-08-17T15:54:51Z

Run Java PreCommit

lgajowy · 2018-08-17T16:02:19Z

Created issues for failing tests:
https://issues.apache.org/jira/browse/BEAM-5165
https://issues.apache.org/jira/browse/BEAM-5164
https://issues.apache.org/jira/browse/BEAM-5163

lgajowy · 2018-09-03T16:35:24Z

@iemejia rebased and fixed the spotless issue. Could you take a look?

lgajowy · 2018-09-10T10:44:38Z

@iemejia ping :)

Alternatively, (if you're very busy), please suggest some other reviewer.

iemejia · 2018-09-11T09:23:54Z

You are absolutely right @lgajowy I should have passed this review since I was a bit loaded these last days. Asking @aromanenko-dev to take a look (thanks Alexey!).

aromanenko-dev

Thanks! I added a couple of minor notes.

aromanenko-dev · 2018-09-13T10:44:12Z

...tion-java/src/test/java/org/apache/beam/runners/core/construction/PipelineResourcesTest.java

+
+    List<String> result = PipelineResources.prepareFilesForStaging(filesToStage, temporaryLocation);
+
+    assertThat(result, is(empty()));


Could you add another entry to filesToStage list, which is existent path, and assert that it removes ONLY non existent one?

ok, good idea.

...ers/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineExecutionEnvironment.java

lgajowy · 2018-09-17T22:30:02Z

Hi @aromanenko-dev - I applied the suggestions. PTAL.

aromanenko-dev · 2018-09-18T07:06:52Z

Retest this please

lgajowy · 2018-09-18T17:27:21Z

@aromanenko-dev I amended it a little bit and left a comment (ptal).

...ers/flink/src/main/java/org/apache/beam/runners/flink/FlinkPipelineExecutionEnvironment.java

lgajowy · 2018-09-19T21:09:38Z

@mxm @aromanenko-dev thanks!

I added tests for Flink. Due to the fact that Spark runner invokes the code in run() method and creates SparkContext there (but does not delete it later?), I had some trouble with creating similar tests for Spark too so i didn't submit them in this PR. Maybe we can do this later. I got the following error when I run similar tests:

Cannot reuse spark context with different spark master URL. Existing: local[1], requested: spark://localhost:7077.

I tried to reuse the context (with ReuseContextRule.java class). I also tried to stop the context somehow but all this lead me nowhere. The code I used is here: https://github.com/apache/beam/compare/master...lgajowy:spark-integration-tests-2?expand=1#diff-9336eb87a4aea9ba0f254a1318f1fc90

@mxm @aromanenko-dev could you take a look again?

mxm

Thanks @lgajowy looks good for the Flink side.

mxm · 2018-09-19T21:16:11Z

runners/spark/src/main/java/org/apache/beam/runners/spark/SparkRunner.java

@@ -435,7 +451,7 @@ public void visitPrimitiveTransform(TransformHierarchy.Node node) {
    protected <TransformT extends PTransform<? super PInput, POutput>>
        TransformEvaluator<TransformT> translate(
            TransformHierarchy.Node node, TransformT transform, Class<TransformT> transformClass) {
-      //--- determine if node is bounded/unbounded.
+      // --- determine if node is bounded/unbounded.


unnecessary change

mxm · 2018-09-19T21:16:24Z

...flink/src/test/java/org/apache/beam/runners/flink/FlinkPipelineExecutionEnvironmentTest.java

@@ -49,6 +63,7 @@ public void shouldRecognizeAndTranslateStreamingPipeline() {
        .apply(
            ParDo.of(
                new DoFn<Long, String>() {
+


unnecessary change

aromanenko-dev · 2018-09-20T11:09:51Z

...flink/src/test/java/org/apache/beam/runners/flink/FlinkPipelineExecutionEnvironmentTest.java

+
+  @Test
+  public void shouldNotPrepareFilesToStageWhenFlinkMasterIsSetToLocal() throws IOException {
+    FlinkPipelineOptions options = testPreparingResourcesToStage("[local]");


These 3 similar tests above could be only one parameterised but not a big deal for now.

aromanenko-dev · 2018-09-20T11:18:55Z

Since Flink side is ok (thanks to @mxm for review) and other part is fine for me as well, so it LGTM and let's wait for green tests and I think we can merge it after.

aromanenko-dev · 2018-09-20T11:21:02Z

@lgajowy Could you squash all commits before?

lgajowy · 2018-09-20T18:14:21Z

Run Java PreCommit

…ne place

Directories are now packaged to .jar files and placed in temporary location. They can be staged from there later by the runner.

… in Flink runner. Add unit tests

lgajowy · 2018-09-20T20:17:04Z

@aromanenko-dev I rewrote the history squashing the commits that were irrelevant. Initially, this PR was all about Spark (BEAM-3371) but I also added one commit related to Flink (BEAM-3370) which is about adding the tests. All this in light of the recent dev list discussion and docs.

Let me know if this looks good to you too - I have never really received feedback about the history shape, although I try to care about it. It looks that we (commiters especially) should be more aware about the git history so this PR is a good chance to actually get the feedback and practice this. :)

aaltay · 2018-09-28T20:19:49Z

What is the status of this PR?

lgajowy · 2018-09-30T20:36:00Z

@aaltay @aromanenko-dev suggested some additional manual checks what happens if we run this using spark-submit. I didn't test that yet - I should be able to do it later this week.

aromanenko-dev · 2018-10-01T15:56:03Z

Run Flink ValidatesRunner

aromanenko-dev · 2018-10-01T15:56:18Z

Run Spark ValidatesRunner

aromanenko-dev · 2018-10-02T09:56:08Z

@lgajowy I performed testing on my side - I created a fat jar with basic WordBasic pipeline using mvn artifacts built from the branch of this PR and run it on my virtual YARN/Spark cluster. I didn't see any issues with that, so, I think it LGTM and we can merge it.

lgajowy · 2018-10-02T12:05:27Z

@aromanenko-dev thanks a lot! :)

lgajowy requested a review from iemejia August 17, 2018 15:50

lgajowy force-pushed the spark-integration-tests branch from a96baa6 to 6bdb72e Compare September 3, 2018 16:33

iemejia requested a review from aromanenko-dev September 11, 2018 09:24

aromanenko-dev requested changes Sep 13, 2018

View reviewed changes

mxm reviewed Sep 19, 2018

View reviewed changes

aromanenko-dev reviewed Sep 20, 2018

View reviewed changes

lgajowy added 4 commits September 20, 2018 11:24

[BEAM-3371] Allow running integration tests on Spark

fcff44f

[BEAM-3371] Move common code for pre-stage resources preparation to o…

08d0ad6

…ne place

[BEAM-3371] Fix directories not being staged to classpath issue

d502e70

Directories are now packaged to .jar files and placed in temporary location. They can be staged from there later by the runner.

[BEAM-3370] Make preparing resources to stage condition more explicit…

dd0d74b

… in Flink runner. Add unit tests

lgajowy force-pushed the spark-integration-tests branch from 12d621d to dd0d74b Compare September 20, 2018 18:44

aromanenko-dev approved these changes Oct 2, 2018

View reviewed changes

aromanenko-dev merged commit e79918c into apache:master Oct 2, 2018

lgajowy deleted the spark-integration-tests branch May 20, 2019 14:49

lgajowy restored the spark-integration-tests branch May 20, 2019 14:49

lgajowy deleted the spark-integration-tests branch May 20, 2019 14:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BEAM-3371] Enable running integration tests on Spark #6244

[BEAM-3371] Enable running integration tests on Spark #6244

lgajowy commented Aug 17, 2018 •

edited

Loading

lgajowy commented Aug 17, 2018 •

edited

Loading

lgajowy commented Aug 17, 2018

lgajowy commented Aug 17, 2018

lgajowy commented Sep 3, 2018

lgajowy commented Sep 10, 2018

iemejia commented Sep 11, 2018

aromanenko-dev left a comment •

edited

Loading

aromanenko-dev Sep 13, 2018

lgajowy Sep 17, 2018

lgajowy commented Sep 17, 2018

aromanenko-dev commented Sep 18, 2018

lgajowy commented Sep 18, 2018

lgajowy commented Sep 19, 2018 •

edited

Loading

mxm left a comment

mxm Sep 19, 2018

lgajowy Sep 19, 2018

mxm Sep 19, 2018

lgajowy Sep 19, 2018

aromanenko-dev Sep 20, 2018

aromanenko-dev commented Sep 20, 2018

aromanenko-dev commented Sep 20, 2018

lgajowy commented Sep 20, 2018

lgajowy commented Sep 20, 2018

aaltay commented Sep 28, 2018

lgajowy commented Sep 30, 2018

aromanenko-dev commented Oct 1, 2018

aromanenko-dev commented Oct 1, 2018

aromanenko-dev commented Oct 2, 2018

lgajowy commented Oct 2, 2018


		List<String> result = PipelineResources.prepareFilesForStaging(filesToStage, temporaryLocation);

		assertThat(result, is(empty()));

[BEAM-3371] Enable running integration tests on Spark #6244

[BEAM-3371] Enable running integration tests on Spark #6244

Conversation

lgajowy commented Aug 17, 2018 • edited Loading

Post-Commit Tests Status (on master branch)

lgajowy commented Aug 17, 2018 • edited Loading

lgajowy commented Aug 17, 2018

lgajowy commented Aug 17, 2018

lgajowy commented Sep 3, 2018

lgajowy commented Sep 10, 2018

iemejia commented Sep 11, 2018

aromanenko-dev left a comment • edited Loading

Choose a reason for hiding this comment

aromanenko-dev Sep 13, 2018

Choose a reason for hiding this comment

lgajowy Sep 17, 2018

Choose a reason for hiding this comment

lgajowy commented Sep 17, 2018

aromanenko-dev commented Sep 18, 2018

lgajowy commented Sep 18, 2018

lgajowy commented Sep 19, 2018 • edited Loading

mxm left a comment

Choose a reason for hiding this comment

mxm Sep 19, 2018

Choose a reason for hiding this comment

lgajowy Sep 19, 2018

Choose a reason for hiding this comment

mxm Sep 19, 2018

Choose a reason for hiding this comment

lgajowy Sep 19, 2018

Choose a reason for hiding this comment

aromanenko-dev Sep 20, 2018

Choose a reason for hiding this comment

aromanenko-dev commented Sep 20, 2018

aromanenko-dev commented Sep 20, 2018

lgajowy commented Sep 20, 2018

lgajowy commented Sep 20, 2018

aaltay commented Sep 28, 2018

lgajowy commented Sep 30, 2018

aromanenko-dev commented Oct 1, 2018

aromanenko-dev commented Oct 1, 2018

aromanenko-dev commented Oct 2, 2018

lgajowy commented Oct 2, 2018

lgajowy commented Aug 17, 2018 •

edited

Loading

lgajowy commented Aug 17, 2018 •

edited

Loading

aromanenko-dev left a comment •

edited

Loading

lgajowy commented Sep 19, 2018 •

edited

Loading