Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a config option to print DAG. #4257

Closed
wants to merge 5 commits into from

Conversation

KaiXinXiaoLei
Copy link

Add a config option "spark.rddDebug.enable" to check whether to print DAG info. When "spark.rddDebug.enable" is true, it will print information about DAG in the log.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@ScrapCodes
Copy link
Member

You can print the same details by calling rdd.toDebugString in your program ?

@KaiXinXiaoLei
Copy link
Author

Add configuration parameters "spark.rddDebug.enable" , and default value is false. When the option "spark.rddDebug.enable" is true, can print the DAG tree in the log.
For example:
15/01/26 22:38:37 INFO SparkContext: RDD.toDebugString:
(12) MappedRDD[13] at map at dagPrint.scala:26 []
| MappedRDD[12] at map at dagPrint.scala:25 []
| FlatMappedValuesRDD[11] at join at dagPrint.scala:24 []
| MappedValuesRDD[10] at join at dagPrint.scala:24 []
| CoGroupedRDD[9] at join at dagPrint.scala:24 []
| ShuffledRDD[4] at reduceByKey at dagPrint.scala:17 []
+-(2) MappedRDD[3] at map at dagPrint.scala:16 []
| MappedRDD[2] at map at dagPrint.scala:15 []
| /data.txt MappedRDD[1] at textFile at dagPrint.scala:14 []
| /data.txt HadoopRDD[0] at textFile at dagPrint.scala:14 []
+-(2) MappedRDD[8] at map at dagPrint.scala:22 []
| MappedRDD[7] at map at dagPrint.scala:21 []
| /data2.txt MappedRDD[6] at textFile at dagPrint.scala:20 []
| /data2.txt HadoopRDD[5] at textFile at dagPrint.scala:20 []

While the option is false, it will not print DAG info.

@rxin
Copy link
Contributor

rxin commented Jan 31, 2015

I think @ScrapCodes's question is ... is this necessary, given users can easily print the dag themselves?

@pwendell
Copy link
Contributor

pwendell commented Feb 2, 2015

@rxin I have noticed that very few users know about toDebugString. Maybe we should open a JIRA to add better documentation for that function (i.e. discuss it in the programming guide). Overall, I agree with you and @ScrapCodes in that I'm not sure this particular flag is super useful.

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@rxin
Copy link
Contributor

rxin commented Feb 2, 2015

Actually now I think about it - users don't always control the RDDs. For example, if they are using the ML pipeline API or calling some other libraries. Maybe there is some merit in including this, especially if they are off by default?

@pwendell
Copy link
Contributor

pwendell commented Feb 3, 2015

Fair, but the issue is in some cases (e.g. GraphX) the printed representation of the DAG can be many hundreds of lines long. could that potentially explode the output?

@KaiXinXiaoLei
Copy link
Author

I think, when running a application, users just need change a config option, not modify binary code. And the default value of this config option is false. When users need get DAG info, users will go to modify this value. So i think there is a small chance of exploded output, or there is Impossible of exploded output.

@KaiXinXiaoLei
Copy link
Author

@pwendell
For example, I had completed binary package, and put it on spark cluster. Now i want to get DAG info, i not need to change code and build again. I just change a config option. After i had got info, I change the config option to false. So I think there is impossible of exploded output. Thanks.

@srowen
Copy link
Member

srowen commented Feb 9, 2015

ok to test

@SparkQA
Copy link

SparkQA commented Feb 9, 2015

Test build #27096 has started for PR 4257 at commit adcb14f.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 9, 2015

Test build #27096 has finished for PR 4257 at commit adcb14f.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27096/
Test PASSed.

@rxin
Copy link
Contributor

rxin commented Feb 9, 2015

Maybe we can include this, provided that it is off by default.

That said, I think a better name for this option is "spark.logLineage".

change config option from "spark.rddDebug.enable" to "spark.logLineage"
@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27149 has started for PR 4257 at commit 83c2b32.

  • This patch merges cleanly.

@KaiXinXiaoLei
Copy link
Author

@rxin
OK. I had changed the option name to "spark.logLineage". Thanks.

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27154 has started for PR 4257 at commit c27ee76.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27154 has finished for PR 4257 at commit c27ee76.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27154/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27149 has finished for PR 4257 at commit 83c2b32.

  • This patch fails MiMa tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test FAILed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27149/
Test FAILed.

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27166 has started for PR 4257 at commit d9fe42e.

  • This patch merges cleanly.

@SparkQA
Copy link

SparkQA commented Feb 10, 2015

Test build #27166 has finished for PR 4257 at commit d9fe42e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@AmplabJenkins
Copy link

Test PASSed.
Refer to this link for build results (access rights to CI server needed):
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/27166/
Test PASSed.

@rxin
Copy link
Contributor

rxin commented Feb 10, 2015

Thanks. I've merged this.

asfgit pushed a commit that referenced this pull request Feb 10, 2015
Add a config option "spark.rddDebug.enable" to check whether to print DAG info. When "spark.rddDebug.enable" is true, it will print information about DAG in the log.

Author: KaiXinXiaoLei <huleilei1@huawei.com>

Closes #4257 from KaiXinXiaoLei/DAGprint and squashes the following commits:

d9fe42e [KaiXinXiaoLei] change  log info
c27ee76 [KaiXinXiaoLei] change log info
83c2b32 [KaiXinXiaoLei] change config option
adcb14f [KaiXinXiaoLei] change the file.
f4e7b9e [KaiXinXiaoLei] add a option to print DAG

(cherry picked from commit 31d435e)
Signed-off-by: Reynold Xin <rxin@databricks.com>
@asfgit asfgit closed this in 31d435e Feb 10, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants