[SPARK-29431][WebUI] Improve Web UI / Sql tab visualization with cached dataframes. #26082

planga82 · 2019-10-10T16:29:13Z

What changes were proposed in this pull request?

With this pull request I want to improve the Web UI / SQL tab visualization. The principal problem that I find is when you have a cache in your plan, the SQL visualization don’t show any information about the part of the plan that has been cached.

Before the change

After the change

Why are the changes needed?

When we have a SQL plan with cached dataframes we lose the graphical information of this dataframe in the sql tab

Does this PR introduce any user-facing change?

Yes, in the sql tab

How was this patch tested?

Unit testing and manual tests throught spark shell

wangyum · 2019-10-13T13:37:35Z

ok to test

SparkQA · 2019-10-13T17:07:42Z

Test build #111998 has finished for PR 26082 at commit 967b4f5.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

planga82 · 2019-10-16T19:25:25Z

Conflicts resolved

SparkQA · 2019-10-16T23:00:43Z

Test build #112183 has finished for PR 26082 at commit 7d0ff6a.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

planga82 · 2019-10-17T06:25:24Z

@joshrosen-stripe @wangyum Do you think it is and interesting feature for you too?

joshrosen-stripe · 2019-10-25T17:00:41Z

I don't have the bandwidth to shepherd / review this right now, but I am following along because I'm excited to see this UX pain-point get addressed: the SQL tab is one of my go-to debugging tools, but (prior to this PR) it was really unhelpful when using caching.

I do have a question about the new UX, though:

With today's existing behavior, the "number of rows scanned" metrics from inputs always reflect the total data volume read: if I have a table with a 100k rows and I scan the whole thing then I'll see "100k input rows" on the scan node.

With your PR, I'd expect to see 100k rows scanned during the job which initially populates the cache.

What happens if we're reading the cache a second time? It sounds like we'd still display the cached part of the plan, but what metrics would we show? If we had 100% cache hits, would we see empty SQL metrics on the UI (e.g. zero rows scanned)? If we had a mixture of cache hits and misses, would we see metrics corresponding only to what's been recomputed?

I have slight concerns that the metrics might be confusing except to eagle-eyed readers who spot that there's a cache node in the middle of a plan. Maybe we could color nodes upstream of a cache? Or somehow give a clearer visual indication of the cache nodes, maybe via a different color or something? I'm not sure what's the right approach here.

planga82 · 2019-10-26T19:55:38Z

Hi @joshrosen-stripe
It's very interesting what are you talking about. I have done some tests on it. I think the results are coherent but I'm not sure if enough clear.
First job with cached dataframe (full cached)

Second job using the same cached dataframe

At the second job thre is no data processed at the cached dataframe.
First job with partial cached dataframe

Second job that uses the previous dataframe but need to cache all the dataframe.

For me is enough clear if you know how cached dataframes are. ¿What do you think?

planga82 · 2019-10-31T07:06:09Z

Hi @srowen,
Do you think it is and interesting feature?
Thanks

srowen · 2019-10-31T15:59:47Z

I don't have a strong opinion on this. Yes I guess the question is whether it clarifies more than confuses. So this always adds the 'relation' to the graph? Sounds plausible but I don't know if for some reason that isn't meant to be shown here with the other nodes. I don't necessarily think the metrics are an issue, but is there any indication at all that the source is a cached DataFrame in the output?

planga82 · 2019-11-01T09:20:58Z

About your questions, yes, it add the relation to the graph always when appears InMemoryTableScan node. Looking for this kind of node it seems only appear when the chache method is called so there is no other uses of InMemorytableScan.
There is no special indication that this is a cached dataframe, only the node type (InMemoryTableScan). We could try to put a diferent color to the node to clarify it.

srowen · 2019-11-01T13:26:01Z

I think it could be OK. @cloud-fan do you have any opinions - is there any downside to showing more info this way?

srowen · 2019-11-05T14:49:31Z

Expanding to @dongjoon-hyun maybe for a check - does this sound reasonable? I don't see a downside other than a bigger graph, but it contains more info.

cloud-fan · 2019-11-06T05:48:12Z

First job with partial cached dataframe

what do you mean by partial cached df?

In general this idea LGTM. Can you compare the UI with a parquet scan? Similar to parquet scan, the cached table also supports filter pushdown. Maybe better to display it so that users can understand why a query doesn't read all the rows of a cached table.

planga82 · 2019-11-08T20:30:31Z

With partial dataframe I mean, for example, when you have a cached dataframe, and the first action you make on it is show(10), only a few elements of this dataframe will be cached. You could do count() on the cached dataframe and this will make the rest of the elements be cached. This is what I'm trying to show with the images.

I attach an example with a simple parquet file
sc.parallelize(1 to 100000).toDF("x").withColumn("x1",col("x") + 1) .write.parquet("test.parquet")
val df = spark.read.parquet("test.parquet").filter(col("x")<100).cache()
res3.filter(col("x")<50).count

res3.filter(col("x")<50).count

planga82 · 2019-11-25T19:46:09Z

@dongjoon-hyun What is your opinion about this changes? Thank you

SparkQA · 2019-11-25T21:58:23Z

Test build #114426 has finished for PR 26082 at commit 09b4ed3.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-11-26T00:06:41Z

Test build #114424 has finished for PR 26082 at commit c030902.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

thompsonmax · 2020-01-07T18:44:13Z

Hello, is there any ETA on when this might be merged? I have some follow-up changes I'd like to create a PR for these JIRAs.

srowen · 2020-01-07T19:41:25Z

I don't have an opinion on the UI change; I know there is also some rumbling that it's already got far too much going on. I wouldn't open a bunch of JIRAs or changes just yet, unless they are clear wins or simplifications.

viirya · 2020-02-03T20:48:44Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanInfo.scala

+    val relation = plan match {
+      case inMemTab: InMemoryTableScanExec => Seq(fromSparkPlan(inMemTab.relation.cachedPlan))
+      case _ => Seq()
+    }


Cannot we just add InMemoryTableScanExec.relation.cachedPlan into children? Then I think you don't need to add relation to SparkPlanInfo.

I do in that way because I don't want to impact in other parts of the code. As in InMemoryTableScan is treated as a relation(not a children) I do in the same way in SparkPlanInfo.
It's possible that if we add directly to children it don't affect unexpectedly but I'm not sure.

I think SparkPlanInfo is only used for UI? If so, adding cachedPlan into children should be fine.

cloud-fan · 2020-02-04T13:36:34Z

sql/core/src/main/scala/org/apache/spark/sql/execution/SparkPlanInfo.scala

@@ -33,7 +34,8 @@ class SparkPlanInfo(
    val simpleString: String,
    val children: Seq[SparkPlanInfo],
    val metadata: Map[String, String],
-    val metrics: Seq[SQLMetricInfo]) {
+    val metrics: Seq[SQLMetricInfo],
+    val relation: Seq[SparkPlanInfo] = Seq()) {


I'm a little worried about adding a new parameter to SparkPlanInfo. What's the semantic of it for general cases?

I tried to replicate the same structure that in InMemoryTableScan where the relation is a different element and is not a children.
This attribute has a default value so it not impact in any case where is not used. Only in this case is used. Do you think there is a better way to solve it without the new attribute? Thanks!

Can we treat InMemoryTableScan.relation as its child in the UI?

planga82 · 2020-02-04T22:27:01Z

@viirya @cloud-fan I'm going to change the PR to add directly to the child. Thanks for your comments

…d nodes

planga82 · 2020-02-15T18:52:07Z

@viirya @cloud-fan I have updated the PR with the comments. Sorry for the wait!

SparkQA · 2020-02-15T22:42:38Z

Test build #118482 has finished for PR 26082 at commit c383de4.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2020-02-17T09:01:13Z

LGTM, cc @gengliangwang

github-actions · 2020-05-28T00:30:51Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

EnricoMi · 2020-05-28T05:23:04Z

@cloud-fan @gengliangwang what are the next steps with this PR once @planga82 resolved conflicts with master?

cloud-fan · 2020-06-02T02:33:19Z

Let's fix the conflict and get it merged.

gengliangwang

LGTM.

gengliangwang · 2020-06-02T09:01:50Z

@planga82 the conflict was simple. I am not sure if you are active recently. So I just resolved it for you.

gengliangwang · 2020-06-02T09:02:37Z

ok to test

EnricoMi · 2020-06-02T09:02:40Z

Thanks guys for picking this up again, awesome!

planga82 · 2020-06-02T09:22:26Z

Thanks @gengliangwang !

SparkQA · 2020-06-02T13:55:28Z

Test build #123427 has finished for PR 26082 at commit a02a53e.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gengliangwang · 2020-06-03T00:24:35Z

Thanks, merging to master

dongjoon-hyun added the WEB UI label Oct 10, 2019

planga82 force-pushed the feature/SPARK-29431_SQL_Cache_webUI branch from 967b4f5 to 7d0ff6a Compare October 16, 2019 19:24

planga82 force-pushed the feature/SPARK-29431_SQL_Cache_webUI branch 2 times, most recently from c030902 to 09b4ed3 Compare November 25, 2019 19:43

viirya reviewed Feb 3, 2020

View reviewed changes

cloud-fan reviewed Feb 4, 2020

View reviewed changes

planga82 added 2 commits February 15, 2020 09:19

[SPARK-29431][WebUI] Improve Sql tab cached dataframes

bc06c1d

[SPARK-29431][WebUI] Add inmemorytablescan relation to sparkInfo chil…

c383de4

…d nodes

planga82 force-pushed the feature/SPARK-29431_SQL_Cache_webUI branch from 09b4ed3 to c383de4 Compare February 15, 2020 18:48

github-actions bot added the Stale label May 28, 2020

github-actions bot closed this May 29, 2020

cloud-fan removed the Stale label Jun 2, 2020

cloud-fan reopened this Jun 2, 2020

Merge branch 'master' into feature/SPARK-29431_SQL_Cache_webUI

a02a53e

probot-autolabeler bot added the SQL label Jun 2, 2020

gengliangwang approved these changes Jun 2, 2020

View reviewed changes

gengliangwang closed this in e4db3b5 Jun 3, 2020

[SPARK-29431][WebUI] Improve Web UI / Sql tab visualization with cached dataframes. #26082

[SPARK-29431][WebUI] Improve Web UI / Sql tab visualization with cached dataframes. #26082

Conversation

planga82 commented Oct 10, 2019

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

wangyum commented Oct 13, 2019

SparkQA commented Oct 13, 2019

planga82 commented Oct 16, 2019

SparkQA commented Oct 16, 2019

planga82 commented Oct 17, 2019 • edited Loading

joshrosen-stripe commented Oct 25, 2019

planga82 commented Oct 26, 2019

planga82 commented Oct 31, 2019

srowen commented Oct 31, 2019

planga82 commented Nov 1, 2019

srowen commented Nov 1, 2019

srowen commented Nov 5, 2019

cloud-fan commented Nov 6, 2019

planga82 commented Nov 8, 2019 • edited Loading

planga82 commented Nov 25, 2019

SparkQA commented Nov 25, 2019

SparkQA commented Nov 26, 2019

thompsonmax commented Jan 7, 2020

srowen commented Jan 7, 2020

viirya Feb 3, 2020

Choose a reason for hiding this comment

planga82 Feb 4, 2020

Choose a reason for hiding this comment

viirya Feb 4, 2020

Choose a reason for hiding this comment

cloud-fan Feb 4, 2020

Choose a reason for hiding this comment

planga82 Feb 4, 2020

Choose a reason for hiding this comment

cloud-fan Feb 4, 2020

Choose a reason for hiding this comment

planga82 commented Feb 4, 2020

planga82 commented Feb 15, 2020

SparkQA commented Feb 15, 2020

cloud-fan commented Feb 17, 2020

github-actions bot commented May 28, 2020

EnricoMi commented May 28, 2020

cloud-fan commented Jun 2, 2020

gengliangwang left a comment

Choose a reason for hiding this comment

gengliangwang commented Jun 2, 2020

gengliangwang commented Jun 2, 2020

EnricoMi commented Jun 2, 2020

planga82 commented Jun 2, 2020

SparkQA commented Jun 2, 2020

gengliangwang commented Jun 3, 2020

planga82 commented Oct 17, 2019 •

edited

Loading

planga82 commented Nov 8, 2019 •

edited

Loading