Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-29431][WebUI] Improve Web UI / Sql tab visualization with cached dataframes. #26082

Closed

Conversation

planga82
Copy link
Contributor

What changes were proposed in this pull request?

With this pull request I want to improve the Web UI / SQL tab visualization. The principal problem that I find is when you have a cache in your plan, the SQL visualization don’t show any information about the part of the plan that has been cached.

Before the change
image
After the change
image

Why are the changes needed?

When we have a SQL plan with cached dataframes we lose the graphical information of this dataframe in the sql tab

Does this PR introduce any user-facing change?

Yes, in the sql tab

How was this patch tested?

Unit testing and manual tests throught spark shell

@wangyum
Copy link
Member

wangyum commented Oct 13, 2019

ok to test

@SparkQA
Copy link

SparkQA commented Oct 13, 2019

Test build #111998 has finished for PR 26082 at commit 967b4f5.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@planga82 planga82 force-pushed the feature/SPARK-29431_SQL_Cache_webUI branch from 967b4f5 to 7d0ff6a Compare October 16, 2019 19:24
@planga82
Copy link
Contributor Author

Conflicts resolved

@SparkQA
Copy link

SparkQA commented Oct 16, 2019

Test build #112183 has finished for PR 26082 at commit 7d0ff6a.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@planga82
Copy link
Contributor Author

planga82 commented Oct 17, 2019

@joshrosen-stripe @wangyum Do you think it is and interesting feature for you too?

@joshrosen-stripe
Copy link
Contributor

I don't have the bandwidth to shepherd / review this right now, but I am following along because I'm excited to see this UX pain-point get addressed: the SQL tab is one of my go-to debugging tools, but (prior to this PR) it was really unhelpful when using caching.

I do have a question about the new UX, though:

With today's existing behavior, the "number of rows scanned" metrics from inputs always reflect the total data volume read: if I have a table with a 100k rows and I scan the whole thing then I'll see "100k input rows" on the scan node.

With your PR, I'd expect to see 100k rows scanned during the job which initially populates the cache.

What happens if we're reading the cache a second time? It sounds like we'd still display the cached part of the plan, but what metrics would we show? If we had 100% cache hits, would we see empty SQL metrics on the UI (e.g. zero rows scanned)? If we had a mixture of cache hits and misses, would we see metrics corresponding only to what's been recomputed?

I have slight concerns that the metrics might be confusing except to eagle-eyed readers who spot that there's a cache node in the middle of a plan. Maybe we could color nodes upstream of a cache? Or somehow give a clearer visual indication of the cache nodes, maybe via a different color or something? I'm not sure what's the right approach here.

@planga82
Copy link
Contributor Author

Hi @joshrosen-stripe
It's very interesting what are you talking about. I have done some tests on it. I think the results are coherent but I'm not sure if enough clear.
First job with cached dataframe (full cached)
image
Second job using the same cached dataframe
image
At the second job thre is no data processed at the cached dataframe.
First job with partial cached dataframe
image
Second job that uses the previous dataframe but need to cache all the dataframe.
image
For me is enough clear if you know how cached dataframes are. ¿What do you think?

@planga82
Copy link
Contributor Author

Hi @srowen,
Do you think it is and interesting feature?
Thanks

@srowen
Copy link
Member

srowen commented Oct 31, 2019

I don't have a strong opinion on this. Yes I guess the question is whether it clarifies more than confuses. So this always adds the 'relation' to the graph? Sounds plausible but I don't know if for some reason that isn't meant to be shown here with the other nodes. I don't necessarily think the metrics are an issue, but is there any indication at all that the source is a cached DataFrame in the output?

@planga82
Copy link
Contributor Author

planga82 commented Nov 1, 2019

About your questions, yes, it add the relation to the graph always when appears InMemoryTableScan node. Looking for this kind of node it seems only appear when the chache method is called so there is no other uses of InMemorytableScan.
There is no special indication that this is a cached dataframe, only the node type (InMemoryTableScan). We could try to put a diferent color to the node to clarify it.

@srowen
Copy link
Member

srowen commented Nov 1, 2019

I think it could be OK. @cloud-fan do you have any opinions - is there any downside to showing more info this way?

@srowen
Copy link
Member

srowen commented Nov 5, 2019

Expanding to @dongjoon-hyun maybe for a check - does this sound reasonable? I don't see a downside other than a bigger graph, but it contains more info.

@cloud-fan
Copy link
Contributor

First job with partial cached dataframe

what do you mean by partial cached df?

In general this idea LGTM. Can you compare the UI with a parquet scan? Similar to parquet scan, the cached table also supports filter pushdown. Maybe better to display it so that users can understand why a query doesn't read all the rows of a cached table.

@planga82
Copy link
Contributor Author

planga82 commented Nov 8, 2019

With partial dataframe I mean, for example, when you have a cached dataframe, and the first action you make on it is show(10), only a few elements of this dataframe will be cached. You could do count() on the cached dataframe and this will make the rest of the elements be cached. This is what I'm trying to show with the images.

I attach an example with a simple parquet file
sc.parallelize(1 to 100000).toDF("x").withColumn("x1",col("x") + 1) .write.parquet("test.parquet")
val df = spark.read.parquet("test.parquet").filter(col("x")<100).cache()
res3.filter(col("x")<50).count
image
res3.filter(col("x")<50).count
image

@planga82 planga82 force-pushed the feature/SPARK-29431_SQL_Cache_webUI branch 2 times, most recently from c030902 to 09b4ed3 Compare November 25, 2019 19:43
@planga82
Copy link
Contributor Author

@dongjoon-hyun What is your opinion about this changes? Thank you

@SparkQA
Copy link

SparkQA commented Nov 25, 2019

Test build #114426 has finished for PR 26082 at commit 09b4ed3.

  • This patch fails Spark unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented Nov 26, 2019

Test build #114424 has finished for PR 26082 at commit c030902.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@thompsonmax
Copy link

Hello, is there any ETA on when this might be merged? I have some follow-up changes I'd like to create a PR for these JIRAs.

@srowen
Copy link
Member

srowen commented Jan 7, 2020

I don't have an opinion on the UI change; I know there is also some rumbling that it's already got far too much going on. I wouldn't open a bunch of JIRAs or changes just yet, unless they are clear wins or simplifications.

Comment on lines 79 to 82
val relation = plan match {
case inMemTab: InMemoryTableScanExec => Seq(fromSparkPlan(inMemTab.relation.cachedPlan))
case _ => Seq()
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cannot we just add InMemoryTableScanExec.relation.cachedPlan into children? Then I think you don't need to add relation to SparkPlanInfo.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do in that way because I don't want to impact in other parts of the code. As in InMemoryTableScan is treated as a relation(not a children) I do in the same way in SparkPlanInfo.
It's possible that if we add directly to children it don't affect unexpectedly but I'm not sure.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think SparkPlanInfo is only used for UI? If so, adding cachedPlan into children should be fine.

@@ -33,7 +34,8 @@ class SparkPlanInfo(
val simpleString: String,
val children: Seq[SparkPlanInfo],
val metadata: Map[String, String],
val metrics: Seq[SQLMetricInfo]) {
val metrics: Seq[SQLMetricInfo],
val relation: Seq[SparkPlanInfo] = Seq()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little worried about adding a new parameter to SparkPlanInfo. What's the semantic of it for general cases?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried to replicate the same structure that in InMemoryTableScan where the relation is a different element and is not a children.
This attribute has a default value so it not impact in any case where is not used. Only in this case is used. Do you think there is a better way to solve it without the new attribute? Thanks!

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we treat InMemoryTableScan.relation as its child in the UI?

@planga82
Copy link
Contributor Author

planga82 commented Feb 4, 2020

@viirya @cloud-fan I'm going to change the PR to add directly to the child. Thanks for your comments

@planga82 planga82 force-pushed the feature/SPARK-29431_SQL_Cache_webUI branch from 09b4ed3 to c383de4 Compare February 15, 2020 18:48
@planga82
Copy link
Contributor Author

@viirya @cloud-fan I have updated the PR with the comments. Sorry for the wait!

@SparkQA
Copy link

SparkQA commented Feb 15, 2020

Test build #118482 has finished for PR 26082 at commit c383de4.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@cloud-fan
Copy link
Contributor

LGTM, cc @gengliangwang

@github-actions
Copy link

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label May 28, 2020
@EnricoMi
Copy link
Contributor

@cloud-fan @gengliangwang what are the next steps with this PR once @planga82 resolved conflicts with master?

@github-actions github-actions bot closed this May 29, 2020
@cloud-fan cloud-fan removed the Stale label Jun 2, 2020
@cloud-fan cloud-fan reopened this Jun 2, 2020
@cloud-fan
Copy link
Contributor

Let's fix the conflict and get it merged.

Copy link
Member

@gengliangwang gengliangwang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@gengliangwang
Copy link
Member

@planga82 the conflict was simple. I am not sure if you are active recently. So I just resolved it for you.

@gengliangwang
Copy link
Member

ok to test

@EnricoMi
Copy link
Contributor

EnricoMi commented Jun 2, 2020

Thanks guys for picking this up again, awesome!

@planga82
Copy link
Contributor Author

planga82 commented Jun 2, 2020

Thanks @gengliangwang !

@SparkQA
Copy link

SparkQA commented Jun 2, 2020

Test build #123427 has finished for PR 26082 at commit a02a53e.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@gengliangwang
Copy link
Member

Thanks, merging to master

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.