[SPARK-48928] Log Warning for Calling .unpersist() on Locally Checkpointed RDDs #47391

mingkangli-db · 2024-07-17T18:01:16Z

What changes were proposed in this pull request?

This pull request proposes logging a warning message when the .unpersist() method is called on RDDs that have been locally checkpointed. The goal is to inform users about the potential risks associated with unpersisting locally checkpointed RDDs without changing the current behavior of the method.

Why are the changes needed?

Local checkpointing truncates the lineage of an RDD, preventing it from being recomputed from its source. If a locally checkpointed RDD is unpersisted, it loses its data and cannot be regenerated, potentially leading to job failures if subsequent actions or transformations are attempted on the RDD (which was seen on some user workloads). Logging a warning message helps users avoid such pitfalls and aids in debugging.

Does this PR introduce any user-facing change?

Yes, this PR adds a warning log message when .unpersist() is called on a locally checkpointed RDD, but it does not alter any existing behavior.

How was this patch tested?

This PR does not change any existing behavior and therefore no tests are added.

Was this patch authored or co-authored using generative AI tooling?

No.

jiangxb1987

LGTM

mridulm · 2024-07-19T06:05:13Z

It is unclear to me what the value of adding this to unpersist is.
Use of local checkpoint'ing, irrespective of whether it was persist'ed or not, is risky.

For example, if there is a node loss, things will fail - even without an unpersist call.

Ngone51 · 2024-07-19T08:19:10Z

It is unclear to me what the value of adding this to unpersist is.

I think it at least gives users an explict warning in first place rather than relying on the lazily triggered error msg in the case of the misuse of local checkpoint. And local checkpoint is still used by the public API today, Dataset#localCheckpoint(). So might be ok to add it.

mridulm · 2024-07-20T20:24:37Z

My point is, use of persist/unpersist is orthogonal to this - so why special case only here ?
Better would be to add it when user invokes local checkpoint

mingkangli-db · 2024-07-23T00:51:30Z

My point is, use of persist/unpersist is orthogonal to this - so why special case only here ?

Yes, you are right that the use of persist/unpersist can be orthogonal to this. However, we special case because calling.unpersist() after local checkpointing will lead to job failure if any recomputation happens. .unpersist() is usually a safe operation because the RDD can still be recomputed if unpersist() is called, but not in this case.

Better would be to add it when user invokes local checkpoint

When user invokes .localCheckpoint(), this is fine as long as they understand the risk of having no fault-tolerance, but the combo "localCheckpoint() + unpersist() + any action" leading to job failure can happen even when there is no executor failure, so this is just to make users aware of the potential consequences.

mridulm · 2024-07-23T07:52:04Z

If there is no executor failure/loss, whether unpersist is called or not, there should be no failure as there is locally checkpointed data on the node.
Are you seeing this not being the case ?
(It has been quite a while since I looked at this though, so hopefully I am not misremembering)

mingkangli-db · 2024-07-23T17:16:42Z

If there is no executor failure/loss, whether unpersist is called or not, there should be no failure as there is locally checkpointed data on the node. Are you seeing this not being the case ? (It has been quite a while since I looked at this though, so hopefully I am not misremembering)

You're right that when there is no executor failure, there is no failure when you call .localCheckpoint() and .unpersist() by themselves. However, if you perform any action on the RDD afterwards, it will lead to an exception. This happens because, after unpersisting, the lineage is cut, and the RDD cannot be recomputed. For example:

  val checkpointedDf = sql("select * from range(10)")
    .localCheckpoint(eager = false)
  val rdd = checkpointedDf.queryExecution.analyzed.asInstanceOf[LogicalRDD].rdd
  checkpointedDf.collect()
  rdd.unpersist()
  
// EXCEPTION HAPPENS HERE
    checkpointedDf.collect()

The last line results in an exception because the checkpointed RDD is evicted, even without node failure. Logging a warning after rdd.unpersist() would make it clearer to users what the consequence would be.

mridulm · 2024-07-23T19:40:47Z

Ah, I see what you mean - sorry, I had forgotten that this is simply relying on storage level and not actually writing to the local disk (similar to hdfs for regular checkpoint).

Make sense to add this.

mridulm · 2024-07-23T19:42:59Z

Merged to master, thanks for fixing this @mingkangli-db !
Thanks for the reviews @jiangxb1987 , @Ngone51 :-)

Also thanks to everyone for answering my queries - I really was misremembering how local checkpoint was happening.

…inted RDDs ### What changes were proposed in this pull request? This pull request proposes logging a warning message when the `.unpersist()` method is called on RDDs that have been locally checkpointed. The goal is to inform users about the potential risks associated with unpersisting locally checkpointed RDDs without changing the current behavior of the method. ### Why are the changes needed? Local checkpointing truncates the lineage of an RDD, preventing it from being recomputed from its source. If a locally checkpointed RDD is unpersisted, it loses its data and cannot be regenerated, potentially leading to job failures if subsequent actions or transformations are attempted on the RDD (which was seen on some user workloads). Logging a warning message helps users avoid such pitfalls and aids in debugging. ### Does this PR introduce _any_ user-facing change? Yes, this PR adds a warning log message when .unpersist() is called on a locally checkpointed RDD, but it does not alter any existing behavior. ### How was this patch tested? This PR does not change any existing behavior and therefore no tests are added. ### Was this patch authored or co-authored using generative AI tooling? No. Closes apache#47391 from mingkangli-db/warning_unpersist. Authored-by: Mingkang Li <mingkang.li@databricks.com> Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>

add warning message

9688c84

github-actions bot added the CORE label Jul 17, 2024

mingkangli-db marked this pull request as ready for review July 17, 2024 18:01

jiangxb1987 approved these changes Jul 17, 2024

View reviewed changes

Ngone51 approved these changes Jul 19, 2024

View reviewed changes

mridulm approved these changes Jul 23, 2024

View reviewed changes

mridulm closed this in 118167f Jul 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48928] Log Warning for Calling .unpersist() on Locally Checkpointed RDDs #47391

[SPARK-48928] Log Warning for Calling .unpersist() on Locally Checkpointed RDDs #47391

mingkangli-db commented Jul 17, 2024

jiangxb1987 left a comment

mridulm commented Jul 19, 2024 •

edited

Loading

Ngone51 commented Jul 19, 2024

mridulm commented Jul 20, 2024

mingkangli-db commented Jul 23, 2024 •

edited

Loading

mridulm commented Jul 23, 2024 •

edited

Loading

mingkangli-db commented Jul 23, 2024

mridulm commented Jul 23, 2024

mridulm commented Jul 23, 2024

[SPARK-48928] Log Warning for Calling .unpersist() on Locally Checkpointed RDDs #47391

[SPARK-48928] Log Warning for Calling .unpersist() on Locally Checkpointed RDDs #47391

Conversation

mingkangli-db commented Jul 17, 2024

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

jiangxb1987 left a comment

Choose a reason for hiding this comment

mridulm commented Jul 19, 2024 • edited Loading

Ngone51 commented Jul 19, 2024

mridulm commented Jul 20, 2024

mingkangli-db commented Jul 23, 2024 • edited Loading

mridulm commented Jul 23, 2024 • edited Loading

mingkangli-db commented Jul 23, 2024

mridulm commented Jul 23, 2024

mridulm commented Jul 23, 2024

mridulm commented Jul 19, 2024 •

edited

Loading

mingkangli-db commented Jul 23, 2024 •

edited

Loading

mridulm commented Jul 23, 2024 •

edited

Loading