Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-48928] Log Warning for Calling .unpersist() on Locally Checkpointed RDDs #47391

Closed
wants to merge 1 commit into from

Conversation

mingkangli-db
Copy link
Contributor

What changes were proposed in this pull request?

This pull request proposes logging a warning message when the .unpersist() method is called on RDDs that have been locally checkpointed. The goal is to inform users about the potential risks associated with unpersisting locally checkpointed RDDs without changing the current behavior of the method.

Why are the changes needed?

Local checkpointing truncates the lineage of an RDD, preventing it from being recomputed from its source. If a locally checkpointed RDD is unpersisted, it loses its data and cannot be regenerated, potentially leading to job failures if subsequent actions or transformations are attempted on the RDD (which was seen on some user workloads). Logging a warning message helps users avoid such pitfalls and aids in debugging.

Does this PR introduce any user-facing change?

Yes, this PR adds a warning log message when .unpersist() is called on a locally checkpointed RDD, but it does not alter any existing behavior.

How was this patch tested?

This PR does not change any existing behavior and therefore no tests are added.

Was this patch authored or co-authored using generative AI tooling?

No.

@github-actions github-actions bot added the CORE label Jul 17, 2024
@mingkangli-db mingkangli-db marked this pull request as ready for review July 17, 2024 18:01
Copy link
Contributor

@jiangxb1987 jiangxb1987 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@mridulm
Copy link
Contributor

mridulm commented Jul 19, 2024

It is unclear to me what the value of adding this to unpersist is.
Use of local checkpoint'ing, irrespective of whether it was persist'ed or not, is risky.

For example, if there is a node loss, things will fail - even without an unpersist call.

@Ngone51
Copy link
Member

Ngone51 commented Jul 19, 2024

It is unclear to me what the value of adding this to unpersist is.

I think it at least gives users an explict warning in first place rather than relying on the lazily triggered error msg in the case of the misuse of local checkpoint. And local checkpoint is still used by the public API today, Dataset#localCheckpoint(). So might be ok to add it.

@mridulm
Copy link
Contributor

mridulm commented Jul 20, 2024

My point is, use of persist/unpersist is orthogonal to this - so why special case only here ?
Better would be to add it when user invokes local checkpoint

@mingkangli-db
Copy link
Contributor Author

mingkangli-db commented Jul 23, 2024

My point is, use of persist/unpersist is orthogonal to this - so why special case only here ?

Yes, you are right that the use of persist/unpersist can be orthogonal to this. However, we special case because calling.unpersist() after local checkpointing will lead to job failure if any recomputation happens. .unpersist() is usually a safe operation because the RDD can still be recomputed if unpersist() is called, but not in this case.

Better would be to add it when user invokes local checkpoint

When user invokes .localCheckpoint(), this is fine as long as they understand the risk of having no fault-tolerance, but the combo "localCheckpoint() + unpersist() + any action" leading to job failure can happen even when there is no executor failure, so this is just to make users aware of the potential consequences.

@mridulm
Copy link
Contributor

mridulm commented Jul 23, 2024

If there is no executor failure/loss, whether unpersist is called or not, there should be no failure as there is locally checkpointed data on the node.
Are you seeing this not being the case ?
(It has been quite a while since I looked at this though, so hopefully I am not misremembering)

@mingkangli-db
Copy link
Contributor Author

If there is no executor failure/loss, whether unpersist is called or not, there should be no failure as there is locally checkpointed data on the node. Are you seeing this not being the case ? (It has been quite a while since I looked at this though, so hopefully I am not misremembering)

You're right that when there is no executor failure, there is no failure when you call .localCheckpoint() and .unpersist() by themselves. However, if you perform any action on the RDD afterwards, it will lead to an exception. This happens because, after unpersisting, the lineage is cut, and the RDD cannot be recomputed. For example:

  val checkpointedDf = sql("select * from range(10)")
    .localCheckpoint(eager = false)
  val rdd = checkpointedDf.queryExecution.analyzed.asInstanceOf[LogicalRDD].rdd
  checkpointedDf.collect()
  rdd.unpersist()
  
// EXCEPTION HAPPENS HERE
    checkpointedDf.collect()

The last line results in an exception because the checkpointed RDD is evicted, even without node failure. Logging a warning after rdd.unpersist() would make it clearer to users what the consequence would be.

@mridulm
Copy link
Contributor

mridulm commented Jul 23, 2024

Ah, I see what you mean - sorry, I had forgotten that this is simply relying on storage level and not actually writing to the local disk (similar to hdfs for regular checkpoint).

Make sense to add this.

@mridulm mridulm closed this in 118167f Jul 23, 2024
@mridulm
Copy link
Contributor

mridulm commented Jul 23, 2024

Merged to master, thanks for fixing this @mingkangli-db !
Thanks for the reviews @jiangxb1987 , @Ngone51 :-)

Also thanks to everyone for answering my queries - I really was misremembering how local checkpoint was happening.

ilicmarkodb pushed a commit to ilicmarkodb/spark that referenced this pull request Jul 29, 2024
…inted RDDs

### What changes were proposed in this pull request?

This pull request proposes logging a warning message when the `.unpersist()` method is called on RDDs that have been locally checkpointed. The goal is to inform users about the potential risks associated with unpersisting locally checkpointed RDDs without changing the current behavior of the method.

### Why are the changes needed?

Local checkpointing truncates the lineage of an RDD, preventing it from being recomputed from its source. If a locally checkpointed RDD is unpersisted, it loses its data and cannot be regenerated, potentially leading to job failures if subsequent actions or transformations are attempted on the RDD (which was seen on some user workloads). Logging a warning message helps users avoid such pitfalls and aids in debugging.

### Does this PR introduce _any_ user-facing change?

Yes, this PR adds a warning log message when .unpersist() is called on a locally checkpointed RDD, but it does not alter any existing behavior.

### How was this patch tested?

This PR does not change any existing behavior and therefore no tests are added.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47391 from mingkangli-db/warning_unpersist.

Authored-by: Mingkang Li <mingkang.li@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
fusheng-rd pushed a commit to fusheng-rd/spark that referenced this pull request Aug 6, 2024
…inted RDDs

### What changes were proposed in this pull request?

This pull request proposes logging a warning message when the `.unpersist()` method is called on RDDs that have been locally checkpointed. The goal is to inform users about the potential risks associated with unpersisting locally checkpointed RDDs without changing the current behavior of the method.

### Why are the changes needed?

Local checkpointing truncates the lineage of an RDD, preventing it from being recomputed from its source. If a locally checkpointed RDD is unpersisted, it loses its data and cannot be regenerated, potentially leading to job failures if subsequent actions or transformations are attempted on the RDD (which was seen on some user workloads). Logging a warning message helps users avoid such pitfalls and aids in debugging.

### Does this PR introduce _any_ user-facing change?

Yes, this PR adds a warning log message when .unpersist() is called on a locally checkpointed RDD, but it does not alter any existing behavior.

### How was this patch tested?

This PR does not change any existing behavior and therefore no tests are added.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47391 from mingkangli-db/warning_unpersist.

Authored-by: Mingkang Li <mingkang.li@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
attilapiros pushed a commit to attilapiros/spark that referenced this pull request Oct 4, 2024
…inted RDDs

### What changes were proposed in this pull request?

This pull request proposes logging a warning message when the `.unpersist()` method is called on RDDs that have been locally checkpointed. The goal is to inform users about the potential risks associated with unpersisting locally checkpointed RDDs without changing the current behavior of the method.

### Why are the changes needed?

Local checkpointing truncates the lineage of an RDD, preventing it from being recomputed from its source. If a locally checkpointed RDD is unpersisted, it loses its data and cannot be regenerated, potentially leading to job failures if subsequent actions or transformations are attempted on the RDD (which was seen on some user workloads). Logging a warning message helps users avoid such pitfalls and aids in debugging.

### Does this PR introduce _any_ user-facing change?

Yes, this PR adds a warning log message when .unpersist() is called on a locally checkpointed RDD, but it does not alter any existing behavior.

### How was this patch tested?

This PR does not change any existing behavior and therefore no tests are added.

### Was this patch authored or co-authored using generative AI tooling?

No.

Closes apache#47391 from mingkangli-db/warning_unpersist.

Authored-by: Mingkang Li <mingkang.li@databricks.com>
Signed-off-by: Mridul Muralidharan <mridul<at>gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants