Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TEST-#6460: don't use 'repr' to force materialization #6461

Merged
merged 10 commits into from
Sep 5, 2023

Conversation

anmyachev
Copy link
Collaborator

@anmyachev anmyachev commented Aug 5, 2023

What do these changes do?

  • first commit message and PR title follow format outlined here

    NOTE: If you edit the PR title to match this format, you need to add another commit (even if it's empty) or amend your last commit for the CI job that checks the PR title to pick up the new PR title.

  • passes flake8 modin/ asv_bench/benchmarks scripts/doc_checker.py
  • passes black --check modin/ asv_bench/benchmarks scripts/doc_checker.py
  • signed commit with git commit -s
  • Resolves Do not use repr function to materialize the result of a function execution #6460
  • tests added and passing
  • module layout described at docs/development/architecture.rst is up-to-date

@anmyachev anmyachev marked this pull request as ready for review August 6, 2023 15:09
@anmyachev anmyachev requested review from a team as code owners August 6, 2023 15:09
Copy link
Collaborator

@vnlitvinov vnlitvinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be at least two places with missing arguments... and I've left a few suggestions on using "more public" API instead of protected ._to_pandas()

modin/experimental/pandas/test/test_io_exp.py Outdated Show resolved Hide resolved
modin/pandas/test/test_series.py Outdated Show resolved Hide resolved
modin/pandas/test/test_series.py Outdated Show resolved Hide resolved
modin/pandas/test/test_series.py Outdated Show resolved Hide resolved
modin/pandas/test/test_series.py Outdated Show resolved Hide resolved
modin/pandas/test/test_series.py Outdated Show resolved Hide resolved
modin/pandas/test/utils.py Outdated Show resolved Hide resolved
@vnlitvinov
Copy link
Collaborator

Also @anmyachev I'm not entirely sure this fixes the problem unearthed in #6442 - I think the analysis there rightfully assumes that "at least on Ray, that error is propagated from the worker to the main process as soon as it happens, which means that another worker process may still be doing stuff in the background. The error "completes" the test, so we think we are ready for teardown", and I don't think ._to_pandas() would be any different here - i.e. that it would wait for all tasks to finish before erroring out. I suppose we'd have to use mgr.wait_partitions() or something similar somehow instead.

@anmyachev
Copy link
Collaborator Author

anmyachev commented Aug 7, 2023

Also @anmyachev I'm not entirely sure this fixes the problem unearthed in #6442 - I think the analysis there rightfully assumes that "at least on Ray, that error is propagated from the worker to the main process as soon as it happens, which means that another worker process may still be doing stuff in the background. The error "completes" the test, so we think we are ready for teardown", and I don't think ._to_pandas() would be any different here - i.e. that it would wait for all tasks to finish before erroring out. I suppose we'd have to use mgr.wait_partitions() or something similar somehow instead.

Actually, in order to get an exception on the main process, it must be explicitly materialized through ray.get function.
Consider the example:

import ray
import time
ray.init(num_cpus=4)

def example():
    raise ValueError("something isn't working")

rem_example = ray.remote(example)
future = rem_example.remote() # call it
time.sleep(10)  # to make sure exception is happened in remote `example` function
print("after sleep")  # this code works even if a exception is happened

ray.get(future)  # get the exception on the main process
print("after get")  # this code doesn't work, there was an exception in the previous line

UPD: I realized that we are talking about two different problems. The first is that we do not materialize all partitions due to usage of repr function. The second is that if an error occurs in one of the partitions, then we do not wait for the materialization of other partitions to finish, where resources may still be occupied.

@vnlitvinov
Copy link
Collaborator

The second is that if an error occurs in one of the partitions, then we do not wait for the materialization of other partitions to finish, where resources may still be occupied.

Yeah, that was the problem I talked about.

For the repr() building only bits of the dataframe - this is something I didn't consider, but you're right here. However I think that, if we're at squashing that "files not closed" bug, we should fix both problems in this PR...

@anmyachev
Copy link
Collaborator Author

Need to check if the problem is the root cause for #2533.

@anmyachev anmyachev force-pushed the issue6460 branch 2 times, most recently from a9bcffb to ff9c4ea Compare August 27, 2023 15:35
Comment on lines 877 to 882
try:
result = cls._execution_wrapper.materialize(blocks_to_materialize)
except Exception:
# wait for the end of computations so that the processes free up the resources used
cls._execution_wrapper.wait(blocks_to_materialize)
raise
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@YarShev this construction causes the unidist to hangs, although dask and ray work. Can you suggest what's wrong?

@vnlitvinov
Copy link
Collaborator

Need to check if the problem is the root cause for #2533.

For that we first need to see if that issue is still occurring... 😺

Copy link
Collaborator

@vnlitvinov vnlitvinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks good, but CI is still not green, so cannot approve.

Comment on lines 881 to 882
cls._execution_wrapper.wait(blocks_to_materialize)
raise
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this is exactly the right place to put this synchronization point... At least it should be waited conditionally based on some configuration variable, as I would think that in most scenarios the user doesn't want to wait for all computations when it's already known that an exception is happening.

Also, this is fixing a problem which isn't entirely related to the original issue, so maybe it's better to open a new separate PR with this change.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, this is fixing a problem which isn't entirely related to the original issue, so maybe it's better to open a new separate PR with this change.

I agree, let's continue in another PR.

Comment on lines 877 to 883
try:
result = cls._execution_wrapper.materialize(blocks_to_materialize)
except Exception:
# wait for the end of computations so that the processes free up the resources used
cls._execution_wrapper.wait(blocks_to_materialize)
raise
return result
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if that code is to stay, we can make it look nicer:

Suggested change
try:
result = cls._execution_wrapper.materialize(blocks_to_materialize)
except Exception:
# wait for the end of computations so that the processes free up the resources used
cls._execution_wrapper.wait(blocks_to_materialize)
raise
return result
try:
return cls._execution_wrapper.materialize(blocks_to_materialize)
except Exception:
# wait for the end of computations so that the processes free up the resources used
cls._execution_wrapper.wait(blocks_to_materialize)
raise

Copy link
Collaborator

@vnlitvinov vnlitvinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code at a glance looks good, but tests are failing.
@anmyachev please take a look at CI.

anmyachev and others added 10 commits September 5, 2023 18:20
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Co-authored-by: Vasily Litvinov <fam1ly.n4me@yandex.ru>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
This reverts commit 1341cf7.
Signed-off-by: Anatoly Myachev <anatoly.myachev@intel.com>
Copy link
Collaborator

@vnlitvinov vnlitvinov left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@vnlitvinov vnlitvinov merged commit 98fbf75 into modin-project:master Sep 5, 2023
37 checks passed
@anmyachev anmyachev deleted the issue6460 branch September 5, 2023 19:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Do not use repr function to materialize the result of a function execution
2 participants