Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use bigger fixture tree for distributed tests #4

Draft
wants to merge 11 commits into
base: main
Choose a base branch
from

Conversation

shcheklein
Copy link
Member

A follow up from https://github.com/iterative/studio/pull/10211

Since the default tree fixture is too small UDFDistributor was smart enough to run a single worker (despite us passing 2 as UDF parameter in tests).

I don't know if that was always the case or not, but atm it means we are testing some edge case (a single worker running as part of the LocalWorkerProcess I believe) and don't really launch other workers (that takes a different path).

Introducing this bigger fixture reproduced the issue (one of them) in production - shutdown getting stuck. I'm looking into this (unless someone has some immediate ideas) and I'll fix other tests (unless I hit some walls).

Thanks @dtulga for pointing me to these tests.

@shcheklein shcheklein requested a review from dtulga July 10, 2024 22:01
@shcheklein shcheklein self-assigned this Jul 10, 2024
Copy link
Contributor

@dtulga dtulga left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for improving this test! Originally this wasn't only running a local worker, but I think this only running a local worker wasn't noticed as it was caused by a combination of other changes.

@shcheklein shcheklein added bug Something isn't working tests labels Jul 10, 2024
@shcheklein
Copy link
Member Author

shcheklein commented Jul 10, 2024

Originally this wasn't only running a local worker

It seems it's non trivial actually to run the second worker properly, since it requires a celery worker running and I don't see a fixture for that or something. Or were you using a different way of running it? (don't spend too much time on this, but if you remember / could find the initial way we were running the second worker - that might speedup my research)

Introducing this bigger fixture reproduced the issue (one of them) in production - shutdown getting stuck. I'm looking into this (unless someone has some immediate ideas) and I'll fix other tests (unless I hit some walls).

Re this. I think I found the root cause for this and potentially we need to check some other places.

It hangs on worker_result_group.join() in shutdown_datachain_workers. Why? Because, apparently worker_result_group.revoke() is not enough - celery/celery#8888 😢 (also this describes it well https://stackoverflow.com/questions/39191238/revoke-a-task-from-celery) . So, even if put timeout, etc, if workers are offline (and this case there were no celery node to run the worker_result_group in the first place - we don't launch them, and on Studio there was a queue name discrepancy) we still won't remove the task from the queue and it might run again of clutter the queue. Not sure if there is a simple workaround ... looking more into that.

@dtulga
Copy link
Contributor

dtulga commented Jul 11, 2024

Originally, I was using this script: https://github.com/iterative/dvcx-server/blob/b232559d773dcee8cadc9f1ac8730c0856b94ff8/clickhouse-db-adapter/scripts/run_with_distributed_workers.py to run the tests with distributed workers, but this has been changed a few times since then.

Now they are supposed to be run with this fixture here: https://github.com/iterative/studio/blob/v2.126.1/backend/clickhouse-db-adapter/tests/conftest.py#L18 but I'm not sure how this is supposed to work with these tests, this fixture was added in this PR: https://github.com/iterative/dvcx-server/pull/332 and that also removed the usage of run_with_distributed_workers.py for tests.

And thanks for finding that strange Celery behavior! I didn't know that's how Celery worked, and yes, we don't have a test for the case where there are no workers running (or no workers with the correct queue, at least).

@shcheklein
Copy link
Member Author

Originally, I was using this script

Yep, I saw the script ... I thought it was more of a local helper. Good to have more context. Thanks. Assuming the fixture works - can we drop the script or do you use it locally also?

And thanks for finding that strange Celery behavior! I didn't know that's how Celery worked, and yes, we don't have a test for the case where there are no workers running (or no workers with the correct queue, at least).

yep, I don't see an easy fix for this so far. I'll keep looking for a while.

@dtulga
Copy link
Contributor

dtulga commented Jul 11, 2024

I have been using that script for local debugging, yes. And I don't see an obvious fix for this particular kind of Celery issue either, but I'll think of possible solutions as well.

Copy link

codecov bot commented Jul 15, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 83.86%. Comparing base (cc5994e) to head (773aef6).
Report is 2 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main       #4   +/-   ##
=======================================
  Coverage   83.86%   83.86%           
=======================================
  Files          91       91           
  Lines        9479     9479           
  Branches     1855     1855           
=======================================
  Hits         7950     7950           
  Misses       1211     1211           
  Partials      318      318           
Flag Coverage Δ
datachain 83.80% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link

cloudflare-workers-and-pages bot commented Jul 16, 2024

Deploying datachain-documentation with  Cloudflare Pages  Cloudflare Pages

Latest commit: b0146b0
Status: ✅  Deploy successful!
Preview URL: https://89e14606.datachain-documentation.pages.dev
Branch Preview URL: https://fix-distributed-test.datachain-documentation.pages.dev

View logs

@mattseddon
Copy link
Member

If you ever get back to this can you please move the tests from DatasetQuery to DataChain

@skshetry
Copy link
Member

I'll suggest closing this for now, and reopening when you get back to this. It has been opened for quite a while now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants