Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[dask] Random failures in Dask tests during teardown #3829

Closed
StrikerRUS opened this issue Jan 24, 2021 · 38 comments
Closed

[dask] Random failures in Dask tests during teardown #3829

StrikerRUS opened this issue Jan 24, 2021 · 38 comments
Assignees

Comments

@StrikerRUS
Copy link
Collaborator

Need to make Dask tests less flaky and ideally remove all randomness from them.
Log from one of failed job:

2021-01-24T00:16:23.5034764Z ============================= test session starts ==============================
2021-01-24T00:16:23.5036640Z platform linux -- Python 3.8.5, pytest-6.2.1, py-1.10.0, pluggy-0.13.1
2021-01-24T00:16:23.5037201Z rootdir: /__w/1/s
2021-01-24T00:16:23.5037554Z collected 248 items
2021-01-24T00:16:23.5037809Z 
2021-01-24T00:16:23.8406259Z ../tests/python_package_test/test_basic.py ............                  [  4%]
2021-01-24T00:16:32.3036057Z ../tests/python_package_test/test_consistency.py ......                  [  7%]
2021-01-24T00:18:49.4493918Z ../tests/python_package_test/test_dask.py .........................E.... [ 18%]
2021-01-24T00:19:32.8918399Z ...........                                                              [ 23%]
2021-01-24T00:19:32.8930118Z ../tests/python_package_test/test_dual.py s                              [ 23%]
2021-01-24T00:19:36.0702660Z ../tests/python_package_test/test_engine.py ............................ [ 35%]
2021-01-24T00:20:19.2796809Z ........................................                                 [ 51%]
2021-01-24T00:20:20.0033215Z ../tests/python_package_test/test_plotting.py .....                      [ 53%]
2021-01-24T00:20:21.7294635Z ../tests/python_package_test/test_sklearn.py ........................... [ 64%]
2021-01-24T00:20:24.6758838Z ......x.........................................x....................... [ 93%]
2021-01-24T00:20:24.8587472Z .................                                                        [100%]
2021-01-24T00:20:24.8588302Z 
2021-01-24T00:20:24.8590837Z ==================================== ERRORS ====================================
2021-01-24T00:20:24.8593230Z ______ ERROR at teardown of test_regressor_quantile[0.5-scipy_csr_matrix] ______
2021-01-24T00:20:24.8595011Z 
2021-01-24T00:20:24.8595723Z loop = <tornado.platform.asyncio.AsyncIOLoop object at 0x7f41dc8e2070>
2021-01-24T00:20:24.8596248Z 
2021-01-24T00:20:24.8596718Z     @pytest.fixture
2021-01-24T00:20:24.8597247Z     def cluster_fixture(loop):
2021-01-24T00:20:24.8597875Z         with cluster() as (scheduler, workers):
2021-01-24T00:20:24.8598780Z >           yield (scheduler, workers)
2021-01-24T00:20:24.8599277Z 
2021-01-24T00:20:24.8600455Z /home/AzDevOps_azpcontainer/miniconda/envs/test-env/lib/python3.8/site-packages/distributed/utils_test.py:522: 
2021-01-24T00:20:24.8601420Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2021-01-24T00:20:24.8602558Z /home/AzDevOps_azpcontainer/miniconda/envs/test-env/lib/python3.8/contextlib.py:120: in __exit__
2021-01-24T00:20:24.8603278Z     next(self.gen)
2021-01-24T00:20:24.8604897Z /home/AzDevOps_azpcontainer/miniconda/envs/test-env/lib/python3.8/site-packages/distributed/utils_test.py:677: in cluster
2021-01-24T00:20:24.8605790Z     loop.run_sync(
2021-01-24T00:20:24.8606854Z /home/AzDevOps_azpcontainer/miniconda/envs/test-env/lib/python3.8/site-packages/tornado/ioloop.py:530: in run_sync
2021-01-24T00:20:24.8607455Z     return future_cell[0].result()
2021-01-24T00:20:24.8608539Z /home/AzDevOps_azpcontainer/miniconda/envs/test-env/lib/python3.8/site-packages/distributed/utils_test.py:740: in disconnect_all
2021-01-24T00:20:24.8609314Z     await asyncio.gather(*[disconnect(addr, timeout, rpc_kwargs) for addr in addresses])
2021-01-24T00:20:24.8609992Z /home/AzDevOps_azpcontainer/miniconda/envs/test-env/lib/python3.8/site-packages/distributed/utils_test.py:736: in disconnect
2021-01-24T00:20:24.8610409Z     await asyncio.wait_for(do_disconnect(), timeout=timeout)
2021-01-24T00:20:24.8610740Z _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
2021-01-24T00:20:24.8610993Z 
2021-01-24T00:20:24.8611655Z fut = <Task cancelled name='Task-1672' coro=<disconnect.<locals>.do_disconnect() done, defined at /home/AzDevOps_azpcontainer/miniconda/envs/test-env/lib/python3.8/site-packages/distributed/utils_test.py:731>>
2021-01-24T00:20:24.8612132Z timeout = 3
2021-01-24T00:20:24.8612290Z 
2021-01-24T00:20:24.8612547Z     async def wait_for(fut, timeout, *, loop=None):
2021-01-24T00:20:24.8612885Z         """Wait for the single Future or coroutine to complete, with timeout.
2021-01-24T00:20:24.8613165Z     
2021-01-24T00:20:24.8613395Z         Coroutine will be wrapped in Task.
2021-01-24T00:20:24.8613619Z     
2021-01-24T00:20:24.8613880Z         Returns result of the Future or coroutine.  When a timeout occurs,
2021-01-24T00:20:24.8614211Z         it cancels the task and raises TimeoutError.  To avoid the task
2021-01-24T00:20:24.8614519Z         cancellation, wrap it in shield().
2021-01-24T00:20:24.8614738Z     
2021-01-24T00:20:24.8614982Z         If the wait is cancelled, the task is also cancelled.
2021-01-24T00:20:24.8615219Z     
2021-01-24T00:20:24.8615440Z         This function is a coroutine.
2021-01-24T00:20:24.8615837Z         """
2021-01-24T00:20:24.8616126Z         if loop is None:
2021-01-24T00:20:24.8616491Z             loop = events.get_running_loop()
2021-01-24T00:20:24.8616842Z         else:
2021-01-24T00:20:24.8617230Z             warnings.warn("The loop argument is deprecated since Python 3.8, "
2021-01-24T00:20:24.8617671Z                           "and scheduled for removal in Python 3.10.",
2021-01-24T00:20:24.8618001Z                           DeprecationWarning, stacklevel=2)
2021-01-24T00:20:24.8618236Z     
2021-01-24T00:20:24.8618457Z         if timeout is None:
2021-01-24T00:20:24.8618692Z             return await fut
2021-01-24T00:20:24.8618900Z     
2021-01-24T00:20:24.8619099Z         if timeout <= 0:
2021-01-24T00:20:24.8619367Z             fut = ensure_future(fut, loop=loop)
2021-01-24T00:20:24.8619586Z     
2021-01-24T00:20:24.8619797Z             if fut.done():
2021-01-24T00:20:24.8620035Z                 return fut.result()
2021-01-24T00:20:24.8620256Z     
2021-01-24T00:20:24.8620448Z             fut.cancel()
2021-01-24T00:20:24.8620700Z             raise exceptions.TimeoutError()
2021-01-24T00:20:24.8620916Z     
2021-01-24T00:20:24.8621134Z         waiter = loop.create_future()
2021-01-24T00:20:24.8621423Z         timeout_handle = loop.call_later(timeout, _release_waiter, waiter)
2021-01-24T00:20:24.8621870Z         cb = functools.partial(_release_waiter, waiter)
2021-01-24T00:20:24.8622114Z     
2021-01-24T00:20:24.8622325Z         fut = ensure_future(fut, loop=loop)
2021-01-24T00:20:24.8622583Z         fut.add_done_callback(cb)
2021-01-24T00:20:24.8622788Z     
2021-01-24T00:20:24.8622978Z         try:
2021-01-24T00:20:24.8623227Z             # wait until the future completes or the timeout
2021-01-24T00:20:24.8623494Z             try:
2021-01-24T00:20:24.8623708Z                 await waiter
2021-01-24T00:20:24.8623984Z             except exceptions.CancelledError:
2021-01-24T00:20:24.8624269Z                 fut.remove_done_callback(cb)
2021-01-24T00:20:24.8624525Z                 fut.cancel()
2021-01-24T00:20:24.8624742Z                 raise
2021-01-24T00:20:24.8624945Z     
2021-01-24T00:20:24.8625143Z             if fut.done():
2021-01-24T00:20:24.8625396Z                 return fut.result()
2021-01-24T00:20:24.8625624Z             else:
2021-01-24T00:20:24.8625875Z                 fut.remove_done_callback(cb)
2021-01-24T00:20:24.8626173Z                 # We must ensure that the task is not running
2021-01-24T00:20:24.8626458Z                 # after wait_for() returns.
2021-01-24T00:20:24.8626749Z                 # See https://bugs.python.org/issue32751
2021-01-24T00:20:24.8627045Z                 await _cancel_and_wait(fut, loop=loop)
2021-01-24T00:20:24.8627335Z >               raise exceptions.TimeoutError()
2021-01-24T00:20:24.8627603Z E               asyncio.exceptions.TimeoutError
2021-01-24T00:20:24.8627813Z 
2021-01-24T00:20:24.8628430Z /home/AzDevOps_azpcontainer/miniconda/envs/test-env/lib/python3.8/asyncio/tasks.py:490: TimeoutError
2021-01-24T00:20:24.8629018Z ---------------------------- Captured stderr setup -----------------------------
2021-01-24T00:20:24.8629735Z distributed.http.proxy - INFO - To route to workers diagnostics web server please install jupyter-server-proxy: python -m pip install jupyter-server-proxy
2021-01-24T00:20:24.8630361Z distributed.scheduler - INFO - Clear task state
2021-01-24T00:20:24.8630898Z distributed.scheduler - INFO -   Scheduler at:     tcp://127.0.0.1:44627
2021-01-24T00:20:24.8631460Z distributed.scheduler - INFO -   dashboard at:            127.0.0.1:8787
2021-01-24T00:20:24.8632061Z distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:37417
2021-01-24T00:20:24.8632636Z distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:37417
2021-01-24T00:20:24.8633223Z distributed.worker - INFO -          dashboard at:            127.0.0.1:39337
2021-01-24T00:20:24.8633809Z distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:44627
2021-01-24T00:20:24.8634364Z distributed.worker - INFO - -------------------------------------------------
2021-01-24T00:20:24.8635026Z distributed.worker - INFO -               Threads:                          1
2021-01-24T00:20:24.8635590Z distributed.worker - INFO -                Memory:                    8.35 GB
2021-01-24T00:20:24.8636318Z distributed.worker - INFO -       Local Directory: /__w/1/s/python-package/_test_worker-5b8cbc6f-8efa-4b05-9689-fb0c0d1a23e7/dask-worker-space/worker-u3uatu6i
2021-01-24T00:20:24.8636956Z distributed.worker - INFO - -------------------------------------------------
2021-01-24T00:20:24.8637525Z distributed.worker - INFO -       Start worker at:      tcp://127.0.0.1:43619
2021-01-24T00:20:24.8638108Z distributed.worker - INFO -          Listening to:      tcp://127.0.0.1:43619
2021-01-24T00:20:24.8638682Z distributed.worker - INFO -          dashboard at:            127.0.0.1:45409
2021-01-24T00:20:24.8639281Z distributed.worker - INFO - Waiting to connect to:      tcp://127.0.0.1:44627
2021-01-24T00:20:24.8639825Z distributed.worker - INFO - -------------------------------------------------
2021-01-24T00:20:24.8640389Z distributed.worker - INFO -               Threads:                          1
2021-01-24T00:20:24.8640947Z distributed.worker - INFO -                Memory:                    8.35 GB
2021-01-24T00:20:24.8641677Z distributed.worker - INFO -       Local Directory: /__w/1/s/python-package/_test_worker-552c819c-6b5f-48f9-ba28-8f06bdeb6cca/dask-worker-space/worker-r7o51tm7
2021-01-24T00:20:24.8642400Z distributed.worker - INFO - -------------------------------------------------
2021-01-24T00:20:24.8643150Z distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:37417', name: tcp://127.0.0.1:37417, memory: 0, processing: 0>
2021-01-24T00:20:24.8644130Z distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:37417
2021-01-24T00:20:24.8644648Z distributed.core - INFO - Starting established connection
2021-01-24T00:20:24.8645190Z distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:44627
2021-01-24T00:20:24.8645741Z distributed.worker - INFO - -------------------------------------------------
2021-01-24T00:20:24.8646479Z distributed.scheduler - INFO - Register worker <Worker 'tcp://127.0.0.1:43619', name: tcp://127.0.0.1:43619, memory: 0, processing: 0>
2021-01-24T00:20:24.8647145Z distributed.core - INFO - Starting established connection
2021-01-24T00:20:24.8647703Z distributed.scheduler - INFO - Starting worker compute stream, tcp://127.0.0.1:43619
2021-01-24T00:20:24.8648228Z distributed.core - INFO - Starting established connection
2021-01-24T00:20:24.8648761Z distributed.worker - INFO -         Registered to:      tcp://127.0.0.1:44627
2021-01-24T00:20:24.8649307Z distributed.worker - INFO - -------------------------------------------------
2021-01-24T00:20:24.8649792Z distributed.core - INFO - Starting established connection
2021-01-24T00:20:24.8650365Z distributed.scheduler - INFO - Receive client connection: Client-ab0ff4de-5dd9-11eb-a0a4-0242ac120002
2021-01-24T00:20:24.8650916Z distributed.core - INFO - Starting established connection
2021-01-24T00:20:24.8651415Z ----------------------------- Captured stdout call -----------------------------
2021-01-24T00:20:24.8651744Z [LightGBM] [Warning] Set TCP_NODELAY failed
2021-01-24T00:20:24.8652009Z [LightGBM] [Warning] Set TCP_NODELAY failed
2021-01-24T00:20:24.8652292Z [LightGBM] [Info] Trying to bind port 13240...
2021-01-24T00:20:24.8652568Z [LightGBM] [Info] Trying to bind port 13241...
2021-01-24T00:20:24.8652835Z [LightGBM] [Info] Binding port 13240 succeeded
2021-01-24T00:20:24.8653109Z [LightGBM] [Info] Binding port 13241 succeeded
2021-01-24T00:20:24.8653370Z [LightGBM] [Warning] Set TCP_NODELAY failed
2021-01-24T00:20:24.8653637Z [LightGBM] [Info] Listening...
2021-01-24T00:20:24.8653867Z [LightGBM] [Info] Listening...
2021-01-24T00:20:24.8654121Z [LightGBM] [Warning] Set TCP_NODELAY failed
2021-01-24T00:20:24.8654383Z [LightGBM] [Warning] Set TCP_NODELAY failed
2021-01-24T00:20:24.8654656Z [LightGBM] [Warning] Set TCP_NODELAY failed
2021-01-24T00:20:24.8654999Z [LightGBM] [Info] Connected to rank 1
2021-01-24T00:20:24.8655265Z [LightGBM] [Info] Connected to rank 0
2021-01-24T00:20:24.8655563Z [LightGBM] [Info] Local rank: 0, total number of machines: 2
2021-01-24T00:20:24.8655922Z [LightGBM] [Info] Local rank: 1, total number of machines: 2
2021-01-24T00:20:24.8656535Z [LightGBM] [Warning] num_threads is set=1, n_jobs=-1 will be ignored. Current value: num_threads=1
2021-01-24T00:20:24.8657188Z [LightGBM] [Warning] num_threads is set=1, n_jobs=-1 will be ignored. Current value: num_threads=1
2021-01-24T00:20:24.8657785Z ----------------------------- Captured stderr call -----------------------------
2021-01-24T00:20:24.8658347Z distributed.scheduler - INFO - Remove client Client-ab0ff4de-5dd9-11eb-a0a4-0242ac120002
2021-01-24T00:20:24.8658925Z distributed.scheduler - INFO - Remove client Client-ab0ff4de-5dd9-11eb-a0a4-0242ac120002
2021-01-24T00:20:24.8659545Z distributed.scheduler - INFO - Close client connection: Client-ab0ff4de-5dd9-11eb-a0a4-0242ac120002
2021-01-24T00:20:24.8660125Z --------------------------- Captured stderr teardown ---------------------------
2021-01-24T00:20:24.8660683Z distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:43619
2021-01-24T00:20:24.8661227Z distributed.worker - INFO - Stopping worker at tcp://127.0.0.1:37417
2021-01-24T00:20:24.8662073Z distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:43619', name: tcp://127.0.0.1:43619, memory: 0, processing: 0>
2021-01-24T00:20:24.8662757Z distributed.core - INFO - Removing comms to tcp://127.0.0.1:43619
2021-01-24T00:20:24.8663508Z distributed.scheduler - INFO - Remove worker <Worker 'tcp://127.0.0.1:37417', name: tcp://127.0.0.1:37417, memory: 0, processing: 0>
2021-01-24T00:20:24.8667564Z distributed.core - INFO - Removing comms to tcp://127.0.0.1:37417
2021-01-24T00:20:24.8669350Z distributed.scheduler - INFO - Lost all workers
2021-01-24T00:20:24.8669803Z =============================== warnings summary ===============================
2021-01-24T00:20:24.8670149Z tests/python_package_test/test_basic.py::test_basic
2021-01-24T00:20:24.8670475Z tests/python_package_test/test_engine.py::test_reference_chain
2021-01-24T00:20:24.8671155Z tests/python_package_test/test_engine.py::test_init_with_subset
2021-01-24T00:20:24.8671957Z tests/python_package_test/test_engine.py::test_fpreproc
2021-01-24T00:20:24.8672336Z tests/python_package_test/test_engine.py::test_dataset_params_with_reference
2021-01-24T00:20:24.8673643Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/basic.py:1317: UserWarning: Overriding the parameters from Reference Dataset.
2021-01-24T00:20:24.8675308Z     warnings.warn('Overriding the parameters from Reference Dataset.')
2021-01-24T00:20:24.8675564Z 
2021-01-24T00:20:24.8675849Z tests/python_package_test/test_basic.py::test_add_features_equal_data_on_alternating_used_unused
2021-01-24T00:20:24.8676236Z tests/python_package_test/test_basic.py::test_add_features_same_booster_behaviour
2021-01-24T00:20:24.8676739Z tests/python_package_test/test_engine.py::test_sliced_data
2021-01-24T00:20:24.8677122Z tests/python_package_test/test_engine.py::test_monotone_penalty_max
2021-01-24T00:20:24.8677471Z tests/python_package_test/test_engine.py::test_forced_bins
2021-01-24T00:20:24.8678576Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/basic.py:332: UserWarning: Usage of np.ndarray subset (sliced data) is not recommended due to it will double the peak memory cost in LightGBM.
2021-01-24T00:20:24.8679370Z     warnings.warn("Usage of np.ndarray subset (sliced data) is not recommended "
2021-01-24T00:20:24.8679638Z 
2021-01-24T00:20:24.8680313Z tests/python_package_test/test_basic.py::test_add_features_equal_data_on_alternating_used_unused
2021-01-24T00:20:24.8680755Z tests/python_package_test/test_basic.py::test_add_features_same_booster_behaviour
2021-01-24T00:20:24.8681600Z tests/python_package_test/test_basic.py::test_add_features_from_different_sources
2021-01-24T00:20:24.8684071Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/basic.py:2014: UserWarning: Cannot add features from NoneType type of raw data to NoneType type of raw data.
2021-01-24T00:20:24.8685702Z   Set free_raw_data=False when construct Dataset to avoid this
2021-01-24T00:20:24.8686204Z     warnings.warn(err_msg)
2021-01-24T00:20:24.8686547Z 
2021-01-24T00:20:24.8686985Z tests/python_package_test/test_basic.py::test_add_features_equal_data_on_alternating_used_unused
2021-01-24T00:20:24.8687652Z tests/python_package_test/test_basic.py::test_add_features_same_booster_behaviour
2021-01-24T00:20:24.8688295Z tests/python_package_test/test_basic.py::test_add_features_from_different_sources
2021-01-24T00:20:24.8689578Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/basic.py:2016: UserWarning: Reseting categorical features.
2021-01-24T00:20:24.8690362Z   You can set new categorical features via ``set_categorical_feature`` method
2021-01-24T00:20:24.8690937Z     warnings.warn("Reseting categorical features.\n"
2021-01-24T00:20:24.8691350Z 
2021-01-24T00:20:24.8691803Z tests/python_package_test/test_basic.py::test_add_features_from_different_sources
2021-01-24T00:20:24.8693121Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/basic.py:2014: UserWarning: Cannot add features from list type of raw data to ndarray type of raw data.
2021-01-24T00:20:24.8694042Z   Freeing raw data
2021-01-24T00:20:24.8694464Z     warnings.warn(err_msg)
2021-01-24T00:20:24.8694806Z 
2021-01-24T00:20:24.8695307Z tests/python_package_test/test_basic.py::test_add_features_from_different_sources
2021-01-24T00:20:24.8696312Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/basic.py:2014: UserWarning: Cannot add features from list type of raw data to csr_matrix type of raw data.
2021-01-24T00:20:24.8696803Z   Freeing raw data
2021-01-24T00:20:24.8697097Z     warnings.warn(err_msg)
2021-01-24T00:20:24.8697300Z 
2021-01-24T00:20:24.8697709Z tests/python_package_test/test_basic.py::test_add_features_from_different_sources
2021-01-24T00:20:24.8698523Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/basic.py:2014: UserWarning: Cannot add features from list type of raw data to DataFrame type of raw data.
2021-01-24T00:20:24.8698993Z   Freeing raw data
2021-01-24T00:20:24.8699262Z     warnings.warn(err_msg)
2021-01-24T00:20:24.8699448Z 
2021-01-24T00:20:24.8699770Z tests/python_package_test/test_consistency.py: 10 warnings
2021-01-24T00:20:24.8700507Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/engine.py:149: UserWarning: Found `num_trees` in params. Will use it instead of argument
2021-01-24T00:20:24.8701035Z     warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
2021-01-24T00:20:24.8701307Z 
2021-01-24T00:20:24.8701611Z tests/python_package_test/test_consistency.py: 10 warnings
2021-01-24T00:20:24.8702342Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/basic.py:1106: UserWarning: data keyword has been found in `params` and will be ignored.
2021-01-24T00:20:24.8702853Z   Please use data argument of the Dataset constructor to pass this parameter.
2021-01-24T00:20:24.8703433Z     warnings.warn('{0} keyword has been found in `params` and will be ignored.\n'
2021-01-24T00:20:24.8703694Z 
2021-01-24T00:20:24.8704007Z tests/python_package_test/test_dask.py::test_errors
2021-01-24T00:20:24.8704666Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/dask.py:229: RuntimeWarning: coroutine '_wait' was never awaited
2021-01-24T00:20:24.8705081Z     wait(parts)
2021-01-24T00:20:24.8705269Z 
2021-01-24T00:20:24.8705586Z tests/python_package_test/test_dask.py::test_errors
2021-01-24T00:20:24.8706383Z   /home/AzDevOps_azpcontainer/miniconda/envs/test-env/lib/python3.8/site-packages/distributed/utils_test.py:938: RuntimeWarning: coroutine 'PooledRPCCall.__getattr__.<locals>.send_recv_from_rpc' was never awaited
2021-01-24T00:20:24.8706998Z     gc.collect()
2021-01-24T00:20:24.8707182Z 
2021-01-24T00:20:24.8707499Z tests/python_package_test/test_engine.py::test_binary
2021-01-24T00:20:24.8708218Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/engine.py:149: UserWarning: Found `num_iteration` in params. Will use it instead of argument
2021-01-24T00:20:24.8708763Z     warnings.warn("Found `{}` in params. Will use it instead of argument".format(alias))
2021-01-24T00:20:24.8709033Z 
2021-01-24T00:20:24.8709347Z tests/python_package_test/test_engine.py::test_pandas_categorical
2021-01-24T00:20:24.8711042Z tests/python_package_test/test_engine.py::test_linear_trees
2021-01-24T00:20:24.8711475Z tests/python_package_test/test_engine.py::test_save_and_load_linear
2021-01-24T00:20:24.8712383Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/basic.py:1589: UserWarning: categorical_feature in Dataset is overridden.
2021-01-24T00:20:24.8712840Z   New categorical_feature is [0]
2021-01-24T00:20:24.8713430Z     warnings.warn('categorical_feature in Dataset is overridden.\n'
2021-01-24T00:20:24.8713693Z 
2021-01-24T00:20:24.8714014Z tests/python_package_test/test_engine.py::test_pandas_categorical
2021-01-24T00:20:24.8714730Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/basic.py:1589: UserWarning: categorical_feature in Dataset is overridden.
2021-01-24T00:20:24.8715462Z   New categorical_feature is ['A']
2021-01-24T00:20:24.8715977Z     warnings.warn('categorical_feature in Dataset is overridden.\n'
2021-01-24T00:20:24.8716217Z 
2021-01-24T00:20:24.8716559Z tests/python_package_test/test_engine.py::test_pandas_categorical
2021-01-24T00:20:24.8717243Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/basic.py:1589: UserWarning: categorical_feature in Dataset is overridden.
2021-01-24T00:20:24.8717870Z   New categorical_feature is ['A', 'B', 'C', 'D']
2021-01-24T00:20:24.8718391Z     warnings.warn('categorical_feature in Dataset is overridden.\n'
2021-01-24T00:20:24.8718653Z 
2021-01-24T00:20:24.8718974Z tests/python_package_test/test_engine.py::test_pandas_categorical
2021-01-24T00:20:24.8719677Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/basic.py:1589: UserWarning: categorical_feature in Dataset is overridden.
2021-01-24T00:20:24.8720722Z   New categorical_feature is ['A', 'B', 'C', 'D', 'E']
2021-01-24T00:20:24.8721323Z     warnings.warn('categorical_feature in Dataset is overridden.\n'
2021-01-24T00:20:24.8721587Z 
2021-01-24T00:20:24.8721902Z tests/python_package_test/test_engine.py::test_pandas_categorical
2021-01-24T00:20:24.8722609Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/basic.py:1589: UserWarning: categorical_feature in Dataset is overridden.
2021-01-24T00:20:24.8723043Z   New categorical_feature is []
2021-01-24T00:20:24.8723699Z     warnings.warn('categorical_feature in Dataset is overridden.\n'
2021-01-24T00:20:24.8723960Z 
2021-01-24T00:20:24.8724281Z tests/python_package_test/test_engine.py::test_pandas_sparse
2021-01-24T00:20:24.8724670Z tests/python_package_test/test_sklearn.py::TestSklearn::test_pandas_sparse
2021-01-24T00:20:24.8725610Z   /home/AzDevOps_azpcontainer/miniconda/envs/test-env/lib/python3.8/site-packages/pandas/core/generic.py:5670: PerformanceWarning: Concatenating sparse arrays with multiple fill values: '[0, nan, False]'. Picking the first and converting the rest.
2021-01-24T00:20:24.8726254Z     return self._mgr.as_array(transpose=self._AXIS_REVERSED)
2021-01-24T00:20:24.8726483Z 
2021-01-24T00:20:24.8726818Z tests/python_package_test/test_engine.py::test_int32_max_sparse_contribs
2021-01-24T00:20:24.8727673Z   /home/AzDevOps_azpcontainer/miniconda/envs/test-env/lib/python3.8/site-packages/scipy/sparse/_index.py:82: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.
2021-01-24T00:20:24.8728234Z     self._set_intXint(row, col, x.flat[0])
2021-01-24T00:20:24.8728440Z 
2021-01-24T00:20:24.8728869Z tests/python_package_test/test_engine.py::test_init_with_subset
2021-01-24T00:20:24.8729563Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/basic.py:1843: UserWarning: Cannot subset str type of raw data.
2021-01-24T00:20:24.8730004Z   Returning original raw data
2021-01-24T00:20:24.8730320Z     warnings.warn("Cannot subset {} type of raw data.\n"
2021-01-24T00:20:24.8730538Z 
2021-01-24T00:20:24.8730848Z tests/python_package_test/test_engine.py::test_monotone_constraints
2021-01-24T00:20:24.8731236Z tests/python_package_test/test_engine.py::test_monotone_penalty
2021-01-24T00:20:24.8731615Z tests/python_package_test/test_engine.py::test_monotone_penalty_max
2021-01-24T00:20:24.8731998Z tests/python_package_test/test_engine.py::test_get_split_value_histogram
2021-01-24T00:20:24.8732741Z tests/python_package_test/test_sklearn.py::TestSklearn::test_pandas_categorical
2021-01-24T00:20:24.8733558Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/basic.py:1586: UserWarning: Using categorical_feature in Dataset.
2021-01-24T00:20:24.8734209Z     warnings.warn('Using categorical_feature in Dataset.')
2021-01-24T00:20:24.8734448Z 
2021-01-24T00:20:24.8734783Z tests/python_package_test/test_engine.py::test_dataset_update_params
2021-01-24T00:20:24.8735529Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/basic.py:1106: UserWarning: categorical_feature keyword has been found in `params` and will be ignored.
2021-01-24T00:20:24.8736186Z   Please use categorical_feature argument of the Dataset constructor to pass this parameter.
2021-01-24T00:20:24.8736785Z     warnings.warn('{0} keyword has been found in `params` and will be ignored.\n'
2021-01-24T00:20:24.8737047Z 
2021-01-24T00:20:24.8737370Z tests/python_package_test/test_plotting.py::test_plot_metrics
2021-01-24T00:20:24.8737817Z   /__w/1/s/tests/python_package_test/test_plotting.py:178: UserWarning: More than one metric available, picking one to plot.
2021-01-24T00:20:24.8738233Z     ax0 = lgb.plot_metric(evals_result0)
2021-01-24T00:20:24.8738439Z 
2021-01-24T00:20:24.8738819Z tests/python_package_test/test_sklearn.py::TestSklearn::test_binary_classification_with_custom_objective
2021-01-24T00:20:24.8739687Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/sklearn.py:934: UserWarning: Cannot compute class probabilities or labels due to the usage of customized objective function.
2021-01-24T00:20:24.8740198Z   Returning raw scores instead.
2021-01-24T00:20:24.8740522Z     warnings.warn("Cannot compute class probabilities or labels "
2021-01-24T00:20:24.8740768Z 
2021-01-24T00:20:24.8741105Z tests/python_package_test/test_sklearn.py::TestSklearn::test_stacking_classifier
2021-01-24T00:20:24.8741892Z   <frozen importlib._bootstrap>:219: RuntimeWarning: numpy.ufunc size changed, may indicate binary incompatibility. Expected 192 from C header, got 216 from PyObject
2021-01-24T00:20:24.8742323Z 
2021-01-24T00:20:24.8742647Z tests/python_package_test/test_sklearn.py: 12 warnings
2021-01-24T00:20:24.8743370Z   /home/AzDevOps_azpcontainer/.local/lib/python3.8/site-packages/lightgbm/basic.py:623: UserWarning: Converting data to scipy sparse matrix.
2021-01-24T00:20:24.8744025Z     warnings.warn('Converting data to scipy sparse matrix.')
2021-01-24T00:20:24.8744260Z 
2021-01-24T00:20:24.8744754Z -- Docs: https://docs.pytest.org/en/stable/warnings.html
2021-01-24T00:20:24.8745164Z =========================== short test summary info ============================
2021-01-24T00:20:24.8745775Z ERROR ../tests/python_package_test/test_dask.py::test_regressor_quantile[0.5-scipy_csr_matrix]
2021-01-24T00:20:24.8746243Z = 245 passed, 1 skipped, 2 xfailed, 74 warnings, 1 error in 242.90s (0:04:02) ==
2021-01-24T00:20:25.1851259Z ##[error]Bash exited with code '255'.
@StrikerRUS
Copy link
Collaborator Author

#3708 (comment)

@jameslamb jameslamb changed the title Random failures in Dask tests during teardown [dask] Random failures in Dask tests during teardown Jan 26, 2021
@jameslamb
Copy link
Collaborator

To avoid merge conflicts with #3855, this is the next Dask issue I'm picking up. Will try some strategies on a PR on my fork.

@jameslamb jameslamb self-assigned this Jan 26, 2021
@jameslamb
Copy link
Collaborator

I've run the sdist task 40 consecutive times across 4 builds (each build with 5 Linux sdist, 5 Linux_latest sdist each).

All of them succeeded.

jameslamb#39

I can't use the sh-ubuntu pool on my personal account. So next I'm going to open a draft PR on a LightGBM branch and test here. I'll comment out all other CIs to minimize the impact on our CI quotas.

Since everything in the job is dockerized, and since I just saw 40 consecutive successes, my theories about the source of flakiness are now related to the differences between the Microsoft-hosted runners and the sh-ubuntu ones. Such as

  • maybe the default client fixture uses more workers than there are logical CPUs available, and this leads to flaky teardowns
  • maybe the disk is slower on sh-ubuntu runners, and this leads some IO-bound process in Dask to be slower, which leads to timeouts
  • maybe the available memory is smaller, and that leads Dask to spill to Disk, which makes it slower, which leads to timeouts

@StrikerRUS or @guolinke , are you able to share the specs for the sh-ubuntu pool? CPU, available memory, disk space, and anything else like that that you can share?

@jameslamb
Copy link
Collaborator

I'm testing on #3869 , but want to add here that I have another idea for helping with this.

We use client.close() in each test, and I think we could pass None or a large positive value to those client.close() calls and that it would help avoid these errors.

https://github.com/dask/distributed/blob/b5c36b587f0e3295fe59db330603d5c78b39331f/distributed/client.py#L1424-L1425

https://github.com/dask/distributed/blob/b5c36b587f0e3295fe59db330603d5c78b39331f/distributed/client.py#L1407-L1408

@StrikerRUS
Copy link
Collaborator Author

@jameslamb

I've run the sdist task

I saw regular and bdist tasks failed more often. Maybe Python version matters?

@jameslamb
Copy link
Collaborator

jameslamb commented Jan 27, 2021

Ok, I can try again with regular and bdist. I had started with sdist because that was the most recent place I saw the issue.

If that still doesn't reveal anything, we can try changing Python version. If you notice the error from this issue in other builds, please leave a commit here with the TASK, operating system, and Python version. I'll do the same.

But I just ran 50 of them (25 Linux, 25 Linux_latest) consecutively on sh-ubuntu pool, on a LightGBM branch, without seeing this error once: #3869 (comment)

@jameslamb
Copy link
Collaborator

just saw this timeout error in a build from #3866 https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=8823&view=logs&j=fb919173-48b9-522d-5342-9a59a13eb10b&t=7fcbeee5-d4c1-571c-8072-1d5874591254, on the Linux bdist task (Python 3.7)

I am almost done with testing on #3869, and will share some information here soon about how often this seems to happen and how we might fix it.

@StrikerRUS
Copy link
Collaborator Author

Yeah, those errors are not very frequent, so it may require more runs and time to prepare any statistics.

@jameslamb
Copy link
Collaborator

Alright, so! On #3869 , I've run 300 tests of the Python CI jobs. I did this 10 at a time, in batches with 5 running on Linux and 5 on Linux_Latest.

296 / 300 of these succeeded, and all 4 failures were the timeout error described in this issue. See the description of #3869 for build logs.

All 4 failures were on Linux (the jobs that run in https://github.com/guolinke/lightgbm-ci-docker), not Linux_latest.

Python 3.7 python 3.8
sdist 50/50 50/50
bdist 49/50 49/50
regular 50/50 48/50

Proposed Solution

I've decided against the idea of using the @gen_cluster fixture from distributed, which I originally thought about in #3708 (comment). That fixture requires you to use Dask's asynchronous API (where results of client.something() calls need to be await-ed), so I think it would not work nicely with the existing Dask module.

My plan next is to try just setting a much higher timeout (like 5 minutes) in each of the client.close() calls in the Dask tests (#3829 (comment)). I feel comfortable doing this because this issue is in a test fixture and not related to LightGBM's behavior.

My plan is to make this change and then test it with another 300 builds. This time, I'll split them up between only bdist and regular (#3829 (comment)). If all 300 pass, we can hope that this issue has been resolved. But that can take a day or two, so before I do I want to ask... @StrikerRUS what do you think about that approach?

@StrikerRUS
Copy link
Collaborator Author

@jameslamb Does adding timeout=300 mean that each test in test_dask.py can potentially run for 5 min?

@jameslamb
Copy link
Collaborator

Yes, that's the absolute worst-case scenario. But to be clear, it doesn't mean "wait 5 minutes, then exit". It means "wait until the client has been closed successfully OR 5 minutes have elapsed".

I think that worst-case scenario is very very unlikely, given the results above where only 4 out of 300 jobs failed.

And by the way, I realized that two of those failures came from test_classifier_pred_contrib, which currently doesn't use client.close() at all! (an oversight on my part, based on the order things were merged).

@StrikerRUS
Copy link
Collaborator Author

Maybe we can go in a more naive and resource-saving way?
I mean, not waste your time and our Docker pulls for running 300 tests again right now, but set timeout=60 and maintain the repo in a normal way. Then if anyone sees failure at teardown, go and increase timeout for one more minute. And then monitor again, increment timeout if needed, and so on... WDYT?

@jameslamb
Copy link
Collaborator

It's fine with me. I just anticipated that you would want more proof before a change like that. I'm happy to stop testing right now and just make a PR that sets the timeout to 60.

@StrikerRUS
Copy link
Collaborator Author

No, I'm totally fine having a place where we can simply increment a value and consequence of such actions will lead us to eliminating the error!

BTW, what is default value for timeout?

@jameslamb
Copy link
Collaborator

No, I'm totally fine having a place where we can simply increment a value and consequence of such actions will lead us to eliminating the error!

Sorry, I shouldn't have made assumptions 😬

BTW, what is default value for timeout?

It's hard to nail down, because it depends on your specific version of distributed and any local settings you have. For me, I have distributed 2021.1.0 and have never changed any Dask config settings in the config files it respects.

from dask.distributed import Client
c = Client()
print(c._timeout)

# 10

For me, this shows the timeout on my default distributed client is 10 seconds. The default behavior for client.close() is to wait 2x this long: https://github.com/dask/distributed/blob/b5c36b587f0e3295fe59db330603d5c78b39331f/distributed/client.py#L1407-L1408.

When I added a raise RuntimeError(client._timeout) in one of our tests in CI, just to see, I saw that the ._timeout attribute for the test fixture we use is 5 (which means the total timeout in our client.close() calls would be 10 seconds). But I don't know if that's reliable. I don't even know for sure that the timeout in client.close() is actually the reason we're seeing this problem 🤷‍♂️

@StrikerRUS
Copy link
Collaborator Author

Wow, how complicated!

5, 10, 2x ...
Maybe then we can start from timeout=30? I believe our current tests are quite simple and faster than 30sec.

@jameslamb
Copy link
Collaborator

I believe our current tests are quite simple and faster than 30sec.

The timeout I'm proposing that we change is not related to how long it takes the rest of the test to run. It specifically controls how long it takes to close down the cluster with client.close() at the end of the test. I'd rather leave it at 60s and not try to cut it too close. Remember I wanted to set it to several minutes 😛

Like I mentioned above, we're talking about an absolute-worst-case addition of 1 minute per test. In my tests on #3869 I saw that only 4 out of 300 builds failed, and on those 4 builds each had only 1 test case in test_dask.py fail. So I think it's reasonable to guess that we're risking that 1 out of every 75 builds might have 1 test that is AT MOST 60s longer than it normally is. In exchange for not having those failures require re-running jobs manually. I think that risk is well worth it, and that the potential cost is so low that we should err on the side of a larger timeout.

@StrikerRUS
Copy link
Collaborator Author

OK, got your position.

@jameslamb
Copy link
Collaborator

@trivialfis or @hcho3 could you help us with something? No need to read all of the discussion above in this issue.

I noticed that in https://github.com/dmlc/xgboost/blob/c3c8e66fc9d285c265b07d78fb6866770a65347b/tests/python/test_with_dask.py, you use a mix of three strategies for creating Dask clients:

  • @gen_cluster fixture from distributed
  • client fixture from distributed
  • manually creating LocalCluster and a client for it inside the test body

Could you explain to us what about the tests there made you choose one or the other? I've read https://distributed.dask.org/en/latest/develop.html#writing-tests but I have a feeling you discovered other things that aren't explained there, that could help us to make better decisions in LightGBM's Dask module.

Thank you so much!

@jameslamb
Copy link
Collaborator

Moving #3866 (comment) over here to keep all conversation in this issue.

Even with the attempted fix from #3869 , another "dask teardown timeout" failure was just observed on a CI run for #3866 : https://dev.azure.com/lightgbm-ci/lightgbm-ci/_build/results?buildId=8861&view=logs&j=c28dceab-947a-5848-c21f-eef3695e5f11&t=fa158246-17e2-53d4-5936-86070edbaacf

I see this in the log, which I didn't notice before

fut = <Task cancelled name='Task-1855' coro=<disconnect.<locals>.do_disconnect() done, defined at /opt/conda/envs/test-env/lib/python3.8/site-packages/distributed/utils_test.py:731>>
timeout = 3

That timeout = 3 tells me that the client.close() timeout is not the relevant one for this issue, since we set it to a much higher value in test_dask.py.

It looks like that value is coming from https://github.com/dask/distributed/blob/1297b18ff09276f6ad1553d40ab3ce77acf0fc0e/distributed/utils_test.py#L728-L736. I'm reopening this issue, and can try to work through other options some time in the near future. Maybe there is a way to influence that specific timeout, or maybe we need to try a totally different approach for providing a Dask cluster in tests, that we have tighter control over.

@StrikerRUS
Copy link
Collaborator Author

Just a remark: I haven't seen this error for a while.

@jameslamb
Copy link
Collaborator

I haven't seen this in a while either. I'm ok with closing it for now if you'd like.

@trivialfis
Copy link

Sorry I missed the previous notification. Our tests are not quite stable either, during teardown there could be a number of issues from race between scheduler and workers to the timeout you are seeing.

@trivialfis
Copy link

trivialfis commented Mar 31, 2021

I think the synchronization during shutdown still have room for improvement, that's one of the reasons we keep using the Python context in our demos/tests:

with LocalCluster() as cluster:
    with Client(cluster) as client:
        main()

Otherwise something like this might result into even weirder error dask/distributed#3081:

cluster = LocalCluster()
client = Client(cluster)

@StrikerRUS
Copy link
Collaborator Author

@jameslamb

I'm ok with closing it for now if you'd like.

I'll defer the decision to you.

@jmoralez
Copy link
Collaborator

jmoralez commented Mar 31, 2021

Since our CI time has been increasing with the additions of boosting_type and now tree_learner I decided to dig a bit into why the tests took so long, here's a quote of #4088 (comment).

@jameslamb do you know what's the advantage of using the client fixture? (Apart from avoiding the client construction). I just tried running test_regressor by initializing the client at the top and the time goes from 137s down to 98s.
Here's the profile with the fixture:
Screenshot from 2021-03-30 20-22-14
And without:
image
So basically with the fixture 70% of the time is spent in setting up and tearing down the cluster. Without it only 60% of the time is spent there and seems to be faster.

Followup
I've been investigating more and found dask/distributed#3540, which led me to the implementation by seanlaw, here: https://github.com/TDAmeritrade/stumpy/blob/68092c931610db725f2c74b6a5155868666eb14f/tests/test_mstumped.py#L10-L14 and holy cow that's fast. Using that test_regressor runs in 18 seconds. Would you support a PR adding this? It'd basically be replacing the client fixture with the dask_cluster fixture and instantiating a client from it in every test. When the client gets closed it's memory gets released so I believe this is the same as what we have right now, every test gets a fresh client but it's significantly faster because we're not creating a cluster everytime.

I'd like to get some feedback about using this approach. This only builds one cluster that gets reused by all the tests so the time is spent actually doing work and not waiting for setup/teardown. I already tried running all of the tests with this approach and it takes 218 seconds on my machine, the only failure I saw was in

error_msg = "Binding port %s failed" % open_ports[0]
with pytest.raises(lgb.basic.LightGBMError, match=error_msg):
with socket.socket(socket.AF_INET, socket.SOCK_STREAM) as s:
s.bind(('127.0.0.1', open_ports[0]))
dask_model.fit(dX, dy, group=dg)
where the test just hangs and I believe it's because one of the workers fails (the one with the port already bound) but the other one keeps waiting, so adding a client.restart there does the job.

@jameslamb
Copy link
Collaborator

Really interesting, thanks for the research! dask-lightgbm used the client fixture from, and I never proposed changing it in LightGBM just because it's mentioned at https://distributed.dask.org/en/latest/develop.html. I read that and thought "well, the Dask maintainers said this is the preferred way, and they use it in all their tests, so it must be the right way".

I would definitely support the change you're proposing! I think it's ok to have all of the tests depend on the same cluster, since we run the tests serially and since that dependency might help us catch problems where LightGBM doesn't clean up after itself.

HOWEVER, I think we should avoid merging such a PR for at least a week. We recently merged some stability fixes for lightgbm.dask (#4133, #4132), and I want to see a week of of builds go by across all PRs in this repo to see if they fixed the instability in the Dask tests (#4057, #4074, #4099). If they do, then we'll have some confidence that any new flakiness in the tests was the result of this overhaul of the way tests access a Dask cluster.

So you could submit a pull request whenever you have time, but I'd want to wait at least a week before it's merged.

@jmoralez
Copy link
Collaborator

jmoralez commented Apr 1, 2021

Thank you, I agree with that. I actually thought of mentioning in the dask issue that seanlaw's approach should be in those docs.

Related to #4099, I actually found something similar when playing around with the single cluster approach. I was using the branch for the voting algorithm, which includes the test that checks that if you try feature parallel you get an error, however it wasn't always raising it. I investigated and found that even though the data had two partitions sometimes they ended up in the same worker and the training basically ignored tree_learner. I'd like to submit a PR as well to modify _create_data such that before returning, it uses the default client to persist the collections and does a rebalance. This would ensure that all the tests are actually performing distributed training, I saw some client.rebalance() in a couple of tests but I believe all of them should do it. What do you think?

@jameslamb
Copy link
Collaborator

I'm torn on this. So far I've been against using .persist() on all of the Dask collections in all of the tests. I don't expect LightGBM's users to always do that before using lightgbm.dask, and I'm worried that we'll miss things in the tests if we .persist() everything. I want to be sure that the test setup doesn't drift so far from the normal usage of lightgbm.dask that the tests run fine but then users experience problems.

As I type that, I had a thought...I would support a PR that uses .persist() + .wait() + .rebalance() in _create_data() IF it adds a separate test that we can use to catch problems with not-yet-persisted data. Like a test that uses dask.Array.from_delayed() to create Dask arrays for data, label, init_score, group, sample_weight, etc. And another that uses da.random.random() or something. That test doesn't need to check correctness of the results (like computing evaluation metrics), but would just be a smoke test that training succeeds even when the data haven't been persisted.


Thank you, I agree with that. I actually thought of mentioning in the dask issue that seanlaw's approach should be in those docs.

I think you should! It's important that the things we learn here in LightGBM make it back to the broader Dask community.

@jmoralez
Copy link
Collaborator

jmoralez commented Apr 1, 2021

I see what you mean. I usually don't persist my collections when using lightgbm.dask or xgboost.dask because you get two copies of your data, so it's more memory efficient not doing it.

I think another (easier) approach would be just changing the defaults in _create_data. Right now it's n_samples=1_000 and chunk_size=500, so by default you get two partitions (these used to be 100 and 50 so you'd get two partitions as well). We could maybe change chunk_size=100 or chunk_size=50 so that it's unlikely that all the parts go to one worker.

@jameslamb
Copy link
Collaborator

I totally agree with that. I'd strongly support a PR to change the default to 10 or 20 partitions if you want to make one.

@jameslamb
Copy link
Collaborator

Let's try the "change default partitioning" first (not adding .persist()) and see if it resolves the reliability problems in tests.

@jmoralez
Copy link
Collaborator

jmoralez commented Apr 1, 2021

I'm running all the tests with chunk_size=50 (20 partitions) to see if something fails, if it doesn't I'll open the PR. I tried the .persist() + client.rebalance() this morning and a couple of tests failed with the other boosting types, I believe goss and dart one each.

@jameslamb
Copy link
Collaborator

Ok cool, thank you. These sources of instability are so tough to track down because they're transient, so I'd rather change things slowly and evaluate after each one how often we see the tests failing.

@jameslamb
Copy link
Collaborator

It has been a long time since I've seen these test failures.

What do you think @jmoralez , could we close this issue and #4149?

@jmoralez
Copy link
Collaborator

I think the client context manager on #4159 helped with this issue and I'm +1 on closing it.

#4149's purpose was to make sure that we were always doing distributed training and not having some tests train on a single "machine" out of bad luck. I believe that's still a nice thing to have, however that's blocked by what I believe is a bug in classification for distributed training (#4220, #5025). I can close it and come back to it once that's been resolved.

@jameslamb
Copy link
Collaborator

ahhhhh right #4220 ! Ok that was the piece of context I couldn't quite remember.

Alright, I'm going to close this, and saw you closed #4149. Thank you!

@github-actions
Copy link

This issue has been automatically locked since there has not been any recent activity since it was closed.
To start a new related discussion, open a new issue at https://github.com/microsoft/LightGBM/issues
including a reference to this.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Aug 15, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants