-
-
Notifications
You must be signed in to change notification settings - Fork 8.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dask xgboost fit: munmap_chunk(): invalid pointer: 0x00007fa5380304b0 #6469
Comments
This is what I tried. The problem happens every time I run same thing, so I can probably bisect back to hash when it started. Will try that tomorrow. It plausibly is related to the eval_set, which should be supported now. I'm using scikit-learn API with eval-set passed, which in past was not supported for early stopping,but I think now is. But maybe there is a problem. |
Yes, if I just don't pass eval_set to dask fit then no such crash. So the new early stopping support for dask must be broken. I was hoping for that early stopping support, main reason why I wanted to try 1.3.0. |
@pseudotensor So far it doesn't show up on our tests. Will look into it once we have a MRE.
Sometimes error happens in a weird way. Like one of the worker died due to oom/ucx failure or whatever, then got restarted by dask resulting into weird state in the underlying MPI (rabit) synchronization. |
Just ran it with higgs, seems fine. |
@trivialfis Here is repro:
Client side appears to just hang, but at least one of the workers hits the error mentioned:
The local cuda server on 2 GPU system did not crash like this. Removing the eval_set does not crash either. |
I will look into it tomorrow. That's pretty new .. hopefully I can get sanitizer to work on distributed workers .. |
Odd remote closure of issue although I only referenced the issue. Bad github. |
@trivialfis Actually the problem is more pervasive. I hit it even when the eval_set is empty. Just the same exact case above but this pickle: dask_model_398f86ad-bd64-4503-954d-6f7c486a2b11.pickle.zip The error is same, but sharing full logs to avoid confusion: dask-cuda-worker.stderr.log.zip full worker log for the 1 remote worker with 2 GPUs. The scheduler has worker for its 2 GPUs but no such errors appear there. Note that this particular case is a super trivial 1 column and 1000 rows fit. So this is back to, compared to 1.2.1, 1.3.0RC being much less stable for dask. This is is still just using native nightly: xgboost-1.3.0_SNAPSHOT+d6386e45e8ed19c238ee06544dabbcbf56e02bbc-py3-none-manylinux2010_x86_64.whl directly, no changes. Of course, the script run is a tiny bit different:
I've tried various changs, no changes:
Also, even though the dask worker seems to come back up, if I ctrl-C the python script and try again, things then just hang fully even without any new errors. i have to restart the dask scheduler/workers. |
bisecting this reduced simple version of script:
by nightly wheels, in case helps since only takes 5 seconds to fail or run: (here FAILED means illegal on remote worker + hang, same as in master) FINISHED BUT had same failure on remote worker. So script completed, but still remote worker had illegal access. No idea how script ran still, unless dask ignored the worker entirely somehow. The remote worker had rabit failures and then followed up with the same illegal stuff:
FINISHED BUT same illegal error and sequence of rabit stuff shown before. So this diff is bad: |
In that "diff" I see changes to rabit, which probably explains why the rabit stuff shows up. I guess the rabit changes broke dask in this case, or perhaps those changes were trying to fix something that otherwise didn't work, etc. You guys would know better. To be clear, some later commit the rabit stuff must goes away, but one is left with the same illegal error and hang instead. As an aside, it would be nice if rabit in xgboost was such that xgboost dask/etc. fitting/predicting would be robust to workers going down. Is that how it is supposed to work? It definitely hasn't been that way for me ever, any crash of worker hangs everything, as I've shard before. Still same here for 1.3.0 head of master. I'm not sure what one expects, but definitely being fragile to a single worker going down is bad. But illegal failure and hang are also super bad. |
Thanks for bisecting the issue. I haven't been able to reproduce it at the moment. Trying out a remote cluster right now. Looking at the diff you found, seems the change to rabit is just logging. Maybe the log got redirected to tracker which is causing issue, I will continue looking into it.
XGBoost is robust to exception, but not hard crash. |
I'll try to bisect the case when no more rabit failure and just illegal + hang: rabit + illegal (kept trying and going, failing multiple times and script failed): https://s3-us-west-2.amazonaws.com/xgboost-nightly-builds/xgboost-1.3.0_SNAPSHOT%2B12d27f43ff117462d607d0949fb89bccbc405a49-py3-none-manylinux2010_x86_64.whl rabit + illegal (kept trying and going, failing multiple times and script failed): This is the change that lost the rabit allreduce message but kept the illegal + hang: Again some logging changes, but some actual non-test dask changes. Some global config stuff. So yes I'd be suspicious of the logging + dask changes, since both appeared in the original case when illegal access was hit. Most likely is dask changes however. FYI, that type of error is seemingly related to trying to free a pointer than did not come from malloc (i.e. you don't own the pointer). https://stackoverflow.com/questions/6199729/how-to-solve-munmap-chunk-invalid-pointer-error-in-c |
So overall, the illegal error probably comes from the python-package/xgboost/dask.py changes here: https://github.com/dmlc/xgboost/compare/8a17610666dd1286fbeb20a96f6aab92ef63a994..d711d648cb83409da00ff48e1890e1e9e386856 Of course, it's also possible that the changes just exposed an existing issue etc. |
If you guys can make a debug build of any particular hash, I'm happy to run it if that would give more useful information. |
FYI, if I run without GPU for the script, I get exact same kind of error:
i.e. just letting it switch, and connecting to normal dask workers, I see:
Of course, it could just be ignoring the client and doing GPU stuff on dask-worker's. If I run with 'hist' as well, I see:
It hangs, but no illegal errors printed. So probably failed in related way just not caught same way. But the hang is already bad. So seems like |
Nope. The log is not redirected. I still can't reproduce it on a remote cluster. My setup is 2 nodes with 4-8 GPUs each, and a scheduler on a third different node and it's CPU only. @hcho3 We have seen similar log in #6469 (comment) , but for spark package. (the failed to shutdown thing). |
Going back to latest build: And trying just CPU mode:
Of course, the nightly was built to run on GPU, but I assume that won't matter. Error hit:
It stalls alot before moving on. Whole script should only take 1 second. And hangs. No illegal error still though. But I guess related problem, not GPU specific. |
@pseudotensor How's the scheduler doing ? |
FYI:
|
And the local worker with the scheduler never sees errors. |
Will be back in 20 minutes. @pseudotensor Is there a way to reach you offline via voice call? I'm a bit overwhelmed by all the information you have provided and might be more productive if we can talk directly. |
Ya, happy to chat in 20, email me at pseudotensor@gmail.com . |
Going back to the PASSED hash just before problems also passes and works fine with only CPU mode. |
So to summarize, even in CPU mode it hangs hard and has various errors, even if not hitting same illegal thing. Steps: System1) dask-scheduler --scheduler-file dask_scheduler.json& System2) dask-worker --scheduler-file dask_scheduler.json& System1 again)
with script:
and remote worker hits:
and script never reaches passed "fit", just hangs.
|
Hmm, one of the worker is not doing well, I assume that would be the local worker. |
The remote worker is always the one that has the connect failures |
I can't reproduce your network environment. Need to ask for some help from others who are more familiar with network admin. |
Just trying to think out loud, feel free to ignore me. There are 2 issues related to the failure, first one is a rabit worker refusing to connect to tracker, another one is it crashes after the connection failure. There are 2 systems for both scheduler, worker and client. 2 workers are allocated on these 2 different systems and 1 of them is local to scheduler. The failed rabit worker is trying to connect to local host for talking to tracker, which is apparently wrong. The tracker address is obtained by client process running a The last part is most suspicious, it was done for supporting a reverse proxy environment to get a correct ip address. Zooming into the possible issue, what if client is local to scheduler and hence obtained a ip as localhost, which then passed to the remote worker that's resided on a different machine? |
Yes, there are now seemingly 2 issues, perhaps related, unsure:
Same number of works on same host has no such issue. Systems have no special setup, I see no reason it should be trying to connect to localhost=127.0.0.1.
|
FYI even if I just have 1 scheduler on system A and 1 worker on system B the same problem happens. A local worker does not matter. I also tried a different remote system. It also hangs but differently:
|
It still feels race related. If I install from scratch and try again, I don't see the same rabit error (same systems used), but instead it is super slow do to the same script. No reason it should take 40 seconds. Also, the remote worker never shows any logs for the task. It's as if dask just ends up using the local client to run things. Dask looks like this: the dispatched_train stands at 0/1 for long time. But I've tried multiple remote workers for the 1 (only) scheduler system and always same for that script. Completes but ridiculously long. I have a separate system I put remote worker on, also same fresh install, and this one hangs. |
FYI, still even though the errors/hangs/times are not consistent, the bisection is consistent. That is, if I go back to a PASSED case: Then that same scheduler/worker setup on any of the systems always works and is always quick. So I think still has to be the dask.py changes I pointed to @trivialfis . It's too consistently major problems since that. |
I still couldn't reproduce it. It runs pretty well on my setups. |
Could you please try applying this patch and see if it fixes the connection issue? It reverts some of the changes used to support k8s. diff --git a/python-package/xgboost/dask.py b/python-package/xgboost/dask.py
index 602f7b26..5b8cadcc 100644
--- a/python-package/xgboost/dask.py
+++ b/python-package/xgboost/dask.py
@@ -67,11 +67,9 @@ distributed = LazyLoader('distributed', globals(), 'dask.distributed')
LOGGER = logging.getLogger('[xgboost.dask]')
-def _start_tracker(n_workers):
+def _start_tracker(host, n_workers):
"""Start Rabit tracker """
env = {'DMLC_NUM_WORKER': n_workers}
- import socket
- host = socket.gethostbyname(socket.gethostname())
rabit_context = RabitTracker(hostIP=host, nslave=n_workers)
env.update(rabit_context.slave_envs())
@@ -603,7 +601,9 @@ def _dmatrix_from_list_of_parts(is_quantile, **kwargs):
async def _get_rabit_args(n_workers: int, client):
'''Get rabit context arguments from data distribution in DaskDMatrix.'''
- env = await client.run_on_scheduler(_start_tracker, n_workers)
+ host = distributed.comm.get_address_host(client.scheduler.address)
+ env = await client.run_on_scheduler(
+ _start_tracker, host.strip('/:'), n_workers)
rabit_args = [('%s=%s' % item).encode() for item in env.items()]
return rabit_args
|
Finally reproduced. |
The above patch should fix it. But on the other hand it will break k8s and reverse proxy env. |
Hi @trivialfis glad you were able to repro. Sorry I got a bit busy and wasn't yet able to try the patch you suggested. Iss there any need to try it anymore or is it all figured out? Thanks! How about the illegal pointer? That only showed up in GPU case. |
It's fixed in xgboost, no need to try the patch now. I haven't been able to get the invalid pointer error, as it's a second error caused the currently fixed one. But right now I don't have another GPU device to form a SSH cluster. I used a laptop and a desktop to reproduce your environment after failing to do so with virtual machine. But my laptop is CPU only... |
Using rapids 0.14, conda, Ubuntu 16/18, cuda 10.0, cuda 11.1 driver, dask/disributed 2.17 that matches rapids 0.14.
After updating to 1.3.0 master nightly, I'm hitting this with any dask fit. It's pervasive, so I'll probably have to go back to 1.2.1 for now unless easy fix. Once it happens the worker is restarted and xgboost hangs.
It's late here, so I'll post repro if possible on weekend.
FYI @trivialfis
The text was updated successfully, but these errors were encountered: