Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[rabit harden] include osx in tests, address time_wait on port assignment, enable all tests #90

Merged
merged 21 commits into from
Apr 24, 2019

Conversation

chenqin
Copy link
Contributor

@chenqin chenqin commented Apr 9, 2019

follow up clean up dmlc-core header file copy pr https://github.com/chenqin/rabit/commit/ecd4bf7aaee477cea6ac14aa52f8276192f085f6.

@chenqin chenqin changed the title [rabit harden] include regression tests on xgboost cmake/java_tests [rabit harden] include regression tests on xgboost java_tests Apr 9, 2019
Copy link
Member

@CodingCat CodingCat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

have we fixed the test mentioned in #86 (comment)?

.travis.yml Outdated Show resolved Hide resolved
.travis.yml Outdated Show resolved Hide resolved
.travis.yml Outdated Show resolved Hide resolved
.travis.yml Show resolved Hide resolved
@chenqin
Copy link
Contributor Author

chenqin commented Apr 9, 2019

have we fixed the test mentioned in #86 (comment)?

That should be another pr after we updated xgboost master. The rationale is previous and this pr haven't change how rabit functions other than reshuffle parameters and delete redundant code.

I already got some idea how to fix flaky test by introducing extra check in allreduce_robust before reset links infinitely. But yeah, I think that should be seperate thing after we have good baseline dmlc/xgboost#4352

scripts/travis_script.sh Outdated Show resolved Hide resolved
remove xgb java tests
@chenqin chenqin changed the title [rabit harden] include regression tests on xgboost java_tests [rabit harden] include osx in tests Apr 10, 2019
@chenqin
Copy link
Contributor Author

chenqin commented Apr 11, 2019

for somereason, trybind actually allows two process bind to same port.

cq@cq-MS-7B84:~/xgboost/rabit/test$ make -f test.mk 
../dmlc-core/tracker/dmlc-submit --cluster local --num-workers=10 --local-num-attempt=10 model_recover 10000 mock=0,0,1,0 mock=1,1,1,0
2019-04-10 15:17:51,502 INFO start listen on 10.0.1.25:9091
[0]ReConnectLinks cmd start port 9010 range (9010 - 10010)
[1]ReConnectLinks cmd start port 9011 range (9010 - 10010)
[0]sock_listen.Accept cmd start port # 9010 0 num_accept 2
[2]ReConnectLinks cmd start port 9012 range (9010 - 10010)
[1]sock_listen.Accept cmd start port # 9011 0 num_accept 2
[3]ReConnectLinks cmd start port 9013 range (9010 - 10010)
[2]sock_listen.Accept cmd start port # 9012 0 num_accept 2
[4]ReConnectLinks cmd start port 9014 range (9010 - 10010)
[2]sock_listen.Accept cmd start port # 9012 1 num_accept 2
[3]sock_listen.Accept cmd start port # 9013 0 num_accept 1
[5]ReConnectLinks cmd start port 9015 range (9010 - 10010)
[4]sock_listen.Accept cmd start port # 9014 0 num_accept 1
[6]ReConnectLinks cmd start port 9016 range (9010 - 10010)
[1]sock_listen.Accept cmd start port # 9011 1 num_accept 2
[5]sock_listen.Accept cmd start port # 9015 0 num_accept 1
[7]ReConnectLinks cmd start port 9017 range (9010 - 10010)
[6]sock_listen.Accept cmd start port # 9016 0 num_accept 1
[8]ReConnectLinks cmd start port 9018 range (9010 - 10010)
[7]sock_listen.Accept cmd start port # 9017 0 num_accept 2
[9]ReConnectLinks cmd start port 9019 range (9010 - 10010)
[0]sock_listen.Accept cmd start port # 9010 1 num_accept 2
[8]sock_listen.Accept cmd start port # 9018 0 num_accept 1
[7]sock_listen.Accept cmd start port # 9017 1 num_accept 2
[0] reload-trail=0, init iter=0
[1] reload-trail=0, init iter=0
[9] reload-trail=0, init iter=0
[7] reload-trail=0, init iter=0
[6] reload-trail=0, init iter=0
[2] reload-trail=0, init iter=0
[4] reload-trail=0, init iter=0
[5] reload-trail=0, init iter=0
2019-04-10 15:17:52,491 INFO @tracker All of 10 nodes getting started
[8] reload-trail=0, init iter=0
[3] reload-trail=0, init iter=0
[0] !!!TestMax pass, iter=0
[3] !!!TestMax pass, iter=0
[6] !!!TestMax pass, iter=0
[0]@@@Hit Mock Error:Broadcast
[4] !!!TestMax pass, iter=0
[5] !!!TestMax pass, iter=0
[2] !!!TestMax pass, iter=0
[8] !!!TestMax pass, iter=0
[9] !!!TestMax pass, iter=0
[7] !!!TestMax pass, iter=0
[1] !!!TestMax pass, iter=0
[9]ReConnectLinks cmd recover port 9014 range (9010 - 10010)
[8]ReConnectLinks cmd recover port 9015 range (9010 - 10010)
[9]sock_listen.Accept cmd recover port # 9014 0 num_accept 3
[7]ReConnectLinks cmd recover port 9016 range (9010 - 10010)
[8]sock_listen.Accept cmd recover port # 9015 0 num_accept 1
[9]sock_listen.Accept cmd recover port # 9014 1 num_accept 3
[1]ReConnectLinks cmd recover port 9017 range (9010 - 10010)
[3]ReConnectLinks cmd recover port 9018 range (9010 - 10010)
[2]ReConnectLinks cmd recover port 9019 range (9010 - 10010)
[1]sock_listen.Accept cmd recover port # 9017 0 num_accept 3
[3]sock_listen.Accept cmd recover port # 9018 0 num_accept 2
[6]ReConnectLinks cmd recover port 9020 range (9010 - 10010)
[1]sock_listen.Accept cmd recover port # 9017 1 num_accept 3
[7]sock_listen.Accept cmd recover port # 9016 0 num_accept 1
[5]ReConnectLinks cmd recover port 9021 range (9010 - 10010)
[6]sock_listen.Accept cmd recover port # 9020 0 num_accept 1
[4]ReConnectLinks cmd recover port 9022 range (9010 - 10010)
[2]sock_listen.Accept cmd recover port # 9019 0 num_accept 1
[3]sock_listen.Accept cmd recover port # 9018 1 num_accept 2
[5]sock_listen.Accept cmd recover port # 9021 0 num_accept 1

@chenqin chenqin changed the title [rabit harden] include osx in tests [rabit harden] include osx in tests, fix flaky integration test Apr 11, 2019
@chenqin
Copy link
Contributor Author

chenqin commented Apr 11, 2019

have we fixed the test mentioned in #86 (comment)?

Done

@chenqin
Copy link
Contributor Author

chenqin commented Apr 12, 2019

osx build error due to unable to download package. should be fine if rerun. local build here.
https://travis-ci.org/chenqin/rabit/builds/518971901

@trivialfis
Copy link
Member

trivialfis commented Apr 12, 2019

@chenqin thanks, restarted the test.

@chenqin
Copy link
Contributor Author

chenqin commented Apr 12, 2019

all tests green @CodingCat @trivialfis

@CodingCat
Copy link
Member

check my comments about mpi in osx

@CodingCat
Copy link
Member

CodingCat commented Apr 12, 2019

hmmmm.....MPI in Mac doesn't compile....we missed some dependency?

and test itself is still flaky, I restarted linux one (thought it was a Travis issue) but found mac one stuck there later (I didn't restart for your further investigation)

@chenqin
Copy link
Contributor Author

chenqin commented Apr 12, 2019

hmmmm.....MPI in Mac doesn't compile....we missed some dependency?

Seems brew install can't do with cxx binding, needs to recompile from source everytime. Hopefully it can be done within 10mins before travis timeout. found this blog, going to try this path
https://d-meiser.github.io/2016/01/10/mpi-travis.html

and test itself is still flaky, I restarted linux one (thought it was a Travis issue) but found mac one stuck there later (I didn't restart for your further investigation)

Will take a look, this time is mac

@chenqin chenqin changed the title [rabit harden] include osx in tests, fix flaky integration test [rabit harden] include osx in tests, address time_wait on port assignment Apr 15, 2019
@chenqin
Copy link
Contributor Author

chenqin commented Apr 16, 2019

dmlc/dmlc-core#524

@CodingCat
Copy link
Member

isn't the test still hanging?

@chenqin
Copy link
Contributor Author

chenqin commented Apr 18, 2019

We need to fix dmlc-core first.

@CodingCat
Copy link
Member

We need to fix dmlc-core first.

check dmlc/dmlc-core#525 (comment)

@CodingCat
Copy link
Member

looks like we have very consistent behavior in linux with pip3

.travis.yml Outdated Show resolved Hide resolved
scripts/mpi.sh Outdated Show resolved Hide resolved
test/test.mk Show resolved Hide resolved
@chenqin chenqin changed the title [rabit harden] include osx in tests, address time_wait on port assignment [rabit harden] include osx in tests, address time_wait on port assignment, enable all tests Apr 21, 2019
@CodingCat CodingCat requested a review from hcho3 April 22, 2019 17:46
@chenqin
Copy link
Contributor Author

chenqin commented Apr 24, 2019

friendly ping

Copy link
Contributor

@hcho3 hcho3 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for cleaning the MPI tests. We should create an example with MPI in the future.

@trivialfis trivialfis merged commit e3d51d3 into dmlc:master Apr 24, 2019
@trivialfis
Copy link
Member

trivialfis commented Apr 24, 2019

Merge. I hope you don't mind me changing the commit message. ;-)

I'm more worry about dependency on dmlc-core. The other day I tried to use C++ API from mxnet, it turns out there's no way to install it from build system with duplicated dmlc-core as git submodules. If the common directory is not common enough, we might just remove it completely, or try to use a proper package manager instead of git submodule.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants