-
Notifications
You must be signed in to change notification settings - Fork 6.8k
Disable ccache log print due to threadunsafety #11997
Conversation
What are the source of deadlocks? That sounds like a bug. I'm in favor of this, but a few questions: Why on each command we print:
Where is the ccache statistics coming from in each command? |
The problem is this line: https://github.com/apache/incubator-mxnet/pull/11997/files#diff-1335fbaf3930b1438d9be18edb07a1a6L122 the stats.lock may only be accessed with proper locks in place - simply grepping it causes problems, especially considering we are on a NFS. That command is also the reason for the log you pasted. Does that answer your questions? |
24832a9
to
129c974
Compare
LGTM, please merge. Not sure what problems it can cause other than a corrupted file. Shouldn't it crash anything or block, can you provide more info about this? just curious. |
This file is shared with all compilers. This means that if we don't access it with a proper locking |
There's no locking guarantees in nfs normally. Is not posix semantics because performance usually as far as I recall |
When do you merge? |
I have to wait for an approval. |
[MXNET-750] fix nested call on CachedOp. (apache#11951) * fix nested call on cachedop. * fix. extend reshape op to allow reverse shape inference (apache#11956) Improve sparse embedding index out of bound error message; (apache#11940) [MXNET-770] Remove fixed seed in flaky test (apache#11958) * Remove fixed seed in flaky test * Remove fixed seed in flaky test Update ONNX docs with the latest supported ONNX version (apache#11936) Reduced test to 3 epochs and made gpu only (apache#11863) * Reduced test to 3 epochs and made GPU only * Moved logger variable so that it's accessible Fix flaky tests for test_laop_4 (apache#11972) Updating R client docs (apache#11954) * Updating R client docs * Forcing build Fix install instructions for MXNET-R (apache#11976) * fix install instructions for MXNET-R * fix install instructions for MXNET-R * fix default cuda version for MXNet-R [MXNET-751] fix ce_loss flaky (apache#11971) * add xavier initializer * remove comment line [MXNET-769] set MXNET_HOME as base for downloaded models through base.data_dir() (apache#11636) * set MXNET_DATA_DIR as base for downloaded models through base.data_dir() push joblib to save containers so is not required when running * MXNET_DATA_DIR -> MXNET_HOME [MXNET-748] linker fixed on Scala issues (apache#11989) * put force load back as a temporary solution * use project.basedir as relative path for OSX linker [MXNET-772] Re-enable test_module.py:test_module_set_params (apache#11979) [MXNET-771] Fix Flaky Test test_executor.py:test_dot (apache#11978) * use assert_almost_equal, increase rtol, reduce matrix size * remove seed in test_bind * add seed 0 to test_bind, it is still flaky * add comments for tracking remove mod from arity 2 version of load-checkpoint in clojure-package (apache#11808) * remove mod from arity 2 version of load-checkpoint * load-checkpoint arity 2 test Add unit test stage for mxnet cpu in debug mode (apache#11974) Website broken link fixes (apache#12014) * fix broken link * fix broken link * switch to .md links * fix broken link removed seed from flaky test (apache#11975) Disable ccache log print due to threadunsafety (apache#11997) Added default tolerance levels for regression checks for MBCC (apache#12006) * Added tolerance level for assert_almost_equal for MBCC * Nudge to CI Disable flaky mkldnn test_requantize_int32_to_int8 (apache#11748) [MXNET-769] Usability improvements to windows builds (apache#11947) * Windows scripted build Adjust Jenkins builds to use ci/build_windows.py Issues: apache#8714 apache#11100 apache#10166 apache#10049 * Fix bug * Fix non-portable ut * add xunit Fix import statement (apache#12005) array and multiply are undefined. Importing them from ndarray Disable flaky test test_random.test_gamma_generator (apache#12022) [MXNET-770] Fix flaky test: test_factorization_machine_module (apache#12023) * Remove fixed seed in flaky test * Remove fixed seed in flaky test * Update random seed to reproduce the issue * Fix Flaky unit test and add a training test * Remove fixed seed in flaky test * Update random seed to reproduce the issue * Fix Flaky unit test and add a training test * Increase accuracy check disable opencv threading for forked process (apache#12025) Bug fixes in control flow operators (apache#11942) Fix data narrowing warning on graph_executor.cc (apache#11969) Fix flaky tests for test_squared_hinge_loss (apache#12017) Fix flaky tests for test_hinge_loss (apache#12020) remove fixed seed for test_sparse_ndarray/test_operator_gpu.test_sparse_nd_pickle (apache#12012) Removed fixed seed from , test_loss:test_ctc_loss_train (apache#11985) Removed fixed seed from , test_loss:test_sample_weight_loss (apache#11986) Fix reduce_kernel_M1 (apache#12026) * Fix reduce_kernel_M1 * Improve test_norm Update test_loss.py to remove fixed seed (apache#11995) [MXNET-23] Adding support to profile kvstore server during distributed training (apache#11215) * server profiling merge with master cleanup old code added a check and better info message add functions for C compatibility fix doc lint fixes fix compile issues lint fix build error update function signatures to preserve compatibility fix comments lint * add part1 of test * add integration test Re-enabling test_ndarray/test_cached (apache#11950) Test passes on CPU and GPU (10000 runs) make gluon rnn layers hybrid blocks (apache#11482) * make Gluon RNN layer hybrid block * separate gluon gpu tests * remove excess assert_raises_cudnn_disabled usage * add comments and refactor * add bidirectional test * temporarily remove hybridize in test_gluon_rnn.test_layer_fill_shape [MXNET-751] fix bce_loss flaky (apache#11955) * add fix to bce_loss * add comments * remove unecessary comments Doc fix for a few optimizers (apache#12034) * Update optimizer.py * Update optimizer.py
Description
Accessing the shared ccache statistics is not threadsafe and causes deadlocks as seen in http://jenkins.mxnet-ci.amazon-ml.com/blue/organizations/jenkins/incubator-mxnet/detail/master/1310/pipeline. They were only there for debugging purposes and thus we're removing them now.
Checklist
Essentials
Please feel free to remove inapplicable items for your PR.
Changes
Comments