-
Notifications
You must be signed in to change notification settings - Fork 653
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Core dump with Census workload on AMD #1918
Comments
Unfortunately this error is not reproduced on Ubuntu OS with Intel CPU, so the reason of this errors can be specific for CentOS or AMD CPU or caused by incorrect conda environment creation. In order to check if error caused by environment benchmark can be ran by using |
Thanks for the quick response! It is unlikely the conda env problem since NY-taxi can be ran successfully. But i will try with the command you suggest when i get that machine again (we borrowed the AMD machine from other team). |
You are welcome! let us know if you have any updates on this issue. |
It was found that benchmark execution fails on the columns multiplication step in the line https://github.com/intel-go/omniscripts/blob/master/census/census_pandas_ibis.py#L78. Occurred error can be reproduced in Modin by next code:
Also benchmark execution is stucking on Intel CPU machine (benchmark error wasn't reproduced on Intel CPU because for Intel CPU machine error reproduced for higher value of |
It was found, that core dumped issue on Intel machine is caused by space limiting of default plasma directory and problem can be fixed by redefining this directory in the |
Waiting Mandy to try again and confirm positive result |
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
System information
modin.__version__
): 0.8.0+9.ge517a09.dirtypython run_ibis_benchmark.py -bench_name census -no_ibis true -df 1 -data_file /dataset/census/ipums_education2income_1970-2010.csv.gz -pandas_mode Modin_on_ray -ray_tmpdir /tmp
Describe the problem
The following errors would occur when run census modin[ray] on AMD:
2020-08-11 14:54:55,172 INFO resource_spec.py:212 -- Starting Ray with 200.0 GiB memory available for workers and up to 200.0 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=).
2020-08-11 14:54:55,386 WARNING services.py:923 -- Redis failed to start, retrying now.
2020-08-11 14:54:55,561 INFO services.py:1165 -- View the Ray dashboard at localhost:8265
2020-08-11 14:54:55,563 WARNING services.py:1517 -- WARNING: object_store_memory is not verified when plasma_directory is set.
Pandas backend: Modin on Ray with tmp directory /tmp and memory 214748364800
(pid=raylet) F0811 14:58:02.397012 161995 161995 node_manager.cc:563] Check failed: node_id != self_node_id_ Exiting because this node manager has mistakenly been marked dead by the monitor.
(pid=raylet) *** Check failure stack trace: ***
Source code / logs
modin_census_amd.docx
AMD info:
[root@localhost ~]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7742 64-Core Processor
Stepping: 0
CPU MHz: 1500.000
CPU max MHz: 2250.0000
CPU min MHz: 1500.0000
BogoMIPS: 4500.16
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K
NUMA node0 CPU(s): 0-63,128-191
NUMA node1 CPU(s): 64-127,192-255
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl xtopology nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 cpb cat_l3 cdp_l3 hw_pstate sme retpoline_amd ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip overflow_recov succor smca
The text was updated successfully, but these errors were encountered: