Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Core dump with Census workload on AMD #1918

Closed
mandy-li opened this issue Aug 11, 2020 · 6 comments
Closed

Core dump with Census workload on AMD #1918

mandy-li opened this issue Aug 11, 2020 · 6 comments
Assignees
Labels
bug 🦗 Something isn't working Ray ⚡ Issues related to the Ray engine

Comments

@mandy-li
Copy link

System information

  • OS Platform and Distribution (e.g., Linux Ubuntu 16.04): CentOS 7.6 (3.10.0-957.el7.x86_64)
  • Modin version (modin.__version__): 0.8.0+9.ge517a09.dirty
  • Python version: Python 3.7.8
  • Code we can use to reproduce: run census workload in the benchmark script repo with the following command:

python run_ibis_benchmark.py -bench_name census -no_ibis true -df 1 -data_file /dataset/census/ipums_education2income_1970-2010.csv.gz -pandas_mode Modin_on_ray -ray_tmpdir /tmp

Describe the problem

The following errors would occur when run census modin[ray] on AMD:

2020-08-11 14:54:55,172 INFO resource_spec.py:212 -- Starting Ray with 200.0 GiB memory available for workers and up to 200.0 GiB for objects. You can adjust these settings with ray.init(memory=, object_store_memory=).
2020-08-11 14:54:55,386 WARNING services.py:923 -- Redis failed to start, retrying now.
2020-08-11 14:54:55,561 INFO services.py:1165 -- View the Ray dashboard at localhost:8265
2020-08-11 14:54:55,563 WARNING services.py:1517 -- WARNING: object_store_memory is not verified when plasma_directory is set.
Pandas backend: Modin on Ray with tmp directory /tmp and memory 214748364800
(pid=raylet) F0811 14:58:02.397012 161995 161995 node_manager.cc:563] Check failed: node_id != self_node_id_ Exiting because this node manager has mistakenly been marked dead by the monitor.
(pid=raylet) *** Check failure stack trace: ***

Source code / logs

modin_census_amd.docx

AMD info:

[root@localhost ~]# lscpu
Architecture: x86_64
CPU op-mode(s): 32-bit, 64-bit
Byte Order: Little Endian
CPU(s): 256
On-line CPU(s) list: 0-255
Thread(s) per core: 2
Core(s) per socket: 64
Socket(s): 2
NUMA node(s): 2
Vendor ID: AuthenticAMD
CPU family: 23
Model: 49
Model name: AMD EPYC 7742 64-Core Processor
Stepping: 0
CPU MHz: 1500.000
CPU max MHz: 2250.0000
CPU min MHz: 1500.0000
BogoMIPS: 4500.16
Virtualization: AMD-V
L1d cache: 32K
L1i cache: 32K
L2 cache: 512K
L3 cache: 16384K
NUMA node0 CPU(s): 0-63,128-191
NUMA node1 CPU(s): 64-127,192-255
Flags: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush mmx fxsr sse sse2 ht syscall nx mmxext fxsr_opt pdpe1gb rdtscp lm constant_tsc art rep_good nopl xtopology nonstop_tsc extd_apicid aperfmperf eagerfpu pni pclmulqdq monitor ssse3 fma cx16 sse4_1 sse4_2 x2apic movbe popcnt aes xsave avx f16c rdrand lahf_lm cmp_legacy svm extapic cr8_legacy abm sse4a misalignsse 3dnowprefetch osvw ibs skinit wdt tce topoext perfctr_core perfctr_nb bpext perfctr_l2 cpb cat_l3 cdp_l3 hw_pstate sme retpoline_amd ssbd ibrs ibpb stibp vmmcall fsgsbase bmi1 avx2 smep bmi2 cqm rdt_a rdseed adx smap clflushopt clwb sha_ni xsaveopt xsavec xgetbv1 cqm_llc cqm_occup_llc cqm_mbm_total cqm_mbm_local clzero irperf xsaveerptr arat npt lbrv svm_lock nrip_save tsc_scale vmcb_clean flushbyasid decodeassists pausefilter pfthreshold avic v_vmsave_vmload vgif umip overflow_recov succor smca

@mandy-li mandy-li added the bug 🦗 Something isn't working label Aug 11, 2020
@amyskov
Copy link
Contributor

amyskov commented Aug 12, 2020

Unfortunately this error is not reproduced on Ubuntu OS with Intel CPU, so the reason of this errors can be specific for CentOS or AMD CPU or caused by incorrect conda environment creation. In order to check if error caused by environment benchmark can be ran by using run_ibis_tests.py script (which creates new environment and runs benchmark) with next command:
python run_ibis_tests.py -executable '' -task build,benchmark --env_name new_env_name --env_check True --save_env True --modin_path /path/to/modin/ -no_ibis true -df 1 -data_file /dataset/census/ipums_education2income_1970-2010.csv.gz -pandas_mode Modin_on_ray -ray_tmpdir /tmp -bench_name census

@mandy-li
Copy link
Author

Thanks for the quick response! It is unlikely the conda env problem since NY-taxi can be ran successfully. But i will try with the command you suggest when i get that machine again (we borrowed the AMD machine from other team).

@amyskov
Copy link
Contributor

amyskov commented Aug 13, 2020

You are welcome! let us know if you have any updates on this issue.

@amyskov
Copy link
Contributor

amyskov commented Aug 15, 2020

It was found that benchmark execution fails on the columns multiplication step in the line https://github.com/intel-go/omniscripts/blob/master/census/census_pandas_ibis.py#L78. Occurred error can be reproduced in Modin by next code:

import modin.pandas as pd
import numpy as np
initial_index_size = 1e8

index = np.sort(np.unique(np.random.choice([x for x in range(int(initial_index_size *1.5))], size=int(initial_index_size ))))

ser1 = pd.Series(np.random.choice([x for x in range(10)], size=int(index.size)))
ser2 = pd.Series(np.random.choice([x for x in range(10)], size=int(index.size)))

ser1.index = index
ser2.index = index

ans = ser1 * ser2

print(ans)

Also benchmark execution is stucking on Intel CPU machine (benchmark error wasn't reproduced on Intel CPU because for Intel CPU machine error reproduced for higher value of initial_index_size parameter).
Intel_CPU_machine_logs.txt

@amyskov
Copy link
Contributor

amyskov commented Sep 3, 2020

It was found, that core dumped issue on Intel machine is caused by space limiting of default plasma directory and problem can be fixed by redefining this directory in the ray.init(). Also, this solution can fix the issue on the AMD machine.

@Garra1980 Garra1980 added the Ray ⚡ Issues related to the Ray engine label Sep 11, 2020
@Garra1980
Copy link
Collaborator

Waiting Mandy to try again and confirm positive result

amyskov added a commit to amyskov/modin that referenced this issue Sep 11, 2020
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
amyskov added a commit to amyskov/modin that referenced this issue Sep 11, 2020
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
devin-petersohn pushed a commit that referenced this issue Sep 14, 2020
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
aregm pushed a commit to aregm/modin that referenced this issue Sep 16, 2020
Signed-off-by: Alexander Myskov <alexander.myskov@intel.com>
@aregm aregm closed this as completed Sep 17, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug 🦗 Something isn't working Ray ⚡ Issues related to the Ray engine
Projects
None yet
Development

No branches or pull requests

4 participants