Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prov/cxi: PR#9791 breaks build on LUMI #9835

Open
thomasgillis opened this issue Feb 22, 2024 · 7 comments
Open

prov/cxi: PR#9791 breaks build on LUMI #9835

thomasgillis opened this issue Feb 22, 2024 · 7 comments

Comments

@thomasgillis
Copy link
Contributor

thomasgillis commented Feb 22, 2024

Hi all,

I am trying to build the cxi provider on LUMI. The update merged in #9791 breaks the build process because lib-cxi is too old.
I am using here the main branch with the patch suggested in #9789:

CC       prov/cxi/test/multinode/prov_cxi_test_multinode_test_barrier-test_barrier.o
In file included from prov/cxi/test/multinode/test_coll.c:29:
./prov/cxi/include/cxip.h: In function 'cxip_cmdq_empty':
In file included from prov/cxi/test/multinode/multinode_frmwk.c:67:
./prov/cxi/include/cxip.h: In function 'cxip_cmdq_empty':
./prov/cxi/include/cxip.h:2799:16: warning: implicit declaration of function 'cxi_cq_empty'; did you mean 'cxi_eq_empty'? [-Wimplicit-function-declaration]
 2799 |         return cxi_cq_empty(cmdq->dev_cmdq);
      |                ^~~~~~~~~~~~
      |                cxi_eq_empty
./prov/cxi/include/cxip.h:2799:16: warning: implicit declaration of function 'cxi_cq_empty'; did you mean 'cxi_eq_empty'? [-Wimplicit-function-declaration]
 2799 |         return cxi_cq_empty(cmdq->dev_cmdq);
      |                ^~~~~~~~~~~~
      |                cxi_eq_empty
In file included from prov/cxi/test/multinode/test_frmwk.c:28:
./prov/cxi/include/cxip.h: In function 'cxip_cmdq_empty':
./prov/cxi/include/cxip.h:2799:16: warning: implicit declaration of function 'cxi_cq_empty'; did you mean 'cxi_eq_empty'? [-Wimplicit-function-declaration]
 2799 |         return cxi_cq_empty(cmdq->dev_cmdq);
      |                ^~~~~~~~~~~~
      |                cxi_eq_empty
In file included from prov/cxi/test/multinode/multinode_frmwk.c:67:
./prov/cxi/include/cxip.h: In function 'cxip_cmdq_empty':
In file included from prov/cxi/test/multinode/test_barrier.c:51:
./prov/cxi/include/cxip.h: In function 'cxip_cmdq_empty':
./prov/cxi/include/cxip.h:2799:16: warning: implicit declaration of function 'cxi_cq_empty'; did you mean 'cxi_eq_empty'? [-Wimplicit-function-declaration]
 2799 |         return cxi_cq_empty(cmdq->dev_cmdq);
      |                ^~~~~~~~~~~~
      |                cxi_eq_empty

Here are the command used:

module load PrgEnv-gnu-amd
module load libfabric/1.15.2.0
./autogen.sh
./configure --enable-cxi --with-rocr=${ROCM_PATH} --with-json=${HOME}/json-c-json-c-0.13.1-20180305 --prefix=$(pwd)/_inst
make install -j

and the version of the relevant libs

rpm -qa | grep cxi
cray-libcxi-retry-handler-0.9-SSHOT2.0.2_20230428225319_d0f6cbe0189c.x86_64
cray-libcxi-devel-0.9-SSHOT2.0.2_20230428225319_d0f6cbe0189c.x86_64
cray-cxi-driver-devel-0.9-34.7__g22b90ec.SSHOT2.0.2.x86_64
cray-cxi-driver-kmp-cray_shasta_c-0.9_k5.14.21_150400.24.46_12.0.71-34.7__g22b90ec.SSHOT2.0.2.x86_64
cray-libcxi-dracut-0.9-SSHOT2.0.2_20230428225319_d0f6cbe0189c.x86_64
cray-libcxi-0.9-SSHOT2.0.2_20230428225319_d0f6cbe0189c.x86_64
cray-libcxi-utils-0.9-SSHOT2.0.2_20230428225319_d0f6cbe0189c.x86_64
cray-cxi-driver-udev-0.9-34.7__g22b90ec.SSHOT2.0.2.x86_64

I understand that the effort of open-sourcing cxi is tedious and that the versioning problem might not be resolved easily/quickly. This specific issue is intended to track the issues we currently face. In the mean time, I have reverted the changes, the branch is available here: https://github.com/thomasgillis/libfabric/tree/dev-cxi
With the revert of the PR, the code compiles correctly on LUMI

@thomasgillis thomasgillis changed the title prov/cxi: PR#9791 break build prov/cxi: PR#9791 break build on LUMI Feb 22, 2024
@thomasgillis thomasgillis changed the title prov/cxi: PR#9791 break build on LUMI prov/cxi: PR#9791 breaks build on LUMI Feb 28, 2024
@jain-jainendra
Copy link

jain-jainendra commented Mar 27, 2024

@thomasgillis Thanks for the fix. I am able to build libfabric with cxi using your branch. But my application is failing at runtime with following error. (I am using sandia openSHMEM with libfabric and cxi as provider)
[0000] WARN: transport_ofi.c:1420: query_for_fabric
[0000] OFI transport did not find any valid fabric services (provider==cxi)
[0000] ERROR: init.c:466: shmem_internal_heap_postinit
[0000] Transport init failed (-61)
Can you suggest the solution?

@thomasgillis
Copy link
Contributor Author

It seems to be a provider selection issue in openSHMEM, I am afraid I cannot help you here :)
I would reach out to them directly

@raffenet
Copy link
Contributor

Copy/pasting my comment from #9793 (comment). We would really like to be able to build the cxi provider on our production Slingshot systems. I'm not totally sure how we get there from here, but we may be able to utilize ALCF resources for CI.

FWIW, I've reached out to folks at ALCF to see if there's anything that can be done to support, at minimum, build testing of cxi on the Polaris machine here at Argonne. Ideally, once cxi is able to build on a production system, CI could prevent further breaking changes from going in. @jswaro is that something that would be of interest?

@hppritcha
Copy link
Contributor

On perlmutter the configury does better than on systems with older sshot (pm has 2.1.2), but the configury fails with complaints about __user in a cxi related header file:

configure:35099: WARNING: cxi_prov_hw.h: present but cannot be compiled
configure:35099: WARNING: cxi_prov_hw.h:     check for missing prerequisite headers?
configure:35099: WARNING: cxi_prov_hw.h: see the Autoconf documentation
configure:35099: WARNING: cxi_prov_hw.h:     section "Present But Cannot Be Compiled"
configure:35099: WARNING: cxi_prov_hw.h: proceeding with the compiler's result
configure:35099: checking for cxi_prov_hw.h
configure:35099: result: no
configure:35108: checking uapi/misc/cxi.h usability
configure:35108: gcc -c -O2 -DNDEBUG -pipe -fvisibility=hidden -Wall -Wundef -Wpointer-arith    conftest.c >&5
In file included from conftest.c:147:
/usr/include/uapi/misc/cxi.h:76:21: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
   76 |         void __user *resp;
      |                     ^
/usr/include/uapi/misc/cxi.h:82:22: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
   82 |         void __user  *resp;
      |                      ^
/usr/include/uapi/misc/cxi.h:96:22: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
   96 |         void __user  *resp;
      |                      ^
/usr/include/uapi/misc/cxi.h:110:22: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
  110 |         void __user  *resp;
      |                      ^
/usr/include/uapi/misc/cxi.h:130:21: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token
  130 |         void __user *resp;
      |                     ^
/usr/include/uapi/misc/cxi.h:144:38: error: expected ':', ',', ';', '}' or '__attribute__' before '*' token

is this what you also see @raffenet

@hppritcha
Copy link
Contributor

oh I'm on main at 717ebc5

@raffenet
Copy link
Contributor

is this what you also see @raffenet

I think @thomasgillis ran into this and ended up just adding

#define __user

somewhere to make that issue go away because its just a hint anyway.

@mcaubet
Copy link

mcaubet commented Oct 8, 2024

Hi Thomas,

I tried by using your branch, but I see a weird behavior. It seems that CXI is properly linked:

🔥 [caubet_m@login001:~]# ldd $(which fi_info) | grep cxi
        libcxi.so.1 => /usr/lib64/libcxi.so.1 (0x00007f61079c4000)

However, the provider is not listed. Here there's a shorter example by only using the CXI provider:

🔥 [caubet_m@login001:~]# export FI_PROVIDER=cxi
🔥 [caubet_m@login001:~]# fi_info
fi_getinfo: -61 (No data available)

We run cray-libcxi-0.9-SSHOT2.1.3_20240529150829_3d1dc9246116.x86_64, I was wondering whether do/did you see a similar issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

6 participants