qsub -A datascience -q preemptable -l select=1:ncpus=64:ngpus=4,filesystems=swift,walltime=02:30:00 -joe -- install_datascience_conda_fromsource.sh /soft/datascience/conda/2023-01-10
qsub -t 170 -n 1 -q full-node -A datascience -M <email> -o build.out -e build.out ./install_datascience_conda.sh /lus/theta-fs0/software/thetagpu/conda/2023-01-11
How to download the CUDA Toolkit from NVIDIA and extract just the toolkit, not installing the driver, etc. without sudo
permissions:
E.g. on Polaris, download from NVIDIA, selecting Target Platform:
- Linux > x86_64 > SLES > Version 15 > runfile (local) =
- https://developer.nvidia.com/cuda-downloads?target_os=Linux&target_arch=x86_64&Distribution=SLES&target_version=15&target_type=runfile_local
cd /soft/compilers/cudatoolkit
wget https://developer.download.nvidia.com/compute/cuda/11.7.1/local_installers/cuda_11.7.1_515.65.01_linux.run
sh cuda_11.7.1_515.65.01_linux.run --silent --toolkit --toolkitpath=$PWD/cuda-11.7.1
- Create bash scripts for testing environments based on notes in https://anl.app.box.com/notes/1001252052445 (WIP: https://anl.app.box.com/notes/1124584874420)
- Revisit potential cloning issues from around 2023-01-13, namely Romit vs. Taylor's error logs
- ThetaGPU script is not setup to install parallel h5py like in Polaris script
- Add Mxnet
- Move future conda environments from Python 3.8 to 3.9 (requirement for HPE Dragon e.g.), or even 3.10 for better Python error messages. Done. Switched to 3.10 starting in January 2023
-
conda-forge
just hasnumpy
, non-metapackage? Nonumpy-base
, unlikedefaults
? https://stackoverflow.com/questions/50699252/anaconda-environment-installing-packages-numpy-base - Why does ThetaGPU seem to demand an OpenMPI/UCX module built against CUDA 11.8 and not 11.4 when TF/Torch/etc. built with 11.8, yet Cray MPICH on Polaris doesnt seem to care about the minor version of CUDA loaded at runtime and used to build the deep learning libraries?
- Double check that
rpath
solution to DeepSpeed dynamic linking tolibaio
is working. Kinda. Still need to do this at runtime:CFLAGS="-I${CONDA_PREFIX}/include/" LDFLAGS="-L${CONDA_PREFIX}/lib/" ds_report
, presumably because of the JITing? How will this work for users in practice? - Why does
pip install sdv>=0.17.1
reinstall numpy everywhere, and also breaks torch, installs other junk on Polaris? numpy 1.24.1 ---> 1.22.4 on ThetaGPU, even though the existing version seems to match??? I was skipping installing sdv, but switching from Python 3.8 to 3.10 somehow avoided the problems on Polaris. Doesn't make much sense, as even the conditionalpython_version<'3.10'
requirements seemed sufficiently loose to not cause problems. Maybe it was the CTGAN dependency, since it is the only one which direclty liststorch
insetup.py
? But those version requirements are also loose; working theory was that some dep was picky about numpy version, which forced a re-install of PyTorch- https://github.com/sdv-dev/SDV/blob/master/setup.py
- https://github.com/sdv-dev/CTGAN/blob/master/setup.py
cd /soft/datascience/conda/2023-01-10/deephyper pip install --dry-run ".[analytics,hvd,nas,popt,autodeuq,sdv]"
- Binary search the deps of
sdv
in the future if it causes problems again.
Collecting numpy<2,>=1.20.0
Downloading numpy-1.22.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (16.9 MB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 16.9/16.9 MB 228.4 MB/s eta 0:00:00
Attempting uninstall: numpy
Found existing installation: numpy 1.24.1
Uninstalling numpy-1.24.1:
Successfully uninstalled numpy-1.24.1
- Consider fixes for problems arising from mixing
conda-forge
anddefaults
packages. Edit: now tryingconda install -c defaults -c conda-forge ...
on the one line.- Check that "conda install" after the environment is made doesnt return a bunch of inconsistent pkg warnings
https://conda.io/projects/conda/en/latest/user-guide/tasks/manage-channels.html
https://conda-forge.org/docs/user/tipsandtricks.html
conda config --set channel_priority strict
conda install -c defaults -c conda-forge somepackage
which puts defaults with top priority. Or:
conda install conda-forge::somepackage
and this will not change the channel priority.
Also, interestingly:
To solve these issues, conda-forge has created special dummy builds of the mpich and openmpi libraries that are simply shell packages with no contents. These packages allow the conda solver to produce correct environments while avoiding installing MPI binaries from conda-forge. You can install the dummy package with the following command
$ conda install "mpich=x.y.z=external_*"
$ conda install "openmpi=x.y.z=external_*"
- When did I start adding things from
conda-forge
?- Answer:
mamba
might be the only package actually needed that is not on mainanaconda/defaults
channel.
- Answer:
- When did I retroactively add
python-libaio
to existing environments? https://github.com/vpelletier/python-libaio- Answer:
python-libaio
is also example of a package not ondefaults
, that is onconda-forge
. But for DeepSpeed built from source, we might only needlibaio
, which is on defaults. Sam requestedpython-libaio
on 2022-11-09, but I dont think it was ever installed via these scripts or retroactively in existing conda environments (he was experimenting in a clone).
- Answer: