Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make a video #1840

Open
Agalakdak opened this issue Aug 30, 2024 · 19 comments
Open

Make a video #1840

Agalakdak opened this issue Aug 30, 2024 · 19 comments

Comments

@Agalakdak
Copy link

Hello. I have been trying to run at least 1 test for a long time and I constantly get errors. Please record a video or give me a link so that I can understand what a normal launch without pain should look like.

@Agalakdak
Copy link
Author

I will donate 10$

@psyhtest
Copy link
Contributor

Which test? Which errors?

@Agalakdak
Copy link
Author

Agalakdak commented Aug 30, 2024

I encountered so many errors that I don't know where to start.
For example, now I followed this document https://docs.mlcommons.org/inference/install/
Then I decided to go straight here https://docs.mlcommons.org/inference/benchmarks/medical_imaging/3d-unet/
And when I entered
cm run script --tags=install,python-venv --name=mlperf

I got the message CM error: automation script not found!

I found this error today, up to this point the instructions worked. The log is below
user@user:$ mkdir test
user@user:
$ cd test/
user@user:/test$ python3 -m venv cm
user@user:
/test$ source cm/bin/activate
(cm) user@user:~~/test$ pip install cm4mlops
Collecting cm4mlops
Using cached cm4mlops-0.2-py3-none-any.whl
Collecting cmind
Using cached cmind-2.3.5.tar.gz (63 kB)
Preparing metadata (setup.py) ... done
Collecting pyyaml
Using cached PyYAML-6.0.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (751 kB)
Collecting requests
Using cached requests-2.32.3-py3-none-any.whl (64 kB)
Collecting giturlparse
Using cached giturlparse-0.12.0-py2.py3-none-any.whl (15 kB)
Collecting setuptools>=60
Using cached setuptools-74.0.0-py3-none-any.whl (1.3 MB)
Collecting wheel
Using cached wheel-0.44.0-py3-none-any.whl (67 kB)
Collecting charset-normalizer<4,>=2
Using cached charset_normalizer-3.3.2-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (142 kB)
Collecting idna<4,>=2.5
Using cached idna-3.8-py3-none-any.whl (66 kB)
Collecting certifi>=2017.4.17
Downloading certifi-2024.8.30-py3-none-any.whl (167 kB)
━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 167.3/167.3 KB 1.3 MB/s eta 0:00:00
Collecting urllib3<3,>=1.21.1
Using cached urllib3-2.2.2-py3-none-any.whl (121 kB)
Using legacy 'setup.py install' for cmind, since package 'wheel' is not installed.
Installing collected packages: wheel, urllib3, setuptools, pyyaml, idna, giturlparse, charset-normalizer, certifi, requests, cmind, cm4mlops
Attempting uninstall: setuptools
Found existing installation: setuptools 59.6.0
Uninstalling setuptools-59.6.0:
Successfully uninstalled setuptools-59.6.0
Running setup.py install for cmind ... done
Successfully installed certifi-2024.8.30 charset-normalizer-3.3.2 cm4mlops-0.2 cmind-2.3.5 giturlparse-0.12.0 idna-3.8 pyyaml-6.0.2 requests-2.32.3 setuptools-74.0.0 urllib3-2.2.2 wheel-0.44.0
(cm) user@user:~/test$ cm run script --tags=install,python-venv --name=mlperf

CM error: automation script not found!
(cm) user@user:~/test$

@Agalakdak
Copy link
Author

Agalakdak commented Aug 30, 2024

I ran this script

cm run --tags=run-mlperf,inference,_find-performance,_full,_r4.0 --model=3d-unet-99 --implementation=intel --framework=pytorch --category=edge - -scenario=Offline --execution_mode=test --device=cpu --quiet --test_query_count=50

There is no libffi7 package on my Ubuntu 23.04
Log below

sudo DEBIAN_FRONTEND=noninteractive apt-get install -y libffi7
Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
E: Unable to find package libffi7

CM error: Portable CM script failed (name = get-generic-sys-util, return code = 256)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that it is often a portability issue of a third-party tool or a native script
wrapped and unified by this CM script (automation recipe). Please re-run
this script with --repro flag and report this issue with the original
command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts
to make existing tools and native scripts more portable, interoperable
and deterministic. Thank you!

@arjunsuresh
Copy link
Contributor

arjunsuresh commented Aug 31, 2024

Hi @Agalakdak The docs page uses docker option as the default to avoid such OS dependent issues. Is there a reason you don't want to use docker?

@Agalakdak
Copy link
Author

Agalakdak commented Sep 3, 2024

Hi @arjunsuresh ! Sorry for the long reply, I tried different ways to solve it. I used the command

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.0 --model=3d-unet-99 --implementation=intel --framework=pytorch --category=edge --scenario=Offline --execution_mode=test --device=cpu --docker --quiet --test_query_count=50

From here https://docs.mlcommons.org/inference/benchmarks/medical_imaging/3d-unet/

And at the last step I got an error
129.8 /home/cmuser/CM/repos/local/cache/b6acf79e843b4c1e/miniconda3/bin/conda install -y -c intel mkl-include
130.2 Collecting package metadata (current_repodata.json): ...working... failed
132.3
132.3 UnavailableInvalidChannel: HTTP 403 FORBIDDEN for channel intel https://conda.anaconda.org/intel
132.3
132.3 The channel is not accessible or is invalid.
132.3
132.3 You will need to adjust your conda configuration to proceed.
132.3 Use conda config --show channels to view your configuration's current state,
132.3 and use conda config --show-sources to view config file locations.
132.3
132.3
132.5 Detected version: 3.10.12
132.5 Detected version: 3.10.12
132.5 Detected version: 22.0.2
132.5
132.5 Extra PIP CMD:
132.5
132.5 Detected version: 3.0.0
132.5 Detected version: 24.7.1
132.5 Detected version: 3.8.0
132.5
132.5 CM error: Portable CM script failed (name = install-generic-conda-package, return code = 256)

And some more logs in the file
error_with_docker.txt

I don't understand what to do with this. What information can I provide you to solve the problem?

@Agalakdak
Copy link
Author

Agalakdak commented Sep 3, 2024

@arjunsuresh, I tried to run another benchmark. But there was an error there too. Please help me figure it out.

Command
cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.0
--model=retinanet
--implementation=intel
--framework=pytorch
--category=edge
--scenario=Offline
--execution_mode=test
--device=cpu
--docker --quiet
--test_query_count=100

Error log
1762.7 environment: line 1: 51417 Killed ${CM_PYTHON_BIN_WITH_PATH} "$@"
1762.8
1762.8 CM error: Portable CM script failed (name = get-dataset-openimages, return code = 256)
1762.8
1762.8
1762.8 ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
1762.8 Note that it is often a portability issue of a third-party tool or a native script
1762.8 wrapped and unified by this CM script (automation recipe). Please re-run
1762.8 this script with --repro flag and report this issue with the original
1762.8 command line, cm-repro directory and full log here:
1762.8
1762.8 https://github.com/mlcommons/cm4mlops/issues
1762.8
1762.8 The CM concept is to collaboratively fix such issues inside portable CM scripts
1762.8 to make existing tools and native scripts more portable, interoperable
1762.8 and deterministic. Thank you!
1762.8
1762.8
1762.8 Using MLCommons Inference source from '/home/cmuser/CM/repos/local/cache/f93090d427b8435f/inference'
1762.8


1 warning found (use docker --debug to expand):

  • SecretsUsedInArgOrEnv: Do not use ARG or ENV instructions for sensitive data (ARG "CM_GH_TOKEN") (line 14)
    ubuntu_22.04.Dockerfile:47

Full error log
error_with_docker2.txt

@arjunsuresh
Copy link
Contributor

Hi @Agalakdak we do have problem with Intel implementation as reported here. We'll work with Intel to fix these. But even then Intel implementation is expected to work on only the latest Intel server/workstation CPUs - we'll update this in the documentation.

@Agalakdak
Copy link
Author

Hi @arjunsuresh, Hi arjunsuresh, thanks for the prompt reply. I'll check that code on the intel XEON GOLD 6346 x2 processor a little later

@Agalakdak
Copy link
Author

@arjunsuresh Can I clarify for the future? Are there any problems with the "Quadro RTX 5000" and "Nvidia A40" video cards?

@arjunsuresh
Copy link
Contributor

@Agalakdak Nvidia doesn't officially support them for MLPerf inference. But typically we have had good success running Nvidia code on such GPUs without much difficulty. Do you have a plan of what all you are trying to benchmark?

@Agalakdak
Copy link
Author

@arjunsuresh Yes, sure. First, I'd like to just run one of the inference benchmarks and compare them with the "reference indicators". If the launch is successful, I'll try to run a benchmark for "training" the network on several video cards using a docker container. And then use these results to find bottlenecks in the system (if there are any)

Today I'll try to run as many benchmarks as possible. And then I'll write about the results. If you need any additional information about the system, let me know.

@arjunsuresh
Copy link
Contributor

@Agalakdak If you want to run as many benchmarks as possible the best option to start with will be using Nvidia implementation. Even if any issue is there, they are usually quickly resolvable. If you just want to try getting a result, reference implementation is good for smaller models like resnet50 and bert-99 as it runs on almost any CPUs.

And if you are referring to MLPerf training benchmarks - hat's very different from inference even though many of the models in inference come from MLPerf training. Currently there is no automated way to run training benchmarks and the only option is to follow the submitter READMEs is the results repository.
https://github.com/mlcommons/training_results_v4.0

@Agalakdak
Copy link
Author

@arjunsuresh Oh, thanks a lot for the help, but I'm afraid I have another question. I successfully ran "Text to Image using Stable Diffusion"
cm run script --tags=run-mlperf,inference,_r4.1
--model=sdxl
--implementation=reference
--framework=pytorch
--category=edge
--scenario=Offline
--execution_mode=valid
--device=cuda
--quiet
My GPU did do some work. But...
After all this I ended up inside a container and can't find any results, neither in the container nor in the logs

error_with_docker3.txt

@arjunsuresh
Copy link
Contributor

arjunsuresh commented Sep 3, 2024

@Agalakdak that's only the first step. You need to do the following command from the documentation page inside that docker container.

@Agalakdak
Copy link
Author

Agalakdak commented Sep 4, 2024

Hello @arjunsuresh , unfortunately a new day and new problems
On one of the systems I started SD (and the test has been running for several hours, I don't know if I need to stop it forcibly or it will handle it itself?), but on the other one it doesn't.
I ran the same command

cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=sdxl
--implementation=nvidia
--framework=tensorrt
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=50

What can I do with this error?

/usr/include/x86_64-linux-gnu/bits/mathcalls.h(110): error: identifier "_Float32" is undefined Error limit reached.
100 errors detected in the compilation of "print_cuda_devices.cu".
Compilation terminated.

CM error: Portable CM script failed (name = get-cuda-devices, return code = 256) ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that it is often a portability issue of a third- party tool or a native script wrapped and unified by this CM script (automation recipe). Please re-run this script with --repro flag and report this issue with the original command line, cm-repro directory and full log here: https://github.com/mlcommons/cm4mlops/issues The CM concept is to collaboratively fix such issues inside portable CM scripts to make existing tools and native scripts more portable, interoperable and deterministic. Thank you!

Full log with error
server_err_sd_1.log

@arjunsuresh
Copy link
Contributor

@Agalakdak the problem is due to CUDA compilation not working on the host machine. This is actually not a necessity though we never had such an issue before. Let me share you the option to skip this.

@anandhu-eng are you able to share this option?

@Agalakdak
Copy link
Author

@arjunsuresh Hi, I encountered a similar problem when I wanted to run ResNet50
I launched:
cm run script --tags=run-mlperf,inference,_find-performance,_full,_r4.1-dev
--model=resnet50
--implementation=reference
--framework=onnxruntime
--category=edge
--scenario=Offline
--execution_mode=test
--device=cuda
--docker --quiet
--test_query_count=1000

and it was ok.

And it worked fine - I got inside the container. In the container I ran
cm run script --tags=run-mlperf,inference,_r4.1-dev,_all-scenarios --model=resnet50 --implementation=reference --framework=onnxruntime --category=edge --execution_mode=valid --device=cuda --quiet

And I got (I assume) a similar error.

INFO:root: ! call /home/cmuser/CM/repos/mlcommons@cm4mlops/script/get-cuda-devices/run.sh from tmp-run.sh
rm: cannot remove 'a.out': No such file or directory

Checking compiler version ...

nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2024 NVIDIA Corporation
Built on Tue_Feb_27_16:19:38_PST_2024
Cuda compilation tools, release 12.4, V12.4.99
Build cuda_12.4.r12.4/compiler.33961263_0

Compiling program ...

Running program ...

/home/cmuser
INFO:root:========================================================
INFO:root:Print file tmp-run.out:
INFO:root:
INFO:root:Error: problem obtaining number of CUDA devices: 100

INFO:root:

CM error: Portable CM script failed (name = get-cuda-devices, return code = 256)

^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
Note that it is often a portability issue of a third-party tool or a native script
wrapped and unified by this CM script (automation recipe). Please re-run
this script with --repro flag and report this issue with the original
command line, cm-repro directory and full log here:

https://github.com/mlcommons/cm4mlops/issues

The CM concept is to collaboratively fix such issues inside portable CM scripts
to make existing tools and native scripts more portable, interoperable
and deterministic. Thank you!

Log with error
error_resnet50_docker_log.txt

@arjunsuresh
Copy link
Contributor

Hi @Agalakdak We also sometimes face the below error while using Nvidia GPUs inside a container

INFO:root:Error: problem obtaining number of CUDA devices: 100

A quick fix for this is to exit the container. Use docker ps -a to get the container ID say id. Then do docker start id && docker attach id and we should be back where we were but with working Nvidia GPUs.

We have also removed the requirement to have NVCC in the host system - please do cm pull repo and you should be able to run sdxl.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants