Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MLPerf ResNet inference workflow failing on Windows #311

Closed
gfursin opened this issue Sep 30, 2024 · 17 comments
Closed

MLPerf ResNet inference workflow failing on Windows #311

gfursin opened this issue Sep 30, 2024 · 17 comments
Assignees
Labels

Comments

@gfursin
Copy link
Contributor

gfursin commented Sep 30, 2024

Hi @anandhu-eng and @arjunsuresh,

I just tried the CM-MLPerf workflow for ResNet benchmark with the latest CM4MLOps mlperf-inference branch on my Windows machine and it fails when downloading a model.

It's weird because we have a GitHub test for this benchmark on Windows...

Do you mind to check if it works on your side, please? If not, I am curious why our tests didn't capture that.

Here is the command I used on Windows 11 (copy/paste from our GitHub test):

cm pull repo mlcommons@cm4mlops --branch=mlperf-inference

cm rm cache -f

cm run script --tags=run-mlperf,inference,_submission,_short --submitter="cTuning" --hw_name=default --model=resnet50 --implementation=python --backend=onnxruntime --device=cpu --scenario=Offline --test_query_count=500 --target_qps=1 -v --quiet 

and here is the error:

Downloading from https://zenodo.org/record/4735647/files/resnet50_v1.onnx

File resnet50_v1.onnx already present, original checksum and computed checksum matches! Skipping Download..

Traceback (most recent call last):
  File "C:\!Progs\Python310\Scripts\cm-script.py", line 33, in <module>
    sys.exit(load_entry_point('cmind==2.3.8.1', 'console_scripts', 'cm')())
  File "C:\!Progs\Python310\lib\site-packages\cmind\cli.py", line 37, in run
    r = cm.access(argv, out='con')
  File "C:\!Progs\Python310\lib\site-packages\cmind\core.py", line 605, in access
    r = action_addr(i)
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 212, in run
    r = self._run(i)
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 1477, in _run
    r = customize_code.preprocess(ii)
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\script\run-mlperf-inference-app\customize.py", line 231, in preprocess
    r = cm.access(ii)
  File "C:\!Progs\Python310\lib\site-packages\cmind\core.py", line 761, in access
    return cm.access(i)
  File "C:\!Progs\Python310\lib\site-packages\cmind\core.py", line 605, in access
    r = action_addr(i)
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 212, in run
    r = self._run(i)
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 1537, in _run
    r = self._call_run_deps(prehook_deps, self.local_env_keys, local_env_keys_from_meta,  env, state, const, const_state, add_deps_recursive,
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 2882, in _call_run_deps
    r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 3033, in _run_deps
    r = self.cmind.access(ii)
  File "C:\!Progs\Python310\lib\site-packages\cmind\core.py", line 605, in access
    r = action_addr(i)
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 212, in run
    r = self._run(i)
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 1367, in _run
    r = self._call_run_deps(deps, self.local_env_keys, local_env_keys_from_meta, env, state, const, const_state, add_deps_recursive,
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 2882, in _call_run_deps
    r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 3033, in _run_deps
    r = self.cmind.access(ii)
  File "C:\!Progs\Python310\lib\site-packages\cmind\core.py", line 605, in access
    r = action_addr(i)
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 212, in run
    r = self._run(i)
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 1537, in _run
    r = self._call_run_deps(prehook_deps, self.local_env_keys, local_env_keys_from_meta,  env, state, const, const_state, add_deps_recursive,
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 2882, in _call_run_deps
    r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 3033, in _run_deps
    r = self.cmind.access(ii)
  File "C:\!Progs\Python310\lib\site-packages\cmind\core.py", line 605, in access
    r = action_addr(i)
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 212, in run
    r = self._run(i)
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 1537, in _run
    r = self._call_run_deps(prehook_deps, self.local_env_keys, local_env_keys_from_meta,  env, state, const, const_state, add_deps_recursive,
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 2882, in _call_run_deps
    r = script._run_deps(deps, local_env_keys, env, state, const, const_state, add_deps_recursive, recursion_spaces,
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 3033, in _run_deps
    r = self.cmind.access(ii)
  File "C:\!Progs\Python310\lib\site-packages\cmind\core.py", line 605, in access
    r = action_addr(i)
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 212, in run
    r = self._run(i)
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\automation\script\module.py", line 1477, in _run
    r = customize_code.preprocess(ii)
  File "D:\Work1\CM\repos\mlcommons@cm4mlops\script\download-file\customize.py", line 210, in preprocess
    env['CM_DOWNLOAD_CMD'] =  env['CM_DOWNLOAD_CMD'].replace('&', '^&').replace('|', '^|').replace('(', '^(').replace(')', '^)')
KeyError: 'CM_DOWNLOAD_CMD'
                                                                                                                

Thanks!

@arjunsuresh
Copy link
Contributor

Hi @gfursin this PR should fix it. I think it went fine in github action because the cache was clean.

@gfursin
Copy link
Contributor Author

gfursin commented Oct 1, 2024

Thank you @arjunsuresh, but I now got another error at the same place with the same command (after cleaning cache):

CM error: Downloaded path D:\Work1\CM\repos\local\cache\0ad2afc45a05440d\resnet50_v1.onnx does not exist. Probably CM_DOWNLOAD_FILENAME is not set and CM_DOWNLOAD_URL given is not pointing to a file!

@arjunsuresh
Copy link
Contributor

oh. Looks like the script is broken on Windows. @anandhu-eng can you please check?

We actually don't have the github test for R50 on Windows - I believe the workflow is having other issues too on Windows.

@gfursin
Copy link
Contributor Author

gfursin commented Oct 1, 2024

Oh, I didn't see that it was excluded from the test ... It will be nice to fix it and add to the tests since I fixed and used it this spring and it worked ...

@gfursin gfursin assigned gfursin and anandhu-eng and unassigned gfursin Oct 1, 2024
@anandhu-eng
Copy link
Contributor

Hi @arjunsuresh @gfursin , the issue occurs because md5sum is not available on Windows for verifying checksums, and it was not properly caught in our code. We could use certutil on Windows for the same purpose. Let me fix it.

@gfursin
Copy link
Contributor Author

gfursin commented Oct 1, 2024

Actually, we do have md5sum for Windows. I added it to the script get-sys-utils-cm a while ago (https://zenodo.org/record/6501550/files/cm-artifact-os-windows-32.zip).

That's where I keep the min set of aux tools for Windows (including wget).

I remember I made our download scripts compatible with Windows and md5sum was working last year using this dependency ... @anandhu-eng - I suggest to use it. I will try to update it with the latest wget.exe today too ...

@gfursin
Copy link
Contributor Author

gfursin commented Oct 1, 2024

@anandhu-eng . In fact, we use get-sys-utils-min on Windows instead of get-sys-utils-cm . It has md5sum that I use on Windows.

@arjunsuresh
Copy link
Contributor

Hi @gfursin I think the problem started when we added md5sum check for the downloads via cmutil. Since this is done in preprocess function, md5sum check is done via a subprocess call.

@anandhu-eng
Copy link
Contributor

Sorry @gfursin , you are right. I am testing the script after passing the values of +PATH to the subprocess

@gfursin
Copy link
Contributor Author

gfursin commented Oct 1, 2024

Hi @gfursin I think the problem started when we added md5sum check for the downloads via cmutil. Since this is done in preprocess function, md5sum check is done via a subprocess call.

Oh, yes, sure! that's the reason why I was not using subprocess call for such tests but ran md5um it via run.bat - in such case we could reuse dependencies from other CM scripts... I remember that I manage to make md5sum work on Windows and Linux like that. That's one of the reasons to have Windows tests to avoid breaking such cross-platform functionality ;) ...

@gfursin
Copy link
Contributor Author

gfursin commented Oct 1, 2024

Sorry @gfursin , you are right. I am testing the script after passing the values of +PATH to the subprocess

Thank you @anandhu-eng .

@gfursin
Copy link
Contributor Author

gfursin commented Oct 1, 2024

I just noticed that with the latest updates, all CM workflows for Windows are failing, even a basic test with image-classification:

cmr "python image-classification onnx"
cmr "python image-classification torch"

...

INFO:root:* cm run script "python image-classification onnx"
INFO:root:  * cm run script "detect os"
INFO:root:         ! cd D:\Work1\CM\repos\fgg\fgg.work
INFO:root:         ! call D:\Work1\CM\repos\mlcommons@cm4mlops\script\detect-os\run.bat from tmp-run.bat
INFO:root:         ! call "postprocess" from D:\Work1\CM\repos\mlcommons@cm4mlops\script\detect-os\customize.py
INFO:root:  * cm run script "get sys-utils-min"
INFO:root:       ! load D:\Work1\CM\repos\local\cache\937028e67dc34760\cm-cached-state.json
INFO:root:  * cm run script "get sys-utils-cm"
INFO:root:       ! load D:\Work1\CM\repos\local\cache\1ec8fce112e84778\cm-cached-state.json
INFO:root:  * cm run script "get python3"
INFO:root:       ! load D:\Work1\CM\repos\local\cache\b27f55b438e94d81\cm-cached-state.json
INFO:root:Path to Python: C:\!Progs\Python310\python.exe
INFO:root:Python version: 3.10.11
INFO:root:  * cm run script "get dataset imagenet image-classification original _run-during-docker-build"
INFO:root:    * cm run script "detect os"
INFO:root:           ! cd D:\Work1\CM\repos\local\cache\ca90e727f1414812
INFO:root:           ! call D:\Work1\CM\repos\mlcommons@cm4mlops\script\detect-os\run.bat from tmp-run.bat
INFO:root:           ! call "postprocess" from D:\Work1\CM\repos\mlcommons@cm4mlops\script\detect-os\customize.py
INFO:root:    * cm run script "get sys-utils-min"
INFO:root:         ! load D:\Work1\CM\repos\local\cache\937028e67dc34760\cm-cached-state.json
INFO:root:    * cm run script "download-and-extract file _extract _url.http://cKnowledge.org/ai/data/ILSVRC2012_img_val_500.tar"
INFO:root:      * cm run script "download file _cmutil _url.http://cKnowledge.org/ai/data/ILSVRC2012_img_val_500.tar"
INFO:root:        * cm run script "detect os"
INFO:root:               ! cd D:\Work1\CM\repos\local\cache\ca90e727f1414812
INFO:root:               ! call D:\Work1\CM\repos\mlcommons@cm4mlops\script\detect-os\run.bat from tmp-run.bat
INFO:root:               ! call "postprocess" from D:\Work1\CM\repos\mlcommons@cm4mlops\script\detect-os\customize.py
INFO:root:        * cm run script "get sys-utils-min"
INFO:root:             ! load D:\Work1\CM\repos\local\cache\937028e67dc34760\cm-cached-state.json
INFO:root:             ! cd D:\Work1\CM\repos\local\cache\ca90e727f1414812
INFO:root:             ! call D:\Work1\CM\repos\mlcommons@cm4mlops\script\download-file\run.bat from tmp-run.bat
md5sum: standard input: no properly formatted MD5 checksum lines found
INFO:root:             ! call "postprocess" from D:\Work1\CM\repos\mlcommons@cm4mlops\script\download-file\customize.py

CM error: Downloaded path D:\Work1\CM\repos\local\cache\ca90e727f1414812\ILSVRC2012_img_val_500.tar does not exist. Probably CM_DOWNLOAD_FILENAME is not set and CM_DOWNLOAD_URL given is not pointing to a file!

Downloading from http://cKnowledge.org/ai/data/ILSVRC2012_img_val_500.tar

File ILSVRC2012_img_val_500.tar already present, original checksum and computed checksum matches! Skipping Download..

It's a top priority to fix the core CM functionality for Windows and add tests to avoid breaking it.

Thanks a lot!!!!

@anandhu-eng
Copy link
Contributor

Hi @gfursin , I have made updates in PR #318, the download error was fixed, but got another error at later stage of the run.

**********************************************************************
** Visual Studio 2022 Developer Command Prompt v17.11.4
** Copyright (c) 2022 Microsoft Corporation
**********************************************************************
[vcvarsall.bat] Environment initialized for: 'x64'
INFO:root:             ! call "postprocess" from C:\Users\Asus\CM\repos\anandhu-eng@cm4mlops\script\get-generic-python-lib\customize.py
DEBUG:root:        - Running postprocess ...
DEBUG:root:        - Removing tmp tag in the script cached output 8762b402cd5148bd ...
INFO:root:        - cache UID: 8762b402cd5148bd
INFO:root:        - running time of script "get,install,generic,generic-python-lib": 9.85 sec.
DEBUG:root:      - Processing env after dependencies ...
DEBUG:root:        # potential PIP version string (if needed): ==master
DEBUG:root:      - Running preprocess ...
DEBUG:root:      - Running native script "C:\Users\Asus\CM\repos\anandhu-eng@cm4mlops\script\get-mlperf-inference-loadgen\run.bat" from temporal script "tmp-run.bat" in "C:\Users\Asus\CM\repos\local\cache\635e6c4ea2d843a7" ...
INFO:root:           ! cd C:\Users\Asus\CM\repos\local\cache\635e6c4ea2d843a7
INFO:root:           ! call C:\Users\Asus\CM\repos\anandhu-eng@cm4mlops\script\get-mlperf-inference-loadgen\run.bat from tmp-run.bat
**********************************************************************
** Visual Studio 2022 Developer Command Prompt v17.11.4
** Copyright (c) 2022 Microsoft Corporation
**********************************************************************
[vcvarsall.bat] Environment initialized for: 'x64'
'´╗┐@echo' is not recognized as an internal or external command,
operable program or batch file.
=======================================================
Current path in CM script: C:\Users\Asus\CM\repos\local\cache\635e6c4ea2d843a7

Switching to C:\Users\Asus\CM\repos\local\cache\4fc750c8f3244b2c\inference\loadgen

Running python.exe setup.py develop
running develop
C:\Users\Asus\AppData\Local\Programs\Python\Python312\Lib\site-packages\setuptools\command\develop.py:40: EasyInstallDeprecationWarning: easy_install command is deprecated.
!!

        ********************************************************************************
        Please avoid running ``setup.py`` and ``easy_install``.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://github.com/pypa/setuptools/issues/917 for details.
        ********************************************************************************

!!
  easy_install.initialize_options(self)
C:\Users\Asus\AppData\Local\Programs\Python\Python312\Lib\site-packages\setuptools\_distutils\cmd.py:66: SetuptoolsDeprecationWarning: setup.py install is deprecated.
!!

        ********************************************************************************
        Please avoid running ``setup.py`` directly.
        Instead, use pypa/build, pypa/installer or other
        standards-based tools.

        See https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html for details.
        ********************************************************************************

!!
  self.initialize_options()
running egg_info
creating mlcommons_loadgen.egg-info
writing mlcommons_loadgen.egg-info\PKG-INFO
writing dependency_links to mlcommons_loadgen.egg-info\dependency_links.txt
writing top-level names to mlcommons_loadgen.egg-info\top_level.txt
writing manifest file 'mlcommons_loadgen.egg-info\SOURCES.txt'
dependency trace_generator.h won't be automatically included in the manifest: the path doesn't exist
reading manifest file 'mlcommons_loadgen.egg-info\SOURCES.txt'
writing manifest file 'mlcommons_loadgen.egg-info\SOURCES.txt'
running build_ext
building 'mlperf_loadgen' extension
creating build
creating build\temp.win-amd64-cpython-312
creating build\temp.win-amd64-cpython-312\Release
creating build\temp.win-amd64-cpython-312\Release\bindings
creating build\temp.win-amd64-cpython-312\Release\generated
"C:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\bin\HostX86\x64\cl.exe" /c /nologo /O2 /W3 /GL /DNDEBUG /MD -DMAJOR_VERSION=4 -DMINOR_VERSION=1 -I. -IC:\Users\Asus\AppData\Local\Programs\Python\Python312\Lib\site-packages\pybind11\include -IC:\Users\Asus\AppData\Local\Programs\Python\Python312\include -IC:\Users\Asus\AppData\Local\Programs\Python\Python312\Include "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Tools\MSVC\14.41.34120\include" "-IC:\Program Files\Microsoft Visual Studio\2022\Community\VC\Auxiliary\VS\include" "-IC:\Program Files (x86)\Windows Kits\NETFXSDK\4.8\include\um" /EHsc /Tpbindings/c_api.cc /Fobuild\temp.win-amd64-cpython-312\Release\bindings/c_api.obj /std:c++latest /EHsc /bigobj
c_api.cc
C:\Users\Asus\CM\repos\local\cache\4fc750c8f3244b2c\inference\loadgen\bindings\c_api.h(24): fatal error C1083: Cannot open include file: 'stddef.h': No such file or directory
error: command 'C:\\Program Files\\Microsoft Visual Studio\\2022\\Community\\VC\\Tools\\MSVC\\14.41.34120\\bin\\HostX86\\x64\\cl.exe' failed with exit code 2

run command:

cm run script --tags=run-mlperf,inference,_submission,_short --submitter="cTuning" --hw_name=default --model=resnet50 --implementation=python --backend=onnxruntime --device=cpu --scenario=Offline --test_query_count=500 --target_qps=1 -v --quiet

@gfursin
Copy link
Contributor Author

gfursin commented Oct 2, 2024

Thank you very much @anandhu-eng for a quick response - very appreciated. By the way, the download is fixed on Windows and I can run image-classification without an issue. I also tried the above workflow and it failed with a slightly different error:

Switching to D:\Work1\CM\repos\local\cache\325b1e24eb624877\inference\loadgen

Running python.exe setup.py develop
running develop
C:\!Progs\Python310\lib\site-packages\setuptools\command\easy_install.py:144: EasyInstallDeprecationWarning: easy_install command is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
C:\!Progs\Python310\lib\site-packages\setuptools\command\install.py:34: SetuptoolsDeprecationWarning: setup.py install is deprecated. Use build and pip and other standards-based tools.
  warnings.warn(
C:\!Progs\Python310\lib\site-packages\pkg_resources\__init__.py:123: PkgResourcesDeprecationWarning: rch is an invalid version and will not be supported in a future release
  warnings.warn(
running egg_info
writing mlcommons_loadgen.egg-info\PKG-INFO
writing dependency_links to mlcommons_loadgen.egg-info\dependency_links.txt
writing top-level names to mlcommons_loadgen.egg-info\top_level.txt
reading manifest file 'mlcommons_loadgen.egg-info\SOURCES.txt'
writing manifest file 'mlcommons_loadgen.egg-info\SOURCES.txt'
running build_ext
building 'mlperf_loadgen' extension
error: Microsoft Visual C++ 14.0 or greater is required. Get it with "Microsoft C++ Build Tools": https://visualstudio.microsoft.com/visual-cpp-build-tools/

CM error: Portable CM script failed (name = get-mlperf-inference-loadgen, return code = 1)

It is weird because the MVC 18+ is detected - let me dig into it ...

@gfursin
Copy link
Contributor Author

gfursin commented Oct 3, 2024

@anandhu-eng

Oh, I think it's because we have to call vcvars64.bat from Visual Studio to set up all variables for MVCC but it's not called anymore ...

Such scripts were picked up in run.bat to build C/C++ programs (I actually developed the possibility to run native bat scripts in CM4MLOps especially for this reason to support complex sub-dependencies that require extra bat files to set up environment variables).

It worked fine when I tested it a few months ago but it seems that the functionality to build and run programs has changed since then and run.bat is not called?

Is it possible to check how to bring such support back, please?

The way I designed CM and CM4MLOps was to always keep backwards compatibility of CM scripts with all platforms and gradually enhance it (Linux, Windows, MacOS, etc) - this is an important feature of CM and I suggest to keep this concept ...

Thanks a lot!

@arjunsuresh
Copy link
Contributor

Hi @gfursin I don't think we ever touched this feature. MLPerf inference R50 is actually working fine on Windows as seen in this runlog and it uses pip install mlcommons_loadgen there by avoiding the need of a compiler. So, the calling of run.bat is working fine in CM. The failure of Visual Studio might be some issue specific to it. Touching anything on Windows wastes a lot of time and so we are only maintaining the currently working Github actions for MLPerf inference R50, RetinaNet and ABTF for Windows.

@arjunsuresh
Copy link
Contributor

Closing this issue as Visual Studio compilation issue is different.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
Status: Done
Development

No branches or pull requests

3 participants