[PyTorch] Update to 2.2.1 and minor changes #1466

HGuillemet · 2024-02-02T09:48:35Z

Work in Progress

Included in this PR:

Update to PyTorch 2.2.1
Restore ExampleStack and TensorExampleStack constructors
Generate more overloads for methods taking an array ref
Add parsing of CUDAFunctions.h (wrappers around commonly used CUDA API functions)
~~Add AOTInductor (new way to run models exported from Python)~~
Virtualize FunctionPreHook and FunctionPostHook to enable setting hooks on autograd graph
Passing a vector of tensors, and some other classes, by value does not remove the data anymore
Add Module.asXXX to test if a Module is of a specific subclass (and do the cast).
Add new AMD Float8 types Float8_e5m2fnuz and Float8_e4m3fnuz with unsigned zero
Add preload of nvrtc-builtins (fixes Problems deploying pytorch 2.1.2-1.5.10 #1468)

sbrunk · 2024-02-02T14:13:57Z

@HGuillemet thanks for the update and especially adding the AOTInductor mapping. I think that's an interesting new variant to be able to use the new optimizations at least for inference. Training still needs to happen in Python in that case but we can export it to a C++ lib and then use that from Java with the bindings.

Have you been able to try it already?I'll give it a try myself, just curious if you got something working already.

HGuillemet · 2024-02-02T14:30:57Z

No I haven't, yet.
I'm training from Java, so there is little chance I use it in the near future. I must rely on Torchscript when I need to import python models, and not a lot of people bother to write Torchscript compatible models :(

This reverts commit 79348b0.

This reverts commit 9ebb661.

HGuillemet · 2024-02-05T22:02:38Z

I don't understand the linking error on Windows:

jnitorch_cuda.obj : error LNK2001: unresolved external symbol "__declspec(dllimport) public: __cdecl torch::inductor::AOTIModelContainerRunnerCuda::~AOTIModelContainerRunnerCuda(void)" (__imp_??1AOTIModelContainerRunnerCuda@inductor@torch@@QEAA@XZ)
jnitorch_cuda.obj : error LNK2001: unresolved external symbol "__declspec(dllimport) public: __cdecl torch::inductor::AOTIModelContainerRunnerCuda::AOTIModelContainerRunnerCuda(char const *,unsigned __int64,char const *)" (__imp_??0AOTIModelContainerRunnerCuda@inductor@torch@@QEAA@PEBD_K0@Z)

AOTIModelContainerRunnerCuda constructor is defined in aoti_model_container_runner_cuda.h which is included from jnitorch_cuda.cpp and no destructor is declared anywhere for this class.
Any idea ?
@saudet, could this be related to ccache and, if yes, could you clear the cache ?

saudet · 2024-02-06T00:37:48Z

There's probably some template somewhere that requires them. You'll probably get the same error on Linux and Mac if you try to link with -Wl,--no-undefined, so try to fix the errors you get with that, and it should fix those errors on Windows too.

HGuillemet · 2024-02-06T12:18:48Z

Thanks for the suggestion. Adding the linker option raised an error about cudnn not linked to jnitorch_cuda. Let's see but I doubt it's related to the error on windows.

HGuillemet · 2024-02-06T12:53:19Z

It seems it has been spotted and fixed upstream in pytorch/pytorch@79ba397 after 2.2.0 release.
I guess we'd better postpone the inclusion of the AOTInductor feature to next release.

HGuillemet · 2024-02-07T17:41:41Z

To enable the setting of hooks in autograd graphs, I need to virtualize FunctionPreHook and FunctionPostHook, which have a virtual method taking a ref to a vector of tensors and returning a vector of tensors. Compilation passes only if I remove the valueTypes in this info:

new Info("std::vector<torch::Tensor>", "std::vector<at::Tensor>", "std::vector<torch::autograd::Variable>", "torch::autograd::variable_list")
  .valueTypes("@Cast({\"\", \"std::vector<torch::Tensor>\"}) @StdMove TensorVector")
  .pointerTypes("TensorVector").define())

I wonder why this valueTypes is here.
It will save a copy when we pass a vector of tensors to a native functions but OTOH it will destroy the vector, while the user could need it after the function call.
If I understand well, if a native function takes a rvalue ref (&&) , parser will generate @ByRef(true) which is enough to avoid copies.
@saudet could you share your infinite knowledge about this point ?
Could it break something if I remove the valueTypes ? First attempts seem to show it does not.

There are some other types with this kind of @Cast @StdMove value types (DataPtr, Storage, TensorMaybeOwned, TensorBaseMaybeOwned, TensorName, EdgeVector)

saudet · 2024-02-08T04:27:01Z

If you're not getting any compile errors, then I guess PyTorch's API was improved so that we don't need them anymore, yes

…ensorVector valueTypes.

HGuillemet · 2024-02-11T20:35:28Z

@sbrunk could you run your tests on this PR ?
Anything you'd like to be added ?

sbrunk · 2024-02-12T21:38:30Z

@sbrunk could you run your tests on this PR ? Anything you'd like to be added ?

Tests are looking good!

pytorch/src/main/java/org/bytedeco/pytorch/presets/torch_cuda.java

saudet · 2024-03-01T12:00:05Z

If you're not planning on making more changes for now, we can merge this?

HGuillemet · 2024-03-01T12:12:14Z

Ok for me.

pytorch/include_list.pl

pytorch/src/gen/java/org/bytedeco/pytorch/AdaptiveAvgPool1dImpl.java

saudet · 2024-03-05T00:33:18Z

2024-03-04T11:44:04.1783504Z Caused by: java.io.IOException: No space left on device

Could you try to fix this? We probably just need to uninstall a couple of large unnecessary packages...

HGuillemet · 2024-03-05T07:53:07Z

I had already added a bunch of rm in deploy-ubuntu once, on downloaded archives after there installation, and you reverted that.
I can try to add them again, that seems the easiest and fastest way to make room.
In a new PR ?

saudet · 2024-03-05T08:13:28Z

Really? Could you point me to that revert and I'll try it on the actions branch here

HGuillemet · 2024-03-05T08:21:41Z

I'm seeing it was on deploy-centos, in fact:
3e3fe5c

I'm reviewing deploy-ubuntu and adding similar cleanup. Shall I push the commit here or on a new PR ?

saudet · 2024-03-05T08:25:29Z

Ah, that won't be enough. We'll probably need to remove a lot more stuff. You can try it here, but we won't know if it works until actual deploy, so I don't know. Let's check how many more GB we can with df -h I guess for now is good indication

HGuillemet · 2024-03-05T09:33:25Z

I pushed aaa37a1 on my branch but it doesn't update this PR now that it's merged.

Anyway, this commit will indeed only save ~ 700Mb for pytorch build, due to mkl archive removal.

If it's not enough, what about a maven clean phase on main artifact after its deploy phase and before the deploy phase of the platform/ext artifact ? This should get rid of cppbuild directory (about 7 or 8G)

saudet · 2024-03-05T09:49:13Z

That could work, yes

HGuillemet added 5 commits January 31, 2024 09:51

Restore ExampleStack and TensorExampleStack constructors

95f80e5

Generate more overloads of methods taking an arrayref

f4a464f

Update to PyTorch 2.2.0

1c4eec0

Add parsing of CUDAFunctions.h

d1d1b54

Add AOTInductor

9d0db45

HGuillemet marked this pull request as draft February 2, 2024 09:57

HGuillemet added 7 commits February 3, 2024 08:56

Fix compilation error on Windows

6e8fc44

Cleanup cppbuild.sh

82cbc2c

Fix linking error on windows

00b84d2

Moved AOTIModelContainerRunnerCuda to main presets

79348b0

Revert "Moved AOTIModelContainerRunnerCuda to main presets"

e6e6aae

This reverts commit 79348b0.

Add DynamicLibrary.h to JNI

9ebb661

Revert "Add DynamicLibrary.h to JNI"

344fee4

This reverts commit 9ebb661.

Link jnitorch_cuda with cudnn

4129ee0

Remove upcast on Module

b63ca78

HGuillemet added 7 commits February 8, 2024 08:07

Virtualize FunctionPreHook and FunctionPostHook. Remove @stdMove on T…

84bfcec

…ensorVector valueTypes.

Remove @stdMove on Storage valueTypes.

d381eba

Remove @StdMove on MaybeOwned<Tensor>

97e1ce6

Remove @StdMove on TensorName

a41a0c1

Remove @StdMove on EdgeVector and DimnameVector

fff554f

Disable AOTInductor

425fe39

Add Module.asX

605704a

HGuillemet marked this pull request as ready for review February 11, 2024 20:37

saudet requested a review from sbrunk February 12, 2024 01:40

HGuillemet mentioned this pull request Feb 13, 2024

Problems deploying pytorch 2.1.2-1.5.10 #1468

Closed

Add preload of nvrtc-builtins

5732dca

HGuillemet mentioned this pull request Feb 17, 2024

[pytorch] 2.1.2-1.5.10 release run example meet error [ Pointer address of argument 0 is NULL.] #1473

Closed

saudet reviewed Feb 18, 2024

View reviewed changes

pytorch/src/main/java/org/bytedeco/pytorch/presets/torch_cuda.java Show resolved Hide resolved

Update to Pytorch 2.2.1

7cb77c8

saudet reviewed Mar 2, 2024

View reviewed changes

pytorch/include_list.pl Show resolved Hide resolved

saudet reviewed Mar 2, 2024

View reviewed changes

pytorch/src/gen/java/org/bytedeco/pytorch/AdaptiveAvgPool1dImpl.java Show resolved Hide resolved

Update CHANGELOG.md and fix nits

04ccfc9

HGuillemet changed the title ~~[PyTorch] WIP~~ [PyTorch] Update to 2.2.1 and minor changes Mar 2, 2024

saudet approved these changes Mar 3, 2024

View reviewed changes

saudet merged commit 575a44e into bytedeco:master Mar 3, 2024
7 checks passed

HGuillemet mentioned this pull request Mar 5, 2024

[CI] Save disk space in deploy-ubuntu #1482

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PyTorch] Update to 2.2.1 and minor changes #1466

[PyTorch] Update to 2.2.1 and minor changes #1466

HGuillemet commented Feb 2, 2024 •

edited

Loading

sbrunk commented Feb 2, 2024

HGuillemet commented Feb 2, 2024 •

edited

Loading

HGuillemet commented Feb 5, 2024

saudet commented Feb 6, 2024

HGuillemet commented Feb 6, 2024

HGuillemet commented Feb 6, 2024

HGuillemet commented Feb 7, 2024 •

edited

Loading

saudet commented Feb 8, 2024

HGuillemet commented Feb 11, 2024

sbrunk commented Feb 12, 2024

saudet commented Mar 1, 2024

HGuillemet commented Mar 1, 2024

saudet commented Mar 5, 2024

HGuillemet commented Mar 5, 2024

saudet commented Mar 5, 2024

HGuillemet commented Mar 5, 2024

saudet commented Mar 5, 2024

HGuillemet commented Mar 5, 2024

saudet commented Mar 5, 2024 via email

[PyTorch] Update to 2.2.1 and minor changes #1466

[PyTorch] Update to 2.2.1 and minor changes #1466

Conversation

HGuillemet commented Feb 2, 2024 • edited Loading

Work in Progress

sbrunk commented Feb 2, 2024

HGuillemet commented Feb 2, 2024 • edited Loading

HGuillemet commented Feb 5, 2024

saudet commented Feb 6, 2024

HGuillemet commented Feb 6, 2024

HGuillemet commented Feb 6, 2024

HGuillemet commented Feb 7, 2024 • edited Loading

saudet commented Feb 8, 2024

HGuillemet commented Feb 11, 2024

sbrunk commented Feb 12, 2024

saudet commented Mar 1, 2024

HGuillemet commented Mar 1, 2024

saudet commented Mar 5, 2024

HGuillemet commented Mar 5, 2024

saudet commented Mar 5, 2024

HGuillemet commented Mar 5, 2024

saudet commented Mar 5, 2024

HGuillemet commented Mar 5, 2024

saudet commented Mar 5, 2024 via email

HGuillemet commented Feb 2, 2024 •

edited

Loading

HGuillemet commented Feb 2, 2024 •

edited

Loading

HGuillemet commented Feb 7, 2024 •

edited

Loading