Skip to content
This repository has been archived by the owner on Nov 17, 2023. It is now read-only.

How to debug hybridize() failures? #10875

Closed
jaanli opened this issue May 9, 2018 · 6 comments
Closed

How to debug hybridize() failures? #10875

jaanli opened this issue May 9, 2018 · 6 comments

Comments

@jaanli
Copy link

jaanli commented May 9, 2018

My model runs fine without hybridize, but I need the speed boost.

When I call hybridize I get a cryptic failure message:

  File "normal_def/train.py", line 225, in <module>
    train()
  File "normal_def/train.py", line 178, in train
    users, items, item_counts, set_sizes, user_meals)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/mxnet/gluon/block.py", line 413, in __call__
    return self.forward(*args)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/mxnet/gluon/block.py", line 621, in forward
    return self._call_cached_op(x, *args)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/mxnet/gluon/block.py", line 528, in _call_cached_op
    out = self._cached_op(*cargs)
  File "/usr/local/anaconda3/lib/python3.6/site-packages/mxnet/_ctypes/ndarray.py", line 149, in __call__
    ctypes.byref(out_stypes)))
  File "/usr/local/anaconda3/lib/python3.6/site-packages/mxnet/base.py", line 149, in check_call
    raise MXNetError(py_str(_LIB.MXGetLastError()))
mxnet.base.MXNetError: Error in operator normaldef0_fastpoissonlogprob0__mul0: [17:27:23] src/operator/nn/../tensor/../elemwise_op_common.h:123: Check failed: assign(&dattr, (*vec)[i]) Incompatible attr in node normaldef0_fastpoissonlogprob0__mul0 at 1-th input: expected [9963,128], got [9963,1]

There are many multiplies and nodes of that shape. How can I figure out why hybridize is not working? i.e. what line of code normaldef0_fastpoissonlogprob0__mul0 corresponds to?

Thanks!

@zhreshold
Copy link
Member

What I can tell from the logs is the mismatching multiplication behavior in hybridized/non-hybridized versions.
TL;DR, you might need to explicitly use broadcast_mul rather than * in your code when the first normaldef0_fastpoissonlogprob0__mul is encountered.

@jaanli
Copy link
Author

jaanli commented May 10, 2018

Yup, that was it - thanks!

Hopefully there's a better way to figure this out in the future, or a list of guidelines to avoid having to change code after debugging - I had to change about 5 lines and change the minus, plus, and multiply operators to F.broadcast_sub and F.broadcast_add and F.broadcast_mul.

@zhreshold
Copy link
Member

@piiswrong Is this a bug or simply we cannot detect to use broadcast operators properly?
@hetong007 had similar problems yesterday.

@piiswrong
Copy link
Contributor

+-/ operators behave differently for symbol and ndarray due to legacy reasons (broadcast_) cannot do reverse infer shape.

We should explain this difference in the error message

@piiswrong piiswrong added Doc and removed Bug labels May 29, 2018
@szha
Copy link
Member

szha commented May 30, 2018

#9686 (comment) I included the eventual solution in this comment

@sojiadeshina
Copy link
Contributor

@szha can we close out this issue now since it's been tracked by the roadmap issue? thanks

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants