Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[torchbench] Detectron2 benchmarks failing to run. #6336

Closed
ysiraichi opened this issue Jan 19, 2024 · 23 comments
Closed

[torchbench] Detectron2 benchmarks failing to run. #6336

ysiraichi opened this issue Jan 19, 2024 · 23 comments
Assignees
Labels

Comments

@ysiraichi
Copy link
Collaborator

🐛 Bug

After #6296, a few detectron2 benchmarks started failing when using XLA:

python xla/benchmarks/experiment_runner.py \
    --suite-name torchbench --accelerator cuda --repeat 2 \
    --test eval --xla PJRT --dynamo openxla \
    -k detectron2_fasterrcnn_r_50_c4
Traceback (most recent call last):
  File "xla/benchmarks/experiment_runner.py", line 906, in <module>
    main()
  File "xla/benchmarks/experiment_runner.py", line 902, in main
    runner.run()
  File "xla/benchmarks/experiment_runner.py", line 59, in run
    self.run_single_config()
  File "xla/benchmarks/experiment_runner.py", line 247, in run_single_config
    metrics, last_output = self.run_once_and_gather_metrics(
  File "xla/benchmarks/experiment_runner.py", line 324, in run_once_and_gather_metrics
    output, _ = loop(iter_fn=self._default_iter_fn)
  File "xla/benchmarks/experiment_runner.py", line 293, in loop
    output, timing, trace = iter_fn(benchmark_experiment, benchmark_model,
  File "xla/benchmarks/experiment_runner.py", line 209, in _default_iter_fn
    output = benchmark_model.model_iter_fn(
  File "xla/benchmarks/benchmark_model.py", line 155, in eval
    pred = self.module(*inputs)
  File "torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lib/python3.8/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 150, in forward
    return self.inference(batched_inputs)
  File "/lib/python3.8/site-packages/detectron2/modeling/meta_arch/rcnn.py", line 208, in inference
    proposals, _ = self.proposal_generator(images, features, None)
  File "torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/rpn.py", line 454, in forward
    pred_objectness_logits, pred_anchor_deltas = self.rpn_head(features)
  File "torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/rpn.py", line 175, in forward
    pred_objectness_logits.append(self.objectness_logits(t))
  File "torch/nn/modules/module.py", line 1511, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "torch/nn/modules/module.py", line 1520, in _call_impl
    return forward_call(*args, **kwargs)
  File "torch/nn/modules/conv.py", line 460, in forward
    return self._conv_forward(input, self.weight, self.bias)
  File "torch/nn/modules/conv.py", line 456, in _conv_forward
    return F.conv2d(input, weight, bias, self.stride,
RuntimeError: Input type (float) and bias type (c10::Half) should be the same

Environment

cc @miladm @JackCaoG

@JackCaoG
Copy link
Collaborator

This error seems to be from pytorch which is weird.. do we know why bias is fp16?

@ysiraichi
Copy link
Collaborator Author

I don't think #6296 is wrong and should be reverted, since I believe it to be the best way to compare against inductor: instantiate the module in the original accelerator, and then move to XLA. That said, I can think of 2 solutions:

  1. (easy) Special-case these models, so that we instantiate only them with XLA device
  2. (hard) Investigate what's actually going on there

Particularly, I believe (2) to be better, so I will focus on doing that.

@ysiraichi
Copy link
Collaborator Author

do we know why bias is fp16?

Not really sure. I still have to investigate.

@ysiraichi
Copy link
Collaborator Author

But, yes, this seems like PyTorch is doing something weird.

@ysiraichi ysiraichi self-assigned this Jan 19, 2024
@ysiraichi
Copy link
Collaborator Author

I've just confirmed that instantiating the model with XLA device solves the error. i.e. changing the line below with str(self.benchmark_experiment.get_device())

device=self.benchmark_experiment.accelerator,

@vanbasten23
Copy link
Collaborator

I've just confirmed that instantiating the model with XLA device solves the error. i.e. changing the line below with str(self.benchmark_experiment.get_device())

device=self.benchmark_experiment.accelerator,

I thought you want to instantiate the model on CUDA and then move to XLA?

@ysiraichi
Copy link
Collaborator Author

Yes. I do think that's better. I was just confirming that that was the change that caused these models to break.

@ysiraichi
Copy link
Collaborator Author

After further investigation, I found out the issue is due to a combination of 2 factors:

  • The model, as well as the example inputs, are converted to float16
  • The XLA_USE_FP16 environment variable is set

This causes the function AtenFromXlaTensor (at the end of a relu dispatch) to call MaybeUpcastToHostTorchType which converts the float16 result back to a float32 tensor.


Why wasn't it failing before?

Torchbench already converts the model and example inputs to float16 due to the DEFAULT_EVAL_CUDA_PRECISION variable being set for detectron2 models, when the device is actually cuda. However, before #6296 the models were being initialized with something other than cuda (using str(self.benchmark_experiment.get_device()). Thus, the model was never actually converted to float16, avoiding the error.


Possible Solutions

I can think of a few possible solutions:

  1. Do not set XLA_USE_FP16=1, since the model is already being converted to float16
  2. Remember what was the original data-type of the input tensor, before being downcast to float16
  3. Have a set of models INIT_WITH_XLA_DEVICE, special-casing models so that they don't get downcast

Of those, I think (1) is the best one. Maybe we should thrown a specific error for the case when using float16 tensors while XLA_USE_FP16 environment variable is set.

@miladm @JackCaoG @vanbasten23
Let me know what you think.

@ysiraichi
Copy link
Collaborator Author

Apparently, after doing (1), I am getting another error:

  File "/lib/python3.8/site-packages/detectron2/modeling/proposal_generator/proposal_utils.py", line 121, in find_top_rpn_proposals
    keep = batched_nms(boxes.tensor, scores_per_img, lvl, nms_thresh)
  File "/lib/python3.8/site-packages/detectron2/layers/nms.py", line 20, in batched_nms
    return box_ops.batched_nms(boxes.float(), scores, idxs, iou_threshold)
...
  File "/lib/python3.8/site-packages/torchvision/torchvision/ops/boxes.py", line 109, in resume_in__batched_nms_vanilla_at_107
    curr_keep_indices = nms(boxes[curr_indices], scores[curr_indices], iou_threshold)
  File "/lib/python3.8/site-packages/torchvision/torchvision/ops/boxes.py", line 41, in nms
    return torch.ops.torchvision.nms(boxes, scores, iou_threshold)
  File "torch/_ops.py", line 825, in __call__
    return self_._op(*args, **(kwargs or {}))
RuntimeError: dets (Float) should have the same type as scores (Half)

In summary:

  • Detectron2 is casting one of the nms inputs to float32
  • nms execution is falling back to the CPU implementation, which doesn't accept different types
    • Interestingly, the CUDA implementation does accept

Given the problem above, I think, for now, we should: use solution (3) in the short-term.
Let me know what you all think.

@JackCaoG
Copy link
Collaborator

FYI nn.Module.to(torch.float16) has a bug, I opened the issue in pytorch/pytorch#115792. This is why we still have to use XLA_USE_FP16.

@ysiraichi
Copy link
Collaborator Author

I see. So, maybe a solution is to pass --precision fp32 when instantiating the benchmark, while having XLA_USE_FP16 set. What do you think?

@JackCaoG
Copy link
Collaborator

That seems to be a reasonable workaround until upstream fixed the model.to issue(which I think they are wokring on, I saw some PR flowing around).

@ysiraichi ysiraichi changed the title Detectron2 benchmarks failing to run. [torchbench] Detectron2 benchmarks failing to run. Jan 22, 2024
@ysiraichi
Copy link
Collaborator Author

This issue was temporarily fixed by #6389. #6404 details a better fix to this upcasting problem. One of them being the actual problem description on #6403.

@ysiraichi
Copy link
Collaborator Author

Apparently, this issue was not due to conversion issues (pytorch/pytorch#115792) as we once thought, but it's a real problem (more details in this comment).

@ysiraichi ysiraichi reopened this Feb 12, 2024
@ysiraichi
Copy link
Collaborator Author

@miladm @JackCaoG

Here's what I found when looking into this issue (nms fallbacking to the CPU kernel): even though there's an implementation of nms inside PyTorch/XLA, it appears that the implementation is only hooked up to a Python function in torch_xla/core/functions.py.

Registering an implementation for XLA to the dispatcher should solve this problem. That said, I don't think we can leverage current codegen infrastructure, since nms is a torchvision kernel.

What I think could be done: register the XLA implementation manually by:

TORCH_LIBRARY_IMPL(torchvision, XLA, m) {
  m.impl(TORCH_SELECTIVE_NAME("torchvision::nms"), TORCH_FN(xla_nms_kernel));
}

Let me know what you think.

@JackCaoG
Copy link
Collaborator

I don't think that XLA nms is well tested, it is better to figure out what that ops does and test it before we register it to be the default implemenation.

@ysiraichi
Copy link
Collaborator Author

@JackCaoG While the solution in this comment works, I thought it would make more sense to implement a CompositeExplicitAutograd version on TorchVision, directly. What do you think?

@JackCaoG
Copy link
Collaborator

what difference does it make to mark it as CompositeExplicitAutograd ?

@ysiraichi
Copy link
Collaborator Author

The difference is that it would be decomposable with, hopefully, already supported operations. That said, I'm thinking on the following plan:

  • Make an XLA kernel for nms on PyTorch/XLA (original idea)
  • Talk with TorchVision maintainers, and create a CompositeExplicitAutograd kernel for nms
  • Kill XLA kernel implementation

I think that it would be better to have the composite kernel because:

  • It's easier to maintain
  • One less kernel to maintain inside PyTorch/XLA

@ysiraichi
Copy link
Collaborator Author

@JackCaoG Question: how important is it to keep the old behavior?

  • Current nms signature:
nms(boxes, scores, score_threshold, iou_threshold, output_size)
  • TorchVision nms signature:
nms(boxes, scores, iou_threshold)

@JackCaoG
Copy link
Collaborator

ehh, no very? I guess no one using our nms at the moment.

@ysiraichi
Copy link
Collaborator Author

ysiraichi commented Mar 22, 2024

So, can we kill it, in favor of the TorchVision variant?

@JackCaoG
Copy link
Collaborator

yea

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants