Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix some more core aten ops #6342

Merged
merged 7 commits into from
Jan 23, 2024
Merged

Fix some more core aten ops #6342

merged 7 commits into from
Jan 23, 2024

Conversation

wonjoolee95
Copy link
Collaborator

@wonjoolee95 wonjoolee95 commented Jan 22, 2024

Fixes #5896, fixes #5867, fixes #5884, fixes #5889

@wonjoolee95 wonjoolee95 requested a review from ManfeiBai January 22, 2024 06:43
Copy link
Collaborator

@ManfeiBai ManfeiBai left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wonjoolee95 wonjoolee95 changed the title Fix sore core aten ops Fix some more core aten ops Jan 22, 2024
@wonjoolee95 wonjoolee95 force-pushed the wonjoo/core-aten-ops-week-6 branch from ae77bfa to 34786e8 Compare January 22, 2024 20:23
@wonjoolee95 wonjoolee95 force-pushed the wonjoo/core-aten-ops-week-6 branch from 34786e8 to 0478d2f Compare January 23, 2024 08:32
@wonjoolee95 wonjoolee95 merged commit 99a1341 into master Jan 23, 2024
18 checks passed
@cota
Copy link
Collaborator

cota commented Jan 24, 2024

I've bisected this commit to a large amount of failures (all torchbench inference on XLA:GPU).

Some example failures:

INFO:__main__:Run with --model-config={"model_name": "BERT_pytorch"} --experiment-config={"accelerator": "cuda", "xla": "PJRT", "xla_flags": null, "dynamo": "openxla", "test": "train"}
ERROR:torchbench_model:Cannot load benchmark model
Traceback (most recent call last):
  File "/home/ecg/nightly_runs/2024-01-24/pytorch/xla/benchmarks/torchbench_model.py", line 288, in default_precision_flag
    benchmark = self.load_benchmark()
  File "/home/ecg/nightly_runs/2024-01-24/pytorch/xla/benchmarks/torchbench_model.py", line 267, in load_benchmark
    return benchmark_cls(
  File "/home/ecg/nightly_runs/2024-01-24/benchmark/torchbenchmark/util/model.py", line 24, in __call__
    obj = type.__call__(cls, *args, **kwargs)
  File "/home/ecg/nightly_runs/2024-01-24/benchmark/torchbenchmark/models/BERT_pytorch/__init__.py", line 148, in __init__
    trainer = BERTTrainer(bert, len(vocab), train_dataloader=train_data_loader, test_dataloader=test_data_loader,
  File "/home/ecg/nightly_runs/2024-01-24/benchmark/torchbenchmark/models/BERT_pytorch/bert_pytorch/trainer/pretrain.py", line 38, in __init__
    self.device = torch.device(device)
TypeError: device() received an invalid combination of arguments - got (bool), but expected one of:
 * (torch.device device)
      didn't match because some of the arguments have invalid types: (!bool!)
 * (str type, int index)
 * 
INFO:__main__:Run with --model-config={"model_name": "Background_Matting"} --experiment-config={"accelerator": "cuda", "xla": "PJRT", "xla_flags": null, "dynamo": "openxla", "test": "eval"}
ERROR:torchbench_model:Cannot load benchmark model
Traceback (most recent call last):
  File "/home/ecg/nightly_runs/2024-01-24/pytorch/xla/benchmarks/torchbench_model.py", line 288, in default_precision_flag
    benchmark = self.load_benchmark()
  File "/home/ecg/nightly_runs/2024-01-24/pytorch/xla/benchmarks/torchbench_model.py", line 267, in load_benchmark
    return benchmark_cls(
  File "/home/ecg/nightly_runs/2024-01-24/benchmark/torchbenchmark/util/model.py", line 24, in __call__
    obj = type.__call__(cls, *args, **kwargs)
  File "/home/ecg/nightly_runs/2024-01-24/benchmark/torchbenchmark/models/Background_Matting/__init__.py", line 72, in __init__
    netB.to(self.device)
  File "/home/ecg/nightly_runs/2024-01-24/pytorch/torch/nn/modules/module.py", line 1137, in to
    raise TypeError('nn.Module.to only accepts floating point or complex '
TypeError: nn.Module.to only accepts floating point or complex dtypes, but got desired dtype=torch.bool

Does this ring any bells?

@wonjoolee95
Copy link
Collaborator Author

Thanks for catching this. It's hard to identify the offending op just looking at the trace, but this PR basically only touches two ops -- aten::reciprocal and aten::sigmoid. Let me revert the changes that this PR does for this two ops for now and investigate.

wonjoolee95 added a commit that referenced this pull request Jan 25, 2024
@wonjoolee95
Copy link
Collaborator Author

wonjoolee95 commented Jan 25, 2024

Reading the error, it's complaining that it's getting passed a boolean in .device() and .to() methods. Just by a quick look, the errors seem irrelevant to this PR's change but let me continue to investigate.

@cota, is there something that describes the set-up for me to repro that (run torchbench) in GPU?

cota added a commit to cota/pytorch-xla that referenced this pull request Jan 26, 2024
This reverts commit 4ab7a24.

Turns out that the revert was unnecessary; things broke
from a different commit. This reverts the revert, i.e.
it reinstates pytorch#6342.
@cota
Copy link
Collaborator

cota commented Jan 26, 2024

@wonjoolee95 I re-did the bisection paying more attention this time. It turns out that the problem was introduced in a prior commit, not in this PR. My apologies! :(
Things are now working on master, and I have confirmed that reinstating this PR still works.
I've sent #6387 to reapply this change.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants