Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use benchmark_cls for checking precision. #6375

Merged
merged 1 commit into from
Jan 25, 2024

Conversation

ysiraichi
Copy link
Collaborator

@ysiraichi ysiraichi commented Jan 24, 2024

This PR makes it so we don't have to call load_benchmark only for checking the precision to be used.

cc @miladm @JackCaoG

@zpcore
Copy link
Collaborator

zpcore commented Jan 24, 2024

Refer to the issue here for the context: #6286
Thanks for making the fix.

The key point I think is to prevent leaving behind a dangling object which e.g., moved a model to xla device. del benchmark doesn't resolve the issue because it has already claimed the PJRT runtime. This will trigger the stackdump error: RuntimeError: Bad StatusOr access: UNKNOWN: TPU initialization failed: open(/dev/vfio/0): Device or resource busy: Device or resource busy; Couldn't open iommu group /dev/vfio/0.

@zpcore zpcore requested a review from will-cromar January 24, 2024 21:29
@zpcore
Copy link
Collaborator

zpcore commented Jan 24, 2024

Since we only need to detect the precision, we can fetch the information directly without invoking

benchmark_cls(
        test=self.benchmark_experiment.test,
        device=device,
        batch_size=self.benchmark_experiment.batch_size,
    )

I think we can call the following load_benchmark_precision instead of load_benchmark to get the precision directly.

  def load_benchmark_precision(self):
   try:
     module = importlib.import_module(
         f"torchbenchmark.models.{self.model_name}")
   except ModuleNotFoundError:
     module = importlib.import_module(
         f"torchbenchmark.models.fb.{self.model_name}")
   benchmark_train_precision = getattr(module.Model, "DEFAULT_TRAIN_CUDA_PRECISION", None)
   benchmark_eval_precision = getattr(module.Model, "DEFAULT_EVAL_CUDA_PRECISION", None)
   return benchmark_train_precision, benchmark_eval_precision

WDYT?

@ysiraichi
Copy link
Collaborator Author

Right. Correct if I'm misunderstanding things, but isn't that exactly what I'm doing here?

@zpcore
Copy link
Collaborator

zpcore commented Jan 25, 2024

Right. Correct if I'm misunderstanding things, but isn't that exactly what I'm doing here?

Hah, you are right. I didn't notice that you called benchmark_cls instead.

Now it LGTM!

@zpcore zpcore self-requested a review January 25, 2024 17:41
@zpcore zpcore merged commit a1e51e4 into master Jan 25, 2024
18 checks passed
@lezcano lezcano changed the title Use benchmark_cls for checking precision.` Use benchmark_cls for checking precision. Feb 5, 2024
bhavya01 pushed a commit that referenced this pull request Apr 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

benchmarks/torchbench_model: some benchmarks fail to load and kill experiment_runner's main process
2 participants