Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP debug test only: AMD OpenCL peak flops patch #3006

Closed
wants to merge 5 commits into from

Conversation

RichardHaselgrove
Copy link
Contributor

Fixes #2988 and #3001 (test only)

Description of the Change
A recent fault in AMD OpenCL driver releases results in a meaningless overflow value being generated for the device GFLOPS peak. This in turn causes tasks to fail (for new machines or new applications) with Error 197: EXIT_TIME_LIMIT_EXCEEDED

Alternate Designs
Research and apply alternative ATI driver metrics to replace the OpenCL 'Max clock frequency'

Release Notes
For testing only. GFLOPS Peak should be capped at 1.e12

This takes the active ingredient from David's #3001 'fix possible overflow' and applies it in lib/coproc.cpp at COPROC_ATI::set_peak_flops()

Posting so that tests can be made on the AppVeyor build: I don't have the necessary hardware.
@RichardHaselgrove
Copy link
Contributor Author

RichardHaselgrove commented Feb 6, 2019

SU user has confirmed that this build now reports

06-Feb-2019 14:29:16 (low) [] OpenCL: AMD/ATI GPU 0: AMD Radeon(TM) Vega 8 Graphics (driver version 2766.5 (PAL,HSAIL), device version OpenCL 2.0 AMD-APP (2766.5), 6567MB, 6567MB available, 1000 GFLOPS peak)

@JuhaSointusalo
Copy link
Contributor

Easier to organize thoughts when writing so putting this here and now instead of in meeting. I would suggest that

  • the patch to be moved so that it covers all vendors. Just because AMD released buggy driver this time doesn't mean others won't ever do it.
  • a message added that tells the computed peak_flops was rejected and replaced by a default value. Look at what David did to see how the message is added. I don't think there is much point in listing which input value was bad because it could have been any of them but include what the rejected peak_flops was.
  • the default value lowered significantly. For example, my mobile Intel GPU has 192 GFLOPS peak. Having peak_flops lower than in reality should be safe I think.
  • the patch to be redone so that it includes only what's in master.

@JuhaSointusalo
Copy link
Contributor

  • Look at what David did to see how the message is added.

My reply to that was a bit short. I had not noticed the patch was in library code. Getting useful messages out of library code is not something that is explained in one sentence, or a five minute job, or one line code change. Right now I couldn't tell how to do it and it's quite possible the changes necessary to do it would be too large to fit in to a hotfix.

@RichardHaselgrove
Copy link
Contributor Author

The easiest way might be to pass back a 'value limited' flag to the calling routine, and let the client code respond with message code when the flag is present.

@JuhaSointusalo
Copy link
Contributor

#3001 implemented a workaround for the driver bug in a different way. Closing this as unnecessary.

@JuhaSointusalo JuhaSointusalo deleted the RDH_AMD_flops_patch branch February 12, 2019 21:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants