Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize GPU components #489

Merged
merged 3 commits into from
Oct 5, 2023
Merged

Optimize GPU components #489

merged 3 commits into from
Oct 5, 2023

Conversation

PhilippeMoussalli
Copy link
Contributor

PR that modifies all current GPU components by:

  • Batching both the preprocessing and inference to avoid OOM issues
  • Disabling a pytorch API that caused illegal memory access issues. More on this issue here

r, g, b = tuple(avg_color)
draw.rectangle(((x1, y1), (x2, y2)), fill=(int(r), int(g), int(b)))

if cropped_image.any():
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not really related to this PR but is needed to tackle edge cases

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @PhilippeMoussalli! We might want to create a model inference component in the future which packages the batching functionality, so users only need to implement their code per batch.

@@ -12,75 +13,97 @@

logger = logging.getLogger(__name__)

os.environ['CUDA_LAUNCH_BLOCKING'] = "1"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We no longer need this, right? It was just for debugging I think.
Same in the other components.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept it in case we run into other issues later on and we need further debugging. We can remove it later on once we've tested enough GPU components at scale and are sure that everything runs fine

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think it helped us since are already running single-threaded. It also didn't change the stacktrace, the first one was already correct. And it seems like it really should not be used when not debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh ok, I'll remove it then

Copy link
Member

@RobbeSneyders RobbeSneyders left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks!

@PhilippeMoussalli PhilippeMoussalli merged commit 015dc0a into main Oct 5, 2023
8 checks passed
@PhilippeMoussalli PhilippeMoussalli deleted the optimize-gpu-components branch October 5, 2023 11:56
@PhilippeMoussalli
Copy link
Contributor Author

When using the default Dask scheduler (threaded), it is important to take into account that all the GPU related processing (preprocessing, inference) has to be batched to avoid running into OOM issues.

To scale the model efficiently, multiple GPUs can be loaded for inference using pytorch Data Parallelism (this does not work on every model) in order to parallelize the batches across multiple GPUs. One important consideration there is to use either a single threaded scheduler (not recommended) or to limit the number of workers to be the same as the number of GPU cores dask.config.set(num_workers=<#GPU>) to avoid running into issues. Other alternatives could include assigning GPUs to spawned processes (not tested yet).

In order to test and diagnose GPU components both nvtop and htop can be used to monitor GPU and CPU usage. This can help identify bottlnecks and pinpoint whether a GPU component is compute or memory bound.

Further things that still need to be clarified:

Whether to run a model using the processes or threaded scheduler (so far, the threaded scheduler has shown to be faster). However, most resources seem to indicate to use threads (link).
How to parallelize GPU and CPU tasks efficiently: limiting the number of workers can leave some workers/CPU cores idle (when #GPU in one machine is larger than the number of cores). There is some room for optimization.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants