Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix typos and simplify wording in performance tips docs #2179

Merged
merged 3 commits into from
Dec 1, 2023
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
14 changes: 7 additions & 7 deletions docs/src/tutorials/performance.jl
Original file line number Diff line number Diff line change
Expand Up @@ -8,12 +8,12 @@
# * Identify problematic kernel invocations: you may be launching thousands of kernels which could be fused into a single call;
# * Find stalls, where the CPU isn't submitting work fast enough to keep the GPU busy.

# If that isn't sufficient, and you identified a kernel that executes slowly, you can try using NSight Compute to analyze that kernel in detail. Some things to look out for in order of importance:
# * Memory optimizations are the most important area for performance. Hence optimizing memory accesses, e.g., avoiding needless global accesses (buffering in shared memory instead) and coalescing accesses can lead to big performance improvements;
# * Launching more threads on each streaming multiprocessor can be acheived by lowering register pressure and reducing shared memory usage, the tips below outline the various ways in which register pressure can be reduced;
# * Using Float32's instead of Float64's can provide significantly better performance;
# * Avoid using control flow instructions such as `if` which cause branches, e.g. replace an `if` with an `ifelse` if possible;
# * Increase the arithmetic intensity in order for the GPU to be able to hide the latency of memory accesses.
# If that isn't sufficient, and you identified a kernel that executes slowly, you can try using NSight Compute to analyze that kernel in detail. Some things to try in order of importance:
# * Optimizing memory accesses, e.g. avoiding needless global accesses (buffering in shared memory instead) or coalescing accesses;
# * Launching more threads on each streaming multiprocessor can be achieved by lowering register pressure and reducing shared memory usage, the tips below outline the various ways in which register pressure can be reduced;
# * Using Float32's instead of Float64's;
# * Avoiding the use of control flow instructions such as `if`s which cause branches, e.g. try replacing an `if` with an `ifelse`;
# * Increasing the arithmetic intensity in order for the GPU to be able to hide the latency of memory accesses.
Zentrik marked this conversation as resolved.
Show resolved Hide resolved

# ### Inlining

Expand All @@ -26,7 +26,7 @@

# ### FastMath

# Use `@fastmath` to use faster versions of common mathematical functions and for even faster square roots use `@cuda fastmath=true`.
# Use `@fastmath` to use faster versions of common mathematical functions and use `@cuda fastmath=true` for even faster square roots.

# ## Resources

Expand Down