JuliaGPU · maleadt · Dec 1, 2023 · Nov 25, 2023 · Nov 28, 2023 · Dec 1, 2023
diff --git a/docs/src/tutorials/performance.jl b/docs/src/tutorials/performance.jl
@@ -8,12 +8,12 @@
 # * Identify problematic kernel invocations: you may be launching thousands of kernels which could be fused into a single call;
 # * Find stalls, where the CPU isn't submitting work fast enough to keep the GPU busy.
 
-# If that isn't sufficient, and you identified a kernel that executes slowly, you can try using NSight Compute to analyze that kernel in detail. Some things to look out for in order of importance:
-# * Memory optimizations are the most important area for performance. Hence optimizing memory accesses, e.g., avoiding needless global accesses (buffering in shared memory instead) and coalescing accesses can lead to big performance improvements;
-# * Launching more threads on each streaming multiprocessor can be acheived by lowering register pressure and reducing shared memory usage, the tips below outline the various ways in which register pressure can be reduced;
-# * Using Float32's instead of Float64's can provide significantly better performance;
-# * Avoid using control flow instructions such as `if` which cause branches, e.g. replace an `if` with an `ifelse` if possible;
-# * Increase the arithmetic intensity in order for the GPU to be able to hide the latency of memory accesses.
+# If that isn't sufficient, and you identified a kernel that executes slowly, you can try using NSight Compute to analyze that kernel in detail. Some things to try in order of importance:
+# * Optimizing memory accesses, e.g. avoiding needless global accesses (buffering in shared memory instead) or coalescing accesses;
+# * Launching more threads on each streaming multiprocessor can be achieved by lowering register pressure and reducing shared memory usage, the tips below outline the various ways in which register pressure can be reduced;
+# * Using Float32's instead of Float64's;
+# * Avoiding the use of control flow instructions such as `if`s which cause branches, e.g. try replacing an `if` with an `ifelse`;
+# * Increasing the arithmetic intensity in order for the GPU to be able to hide the latency of memory accesses.
 
 # ### Inlining
 
@@ -26,7 +26,7 @@
 
 # ### FastMath
 
-# Use `@fastmath` to use faster versions of common mathematical functions and for even faster square roots use `@cuda fastmath=true`.
+# Use `@fastmath` to use faster versions of common mathematical functions and use `@cuda fastmath=true` for even faster square roots.
 
 # ## Resources