Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite map_impl and extend guide #9

Merged
merged 14 commits into from
Sep 24, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/build_api.py
Original file line number Diff line number Diff line change
Expand Up @@ -154,6 +154,7 @@ def build_index_page(groups):
"all",
"any",
"count",
"dot",
],
"Mathematical": [
("abs", "abs(const V&)"),
Expand Down
4 changes: 3 additions & 1 deletion docs/guides.rst
Original file line number Diff line number Diff line change
Expand Up @@ -5,5 +5,7 @@ Guides

guides/introduction.rst
guides/promotion.rst
guides/prelude.rst
guides/memory.rst
guides/accuracy.rst
guides/constant.rst
guides/example.md
112 changes: 112 additions & 0 deletions docs/guides/accuracy.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Accuracy Level

For certain operations, there might be alternative versions available that provide better performance at the cost of lower accuracy.
In other words, they are faster but also have a small error.

## Fast Math

For several operations in single precision (float32), there are "fast math" versions. These functions are faster since they are hardware accelerated, but are not IEEE compliant for all inputs.

To use this functionality, use the `fast_*` functions from Kernel Float.

```cpp
kf::vec<float, 4> x = {1.0f, 2.0f, 3.0f, 4.0f};

// Sine
kf::vec<float, 4> a = kf::fast_sin(x);

// Square root
kf::vec<float, 4> b = kf::fast_sqrt(x);

// Reciprocal `1/x`
kf::vec<float, 4> c = kf::fast_rcp(x);

// Division `a/b`
kf::vec<float, 4> d = kf::fast_div(a, b);
```

These functions are only functional for 32-bit and 16-bit floats.
For other input types, the operation falls back to the regular version.

## Approximate Math

For 16-bit floats, several approximate functions are provided.
These use approximations (typically low-degree polynomials) to calculate rough estimates of the functions.
This can be very fast but also less accurate.


To use this functionality, use the `approx_*` functions from Kernel Float. For other input types, the operation falls back to the `fast_*` variant.

```cpp
kf::vec<half, 4> x = {1.0, 2.0, 3.0, 4.0};

// Sine
kf::vec<half, 4> a = kf::approx_sin(x);

// Square root
kf::vec<half, 4> b = kf::approx_sqrt(x);

// Reciprocal `1/x`
kf::vec<half, 4> c = kf::approx_rcp(x);

// Division `a/b`
kf::vec<half, 4> d = kf::approx_div(a, b);
```

You can adjust the degree of approximation by supplying an integer template parameter:


```cpp
// Sine approximation with polynomial of degree 1
kf::vec<half, 4> a = kf::approx_sin<1>(x);

// Polynomial of degree 2
kf::vec<half, 4> a = kf::approx_sin<2>(x);

// Polynomial of degree 3
kf::vec<half, 4> a = kf::approx_sin<3>(x);
```

## Tuning Accuracy Level

Many functions in Kernel Float accept an additional Accuracy option as a template parameter.
This allows you to tune the accuracy level without changing the function name.

There are four possible values for this parameter:

- `kf::accurate_policy`: Use the most accurate version of the function available.
- `kf::fast_policy`: Use the "fast math" version.
- `kf::approx_policy<N>`: Use the approximate version with degree `N`.
- `kf::default_policy`: Use a global default policy (see the next section).

For example, consider this code:

```cpp
kf::vec<float, 2> input = {1.0f, 2.0f};

// Use the default policy
kf::vec<float, 2> a = kf::cos(input);

// Use the default policy
kf::vec<float, 2> b = kf::cos<kf::default_policy>(input);

// Use the most accurate policy
kf::vec<float, 2> c = kf::cos<kf::accurate_policy>(input);

// Use the fastest policy
kf::vec<float, 2> d = kf::cos<kf::fast_policy>(input);

// Use the approximate policy
kf::vec<float, 2> e = kf::cos<kf::approx_policy<3>>(input);

// You can use aliases to define your own policy
using my_own_policy = kf::fast_policy;
kf::vec<float, 2> f = kf::cos<my_own_policy>(input);
```

## Setting `default_policy`

By default, `kf::default_policy` is set to `kf::accurate_policy`.

Set the preprocessor option `KERNEL_FLOAT_FAST_MATH=1` to change the default policy to `kf::fast_policy`.
This will use fast math for all functions and data types that support it.
112 changes: 112 additions & 0 deletions docs/guides/example.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,112 @@
# Full CUDA example

This page explains a CUDA program that estimates the value of pi using Kernel Float.


## Overview

The program calculates Pi by generating random points within a unit square and counting how many fall inside the unit circle inscribed within that square. The ratio of points inside the circle to the total number of points approximates Pi/4.

The kernel is shown below:


```c++
namespace kf = kernel_float;

using float_type = float;
static constexpr int VECTOR_SIZE = 4;

__global__ void calculate_pi_kernel(int nx, int ny, int* global_count) {
int thread_x = blockIdx.x * blockDim.x + threadIdx.x;
int thread_y = blockIdx.y * blockDim.y + threadIdx.y;

kf::vec<int, VECTOR_SIZE> xi = thread_x * VECTOR_SIZE + kf::range<int, VECTOR_SIZE>();
kf::vec<int, VECTOR_SIZE> yi = thread_y;

kf::vec<float_type, VECTOR_SIZE> xf = kf::cast<float_type>(xi) / float_type(nx);
kf::vec<float_type, VECTOR_SIZE> yf = kf::cast<float_type>(yi) / float_type(ny);

kf::vec<float_type, VECTOR_SIZE> dist_squared = xf * xf + yf * yf;
kf::vec<float_type, VECTOR_SIZE> dist = kf::sqrt(dist_squared);

int n = kf::count(dist <= float_type(1));

if (n > 0) atomicAdd(global_count, n);
}
```


## Code Explanation

Let's go through the code step by step.

```cpp
// Alias `kernel_float` as `kf`
namespace kf = kernel_float;
```

This creates an alias for `kernel_float`.

```cpp
// Define the float type and vector size
using float_type = float;
static constexpr int VECTOR_SIZE = 4;
```

Define `float_type` as an alias for `float` to make it easy to change precision if needed.
The vector size is set to 4, meaning each thread will process 4 data points.

```cpp
__global__ void calculate_pi_kernel(int nx, int ny, int* global_count) {
```

The CUDA kernel. There are `nx` points along the x axis and `ny` points along the y axis.

```cpp
int thread_x = blockIdx.x * blockDim.x + threadIdx.x;
int thread_y = blockIdx.y * blockDim.y + threadIdx.y;
```

Compute the global x- and y-index of this thread.

```cpp
kf::vec<int, VECTOR_SIZE> xi = thread_x * VECTOR_SIZE + kf::range<int, VECTOR_SIZE>();
kf::vec<int, VECTOR_SIZE> yi = thread_y;
```

Compute the points that this thread will process.
The x coordinates start at `thread_x * VECTOR_SIZE` and then the vector `[0, 1, 2, ..., VECTOR_SIZE-1]`.
The y coordinates are all `thread_y`.

```cpp
kf::vec<float_type, VECTOR_SIZE> xf = kf::cast<float_type>(xi) / float_type(nx);
kf::vec<float_type, VECTOR_SIZE> yf = kf::cast<float_type>(yi) / float_type(ny);
```

Divide `xi` and `yi` by `nx` and `ny` to normalize them to `[0, 1]` range.

```cpp
kf::vec<float_type, VECTOR_SIZE> dist_squared = xf * xf + yf * yf;
```

Compute the squared distance from the origin (0, 0) to each point from `xf`,`yf`.

```cpp
kf::vec<float_type, VECTOR_SIZE> dist = kf::sqrt(dist_squared);
```

Take the element-wise square root.

```cpp
int n = kf::count(dist <= float_type(1));
```

Count the number of points in the unit circle (i.e., for which the distance is less than 4).
The expression `dist <= 1` returns a vector of booleans and `kf::count` counts the number of `true` values.

```cpp
atomicAdd(global_count, n);
```

Add `n` to the `global_count` variable.
This must be done using an atomic operation since multiple thread will write this variable simultaneously.
Loading
Loading