-
Notifications
You must be signed in to change notification settings - Fork 104
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
parallel ntt on cpu #591
parallel ntt on cpu #591
Conversation
43d726f
to
84d99a4
Compare
#include <functional> | ||
#include <unordered_map> | ||
|
||
#define H1 15 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
H1 is a bad name, it's not clear. Use clear names everywhere.
* @method bool operator==(const NttTaskCordinates& other) const Compares two task coordinates for equality. | ||
*/ | ||
struct NttTaskCordinates { | ||
int h1_layer_idx = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
again h0,h1 not clear
int find_or_generate_coset(std::unique_ptr<S[]>& arbitrary_coset); | ||
void h1_reorder(E* elements); | ||
eIcicleError | ||
reorder_and_refactor_if_needed(E* elements, NttTaskCordinates ntt_task_cordinates, bool is_top_hirarchy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what is refactor? normalize?
{ | ||
} | ||
|
||
eIcicleError reorder_by_bit_reverse(NttTaskCordinates ntt_task_cordinates, E* elements, bool is_top_hirarchy); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you can return this IcicleError here but why?
std::vector<int> nof_pointing_to_counter; // Number of counters for each layer | ||
|
||
// Each h1_subntt has its own set of counters | ||
std::vector<std::vector<std::vector<std::shared_ptr<int>>>> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think that you need a shared ptr here. simply int.
Also this 3d cube is full? or is it sparse? I don't know how efficient is all those redirections if you use it a lot. Maybe you don't.
private: | ||
int h1_layer_idx; | ||
int nof_h0_layers; | ||
std::vector<int> nof_pointing_to_counter; // Number of counters for each layer |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
not clear name. I think the comment is a better name actually
@@ -105,3 +105,8 @@ class CpuDeviceAPI : public DeviceAPI | |||
}; | |||
|
|||
REGISTER_DEVICE_API("CPU", CpuDeviceAPI); | |||
|
|||
class CpuDeviceAPIREF : public CpuDeviceAPI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove before merging
icicle/tests/test_field_api.cpp
Outdated
|
||
// Randomize config | ||
const int logn = rand() % 10 + 3; | ||
// for (int i = 0; i < 1000; i++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume you will remove comments or uncomment
icicle/backend/cpu/include/cpu_ntt.h
Outdated
|
||
using namespace field_config; | ||
using namespace icicle; | ||
|
||
#define PARALLEL 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
??
1a591cf
to
7f30437
Compare
great work @ShanieWinitz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
looks good, some fixes can be applied as we discussed
NttTaskCordinates ntt_task_cordinates = {0, 0, 0, 0, 0}; | ||
NttTasksManager<S, E> ntt_tasks_manager(logn); | ||
const int nof_threads = std::thread::hardware_concurrency(); | ||
auto tasks_manager = new TasksManager<NttTask<S, E>>(nof_threads - 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this indeed the optimal number of threads?
icicle/backend/cpu/include/cpu_ntt.h
Outdated
const int logn = int(log2(size)); | ||
const uint64_t total_memory_size = size * config.batch_size; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"memory size" is usually in bytes, I think more fitting is total input size or total nof element or something like that
break; | ||
const int coset_stride = ntt.find_or_generate_coset(arbitrary_coset); | ||
|
||
ntt.copy_and_reorder_if_needed(input, output); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why copy if no reorder is needed? also consider inplace reordering
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
implementing DIF ntt can save reordering for some cases
} | ||
} else { | ||
// Just copy, no reordering needed | ||
std::copy(input, input + total_memory_size, output); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is this automatically skipped when input==output? or do you need to add a condition?
// Apply coset multiplication based on the available coset information | ||
if (arbitrary_coset) { | ||
current_elements[batch_stride * i] = current_elements[batch_stride * i] * arbitrary_coset[idx]; | ||
} else if (coset_stride != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we need to check !=0?
icicle/backend/cpu/include/cpu_ntt.h
Outdated
log_nof_subntts_chunks = ntt.ntt_sub_logn.hierarchy_1_layers_sub_logn[0] - log_nof_h1_subntts_todo_in_parallel; | ||
nof_subntts_chunks = 1 << log_nof_subntts_chunks; | ||
|
||
for (int h1_subntts_chunck_idx = 0; h1_subntts_chunck_idx < nof_subntts_chunks; h1_subntts_chunck_idx++) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
duplicate code - can be in loop or function
: 1; | ||
for (ntt_task_cordinates.hierarchy_0_layer_idx = 0; | ||
ntt_task_cordinates.hierarchy_0_layer_idx < | ||
NttCpu<S, E>::ntt_sub_logn.hierarchy_0_layers_sub_logn[ntt_task_cordinates.hierarchy_1_layer_idx].size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
replace with alias
? this->ntt_sub_logn.size | ||
: 1 << this->ntt_sub_logn.hierarchy_1_layers_sub_logn[ntt_task_cordinates.hierarchy_1_layer_idx]; | ||
uint64_t temp_output_size = this->config.columns_batch ? size * this->config.batch_size : size; | ||
auto temp_output = std::make_unique<E[]>(temp_output_size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why is reordering done via temp memory? can't the refactoring be part of the thread ntt task?
for (ntt_task_cordinates.hierarchy_0_subntt_idx = 0; | ||
ntt_task_cordinates.hierarchy_0_subntt_idx < (1 << log_nof_subntts); | ||
ntt_task_cordinates.hierarchy_0_subntt_idx++) { | ||
ntt_tasks_manager.push_task(this, input, ntt_task_cordinates, false); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why use the same function if it does 2 different things?
ntt_task_cordinates.hierarchy_0_layer_idx < | ||
NttCpu<S, E>::ntt_sub_logn.hierarchy_0_layers_sub_logn[ntt_task_cordinates.hierarchy_1_layer_idx].size(); | ||
ntt_task_cordinates.hierarchy_0_layer_idx++) { | ||
if (ntt_task_cordinates.hierarchy_0_layer_idx == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
instead of conditions just run layers one after the other
NttCpu<S, E>::ntt_sub_logn.hierarchy_0_layers_sub_logn[ntt_task_cordinates.hierarchy_1_layer_idx][1]; | ||
int log_nof_blocks = | ||
NttCpu<S, E>::ntt_sub_logn.hierarchy_0_layers_sub_logn[ntt_task_cordinates.hierarchy_1_layer_idx][2]; | ||
for (ntt_task_cordinates.hierarchy_0_block_idx = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
double loop can be simplified into a single loop using mod (shifts) and then all this code can become a single loop that runs on layer idx
int log_nof_blocks = | ||
NttCpu<S, E>::ntt_sub_logn.hierarchy_0_layers_sub_logn[ntt_task_cordinates.hierarchy_1_layer_idx][2]; | ||
for (ntt_task_cordinates.hierarchy_0_block_idx = 0; | ||
ntt_task_cordinates.hierarchy_0_block_idx < (1 << log_nof_blocks); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
block and sub ntt namings are unclear
.hierarchy_1_layers_sub_logn[ntt_task_cordinates.hierarchy_1_layer_idx]); // input + subntt_idx * | ||
// subntt_size | ||
|
||
this->reorder_by_bit_reverse(ntt_task_cordinates, current_input, false); // R --> N |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why reorder at this level? I would expect only index calculations
uint64_t tw_idx = (this->direction == NTTDir::kForward) | ||
? ((this->domain_max_size / ntt_size) * j * i) | ||
: this->domain_max_size - ((this->domain_max_size / ntt_size) * j * i); | ||
elements_of_current_batch[elem_mem_idx] = elements_of_current_batch[elem_mem_idx] * this->twiddles[tw_idx]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
have special memory for twiddles for cash efficiency instead of reading from global
c85ba69
to
9b37c77
Compare
9b37c77
to
ce91edd
Compare
ce91edd
to
05a930f
Compare
parallel ntt on cpu