parallel ntt on cpu #591

ShanieWinitz · 2024-08-28T15:06:03Z

parallel ntt on cpu

yshekel · 2024-08-29T15:42:16Z

icicle/backend/cpu/include/ntt_tasks.h

+#include <functional>
+#include <unordered_map>
+
+#define H1 15


H1 is a bad name, it's not clear. Use clear names everywhere.

yshekel · 2024-08-29T15:42:54Z

icicle/backend/cpu/include/ntt_tasks.h

+   * @method bool operator==(const NttTaskCordinates& other) const Compares two task coordinates for equality.
+   */
+  struct NttTaskCordinates {
+    int h1_layer_idx = 0;


again h0,h1 not clear

yshekel · 2024-08-29T15:44:49Z

icicle/backend/cpu/include/ntt_tasks.h

+    int find_or_generate_coset(std::unique_ptr<S[]>& arbitrary_coset);
+    void h1_reorder(E* elements);
+    eIcicleError
+    reorder_and_refactor_if_needed(E* elements, NttTaskCordinates ntt_task_cordinates, bool is_top_hirarchy);


what is refactor? normalize?

yshekel · 2024-08-29T15:45:29Z

icicle/backend/cpu/include/ntt_tasks.h

+    {
+    }
+
+    eIcicleError reorder_by_bit_reverse(NttTaskCordinates ntt_task_cordinates, E* elements, bool is_top_hirarchy);


you can return this IcicleError here but why?

yshekel · 2024-08-29T15:47:53Z

icicle/backend/cpu/include/ntt_tasks.h

+    std::vector<int> nof_pointing_to_counter; // Number of counters for each layer
+
+    // Each h1_subntt has its own set of counters
+    std::vector<std::vector<std::vector<std::shared_ptr<int>>>>


I don't think that you need a shared ptr here. simply int.
Also this 3d cube is full? or is it sparse? I don't know how efficient is all those redirections if you use it a lot. Maybe you don't.

yshekel · 2024-08-29T15:48:07Z

icicle/backend/cpu/include/ntt_tasks.h

+  private:
+    int h1_layer_idx;
+    int nof_h0_layers;
+    std::vector<int> nof_pointing_to_counter; // Number of counters for each layer


not clear name. I think the comment is a better name actually

yshekel · 2024-08-29T15:49:18Z

icicle/backend/cpu/src/cpu_device_api.cpp

@@ -105,3 +105,8 @@ class CpuDeviceAPI : public DeviceAPI
 };

 REGISTER_DEVICE_API("CPU", CpuDeviceAPI);
+
+class CpuDeviceAPIREF : public CpuDeviceAPI


let's remove before merging

yshekel · 2024-08-29T15:49:44Z

icicle/tests/test_field_api.cpp

-
-  // Randomize config
-  const int logn = rand() % 10 + 3;
+  // for (int i = 0; i < 1000; i++) {


I assume you will remove comments or uncomment

yshekel · 2024-08-29T15:50:12Z

icicle/backend/cpu/include/cpu_ntt.h


 using namespace field_config;
 using namespace icicle;
-
+#define PARALLEL 0


omershlo · 2024-08-31T20:39:36Z

great work @ShanieWinitz

HadarIngonyama

looks good, some fixes can be applied as we discussed

HadarIngonyama · 2024-09-01T07:43:03Z

icicle/backend/cpu/include/cpu_ntt.h

+    NttTaskCordinates ntt_task_cordinates = {0, 0, 0, 0, 0};
+    NttTasksManager<S, E> ntt_tasks_manager(logn);
+    const int nof_threads = std::thread::hardware_concurrency();
+    auto tasks_manager = new TasksManager<NttTask<S, E>>(nof_threads - 1);


Is this indeed the optimal number of threads?

HadarIngonyama · 2024-09-01T07:46:26Z

icicle/backend/cpu/include/cpu_ntt.h

    const int logn = int(log2(size));
+    const uint64_t total_memory_size = size * config.batch_size;


"memory size" is usually in bytes, I think more fitting is total input size or total nof element or something like that

HadarIngonyama · 2024-09-01T09:13:21Z

icicle/backend/cpu/include/cpu_ntt.h

-          break;
+    const int coset_stride = ntt.find_or_generate_coset(arbitrary_coset);
+
+    ntt.copy_and_reorder_if_needed(input, output);


why copy if no reorder is needed? also consider inplace reordering

implementing DIF ntt can save reordering for some cases

HadarIngonyama · 2024-09-01T09:17:31Z

icicle/backend/cpu/include/ntt_tasks.h

+      }
+    } else {
+      // Just copy, no reordering needed
+      std::copy(input, input + total_memory_size, output);


is this automatically skipped when input==output? or do you need to add a condition?

HadarIngonyama · 2024-09-01T09:26:38Z

icicle/backend/cpu/include/ntt_tasks.h

+        // Apply coset multiplication based on the available coset information
+        if (arbitrary_coset) {
+          current_elements[batch_stride * i] = current_elements[batch_stride * i] * arbitrary_coset[idx];
+        } else if (coset_stride != 0) {


why do we need to check !=0?

HadarIngonyama · 2024-09-01T09:30:12Z

icicle/backend/cpu/include/cpu_ntt.h

+      log_nof_subntts_chunks = ntt.ntt_sub_logn.hierarchy_1_layers_sub_logn[0] - log_nof_h1_subntts_todo_in_parallel;
+      nof_subntts_chunks = 1 << log_nof_subntts_chunks;
+
+      for (int h1_subntts_chunck_idx = 0; h1_subntts_chunck_idx < nof_subntts_chunks; h1_subntts_chunck_idx++) {


duplicate code - can be in loop or function

HadarIngonyama · 2024-09-01T09:42:35Z

icicle/backend/cpu/include/ntt_tasks.h

+                                                                                                                    : 1;
+    for (ntt_task_cordinates.hierarchy_0_layer_idx = 0;
+         ntt_task_cordinates.hierarchy_0_layer_idx <
+         NttCpu<S, E>::ntt_sub_logn.hierarchy_0_layers_sub_logn[ntt_task_cordinates.hierarchy_1_layer_idx].size();


replace with alias

HadarIngonyama · 2024-09-01T10:29:07Z

icicle/backend/cpu/include/ntt_tasks.h

+                      ? this->ntt_sub_logn.size
+                      : 1 << this->ntt_sub_logn.hierarchy_1_layers_sub_logn[ntt_task_cordinates.hierarchy_1_layer_idx];
+    uint64_t temp_output_size = this->config.columns_batch ? size * this->config.batch_size : size;
+    auto temp_output = std::make_unique<E[]>(temp_output_size);


why is reordering done via temp memory? can't the refactoring be part of the thread ntt task?

HadarIngonyama · 2024-09-01T14:36:32Z

icicle/backend/cpu/include/ntt_tasks.h

+          for (ntt_task_cordinates.hierarchy_0_subntt_idx = 0;
+               ntt_task_cordinates.hierarchy_0_subntt_idx < (1 << log_nof_subntts);
+               ntt_task_cordinates.hierarchy_0_subntt_idx++) {
+            ntt_tasks_manager.push_task(this, input, ntt_task_cordinates, false);


why use the same function if it does 2 different things?

HadarIngonyama · 2024-09-01T15:02:26Z

icicle/backend/cpu/include/ntt_tasks.h

+         ntt_task_cordinates.hierarchy_0_layer_idx <
+         NttCpu<S, E>::ntt_sub_logn.hierarchy_0_layers_sub_logn[ntt_task_cordinates.hierarchy_1_layer_idx].size();
+         ntt_task_cordinates.hierarchy_0_layer_idx++) {
+      if (ntt_task_cordinates.hierarchy_0_layer_idx == 0) {


instead of conditions just run layers one after the other

HadarIngonyama · 2024-09-01T15:03:47Z

icicle/backend/cpu/include/ntt_tasks.h

+          NttCpu<S, E>::ntt_sub_logn.hierarchy_0_layers_sub_logn[ntt_task_cordinates.hierarchy_1_layer_idx][1];
+        int log_nof_blocks =
+          NttCpu<S, E>::ntt_sub_logn.hierarchy_0_layers_sub_logn[ntt_task_cordinates.hierarchy_1_layer_idx][2];
+        for (ntt_task_cordinates.hierarchy_0_block_idx = 0;


double loop can be simplified into a single loop using mod (shifts) and then all this code can become a single loop that runs on layer idx

HadarIngonyama · 2024-09-01T15:09:04Z

icicle/backend/cpu/include/ntt_tasks.h

+        int log_nof_blocks =
+          NttCpu<S, E>::ntt_sub_logn.hierarchy_0_layers_sub_logn[ntt_task_cordinates.hierarchy_1_layer_idx][2];
+        for (ntt_task_cordinates.hierarchy_0_block_idx = 0;
+             ntt_task_cordinates.hierarchy_0_block_idx < (1 << log_nof_blocks);


block and sub ntt namings are unclear

HadarIngonyama · 2024-09-01T15:11:36Z

icicle/backend/cpu/include/ntt_tasks.h

+                     .hierarchy_1_layers_sub_logn[ntt_task_cordinates.hierarchy_1_layer_idx]); // input + subntt_idx *
+                                                                                               // subntt_size
+
+    this->reorder_by_bit_reverse(ntt_task_cordinates, current_input, false); // R --> N


why reorder at this level? I would expect only index calculations

HadarIngonyama · 2024-09-01T15:22:32Z

icicle/backend/cpu/include/ntt_tasks.h

+        uint64_t tw_idx = (this->direction == NTTDir::kForward)
+                            ? ((this->domain_max_size / ntt_size) * j * i)
+                            : this->domain_max_size - ((this->domain_max_size / ntt_size) * j * i);
+        elements_of_current_batch[elem_mem_idx] = elements_of_current_batch[elem_mem_idx] * this->twiddles[tw_idx];


have special memory for twiddles for cash efficiency instead of reading from global

ShanieWinitz requested review from yshekel, mickeyasa and HadarIngonyama August 28, 2024 15:06

ShanieWinitz force-pushed the swinitz_cpuNtt_parallel_1 branch 8 times, most recently from 43d726f to 84d99a4 Compare August 29, 2024 11:33

yshekel force-pushed the yshekel/V3 branch from 3909f39 to a65cab1 Compare August 29, 2024 15:34

yshekel reviewed Aug 29, 2024

View reviewed changes

icicle/backend/cpu/include/cpu_ntt.h Outdated

using namespace field_config;

using namespace icicle;

#define PARALLEL 0

Copy link

Collaborator

yshekel Aug 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

??

ShanieWinitz force-pushed the swinitz_cpuNtt_parallel_1 branch 3 times, most recently from 1a591cf to 7f30437 Compare August 31, 2024 20:07

HadarIngonyama reviewed Sep 1, 2024

View reviewed changes

ShanieWinitz force-pushed the swinitz_cpuNtt_parallel_1 branch 7 times, most recently from c85ba69 to 9b37c77 Compare September 2, 2024 12:43

yshekel force-pushed the swinitz_cpuNtt_parallel_1 branch from 9b37c77 to ce91edd Compare September 2, 2024 12:56

ShanieWinitz added 7 commits September 2, 2024 16:48

parallel ntt on cpu

8ab82e9

ifdef MSM flag added

6eea7cd

get_coset_stride fixed

1ec9639

implace ntt fixed

d3c3c10

Commented out cpu_ref

cca999f

Code review fixes

669331b

some code review fixes, removed comments

05a930f

yshekel force-pushed the swinitz_cpuNtt_parallel_1 branch from ce91edd to 05a930f Compare September 2, 2024 13:48

mickeyasa approved these changes Sep 2, 2024

View reviewed changes

yshekel merged commit 6a43bde into yshekel/V3 Sep 2, 2024
19 checks passed

yshekel deleted the swinitz_cpuNtt_parallel_1 branch September 2, 2024 16:46

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

parallel ntt on cpu #591

parallel ntt on cpu #591

ShanieWinitz commented Aug 28, 2024 •

edited by yshekel

Loading

yshekel Aug 29, 2024 •

edited

Loading

yshekel Aug 29, 2024

yshekel Aug 29, 2024

yshekel Aug 29, 2024

yshekel Aug 29, 2024

yshekel Aug 29, 2024

yshekel Aug 29, 2024

yshekel Aug 29, 2024

yshekel Aug 29, 2024

omershlo commented Aug 31, 2024

HadarIngonyama left a comment

HadarIngonyama Sep 1, 2024

HadarIngonyama Sep 1, 2024

HadarIngonyama Sep 1, 2024

HadarIngonyama Sep 1, 2024

HadarIngonyama Sep 1, 2024

HadarIngonyama Sep 1, 2024

HadarIngonyama Sep 1, 2024

HadarIngonyama Sep 1, 2024

HadarIngonyama Sep 1, 2024

HadarIngonyama Sep 1, 2024

HadarIngonyama Sep 1, 2024

HadarIngonyama Sep 1, 2024

HadarIngonyama Sep 1, 2024

HadarIngonyama Sep 1, 2024

HadarIngonyama Sep 1, 2024

		const int logn = int(log2(size));
		const uint64_t total_memory_size = size * config.batch_size;

parallel ntt on cpu #591

parallel ntt on cpu #591

Conversation

ShanieWinitz commented Aug 28, 2024 • edited by yshekel Loading

yshekel Aug 29, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

omershlo commented Aug 31, 2024

HadarIngonyama left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ShanieWinitz commented Aug 28, 2024 •

edited by yshekel

Loading

yshekel Aug 29, 2024 •

edited

Loading