# Design with Shared Memory Using Catapult HLS C++

Sr. AE, Yuan-Teng Chang yuan-teng.chang@siemens.com

Restricted | © Siemens 2023 | Siemens Digital Industries Software | Where today meets tomorrow.



#### **Shared Memories Between Design Blocks**

There are two ways to share a memory between classes/member functions mapped to design blocks

- Using an ac\_channel, ac\_shared and required coding style
- Explicitly coding a separate design block that contains a memory



## Create shared memories with ac\_channel

03\_Shared\_Mem/ac\_channel



Page 3 Restricted | © Siemens 2023 | Siemens Digital Industries Software | Where today meets tomorrow.

#### ac\_channel Shared Memory Interfaces Between Blocks

Need a way to create block-level memory interfaces

• Interface Synthesis allows arrays on the function interface to be mapped to a memory interface

Required coding style is to pack arrays inside a struct and read/write through an ac\_channel

- Required style for interconnecting design blocks using shared memories
- Array operations still performed locally
- Struct/array mapped to memory using Interface Synthesis

```
typedef ac_int<8,false> dType;
struct chanStruct
{
    ac_int<8,false> data[NUM_WORDS];
};
```



#### **Memory Write Interfaces**

Define a "local" variable using struct type that packs the array

Operate locally on the array inside the struct

Write the struct containing the array into the channel

• Only write the channel once!

Have the HLS tool map the interface channel to a memory interface



#### **Memory Write Interface Pitfalls**

Temporary struct with array will be optimized away if coded using recommended style

- Be careful when performing memory writes
- Keep local memory operations between temporary struct declaration and memory channel write

#### **Memory Read Interfaces**

Define a "local" variable using struct type that packs the array

Restrict operations on the local struct to be between the local struct declaration and the memory channel read

• Doing anything else will cause a local memory to be generated



7 Restricted | © Siemens 2023 | Siemens Digital Industries Software | Where today meets tomorrow.

#### ac\_channel\_shared\_mem.h

```
20 class top
21 {
    private:
22
23
     ac channel<chanStruct> shr mem; // static memory channel
24
      #pragma hls design
25
      void BLOCK0(ac channel<dType> &din,
26
27
                  ac channel<chanStruct> &dout)
28
      {
29
        chanStruct tmp; // temporary ac array
        WRITE:for (int i=0; i<NUM WORDS; i++) { // local operations</pre>
30
          tmp.data[i] = din.read();
31
32
        }
33
        dout.write(tmp); // single memory channel write
        //Do not access the local array after this point
34
35
      }
36
37
      #pragma hls design
38
      void BLOCK1(ac channel<chanStruct> &din,
39
                  ac channel<dType> &dout)
40
      {
41
42
        chanStruct tmp; //temporary ac array
        tmp = din.read();
                                         // single memory channel read
43
        READ: for (int i=NUM WORDS-1; i>= 0; i--) {
44
45
          dout.write(tmp.data[i]);
46
        }
47
      }
```



#### **Testbench (tb.h)**

```
4 CCS MAIN(int argv, char **argc)
 5 {
     top dut;
 6
     ac channel<dType> din chan;
 7
     ac channel<dType> dout chan;
 8
 9
     for (int j=0; j<2; j++)</pre>
10
11
     {
12
       fprintf(stdout, "iter = %2d start...\n", j);
      for (int i=0; i<NUM WORDS; i++)</pre>
13
14
       {
15
         dType dat = i+j*NUM WORDS;
                                                                    Stimuli
16
         din chan.write(dat);
         fprintf(stdout, "din @%3d = %4d\n", i, (int) dat);
17
18
19
                                                                    DUT
       dut.run(din chan, dout chan);
20
21
22
      int cnt = 0;
       while(dout chan.available(1))
23
24
                                                                    Response
25
         dType dat = dout chan.read();
         fprintf(stdout, "dout @%3d = %4d\n", cnt++, (int) dat);
26
27
28
       fprintf(stdout, "iter = %2d finish!!!\n", j);
29
30 }
```

#### **Define Shared Memory Architecture**

Stage Replication

[mentor@RHEL74 ac\_channel]\$ catapult -p ultra -f directives.tcl &

- 1: a single shared memory architecture
- 2: a ping-pong architecture

| Task Bar ▲ ▼ X                                                         | 🖕 Start Page 🛛 🚰 Table 🕽 🛃 | Flow Manager 🛛 🔆 Constraint Edit | ≜ <b>▼</b>                                      |                |  |  |  |
|------------------------------------------------------------------------|----------------------------|----------------------------------|-------------------------------------------------|----------------|--|--|--|
| Synthesis Tasks 🔹                                                      | Instance Hierarchy         | Module                           | Resource: shr_mem:cns                           |                |  |  |  |
| Input Files                                                            | Golution                   | ☐ 100                            | Resource Type: ccs_sample_mem.ccs_ram_sync_1R1W |                |  |  |  |
| Hierarchy                                                              | BLOCK0:inst                |                                  | Resource Options                                |                |  |  |  |
| A Mapping                                                              | BEOCKI.IIISt               | a a an memory (2288)             | RdDelay_100ps: 5                                |                |  |  |  |
| Architecture Resources                                                 |                            |                                  | Stage Replication:                              | 2 🚔            |  |  |  |
| 📸 Schedule 🔤                                                           |                            |                                  | Packing Mode: absolute                          | <b></b>        |  |  |  |
| ₽ RTL Power Report (Pre Power Opt)                                     |                            |                                  | Block Size:                                     | 0 🔶            |  |  |  |
| Project Files                                                          |                            |                                  | ✓Externalize                                    |                |  |  |  |
| 🕒 📴 Output Files                                                       |                            |                                  | Generate External Enable                        |                |  |  |  |
| <ul> <li>Open Design Analyzer</li> <li>Verification</li> </ul>         |                            |                                  | Input Delay                                     |                |  |  |  |
| ⊕ 📴 gcc 10.3.0<br>⊕ 📴 CDesignChecker                                   |                            |                                  | Library Delay:                                  | 0.5 ns         |  |  |  |
| 🖃 📴 QuestaSIM                                                          |                            |                                  | Inherited Delay:       Port Delay:              | 0 ns           |  |  |  |
| RTL Verilog output 'rtl.v' vs Unt<br>Mi Concat RTL Verilog output 'con |                            |                                  |                                                 |                |  |  |  |
| 🕀 📄 Synthesis                                                          |                            |                                  | Settings Mapping                                | Apply Cancel 🕖 |  |  |  |

#### **RTL Simulation**



### Create shared memories with ac\_shared

03\_Shared\_Mem/ac\_shared



Page 12 Restricted | © Siemens 2023 | Siemens Digital Industries Software | Where today meets tomorrow.

#### ac\_shared\_memory.h

```
15 class input
16 {
17 public:
     input():RAM select(true) {}
18
19
20
     #pragma hls design interface
     void run ( ac shared<dType[NUM WORDS]> &ping,
21
22
                ac shared<dType[NUM WORDS]> &pong,
23
                 ac sync &sync RAMs,
24
                 ac channel<dType> &in ) {
       for ( int i = 0; i < NUM WORDS; i++ ) {</pre>
25
         dType data = in.read();
26
27
         if ( RAM select ) {
28
           ping[i] = data;
29
         } else {
           pong[i] = data;
30
31
32
33
       RAM select = !RAM select:
34
       sync RAMs.sync out();
35
36
37 private:
     bool RAM select;
38
39 };
```

```
41 class output
42 {
43 public:
     output():RAM select(true) {}
44
45
     #pragma hls design interface
46
     void run ( ac shared<dType[NUM WORDS]> &ping,
47
                ac shared<dType[NUM WORDS]> &pong,
48
49
                ac sync &sync RAMs,
                ac channel<dType> &out ) {
50
51
       dType data;
52
       sync RAMs.sync in();
53
       for (int i=NUM WORDS-1; i>= 0; i--) {
54
         if ( RAM select )
55
           data = ping[i];
56
         } else {
57
           data = pong[i];
58
59
         out.write(data);
60
       RAM select = !RAM select;
61
62
63
64 private:
    bool RAM select;
65
```

**66** };

68 #pragma hls design top 69 class ping pong shared RAM 70 { 71 public: 72 ping pong shared RAM() {} 73 74 #pragma hls design interface 75 void CCS BLOCK(run) ( ac channel<dType> &in, ac channel<dType> &out ) { 76 in inst.run(ping,pong,sync RAMs,in); 77 78 out inst.run(ping,pong,sync RAMs,out); 79 80 81 private: 82 input in inst; output out inst; 83 ac shared<dType[NUM WORDS]> ping, pong; 84 ac sync sync RAMs; 85

86 };



#### **Define Shared Memory Architecture**

[mentor@RHEL74 ac\_shared]\$ catapult -f directives.tcl

| Task Bar 🔺 🗙                         | 🔄 🔄 Start Page 🛛 💽 Table 🕽 🕞 Flow Mar | nager 🔀 Constraint Editor |                                                  |
|--------------------------------------|---------------------------------------|---------------------------|--------------------------------------------------|
| Synthesis Tasks 🔹                    | Instance Hierarchy                    | Module                    | Resource: ping.d:rsc                             |
| 🗊 Input Files                        | 🖃 🕣 Solution                          | 🖃 🎦 ping_pong_shared_RAM  | Becautee Tuney loos cample mem cos ram suno 191W |
| 🕤 Hierarchy                          | ing_pong_shared_RAM                   | 🕀 🛅 Interface             | Resource Type: ccs_sample_mem.ccs_ram_sync_1R1W  |
| C Libraries                          | in_inst                               | 🖻 📴 Interconnect          | Resource Options                                 |
| A Mapping                            | 🖸 out_inst                            |                           | RdDelay_100ps: 5                                 |
| 🚺 Architecture                       |                                       | sync_RAMs:cns (1x1)       |                                                  |
| Resources                            |                                       |                           | Packing Mode: absolute                           |
| 📸 Schedule                           |                                       |                           | Block Size:                                      |
| 🗗 RTL                                |                                       |                           | Interleave:                                      |
| Power Report (Pre Power Opt)         |                                       |                           | ✓ Externalize                                    |
| Project Files                        |                                       |                           | Generate External Enable                         |
| ping_pong_shared_RAM.v1 (Passed Extr |                                       |                           | Innut Delay                                      |
| E Input Files                        |                                       |                           | Input Delay                                      |
| 🕶 tb.cpp (Excluded)                  |                                       |                           | Library Delay:                                   |
| 🗄 🖻 ac_shared_memory.h               |                                       |                           | Inherited Delay:                                 |
| 🕀 🛅 Output Files                     |                                       |                           | Port Delay:                                      |
| Open Design Analyzer                 |                                       |                           |                                                  |
| Urification                          |                                       |                           | Total:                                           |
| 🕀 🛅 Synthesis                        |                                       |                           | Settings Mapping                                 |

#### **RTL Simulation**

| DUT                      |       |     |        |      |          |        |      |           |         |         |  |
|--------------------------|-------|-----|--------|------|----------|--------|------|-----------|---------|---------|--|
| 📥 in_rsc_rdy             | 1'h0  |     |        |      |          |        |      |           |         |         |  |
| in_rsc_vld               | 1'h1  |     |        |      |          |        |      |           |         |         |  |
| 💽 🎝 in_rsc_dat           | 8'h3f |     |        |      | <u> </u> |        |      | 3f        |         |         |  |
| sout_rsc_rdy             | 1'h1  |     |        |      |          |        |      |           |         |         |  |
| 🖕 out_rsc_vld            | 1'h0  | r   |        |      |          |        |      |           |         |         |  |
| 🕒 👍 out_rsc_dat          | 8'h20 | 00  |        |      | \        |        |      |           |         |         |  |
| 🕒 🔩 ping_d_rsc_radr      | 5'h1d | (1f |        |      |          |        |      |           |         |         |  |
| 🕒 🔩 ping_d_rsc_wadr      | 5'h00 |     |        |      | <u> </u> |        |      | 100 11111 |         |         |  |
| 🕒 🔩 ping_d_rsc_d         | 8'h3f |     |        |      | <u> </u> |        |      | 3f        |         |         |  |
| 👍 ping_d_rsc_we          | 1'h0  |     |        |      |          | ut Rea |      |           |         |         |  |
| 🖕 ping_d_rsc_re          | 1'h0  | 1   | nput W | rite |          |        |      |           | Inpu    | t Write |  |
| 🖪 🎝 ping_d_rsc_q         | 8'h00 |     | -      |      |          |        |      | 0         |         |         |  |
| 😐 🔩 pong_d_rsc_radr      | 5'h1d | {1f |        |      |          |        |      |           |         |         |  |
| 🕒 🔩 pong_d_rsc_wadr      | 5'h00 |     |        |      | <u> </u> |        |      |           |         |         |  |
| 🖪 🔩 pong_d_rsc_d         | 8'h3f |     |        |      | <u> </u> |        |      | 3f        |         |         |  |
| 💠 pong_d_rsc_we          | 1'h0  | 1   |        |      |          |        |      |           | itout R | ead     |  |
| <pre>pong_d_rsc_re</pre> | 1'h0  | 1   |        |      |          | w tua  | rite |           |         |         |  |
| • pong_d_rsc_q           | 8'h20 |     |        |      |          |        |      |           |         |         |  |

#### Ping

#### Pong

# Thank you

