GitHub - att-innovate/firework: A heterogeneous system for offloading Protocol Buffer serialization onto dedicated FPGA hardware

Preface

So, you want to build your very own hardware accelerator, or more precisely, hardware-accelerated system. Perhaps the desire stems from bottlenecks in the software you routinely use, and the frequent coffee breaks are beginning to attract dubious looks from your coworkers. Another possibility is that today, your curiosity has exceeded a certain threshold, and satisfaction only comes in the form of a deep intuition for how software is executed by the underlying hardware in a given system. My last conjecture is that you finally got your hands on an FPGA-wielding development kit and wish to learn more about the system-level optimization knobs that are at your disposal. Regardless of the reason, you're here, and I hope this tutorial fulfills your needs as either a starting point for further extension of this work, a template for your own hardware accelerator project, or even as fundamental training. For me, I'm just happy knowing that 15 months of work has helped at least one other individual.

In this tutorial, my goal is to cover the development of a complete digital system: from setting up a development environment with the necessary EDA tools to designing a hardware accelerator at the RTL level (and implementing the design in an FPGA), system integration (establishing communication between the hardware accelerator and an ARM CPU), writing a device driver (providing user space applications access to the newly developed hardware accelerator) and ultimately, to modifying user space applications such that they utilize the hardware accelerator. While the knowledge gained and experience acquired will be extremely rewarding, undertaking a project of this size requires commitment, perseverance, and the ability to debug without the assistance of Stack Overflow or similar forums. I emphasize the last requirement (and it's a good skill to have in general) for two reasons:

Development at this level is not heavily documented in public forums. A Google search of a specific error message might yield zero results. Often your only source of information will come from sifting through technical reference manuals such as this one. Perhaps Firework will help spark a new wave of interest in open sourcing hardware accelerator and system designs by making this work seem less daunting. The best resource to my knowledge for community support can be found at RocketBoards.org, although this may only be useful for a certain set of development boards.
It shows you really understand how things work. This is especially important when designing hardware and working with embedded systems, where bugs could not only arise from errors in the user space application but also from the use of incompatible system libraries, the device driver, limited system resources, misinterpreting the timing requirements of handshake signals in a bus protocol, functional errors in your RTL code, or, my personal favorite, from differences between expected and actual behavior of hardware blocks.

Before continuing, I'd like to note that I don't claim to be an expert hardware accelerator/system designer nor that my design is optimal; the scope of this project alone is enough to lead to several outcomes (and getting the thing to work was a victory as far as I'm concerned). That's the beauty of open source; several minds are greater than one, and I hope that collaboration and the collective knowledge will lead to new and interesting ideas and perhaps better designs. I welcome any and all feedback and recommendations for improvement!

Introduction

Firework is an open source hardware-accelerated system designed for offloading Protocol Buffer serialization from the system's CPU. (That was a loaded sentence, I know. If I did my job correctly, by the end of this tutorial it'll make much more sense.) Generally speaking, Firework demonstrates the process of identifying components of a software application as candidates for hardware acceleration, designing hardware to efficiently perform (and replace) that computation, and building a system that deviates from the traditional paradigm of executing instructions sequentially on a CPU. Before I continue, it's necessary to give the term system a precise definition. In the context of hardware acceleration, I define a system as the combination of hardware and software that together perform a specific function. Therefore, the goal of this and any other hardware accelerator project is to improve a system's performance through the co-optimization of the hardware and software that comprise that system. It's also worth noting that the hardware community distinguishes between hardware acceleration and offloading, although the precise difference is a bit ambiguous. I classify Firework as an attempt to perform the latter since, in my design, I move the computation involved in Protocol Buffer serialization from the system's ARM CPU to a custom processor (i.e., the hardware accelerator) that's implemented in the system's FPGA fabric.

One goal of Firework was to target software that's deployed across a large-scale, production datacenter. That way, the developed hardware-accelerated system could theoretically replace a generic server (or servers) supporting that software. (This work is part of a larger effort to explore the use of specialized hardware in a datacenter setting.) Naturally, the first step was to identify such a candidate software applicaiton. Fortunately, I came across the paper Profiling a warehouse-scale computer whose authors essentially performed the search for me and provided motivation for hardware acceleration as well; in a three-year study of the workloads running across Google's fleet of datacenters, they identified a collection of low-level software, which they've coined the datacenter tax, that serve as common building blocks for Google's production services. It was found that this datacenter tax "can comprise nearly 30% of cycles across jobs running in the fleet, which makes its constituents prime candidates for hardware specialization in future server systems-on-chips." Perfect! The following figure pulled from the paper identifies these constituents and their individual contributions to the datacenter tax:

This led to my choice of Protocol Buffers (protobuf in the figure above, which accounts for ~3-4% of total CPU cycles consumed across Google's datacenters) as the candidate software for hardware acceleration. Protocol Buffers are Google's "language-neutral, platform-neutral extensible mechanism for serializing structured data". In other words, this software, which consists of a compiler and runtime library, is used to efficiently serialize structured data (e.g., a C++ object) into a stream of bytes which may be subsequently stored or transmitted over some network to a receiving service that's able to reconstruct the original data structure from that stream of bytes. Before continuing to the Prerequisites section, I recommend reading Profiling a warehouse-scale computer for more context, going through the Protocol Buffers C++ tutorial to understand how Protocol Buffers are used, and learning how Protocol Buffers are encoded, as this is essential to the design of the hardware accelerator.

I chose to use Altera's (now Intel's) Arria 10 SoC Development Kit as the platform for implementing Firework (i.e., the hardware-accelerated system). In the section Choosing a development board, I'll discuss what led to the choice of this specific board.

Firework consists of six main components that together implement a hardware-accelerated system and provide a means of measuring its performance. Each of these components are listed below, and I've provided a brief description of what they are and links to their repositories.

Firework's six main components

protobuf-serializer: the hardware accelerator. This repository contains the RTL design of a 4-stage pipelined, parallel processor that performs Protocol Buffer serialization, written in Verilog, and packaged as a Quartus Prime project. The nomenclature stems from four pipeline stages in the design, two parallel datapaths for processing incoming varint and raw data, and a controller that consists of seven independent finite-state machines (FSMs). Also provided are several testbenches to verify the logic of each FSM as well as the processor as a whole.
a10-soc-devkit-ghrd: the system design. This repository contains a modified Arria 10 SoC Golden Hardware Reference Design (GHRD) with the protobuf-serializer hardware accelerator added as a memory-mapped FPGA peripheral. This too is packaged as a Quartus Prime project with the main SoC subsystem being a Qsys system design.
linux-socfpga: the Linux kernel. This repository contains the Linux kernel, modified and maintained by Altera, for use with Arria 10 SoC platforms. We'll configure and compile the kernel from source to enable certain CONFIG options (e.g., CONFIG_KALLSYMS which provides kernel-level symbols) that are needed to profile the system.
driver: the device driver. This repository contains the source code of a platform device driver for the protobuf-serializer hardware accelerator. This driver, a loadable kernel module, provides an interface to the newly developed hardware accelerator and hence, specifies the modifications that need to be made to the Protocol Buffer runtime library. It's ultimately responsible for relaying data from/to user space to/from the hardware accelerator that resides in the FPGA.
protobuf: the modified Protocol Buffer runtime library. This repository is a fork of version v3.0.2 of Google's Protocol Buffers open source project. The runtime library is modified to send data from user space memory to the protobuf-serializer hardware accelerator for serialization (via the device driver), replacing the code that otherwise performed that serialization on the CPU.
profiling: the Protocol Buffer test applications and collected profiling data. This directory contains test applications that are executed on the Arria 10 SoC in conjunction with perf to collect profiling data. The profiling data gives us the ability to understand and compare the hardware-accelerated system's performance to that of a standard software/CPU system. Here we finally see how a user space application is executed on top of the hardware-accelerated system.

Although Firework specifically covers the design of a hardware accelerator for Protocol Buffer serialization using an Arria 10 SoC Development Kit as the platform for implementation, I made an effort to generalize the design process. The following sequence of high-level steps serve as a general approach to any hardware accelerator project, and the remainder of this tutorial provides in-depth coverage of each of these steps as they pertain to Firework.

High-level steps in building a hardware-accelerated system

I have one final note before we continue: this work can be quite challenging. It's essential to figure out a routine that works for you and knowing how to maintain a mental capacity for creativity over long periods of time, as this work is largely an art. For me, taking breaks when I feel the processor that is my brain overheating definitely helps. Another source of longevity are the inspiring words of world-renowned pop star Katy Perry. Good luck!

Prerequisites

Although not explicitly listed as a step, you should already have a project in mind or at least an idea of which software or algorithm you wish to accelerate. Otherwise, follow along with my choice of Protocol Buffers as described in the introduction and used throughout this tutorial.

1. Choose a development board

The first step is to choose a board that's appropriate for your project and goals. Since my objective was to build a hardware-accelerated system for a datacenter application that both, improves performance and frees the CPU resource, I was in search of a board that's capable of running Linux and could theoretically replace a white box server in a datacenter setting. The Arria 10 SoC Development Kit seemed to fit this description perfectly; it combines an Arria 10 FPGA with a dual-core ARM Cortex-A9 processor (called the Hard Processor System, or HPS) in a single system-on-chip (SoC) package. The FPGA fabric could be used to implement a custom RTL design that performs Protocol Buffer serialization while the HPS could be used to support both Linux and the user space application. Plus, think about how cool you'd look with one of these bad boys sitting on your desk:

Although it is easiest to replicate and extend Firework using the Arria 10 SoC Development Kit, the main component - protobuf-serializer (i.e., the hardware accelerator) - is written in Verilog (with the exception of the FIFOs used in the design - I pulled them from Altera's IP Cores library) and is compatible with other ARM-based systems. The modularity of the hardware accelerator stems from the fact that it's designed as an ARM AMBA AXI slave peripheral (i.e., its top-level I/O ports implement an AXI slave interface) and ARM CPUs serve as AXI masters. More details of the hardware accelerator design and ARM AMBA AXI bus protocol are covered in the sections Designing and Implementing the hardware accelerator (FPGA peripheral) and System integration (Arria 10 GHRD).

At a minimum, you'll need a board with an FPGA (i.e., programmable logic) to implement a hardware accelerator. Other board requirements are specific to your project and goals. Questions you might ask to determine these requirements include:

Is it a standalone algorithm or component of a larger software project I wish to accelerate? (i.e., do I need a CPU?)
Is it a bare metal or Linux application (i.e., do I require the use of an operating system)?
Am I replacing existing hardware? How does my design fit in to the larger system?
Does it require network access?
Am I working with large amounts of data? What are my storage requirements?
Which bottlenecks in the algorithm/software do I wish to alleviate?
Would a custom hardware design outperform the existing CPU/memory architecture?

Don't underestimate the importance of this step. Acquiring a board can be an investment, and its fit with your project will certainly impact its success. Spending time asking and answering questions like these will also help to reaffirm your understanding of your project and goals.

2. Set up your development environment (OS, VNC server/client, EDA tools, licensing)

Before we get to the fun, we need to put our IT hats on. The next step is to set up your hardware development environment. Your setup is primarily going to be influenced by the development board you choose, the corresponding set of EDA tools needed to implement designs on that board, and the computing resources available to you. The complexity of your project's design could also influence your setup; larger, more complex designs might require you to purchase premium, licensed versions of the EDA tools for full functionality. (Personally, I think this outdated business model is something the hardware industry needs to work on since the cost of the development board and software licenses alone adds yet another barrier to innovation in the hardware space. I'm happy to see Amazon taking steps in the right direction; they've begun rolling out F1 Instances in their EC2 cloud providing access to Xilinx FPGAs for hardware acceleration. I haven't tried using them myself, but I imagine it makes getting started with a hardware accelerator project much easier and cheaper than via the method I describe below.)

For me, working with the Arria 10 SoC Development Kit meant installing then-Altera's EDA tools on a remote server which I had access to via my company's LAN. As it turns out, setting up a hardware development environment on a remote server is not a trivial task. Fortunately for you, I've gone through the process and will cover the steps below. Although your setup may be different (board, tools, etc.), hopefully this section will give you an idea of what it takes to set up any environment.

I used the following server, operating system, and VNC software (for remote desktop access) in my setup:

Dell PowerEdge R720xd (server where the EDA tools are installed)
CentOS 7 (free operating system compatible with the EDA tools)
TigerVNC (VNC server software running on the server so it can be remotely accessed)
RealVNC VNC Viewer (VNC client software running on my laptop)

My macbook served as the sole interface to the remote server where the hardware development is performed and to the Arria 10 SoC Development Kit (via a terminal + serial connection) which sat on my desk. Note that any laptop with the proper VNC client and serial communication software (minicom, PuTTY, etc.) installed will work. To this day, I'm still fascinated by the setup I ended up building: 3 computers, 2 geographic locations, and 1 interface to access them all:

In the picture above, the monitor displays the remote server's CentOS 7 desktop with Quartus Prime open, and the terminal opened in the laptop's display shows the Arria 10 SoC booting Linux. The reason why I used a server equipped with two Intel Xeon E5-2670 CPUs (each with 8, 2-way SMT cores), 256 GB of RAM, and 8 TB of storage is that Quartus Prime - the main EDA tool we'll be using for hardware development - has a "recommended system requirement" (...is it recommended or required?) of 18-48 GB of RAM when working with Arria 10 devices. My laptop (and I'm guessing most others) are simply not powerful enough to support this software in a Linux virtual machine. I also know from experience that compiling hardware designs can take quite some time, so 32 parallel threads of execution will definitely come in handy if fully utilized (hint: -j 32 tells make to execute 32 recipes in parallel).

The reason why I chose CentOS 7 as the operating system to install on the server is less obvious. If you look at the Operating System Support for Quartus Prime and other EDA tools we'll be using, notice that only 64-bit variants of the Windows and Red Hat Enterprise Linux (RHEL) operating systems are supported. Well, unlike most other Linux distributions, RHEL isn't free (thanks to the <a href="https://en.wikipedia.org/wiki/Enterprise_software"'E') so unless you already have access to it, CentOS is RHEL's altruistic cousin. Although CentOS 7 is not explicitly supported, it works. Believe me. If you prefer to use Windows, that's fine too (well it's not, but that's beyond the scope of this tutorial).

It's quite a humbling experience to set up a server for the first time, and you'll certainly think twice before sending the next angry ticket to your IT support desk. Without further ado, here are the steps it took to set up my development environment.

Remotely installing CentOS 7 on a Dell PowerEdge R720xd server

To access the remote server, we'll initially use its built-in integrated Dell Remote Access Controller (iDRAC). This tool has many powerful features, including the ability to monitor logged events, power cycle the server, and even install an OS - all remotely. Assuming the server is connected to your LAN, has been assigned an IP address, and the iDRAC is enabled, open any web browser and enter its IP address to access the iDRAC login screen. It should look something like this:

If this is your first time using the iDRAC, the default username and password are root and calvin, respectfully. When you log in, you'll see a summary page with several tabs (on the left side of the page and on the top of some pages) that provide a plethora of stats/info about your server. Now, we're ready to use the iDRAC to remotely install CentOS 7 on the server.

Download the CentOS 7 DVD ISO image. Choose DVD instead of Minimal since we'll need to install a GNOME Desktop on the server.
Log in to the iDRAC and go to the System Summary page (default page upon login).
In the System Summary > Virtual Console Preview window, click Launch. This will download a Virtual Console Client called viewer.jnlp.

Run the viewer.jnlp Java application. Your computer may complain about it being from an unidentified developer, but there's a way around this. Right click viewer.jnlp > Open With > Java Web Start (default) and click Open in the window that appears.

This will open another window, Security Warning. Click Continue.

You can never be too cautious. This will open a new Warning - Security window that asks, Do you want to run this application? I think you know the answer. Click Run.

Great, we've finally opened the Virtual Console Client! Click anywhere in the window so that its menu appears in the OS X menu bar at the top of your screen. Select Virtual Media > Connect Virtual Media. I couldn't take a screenshot of this step because when the Virtual Console Client is in focus, it captures all keyboard events. See the menu bar in the screenshot below.

Once connected, select Virtual Media > Map CD/DVD ... and select the CentOS 7 ISO image we just downloaded. Then click Map Device.

In the Virtual Console Client menu again, select Next Boot > Virtual CD/DVD/ISO. Click OK in the window that appears. This tells the server to boot from the CentOS 7 installer ISO we downloaded on the next boot.

In the Virtual Console Client menu, select Power > Reset System (warm boot). You'll see "No Signal" appear on the window followed by the CentOS 7 installer screen when the server finishes rebooting. With Install CentOS 7 highlighted, press enter to begin installation.

Follow the promts until you get to the Software Selection window. Select GNOME Desktop as your Base Environment. Select the Legacy X Window System Compatibility, Compatibility Libraries, and Development Tools add-ons in the list to the right. Click Done when your window looks like the one below.

Select a disk to install to, select Automatically configure partitioning., and (optionally) encrypt your data; I didn't. Click Done and proceed with the installation by selecting Begin Installation in the main menu.

During the installation, it'll ask you to create a user account. Make the user an administrator and set the root password. DON'T FORGET TO SET THE ROOT PASSWORD. We'll need root access when installing the VNC server daemon. Click Reboot when the installation completes and voila!

The server is now running CentOS 7. In the initial boot, it'll ask you to accept a license. Follow the prompts on the screen to accept the license, let it finish booting, and log in as the user you just created. Keep the Virtual Console Client open; we'll use it to set up the VNC server/client software in the next step.

Setting up VNC server and client software

If you're not familiar with VNC, the basic idea is that you're interacting with a remote computer's desktop environment. That is, your keyboard and mouse events are sent to that computer over the network, and the corresponding GUI actions are relayed back to your screen. This is useful when you need to remotely access an application that has a GUI. As you may have guessed, this is the case of Quartus Prime and other EDA tools we'll be using.

I used TigerVNC as the server software running on the Dell PowerEdge R720xd and RealVNC VNC Viewer as the client softawre running on my macbook. I followed this tutorial by Sadequl Hussain to get the VNC setup working for me, and I've summarized the steps below. I recommend going through his tutorial as it does a great job explaining each step in detail!

VNC server

Using the Virtual Console Client from the previous section, make sure you're logged in as the user you created and open a terminal. First, let's install the TigerVNC server software.

sudo yum install tigervnc-server

The previous step creates a template service unit configuration file for the vncserver service in /lib/systemd/system/. We need to copy this file to /etc/systemd/system and make a few modifications in order to have systemd automatically start this service for the logged in user when the server is reset or turned on.

sudo cp /lib/systemd/system/vncserver@.service /etc/systemd/system/vncserver@:5.service

Using the text editor of your choice (e.g., vim), open the file and replace every <USER> instance with the name of the user you created. I created a user with the name fpga and highlight where those changes are made in the screenshot below. Also add the option -geometry 2560x1440 to the ExecStart= line replacing 2560x1440 with the resolution of the screen you plan to run the VNC Viewer client on. This will make full screen mode look pretty on your laptop or monitor.

Use systemctl to reload the systemd manager configuration.

sudo systemctl daemon-reload

Enable the vncserver unit instance. Note the number 5 in vncserver@:5.service was chosen arbitrarily but means the vncserver running for this user will be listening for incoming connections on port 5905.

sudo systemctl enable vncserver@:5.service

Configure the firewall used by CentOS 7 to allow traffic through port 5905.

sudo firewall-cmd --permanent --zone=public --add-port=5905/tcp
sudo firewall-cmd --reload

In the terminal, run vncserver and it'll ask you to set a password for opening VNC connections with this user. Note, this should be from the user's CentOS 7 login password.
With all the configurations made, let's make sure vncserver is running for this user before setting up the VNC Viewer client software.

sudo systemctl daemon-reload
sudo systemctl restart vncserver@:5.service
sudo systemctl status vncserver@:5.service

You should see active (running) in the output of the last command:

VNC client

Now that we have our VNC server running, let's set up the VNC client software on the device we'll use to remotely access the server (macbook for me). On your laptop, download the RealVNC VNC Viewer client, install, and open it. In the VNC Server field, enter 127.0.0.1:5900 but don't click Connect just yet.

If you're familiar with computer networking, you may be wondering why we entered the IP address of the localhost and port 5900 instead of the IP address of our remote server and port 5905 (the port we set the VNC server to listen to for incoming connections). That's because after the initial authentication, all data communicated between the VNC server and client is unencrypted and hence susceptible to interception. To secure this communication channel, we'll set up an SSH tunnel encrypting the data communicated over the network.

Open a terminal and enter the following command, replacing <ip-address> with the IP address of your server and <user> with the name of user you created when installing CentOS 7.

ssh -L 5900:<ip-address>:5905 <user>@<ip-address> -N

It'll ask for the user's CentOS 7 password. Enter the password and it should leave the terminal in a hanging state - this means we've established our SSH tunnel and are ready to connect to the VNC server.

Click Connect in the VNC Viewer client and you should see the following warning, which we can now safely ignore.

Click Continue. Enter the passowrd you set up for the VNC server and click OK.

Congratulations! We just established our first remote desktop session with the CentOS 7 server! If you hover your curser above the top-middle of the window, a menu will appear. Click on the icon in the middle to enter Full screen mode. If you set the -geometry option with your screen's resolution, it should take up entire screen. Click on this icon again to exit Full screen mode.

Leave the VNC client session open; this is now our main way to interact with the remote server. We'll use it to first install and then use the EDA tools to design our FPGA peripheral hardware.

Installing Intel's EDA tools

Some EDA tools are free while others require a paid license to work. There may be free versions of tools that normally require a license with limited features, but the features included may be enough to satisfy your needs. Although I used licensed tools, I recommend using free versions when possible. You may even consider choosing a development board based on the "free-ness" of the tools needed for compiling designs for that board (e.g., I later realized that a licensed version of Quartus Prime was required to work with Arria 10 development boards).

For the hardware development in this project, I used the following EDA tools:

Quartus Prime Standard Edition
Qsys System Integration Tool
ModelSim-Intel FPGA Edition (formerly ModelSim-Altera Edition)

Note that although I used version 16.0 of these tools during initial development, I've since updated my environment with the latest release of these tools (17.1 at the time of writing this tutorial) and have updated the designs included in this repository work with them. Although the general design process is the same, I've noticed newer versions of the tools are more stable and include some key bug fixes, one of which I take credit for and describe in the licensing section.

The following steps should be completed on the CentOS 7 server through your open VNC client session.

From Intel's Download Center, download version 17.1 of the Quartus Prime Standard Edition software and device support for the Arria 10 (parts 1, 2, and 3). Select Linux as the operating system and Direct Download as the download method (setting up the Akamai DLM3 Download Manager takes extra effort and is unnecessary for this one-time download). You have a few options for downloading the necessary files. I avoided the Complete Download since it's a pretty huge file (26.8 GB) and contains unnecessary device support. Instead, download the following files from the Multiple File Download section:

Quartus Prime Standard Edition Software (Device support not included)
Quartus Prime Device Package 1 (Arria 10)

You should now have the following tarballs sitting in your ~/Downloads (or other default) directory:

Quartus-17.1.0.590-linux.tar
Quartus-17.1.0.590-devices-1.tar

Before we begin extracting and installing the software, it's important to choose a root directory for all of your development; several files and directories will be generated as we progress through the tutorial, and without organization, things can get unwieldy fast. Out of habit, I chose ~/workspace as my root directory. In all code blocks to follow in this tutorial, replace every instance of ~/workspace with the location of your root directory.

Extract Quartus-17.1.0.590-linux.tar and run the setup.sh script.

cd ~/Downloads
tar -xf Quartus-17.1.0.590-linux.tar
./setup.sh

This will open an installer GUI. Click Next, accept the the agreement, and click Next again to reach the Installation directory window. By default, it chooses ~/intelFPGA/17.1. Replace this path with the root directory you chose followed by intelFPGA/17.1. Your window should look similar to the screenshot below, with fpga replaced by the name of your user:

Click Next to reach the Select Components window. Select the following components:

Click Next twice to proceed with the installation. You'll receive the following Info pop-up dialog which you can safely ignore (we'll install Arria 10 device support files next):

When the installation is complete, the following window appears letting you optionally create a desktop shortcut for Quartus Prime. Click Finish to exit the installer.

Extract Quartus-17.1.0.590-devices-1.tar and run the dev1_setup.sh script.

tar -xf Quartus-17.1.0.590-devices-1.tar
./dev1_setup.sh

An installer similar to the one for Quartus Prime will open, and we'll use it to install Arria 10 device support for Quartus Prime. Click Next, accept the agreement, and click Next again to reach the Installation directory window. Like before, replace the default path with your root directory followed by intelFPGA/17.1, and click Next. This takes us to the Select Components window. Check the Devices box which automatically selects Arria 10 Part 1, Arria 10 Part 2, and Arria 10 Part 3:

Click Next twice to proceed with the installation, and click Finish to exit the installer.

We now have Quartus Prime Standard Edition, the Qsys System Integration Tool (embedded in Quartus Prime), and ModelSim-Intel FPGA Edition installed with support for Arria 10 devices on the remote CentOS 7 server. Yay!

If you didn't create a desktop shortcut, let's add the Quartus Prime binary (quartus) to the PATH environment variable so you can simply type quartus in a terminal to open the tool. I'll leave it to you to decide whether you want to do this permanently (i.e., editing ~/.bashrc) or temporarily (i.e., export PATH=$PATH:<path-to-quartus> in an open terminal). The path to quartus is: ~/workspace/intelFPGA/17.1/quartus/bin/.

If you were to open Quartus Prime now, a License Setup Required pop-up dialog would appear (see screeshot below). In the next section, we'll go over how to serve the license we acquire from Intel which gives us full access to Quartus Prime Standard Edition, ModelSim-Intel FPGA Edition, and Arria 10 device support and eliminates this pop-up dialog (in version Quartus Prime 17.1 at least, earlier version may still require initial setup).

Setting up and serving a license

Remember when I said setting up an environment for hardware development is a nontrivial task? Well, licensing EDA tools is the crux of its nontrivial-ness. I don't even know where to begin, from the difficulty in identifying which EDA tools or features of these tools (e.g., the MegaCore IP Library in Quartus Prime) require licenses to whether I'll even be using certain tools or features for my project. That's only the beginning; where and how to even acquire a license isn't obvious, there are different license types to choose from (fixed and floating), and when you finally receive a license, you need to add some not-so-obvious information to the file before it can be properly served by a license manager (that's right, yet another piece of software required to get Quartus Prime and other EDA tools working).

Recall that I mentioned taking credit for a bug that was fixed in later releases of Quartus Prime than the one I originally used (version 16.0). I'll describe that bug now [queue The Final Countdown]. The license you receive is tied to the MAC address of the NIC of the computer you're planning to run the software on. The purpose of doing this is to limit the license's use to only one, uniquely-identified computer. That's fine, it's a way of making sure a license only works for the intended user (who paid for it). Well, thanks to an act of incredibly poor engineering, Quartus Prime used to ONLY recognize NICs that are named eth0 from Linux's perspective. Upon further investigation, I found out that CentOS 7's choice of em1, em2, etc. as the names for the network interfaces on my server is actually the modern naming convention used for NICs (the eth0, eth1, etc. naming convention is obsolete). The only way to get Quartus Prime to recognize I had the proper license being served, like this guy I had to rename my em1 NIC to eth0, a non-trivial task.

Luckily, I complained about this bug and you won't have to worry about it (Quartus now recognizes all NICs on your machine). I'll try to also spare you from reading this 46-page manual on licensing by summarzing the necessary steps to edit and serve your license below.

Acquire a license. I wouldn't bother with the Self-Service Licensing Center. Instead contact an Intel licensing representative directly and ask for a license for Quartus Prime Standard Edition and ModelSim-Intel FPGA Edition. Quartus Prime Standard Edition (or Pro) is required when working with Arria 10 devices and is used to implement the FPGA hardware design. I used ModelSim-Intel FPGA Edition to run simulations and perform functional verification of each subsystem of the design as well as the hardware accelerator as a whole. Although I haven't tried, you may be able to run the simulations I've included using the free version of ModelSim (ModelSim-Intel FPGA Starter Edition).
You should now have received a license from Intel. Create a directory called license in your root directory and move the license there. I used scp to copy the license from my macbook to the CentOS 7 server (see screenshot below as an example of using this command). Accessing your email from the server directly is another option.

We need to edit the license before it can be served. First, copy and rename the license to /usr/local/flexlm/licenses/license.dat. This is where we'll tell the FLEXlm license manager to look for a license file called license.dat to serve.

sudo mkdir -p /usr/local/flexlm/licenses
sudo cp ~/workspace/license/1-FQMLCP_License.dat /usr/local/flexlm/licenses/license.dat

Using the text editor of your choice, open license.dat and add the following lines near the top. See the screenshot below to see where I made the changes.

SERVER <hostname> <MAC address of NIC>
VENDOR alterad "<your-working-dir>/intelFPGA/17.1/quartus/linux64/alterad"
VENDOR mgcld "<your-working-dir>/intelFPGA/17.1/modelsim_ae/linuxaloem/mgcld"
USE_SERVER

To obtain your server's <hostname>, run uname -n in a terminal.

You should already know the MAC address of your NIC since it was needed to acquire the license. Just in case, run ifconfig to list the network interfaces on your machine, identify which one you're actively using (I have em1 connected to the network), and the MAC address will be the 12-digit, colon-separated hex value following ether:

The next two VENDOR lines specify the locations of the Quartus Prime and ModelSim daemons that FLEXlm runs to serve their features. These daemons were included with the Quartus Prime Standard Edition and ModelSim-Intel FPGA Edition installations.

Now we're ready to serve our license! Run the following command to start the FLEXlm license manager and serve our license. Note that FLEXlm was also included in the Quartus Prime installation.

cd ~/workspace
intelFPGA/17.1/quartus/linux64/lmgrd -c /usr/local/flexlm/licenses/license.dat

... and we were so close! I'm referring to the error /lib64/ld-lsb-x86-64.so.3: bad ELF interpreter: No such file or directory you probably received the first time running the license manager:

Giving credit to this post, there's nothing a simple symbolic link can't fix! We have a functional 64-bit program loader running on the CentOS 7 server, I promise; the symlink we'll create let's FLEXlm call it ld-lsb-x86-64.so.3 if it likes.

sudo ln -s /lib64/ld-linux-x86-64.so.2 /lib64/ld-lsb-x86-64.so.3

Run the command to start FLEXlm again and voila! You'll see a log of information from FLEXlm (lmgrd), the Altera daemon (alterad), and the ModelSim daemon (mgcld). Completely irrelevant, but I'm guessing lmgrd stands for license manager daemon.
Uh-oh! Upon further inspection, we see that FLEXlm actually failed to launch the ModelSim daemon (mgcld) at time 23:42:24:

That's not good. What could be the problem? Well, I tried running the daemon directly and received the following error message:

That looks familiar! This time it's the 32-bit version of the program loader that's missing, but why does mgcld want to use a 32-bit program loader on a 64-bit architecture? If we revisit the Operating System Support page and look closely at the ModelSim-Intel FPGA Edition entry in the table, there's a supercript on the checkmark - - that corresponds to the following note:

So obvious! Shame on us. Let's install the 32-bit program loader, ld-linux.so.2, which belongs to the glibc package.

sudo yum install ld-linux.so.2

Note, i686 in the output refers to Intel's 32-bit x86 architecture while x86_64 refers to 64-bit architecture. Let's run mgcld again to confirm we fixed the issue.

That looks promising!

Restart FLEXlm to successfully serve both Quartus Prime and ModelSim features this time. First, use ps to identify the currently-running lmgrd's process ID (PID) and kill it with the kill -9 command.

ps aux | grep lmgrd
sudo kill -9 <PID>

Now we can start FLEXlm again:

cd ~/workspace
intelFPGA/17.1/quartus/linux64/lmgrd -c /usr/local/flexlm/licenses/license.dat

As long as you don't see any error messages in the log, all features of Quartus Prime and ModelSim should now be properly served. Here's what the tail end of my log looks like:

Open Quartus Prime. Note, if you're using version 17.1 or later, the license should automatically be recognized and you can skip ahead to step 11. Otherwise, you'll see the License Setup Required pop-up dialog appear again, but this time we're ready to specify the license we're using. Select If you have a valid license file, specify the locaiton of your license file option and click OK.

Specify the locaiton of the license file and click OK.

Voila! We now have Quartus Prime Standard Edition running with its (and ModelSim-Intel FPGA Edition's) features properly licensed and activated!

That was quite the process, I know. To summarize, we learned how to remotely interact with a Dell PowerEdge R720xd server using its built-in iDRAC controller, install CentOS 7 on the server with a GNOME Desktop environment, install VNC server and client software on the server and macbook respectively, install the EDA tools (Quartus Prime Standard Edition, ModelSim-Intel FPGA Edition) we'll be using with the Arria 10 SoC Development Kit, and acquire and serve a license for the features we need. With our board selected and hardware development environment set up, we're now ready to begin designing the hardware-accelerated system!

3. Understand the software you wish to accelerate

This is perhaps the most important step. Time spent here will directly impact your approach to the problem, the design of your hardware accelerator, and ultimately your success in improving system performance. A philosophy I adhere to is that one's understanding of how something works is directly proportional to that individual's ability to debug issues or improve upon its design. When you're attempting to replace components of a large software project with specialized hardware, this is especially true. The goal of this step is to fundamentally understand the movement of and operations on data. This will help you identify performance bottlenecks in the software. Is the system memory bandwidth-limited? Is it computation-limited? Answering these questions provides insight into what can be tuned to improve performance.

When working with someone else's software, extra time is needed to understand how the code is structured and to identify source files, data structures, functions, etc. that are relevant to the computation you wish to accelerate. This was my case in choosing to work with Protocol Buffers and focusing specifically on Protocol Buffer serialization. I had no prior experience using the software nor was I familiar with varint encoding (a mechanism used heavily in the serialization code). The obvious first step was to acquire a basic understanding of how Protocol Buffers are used in applications; I achieved this by going through the online documentation, building and installing the software, and running the example applications provided. After that, I was ready to dig deeper for an understanding of how the Protocol Buffer serialization code works.

There are powerful tools available that one can use to navigate the source code of a software project and inspect running applications; using these tools helped me identify the source code responsible for Protocol Buffer serialization and understand it to a level at which I was able to begin designing the hardware that would eventually replace it. One such tool is the marriage of vim and ctags. I used this combination to review a Protocol Buffer application line-by-line and jump to the source files containing definitions of class methods as they were invoked. This helped me better understand the relationship between the compiler-generated code and runtime library that are components of any Protocol Buffer application. I then used the GNU Debugger (gdb) to step through the same application and inspect stack traces as it actively serialized a Protocol Buffer Message. This helped me understand the sequence (and frequency) of method invocations involved in the serialization. The last tool I'll mention is perf - a powerful profiling tool used in performance tuning. perf makes use of performance counters, kernel tracepoints, and even user-level statically defined tracing (USDT) to periodically sample a running application and provide detailed reports on where it spends its time (i.e., the various code paths taken and what percentage of the total execution time they account for). As demonstrated in section 9. Profiling the hardware-accelerated system, we'll use it to profile applications running on both, standard (purely CPU/software model) and hardware-accelerated systems and analyze differences in their execution on the two systems.

In the remainder of this section, I'll provide an overview of the Protocol Buffer software, walk through an example of serializing a message, specify which version of the Protocol Buffer software we use in Firework (and show how to build it from source), demonstrate how I used vim+ctags and gdb to identify and understand the source code relevant to Protocol Buffer serialization, and discuss how time spent analyzing the WireFormatLite and CodedOutputStream classes and their relation to the various message field types led to a key realization and simplifcation of the hardware accelerator design. I'll conclude this section with a brief discussion about importance of using perf at this stage as well, a lesson I learned after-the-fact.

Overview of Protocol Buffers and message serialization

From the Developer Guide, "Protocol buffers are a flexible, efficient, automated mechanism for serializing structured data". In the land of Protocol Buffers, structured data (or data structures) are called messages. Messages consist of a series of key-value pairs called fields, similar to JSON objects. Fields can be basic types (e.g., integers, booleans, strings), arrays, or even other embedded messages. The general idea is that you define the messages you want to use in your application in a .proto file and use the Protocol Buffer compiler (protoc) to generate specialized code that implements these messages in the language of your choice (e.g., C++ classes). The compiler-generated code provides accessors for individual fields along with methods that work closely with the Protocol Buffer runtime library (libprotobuf.so.10.0.0) to serialize/parse entire messages to/from streams or storage containers. Protocol Buffers are extensible in the sense that you can add new fields to messages without disrupting existing applications that use older formats; this is achieved by marking fields as optional rather than required.

For a more complete understanding of what Protocol Buffers are and how they're used, and to learn about varint encoding and how messages are serialized, I recommend going through the following material (which also serves as a prerequisite for the remaining content in this section and subsequent sections in this tutorial):

Protocol Buffers: Protocol Buffer home page
Developer Guide: a good introduction
Protocol Buffer Basics: C++ : tutorial on using Protocol Buffers in C++ (language used in Firework)
Encoding: describes varint encoding and message serialization

From the Encoding page, we learned that field keys are actually composed of two values - a field number (or tag) and a wire type - when the containing message is serialized into its binary wire format. Field numbers are simply the integers assigned to the fields of a message as defined in the .proto file, and wire types, also an integer value, are determined by the type of a field's value (e.g., int32, fixed64, string, embedded message). Wire types provide information that's needed to determine the size, in bytes, of a field's encoded value following the key.

To serialize a message, its fields are encoded and written sequentially in ascending field number. A field is written with its encoded key first followed by its encoded value. Finally, keys are encoded as varints with the value: (field number << 3) | wire type, and the way a field's value is encoded depends on its wire type. To better understand how messages are serialized, let's walk through an example below.

Using the address book example application from the C++ tutorial, let's serialize, by hand, an AddressBook message that contains one Person message. Our Person message has the following fields:

name:       Kevin Durant
id:         35
email:      kd@warriors.com
phones:
    number: 4155551988
    type:   MOBILE

First up, we encode the AddressBook message's only field:

repeated Person people = 1;

We see that its field number is 1, and referring to the table in the Message Structure section of the Encoding page, its wire type is 2 (length-delimited) since the value's type is an embedded message (i.e., a Person). With this information, we're now ready to generate the value of the field's key and subsequently encode it as a varint. Left shifting the field number three times gives us 0b00001000, and ORing this value with the wire type produces the key's value of 0b00001010 (or 0a in hex, 10 in decimal). Varint encoding this value next will give us our encoded key, the first data to be written when serializing our message.

Recall that base 128 varints use the MSB of each byte to indicate whether it's the final byte of data; this means the largest value a single byte of varint data can represent is 127 (i.e., 0b01111111 for the final byte and 0b11111111 for any other byte). With this in mind, the varint encoding algorithm can be described as follows (note, negative integers are treated as very large positive numbers):

Is the value of this integer less than 128? If yes, append a 0b0 to the least significant 7 bits, and this produces the final byte of data. Otherwise, append a 0b1 to the least significant 7 bits, and this produces the next byte of data. Right shift the integer 7 times and repeat from the beginning.

Turning back to our example, since the value of our key (10) is less than 128, we append a 0b0 to the least significant 7 bits giving us our 1-byte varint encoded key, 0a in hex. (Note, I'll use hex notation for all encoded data from this point forward.)

Next we encode the field's value, an embedded Person message. For length-delimited fields, encoded values consist of two parts: a varint encoded length followed by the specified number of bytes of data. The length in this case corresponds to the size of the encoded Person message. As I'll demonstrate later, the compiler-generated code provides methods for calculating and caching the sizes of populated messages as an optimization during serialization, so I'll just tell you that the Person message with the values above results in 47 bytes of encoded data. You can confirm this number after we've serialized all its fields.

Since 47 is less than 128, our varint encoded length is simply 2f. The next 47 bytes of the length-delimited value consist of the encoded fields (keys and values) of the embedded Person message in ascending field number.

The first field of our Person message is:

required string name = 1;

The field number is 1 and wire type is 2 (length-delimited) since the value's type is a string. In the same way we arrived at our first varint encoded key, this field's varint encoded key is also 0a.

The length we need to varint encode is simply the size of the string Kevin Durant, which is 12 bytes. Since 12 is less than 128, our varint encoded length is 0c. The second part of our length-delimited value consists of the 12 UTF-8 encoded characters of the string Kevin Durant, which are 4b 65 76 69 6e 20 44 75 72 61 6e 74. We now have our first completely encoded field!

The next field of our Person message is:

required int32 id = 2;

This time our field number is 2 and wire type is 0 (varint) since the value's type is an int32. Following the same procedure, our varint encoded key is 10.

The value of an int32 field is also varint encoded. Since 35 less than 128, our varint encoded value is 23 (and as fate would have it, Kevin Durant's number in hex matches that of Michael Jordan's in decimal).

The third field of our Person message is:

optional string email = 3;

The field number is 3 and wire type is 2 (length-delimited) giving us a varint encoded key of 1a. (Hopefully at this point you've recognized that integers 0-127 result in 1-byte varints.) The string kd@warriors.com has 15 characters giving us a varint encoded length of 0f, and following this are the UTF-8 encoded characters 6b 64 40 77 61 72 72 69 6f 72 73 2e 63 6f 6d.

The fourth and final field of our Person message is also an embedded message:

repeated PhoneNumber phones = 4;

The field number is 4 and wire type is 2 (length-delimited) giving us a vaint encoded key of 22.

The size of this embedded message is 12 bytes which gives us a varint encoded length of 0c, but this isn't so obvoius. Jumping into the PhoneNumber message, as usual we encode its first field:

required string number = 1;

The field number is 1 and wire type is 2 giving us a varint encoded key of 0a. The phone number 4155551988 has 10 characters giving us a varint encoded length of 0a, and this is followed by the UTF-8 encoded characters 34 31 35 35 35 35 31 39 38 38 (don't confuse the phone number for a number, remember it's a string). But wait a minute... encoding the first field already gives us 12 bytes which I claimed was the size of the entire embedded PhoneNumber message. What about its second field:

optional PhoneType type = 2 [default = HOME];

Since this field is marked as optional and has HOME specified as the default value of type, it means that a serialized PhoneNumber message missing this field is assumed to have a PhoneType value of HOME. In other words, we don't have to include this field and a parser would still be able to reconstruct this message's corresponding C++ object, even if type was explicitly set as HOME. This field would have to be included if type was assigned a value other than the default (i.e., MOBILE or WORK).

With that said, we've finished serializing the embedded PhoneNumber message which means we've also finished serializing the embedded Person message which means we've also also finished serializing our AddressBook message! To summarize, the following bytes of data comprise the entire serialized AddressBook message:

0a 2f 0a 0c 4b 65 76 69 6e 20 44 75 72 61 6e 74 10 23 1a 0f 6b 64 40 77 61 72 72 69 6f 72 73 2e 63 6f 6d 22 0c 0a 0a 34 31 35 35 35 35 31 39 38 38

Building the Protocol Buffer compiler and runtime libraries from source, running the example C++ applications

Now that we're familiar with how Protocol Buffers are used and understand how messages are serialized, let's build the software from source and run the example applications provided. As an overview, we're going to clone the google/protobuf repository, see which versions of the software are available, create a new branch corresponding to release v3.0.2 of the Protocol Buffers software, and finally, build it from source. v3.0.2 was the latest version available at the time I worked on Firework and hence the version I forked and modified for use in the hardware-accelerated system. (The modified protobuf repository is located here: firework/protobuf.) We'll use both, the modified and unmodified protobuf libraries later to profile the hardware-accelerated and standard systems, respectfully. Note, I performed the following steps on the CentOS 7 server.

Download the Protocol Buffer source code repository found here.

git clone https://github.com/google/protobuf.git

Using git tag, let's list the tags included in the protobuf repository we just downloaded. These tags correspond to different releases of the software. Since we're interested in release v3.0.2, we'll create and checkout a new branch corresponding to the tag v3.0.2, all in one command.

cd protobuf
git tag
git checkout -b protobuf-v3.0.2 v3.0.2
git branch -v

Inspecting the output of git branch -v, we see that we've indeed switched to the new branch. The output should look something like this:

Follow the C++ Installation - Unix instructions to build and install the Protocol Buffer compiler (protoc) and runtime libraries from source. Stop when you reach the Compiling dependent packages section. Here are some helpful notes on building:

To install the build tools in the first step, replace apt-get with yum since we're using CentOS and not Ubuntu. There is no g++ package in the CentOS repositories; the package you're insterested in is gcc-c++ (which I think is more appropriately named). We should already have these tools installed since we selected Development Tools when installing CentOS, but it doesn't hurt to run this command anyway in case any tools are missing or there are updates
Figure out how many threads your server can execute in parallel and use the -j <num> option when running make for a faster build (e.g., make -j 32 for me)
Here's the output you want to see after running make -j <num> check:

Running sudo make install, places protoc in /usr/local/bin/ and the runtime libraries in /usr/local/lib/:

Now that we have the Protocol Buffer compiler and runtime libraries built and installed, let's use them to compile and run the C++ example applications, add_person.cc and list_people.cc. First we have to use protoc to generate C++ classes for the messages defined in addressbook.proto. Following the instructions from the section Compiling Your Protocol Buffers in the Protocol Buffer Basics: C++ tutorial:

cd ~/workspace/protobuf/examples
protoc -I=./ --cpp_out=./ addressbook.proto

This generates two new files, addressbook.pb.h and addressbook.pb.cc:

Now we have all the necessary components for compiling the C++ applications: the Protocol Buffer runtime libraries (e.g., libprotobuf.so.10.0.0), the compiler-generated C++ class definitions for the messages AddressBook, Person, and PhoneNumber used in our applications, and of course the applications themselves: add_person.cc, list_people.cc. Resuming where we left off in the C++ Installation - Unix guide, the section Compiling dependent packages shows how to use pkg-config to compile and link applications against a package called protobuf. First, we need to tell pkg-config where it can find the file protobuf.pc:

pkg-config --cflags --libs protobuf
ls /usr/local/lib/pkgconfig/
export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig/
pkg-config --cflags --libs protobuf

The sequence above shows how pkg-config fails to find the protobuf package at first and what you need to do to fix it. The second time pkg-config --cflags --libs protobuf is run, we see the compiler and linker flags used for building any protobuf application:

Finally, let's build the C++ example applications.

g++ add_person.cc addressbook.pb.cc `pkg-config --cflags --libs protobuf` -o add_person
g++ list_people.cc addressbook.pb.cc `pkg-config --cflags --libs protobuf` -o list_people

Your protobuf/examples directory should now contain two new binaries, add_person and list_people.

Let's run add_person to create a new Person, add it to a new AddressBook, and serialize and store the entire AddressBook message in a file called my_addressbook.

./add_person my_addressbook

Uh-oh! You probably received the following error message:

This error message means the program loader (ld-linux-x86-64.so.2 on my system) was unable to find the runtime library (i.e., shared object file) called libprotobuf.so.10 that add_person needs to run. To solve this problem, we need to set the LD_LIBRARY_PATH environment variable with the path containing libprotobuf.so.10:

echo $LD_LIBRARY_PATH
export LD_LIBRARY_PATH=/usr/local/lib
echo $LD_LIBRARY_PATH

Now we should be able to run our application error-free. Run ./add_person my_addressbook once more and follow the prompts to create a Person with the following fields:

name:       Kevin Durant
id:         35
email:      kd@warriors.com
phones:
    number: 4155551988
    type:   MOBILE

If you're unfamiliar with LD_LIBRARY_PATH; the difference between ld (the poorly-named linker) and /lib/ld-linux.so.2 (the program loader); the directories /lib, /usr/lib, and /usr/local/lib; and/or the difference between files named libprotobuf.so, libprotobuf.so.10, and libprotobuf.so.10.0.0 then I HIGHLY RECOMMEND you read this page on shared libraries through section 3.3.2. LD_DEBUG before continuing.

Now let's run list_people to parse the serialized AddressBook message contained in the file my_addressbook and list its contents.

./list_people my_addressbook

Great! We see the Person message we just created in the last step. Actually, this isn't that exciting. We can do better. Let's use the hexdump utility to inspect the contents of my_addressbook, which contains the serialized AddressBook message (i.e., binary data):

hexdump -C my_addressbook

Lo and behold, we see 49 bytes of data, starting with 0a and ending in 38. If you compare this byte-for-byte with the AddressBook message we serialized earlier, you'll see they match. Incredible! Now that we're confident in our abilities to compile and run Protocol Buffer applications and serialize Protocol Buffer messages, let's see how the Protocol Buffer runtime library and compiler-generated code work together to perform serialization.

Stepping through add_person.cc: `vim` + `ctags`

In this section, we'll use vim and ctags to take a closer look at add_person.cc and dive into the code responsible for Protocol Buffer serialization. If you've never used it before, ctags is the holy grail of navigating large software projects that contain several source files. Used in conjunction with vim, this creates a powerful, GUI-free method of understanding how an application's source code is structured. This is particularly useful when working with embedded systems or accessing a remote computer via ssh where a terminal may be your only interface (i.e., you can't use more powerful IDEs like Eclipse, Atom, etc.).

If you've never used vim before, take some time to become familiar with the keyboard commands. In fact, a keyboard is your only way of interacting with vim. Find a tutorial that covers the various modes of operation and basic navigation; perhaps this is a good starting point. Once you've got the basics down, here's the tutorial I used to learn how to use ctags with vim. It may be difficult to internalize all the relevant commands in such little time; a useful thing to do is to make your own cheat sheet of commands frequently used or that you're trying to learn and keep it somewhere accessible. I placed Post-it notes on my monitor for basic vim navigation and ctags-specific commands:

Off the top of my head, here are some other fundamental commands you'll find yourself using:

Command	What it does
`i`	enters insert mode (i.e., you can begin editing the file. Hit `ESC` to exit this or any other mode)
`u`	undo
`ctrl + r`	redo
`dd`	deletes a line
`p`	paste whatever's stored in `vim`'s buffer (try `dd p` and see what happens)
`:w`	save
`:wq`	save and quit
`:set nu`	displays line numbers
`:set nonu`	hides line numbers
`/ <reg_exp>`	search the file for every instance of `<reg_exp>`
`n`	jump to the next match of `<reg_exp>`
`? <reg_exp>`	search the file for every instance of `<reg_exp>`
`n`	jump to the previous match of `<reg_exp>` :)
`:f`	display the file that's currently open
`:sp <file>`	opens `<file>` and splits the window (note, only one file is in focus at a time)
`ctrl + w`, `ctrl + w`	toggles focus between split windows
`ctrl + w`, `up arrow`	switches context to the window above the one currently in focus
`ctrl + w`, `down arrow`	I think you're smart enough to infer its function
`:q`	quit (exit's the split window in focus or all of `vim` if only one file is open)

Without further ado, let's dive into add_person.cc and learn how messages are serialized.

In the protobuf/src directory (i.e., the root directory containing the Protocol Buffer runtime library's source code), let's use ctags to generate an index of all identifiers used in the source code. This creates a new tags file which maps identifiers to their containing source file(s). Note, we should still be on branch protobuf-v3.0.2.

cd ~/workspace/protobuf/src
ctags -R *

From the same directory, open add_person.cc with vim:

vim ../examples/add_person.cc

Note that if you switch to another directory (e.g., protobuf/examples) and open add_person.cc, the tag-search commands we're going to use wouldn't work. You need to open a file with vim or invoke vim -t <tag> from the directory containing the tags file.

You should already be familiar with what add_person.cc does and how it's used. If not, revisit the Protocol Buffer Basics: C++ tutorial. We want to focus on the code that's involved in serializing messages. Fortunately, there's only one line in the entire program that we care about: line 85. Navigate to this line and place your cursor anywhere over the identifier SerializeToOstream(), not address_book or (&output). Note, I use the notation FooBar() to refer to functions or methods, purposely excluding any parameter list to avoid clutter. (Pro tip: try entering the command :85 in vim.)

With the cursor over SerializeToOstream(), enter the command ctrl + ] to jump to its definition. This takes us to line 175 of the file, google/protobuf/message.cc:

This marks our first use of ctags, and we immediately learn a few things:

SerializeToOstream() is a method of the Message class, not AddressBook (i.e., the base class that AddressBook is derived from and whose accessible members it inherits)
Because SerializeToOstream() belongs to Message, we're looking at code that constitutes the Protocol Buffer runtime library, not compiler-generated code
SerializeToOstream() is simply a wrapper function around SerializeToZeroCopyStream(), which we'll jump to next

Place the cursor anywhere over SerializeToZeroCopyStream() on line 178, and enter ctrl + ] once more. This takes us to line 272 of the file, google/protobuf/message_lite.cc:

Alright, things are starting to get interesting. We see that SerializeToZeroCopyStream() is a method of the MessageLite class - the base class that Message is derived from (as seen here in the C++ API). This method instantiates a CodedOutputStream object and in turn calls SerializeToCodedStream().

Place your cursor anywhere over SerializeToCodedStream() and enter ctrl + ]. This takes us to line 234 of the same file. This method simply checks that the message object has been initialized (i.e., all required fields have been assigned values) and calls SerializePartialToCodedStream(). Place your cursor over this method and enter ctrl + ] once more. We see that it's defined right below starting on line 239:

SerializePartialToCodedStream() is the first method with relevant content, and it's actually central to message serialization. It's in this method where the Protocol Buffer runtime library finally interacts with the compiler-generated code. As you may recall, I previously mentioned that the compiler-generated code provides methods for calculating and caching the sizes of populated messages as an optimization during serialization. Well, on line 241 we call one such method, ByteSize(), of the AddressBook class. If you try jumping to ByteSize()'s definition with ctrl + ] like we've done so far, you'll see that vim + ctags yields several hundred options, and none of these options are actually useful. This makes sense since we only created an index (i.e., tags file) of the runtime library's source code, and I just said ByteSize() comes from the compiler-generated code. It took me quite some time to realize that not only does ByteSize() come from the compiler-generated code, it's also defined multiple times - once for each message in your .proto file. Since back in add_person.cc it was an AddressBook object that invoked SerializeToOstream(), it's AddressBook::ByteSize() that's invoked here which is defined on line 1174 of the file, protobuf/examples/addressbook.pb.cc.

After the message object's size is cached, SerializePartialToCodedStream() then accesses the CodedOutputStream object's buffer, determines if it has enough room for the serialized message, and if so, calls SerializeWithCachedSizesToArray() to write the message to it directly. Otherwise, it calls SerializeWithCachedSizes() and passes it the entire CodedOutputStream object to write the message. Both methods perform the same function, but the former is optimized. Like ByteSize(), the methods SerializeWithCachedSizesToArray() and SerializeWithCachedSizes() also are part of the compiler-generated code and have multiple definitions, one for each message in the .proto file.

I'll demonstrate in the next section how gdb makes it very easy to identify the exact codepaths taken on each function or method invocation, fixing this deficiency of vim + ctags.

Let's step into SerializeWithCachedSizesToArray() to begin learning about the compiler-generated code's role in serializing messages. As mentioned, this method is defined once for each message; we'll look at the Person class's since it's more informative than the AddressBook's and PhoneNumber's definitions. For completeness, place your cursor over SerializeWithCachedSizesToArray() on line 250 and enter ctrl + ]. This displays several options; enter f repeatedly until you've reached option 69:

This is the source file we're interested in; it contains the definition of this member of the MessageLite class. Continue entering f until you've reached the bottom and are prompted to enter the number of the entry you wish to jump to:

Enter 69 and press enter. This takes us to line 257 of the file, google/protobuf/message_lite.h:

Ah-ha! Here we see that this method is a virtual member of the MessageLite class. This means derived classes can redefine this method, so hopefully now you believe me when I say this is where AddressBook and other compiler-generated message classes take over. Actually, if you look at the Person class's implementation of this method (lines 198-200 of the file addressbook.pb.h), you'll see that it's identical to MessageLite's and also calls InternalSerializeWithCachedSizesToArray(). We'll jump into Person::InternalSerializeWithCachedSizesToArray() next.

Enter :q to exit vim. Still in ~/workspace/protobuf/src, open the file, ../examples/addressbook.pb.cc. This is the source file that the compiler (i.e., protoc) spits out defining the three messages, AddressBook, Person, and PhoneNumber, and hence it's where we'll find Person::InternalSerializeWithCachedSizesToArray().
Once open, let's display line numbers and search for the method.

:set nu
/ Person::InternalSerializeWithCachedSizesToArray

This brings us to the method's defintion on line 677:

Don't let Google's extensive use of namespaces and the scope resolution operator frighten you; this method is actually very straightforward. Recall that the Person message has four fields as defined in the .proto file, copied below for convenience:

In the order of ascending field number, Person::InternalSerializeWithCachedSizesToArray(), calls a WireFormatLite::Write*ToArray() method for each field, where the particular Write*ToArray() method corresponds to the field value's type:

WireFormatLite::WriteStringToArray() for name
WireFormatLite::WriteInt32ToArray() for id
WireFormatLite::WriteStringToArray() for email
WireFormatLite::InternalWriteMessageNoVirtualToArray() for phones

WireFormatLite is a class that's internal to the Protocol Buffer runtime library, and its Write*ToArray() methods are used to serialize individual fields of a message and write them to an output buffer. As you can see, the compiler-generated code specifies the order in which fields are written and the runtime library performs the actual writing. Let's see how the id field is serialized by jumping into WireFormatLite::WriteInt32ToArray().

Place your cursor over the identifier WriteInt32ToArray() and enter ctrl + ]. This takes us to line 672 of the file, google/protobuf/wire_format_lite_inl.h:

We see that this method calls two other WireFormatLite methods: WriteTagToArray() and WriteInt32NoTagToArray(). This should be intuitive now since we previously learned that a field is written key first followed by its encoded value (key and tag are synonyms). Let's look at WriteInt32NoTagToArray() next.

Place your cursor over the identifier WriteInt32NoTagToArray() and enter ctrl + ]. This takes us to line 608 of the same file:

We see that this method is simply a wrapper function that calls the CodedOutputStream class's WriteVarint32SignExtendedToArray() method. From the C++ API, the CodedOutputStream class "encodes and writes binary data which is composed of varint-encoded integers and fixed-width pieces". This sounds very promising; let's jump into WriteVarint32SignExtendedToArray() next. By the way, if we're comparing this call stack to the movie Inception, I think time is moving backwards at this point.

Place your cursor over the identifier WriteVarint32SignExtendedToArray() and enter ctrl + ]. This takes us to line 1164 of the file, google/protobuf/io/coded_stream.h:

In this method, we see that the path taken next depends on the sign of the int32 field's value. Let's take the red pill and venture down the WriteVarint32ToArray() rabbit hole. Proceed with caution; we may actually create a black hole in this next jump.

Place your cursor over the identifier WriteVarint32ToArray() and enter ctrl + ]. This takes us to line 1145 of the same file:

And with this my friend, I'm proud to say we've FINALLY reached what we've been (secretly) looking for: the code that actually performs varint encoding. Take some time to go through this method and confirm it implements the varint-encoding algorithm I described in English in the section, Overview of Protocol Buffers and message serialization.

To summarize, we've used vim and ctags to navigate our way through the Protocol Buffer runtime library and compiler-generated code, starting with an example application (add_person.cc) that initiates message serialization. We've learned that the runtime library and message classes defined in the compiler-generated code work hand-in-hand to encode and serialize individual fields, looking at how int32 fields are varint encoded and serialized as an example. Although not explicitly stated, we've also learned that two classes central to serializing messages are WireFormatLite and CodedOutputStream, partially-defined in the files, google/protobuf/wire_format_lite_inl.h and google/protobuf/io/coded_stream.h, respectfully. I don't expect this next bit to be immediately intuitive, but it's actually quite significant that all message serialization and encoding logic is contained in these two classes (and really just the latter) and not in the compiler-generated code. I'll elaborate further on this in the section, Analyzing the Protocol Buffer serialization code.

I'm jumping ahead of myself, but it's WriteVarint32ToArray() and other methods of the CodedOutputStream class that we'll be modifying after we build the hardware accelerator, integrate it into the larger SoC system, and write a device driver that enables user space access to the newly developed FPGA peripheral. It's actually the device driver that creates an interface for the modifications we'll need to make!

Before we get there, let's see how using gdb to step through the same application and inspecting stack traces along the way provides a straightforward way of identifying relevant code paths and containing source files.

Stepping through add_person: `gdb`

In the last section, we used vim and ctags to navigate our way through add_person.cc, learning more about the soure code that's responsible for serializing messages. While this combo was very effective, it wasn't perfect. We were sometimes left with several hundred options to choose from in identifying the next codepath to take, and sometimes none of these paths were the one we needed. The GNU Project Debugger (a.k.a., the GNU Debugger, or simply gdb) fixes this problem, providing the ability to easily identify the exact codepaths taken and eliminating any guesswork. This powerful tool allows you to walk through an application as it executes, insert breakpoints, halt execution, inspect the values of variables, step into functions and methods as they're called, and inspect the call stack among other things. While this tool's primary use is in debugging, it can also be effectively used to understand how a program works. In this section, we'll use gdb to walk through add_person, providing the same information for the Person message we previously serialized by hand, and make use of gdb's backtrace feature to take a closer look at the codepaths that ultimately lead to WriteVarint32ToArray() and other methods of the CodedOutputStream class.

If you've never used gdb before, here are some useful tutorials that cover everything from how to compile your program to include debugging information needed by gdb, to basic commands for the gdb command-line interface and retrieving information about call stack:

GDB Tutorial: a good intro, quick tutorial covering basics
RMS's gdb Debugger Tutorial: a more comprehensive tutorial (and the one I used to learn gdb)
8.2 Backtraces: using gbd's backtrace feature to inspect the call stack

Once you're confident in your abilities to use gdb, continue below to step through add_person. Note, I'm using the CentOS 7 server with the Protocol Buffer runtime library already built and installed, as well as add_person. I also assume you didn't permanently set the PKG_CONFIG_PATH or LD_LIBRARY_PATH environment variables; you can safely skip these steps below if they're already set. Also note that we should still be on branch protobuf-v3.0.2 of the protobuf respository.

gdb should already be installed on your machine, but just in case, run the following:

sudo yum install gdb

As you know, applications need to be compiled with the -g flag in order for gdb to be able to step through them properly. This flag tells the compiler to include debugging information in the executable (or shared object file, as in the case of the Protocol Buffer runtime library, libprotobuf.so.10.0.0) it generates. Note that both libprotobuf.so.10.0.0 and add_person are simply variants of an object file, and debugging information is added to these files in the form of additional .debug_* sections. Therefore, we can use the objdump utility to list the sections contained in these files and look for the inclusion of .debug_* sections, confirming they were compiled with -g.

objdump -h /usr/local/lib/libprotobuf.so.10.0.0 | grep debug

I used grep to filter the output, listing only lines that contain the string: debug. Here, we see that the Protocol Buffer runtime library was indeed compiled with -g. Now, let's check add_person.

cd ~/workspace/protobuf/examples
objdump -h add_person | grep debug

Uh-oh, we don't see any .debug_* sections in the output! That's ok, luckily it's an easy fix. Let's compile another version with debugging symbols and name it add_person_dbg.

export PKG_CONFIG_PATH=/usr/local/lib/pkgconfig
g++ add_person.cc addressbook.pb.cc `pkg-config --cflags --libs protobuf` -g -o add_person_dbg
objdump -h add_person_dbg | grep debug

... and we're all set! I encourage you to compare the size of the two binaries, add_person and add_person_dbg, using ls -lh. This should provide insight as to why production software is stripped of debugging symbols.

Now we're ready to invoke gdb and pass it add_person_dbg as the program we wish to "debug".

export LD_LIBRARY_PATH=/usr/local/lib
gdb add_person_dbg
run

Since we didn't provide the name of an address book file, add_person_dbg prints a Usage: statement and exits, as expected. If you look at gdb's output following this statement (highlighted in the screenshot above), we see that we're missing debugging information for the GNU C Library (glibc), GCC low-level runtime library (libgcc), and GNU Standard C++ Library (libstdc++), all of which add_person_dbg uses.

Let's exit the current gdb session and install these libraries' missing debugging symbols. Let's also remove my_addressbook (if it exists) such that the next invocation of add_person_dbg creates a new one that'll only contain the Person message we enter.

q
sudo debuginfo-install glibc libgcc libstdc++
rm my_addressbook

With debugging symbols for the executable, Protocol Buffer runtime library, and core libraries they use, we're now ready to step through add_person_dbg. Before we run the application under gdb, let's insert a breakpoint at line 85 of add_person.cc. This is where SerializeToOstream() is called, and hence where the AddressBook message's serialization is initiated.

gdb add_person_dbg
break 85
run my_addressbook

Following the prompts, enter the information of our example Person message from before (see below). After the last field is entered, gdb will continue executing add_person_dbg and halt when it reaches the first instruction corresponding to line 85 of the source file add_person.cc.

Run the list command without any arguments. This displays the 10 lines of code centered around where the program was halted (i.e., line 85). We already knew that we want to step into SerializeToOstream(), but list comes in handy when you've stepped into a function and need to figure out where to go next. Running list again displays the next 10 lines of code, and so on.

Run the step command to instruct gdb to step into SerializeToOstream(). This takes us to line 175 of the file, google/protobuf/message.cc. Note that if we ran next instead, gdb would've executed the subroutine completely, returned, and halted execution at the instruction immediatedly following the call to SerializeToOstream().

Inspecting the output of list, we see that SerializeToZeroCopyStream() is on line 178, and this is the method we want to jump into next. One way of getting there is to set another breakpoint at line 178 and run the command, continue.

Set a breakpoint at line 178 and run the commands continue, step, and list. This takes us into the method SerializeToZeroCopyStream() at line 273 of the file, google/protobuf/message_lite.cc.

Next, we want to jump into SerializeToCodedStream() at line 275. Before we do this, however, let's run the command backtrace (or bt, or info stack) to see what our call stack looks like at this point.

We see SerializeToZeroCopyStream() at the top of the stack (entry #0) which is currently the active stack frame. Underneath are stack frames for SerializeToOstream() of the Message class and main() of add_person.cc, neither of which have finished executing. Take some time to become familiar with reading the output of this command. It's gdb's ability to step into functions and provide informative stack traces that allows us to eventually reach CodedOutputStream::WriteVarint32ToArray() without any guesswork. Let's continue on that journey!

Set a breakpoint at line 275, continue to this breakpoint, and step into SerializeToCodedStream(). This takes us to line 236 of the same file. Repeating this pattern, continue stepping into the following methods (where each one is called somewhere in the body of the one prior):

MessageLite::SerializePartialToCodedStream()
AddressBook::SerializeWithCachedSizesToArray()
AddressBook::InternalSerializeWithCachedSizesToArray()
WireFormatLite::InternalWriteMessageNoVirtualToArray()
WireFormatLite::WriteTagToArray()
CodedOutputStream::WriteTagToArray()
CodedOutputStream::WriteVarint32ToArray()

Once in CodedOutputStream::WriteVarint32ToArray(), set a breakpoint at line 1147 (the first line of this method's body). This way, we can subsequently run continue to reach this method again, run a backtrace, and inspect the call stack to see which field's tag, length-delimited size, or value is actively being varint encoded.

Note, it is not trivial to reach this point. It might take several tries for you to figure out when to step, when to run next, and when to continue, as some method invocations span several lines and call other methods themselves, using the return value as a parameter. If you've properly navigated your way to the first time CodedOutputStream::WriteVarint32ToArray() is called, running bt should result in the following call stack:

Let's take a look at what's happening at this point in the program's execution. Starting with AddressBook::InternalSerializeWithCachedSizesToArray() (#4 in the call stack), this compiler-generated method calls the runtime library's WireFormatLite::InternalWriteMessageNoVirtualToArray() once for each embedded Person message it needs to serialize. This makes sense; recall that an AddressBook message only has one field, a repeated Person people = 1;. As for any field, first we encode and write this embedded message field's key. With a field number of 1 and wire type of 2 (length-delimited), the value we need to varint encode (giving us our key) is 10 in decimal. Correspondingly, WireFormatLite::InternalWriteMessageNoVirtualToArray() calls WireFormatLite::WriteTagToArray(), setting its parameters type and field_number to WireFormatLite::WIRETYPE_LENGTH_DELIMITED and 1, respectfully. (Again, note that Tag in WriteTagToArray() is synonymous to my use of the word key; this method is for writing field keys.) This method in turn calls CodedOutputStream::WriteTagToArray(), setting its parameter value to 10. Last, CodedOutputStream::WriteTagToArray() calls CodedOutputStream::WriteVarint32ToArray(), also setting its parameter value to 10, which is where the number 10 is varint encoded into the hex value 0a, finally producing our first byte of serialized data: the first varint encoded key of the first field of our AddressBook message. Cool, eh? :)

We expect the program to varint encode the size (in bytes) of the embedded Person message next, since it's a length-delimited field. Let's run continue to take us to the next invocation of CodedOutputStream::WriteVarint32ToArray(), and inspect the stack.

Highlighted above, we see that the number 47 is being varint encoded next; recall that this is the size of our example Person message.

Run continue followed by bt once more, and let's take a look at what the stack is telling us:

We see that WireFormatLite::InternalWriteMessageNoVirtualToArray() (#5 in the call stack) calls Person::InternalSerializeWithCachedSizesToArray(), which is the compiler-generated method we analyzed in the last section and know is responsible for serializing the Person class's individual fields! This method calls WireFormatLite::WriteStringToArray() first, setting its parameters value and field_number to "Kevin Durant" and 1, respectfully, to proceed with serializing the Person message's first field: required string name = 1;. As usual, we write this field's key first which leads to varint encoding the value 10 in CodedOutputStream::WriteVarint32ToArray(). Before we conclude this section, let's see which methods are responsible for serializing the value ("Kevin Durant") of our name field.

Set a breakpoint at line 736 of the file, google/protobuf/io/coded_stream.cc, run continue, and run bt one final time.

break coded_stream.cc:736
continue
bt

Here, we see that WireFormatLite::WriteStringToArray() (#1 in the call stack) calls CodedOutputStream::WriteRawToArray(), setting its parameter size to 12: the length of the string "Kevin Durant". This method in turn calls memcpy() on line 736 (highlighted above), instructing the system to copy 12 bytes of data from the memory area pointed to by data to the memory area pointed to by target. For those brave enough, let's jump into memcpy() next.

Run step and list.

Here we see that we've now entered code that belongs to the GNU C Library, glibc. We also see that memcpy() is implemented in x86 assembly. Details of this method's implementation are beyond the scope of this tutorial, but the important takeaway is that this is where the Protocol Buffer runtime library hands off execution to the system for copying raw bytes of data from one memory location to another (i.e., to our output stream).

In the next section, we'll see how keys and values of all field types can be categorized into two high-level types of data: varint data and raw data.

Analyzing the Protocol Buffer serialization code

In the previous two sections, we saw that the compiler-generated code works very closely with the Protocol Buffer runtime library to serialize fields of a message. We took a closer look at Person::InternalSerializeWithCachedSizesToArray() and saw that it makes a sequence of calls to WireFormatLite::Write*ToArray() methods to serialize its fields in order of ascending field number. I like to call InternalSerializeWithCachedSizesToArray() the "lowest high-level serialization method"; its purpose is to provide the order in which fields of a message (e.g., a Person message) are encoded and written to an output buffer. Next, we looked at the WireFormatLite::Write*ToArray() methods that it calls and saw that they in turn call CodedOutputStream::Write*ToArray() methods which actually perform the encoding and writing of field keys and values. Finally, we learned that CodedOutputStream::WriteVarint32ToArray() is the method responsible for varint encoding 32-bit integers.

As it turns out, there are six methods of the CodedOutputStream class that contain all the logic for encoding and writing keys and values of all field types:

WriteVarint32ToArray()

WriteVarint64ToArrayInline()

WriteLittleEndian32ToArray()

WriteLittleEndian64ToArray()

WriteRaw()

WriteRawToArray()

This brings us to the first of two key realizations I had when analzying the code:

1. All logic for serializing fields of any message is contained within the Protocol Buffer runtime library, NOT in the compiler-generated code

This is significant because it means that we don't have specialized encoding logic in the compiler-generated code for messages defined in a .proto file but rather a consistent mechanism for serializing fields: calling one of the six CodedOutputStream methods above. It's these methods that serve as templates for designing the datapath of the hardware accelerator. They also shed light on how data would eventually be communicated from the SoC's ARM Cortex-A9 CPU to the hardware accelerator, leading to my choce of designing the peripheral as an ARM AMBA AXI4 slave peripheral. Had the logic been part of the compiler-generated code, it may have been much more difficult or not possible to design a single hardware accelerator that supports serialization of all Protocol Buffer applications.

The second key realization came from thinking about how all the supported field types map to these six methods:

2. Keys are encoded as varints, and values of all field types can be categorized as either varints or raw data

Let's look more closely at how the supported field types map to one of six wire types in the following table, pulled from the Encoding page:

Note, I chose to omit support for wire types 3 and 4 from the hardware accelerator design because groups have been depreciated starting with the proto3 version of the Protocol Buffers language, and we're working with version v3.0.2.

Wire type 0 is used for field keys; the following field types (from the table): int32, int64, uint32, uint64, sint32, sint64, bool, enum; and the size of length-delimited fields. All of these are encoded as varints.

Wire type 2 is used for length-delimited fields (string, bytes, embedded messages, packed repeated fields), and their values can be thought of as variable-sized raw data that simply needs to be copied byte-for-byte to an output buffer. Wire type 1 is used for 64-bit data (fixed64, sfixed64, double) whose payload is always 8 bytes, and wire type 5 is used for 32-bit data (fixed32, sfixed32, float) whose payload is always 4 bytes. Values of these two wire types are also copied byte-for-byte, so you can classify them as raw data as well.

This second insight that all serialized data could be classified into two broad categories led to a major simplificaiton and optimization in the hardware accelerator design: I could build a datapath that consists of two parallel channels for processing incoming varints and raw data and stitch together the encoded data into a unified output buffer, presesrving the order in which the fields are serialized. There isn't a precise way for me to explain how I came to this, and it's a prime example of how RTL design is an art. I took the time to fundamentally understand the operations being performed on data and how the data is moving at a level even lower than the abstraction provided above by the Protocol Buffer language.

I elaborate further on the hardware accelerator design and how it supports the various field types in the section, Design and Implement the hardware accelerator (FPGA peripheral).

A brief note on `perf`

If I could go back, I would also use perf at this stage to learn more about the frequency at which different methods are invoked and the percentage of total execution time the various codepaths, focusing on those belonging to message serialization, account for. perf also yields important information on time spent in the user space application vs. core libraries and even time spent executing kernel code. It may very well be the case that a considerable amount of the execution time is spent in areas you weren't aware of and hence didn't target for acceleration (e.g., the overhead of calling memcpy() in the CodedOutputStream::Write*() methods when serializing raw data). With this information, you can guage whether specialized hardware could even outperform the software, and if so, which codepaths you should target for optimization (i.e., design hardware to replace). It was a combination of lacking experience in analyzing system performance, thinking the paper provided sufficient motivation, and being overwhelmed with too many other unknowns in this project that led to skipping this step.

Hardware Development

If you've made it this far, congratulations! Now that we have our board selected, our development environment set up with the necessary EDA tools installed, and understand how Protocol Buffer serialization (and the code that performs it) works, we're ready to begin designing our hardware accelerator. It took a lot of preparation to get to this point, and in the following sections, we'll finally begin to realize our vision.

First, let's see how the six CodedOutputStream::Write*() methods identified in the last section are translated into a hardware accelerator (i.e., an RTL design or FPGA peripheral). Then, we'll integrate the newly developed FPGA peripheral into the larger Arria 10 Golden Hardware Reference Design (GHRD) where it'll serve as a co-processor to the ARM Cortex-A9 CPU onboard.

4. Design and Implement the hardware accelerator (FPGA peripheral)

In the last section, I claimed that understanding the software was the most important step of the process. Well, I think designing the hardware accelerator is the most interesting (and challenging) step as much of it relies on problem solving, piecing together information from various sources, and creativity. Before I could begin designing the hardware accelerator (i.e., the top-level I/O ports, FSMs, and datapath of the RTL design), I needed to first understand how it would integrate into the larger Arria 10 SoC such that I've established communication with the ARM Cortex-A9 CPU onboard and could interact with it via software running in a Linux environment. Knowing this would essentially define the interface (i.e., top-level I/O ports or external facing signals) the hardware accelerator would have to implement and leave its internal design to my creativity.

If you've worked with FPGAs in the past to build simple digital designs, perhaps in the lab section of an introductory course on logic design, it's important to understand that building a hardware accelerator requires a different train of thought. It's not a standalone piece of hardware that we're implementing in the FPGA (e.g., a counter whose input is connected to a push button and output is displayed on a 7-segment display); we're designing an FPGA peripheral that's just one small component of a much larger, complex digital system: a computer. Specifically, we're designing a custom processor that performs Protocol Buffer serialization as instructed by an ARM Cortex-A9 CPU which in turn executes a Protocol Buffer application. In this configuration, the hardware accelerator serves as a co-processor to the ARM CPU and is viewed as a memory-mapped FPGA peripheral from its perspective. What differentiates our FPGA design from other standalone projects is that, aside from the complexity of what we're building, its top-level I/O ports are solely connected to the system interconnect (i.e., system bus) which enables its discovery and access by the ARM CPU. That is, we're not interacting with the hardware accelerator directly but rather via software running on the ARM CPU. It may be troubling at frist to think you're going to build a very sophisticated digital circuit whose only means of interaction is via instructions executed on an ARM CPU (sorry, no dip switches or LEDs here!), but rest assured, I'll show you how to methodically approach the problem and make it seem less like a leap of faith, like it was for me.

How did I come to realize that this is the role our hardware accelerator would play within the larger Arria 10 SoC? It required a lot of patience and piecing together information from various sources. The following is a list of separate initiatives I undertook that, although not comprehensive, I later identified as key contributors to obtaining the information I needed to begin designing the hardware accelerator:

Going through several free online courses Intel provides in their FPGA Technical Training curricula which covers everything from the general FPGA design process to how their EDA tools are used at the various stages of design, simulation, and testing (functional verificaiton, static timing analysis, etc.) and more
Going through RocketBoards.org's series of Getting Started tutorials and spending time to fundamentally understand the Arria 10 GHRD architecture, design files, Quartus Prime and Qsys projects, example FPGA peripherals and how they're accessed from Linux. I realized the GHRD would be the starting point for my own Qsys system design where the hardware accelerator would be added to it as another component
Going through the Altera SoC Workshop Series presentations on software development for Altera SoCs and writing Linux device drivers for memory-mapped FPGA peripherals
Learning about the FPGA-to-HPS, lightweight HPS-to-FPGA, and HPS-to-FPGA interfaces that provide a means of connecting the Arria 10 SoC's Hard Processor System (HPS) (~synonymous to "ARM CPU") to designs residing in the FPGA fabric. Here I found that certain address spaces are reserved for communicating with FPGA peripherals over these interfaces and learned how to assign addresses to FPGA peripherals in a Qsys system
Going through one course in particular from the Advanced Hardware section of Intel's FPGA Technical Trianing curricula that connected everything I learned up to that point and gave me the last bit of information I needed to be able to confidently begin designing the hardware accelerator
(and finally) Reading the AMBA AXI and ACE Protocol Specification to learn about the system bus (or interconnect) used to build ARM-based SoCs and specifically, AXI4 interfaces and the signals that comprise them (since it was determined this is the interface our hardware accelerator would implement!)

The remainder of this section is broken down into three parts. First, I'll elaborate on the initiatives above and walk through the resources and useful exercises that helped me gain intuition on how hardware accelerators are built on Arria 10 SoC platforms. I'll conclude this part with a discussion on the course that put all the pieces together followed by an overview of the AMBA AXI4 interface. Next, I'll present the hardware accelerator's RTL design (high-level architecture, pipeline stages, datapaths, FSMs, etc.) and describe how it implements the CodedOutputStream::Write*() methods identified in the last section. In the last part, I'll show how to implement the design in Quartus Prime (using Verilog to capture its behavior) and how to verify its functional correctness by using ModelSim-Intel FPGA to run gate-level simulations.

a. Building hardware accelerators on the Arria 10 SoC platform

Let's take a look at the online training, examples, and other available resources that provided the necessary context for building hardware accelerators on the Arria 10 SoC platform.

i. Intel FPGA Technical Training (and other online resources)

If you don't have prior experience working with FPGAs or using design tools such as Quartus Prime, then Intel's FPGA Technical Training curricula is a great starting point. Intel provides several free online courses that cover almost every aspect of the FPGA design process: from the history of programmable logic devices to the use of HDLs (Verilog, VHDL) for capturing circuit behavior, tutorials on using Quartus Prime, Qsys, and other tools at the various stages (and levels) of design, performing static timing analysis, using on-chip logic analyzers for debugging, and much more. Take some time to go through the catalog and select the courses that you feel necessary to fill voids in your skillset. Of the courses available, below are the ones that I found relevant to this project. If you're already familiar with a particular topic or tool, then feel free to skip the training. At a minimum, I recommend going through the bolded courses as a refresher on how to use Verilog and Quartus Prime to implement RTL designs and Qsys to build complete digital systems.

(Optional) Background information on programmable logic and FPGAs

Verilog, Quartus Prime

ModelSim-Intel FPGA, writing testbenches, functional verification

Overview of Mentor Graphic's ModelSim Software
EECS 270: Quartus Software Tutorial (focusing on section C. Simulation)

Static timing analysis, TimeQuest Timing Analyzer, SDC Constraints

Qsys, system design

Introduction to Qsys
Creating a System Design with Qsys

Leaving this section, you should have a solid grasp on writing behavioral and structural Verilog, going from design entry to a generated programming file using Quartus Prime, writing Verilog testbenches, running gate-level simulations using ModelSim (for functional verification), and using Qsys to build complete digital systems. Static timing analysis is also an important step in the design process, but I'll be honest I skipped this step. I was pretty confident that the various paths in my design would meet basic timing requirements, and luckily I was right. Unless you have experience building sophisticated digital circuits, I don't recommend skipping this step.

ii. RocketBoards.org

RocketBoards.org is another great source of general information, tutorials, example designs, source code, etc. that help you get up and running with your SoC FPGA project. This website also provides access to a community of other Intel SoC FPGA developers for project collaboration and assistance with questions you may have (see the RocketBoards.org forum). You'll find example FPGA designs, development kit-specific reference system designs (e.g., Arria 10 GSRD and GHRD), Arria 10 (and other board) documentation, and a series of Getting Started tasks that help familiarize you with the Arria 10 SoC Development Kit:

Take some time to go through this website and become familiar with the various resources available. Once you've done that, I recommend navigating to the Getting Started page, selecting the proper Board and Tool Version, and going through the various Tasks to become familiar with the Arria 10 SoC Development Kit and its various hardware and software components. Although all of the tasks provide useful information, I recommend completing:

1 - Booting Linux from SD Card
2 - Sample Linux Applications

for now which show how to boot Linux on the Arria 10 SoC Development Kit for the first time and how to interact with the Arria 10 GHRD's FPGA peripherals through the example Linux applications provided. Tasks 4, 5, 7, 8, and 9 are also relevant, but hold off on them for now; we'll revisit them in later sections of the tutorial after we've built the hardware accelerator.

Once you've gone through tasks 1 and 2 above and have successfully interacted with the FPGA peripherals (e.g., LED PIO module) - components of the Arria 10 GHRD - a useful exercise is to apply what you've learned in the previous section about Quartus Prime and Qsys to compile the Arria 10 GHRD Quartus Prime project and learn about its architecture by opening and inspecting the GHRD's main Qsys subsystem. The Arria 10 GHRD project files can be downloaded here for version 17.1 of the Quartus Prime and Qsys toolset (see screenshot below):

Become intimate with its architecture, various components, connections, interfaces, signals, top-level I/O, addresses assigned to the FPGA peripherals (i.e., soft IP), and design files. For example, you should leave this section knowing not only how to use the provided Linux applications to toggle the LEDs ON/OFF, but also the connections in the Arria 10 GHRD between the HPS and LED PIO IP core and assigned addresses that make this interaction possible. Refer to Chapter 22 - PIO Core of the Embedded Peripherals IP User Guide to learn more about Parallel Input/Output (PIO) IP cores and how one is configured and instantiated in the Arria 10 GHRD to control the external LEDs. Also, see the screenshots below which highlight parts of the Qsys system design relevant to controlling LEDs.

This is an important prerequisite for the next step, System integration (Arria 10 GHRD), as we'll add the hardware accelerator we develop here to the Arria 10 GHRD!

iii. Altera SoC Workshop Series

At this point, you should be familiar with the FPGA design process, using Quartus Prime and Qsys to implement FPGA and system designs respectively, running Linux on the Arria 10 SoC Development Kit, and the Arria 10 GHRD architecture. Next, it's important to understand how Linux applications interact with user-designed FPGA peripherals, or soft IP, such as the PIO IP core that controls the LEDs and the hardware accelerator that we're going to build and add to the GHRD. Where can we obtain such an understanding? The Altera SoC Workshop Series is your one-stop shop for learning about the mechanisms that enable this user space interaction with custom FPGA designs, such as:

the FPGA to HPS, LW HPS to FPGA, and HPS to FPGA bridges that provide a means of communication between the Arria 10 SoC's Hard Processor System (HPS) and FPGA peripherals
a section of the ARM CPU's physical address map that's reserved for memory-mapped I/O
and writing Linux device drivers for memory-mapped FPGA peripherals (workshop #3 covers general Linux device driver techniques and walks through the source code of drivers written for the GHRD's FPGA peripherals)

The Altera SoC Workshop Series includes three workshops:

Each workshop consists of presentation material (i.e., a slide deck) and a take-home lab. Don't bother with the labs; they're designed for a different development board, and this tutorial essentially serves as a replacement. Although all three workshops provide useful information, there's a lot of redundancy and information that applies to later sections of this tutorial. Therefore, I'll do my best to summarize the information that's relevant to this section. It may be a good idea to skim through the presentations anyway, but don't worry about trying to understand every detail. Note, I'm focusing on workshops 1 and 2 here; workshop 3 covers Linux driver development and is the main material of section 7. Write a device driver (interface between FPGA peripheral and user space application).

Some of the information found (only?) in these presentations is so fundamental to understanding how FPGA designs interact with the Arria 10 SoC's HPS that it makes me wonder why it's buried in such an obscure location and not developed further; some of the concepts described below I had to deduce myself. If everything in this and other sections of a. Building hardware accelerators on the Arria 10 SoC platform were succinctly contained in some README file, a "Quick Start Guide", or perhaps a "What You Need to Know for Building FPGA Designs on the Arria 10 SoC" tutorial, it would've made reaching the design phase of this project significantly less frustrating. Without further ado, let's take a look at the key takeaways from workshops 1 and 2.

Below are four different Arria 10 SoC block diagrams (why there are four, I'm not quite sure), and each one presents a slightly different view of its architecture, connections, system bus, etc. Let's see what we can learn from each, starting with the first:

In the diagram above, we see the Arria 10 SoC's two main components: the Hard Processor System (HPS) in light blue and FPGA in light green. Also shown are the hardware blocks that constitute the HPS, including the two ARM Cortex-A9 CPUs at its heart. Enclosed in the red rectangle are two components of particular interest: the HPS to FPGA and FPGA to HPS bridges. Knowing that we'll eventually want the Protocol Buffer application (running in the HPS) to send data to our hardware accelerator (running in the FPGA) and vice versa, it appears we'll have to communicate this data over one or more of these bridges. Let's take a look at the next diagram to see what other details we can infer about these bridges:

Starting at the ARM Cortex-A9 MPCore block, we see that its L2 Cache is connected to an external block, the system Interconnect, which in turn connects to the HPS to FPGA and FPGA to HPS bridges (here they incorrectly appear to be components of the FPGA; don't let that bother you). What does this tell us? Well, if you're familiar with the basic principles of computer organization and design and memory hierarchies, then it appears that the ARM CPUs have a path to the FPGA by means of executing instructions that store/load data to/from memory. For example for writes, data propegates through the CPU's L1 cache to the shared L2 cache, and instead of being directed to the system's main memory via the Multi-port DDR SDRAM Controller, it'll make its way through the system Interconnect and to the FPGA via the HPS to FPGA bridge. Alright, now we're getting somewhere! Let's see how we can build on this in the next diagram.

Note, "CORE" is synonymous to "FPGA" in the LW HPS TO CORE BRIDGE component (they were lazy with the labeling). Notice that in the first two diagrams, there are two arrows extending from the HPS to FPGA bridge. Here, we see that it actually consists of two separate components: a (lightweight) LW HPS to FPGA bridge and a (full or high-performance) HPS to FPGA bridge. We're also given new information about all three bridges:

LW HPS to FPGA is labeled AXI 32
HPS to FPGA is labeled AXI 32/64/128 300 MHz
FPGA to HPS is also labeled AXI 32/64/128 300 MHz

AXI refers to ARM's Advanced Microcontroller Bus Architecture (AMBA) Advanced eXtensible Interface (AXI). It's an interface specifciation that functional blocks (e.g., our hardware accelerator) of an ARM-based SoC design implement to facilitate communication over the system interconnect. The LW HPS to FPGA bridge implements a 32-bit AXI interface, and both the HPS to FPGA and FPGA to HPS bridges have configurable 32-, 64-, or 128-bit AXI interfaces that operate at 300 MHz (I'm guessing the lightweight bridge's interface does too). Although it may not be immediately apparent, this implies that our hardware accelerator will need to implement an AXI interface in order to communicate with the ARM CPU over one of these bridges. Let's see what we can learn from the last diagram:

This block diagram shares the most detailed information yet. Here, we see the L3 Interconnect consists of an L3 Main Switch (which has a 64-bit AXI bus connection to the HPS to FPGA bridge) and an L3 Slave Peripheral Switch (which has a 32-bit AXI bus connection to the LW HPS to FPGA bridge). For clarity, I've marked the paths from the ARM CPU to these bridges in red and blue rectangles, respectively.

We gain another key insight from this diagram: the HPS to FPGA and LW HPS to FPGA bridges are designed to communicate with slave peripherals in the FPGA, whereas the FPGA to HPS bridge is designed to communicate with master peripherals in the FPGA. This leaves us with the choice of designing the hardware accelerator as an AXI slave or master component depending on its role in the design and which bridge we choose to interface with the HPS. I'm getting ahead of myself, but in the next section I'll describe how one course in particular along with my knowledge of the Arria 10 GHRD led to the epiphany that ARM CPUs are AXI masters, and hence, the hardware accelerator needs to be designed as an AXI slave peripheral. Also, I decided to use the HPS to FPGA bridge over its lightweight variant because why not?

One question remains: how does the ARM CPU send data to the HPS to FPGA bridge? The answer's found in the Arria 10 SoC's physical memory map.

Earlier I claimed that the ARM CPU has a path to the FPGA by means of executing instructions that store/load data to/from memory. From its perspective, any address it writes to (or reads from) specifies a location in the system's main memory. That is, it's completely unaware of the L1 and L2 Cache, L3 Interconnect, SDRAM Controller Subsystem, or even the fact that it's using virtual addresses which the SDRAM controller must translate into actual physical addresses (i.e. memory locations) of the SDRAM chips that constitute the main memory. In this abstraction, let's see how data written to certain addresses is directed to FPGA periperals over the LW HPS to FPGA and HPS to FPGA bridges rather than system's memory.

Above we see the Arria 10 SoC's physical memory map from various perspectives, including the L3 Interconnect, the MPU, and the FPGA's direct view of SDRAM (different from that of the FPGA-to-HPS Bridge). Note, I'm not exactly sure what the acronym MPU stands for (perhaps Multiprocessor Unit?), but for our purposes, it's fine to think of it as the "HPS" or simply "ARM CPU". Focusing on either of the first two columns, we learn something significant: certain address spaces are reserved for communicating with FPGA slave peripherals over the HPS to FPGA (light-blue region) and LW HPS to FPGA (orange region) bridges. That is, the HPS accesses FPGA peripherals by means of memory-mapped I/O. The next slide gives us the sizes of these two address spaces (which oddly isn't shown in the one above), and hence we can deduce their boundaries:

With 960 MB of address space reserved for FPGA slave peripherals over the HPS to FPGA bridge, and 2 MB reserved for the LW HPS to FPGA bridge, we arrive at the following:

Interface	Size of Address Space	Start Address	End Address	Calculation
HPS to FPGA	960 MB	`0xc0000000`	`0xfbffffff`	(32^30 + 9602^20) - 1
LW HPS to FPGA	2 MB	`0xff200000`	`0xff3fffff`	(0xff200000 + 2*2^20) - 1

A funny sidenote: I noticed at the time of creating this tutorial that the slide above, from Workshop 1, also appears in Workshop 2. However, the title has been modified, the last bullet conveys more specific information, and two additional bullets have been inserted that provide useful information I later discovered on my own.

Hopefully your understanding of how Linux applications access FPGA peripherals on the Arria 10 SoC platform is much clearer now. We learned that the HPS, running the Protocol Buffer applicaiton, can communicate with AXI slave peripherals in the FPGA, like our hardware accelerator, over an HPS to FPGA bridge by reading/writing to addresses within a certain address space. There are still a few details that elude us (e.g., how do we assign addresses to FPGA peripherals? How do we access these physical addresses from user space applications?), but rest assured, we'll find these answers in the next subsection and in the later section, 7. Write a device driver (interface between FPGA peripheral and user space application).

iv. Putting the pieces together

Custom IP Development Using Avalon and AXI Interfaces --> "ah-ha! moment": develop hardware accelerator as an AXI4 slave, integrate into Arria 10 GHRD system as a memory mapped FPGA peripheral communicating via HPS2FPGA bridge :D
Custom IP Development Using Avalon and AXI Interfaces

I've kinda ruined the surprise already...

The ARM CPU (AXI master) initiates a sequence of write transactions sending data to the hardware accelerator (AXI slave) for it to serialize. The hardware accelerator must also be capable of responding to subsequent read transactions, sending serialized data from its output buffer to the ARM CPU.

v. AMBA AXI and ACE Protocol Specification

Read the ARM AMBA AXI4 specification

Now that we know what needs to be done/how it fits, we're ready to design the hardware accelerator.

b. The hardware accelerator RTL design: an AXI4 slave peripheral

As a prerequisite to understanding the hardware accelerator design in the following section, I'm assuming you have a background in or basic understanding of the principles of logic design and digital systems, combinational vs. sequential logic, controllers (FSMs), datapaths, RTL design, the use of HDLs to implement hardware at a behavioral level, the concept of pipelining, and designing custom processors. I've essentially listed material covered in the book, Digital Design by Frank Vahid, and I recommend reviewing this material before proceeding. If at any point you find yourself unfamiliar with a particular topic, you can refer to this book or other sources to learn more and continue with the tutorial.

Began designing FSM... quickly realized read/write transactions are independent, require separate FSMs --> pipelined processor design
Go over the design of the processor in detail (processor architecture, top-level I/O, FSMs, datapaths)

c. Implementing the RTL design using Quartus Prime and ModelSim-Intel FPGA

Writing custom RTL vs. OpenCL?
Verilog-2001 vs. SystemVerilog?
Already enough complexity: see what's available in the IP Catalog (used FIFO IP Cores)
What determine's top-level I/O?: Qsys-generated HDL skeleton for AXI4 slave interface :D
Quartus Prime project 'protobuf-serializer' --> design entry (Verilog), compilation
ModelSim-Intel FPGA Edition: testbenches + gate-level simulation for functional verification
Woo hoo! Time to integrate...

Open a terminal and download the entire Firework repository.

cd ~/workspace
git clone https://github.com/att-innovate/firework.git

Open the Quartus Prime Standard Edition software.

quartus &

Luckily, we don't have to create a new project using the New Project Wizard. protobuf-serializer from the Firework repository is a complete Quartus Prime project that we can open, containing the RTL for our hardware accelerator, Quartus Prime Project File (.qpf), and Quartus Prime Settings File (.qsf) which contains all project-wide assignments and settings (e.g., the 10AS066N3F40E2SG device we're targeting). Note that for historic reasons, the (.qpf) and (.qsf) files are named varint-encoder.qpf and varint-encoder.qsf, instead of protobuf-serializer.qpf and protobuf-serializer.qsf, respectfully. varint-encoder is the name I originally gave to the hardware accelerator thinking it would only process 32-bit varint data before it eventually morphed into a much more sophisticated piece of hardware. From the Home screen, select Open Project.

5. System integration (Arria 10 GHRD)

Intro
- Arria 10 GHRD, Qsys
- Training that helps: Custom IP Development Using Avalon and AXI Interfaces
- Interfaces (clock, reset, interrupts, Avalon, AXI, conduits)
- Most powerful Qsys tool: auto-generated interconnect (you develop an AXI slave interface, simply connect to AXI master component)
Training (Intel online training)
- Intel online training
- Rocketboards.org
  - https://rocketboards.org/foswiki/Documentation/A10GSRDV160CompilingHardwareDesign
Tutorial
- [a10-soc-devkit-ghrd]
- Adding protobuf-serializer to the GHRD!

Software Development

The operating system, device driver, user space application

6. Create an FPGA peripheral-aware Linux image

Intro
- Discuss why running Linux is important (mimic's real datacenter setting)
- Talk about Yocto Project, embedded Linux, etc.
- Angstrom Linux distribution maintained for the Arria 10, other Altera boards
- Working with Linux source code, configuring, compiling, and zImage for bootable microSD card
- Overview of the boot process
- Rocketboards.org training on creating the U-Boot bootloader, Linux device tree, rootfs, and formatting the microSD card
Training
- Rocketboards.org
Tutorial
- [linux-socfpga]
- Configure and compile Linux kernel (cross-compile, see link below)
  - https://rocketboards.org/foswiki/Documentation/A10GSRDV160CompilingLinuxKernel
- Steps to create bootable microSD card (Rocketboards.org training, repeat here or tell user to follow?)

7. Write a device driver (interface between FPGA peripheral and user space application)

Intro
- Overview of the driver I wrote (misc device driver)
- Journey from writing to memory address --> input at hw-acc
Training
- Altera SoC Workshop Series
Tutorial
- [driver]
- Setting up driver environment on the Arria 10
- Installing linux-socfpga source (same source used to create zimage in step 6.)!
- setting up kbuild environment

8. Closing the loop: modify the user space application

Intro
- Device driver provides the interface
- Replace functions implementing computaiton w/ statements sending data to FPGA peripheral
Training
- http://tldp.org/HOWTO/Program-Library-HOWTO/shared-libraries.html
Tutorial
- [att-innovate/protobuf]
- building & installing google/protobuf on the Arria 10
- building & installing att-innovate/protobuf on the Arria 10

Measuring System Performance

We're done building the hardware-accelerated system. Now let's see how see how its performance compares to that of the standard system.

9. Profiling the hardware-accelerated system

Training
- http://www.brendangregg.com/perf.html
Intro
- clock_gettime(CLOCK_PROCESS_CPUTIME_ID, &res)
- perf
- symbols, stack traces, cross-compilation
Tutorial
- [profiling]
- switching between std & hw-acc libraries

Name		Name	Last commit message	Last commit date
Latest commit History 4,217 Commits
a10-soc-devkit-ghrd		a10-soc-devkit-ghrd
driver		driver
profiling		profiling
protobuf-serializer		protobuf-serializer
protobuf		protobuf
resources		resources
LICENSE		LICENSE
README.md		README.md

License

att-innovate/firework

Folders and files

Latest commit

History

Repository files navigation