ECE9143 HPML Final Project

Accelerating Number Theoretic Transform for High-performance privacy preserving machine learning using Nvidia GPU

Abstract

In today’s high performance machine learning system where privacy concerns start arising, the gap between HE and machine learning must be bridged. Privacy concerns have thrust privacy-preserving computation into the spotlight. Homomorphic encryption (HE) is a cryptographic system that enables computation to occur directly on encrypted data, providing users with strong privacy (and security) guarantees while using the same services they enjoy today unprotected. While promising, HE has seen little adoption due to extremely high computational overheads, rendering it impractical. Previous research shows that the Number Theoretic Transform (NTT) is one of the bottle necks for the high-performance privacy preserving machine learning system. In this project, I will accelerate this algorithm using CUDA on Nvidia GPUs.

How to run the code:

Navigate to each folder. Open the header file to change the hyperparameters such as maximum number of threads and maximum number of blocks launched each time kernel is invoked. Finally, follow these commands:

make clean
make all
./ntt

Results are measured on CUDA 11.6 and CUDA 11.3 on Ubuntu 20.04 system with nvidia driver version 470.19. Hardwares are GTX1080 and Intel i7 8700 with 32GB mem.

Main Result

maxthreads	inplace_dif	inplace_dit	pease
1	3.1673	3.13434	0.253502
2	1.68092	1.68709	0.250495
4	0.831348	1.01629	0.234751
8	0.472502	0.641575	0.295248
16	0.348652	0.51664	0.228345
32	0.23294	0.403408	0.235941
64	0.185696	0.382066	0.232705
128	0.156169	0.336544	0.235728
256	0.170691	0.32358	0.234253
512	0.186446	0.338545	0.255258
1024	0.237379	0.408641	0.255612

Table 1 Total time for CUDA implementation using memory copy in ms

Total time memcpy (ms)	Total time sharedmem (ms)	Algo
0.186446	0.224256	inplace_dif
0.338545	0.108614	inplace_dit
0.255258	0.329698	pease

Table 2 Total time for different algorithm using maximum threads of 512

Total time memcpy (ms)	Total time sharedmem (ms)	Algo
0.237379	0.227119	inplace_dif
0.408641	0.132228	inplace_dit
0.255612	0.308603	pease

Table 3 Total time for different algorithm using maximum threads of 1024

Future Work:

NTTSuite

Acknowledgements

This work was supported in part by the Applications Driving Architectures (ADA) Research Center, a JUMP Center co-sponsored by SRC and DARPA. This research was also developed with funding from the Defense Advanced Research Projects Agency (DARPA),under the Data Protection in Virtual Environments (DPRIVE) program, contract HR0011-21-9-0003. The views, opinions and/or findings expressed are those of the author and should not be interpreted as representing the official views or policies of the Department of Defense or the U.S. Government.

Name		Name	Last commit message	Last commit date
Latest commit History 1 Commit
inplace_dif_md		inplace_dif_md
inplace_dit_md		inplace_dit_md
pease		pease
sharedmemory		sharedmemory
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ECE9143 HPML Final Project

Abstract

How to run the code:

Main Result

Future Work:

Acknowledgements

About

Releases

Packages

Languages

Dragon201701/ECE9143_project

Folders and files

Latest commit

History

Repository files navigation

ECE9143 HPML Final Project

Abstract

How to run the code:

Main Result

Future Work:

Acknowledgements

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages