We have won the third prize in phase 2 in the PETs Prize Challenge!
We are announced at Summit for Democracy, reported by The White House.
The recent decade witnessed a surge of increase in financial crimes across the public and private sectors, with an average cost of scams of $102m to financial institutions in 2022.
Developing a mechanism for battling financial crimes is an impending task that requires in-depth collaboration from multiple institutions, and yet such collaboration imposed significant technical challenges due to the privacy and security requirements of distributed financial data. For example, consider the Society for Worldwide Interbank Financial Telecommunications (SWIFT) system, which generates 42 million transactions per day across its 11,000 global institutions. Training a detection model of fraudulent transactions requires not only secured SWIFT transactions but also the private account activities of those involved in each transaction from corresponding bank systems.
The distributed nature of both samples and features prevents most existing learning systems from being directly adopted to handle the data mining task. In this research, we collectively address these challenges by proposing a hybrid federated learning system (HyFL) that offers secure and privacy-aware learning and inference for financial crime detection.
We conduct extensive empirical studies to evaluate the proposed framework's detection performance and privacy-protection capability, evaluating its robustness against common malicious attacks of collaborative learning. Find more details in our paper.
The following two diagrams show the training and testing communication flows, respectively.
The implementation of HyFL is based on the runtime of PETs Prize Challenge and Flower. To run HyFL, you will need Docker installed on your system. Also you will need to download the datasets for training and testing.
The datasets are available on PETS competiton website. This repository contains a data/
directory. When running commands to test the solution locally, contents of this directory will be mounted to the launched Docker container. This allows you to do local evaluation using the challenge's development data.
The data/
directory has been prepopulated with some example directory scaffolding to copy the data into. It should look like this:
data
├── fincrime/
├── centralized/
│ ├── test/
│ │ └── data.json
│ └── train/
│ └── data.json
├── scenario01/
│ ├── test/
│ │ ├── bank01/
│ │ ├── partitions.json
│ │ └── swift/
│ └── train/
│ ├── bank01/
│ ├── partitions.json
│ └── swift/
└── scenarios.txt
Here is an explanation to help you understand this directory structure:
- Federated:
- There is a
scenarios.txt
file. This is a newline-delimited file that lists partioning scenarios. The evaluation runner will loop through the scenarios present here. In the real evaluation runtime, there will be three scenarios defined. In the example provided here, there is one partitioning scenario namedscenario01
for each track. - Each scenario has a corresponding subdirectory (e.g.,
data/fincrime/scenario01/
). - Inside the scenario directory, you will see
train/
andtest/
subdirectories. These will contain data for the respective stages. - Inside the
train/
ortest/
subdirectory, you will see a few things:partitions.json
is a JSON configuration file that lists each client in the scenario and paths to that client's data partition files. The top level key is the partition/client ID (cid
in the simulation code). The inner JSON object lists the data filenames that will be provided to your client factory function. You will notice that the inner object's keys should match the argument names in the client factory signature. (Docs)- Subdirectories for each data partition/client. The directory names should match the client IDs found in
partitions.json
. The simulation code will expect to find data files in each of these subdirectories matching the filenames inpartitions.json
. (You will need to copy your development data into here.) - In the
test/{cid}/
subdirectories, there will also bepredictions_format.csv
files. These will help you write your predictions in the correct format. Paths to these files will be provided to your test client factory function. You will need to populate these for local testing.
- There is a
- Centralized:
- There is also a
centralized/
subdirectory (e.g.,data/fincrime/centralized/
). This will contain data for centralized evaluation. - Like with the federated scenarios, the centralized directory contains
train/
andtest/
subdirectories. - Inside the train or test subdirectory, you will see a
data.json
. This is a JSON configuration file that lists the data files that the training/test code will have access to. The keys should match the argument names of the data paths provided to yourfit
orpredict
functions (Docs). - The evaluation code will expect to find data files alongside
data.json
that match the filenames indata.json
. (You will need to copy your development data into here.) - The evaluation code also expects to find a
test/predictions_format.csv
. This will help you write your predictions in the correct format. A path to this file will be provided to yourpredict
function. You can download the full centralized version of thepredictions_format.csv
file for the development dataset on the data download page for your track.
- There is also a
In order to run evaluation locally, you will need to copy the development dataset into this directory structure. First, download the development datasets from the challenge data download page. Then, you will need to copy data files into either the client subdirectories for federated data matching the filenames in partitions.json
, or into the train/
or test/
subdirectories matching the filenames in data.json
. You will additionally need a predictions_format.csv
file in the test/
subdirectories.
For the federated data, it is up to you to partition the development data before copying it into the data directory.
Run the following command to build the Docker image with centralized and federated methods.
make build
Run the following command to build the Docker image with centralized and federated methods. For a centralized package, set the environmental variable
export PACKAGE_TRACK=fincrime
export PACKAGE_TYPE=centralized
For a federated package, set the environmental variable
export PACKAGE_TRACK=fincrime
export PACKAGE_TYPE=federated
Then, pack the package source code using
make pack-package
This will create a package directory with both centralized and federated methods.
mkdir -p package/
cd package_src/fincrime; zip -r ../../package/package.zip ./*
adding: solution_centralized.py (deflated 74%)
adding: solution_federated.py (deflated 85%)
Run the following command to start training and testing
make test-package
To switch the method between centralized and federated. Change the environment variable PACKAGE_TYPE
accordingly.
After running the package and saving the results, run the following command to clean up the temporary files generated. This will clean up Python caches and delete the package
directory.
make clean
We acknowledge contributions to this implementation from the following two authors:
- Fan Dong, DENOS Lab, University of Calgary, Canada
- Haobo Zhang, Illidan Lab, Michigan State University, USA