This repository contains all experiments from the paper [Oedingen2024] and allows for the reproduction of the results. Additionally, we provide the data used for the experiments, allowing for further research of detecting AI-generated code.
[Oedingen2024]: Oedingen, M., Engelhardt, R. C., Denz, R., Hammer, M., & Konen, W. (2024). ChatGPT Code Detection: Techniques for Uncovering the Source of Code. arXiv preprint arXiv:2405.15512. https://arxiv.org/abs/2405.15512
The data used for this project is a collection of Python code snippets from various sources:
- APPS Data: here; Paper: here
- CodeChef Data: here; Paper: /
- CodeContests Data: here; Paper: here
- HackerEarth Data: here; Paper: /
- HumanEval Data: here; Paper: here
- MBPPD Data: here; Paper: here
- MTrajK Data: here; Paper: /
Looking forward to further data sources:
- CSES Data: here; Paper: /
- DS-1000 Data: here; Paper: here
- edabit Data: here; Paper: /
- LeetCode Data: here; Paper: /
- 150k PSCD Data: here; Paper here
The dataset consists of four relevant columns:
id
: The unique identifier of the code snippet, which is a combination of the source and the task id of the code snippet from its original source.source
: The source of the code snippet.code
: The Python code snippet.label
: The label of the code snippet, which is1
if the code snippet is GPT-generated and0
if the code snippet is human-written.embedding
: The 'Ada' embedding of the code snippet.
In total, the dataset contains 40158 code snippets with 20079 GPT-generated and 20079 human-written code snippets.
The dataset is balanced, meaning that the number of GPT-generated code snippets and human-written code snippets is equal for each id
.
In our paper, we used a more strict filtering of duplicates of the code snippets, which resulted in a dataset of 31448 code snippets.
Download the dataset without embeddings from here and with embeddings from here.
Unzip the downloaded file and place it in the Dataset
folder to use it for the experiments.
├── Bayes_Classifier
│ ├── bayes_class.py
├── ML_Algorithms
│ ├── Decision_Tree
│ │ ├── DT_Ada.py
│ │ ├── DT_TFIDF.py
│ ├── Deep_Neural_Network
│ │ ├── DNN_Ada.py
│ │ ├── DNN_TFIDF.py
│ │ ├── DNN_Word2Vec.py
│ ├── Feature_Based
│ │ ├── feature_algorithms.py
│ │ ├── feature_extractor.py
│ ├── Gaussian Mixture Model
│ │ ├── GMM_Ada.py
│ │ ├── GMM_TFIDF.py
│ │ ├── GMM_Word2Vec.py
│ ├── Logistic_Regression
│ │ ├── LR_Ada.py
│ │ ├── LR_TFIDF.py
│ ├── Oblique_Predictive_Clustering_Tree
│ │ ├── OPCT_Ada.py
│ │ ├── OPCT_TFIDF.py
│ ├── Random_Forest
│ │ ├── RF_Ada.py
│ │ ├── RF_TFIDF.py
│ ├── eXtreme_Gradient_Boosting
│ │ ├── XGB_Ada.py
│ │ ├── XGB_TFIDF.py
├── Datasets
│ ├── Unformatted_Balanced.jsonl
│ ├── Unformatted_Balanced_Embedded.jsonl
├── Utility
│ ├── utils.py
├── Models
│ ├── DNN_Ada.pkl
│ ├── XGB_TFIDF.pkl
│ ├── Vectorizer_TFIDF.pkl
├── Results
├── main.py
├── requirements.txt
├── README.md
To run the experiments, you need to install the required packages. You can do this by running the following command:
pip install -r requirements.txt
After installing the required packages, you can run the experiments by executing the main.py
file:
python main.py --dataset <dataset> --embedding <embedding> --algorithm <algorithm> --seed <seed>
Parameters:
dataset
: The dataset to use for the experiments. Possible values areFormatted
orUnformatted
.embedding
: The embedding to use for the experiments. Possible values areAda
,TFIDF
, orWord2Vec
.algorithm
: The algorithm to use for the experiments. Possible values areDT
,DNN
,features
,GMM
,LR
,OPCT
,RF
, orXGB
.seed
: The seed to use for the experiments, i.e, the distribution of problems in the training and test set.
cd app
cog build -t xgb-tfidf-model
- Run docker-compose
docker-compose up
- Go to
http://localhost:3000
to see UI - Go to
http://localhost:5002/docs
to see the Swagger UI - Curl predictions endpoint directly
curl http://localhost:5002/predictions -X POST \
--header "Content-Type: application/json" \
--data '{"input": {"code": "hello"}}'
- Python 3.8
- Cog for building model prediction API container
- Docker
- SvelteKit
If you use this code or data, please cite the following paper:
@article{ai5030053,
author = {Oedingen, Marc and Engelhardt, Raphael C. and Denz, Robin and Hammer, Maximilian and Konen, Wolfgang},
doi = {10.3390/ai5030053},
issn = {2673-2688},
journal = {AI},
number = {3},
pages = {1066--1094},
title = {ChatGPT Code Detection: Techniques for Uncovering the Source of Code},
url = {https://www.mdpi.com/2673-2688/5/3/53},
volume = {5},
year = {2024},
}