ChatGPT Code Detection: Techniques for Uncovering The Source of Code

Introduction

This repository contains all experiments from the paper [Oedingen2024] and allows for the reproduction of the results. Additionally, we provide the data used for the experiments, allowing for further research of detecting AI-generated code.

[Oedingen2024]: Oedingen, M., Engelhardt, R. C., Denz, R., Hammer, M., & Konen, W. (2024). ChatGPT Code Detection: Techniques for Uncovering the Source of Code. arXiv preprint arXiv:2405.15512. https://arxiv.org/abs/2405.15512

Data

The data used for this project is a collection of Python code snippets from various sources:

APPS Data: here; Paper: here
CodeChef Data: here; Paper: /
CodeContests Data: here; Paper: here
HackerEarth Data: here; Paper: /
HumanEval Data: here; Paper: here
MBPPD Data: here; Paper: here
MTrajK Data: here; Paper: /

Looking forward to further data sources:

CSES Data: here; Paper: /
DS-1000 Data: here; Paper: here
edabit Data: here; Paper: /
LeetCode Data: here; Paper: /
150k PSCD Data: here; Paper here

Dataset Description

The dataset consists of four relevant columns:

id: The unique identifier of the code snippet, which is a combination of the source and the task id of the code snippet from its original source.
source: The source of the code snippet.
code: The Python code snippet.
label: The label of the code snippet, which is 1 if the code snippet is GPT-generated and 0 if the code snippet is human-written.
embedding: The 'Ada' embedding of the code snippet.

In total, the dataset contains 40158 code snippets with 20079 GPT-generated and 20079 human-written code snippets. The dataset is balanced, meaning that the number of GPT-generated code snippets and human-written code snippets is equal for each id. In our paper, we used a more strict filtering of duplicates of the code snippets, which resulted in a dataset of 31448 code snippets.

Download the dataset without embeddings from here and with embeddings from here. Unzip the downloaded file and place it in the Dataset folder to use it for the experiments.

Structure of the Project

├── Bayes_Classifier
│   ├── bayes_class.py
├── ML_Algorithms
│   ├── Decision_Tree
│   │   ├── DT_Ada.py
│   │   ├── DT_TFIDF.py
│   ├── Deep_Neural_Network
│   │   ├── DNN_Ada.py
│   │   ├── DNN_TFIDF.py
│   │   ├── DNN_Word2Vec.py
│   ├── Feature_Based
│   │   ├── feature_algorithms.py
│   │   ├── feature_extractor.py
│   ├── Gaussian Mixture Model
│   │   ├── GMM_Ada.py
│   │   ├── GMM_TFIDF.py
│   │   ├── GMM_Word2Vec.py
│   ├── Logistic_Regression
│   │   ├── LR_Ada.py
│   │   ├── LR_TFIDF.py
│   ├── Oblique_Predictive_Clustering_Tree
│   │   ├── OPCT_Ada.py
│   │   ├── OPCT_TFIDF.py
│   ├── Random_Forest
│   │   ├── RF_Ada.py
│   │   ├── RF_TFIDF.py
│   ├── eXtreme_Gradient_Boosting
│   │   ├── XGB_Ada.py
│   │   ├── XGB_TFIDF.py
├── Datasets
│   ├── Unformatted_Balanced.jsonl
│   ├── Unformatted_Balanced_Embedded.jsonl
├── Utility
│   ├── utils.py
├── Models
│   ├── DNN_Ada.pkl
│   ├── XGB_TFIDF.pkl
│   ├── Vectorizer_TFIDF.pkl
├── Results
├── main.py
├── requirements.txt
├── README.md

Usage

To run the experiments, you need to install the required packages. You can do this by running the following command:

pip install -r requirements.txt

After installing the required packages, you can run the experiments by executing the main.py file:

python main.py --dataset <dataset> --embedding <embedding> --algorithm <algorithm> --seed <seed>

Parameters:

dataset: The dataset to use for the experiments. Possible values are Formatted or Unformatted.
embedding: The embedding to use for the experiments. Possible values are Ada, TFIDF, or Word2Vec.
algorithm: The algorithm to use for the experiments. Possible values are DT, DNN, features, GMM, LR, OPCT, RF, or XGB.
seed: The seed to use for the experiments, i.e, the distribution of problems in the training and test set.

Application

Getting Started

Install Cog and Docker
Build cog docker container

cd app
cog build -t xgb-tfidf-model

Run docker-compose

docker-compose up

Go to http://localhost:3000 to see UI
Go to http://localhost:5002/docs to see the Swagger UI
Curl predictions endpoint directly

curl http://localhost:5002/predictions -X POST \
--header "Content-Type: application/json" \
--data '{"input": {"code": "hello"}}'

Tech Stack

Python 3.8
Cog for building model prediction API container
Docker
SvelteKit

Reference

If you use this code or data, please cite the following paper:

@article{ai5030053,
	author = {Oedingen, Marc and Engelhardt, Raphael C. and Denz, Robin and Hammer, Maximilian and Konen, Wolfgang},
	doi = {10.3390/ai5030053},
	issn = {2673-2688},
	journal = {AI},
	number = {3},
	pages = {1066--1094},
	title = {ChatGPT Code Detection: Techniques for Uncovering the Source of Code},
	url = {https://www.mdpi.com/2673-2688/5/3/53},
	volume = {5},
	year = {2024},
}

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

ChatGPT Code Detection: Techniques for Uncovering The Source of Code

Introduction

Data

Dataset Description

Structure of the Project

Usage

Application

Getting Started

Tech Stack

Reference

About

Releases

Packages

Contributors 2

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
Bayes_Classifier		Bayes_Classifier
ML_Algorithms		ML_Algorithms
Models		Models
Utility		Utility
app		app
.gitignore		.gitignore
LICENSE		LICENSE
main.py		main.py
readme.md		readme.md
requirements.txt		requirements.txt

License

MarcOedingen/ChatGPT-Code-Detection

Folders and files

Latest commit

History

Repository files navigation

ChatGPT Code Detection: Techniques for Uncovering The Source of Code

Introduction

Data

Dataset Description

Structure of the Project

Usage

Application

Getting Started

Tech Stack

Reference

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Packages