Skip to content

Code and data for the paper "ChatGPT Code Detection: Techniques for Uncovering the Source of Code" Published in MDPI AI journal

License

Notifications You must be signed in to change notification settings

MarcOedingen/ChatGPT-Code-Detection

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

21 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ChatGPT Code Detection: Techniques for Uncovering The Source of Code

Introduction

This repository contains all experiments from the paper [Oedingen2024] and allows for the reproduction of the results. Additionally, we provide the data used for the experiments, allowing for further research of detecting AI-generated code.

[Oedingen2024]: Oedingen, M., Engelhardt, R. C., Denz, R., Hammer, M., & Konen, W. (2024). ChatGPT Code Detection: Techniques for Uncovering the Source of Code. arXiv preprint arXiv:2405.15512. https://arxiv.org/abs/2405.15512

Data

The data used for this project is a collection of Python code snippets from various sources:

  • APPS Data: here; Paper: here
  • CodeChef Data: here; Paper: /
  • CodeContests Data: here; Paper: here
  • HackerEarth Data: here; Paper: /
  • HumanEval Data: here; Paper: here
  • MBPPD Data: here; Paper: here
  • MTrajK Data: here; Paper: /

Looking forward to further data sources:

  • CSES Data: here; Paper: /
  • DS-1000 Data: here; Paper: here
  • edabit Data: here; Paper: /
  • LeetCode Data: here; Paper: /
  • 150k PSCD Data: here; Paper here

Dataset Description

The dataset consists of four relevant columns:

  • id: The unique identifier of the code snippet, which is a combination of the source and the task id of the code snippet from its original source.
  • source: The source of the code snippet.
  • code: The Python code snippet.
  • label: The label of the code snippet, which is 1 if the code snippet is GPT-generated and 0 if the code snippet is human-written.
  • embedding: The 'Ada' embedding of the code snippet.

In total, the dataset contains 40158 code snippets with 20079 GPT-generated and 20079 human-written code snippets. The dataset is balanced, meaning that the number of GPT-generated code snippets and human-written code snippets is equal for each id. In our paper, we used a more strict filtering of duplicates of the code snippets, which resulted in a dataset of 31448 code snippets.

Download the dataset without embeddings from here and with embeddings from here. Unzip the downloaded file and place it in the Dataset folder to use it for the experiments.

Structure of the Project

├── Bayes_Classifier
│   ├── bayes_class.py
├── ML_Algorithms
│   ├── Decision_Tree
│   │   ├── DT_Ada.py
│   │   ├── DT_TFIDF.py
│   ├── Deep_Neural_Network
│   │   ├── DNN_Ada.py
│   │   ├── DNN_TFIDF.py
│   │   ├── DNN_Word2Vec.py
│   ├── Feature_Based
│   │   ├── feature_algorithms.py
│   │   ├── feature_extractor.py
│   ├── Gaussian Mixture Model
│   │   ├── GMM_Ada.py
│   │   ├── GMM_TFIDF.py
│   │   ├── GMM_Word2Vec.py
│   ├── Logistic_Regression
│   │   ├── LR_Ada.py
│   │   ├── LR_TFIDF.py
│   ├── Oblique_Predictive_Clustering_Tree
│   │   ├── OPCT_Ada.py
│   │   ├── OPCT_TFIDF.py
│   ├── Random_Forest
│   │   ├── RF_Ada.py
│   │   ├── RF_TFIDF.py
│   ├── eXtreme_Gradient_Boosting
│   │   ├── XGB_Ada.py
│   │   ├── XGB_TFIDF.py
├── Datasets
│   ├── Unformatted_Balanced.jsonl
│   ├── Unformatted_Balanced_Embedded.jsonl
├── Utility
│   ├── utils.py
├── Models
│   ├── DNN_Ada.pkl
│   ├── XGB_TFIDF.pkl
│   ├── Vectorizer_TFIDF.pkl
├── Results
├── main.py
├── requirements.txt
├── README.md

Usage

To run the experiments, you need to install the required packages. You can do this by running the following command:

pip install -r requirements.txt

After installing the required packages, you can run the experiments by executing the main.py file:

python main.py --dataset <dataset> --embedding <embedding> --algorithm <algorithm> --seed <seed>

Parameters:

  • dataset: The dataset to use for the experiments. Possible values are Formatted or Unformatted.
  • embedding: The embedding to use for the experiments. Possible values are Ada, TFIDF, or Word2Vec.
  • algorithm: The algorithm to use for the experiments. Possible values are DT, DNN, features, GMM, LR, OPCT, RF, or XGB.
  • seed: The seed to use for the experiments, i.e, the distribution of problems in the training and test set.

Application

Getting Started

  1. Install Cog and Docker
  2. Build cog docker container
cd app
cog build -t xgb-tfidf-model
  1. Run docker-compose
docker-compose up
  1. Go to http://localhost:3000 to see UI
  2. Go to http://localhost:5002/docs to see the Swagger UI
  3. Curl predictions endpoint directly
curl http://localhost:5002/predictions -X POST \
--header "Content-Type: application/json" \
--data '{"input": {"code": "hello"}}'

Tech Stack

  • Python 3.8
  • Cog for building model prediction API container
  • Docker
  • SvelteKit

Reference

If you use this code or data, please cite the following paper:

@article{ai5030053,
	author = {Oedingen, Marc and Engelhardt, Raphael C. and Denz, Robin and Hammer, Maximilian and Konen, Wolfgang},
	doi = {10.3390/ai5030053},
	issn = {2673-2688},
	journal = {AI},
	number = {3},
	pages = {1066--1094},
	title = {ChatGPT Code Detection: Techniques for Uncovering the Source of Code},
	url = {https://www.mdpi.com/2673-2688/5/3/53},
	volume = {5},
	year = {2024},
}

About

Code and data for the paper "ChatGPT Code Detection: Techniques for Uncovering the Source of Code" Published in MDPI AI journal

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published