Skip to content

Repository for source codes for the paper titled "Can Large Language Model Detect Plagiarism in Source Code?" which was accepted and presented at IEEE International Conference on Foundation and Large Language Models (FLLM2024).

Notifications You must be signed in to change notification settings

fiit-ba/llm-plagiarism-check

 
 

Repository files navigation

If you want to use this code, please cite our article describing this solution:

IEEE style

W. Brach, K. Košťál and M. Ries, "Can Large Language Model Detect Plagiarism in Source Code?," 2024 IEEE International Conference on Foundation and Large Language Models (FLLM2024), Dubai, United Arab Emirates, 2024, pp. 1-8.

LLM-plagiarism-check

We're trying to build a system for source code plagiarism detection using Large Language Models (LLMs) via the DSPy framework. The goal is to compare two input code files, determine if plagiarism has occurred, and provide an explanation for the result.

Installation

# Clone the repository
git clone https://github.com/fiit-ba/LLM-plagiarism-check.git
cd LLM-plagiarism-check

# Create a virtual environment
python3 -m venv llm-plagiarism-check

# Activate the virtual environment
source llm-plagiarism-check/bin/activate

# Install the required packages
pip install -r requirements.txt

Usage

Our project consists of several key components, each serving a specific purpose in our research workflow:

Jupyter Notebooks

  • check.ipynb: This is where we compile and train our DSPy programs.
  • eval.ipynb: Use this notebook to evaluate the performance of our DSPy programs.
  • jplag.ipynb: Run this to calculate the JPlag benchmark.
  • analysis.ipynb: This notebook contains all our plots and analysis of results.

Python Scripts

  • dataloader.py: Provides support for loading our research data.
  • models.py: Contains the model definitions for our DSPy programs.

Data Directories

  • data/IR-Plag-Dataset/: This directory contains our plagiarism dataset, sourced from this GitHub repository.
  • data/jplag/: Used for the JPlag benchmark calculations.
  • data/metadata/: Stores metadata for our DSPy programs.
  • data/results/: Where we save our research results.
  • data/train.tsv: Our training dataset for DSPy.
  • programs/ : Contains DSPy programs.

Citation

Contact

William Brach - @williambrach - william.brach@stuba.sk

About

Repository for source codes for the paper titled "Can Large Language Model Detect Plagiarism in Source Code?" which was accepted and presented at IEEE International Conference on Foundation and Large Language Models (FLLM2024).

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Jupyter Notebook 99.7%
  • Other 0.3%