Skip to content

How can machine learning techniques be effectively employed to identify essays generated by large language models (LLMs) compared to those authored by middle and high school students?

Notifications You must be signed in to change notification settings

prithviraj-maurya/detect_llm_generated_essay

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 

Repository files navigation

Detecting LLM-Generated Essays: A Machine Learning Approach

Large Language Models (LLMs), ChatGPT, academia, sentiment analysis, educational technology, natural language processing, qualitative content analysis, Artificial Intelligence tool usage, and higher education.

INTRODUCTION

The integration of large language models (LLMs) into educational settings has ushered in a new era of possibilities and challenges[3]. While these advanced language models offer unprecedented opportunities for students to enhance their writing skills and access a wealth of information, concerns have arisen regarding their potential impact on academic integrity, particularly in the realm of essay writing. With the ability to produce text virtually indistinguishable from human writing, LLMs pose a significant challenge to educators in preserving the authenticity of students’ work and preventing plagiarism.

This research project delves into the evolving landscape of education, where technological advancements, particularly in natural language processing, have given rise to both excitement and apprehension. The focus of this study lies in the development of a machine-learning model that can effectively discern essays generated by LLMs from those composed by middle and high school students. The ultimate goal is to provide educators with a reliable tool to uphold academic integrity while acknowledging the ethical use of LLMs as educational aids.

In this context, it is crucial to recognize that LLMs can be powerful allies in the learning process when used responsibly. As tools that can assist students in refining their writing skills, deepening their understanding of various subjects, and fostering creativity, LLMs offer valuable contributions to education. However, striking a balance between harnessing these benefits and safeguarding against potential misuse is imperative for educators and institutions.

Recent works in this area have laid the foundation for understanding and addressing challenges related to AI-generated text in educational contexts. For instance, Zellers et al. [2] proposed methodologies for defending against neural fake news, acknowledging the growing concerns associated with the generation of misleading content using advanced language models. Additionally, Gehrmann et al. [3] delve into statistical approaches that enable the identification of text origin, differentiating between model-generated and human-authored content. Building upon these foundations, our research takes a step further by exploring machine learning techniques, we aim to develop a robust model that can be employed as a practical solution for educators facing the complex task of differentiating between authentic student work and content generated by advanced language models. Through this exploration, we hope to provide insights that not only contribute to the field of LLM text detection but also empower educators to navigate the evolving landscape of technology in education with confidence.

Research Question

How can machine learning techniques be effectively employed to identify essays generated by large language models (LLMs) compared to those authored by middle and high school students?

BACKGROUND

The release of large language models (LLMs) in the realm of education marks a shift in how students engage with written expression and information synthesis. LLMs, such as OpenAI’s GPT-3[12], have demonstrated an exceptional ability to generate coherent and contextually relevant text, raising both excitement and concerns within academic circles. As students increasingly turn to these advanced language models to assist them in essay writing and other written assignments, educators are confronted with the challenge of distinguishing between authentic student work and content generated by machines.

The integration of LLMs into educational practices is driven by their capacity to facilitate learning, providing students with tools to refine their writing skills, explore complex topics, and easily access information. However, the transformative potential of LLMs comes with a critical caveat — the risk of compromising academic integrity. Students armed with tools that can produce text virtually indistinguishable from human writing challenge traditional methods of plagiarism detection and present a formidable obstacle to maintaining the authenticity of academic assessments.

As LLMs become more accessible and pervasive, the need for effective mechanisms to identify machine-generated content becomes paramount. Striking a delicate balance between fostering responsible use of LLMs for educational purposes and preventing misuse is essential to ensuring the continued trustworthiness of academic assessments.

It is crucial to emphasize that while concerns about the potential misuse of LLMs are valid, these models can also potentially revolutionize education positively. Responsible integration of LLMs can enhance the learning experience, offering students valuable resources for research, critical thinking, and creative expression. Through this research endeavor, we aspire to contribute insights that foster a nuanced understanding of the intersection between advanced language models and academic integrity, ultimately providing educators with the tools to navigate this complex terrain effectively.

METHODS

Data

To ensure the robustness and generalizability of our model, we have assembled a diverse and extensive dataset of essays sourced from Kaggle[1]. This dataset encompasses essays from both middle and high school students, as well as essays generated by various large language models (LLMs), including GPT-3[12] and BERT[11]. The inclusion of human-written and machine-generated essays is instrumental in creating a well-balanced dataset that accurately represents a spectrum of writing styles and qualities, crucial for achieving accurate detection.

The dataset categorizes the text into two main classes: "AI-generated" and "Human-written," denoted by the binary label "1" and "0" respectively. Human-authored essays were collected by providing students with specific prompts. The AI-generated essays, produced by LLMs like ChatGPT[12] using their APIs, have been labeled accordingly as "1." The problem is formulated as a classification task, where the goal is to predict whether an essay is written by a human (0) or generated by an AI model (1). Kaggle has collected over 10,000 student essays and the same prompts were used to generate these essays by LLMs such as ChatGPT using their APIs. Using the same prompts we generated more such examples from different LLMs such as BERT and GPT adding 2000 essays to the dataset.

The dataset is accessible through the Kaggle API, requiring authentication with a Kaggle account and adherence to the platform’s terms and conditions. The data files are available in the CSV (Comma Separated Values) format and can be read by libraries such as "Pandas" in the Python environment. It comprises four columns:

ID: A unique identifier for each text entry.

Prompt: The prompt (question) that was provided to the students to write the essay, same was used to ask an LLM to generate an essay.

Text: The essay content, whether human-authored or AI-generated.

Generated: A binary label indicating whether the essay is AI-generated (1) or human-written (0).

Id Prompt Text Generated
0059830c Write an explanatory essay to inform fellow citizens about the advantages of limiting car usage. Cars have been around since they became famous in the 1900s, when Henry Ford created and built the first ModelT. Cars have played a major role in our every day lives since then… 0

A sample row from the dataset

Methodology

In our model evaluations, we selected the ROC AUC score[5] as the key metric. This score, spanning from 0 to 1, serves as a robust indicator of the model’s performance on the data. Moreover, it addresses the trade-off between bias and variance in the model.

1. Baseline Model:

To establish a baseline for our essay detection task, we initiated a straightforward Bag of Words[6] encoding coupled with a logistic regression[8] classifier. To explore alternative tokenization methods, we experimented with TF-IDF[7] vectorization, which, although not significantly outperforming the Bag of Words model, added a nuanced perspective to our analysis. In pursuit of a more complex model, we incorporated the XGBoost[9] model from the XGBoost Classifier library. To leverage the complementary strengths of different models, we implemented an ensemble combining logistic regression and XGBoost.

2. Recurrent Neural Networks (RNNs):

Delving into deep learning, we constructed a custom model featuring a Bidirectional Long Short-Term Memory (LSTM)[10] layer followed by a Dense layer. Notably, during experimentation, we observed that the model’s performance peaked with 5 epochs, suggesting a potential issue of overfitting with more prolonged training. The plot below illustrates the accuracy and loss function for each epoch of training:

3. Transformers:

Transitioning to transformer models, we explored the application of the pre-trained Distilled BERT Uncased transformer[11]. Leveraging pre-trained embeddings from BERT, we fine-tuned the transformer model on our dataset. This exploration of transformer models provides valuable insights into the efficacy of leveraging state-of-the-art language models for the specific task of distinguishing between human-authored and LLM-generated essays.

These varied approaches encompass both traditional machine learning and cutting-edge deep learning techniques, allowing us to comprehensively assess the strengths and limitations of different models in addressing the unique challenges posed by LLM-generated content detection in the context of academic writing.

Limitations

1. Representation Bias in External Data:

The inclusion of external data from different AI models introduces potential biases specific to those models, limiting the generalizability of our findings.

2. Sensitivity to Model Choices:

The sensitivity of our models to tokenization and embedding methods may impact their generalization to diverse datasets or domains.

3. Challenges in Deep Learning Generalization:

The Bidirectional LSTM model exhibited potential overfitting, highlighting the delicate balance required in deep-learning models for optimal generalization.

These limitations underscore the need for cautious interpretation and further refinement of models for robust essay detection in diverse educational contexts.

Analysis

Our analysis encompasses a multifaceted exploration of different models applied to the task of distinguishing between essays generated by large language models (LLMs) and those authored by middle and high school students. We conducted an in-depth investigation, utilizing various machine learning and deep learning techniques to achieve a better understanding of the strengths and limitations of each approach.

1. Baseline Model:

The baseline model, employing a Bag of Words encoding with logistic regression, provided a solid foundation for comparison. This initial model yielded a ROC AUC score of 0.78. The subsequent exploration of TF-IDF[7] vectorization demonstrated the versatility of tokenization methods but did not yield a substantial improvement. The incorporation of the XGBoost[9] model introduced a notable performance boost, emphasizing the effectiveness of gradient-boosting techniques. The ensemble combining logistic regression[8] and XGBoost, showcased the potential synergy of different models, resulting in an enhanced ROC AUC score of 0.818.

2. Recurrent Neural Networks (RNNs):

Our foray into deep learning involved the development of a custom model featuring a Bidirectional LSTM layer. The training process, monitored over epochs, revealed a phenomenon of diminishing returns after 5 epochs, suggesting a delicate balance between model complexity and overfitting. The training plot illustrated fluctuations in accuracy and loss, guiding our decision to optimize the model with a limited number of epochs.

3. Transformers:

Exploring transformer models, we adopted the pre-trained Distilled BERT[11] Uncased transformer and fine-tuned it on our dataset. Despite the inherent complexity of transformer architectures, the performance achieved a score of 0.775. This finding underscores the potential of leveraging state-of-the-art language models for the task at hand, although further optimization and experimentation may be required to maximize performance.

Our analysis reveals a vast difference between traditional machine learning models and cutting-edge deep learning architectures. The baseline models provide a solid reference point, while the more complex models showcase the potential for significant performance gains. The delicate balance between model complexity and overfitting, highlighted in the RNN analysis, emphasizes the importance of careful model selection and tuning. Additionally, the transformer model’s performance indicates the evolving landscape of natural language processing, where pre-trained models offer a robust foundation for specific tasks.

In conclusion, our comprehensive analysis provides valuable insights into the different models for detecting LLM-generated essays. Each approach contributes unique perspectives, guiding future research directions and offering practical considerations for educators seeking robust tools to ensure academic integrity in the digital age.

RESULTS

Results:

Table II summarizes the results obtained from our experiments on detecting essays generated by large language models (LLMs) compared to those authored by middle and high school students. We can observe how changing the embeddings can affect the model’s performance. Also, by replacing the model architecture we obtain different scores.

Notebook ROC AUC Score Dataset size Model Notes
Baseline 0.664 10,000 LR + XGB Experiment 1
Baseline 0.758 12,078 LR + XGB Used glove_100d embeddings
RNN 0.804 12,078 LSTM Experiment 2
RNN 0.721 12,078 LSTM Increased epochs from 5 to 20
Transformers 0.73 10,000 BERT Experiment 3
Transformers 0.775 12,078 BERT Added more data

CONCLUSIONS

Our exploration into the detection of essays generated by large language models (LLMs) versus those authored by students has yielded compelling insights and results. The baseline model, employing a Voting classifier combining logistic regression (LR) and XGBoost (XGB), demonstrated robust performance with a ROC AUC score of 0.818, laying a solid foundation for our experiments.

Further experiments involved variations in tokenization methods and incorporating deep learning models. The classifier using embeddings showed varied performance, indicating the sensitivity of the model to different embedding techniques. The Bidirectional LSTM model, initially achieved a ROC AUC score of 0.804, but careful analysis revealed potential overfitting, as increasing epochs from 5 to 20 resulted in a decreased score of 0.721.

Exploring transformer models, the Distilled BERT Uncased model showcased promising results with a ROC AUC score of 0.775. Importantly, incorporating external data from different AI models (GPT) further improved the performance, underscoring the potential benefits of leveraging diverse sources in training datasets.

In conclusion, our study presents a comprehensive analysis of various models for detecting LLM-generated essays, emphasizing the need for careful model selection, parameter tuning, and consideration of external data sources. The results contribute valuable insights to the ongoing discourse on maintaining academic integrity in the face of evolving technologies, providing educators with informed tools for essay authenticity verification. Future research may explore additional model architectures and further refine the interplay between machine learning and deep learning techniques in this domain.

References

  1. Jules King, Perpetual Baffour, Scott Crossley, Ryan Holbrook, Maggie Demkin. (2023). LLM - Detect AI Generated Text. Kaggle.

  2. Rowan Zellers Paul G. Allen School of Computer Science & Engineering, Zellers, R., Paul G. Allen School of Computer Science & Engineering, Defending against neural fake news: Proceedings of the 33rd International Conference on Neural Information Processing Systems. Guide Proceedings.

  3. Gehrmann, S., Strobelt, H., & Rush, A. (2019). GLTR: Statistical Detection and Visualization of Generated Text (pp. 111–116).

  4. Tian, Y., Chen, H., Wang, X., Bai, Z., Zhang, Q., Li, R., Xu, C., & Wang, Y. (2023, September 29). Multiscale positive-unlabeled detection of AI-generated texts. arXiv.org..

  5. Fawcett, T. An introduction to ROC analysis. Pattern Recognition Letters, 27(8):861–874, 2006.

  6. Qader, Wisam & M. Ameen, Musa & Ahmed, Bilal. (2019). An Overview of Bag of Words;Importance, Implementation, Applications, and Challenges. 200-204. 10.1109/IEC47844.2019.8950616. .

  7. Reddy, A. J., Rocha, G., & Esteves, D. (2018). Defactonlp: Fact verification using entity recognition, TFIDF vector comparison and decomposable attention. arXiv preprint arXiv:1809.00509.

  8. Vimal, Bhartendoo. (2020). Application of Logistic Regression in Natural Language Processing. International Journal of Engineering Research and. V9. 10.17577/IJERTV9IS060095. .

  9. Chen, T., & Guestrin, C. (2016, August). Xgboost: A scalable tree boosting system. In Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining (pp. 785-794)..

  10. Wang, S., & Jiang, J. (2015). Learning natural language inference with LSTM. arXiv preprint arXiv:1512.08849..

  11. Devlin, J., Chang, M. W., Lee, K., & Toutanova, K. (2018). Bert: Pre-training of deep bidirectional transformers for language understanding. arXiv preprint arXiv:1810.04805..

  12. Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., & Amodei, D. (2020). Language models are few-shot learners. Advances in neural information processing systems, 33, 1877-1901..

About

How can machine learning techniques be effectively employed to identify essays generated by large language models (LLMs) compared to those authored by middle and high school students?

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published