This repository contains the code written for the paper titled, "Fine-tuning GPT-3 for Synthetic Danish News Generation" (Almasi & Schiønning, 2023).
The project involved fine-tuning GPT-3 to produce synthetic news articles in Danish and evaluating the model in binary classification tasks. The evaluation relied on both human participants (A) and machine classifiers (B).
To read the details of this evaluation, please refer to (Almasi & Schiønning, 2023).
Due to constraints with copyright and GDPR, only the test data and the synthetically generated GPT-3 data is uploaded to this GitHub repository. For all other purposes, dummy
data is provided to reproduce the pipelines (see also Project Structure
). To run any of the pipelines, follow the instructions in the Pipeline
section.
For any other questions regarding the project, please contact the authors.
The repository is structured as such:
Description | |
---|---|
dummy_data |
Dummy data to run GPT-3 pipeline, reproduce plots from experiment A (human participants) and technical pipelines from experiment B (machine classifiers). Created to mimic actual data to the extent that is possible. |
dummy_results |
Files that come from running dummy scripts in src . Due to limited dummy data, these may not contain any intelligible information. |
data |
Contains the 96 test articles used in both Experiment A and B (i.e., for evaluating both human participants and machine detectors) and the 609 articles generated by GPT-3 for fine-tuning BERT. |
plots |
Plots used in (Almasi & Schiønning, 2023) |
results |
Results from machine classifiers presented in (Almasi & Schiønning, 2023) |
src |
All code organised in folders process_articles , gpt3 and classifiers |
tokens |
Empty folder to place openai_token.txt (for GPT-3 pipeline) and hf_token.txt (to push model to HF Hub, OPTIONAL!!!) |
setup.sh |
Run to install general requirements, packages in virtual environment. Note that additional setup may be required for the individual pipelines. |
simple_classifier.sh |
Run to reproduce classifier pipelines |
bert_classifier.sh |
Run to reproduce BERT pipeline |
Please note that the files in results
, plots
and data
contain actual data pertaining to (Almasi & Schiønning, 2023) while the files in dummy_data
and dummy_results
do not.
For this project, Python (version 3.10) and R was used. Python's venv needs to be installed for the setup to work.
To install necessary requirements in a virtual environment (env
), please run the setup.sh
in the terminal:
bash setup.sh
The individual technical pipelines may require extra setup. These steps are explained in their respective README's
.
Refer to README.md located in src/process_articles
to reproduce the article preprocessing.
To fine-tune and/or generate text with GPT-3 with dummy data, refer to the README.md located in src/gpt3
.
⚠️ NOTE!
The current script finetunes "text-davinci", but this will be deprecated on the 4th of January 2024. You can read more on about this at https://openai.com/blog/gpt-4-api-general-availability.
To run the analysis, please refer to the Rmarkdown exp-a-analysis.Rmd
in the src
folder.
To construct the machine classifiers (BOW
, TF-IDF
, fine-tuned BERT
), follow the instructions in the README.md located in src/classifiers
.
⚠️ NOTE!
While the fine-tuning of NbAiLab/nb-bert-large is done on dummy data, the inference is done with the actual fine-tuned classifier on the real test data.
The fine-tuned BERT
can be accessed from the Hugging Face Hub:
MinaAlmasi/dknews-NB-BERT-AI-classifier
For any questions regarding the paper or reproducibility of the project, you can contact us:
- drasbaek@post.au.dk (Anton Drasbæk Schiønning)
- mina.almasi@post.au.dk (Mina Almasi)