This Repo is for Kaggle - LLM - Detect AI Generated Text
pip install -r requirements.txt
export KAGGLE_USERNAME="your_kaggle_username"
export KAGGLE_KEY="your_api_key"
cd large_dataset
sudo apt install unzip
kaggle datasets download -d lizhecheng/llm-detect-ai-generated-text-dataset
unzip llm-detect-ai-generated-text-dataset.zip
kaggle datasets download -d thedrcat/daigt-v2-train-dataset
unzip daigt-v2-train-dataset.zip
kaggle datasets download -d thedrcat/daigt-v3-train-dataset
unzip daigt-v3-train-dataset.zip
kaggle competitions download -c llm-detect-ai-generated-text
unzip llm-detect-ai-generated-text.zip
kaggle competitions download -d lizhecheng/daigt-datasets
unzip daigt-datasets.zip
cd generate_dataset
(run real_texts_based.ipynb)
cd generate_dataset
(run change_style_0117.ipynb)
cd models
python deberta_trainer.py
cd models
chmod +x ./run.sh
./run.sh
cd models
chmod +x ./run_awp.sh
./run_awp.sh
cd models
python *_cls.py
5. Run with Classification Models with Features from Writing Quality Competition
cd models_essay_features
python *_cls.py
Thanks to Kaggle
and THE LEARNING AGENCY LAB
for hosting this meaningful competition. In addition, I would like to thank all the Kagglers
who shared datasets and innovative ideas. Although it's another drop on the private leaderboard, fortunately, I managed to hold on to the silver medal.
n_grams = (3, 5)
worked best for me, I did not tryn_grams
larger than5
.min_df = 2
can boost scores ofSGD
andMultinomialNB
almost0.02
, but would reduce scores ofCatBoost
andLGBM
almost0.01
.- When I used
min_df = 2
, I tried up to57k
data without encountering an out-of-memory error. However, when I didn't usemin_df = 2
, I could only train a maximum of45k
. - For
SGD
andMultinomialNB
, I created a new dataset combined DAIGT V2 Train Dataset, DAIGT V4 Magic Generations, Gemini Pro LLM - DAIGT, I could achieve LB score0.960
with only these two models. - For
CatBoost
andLGBM
, I still used original DAIGT V2 Train Dataset, which could give great results on LB. - I tried
RandomForest
on DAIGT V2 Train Dataset, which can achieve LB score0.930
. Also, I triedMLP
on the same dataset, got LB score0.939
. - Reduce
CatBoost
iterations and increase learning rate can achieve better score and decrease a lot of execution time.
I divided all the models into two major categories to generate prediction results since these two categories of models used different datasets and parameters.
Combo 1 | Weights 1 | Combo 2 | Weights 2 | Final Weights | LB | PB | Chosen |
---|---|---|---|---|---|---|---|
(MultinomialNB, SGD) |
[0.5, 0.5] |
(LGBM, RandomForest) |
[0.5, 0.5] |
[0.4, 0.6] |
0.970 |
0.907 |
Yes |
(MultinomialNB, SGD) |
[0.10, 0.31] |
(LGBM, CatBoost) |
[0.28, 0.67] |
[0.3, 0.7] |
0.966 |
0.908 |
Yes |
(MultinomialNB, SGD) |
[0.5, 0.5] |
(CatBoost, RandomForest) |
[20.0, 8.0] |
[0.20, 0.80] |
0.969 |
0.929 |
After Deadline |
(MultinomialNB, SGD) |
[0.5, 0.5] |
(CatBoost, RandomForest, MLP) |
[4.0, 1.5, 0.3] |
[0.20, 0.80] |
0.970 |
0.928 |
After Deadline |
Notebook Links:
LB 0.970 PB 0.928 MNB+SGD+CB+RF+MLP
LB 0.969 PB 0.929 MNB+SGD+RF+CB
As a result, although CatBoost
score on the LB is relatively low compared to other models, it proves its strong robustness. Therefore, we can discover that giving CatBoost
a higher weight can lead to better performance on the PB.
-
Set
max_df
ormax_features
did not work for me. -
I tried to generate new dataset by
gpt-3.5-turbo
, but could not get a good result on my dataset.model_input = "The following is a human-written article. Now, please rewrite this article in your writing style, also optimize sentence structures and correct grammatical errors. You can appropriately add or remove content associated with the article, but should keep the general meaning unchanged. Just return the modified article.\n" + "article: " + human_text
-
Tried
SelectKBest
andchi2
to reduce the dimension of vectorized sparse matrix, LB score dropped.k = int(num_features / 4) chi2_selector = SelectKBest(chi2, k=k) X_train_chi2_selected = chi2_selector.fit_transform(X_train, y_train) X_test_chi2_selected = chi2_selector.transform(X_test)
-
Tried
TruncatedSVD
too. However, since the dimension of original sparse matrix is too large, I could only set the new dimension to a very low number, which caused the LB score dropped a lot. (Setting a large output dimension for reduction can still lead to out-of-memory error becauseTruncatedSVD
is achieved through matrix multiplication, which means that the generated new matrix also occupies memory space).n_components = int(num_features / 4) svd = TruncatedSVD(n_components=n_components) X_train_svd = svd.fit_transform(X_train) X_test_svd = svd.transform(X_test)
-
Tried to use features from last competition, such as the ratio of word that length greater than 5, 6, ..., 10; the ratio of sentence that length greater than 25, 50, 75; different aggregations of word features, sentence features and paragraph features.
The robustness of large language models is indeed stronger than tree models. Additionally, in this competition, there is a higher requirement for the quality of training data for large language models. I used the publicly available large datasets from the discussions, but I did not achieve very ideal results. Therefore, it is essential to have the machine rewrite human-written articles to increase the model's discrimination difficulty.
I gained a lot from this competition and look forward to applying what I've learned in the next one. Team Avengers will keep moving forward.
GitHub: Here