Poetry generation has been a challenging task in the field of Natural Language Processing, as it requires the model to understand the nuances of language, sentiment, and style. In this paper, we propose using Large Language Models to generate Vietnamese poems of various genres from natural language prompts, thereby facilitating an intuitive process with enhanced content control.
Our most efficacious model, the GPT-3 Babbage variant, achieves a custom evaluation score of 0.8
, specifically tailored to the "luc bat" genre of Vietnamese poetry. Furthermore, we also explore the idea of paraphrasing poems into normal text prompts and yield a relatively high score of 0.781
in the "luc bat" genre. This experiment presents the potential for cross-Language poem-to-poem translation with translated poems as the inputs while concurrently maintaining complete control over the generated content.
The orignial dataset is a collection of 171188 Vietnamese poems with different genres: luc-bat, 5-chu, 7-chu, 8-chu, 4-chu. Download here.
For more detail, refer to the Acknowledgments section
We also created our own datasets for prompt-based generation in the resource/dataset folder.
We trained a custom genre classifier based on BERT with the accuracy of 99.7%
to classify the correct genre before scoring. For more detail, refer to our vietnamese-poem-classifier. This would be helpful during blind test (where genre is not specified).
The training code is in this repo. To train the classifier, run:
python poem_classifier_training.py
We use a custom function to score the quality of a poem, based soldly on its conformation to the rigid rule of various types of vietnamese poem. Using 3 criterias: Length, Tone and Rhyme as follow: score = L/10 + 3T/10 + 6R/10
Table 1: Result comparison of models
Models | Luc Bat | Blind | 7 Chu | 8 Chu | 5 Chu | 4 Chu |
---|---|---|---|---|---|---|
text-to-poem | ||||||
ChatGPT (zero-shot) | 0.440 | 0.345 | 0.292 | 0.197 | 0.284 | 0.238 |
Davinci (1000 samples) | 0.580 | - | - | - | - | - |
BLOOM (20k samples) | 0.678 | 0.596 | 0.367 | 0.279 | 0.480 | 0.440 |
Babbage (20k samples) | 0.718 | - | - | - | - | - |
Babbage | 0.805 | 0.795 | 0.661 | 0.500 | 0.382 | 0.392 |
poem-to-poem | ||||||
Babbage | 0.781 | - | - | - | - | - |
Currently, the Luc Bat
genre score highest due to sheer sample size. It also has the tendency to genrerate Luc Bat
when the genre is not specified, so it also scores very high during blind test.
The opensource version use a Lora for Bloom-7b1
in 8bit and can run on colab. You can try it here (probably run out of memory and crash. It used to run fine, new library versions conflict a lot)
@misc{huynh2024vietnamese,
title={Vietnamese Poem Generation & The Prospect Of Cross-Language Poem-To-Poem Translation},
author={Triet Minh Huynh and Quan Le Bao},
year={2024},
eprint={2401.01078},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
This project was inspired by the evaluation method from fsoft-ailab
's SP-GPT2 Poem-Generator
Dataset also taken from their repo