Skip to content

Construct text corpus data and corresponding model for automatic sarcasm detection on korean.

License

Notifications You must be signed in to change notification settings

SpellOnYou/korean-sarcasm

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

80 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Kocasm : korean automatic sarcasm detection

  • Why this name? Kocasm is blend word, Korean + sarcasm

Why Irony detection is important?

Because it converts or distorts literal meaning of sentence, sarcasm is highly related to Sentiment Classification.

Preparing the data

  • HTML data gathered from a twitter
  • Data is composed of label 1,0.
    • label 1: sarcasm, label0: randomly gatherd
  • korean data, queries for hashtags such as 역설, 아무말, 운수좋은날, 笑, 뭐래 아닙니다, 그럴리없다, 어그로, irony sarcastic, sarcasm was labeled as True data.(so still has lots of noise)
  • And pre-processed dataset (1) user anonymous (2) removing hashtag (3) removing url process.

preprocessing-pipeline

If you have any other questions with corpus, please contacts me
- jiwon.kim.096@gmail.com

If you want to compare with other dataset, refer: [English]

Language Model (It is still being editting)

  • I'm strongly inspired by MirunaPislar's code and I referred a lot to that codes, but I tried to make my codes more pythonic and pytorchic style. Actually, I am still modifying the code.

  • Kokasm is compatible with: Python 2.7-3.7

In case with your own data, clone this repository and...

export DATA_DIR=/path/to/data
export PREP_DIR=/path/to/preprocess
export SAVE_DIR=/path/to/save

python tf_attention_models.py \
    --mode train \
    --model_cfg config/attention_base.json \
    --data_file $DATA_DIR/jiwon/train.csv \
    --test_file $DATA_DIR/jiwon/test.csv \
    --pretrain_file $BERT_PRETRAIN \
    --vocab PREP_DIR/vocab.txt \
    --save_dir $SAVE_DIR \
    --max_len 128

Citation

If you found this dataset useful, please cite as:

@misc{kim2019kocasm,
  author = {Kim, Jiwon and Cho, Won Ik},
  title = {Kocasm: Korean Automatic Sarcasm Detection},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/SpellOnYou/korean-sarcasm}}
}

See also

linguistic, computer science related to sarcasm

Implementation as proposed by Yang et al. in "Hierarchical Attention Networks for Document Classification" (2016)

About

Construct text corpus data and corresponding model for automatic sarcasm detection on korean.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages