Skip to content

Latest commit

 

History

History
76 lines (54 loc) · 3.42 KB

README.md

File metadata and controls

76 lines (54 loc) · 3.42 KB

Kocasm : korean automatic sarcasm detection

  • Why this name? Kocasm is blend word, Korean + sarcasm

Why Irony detection is important?

Because it converts or distorts literal meaning of sentence, sarcasm is highly related to Sentiment Classification.

Preparing the data

  • HTML data gathered from a twitter
  • Data is composed of label 1,0.
    • label 1: sarcasm, label0: randomly gatherd
  • korean data, queries for hashtags such as 역설, 아무말, 운수좋은날, 笑, 뭐래 아닙니다, 그럴리없다, 어그로, irony sarcastic, sarcasm was labeled as True data.(so still has lots of noise)
  • And pre-processed dataset (1) user anonymous (2) removing hashtag (3) removing url process.

preprocessing-pipeline

If you have any other questions with corpus, please contacts me
- jiwon.kim.096@gmail.com

If you want to compare with other dataset, refer: [English]

Language Model (It is still being editting)

  • I'm strongly inspired by MirunaPislar's code and I referred a lot to that codes, but I tried to make my codes more pythonic and pytorchic style. Actually, I am still modifying the code.

  • Kokasm is compatible with: Python 2.7-3.7

In case with your own data, clone this repository and...

export DATA_DIR=/path/to/data
export PREP_DIR=/path/to/preprocess
export SAVE_DIR=/path/to/save

python tf_attention_models.py \
    --mode train \
    --model_cfg config/attention_base.json \
    --data_file $DATA_DIR/jiwon/train.csv \
    --test_file $DATA_DIR/jiwon/test.csv \
    --pretrain_file $BERT_PRETRAIN \
    --vocab PREP_DIR/vocab.txt \
    --save_dir $SAVE_DIR \
    --max_len 128

Citation

If you found this dataset useful, please cite as:

@misc{kim2019kocasm,
  author = {Kim, Jiwon and Cho, Won Ik},
  title = {Kocasm: Korean Automatic Sarcasm Detection},
  year = {2019},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/SpellOnYou/korean-sarcasm}}
}

See also

linguistic, computer science related to sarcasm

Implementation as proposed by Yang et al. in "Hierarchical Attention Networks for Document Classification" (2016)