ToMuchInfo (NAVER AI Hackathon)
Author : 이상헌, 조용래, 박성남
Preprocessing
- Tokenize
- Feature Extraction
- Embedding
- Model
- Ensemble(optional)
normalizers.py
: correct bad words and typoLSUV.py
,ironyer.py
: applied LSUV init (https://arxiv.org/pdf/1511.06422.pdf)
DummyTokenizer
: dummy tokenizer that splits a sentence by spaceJamoTokenizer
: split text into jamosJamoMaskedTokenizer
: split text into jamos and mask movie names and actor namesTwitterTokenizer
: tokenize text using konlpy's Twitter moduleSoyNLPTokenizer
: tokenize text using SoyNLP's MaxScoresTokenizer
LengthFeatureExtractor
: token의 길이ImportantWordFeaturesExtractor
: 부정적 단어, 욕설 단어, 반전 단어의 수MovieActorFeaturesExtractor
: 자주 언급된 배우/영화를 찾고 이를 one-hot encodingAbnormalWordExtractor
: 직접 데이터를 보며 유의미할 것 같은 단어들 one-hot encodingSleepnessExtractor
: 졸리다는 내용의 표현 수
RandomDictionary
: 단순히 word를 index화 시켜서 returnFastTextDictionary
: pretrained FastText embedding을 불러와 embeddingFastTextVectorizer
: train set으로 FastText 학습시키고 embeddingWord2VecVectorizer
: train set으로 Word2Vec 학습시키고 embeddingTfidfVectorizer
: train set으로 sklearn을 사용해 tf-idf vectorize
VDCNN
: Very Deep Convolutional Networks for Text ClassificationWordCNN
: Convolutional Neural Networks for Sentence ClassificationBiLSTM
: Text Classification Improved by Integrating Bidirectional LSTM with Two-dimensional Max PoolingCNNTextInception
: Merging Recurrence and Inception-Like Convolution for Sentiment AnalysisDCNN-LSTM
: 저희 팀이 만들었습니다.LSTM_Attention
: Attention-Based Bidirectional Long Short-Term Memory Networks for Relation ClassificationRCNN
: Recurrent Convolutional Neural Networks for Text ClassificationTDSM
: Character-Based Text Classification using Top Down Semantic Model for Sentence Representation
- Average : 한 epoch당 여러개의 모델을 동시에 돌리면서, validation loss가 더이상 떨어지지 않으면 저장. 성능이 잘나오는 모델만 모아서 평균을 냄
- XGBRegressor : 각각의 모델당 best epoch을 찾은 후, 한번에 다 돌리고 그 결과값들로 xgboost를 써서 예측