New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[TODO] Special Token 추가 관련 정보 탐색 및 예시코드 작성 #20

Closed

SeongIkKim opened this issue Apr 28, 2021 · 1 comment

Assignees

Labels

documentation enhancement

Milestone

Member

SeongIkKim commented Apr 28, 2021 •

edited

Loading

📌 요약

(수행 전) Special Token을 추가하였을 때 어떤 이유로 모델 성능이 오르는가? 오히려 차원이 한차원 늘어나서 정보가 흩어져버리는거 아닌가?
(수행 후) 핵심 : vocab.txt의 [unusedXXX] 토큰을 대체하여 스페셜 토큰을 추가할 수 있다. 다만, 우리 task가 domain speicific하지 않으므로 크게 효과를 기대하기는 어렵다.
- 자주 사용되지만 모델이 split해버리는 token들을 모아 하나의 entity로 취급할 수 있으므로, 모델의 이해도가 높아질 수 있다.
- add_special_token 이후 resize_token_embedding을 통해 스페셜 토큰을 추가할수도 있다. 기존의 preatrained embedding layer도 보존한다. 다만, 일반적으로 단순히 모델이 포착하지 못하는 entity를 포착하기 위해서는 잘 사용하지 않는 방법이다.

📔 상세

Before

Special token을 추가했을 때 embedding 벡터가 어떻게 바뀌는지 알아보기
Special token 추가 시 tokenizer의 length를 어떻게 바꾸어야 하는지 알아보기
모델이 tokenizing을 더 잘 하기 위한 전처리방법 찾아보기

After

special token을 추가하면 성능이 오르는가?

Are special tokens [CLS] [SEP] absolutely necessary while fine tuning BERT?

special token을 넣지 않고 pretrained된 모델을 학습시키는것보다, pretrained 형식과 동일하게 special token을 넣어서 학습시켜주는것이 20% 가량 성능이 더 좋았다.
- pretrain 시에 학습형태와 비슷한 방식으로 모델이 데이터를 이해할 수 있도록 하기 때문이다.
- BERT for NER
  
  BERT also expects that each sentence starts with a [CLS] token and ends with a [SEP] token. These special tokens are not particularly relevant for the NER task, considering that classification is done token-wise and the special tokens have no associated tag. Nevertheless, they should be included so that the fine-tuning input is not too different from the pre-training input.
- 단순히 special_token을 더한다고 성능이 오르는 것은 아님.

Adding new token for fast tokenizer not working · Issue #507 · huggingface/tokenizers

special token을 추가해서 (자주 나올것으로 예상되는) 특정 단어를 분리하여 tokenization 하지 않도록 하려는 목적.
- add_special_token을 해서 분리가 안되게 할수는 있다.

special token을 추가시 unk token중 하나를 대체하는건가?

BERT tokenizer - set special tokens · Issue #599 · huggingface/transformers

GPT가 다양한 special token을 사용해 성능을 높이듯, BERT도 custom special token을 이용해 fine-tuning 하여 성능을 높일 수 있느냐는 질문
- 그렇게 할 수 있고, 그렇게 하기 위해 unused token을 몇개 남겨두었다는 답변.
  
  Hi Adrian, BERT already has a few unused tokens that can be used similarly to the special_tokens of GPT/GPT-2.
- Adding domain speicific vocabulary에 대한 BERT google researcher 답변
  
  My recommendation would be to just use the existing wordpiece vocab and run pre-trianing for more steps on the in-domain text, and it should learn the compositionality "for free". Keep in mind that with a wordpiece vocabulary there are basically no out-of-vocabulary words, and you don't really know which words were seen in the pre-training and not. Just because a word was split up by word pieces doesn't mean it's rare, in fact many words which were split into wordpieces were seen 5,000+ times in the pre-training data.
  - 가장 좋은건 해당 도메인의 text를 많이 접하도록 여러번 pretraining하여 직접 해당 vocab을 학습하도록 하는것. wordpiece는 OOV 문제가 없는 vocab이기 때문에, 어차피 중요한(or 자주나오는) 단어들은 pretraining 과정에서 자주 마주칠거고, 자연스럽게 vocab에 등록될 수 밖에 없다.
  But if you want to add more vocab you can either:
  > (a) Just replace the "[unusedX]" tokens with your vocabulary. Since these were not used they are effectively randomly initialized.
  > (b) Append it to the end of the vocab, and write a script which generates a new checkpoint that is identical to the pre-trained checkpoint, but but with a bigger vocab where the new embeddings are randomly initialized (for initialized we used tf.truncated_normal_initializer(stddev=0.02)). This will likely require mucking around with some tf.concat() and tf.assign() calls.
  - 만약 vocab을 수정하고 싶다면 unused token 중 하나를 바꿔치기할것. random initialization 문제도 걱정 안해도됨.
  - 또는 vocab에 token을 추가하고, randomly initialize하여 새로운 checkpoint를 만들 것. BERT는 tf.truncated_normal_initializer(stddev=0.02) 를 사용했다고 함.
- fine tuning 이후의 embedding 문제 issue
  - vocab.txt 파일에 unused token 중 하나를 새로운 스페셜 토큰 [NEW]로 바꾸고, add_special_token에 집어넣었다.
  - 그러나 pretrained 된 tokenizer가 사용한 데이터셋에 비해 추가적으로 학습한 데이터셋이 너무 적어 효과는 미미했다.
vocab.txt 파일에 unused token 중 하나를 새로운 스페셜 토큰 [NEW]로 바꾸고, add_special_token에 집어넣었다.
그러나 pretrained 된 tokenizer가 사용한 데이터셋에 비해 추가적으로 학습한 데이터셋이 너무 적어 효과는 미미했다.

pretrained된 tokenizer에 speicial token을 새로 집어넣으면 기존에 있던 정보까지 다 못쓰게 만드는게 아닐까?

add_special_tokens를 한 뒤 resize_token_embeddings를 하면 별 문제가 되지 않는다.
- How to add some new special tokens to a pretrained tokenizer? huggingface/tokenizers#247
- resize시에는 기존 pretrained tokenizer의 embedding layer를 변환시키지 않으면서 새로운 token을 추가할 수 있다.
사실 굳이 special_token을 사용할 필요 없이, 자주 사용되는 단어를 묶어서 하나의 토큰으로 만들기 위해 남겨둔 여분의 토큰 [unusedXXX] 을 사용하면 된다. vocab.txt에서 unusedXXX 대신 원하는 토큰 text를 넣어둔 뒤 fine-tuning하자. 이게 일반적으로 사용하는 방식.
- How to use my own additional vocabulary dictionary? google-research/bert#396

The text was updated successfully, but these errors were encountered:

SeongIkKim added the enhancement label

SeongIkKim self-assigned this

SeongIkKim added the documentation label

Contributor

ggm1207 commented Apr 28, 2021

오 이거 해주시는군요! 감사합니다!

SeongIkKim closed this as completed

sooyounlee added this to the Week1 milestone

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment