The official repository for "UnihanLM: Coarse-to-Fine Chinese-Japanese Language Model Pretraining with the Unihan Database", AACL-IJCNLP 2020
The pretrained model is available at 🤗 Hugging Face Model Hub: https://huggingface.co/microsoft/unihanlm-base
We have made the code to find the Unihan clusters and the cached cluster IDs available here.
Please follow our paper and use the training code from facebookresearch/XLM.
- Preprocess your corpus by replacing all characters with the first character in each cluster.
- After cluster-level pretraining, copy the embedding of the first characters in each cluster for other characters in the same cluster.
for cluster in clusters:
for chracter_id in cluster[1:]:
embedding.weight[chracter_id] = embedding.weight[cluster[0]].detach()
- Re-preprocess the corpus by a standard procedure.
- Restart training on the new corpus.
Note: XLM is released under a CC BY-NC 4.0 licence.