This repository contains a web crawler designed to collect the Korean name Hanja (漢字) character set from the Supreme Court of Korea and Naver Dictionary.
The dataset includes characters that can be legally used in names for birth registration or name changes in Korea.
In June 2024, the Supreme Court of Korea expanded the number of Hanja characters that can be used in names from 8,319 to 9,389, an increase of 1,070 characters. This expansion marks the largest increase since 2014. The new characters include rarely used or complex Hanja such as '㖀(률)', '疋(아)', '䬈(태)', and '汩(골)'.
The regulation on name Hanja was first introduced in December 1990 to prevent the inconvenience caused by using rarely used or difficult Hanja in names. The initial list included 2,731 characters based on educational Hanja and commonly used characters. Over the years, the list has been gradually expanded through 11 amendments, reflecting the evolving naming practices and societal needs.
For comparison, China limits name characters to 3,500, and Japan restricts them to 2,999 (2,136 common-use kanji + 863 name-use kanji).
News Reference: 대법원, 이름에 사용할 수 있는 한자 '1000자 이상' 확대
- Supreme Court of Korea
- 대한민국 법원 전자가족관계등록시스템 인명용 한자 조회
- Korean Court Electronic Family Relationship Registration System Chinese Character Search for Personal Names
- https://efamily.scourt.go.kr/cs/CsBltnWrtList.do?bltnbordId=0000010
- Naver Dictionary
In Korean, the same Hanja character can have different pronunciations.
Additionally, since the query results are categorized by pronunciation, there is a possibility of the same character appearing multiple times.
Hanja character in dataset has not yet been deduplicated.
-
Supreme Court of Korea (10,163 characters in total)
data-gov.json
: Raw datadata-gov.csv
: Cleaned data
-
Naver Dictionary (8,957 characters in total)
data-naver.json
: Raw datadata-naver.csv
: Cleaned data
- Install requirements.
pip install requests
- Run the code.
Noticed that there is a one-second interval between each loop to prevent excessive requests and avoid overloading the server's connection.
# select the source: Supreme Court of Korea (gov) or Naver Dictionary (naver)
cd naver
# or
cd gov
# crawl raw data by request
python crawler.py
# clean json data to csv
python cleaner.py
Contributions are welcome!
Please open an issue or submit a pull request if you find a bug or have suggestions for improvements.
This project is licensed under the MIT License.
See the LICENSE file for details.