A library to extract historical toponyms from texts, geocode and visualize the results on maps.
pip install his_geo --upgrade
from his_geo import extractor
from his_geo import geocoder
prompt = """
I would like you to take on the roles of both a Geographer and a Historian.
You possess extensive knowledge in Chinese geography and history, with a particular expertise in historical toponymy.
Your task is to extract precise location references of historical toponyms from texts.
When I provide a scholarly text analyzing the location of one or several historical toponyms, please identify and extract both the toponyms and their corresponding location references from the text.
Keep the following in mind:
1. If the text presents differing opinions of the same historical toponym's location from various scholars, only extract the most correct location reference that the author of the text acknowledges or agrees with. Do not include location references that the author disputes.
2. If a toponym is mentioned in the text but no location is provided, please skip this toponym.
3. Present the extracted information always in Chinese and strictly adhere to the following format:
"Toponym 1", "Location 1"
"Toponym 2", "Location 2"
Please do not include any explanation, verb or extraneous information.
The text is as follows:
"""
model = "chatgpt"
model_version = "gpt-4o-2024-08-06" # "gpt-4-turbo-2024-04-09"
api_key = "Your API key"
Check the OpenAI website
llm_extractor = extractor.Extractor(prompt, output_dir="The output directory",
model=model, model_version=model_version, api_key=api_key)
Load from a CSV or EXCEL file:
data = pd.read_csv("Your CSV file") # data = pd.read_excel("Your EXCEL file")
texts = data["The Text Column"].tolist()
Run the extractor:
results = llm_extractor.extract_texts(texts)
# Extracting text 0 to ./extracted_results_chatgpt_gpt-4o-2024-08-06.json
# Extracting text 1 to ./extracted_results_chatgpt_gpt-4o-2024-08-06.json
# Extracting text 2 to ./extracted_results_chatgpt_gpt-4o-2024-08-06.json
# Extracting text 3 to ./extracted_results_chatgpt_gpt-4o-2024-08-06.json
# Extracting text 4 to ./extracted_results_chatgpt_gpt-4o-2024-08-06.json
# ......
The result will be automatically saved to the output_dir you set.
print(texts[0])
# '此为最早明确见于文献记载中的楚县,亦是春秋置县之首例。权本为子姓小国,后为楚武王所灭,并被改建为县。《左传》庄公十八年载:“初,楚武王克权,使斗缗尹之。”斗缗为楚国大夫,“尹之”,就是以斗缗为权县的长官,来管理县内的有关事务。楚武王在位时间为公元前740年至前690年。《水经·沔水注》曰:“沔水自荆城东南流,迳当阳县之章山东,山上有故城,太尉陶侃伐杜曾所筑也。……沔水又东,右会权口,水出章山,东南流权城北,古之权国也。”《大清一统志》卷342安陆府古迹权城下亦云:“在钟祥县西南。”是权县当位于今湖北省荆门县东南。杨伯峻《春秋左传注》以为在今湖北省当阳县东南,恐非,当是将古当阳县(位于今荆门县西南)与今当阳县错混而致误。后斗缗据权县而叛楚,楚武王率军“围而杀之”。随后“迁权于那处,使阎敖尹之”(《左传》庄公十八年),即楚武王把权县原有的臣民迁往“那处”,并在那处设县,让阎敖为县尹,负责那处的地方政务。又,徐少华认为“迁权于那处”的应是指权国旧贵族及部分平民,在权县当仍有大多数平民留于当地而为县民,不可能全面迁走而使权成为弃地,权县仍当继续存在。其说恐未必与当时的事情发展相符。因权与那处颇近,权迁那处后,权已演变为一居民点,即一般的楚邑,而权县应当不复存在了。'
print(results[0])
# '权县,今湖北省荆门市东南'
Load from a CSV or EXCEL file:
data = pd.read_csv("Your CSV file") # data = pd.read_excel("Your EXCEL file")
addresses = data["The Address Column"].tolist()
geocoder_test = geocoder.Geocoder(addresses,
lang="ch",
preferences=["modern", "historic"],
if_certainty=True)
Only 'ch' (Chinese) can be used in lang for now.
# Addresses will be matched with locations in existing database after a process of toponym normalization
geocoder_test.match_address()
# Detect if there is any information about specific direction, to make the calculation more accurate
geocoder_test.detect_direction()
# Calculate a coordinate for each address
geocoder_test.calculate_point()
map = geocoder_test.visualize()
map
geocoder_test.data.to_csv("The file path", encoding='utf-8-sig')
Method | Model | Precision | Recall | F1 score |
---|---|---|---|---|
NER | albert-base-chinese-ner | 0.400 | 0.732 | 0.486 |
NER | bert-base-chinese-ner | 0.410 | 0.745 | 0.494 |
NER | roberta-base-finetuned-cluener2020-chinese | 0.548 | 0.914 | 0.644 |
Prompting | gpt-3.5-turbo-0125 | 0.684 | 0.785 | 0.709 |
Prompting | gpt-4-turbo-2024-04-09 | 0.733 | 0.811 | 0.756 |
Prompting | gpt-4o-2024-08-06 | 0.829 | 0.848 | 0.831 |