Skip to content

A new dataset HarveyNER with fine-grained locations annotated in tweets with strong baseline models using Curriculum Learning.

Notifications You must be signed in to change notification settings

brickee/HarveyNER

Repository files navigation

HarveyNER

We introduce a new dataset HarveyNER with fine-grained locations annotated in tweets. This dataset presents unique challenges and characterizes many complex and long location mentions in informal descriptions. We built strong baseline models using Curriculum Learning and experimented with different heuristic curricula to better recognize diffcult location mentions. alt HarveyNER focuses on the coordinate-oriented locations so we mainly annotate Point that can be precisely pinned to a map and Area that occupies a small polygon of a map. Considering that some disasters can affect line-like objects (e.g., a food can affect the neighbors of a whole river), we also include Road and River types.

  • Points: denote an exact location that a geocoordinate can be assigned. E.g., a uniquely named building, intersections of roads or rivers.
  • Areas: denote geographical entities such as city subdivisions, neighborhoods, etc.
  • Roads: denote a road or a section of a road.
  • Rivers: denote a river or a section of a river.

Statistics

Data Split Train Valid Test Total
All Tweets 3,967 1,301 1,303 6,571
Tweet w/ Entity 1,087 366 353 1,806
Tweet w/o Entity 2,880 935 950 4,765
All Entity Type 1,581 523 500 2,604
Point 591 206 202 999
Area 715 236 212 1,163
Road 158 51 57 266
River 117 30 29 176

Dataset

Please use the latest version in the data directory

Requirement

Please see requirement. You can ceate a conda environment using the bert_ner.yaml file:

$ conda env create -f bert_ner.yml

Run

$ python run_ner_loc.py --data_dir=data/tweets --bert_model=bert-base-uncased --task_name=ner --max_seq_length=48 --num_train_epochs=50 --learning_rate=5e-5 --bert_lr=5e-5 --train_batch_size=32 --eval_batch_size=32 --do_train --do_eval --do_predict --seed=42  --do_lower_case --warmup_proportion=0.1 --curriculum=commonness --netural --complexity_lambda=0.6 --maximum_lambda=1 --anti

Citation

If you extend or use this dataset, please cite the paper where it was introduced.

@inproceedings{chen-etal-2022-crossroads,
    title = "Crossroads, Buildings and Neighborhoods: A Dataset for Fine-grained Location Recognition",
    author = "Chen, Pei  and Xu, Haotian  and Zhang, Cheng  and Huang, Ruihong",
    booktitle = "NAACL",
    year = "2022",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2022.naacl-main.243",
}

About

A new dataset HarveyNER with fine-grained locations annotated in tweets with strong baseline models using Curriculum Learning.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages