Code and resources for the Author Name Disambiguation (AND) tool developed in the Scalable Author Disambiguation (SCAD) project.
Currently under construction!
$ mkdir scad
$ cd scad
$ git clone https://github.com/nlpAThits/scad-tool.git
$ cd scad-tool
$ conda create --name scad-env python=3.6
$ source activate scad-env
$ pip install -r scad-requirements-wo-wombat.txt
$ git clone https://github.com/nlpAThits/WOMBAT.git
$ pip install WOMBAT/.
$ git clone https://github.com/conll/reference-coreference-scorers.git
$ unzip 'resources/wombat/*.zip' -d resources/wombat/
$ python scad-server/app.py localhost 50001 &
$ kill PID
to stop the server.
This project includes a simple Python client which processes a JSON file and disambiguates it by making API calls against the SCAD server. The following will process publications belonging to the block a smith from the KISTI corpus, using the semantic matching method avg_of_cos with a dblp-trained word2vec resource (cf. below):$ python scad-client/run_simple_scad_client.py \
--scad_url http://localhost:50001 \
--pubfile publicationdata/full-kisti-plain-sng-sorted.json \
--blocking_pattern "'name': '(a[^\']* smith)'" \
--name_matching_method match:shortname \
--paramfile resources/scad_params.json \
--resourcefile resources/scad_resources.json \
--evaluate
The matching methods to use are specified in resources/scad_params.json
.
Visualized example results can be found at https://nlpathits.github.io/scad-tool/ (Use 'Open in new tab/window')