-
Notifications
You must be signed in to change notification settings - Fork 24
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
geonameID type is now Option[String] #689
Conversation
Thanks @EgoLaparra. That was fast. Do I need to bump up the version number on geonorm in build.sbt to get something other than numbers then? |
No, no need of updating the version. The id returned by geonorm is a String and the conversion into Int was made in eidos. |
Am I not hoping to get some strings from previously unidentifiable locations? |
Sorry. They are probably in the new index with the new name. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Merging shortly then.
@bethard, @EgoLaparra, in eidos.conf, geoNamesIndexURL is "http://clulab.cs.arizona.edu/models/geonames-index.zip". Is this still the file that needs to be downloaded? In GeoNormFinder, it will be stored as "geonames+woredas.zip", but the name change is worrisome. Should it have been changed in eidos.conf as well or was the online version updated if that was necessary? |
Both files are available on the web, so I'm assuming that the answer is yes. |
Sorry, I forgot to change eidos.conf. Yes, it should to point to "geonames+woredas.zip", |
I updated and am comparing the old output to new output, hoping to see something other than a number in the geoID. That hasn't happened yet, but I am curious about these differences:
They seem a little worrisome. |
In the 1532 documents just processed there was no geoId that was a string. For the examples above, after some neural network finds that node x is the best match, how is the text of the node compared to the document text? |
For the table of differences, everything but Addis Ababa was a bad location anyway. For Addis Ababa, the change is from https://www.geonames.org/344979/addis-ababa.html to https://www.geonames.org/444178/adis-abeba-astedader.html. I think either of those would be acceptable; it's whether we think they mean the city itself or the metropolitan area, I believe. These kids of changes are to be expected any time the content of the index changes. GeoNorm looks up the text found by the neural location finder in a Lucene index, and then uses a linear classifier to rerank the results of that Lucene lookup. Both Lucene and the classifier may give different results if the index changes. Getting no geo IDs of the form ETH.X.X.X_X is a little surprising, though it's possible that the named entity tagger isn't finding any of the woreda names that weren't already in GeoNames. If you wanted to look further into it, you could see if any of the woreda names here are being found as locations: http://clulab.cs.arizona.edu/models/gadm_woredas.txt |
Thanks. Analemmo did it in the webapp. Maybe I grepped before all the files were complete, but two are now in the output:
How can we avoid the bad locations? |
The bad locations are a result of the neural location finder making mistakes. There's no simple way to fix that. Do you have any idea of how frequent they are? |
I'm emailing some files with the collected geographic texts, some 9312 found across 1532 documents. Perhaps there will be some useful patterns. |
Changes to allow GADM ID as described in #688