Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

geonameID type is now Option[String] #689

Merged
merged 1 commit into from
Oct 11, 2019
Merged

Conversation

EgoLaparra
Copy link
Contributor

Changes to allow GADM ID as described in #688

  • The type of geonameID of GeoPhraseID has been changed to Option[String].
  • The geoNamesIndex has been changed to geonames+woredas.zip.

@kwalcock
Copy link
Member

Thanks @EgoLaparra. That was fast. Do I need to bump up the version number on geonorm in build.sbt to get something other than numbers then?

@EgoLaparra
Copy link
Contributor Author

No, no need of updating the version. The id returned by geonorm is a String and the conversion into Int was made in eidos.
@bethard, is that correct?

@kwalcock
Copy link
Member

Am I not hoping to get some strings from previously unidentifiable locations?

@kwalcock
Copy link
Member

Sorry. They are probably in the new index with the new name.

Copy link
Member

@kwalcock kwalcock left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Merging shortly then.

@kwalcock kwalcock merged commit bfcabd8 into master Oct 11, 2019
@kwalcock kwalcock deleted the egoitz-geoIDsAsStrings branch October 11, 2019 00:27
@kwalcock
Copy link
Member

@bethard, @EgoLaparra, in eidos.conf, geoNamesIndexURL is "http://clulab.cs.arizona.edu/models/geonames-index.zip". Is this still the file that needs to be downloaded? In GeoNormFinder, it will be stored as "geonames+woredas.zip", but the name change is worrisome. Should it have been changed in eidos.conf as well or was the online version updated if that was necessary?

@kwalcock
Copy link
Member

Both files are available on the web, so I'm assuming that the answer is yes.

@EgoLaparra
Copy link
Contributor Author

Sorry, I forgot to change eidos.conf. Yes, it should to point to "geonames+woredas.zip",

@kwalcock
Copy link
Member

I updated and am comparing the old output to new output, hoping to see something other than a number in the geoID. That hasn't happened yet, but I am curious about these differences:

Text Old geoId New geoId
2015 7439356 6852983
Black 4050081 5765865
10522145 10513141
10522145 10513141
mid-May 2157544 6073561
progression 11833383 3596416
Addis Ababa 344979 444178
R2 10529703 10517701

They seem a little worrisome.

@kwalcock
Copy link
Member

In the 1532 documents just processed there was no geoId that was a string. For the examples above, after some neural network finds that node x is the best match, how is the text of the node compared to the document text?

@bethard
Copy link
Contributor

bethard commented Oct 14, 2019

For the table of differences, everything but Addis Ababa was a bad location anyway. For Addis Ababa, the change is from https://www.geonames.org/344979/addis-ababa.html to https://www.geonames.org/444178/adis-abeba-astedader.html. I think either of those would be acceptable; it's whether we think they mean the city itself or the metropolitan area, I believe.

These kids of changes are to be expected any time the content of the index changes. GeoNorm looks up the text found by the neural location finder in a Lucene index, and then uses a linear classifier to rerank the results of that Lucene lookup. Both Lucene and the classifier may give different results if the index changes.

Getting no geo IDs of the form ETH.X.X.X_X is a little surprising, though it's possible that the named entity tagger isn't finding any of the woreda names that weren't already in GeoNames. If you wanted to look further into it, you could see if any of the woreda names here are being found as locations: http://clulab.cs.arizona.edu/models/gadm_woredas.txt

@kwalcock
Copy link
Member

Thanks. Analemmo did it in the webapp. Maybe I grepped before all the files were complete, but two are now in the output:

        "text" : "Aleta Wendo",
        "geoID" : "ETH.10.19.1_1"

        "text" : "Afdera",
        "geoID" : "ETH.2.2.2_1"

How can we avoid the bad locations?

@bethard
Copy link
Contributor

bethard commented Oct 14, 2019

The bad locations are a result of the neural location finder making mistakes. There's no simple way to fix that. Do you have any idea of how frequent they are?

@kwalcock
Copy link
Member

I'm emailing some files with the collected geographic texts, some 9312 found across 1532 documents. Perhaps there will be some useful patterns.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants