Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add default geolocation rules #1744

Draft
wants to merge 24 commits into
base: master
Choose a base branch
from
Draft

Conversation

joverlee521
Copy link
Contributor

Description of proposed changes

Add default geolocation rules that can be used for augur curate apply-geolocation-rules. This PR only adds the default rules, but does not incorporate them into the augur curate command. I plan to do that separately.

The rules were originally copied from ncov-ingest and then sorted and modified to match places to the geographic location rather than diplomatic semantics (as discussed on Slack).

Welcome any suggestions on organization and any sharp eyes that can flag weird rules that should be removed/modified.

Related issue(s)

Resolves #1488

Checklist

  • Automated checks pass
  • Check if you need to add a changelog message
  • Check if you need to add tests
  • Check if you need to update docs

This commit copies the latest geolocation rules from ncov-ingest¹
which we've been using as our "central" geolocation rules across
pathogen ingest workflows. The subsequent commits will modify the rules
to remove GISAID specific entries and incorporate the rules as the
default rules for the `augur curate apply-geolocation-rules` command.

¹ <https://github.com/nextstrain/ncov-ingest/blob/71ff771dc83ca5c5d14ea6de70132ea2e52a2ab6/source-data/gisaid_geoLocationRules.tsv>
Sort geolocation rules by the annotated country to make manual curation
easier.

I did this using visidata and the replayable commands are saved at
<https://gist.github.com/joverlee521/6d257c5b5045dd928dd374556baada9a>
Add clear demarcation of sections of geolocation rules
Within each section, the rules are sorted by alphabetical order of the
raw geolocation values.

Subsequent commits will clean up the general geolocation rules
since they seem to be specific to the ncov-ingest data.
Remove multiple general geolocation rules that are specific to the
ncov-ingest data.

1. Removed rule matching a date string `24 de Diciembre`
2. Removed rules that use abbreviations that seem too general to
match to specific locations.
3. Removed the general rule for "Milwaukee" because there are multiple
locations named Milwaukee and not all of them are counties.
Part of effort to match places to the geographic location rather than
diplomatic semantics.¹ Similar to the geolocation rules used in
rabies,² match "French Guiana" to its geographic location:

    region = South America
    country = French Guiana
    division = French Guiana

¹ <https://bedfordlab.slack.com/archives/CFEU0GNNS/p1736375470477489>
² <nextstrain/rabies#22>
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Réunion" to its geographic location:

    region = Africa
    country = Réunion
    division = Réunion
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Guadeloupe" to its geographic location:

    region = North America
    country = Guadeloupe
    division = Guadeloupe
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Martinique" to its geographic location:

    region = North America
    country = Martinique
    division = Martinique
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Mayotte" to its geographic location:

    region = Africa
    country = Mayotte
    division = Mayotte
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "New Caledonia" to its geographic location:

    region = Oceania
    country = New Caledonia
    division = New Caledonia
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Wallis and Futuna" to its geographic location:

    region = Oceania
    country = Wallis and Futuna
    division = Wallis and Futuna

Also consolidates "Wallis and Futuna" and "Wallis and Futuna Islands".
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Canary Islands" to its geographic location:

    region = Africa
    country = Canary Islands
    division = Canary Islands
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Sint Eustatius" to its geographic location:

    region = North America
    country = Sint Eustatius
    division = Sint Eustatius

Also updates location to use the fully spelled out name instead of
the abbreviated "St Eustatius" to match other locations.
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Anguilla" to its geographic location:

    region = North America
    country = Anguilla
    division = Anguilla
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "British Virgin Islands" to its geographic location:

    region = North America
    country = British Virgin Islands
    division = British Virgin Islands
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Cayman Islands" to its geographic location:

    region = North America
    country = Cayman Islands
    division = Cayman Islands
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Montserrat" to its geographic location:

    region = North America
    country = Montserrat
    division = Montserrat
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Turks and Caicos Islands" to its geographic location:

    region = North America
    country = Turks and Caicos Islands
    division = Turks and Caicos Islands
@joverlee521 joverlee521 linked an issue Feb 4, 2025 that may be closed by this pull request
@joverlee521
Copy link
Contributor Author

Thanks @kimandrews for flagging the USA territories! Will update rules for those in new commits.

Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "American Samoa" to its geographic location:

    region = Oceania
    country = American Samoa
    division = American Samoa
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Guam" to its geographic location:

    region = Oceania
    country = Guam
    division = Guam
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Northern Mariana Islands" to its geographic location:

    region = Oceania
    country = Northern Mariana Islands
    division = Northern Mariana Islands
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "Puerto Rico" to its geographic location:

    region = North America
    country = Puerto Rico
    division = Puerto Rico
Similar to previous commit, this is part of effort to match places to
the geographic location rather than diplomatic semantics.
Match "US Virgin Islands" to its geographic location:

    region = North America
    country = US Virgin Islands
    division = US Virgin Islands
Adds unit test to check the default geolocation rules.
Currently checks for any "cyclic" rules where the annotation exists
in the raw matching column. This won't catch all cyclic rules because
of wildcard matching, but it's better than nothing.

This already flags some cyclic rules in the file that will be fixed in
subsequent commits.
@joverlee521 joverlee521 marked this pull request as draft February 5, 2025 01:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Curate standard geolocation rules
1 participant