Skip to content

Commit

Permalink
🔨 refactor guinea worm steps (#3837)
Browse files Browse the repository at this point in the history
* 🔨 🔨 refactor guinea worm steps

* ✨ remove backfilling for certification status

---------

Co-authored-by: Tuna Acisu <tuna.acisu@ourworldindata.com>
  • Loading branch information
antea04 and Tuna Acisu committed Feb 5, 2025
1 parent 2fb265e commit b29e876
Show file tree
Hide file tree
Showing 12 changed files with 175 additions and 100 deletions.
7 changes: 7 additions & 0 deletions dag/archive/fasttrack.yml
Original file line number Diff line number Diff line change
Expand Up @@ -16,3 +16,10 @@ steps:
- snapshot://fasttrack/2023-05-31/cholera.csv
data://grapher/fasttrack/2023-01-03/long_term_homicide_rates_in_europe:
- snapshot://fasttrack/2023-01-03/long_term_homicide_rates_in_europe.csv
# Guinea worm data
data://grapher/fasttrack/2023-06-16/guinea_worm:
- snapshot://fasttrack/2023-06-16/guinea_worm.csv
data://grapher/fasttrack/2024-06-17/guinea_worm:
- snapshot://fasttrack/2024-06-17/guinea_worm.csv
data://grapher/fasttrack/2023-06-28/guinea_worm:
- snapshot://fasttrack/2023-06-28/guinea_worm.csv
5 changes: 5 additions & 0 deletions dag/archive/health.yml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ steps:
- data://meadow/who/2024-04-26/avian_influenza_ah5n1
data://grapher/who/2024-04-26/avian_influenza_ah5n1:
- data://garden/who/2024-04-26/avian_influenza_ah5n1

# OECD Road Accidents
data://grapher/oecd/2023-08-11/road_accidents:
- data://garden/oecd/2023-08-11/road_accidents
Expand Down Expand Up @@ -245,6 +246,10 @@ steps:
data://grapher/who/2022-07-17/who_vaccination:
- data://garden/who/2022-07-17/who_vaccination

# Guinea Worm Eradication Program
data://grapher/who/2023-06-30/guinea_worm:
- data://garden/who/2023-06-29/guinea_worm

# Polio vaccine schedule - to archive
data://meadow/who/2024-04-22/polio_vaccine_schedule:
- snapshot://who/2024-04-22/polio_vaccine_schedule.xlsx
Expand Down
2 changes: 1 addition & 1 deletion dag/archive/main.yml
Original file line number Diff line number Diff line change
Expand Up @@ -318,7 +318,7 @@ steps:
data://grapher/worldbank_wdi/2022-05-26/wdi:
- data://garden/worldbank_wdi/2022-05-26/wdi

# UN SDG - archive
# UN SDG
data://meadow/un/2023-01-24/un_sdg:
- snapshot://un/2023-01-24/un_sdg.feather
data://garden/un/2023-01-24/un_sdg:
Expand Down
7 changes: 0 additions & 7 deletions dag/fasttrack.yml
Original file line number Diff line number Diff line change
Expand Up @@ -58,11 +58,6 @@ steps:
- snapshot-private://fasttrack/latest/lead_paint_regulation_who.csv
data://grapher/fasttrack/latest/whm_treatment_gap_anxiety_disorders:
- snapshot://fasttrack/latest/whm_treatment_gap_anxiety_disorders.csv

data://grapher/fasttrack/2023-06-16/guinea_worm:
- snapshot://fasttrack/2023-06-16/guinea_worm.csv
data://grapher/fasttrack/2023-06-28/guinea_worm:
- snapshot://fasttrack/2023-06-28/guinea_worm.csv
data-private://grapher/fasttrack/latest/fiscal_top1_shares_country_standardized:
- snapshot-private://fasttrack/latest/fiscal_top1_shares_country_standardized.csv
data-private://grapher/fasttrack/latest/pain_hours_hen_systems:
Expand Down Expand Up @@ -183,8 +178,6 @@ steps:
# - snapshot-private://fasttrack/latest/draft_joe_gini_diff_1980_2018_take2.csv
# data-private://grapher/fasttrack/latest/draft_joe_top1share_diff_1980_2018:
# - snapshot-private://fasttrack/latest/draft_joe_top1share_diff_1980_2018.csv
data://grapher/fasttrack/2024-06-17/guinea_worm:
- snapshot://fasttrack/2024-06-17/guinea_worm.csv
data://grapher/fasttrack/latest/emissions_energy_heating_cooling_iea:
- snapshot://fasttrack/latest/emissions_energy_heating_cooling_iea.csv
data-private://grapher/fasttrack/latest/voter_turnout_by_age__sheet1:
Expand Down
11 changes: 4 additions & 7 deletions dag/health.yml
Original file line number Diff line number Diff line change
Expand Up @@ -117,16 +117,13 @@ steps:
- data://garden/unicef/2023-06-16/diarrhea

# Guinea worm
data://meadow/who/2023-06-29/guinea_worm:
data://meadow/who/2023-06-29/guinea_worm_certification:
- snapshot://who/2023-06-29/guinea_worm.csv
data://garden/who/2023-06-29/guinea_worm:
- data://meadow/who/2023-06-29/guinea_worm
- data://grapher/fasttrack/2023-06-28/guinea_worm
data://grapher/who/2023-06-30/guinea_worm:
- data://garden/who/2023-06-29/guinea_worm
data://garden/who/2023-06-29/guinea_worm_certification:
- data://meadow/who/2023-06-29/guinea_worm_certification

data://garden/who/2024-06-17/guinea_worm:
- data://garden/who/2023-06-29/guinea_worm
- data://garden/who/2023-06-29/guinea_worm_certification
- snapshot://fasttrack/2024-06-17/guinea_worm.csv
data://grapher/who/2024-06-17/guinea_worm:
- data://garden/who/2024-06-17/guinea_worm
Expand Down

This file was deleted.

Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,9 @@ all_sources:
published_by: Dracunculiasis Eradication Portal, World Health Organization
date_accessed: '2023-06-28'
url: https://www.who.int/teams/control-of-neglected-tropical-diseases/dracunculiasis/dracunculiasis-eradication-portal



dataset:
title: Guinea worm reported cases and certification (WHO)
licenses:
Expand All @@ -52,7 +55,7 @@ dataset:
- *source_certification
- *source_reported_cases
tables:
guinea_worm:
guinea_worm_certification:
variables:
year_certified:
title: Year country is certified free from guinea worm
Expand All @@ -66,10 +69,3 @@ tables:
unit: ''
description:
*certification_description
guinea_worm_reported_cases:
title: Reported cases of guinea worm disease in humans
unit: 'reported cases'
description:
*reported_cases_description
display:
numDecimalPlaces: 0
Original file line number Diff line number Diff line change
@@ -1,10 +1,10 @@
"""Load a meadow dataset and create a garden dataset."""

from itertools import product
from typing import cast

import pandas as pd
from owid.catalog import Dataset, Table
from owid.catalog import processing as pr
from structlog import get_logger

from etl.data_helpers import geo
Expand All @@ -15,6 +15,8 @@
# Get paths and naming conventions for current step.
paths = PathFinder(__file__)

LATEST_YEAR = 2022


def run(dest_dir: str) -> None:
log.info("guinea_worm.start")
Expand All @@ -23,32 +25,30 @@ def run(dest_dir: str) -> None:
# Load inputs.
#
# Load meadow dataset.
ds_meadow = cast(Dataset, paths.load_dependency(short_name="guinea_worm", version="2023-06-29", channel="meadow"))
ds_fasttrack = cast(
Dataset, paths.load_dependency(short_name="guinea_worm", version="2023-06-28", channel="grapher")
ds_meadow = cast(
Dataset, paths.load_dependency(short_name="guinea_worm_certification", version="2023-06-29", channel="meadow")
)

# Read table from meadow dataset.
tb = ds_meadow["guinea_worm"]
tb_fasttrack = ds_fasttrack["guinea_worm"].reset_index().astype({"year": int})
tb = ds_meadow["guinea_worm_certification"]

#
# Process data.
#
log.info("guinea_worm.harmonize_countries")
log.info("guinea_worm_certification.harmonize_countries")
tb: Table = geo.harmonize_countries(
df=tb, countries_file=paths.country_mapping_path, excluded_countries_file=paths.excluded_countries_path
df=tb,
countries_file=paths.country_mapping_path,
)
tb = update_with_latest_status(tb)
# Create time-series of certification
tb_time_series = create_time_series(tb)
tb_time_series = update_time_series_with_latest_information(tb_time_series)
# Combine datasets
tb = combine_datasets(tb, tb_time_series, tb_fasttrack)
tb = tb.drop(columns=[col for col in tb.columns if col not in ["country", "year", "year_certified"]])
tb = add_year_certified(tb, tb_time_series)

tb["year_certified"] = tb["year_certified"].astype("str")
# tb = tb.dropna(axis=0, subset=["year_certified", "certification_status", "guinea_worm_reported_cases"], how="all")
tb = add_missing_years(tb)
# Fill na with 0
tb["guinea_worm_reported_cases"] = tb["guinea_worm_reported_cases"].fillna(0)
tb = tb.set_index(["country", "year"])
#
# Save outputs.
Expand All @@ -62,19 +62,6 @@ def run(dest_dir: str) -> None:
log.info("guinea_worm.end")


def add_missing_years(df: Table) -> Table:
"""
Add full spectrum of year-country combinations to fast-track dataset so we have zeros where there is missing data
"""
years = df["year"].drop_duplicates().to_list()
countries = df["country"].drop_duplicates().to_list()
comb_df = pd.DataFrame(list(product(countries, years)), columns=["country", "year"])

df = Table(pd.merge(df, comb_df, on=["country", "year"], how="outer"), short_name=paths.short_name)

return df


def update_with_latest_status(df: Table) -> Table:
"""
Update with latest information as dataset only runs up to 2017
Expand All @@ -89,15 +76,15 @@ def update_with_latest_status(df: Table) -> Table:
return df


def create_time_series(df: Table) -> Table:
def create_time_series(df: Table):
"""
Pivoting the table so that we can have a time-series of the guinea worm status and how it has changed over time
"""
df_time = df.iloc[:, 0:24].drop(df.columns[[1]], axis=1)

df_time.columns = df_time.columns.str.replace("_", "")
years = df_time.drop("country", axis=1).columns.values
df_piv = pd.melt(df_time, id_vars="country", value_vars=years)
df_piv = pr.melt(df_time, id_vars="country", value_vars=years)
df_piv = df_piv.replace(
{
"value": {
Expand All @@ -114,34 +101,52 @@ def create_time_series(df: Table) -> Table:
return df_piv


def update_time_series_with_latest_information(df: Table) -> Table:
def update_time_series_with_latest_information(tb: Table):
"""
For each country we replicate the status as it was in 2017 and then adjust the countries where this status has changed
"""
df["year"] = df["year"].astype("int")
years_to_add = [2018, 2019, 2020, 2021, 2022]
tb["year"] = tb["year"].astype("int")
years_to_add = list(range(2018, LATEST_YEAR + 1))

year_to_copy = df[df["year"] == 2017].copy()
year_to_copy = tb[tb["year"] == 2017].copy()

for year in years_to_add:
year_to_copy["year"] = year
df = pd.concat([df, year_to_copy], ignore_index=True)
tb = pr.concat([tb, year_to_copy], ignore_index=True)

assert any(df["year"].isin(years_to_add))
df.loc[(df["country"] == "Angola") & (df["year"] >= 2020), "certification_status"] = "Endemic"
df.loc[(df["country"] == "Kenya") & (df["year"] >= 2018), "certification_status"] = "Certified disease free"
df.loc[(df["country"] == "Democratic Republic of Congo") & (df["year"] >= 2022), "certification_status"] = (
assert any(tb["year"].isin(years_to_add))
tb.loc[(tb["country"] == "Angola") & (tb["year"] >= 2020), "certification_status"] = "Endemic"
tb.loc[(tb["country"] == "Kenya") & (tb["year"] >= 2018), "certification_status"] = "Certified disease free"
tb.loc[(tb["country"] == "Democratic Republic of Congo") & (tb["year"] >= 2022), "certification_status"] = (
"Certified disease free"
)

return df


def combine_datasets(tb: Table, tb_time_series: Table, tb_fasttrack: Table) -> Table:
tb["year"] = 2022
tb = tb[["country", "year", "year_certified"]]

tb_combined = pd.merge(tb, tb_time_series, on=["country", "year"], how="outer")
tb_combined = pd.merge(tb_combined, tb_fasttrack, on=["country", "year"], how="outer")
tb_combined = Table(tb_combined, short_name=paths.short_name)
return tb_combined
return tb


def add_year_certified(tb: Table, tb_time_series: Table) -> Table:
tb_time_series["year_certified"] = pd.NA
for cntry in tb_time_series["country"].unique():
year_certified = tb[tb["country"] == cntry]["year_certified"].max()
if year_certified in ["Endemic", "Pre-certification", "Pending surveillance"]:
# set all years to the certification status of that year
tb_time_series.loc[tb_time_series["country"] == cntry, "year_certified"] = tb_time_series.loc[
tb_time_series["country"] == cntry, "certification_status"
]
else:
year_certified = int(year_certified)
# years after certification should have the year of certification
tb_time_series.loc[
(tb_time_series["country"] == cntry) & (tb_time_series["year"] >= year_certified),
"year_certified",
] = year_certified
# years before certification should have respective status of that year
tb_time_series.loc[
(tb_time_series["country"] == cntry) & (tb_time_series["year"] < year_certified),
"year_certified",
] = tb_time_series.loc[
(tb_time_series["country"] == cntry) & (tb_time_series["year"] < year_certified),
"certification_status",
]

return tb_time_series
64 changes: 64 additions & 0 deletions etl/steps/data/garden/who/2024-06-17/guinea_worm.meta.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
definitions:
certification: |-
The current and historical values for the status of Guinea worm disease (Dracunculiasis) as certified by the WHO. To be certified as free of guinea worm disease, a country must have reported zero indigenous cases through active surveillance for at least three consecutive years.
Data regarding certification status is taken from:
https://web.archive.org/web/20211024081702/https://apps.who.int/dracunculiasis/dradata/html/report_Countries_t0.html
This is supplmented with more recent changes to Guinea worm disease certification:
- Angola has had endemic status since 2020: https://www.who.int/news/item/23-09-2020-eradicating-dracunculiasis-human-cases-and-animal-infections-decline-as-angola-becomes-endemic
- Kenya was certified guinea worm free in 2018: https://www.who.int/news/item/21-03-2018-dracunculiasis-eradication-south-sudan-claims-interruption-of-transmission-in-humans
- DRC was certified guinea worm free in 2022: https://www.who.int/news/item/15-12-2022-the-democratic-republic-of-the-congo-certified-free-of-dracunculiasis-transmission-by-who
reported_cases: |-
Reported cases of guinea worm disease (Dracunculiasis) as recorded by WHO.
For Cameroon, Central African Republic, Cote d'Ivoire, Mauritania, Senegal and Yemen data is gathered from:
1986-2017: https://web.archive.org/web/20220208133814/https://apps.who.int/dracunculiasis/dradata/html/report_Countries_i2.html
2018: Table 1a: https://web.archive.org/web/20230629130727/https://apps.who.int/iris/bitstream/handle/10665/324786/WER9420-233-251.pdf?sequence=1&isAllowed=y
2019: Table 1a: https://web.archive.org/web/20230629130619/https://apps.who.int/iris/bitstream/handle/10665/332086/WER9520-209-227-eng-fre.pdf?sequence=1&isAllowed=y
2020: Table 1a: https://web.archive.org/web/20230226162934/https://apps.who.int/iris/bitstream/handle/10665/341529/WER9621-173-194-eng-fre.pdf?sequence=1&isAllowed=y
2021: Table 1a: https://web.archive.org/web/20230226163027/https://apps.who.int/iris/bitstream/handle/10665/354576/WER9721-22-225-247-eng-fre.pdf?sequence=1&isAllowed=y
2022: Table 1a: https://web.archive.org/web/20230629124651/https://apps.who.int/iris/bitstream/handle/10665/367924/WER9820-205-224.pdf?sequence=1&isAllowed=y
For all other countries data is gathered from the following sources:
1980-2020: https://www.who.int/teams/control-of-neglected-tropical-diseases/dracunculiasis/dracunculiasis-eradication-portal
2021: Table 1a: https://web.archive.org/web/20230226163027/https://apps.who.int/iris/bitstream/handle/10665/354576/WER9721-22-225-247-eng-fre.pdf?sequence=1&isAllowed=y
2022: Table 1a: https://web.archive.org/web/20230629124651/https://apps.who.int/iris/bitstream/handle/10665/367924/WER9820-205-224.pdf?sequence=1&isAllowed=y
Global totals are calculated yearly as the sum of the number of reported cases in each country.
tables:
guinea_worm:
variables:
year_certified:
title: Year of certification
description_short: Year country is certified free from guinea worm
unit: ''
display:
numDecimalPlaces: 0
description_from_producer: |-
{definitions.certification}
certification_status:
title: Certification status over time
description_short: |-
Certification status of guinea worm disease over time. A country has to have no cases of guinea worm disease for three years to be "certified disease free".
description_key:
- A country has to have no cases of guinea worm disease for three years while being actively surveilled to be "certified disease free".
- A disease outbreak is endemic in a country if it is consistently present in the country. Countries labeled pre-certification are endemic countries that have reported zero indigenous cases in the calendar year. Countries labeled pending surveillance have not had sufficient testing to determine their status or the number of cases in the country.
unit: ''
description_from_producer: |-
{definitions.certification}
guinea_worm_reported_cases:
title: Reported cases of guinea worm disease in humans
unit: 'reported cases'
description_from_producer: |-
{definitions.reported_cases}
display:
numDecimalPlaces: 0
Loading

0 comments on commit b29e876

Please sign in to comment.