Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Address data inconsistencies and absence of versioning or sourcing in gapminder data #577

Closed
dsmedia opened this issue Jul 6, 2024 · 6 comments · Fixed by #580
Closed

Comments

@dsmedia
Copy link
Contributor

dsmedia commented Jul 6, 2024

Gapminder data, from a Swedish non-profit, is a popular part of this repository and truly fascinating to explore using visualization tools in the vega ecosystem. While working on an Altair example, I discovered what looked like a simple issue in the gapminder.json dataset, but as I looked into fixing it with a simple pull request, the right solution seemed a bit more complex, and I wanted to lay out my thoughts here for feedback.

The immediate issue I found is that it looks like life expectancy data between North and South Korea has been swapped. For 2005, this repository's dataset shows South Korea's life expectancy as 67.297 years and North Korea's as 78.623 years. This contradicts current Gapminder life expectancy data (v14), which reports approximately the reverse. This raises questions about other errors lurking in the dataset.

Resolving this issue is complicated by the absence of sourcing or versioning details for the gapminder data in SOURCES.md. The json file in this repository appears to be based on an older version of the dataset that I could not locate. For instance, Afghanistan's 1955 life expectancy is 30.332 years in the vega-datasets json, which aligns closely with Gapminder's v11 data (32.48 years), but differs from the current v14 (43.88 years).

Given what the vega-datasets README states about versioning, there seem to be a few options for a solution:

  1. Patch release: If the Korea data swap is confirmed as a formatting error, it could potentially be addressed in a patch release. That said, I still haven't been able to locate an older version of a Gapminder file containing data that matches the vega-datasets json.

  2. Minor release: Updating the dataset with current Gapminder figures without changing field names or file names could be done in a minor release. This could address the outdated data issue. But the data could be significantly different (as in the Afghanistan life expectancy data) and some country names may have changed.

  3. Major release: If we need to change field names (e.g., updating regional classification field name "cluster" to align with current Gapminder terminology) or significantly alter file contents, a major release would be necessary.

Regardless of the chosen approach, I propose:

  1. Considering whether to add a disclaimer of some kind in the repository about the intended / appropriate use cases for the data (given the repository can have errors, may be out of date, isn't actively maintained, that it's more for demo purposes) and/or encouraging that non-demonstration use cases refer back to the original sources rather than rely on the vega-datasets repository.
  2. Considering how best to adhere to appropriate sourcing requirements for datasets, such as attribution. Gapminder's license page lists attribution requirements.
  3. Updating SOURCES.md with detailed sourcing information
  4. It is also worth considering the handling of the other gapminder file in vega-datasets, gapminder-health-income.csv, which I haven't looked at.
@domoritz
Copy link
Member

domoritz commented Jul 6, 2024

Let's add a comment. Something to the extent of #111 (comment). We can still update the datasets but let's at least use a minor version bump so that we don't accidentally break test cases that rely on exact values.

@dsmedia
Copy link
Contributor Author

dsmedia commented Jul 7, 2024

Thanks. Given the possibility that the Korea data issues were added intentionally for instructional purposes (as noted for other datasets here, here, and here) perhaps we leave the data file as is, and just add a data usage note in the README.txt (and SOURCES.md?).

There's probably a case for placing this note prominently (higher up) in the docs given the acknowledgement here that some are using this repository in unintended ways.

Maybe something like the below? I'd be happy to open a PR for this if you think it would be helpful.

Data Usage Note

These datasets are intended only for instructional and demonstration purposes. Datasets may contain intentional inconsistencies or errors to provide opportunities for data cleaning exercises and to illustrate common data quality issues.

@domoritz
Copy link
Member

domoritz commented Jul 7, 2024

Let's add a data usage note to the readme only. Yes, please send a pull request.

I think we can still update the gap minder data to a known version number than we can link to. I think that would be worth doing a minor version bump and I'd love if you could send a pull request that updates the dataset and SOURCES.md accordingly since right now it's empty.

dsmedia added a commit to dsmedia/vega-datasets that referenced this issue Jul 9, 2024
This pull request addresses part of vega#577 by updating the README file with a data usage note. 

Still to be done (also tracked in vega#577): 
- refresh gapmider.json from gapminder source
- update SOURCES.md with sourcing information for the dataset
@dsmedia
Copy link
Contributor Author

dsmedia commented Jul 9, 2024

The remaining two tasks will be addressed together in a separate pull request.

domoritz added a commit that referenced this issue Jul 9, 2024
* Updates README.md with data usage note

This pull request addresses part of #577 by updating the README file with a data usage note. 

Still to be done (also tracked in #577): 
- refresh gapmider.json from gapminder source
- update SOURCES.md with sourcing information for the dataset

* Update README.md

---------

Co-authored-by: Dominik Moritz <domoritz@gmail.com>
@dsmedia
Copy link
Contributor Author

dsmedia commented Jul 10, 2024

Before the minor version bump, I wanted to highlight the significance of some of the revisions made by Gapminder to its demographic data since the last update in this repo nine years ago. I've prepared visualizations for review (seemed appropriate given the audience!) prior to submitting the PR, since the data changes may flow downstream to many existing charts that rely on the dataset. Some of the changes (South Korea and North Korea) appear to fix errors in the series; others reflect new estimates from sources deemed credible by Gapminder, particularly around major world events that have had a sizable impact on life expectancy at birth. There do not appear to be annotated explanations for each of the major revisions in the Gapminder data series. I also figure this may be a helpful exercise for others considering updating any of the vega-datasets series in the future.

The scatter plots below show countries with notable revisions in life expectancy and fertility data, two of the three data series in this repo's gapminder.json. Countries are included if they have at least one year with a "significant" deviation (defined arbitrarily by me as 5+ years for life expectancy, or +0.75 babies per woman for fertility) between old and revised data. Points represent 5-year intervals from 1955 to 2005.

Out of 63 countries in the series, most have smaller deviations than those shown; those are not shown here. There are also revisions to population data, but I've not shown them here.

image

image

As a sanity check, I made a a quick comparison with the World Bank data for Afghanistan's life expectancy at birth. This data is a closer fit with the revised Gapminder data than the nine-year-old version we had.

image

  • Among other changes since the last update:

-- Aruba is no longer in the series (In the PR I plan to substitute a new country for Aruba to keep the country count the same)
-- Hong Kong will be renamed "Hong Kong, China" in line with the new Gapminder format.
-- I plan to add 2010 and 2015 data (two extra rows per country)

@domoritz
Copy link
Member

Thanks for the detailed analysis. I think we should stay true to the original date if possible rather than augmenting/modifying it ourselves. So I would say let's not substitute but update the data, names, and rows.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants