Address data inconsistencies and absence of versioning or sourcing in gapminder data #577

dsmedia · 2024-07-06T16:46:03Z

Gapminder data, from a Swedish non-profit, is a popular part of this repository and truly fascinating to explore using visualization tools in the vega ecosystem. While working on an Altair example, I discovered what looked like a simple issue in the gapminder.json dataset, but as I looked into fixing it with a simple pull request, the right solution seemed a bit more complex, and I wanted to lay out my thoughts here for feedback.

The immediate issue I found is that it looks like life expectancy data between North and South Korea has been swapped. For 2005, this repository's dataset shows South Korea's life expectancy as 67.297 years and North Korea's as 78.623 years. This contradicts current Gapminder life expectancy data (v14), which reports approximately the reverse. This raises questions about other errors lurking in the dataset.

Resolving this issue is complicated by the absence of sourcing or versioning details for the gapminder data in SOURCES.md. The json file in this repository appears to be based on an older version of the dataset that I could not locate. For instance, Afghanistan's 1955 life expectancy is 30.332 years in the vega-datasets json, which aligns closely with Gapminder's v11 data (32.48 years), but differs from the current v14 (43.88 years).

Given what the vega-datasets README states about versioning, there seem to be a few options for a solution:

Patch release: If the Korea data swap is confirmed as a formatting error, it could potentially be addressed in a patch release. That said, I still haven't been able to locate an older version of a Gapminder file containing data that matches the vega-datasets json.
Minor release: Updating the dataset with current Gapminder figures without changing field names or file names could be done in a minor release. This could address the outdated data issue. But the data could be significantly different (as in the Afghanistan life expectancy data) and some country names may have changed.
Major release: If we need to change field names (e.g., updating regional classification field name "cluster" to align with current Gapminder terminology) or significantly alter file contents, a major release would be necessary.

Regardless of the chosen approach, I propose:

Considering whether to add a disclaimer of some kind in the repository about the intended / appropriate use cases for the data (given the repository can have errors, may be out of date, isn't actively maintained, that it's more for demo purposes) and/or encouraging that non-demonstration use cases refer back to the original sources rather than rely on the vega-datasets repository.
Considering how best to adhere to appropriate sourcing requirements for datasets, such as attribution. Gapminder's license page lists attribution requirements.
Updating SOURCES.md with detailed sourcing information
It is also worth considering the handling of the other gapminder file in vega-datasets, gapminder-health-income.csv, which I haven't looked at.

The text was updated successfully, but these errors were encountered:

domoritz · 2024-07-06T17:10:02Z

Let's add a comment. Something to the extent of #111 (comment). We can still update the datasets but let's at least use a minor version bump so that we don't accidentally break test cases that rely on exact values.

dsmedia · 2024-07-07T13:48:34Z

Thanks. Given the possibility that the Korea data issues were added intentionally for instructional purposes (as noted for other datasets here, here, and here) perhaps we leave the data file as is, and just add a data usage note in the README.txt (and SOURCES.md?).

There's probably a case for placing this note prominently (higher up) in the docs given the acknowledgement here that some are using this repository in unintended ways.

Maybe something like the below? I'd be happy to open a PR for this if you think it would be helpful.

Data Usage Note

These datasets are intended only for instructional and demonstration purposes. Datasets may contain intentional inconsistencies or errors to provide opportunities for data cleaning exercises and to illustrate common data quality issues.

domoritz · 2024-07-07T18:29:07Z

Let's add a data usage note to the readme only. Yes, please send a pull request.

I think we can still update the gap minder data to a known version number than we can link to. I think that would be worth doing a minor version bump and I'd love if you could send a pull request that updates the dataset and SOURCES.md accordingly since right now it's empty.

This pull request addresses part of vega#577 by updating the README file with a data usage note. Still to be done (also tracked in vega#577): - refresh gapmider.json from gapminder source - update SOURCES.md with sourcing information for the dataset

dsmedia · 2024-07-09T01:51:03Z

update README file with a data usage note (submitted PR docs: updates README.md with data usage note #578)
refresh gapmider.json from gapminder source
update SOURCES.md with sourcing information for the dataset

The remaining two tasks will be addressed together in a separate pull request.

* Updates README.md with data usage note This pull request addresses part of #577 by updating the README file with a data usage note. Still to be done (also tracked in #577): - refresh gapmider.json from gapminder source - update SOURCES.md with sourcing information for the dataset * Update README.md --------- Co-authored-by: Dominik Moritz <domoritz@gmail.com>

dsmedia · 2024-07-10T01:48:56Z

Before the minor version bump, I wanted to highlight the significance of some of the revisions made by Gapminder to its demographic data since the last update in this repo nine years ago. I've prepared visualizations for review (seemed appropriate given the audience!) prior to submitting the PR, since the data changes may flow downstream to many existing charts that rely on the dataset. Some of the changes (South Korea and North Korea) appear to fix errors in the series; others reflect new estimates from sources deemed credible by Gapminder, particularly around major world events that have had a sizable impact on life expectancy at birth. There do not appear to be annotated explanations for each of the major revisions in the Gapminder data series. I also figure this may be a helpful exercise for others considering updating any of the vega-datasets series in the future.

The scatter plots below show countries with notable revisions in life expectancy and fertility data, two of the three data series in this repo's gapminder.json. Countries are included if they have at least one year with a "significant" deviation (defined arbitrarily by me as 5+ years for life expectancy, or +0.75 babies per woman for fertility) between old and revised data. Points represent 5-year intervals from 1955 to 2005.

Out of 63 countries in the series, most have smaller deviations than those shown; those are not shown here. There are also revisions to population data, but I've not shown them here.

As a sanity check, I made a a quick comparison with the World Bank data for Afghanistan's life expectancy at birth. This data is a closer fit with the revised Gapminder data than the nine-year-old version we had.

Among other changes since the last update:

-- Aruba is no longer in the series (In the PR I plan to substitute a new country for Aruba to keep the country count the same)
-- Hong Kong will be renamed "Hong Kong, China" in line with the new Gapminder format.
-- I plan to add 2010 and 2015 data (two extra rows per country)

domoritz · 2024-07-10T18:57:50Z

Thanks for the detailed analysis. I think we should stay true to the original date if possible rather than augmenting/modifying it ourselves. So I would say let's not substitute but update the data, names, and rows.

dsmedia mentioned this issue Jul 6, 2024

Include gallery example of point paths on hover from Vega gallery vega/altair#2980

Closed

dsmedia mentioned this issue Jul 9, 2024

docs: updates README.md with data usage note #578

Merged

This was referenced Jul 11, 2024

docs: add gapminder source details to SOURCES.md #579

Closed

feat: update gapminder.json and add source information #580

Merged

domoritz closed this as completed in #580 Jul 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address data inconsistencies and absence of versioning or sourcing in gapminder data #577

Address data inconsistencies and absence of versioning or sourcing in gapminder data #577

dsmedia commented Jul 6, 2024

domoritz commented Jul 6, 2024

dsmedia commented Jul 7, 2024

domoritz commented Jul 7, 2024

dsmedia commented Jul 9, 2024

dsmedia commented Jul 10, 2024 •

edited

Loading

domoritz commented Jul 10, 2024

Address data inconsistencies and absence of versioning or sourcing in gapminder data #577

Address data inconsistencies and absence of versioning or sourcing in gapminder data #577

Comments

dsmedia commented Jul 6, 2024

domoritz commented Jul 6, 2024

dsmedia commented Jul 7, 2024

Data Usage Note

domoritz commented Jul 7, 2024

dsmedia commented Jul 9, 2024

dsmedia commented Jul 10, 2024 • edited Loading

domoritz commented Jul 10, 2024

dsmedia commented Jul 10, 2024 •

edited

Loading