Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding a publically-available example for the who dataset #149

Merged
merged 3 commits into from
Feb 24, 2023

Conversation

rosecers
Copy link
Collaborator

@rosecers rosecers commented Dec 8, 2022

A publically-available version of the examples used in the paper text

@agoscinski
Copy link
Collaborator

I think most test error go away when this branch is rebased

I don't think we can upload the dataset in our repo. There is no license information at kaggle which would allow a redistribution https://academia.stackexchange.com/a/63157

i will contact the person first, but if this does not work, I think we have to download the dataset from kaggle within the notebook to be on the safe side. Since there is no downloadable link available I think we have to use kaggle API, make it a dependency and use

kaggle datasets download -d kumarajarshi/life-expectancy-who

with some account information which we also need to include here. Not so idea, would prefer that the owner just allows us the upload.

I think the notebook are nice like they are. Only things

  • I would remove the savefig at the end.
  • Can we make notebook WhoDataset-PCovR.ipynb a faster? The "## Train the Different Kernel DR Techniques" section takes way too long
  • clear output.

@agoscinski
Copy link
Collaborator

Okay, contacting people on kaggle is only possible if you have a higher tier account and for that you need to fill up you profile contribute some stuff and get upvotes. I think downloading it from kaggle with some credentials is the easier solution. I made an account with my epfl address and added the token to the yml for the test. I think the chance of abuse is very little.

What i did (should be a

  • removed WHO dataset and added kaggle download command in notebook
  • commented out some computations in the WHO notebooks and directly inserted the results to drastically reduce computation time
  • removed savefig
  • rebased everything on main

Copy link
Collaborator

@agoscinski agoscinski left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me know if you agree with the changes @rosecers, I am okay squash merging it.

@ceriottm
Copy link
Collaborator

ceriottm commented Jan 8, 2023

Heya. Isn't this data taken from somewhere? Must be. If that's the case, we can also fetch it from the original source so we make it available in a more open format.

@agoscinski
Copy link
Collaborator

agoscinski commented Jan 8, 2023

Yes, the specific dataset is taken from kaggle https://www.kaggle.com/datasets/kumarajarshi/life-expectancy-who
You need an account to access the dataset. This PR added credentials into yml files to download it for a account I made specifically for this purpose.

it has no license attached, so redistributing this dataset seems to me troublesome. But the dataset itself is I think a merge from different WHO datasets like this one
https://apps.who.int/gho/data/node.main.688
Maybe its worth to spent the time to just merge the different datasets by ourself, since WHO licence allows us a redistribution

The CC BY-NC-SA 3.0 IGO licence allows users to freely copy, reproduce, reprint, distribute, translate and adapt the work for non-commercial purposes, provided WHO is acknowledged as the source using the following suggested citation: [...]

https://www.who.int/data/gho/data/themes/mortality-and-global-health-estimates/ghe-life-expectancy-and-healthy-life-expectancy
EDIT: wrong link, meant https://www.who.int/about/policies/publishing/copyright

@ceriottm
Copy link
Collaborator

ceriottm commented Jan 8, 2023

Uhm. I think that if they used those datasets, based on the terms of CC BY-NC-SA 3.0 IGO, the kaggle dataset should also be distributed according to CC BY-NC-SA 3.0 IGO (otherwise they're in break of the -SA provisions).
I'd say that if it takes less than a couple of hours it might be better to assemble a file from the original WHO stuff (and distribute it with a clear CC BY-NC-SA 3.0 IGO licence), otherwise I'd read clearly CC BY-NC-SA 3.0 IGO and check if it implies we can reuse the kaggle assembly assuming it's also CC BY-NC-SA 3.0 IGO .

@agoscinski
Copy link
Collaborator

Ah right this should be under the ShareAlike constraint

ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

https://creativecommons.org/licenses/by-nc-sa/3.0/igo/

Some licenses can be made proprietary with inheritance, so I was not sure till now. I will check if all the data is available on WHO, and if its much effort to merge it (I guess not). Would be nicer to have it here as a dataset than downloading it, on that I agree.

@agoscinski
Copy link
Collaborator

agoscinski commented Jan 8, 2023

I compared the data from the WHO website with the one from kaggle and I cant find any interpretation that make the two agree with each other (checked for Afghanistan)
https://www.who.int/data/gho/data/themes/topics/indicator-groups/indicator-group-details/GHO/life-expectancy-and-healthy-life-expectancy

Looked into the dataset and it has serious issues. Look at the population of Russia

Country Year Population
Russian Federation 2014 143819666.0
Russian Federation 2013 14356911.0
Russian Federation 2012 14321676.0
Russian Federation 2011 14296868.0
Russian Federation 2010 142849449.0
Russian Federation 2009 142785342.0
Russian Federation 2008 14274235.0
Russian Federation 2007 1428588.0
Russian Federation 2006 14349528.0
Russian Federation 2005 143518523.0
Russian Federation 2004 1446754.0
Russian Federation 2003 144648257.0
Russian Federation 2002 1453646.0
Russian Federation 2001 14597683.0

We need to also redo the analysis after we made our own dataset

@rosecers rosecers force-pushed the doc/WHO_Examples branch 2 times, most recently from 7868ee1 to 35fd969 Compare January 18, 2023 23:36
@rosecers
Copy link
Collaborator Author

@agoscinski I have updated the examples. During my rebase, I noticed a lot of documentation changes from your pushes -- do you want these in this PR?

@agoscinski
Copy link
Collaborator

The documentation changes should come from the last merged PR. We still need

  • a .rst description for the who_dataset.csv
  • move the dataset to the datasets folder
  • renaming the dataset to something more fitting (e.g. mortality.csv), because its a mix of the who and world bank datasets

@agoscinski
Copy link
Collaborator

Also when I run the notebook, I get different results than in paper. Just want to sure everything is correct.
who-selection

@rosecers rosecers force-pushed the doc/WHO_Examples branch 2 times, most recently from 4fa5cf4 to 79d85ab Compare February 17, 2023 16:39
examples/who_data.rst Outdated Show resolved Hide resolved
@agoscinski agoscinski force-pushed the doc/WHO_Examples branch 3 times, most recently from 052ca93 to c967a65 Compare February 24, 2023 14:11
@agoscinski agoscinski merged commit cb1c07c into main Feb 24, 2023
@agoscinski agoscinski deleted the doc/WHO_Examples branch February 24, 2023 14:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants