Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Web Scraping & Data access/storage #19

Open
juan-umana opened this issue Oct 11, 2022 · 1 comment
Open

Web Scraping & Data access/storage #19

juan-umana opened this issue Oct 11, 2022 · 1 comment

Comments

@juan-umana
Copy link
Member

juan-umana commented Oct 11, 2022

Hi everyone. We'd like to start a discussion on web scraping and data access of official websites, in our case from Colombia. Epidemiological data is stored in the SIVIGILA site and it cannot be reached from some countries abroad (i.e. Canada) because a "connection time out" error ocurres. This example motivates us to think in a kind of local server/website to store data (legal issues must be addreseed), or to redirect queries and act as VPN. We initially thought on preloaded datasets within the library but they are too large. What are your thoughts on this idea? or how do you think we can ensure data access to potential users?

@Bisaloo
Copy link
Member

Bisaloo commented Oct 12, 2022

Thanks for opening this issue!

As you mention, there is a combination of both technical and legal issues. Here are the different options I see, in order of preference:

  • get in touch with the admins of the SIVIGILA website to see if it would be possible to unblock the access from abroad. This would remove any potential legal issue, and clearly acknowledge SIVIGILA as the creator and maintainer of these datasets. They might even have a non-documented API you can use to download the data, which would be easier than web scraping.

  • if legal issues regarding data redistribution are solved, the data could be saved in an open archive, such as zenodo. We actually already have one package linked with Epiverse-TRACE which downloads data from zenodo, socialmixr (cc @sbfnk). This has the benefit of making the data available to other tools. Users might want to download the data directly from zenodo, or to write a Python/Julia/etc. script to download it, rather than using your R package. In other words, your work would be useful even to non-R users. Finally, once the data is on zenodo, you don't have to worry about it, it's permanently saved with a persistent identifier (DOI).

  • if legal issues regarding data redistribution are solved, the data could be saved in a S3-compatible space, such as AWS S3. This is what is done in the rnaturalearth R package. This is by far the most complex option from a technical point of view, as you need to ensure your S3 server stays up. If you server stops working at some point (e.g., funding runs out), the entire package will break and become useless.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants