Web Scraping & Data access/storage #19

juan-umana · 2022-10-11T19:55:49Z

Hi everyone. We'd like to start a discussion on web scraping and data access of official websites, in our case from Colombia. Epidemiological data is stored in the SIVIGILA site and it cannot be reached from some countries abroad (i.e. Canada) because a "connection time out" error ocurres. This example motivates us to think in a kind of local server/website to store data (legal issues must be addreseed), or to redirect queries and act as VPN. We initially thought on preloaded datasets within the library but they are too large. What are your thoughts on this idea? or how do you think we can ensure data access to potential users?

Bisaloo · 2022-10-12T08:45:18Z

Thanks for opening this issue!

As you mention, there is a combination of both technical and legal issues. Here are the different options I see, in order of preference:

get in touch with the admins of the SIVIGILA website to see if it would be possible to unblock the access from abroad. This would remove any potential legal issue, and clearly acknowledge SIVIGILA as the creator and maintainer of these datasets. They might even have a non-documented API you can use to download the data, which would be easier than web scraping.
if legal issues regarding data redistribution are solved, the data could be saved in an open archive, such as zenodo. We actually already have one package linked with Epiverse-TRACE which downloads data from zenodo, socialmixr (cc @sbfnk). This has the benefit of making the data available to other tools. Users might want to download the data directly from zenodo, or to write a Python/Julia/etc. script to download it, rather than using your R package. In other words, your work would be useful even to non-R users. Finally, once the data is on zenodo, you don't have to worry about it, it's permanently saved with a persistent identifier (DOI).
if legal issues regarding data redistribution are solved, the data could be saved in a S3-compatible space, such as AWS S3. This is what is done in the rnaturalearth R package. This is by far the most complex option from a technical point of view, as you need to ensure your S3 server stays up. If you server stops working at some point (e.g., funding runs out), the entire package will break and become useless.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Web Scraping & Data access/storage #19

Web Scraping & Data access/storage #19

juan-umana commented Oct 11, 2022 •

edited

Loading

Bisaloo commented Oct 12, 2022

Web Scraping & Data access/storage #19

Web Scraping & Data access/storage #19

Comments

juan-umana commented Oct 11, 2022 • edited Loading

Bisaloo commented Oct 12, 2022

juan-umana commented Oct 11, 2022 •

edited

Loading