Skip to content

Daily Refresh of COVID-19 data integration resultset (Public s3 Bucket)

Notifications You must be signed in to change notification settings

polyglotDataNerd/poly-spark-covid

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

82 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

COVID-19 Public Data

Tracks COVID19 Cases

This project consolidates two of the major COVID19 data repositories and consolidates and standardizes two disparate sources.

Dependencies:

Intention

  • The intention of this repo is to understand and analyze the consolidated data using Spark Dataframes.

Frequency

  • Dependant on the frequency of the extract and load pipeline, all data will be sourced in s3 via the objects that is extracted by the data pipeline.

Output

  • Refer to Readme.md
  • Made s3 prefix public to download individual source files along with orc sources
    • All objects are compressed in GZIP format

    • Download Consolidated Dataset

        -Johns Hopkins
        aws s3 ls s3://poly-testing/covid/jhu  --recursive
        2020-04-01 08:43:11          0 covid/jhu/
        2020-04-02 05:30:58     329761 covid/jhu/UID_ISO_FIPS_LookUp_Table.csv
        2020-04-01 08:43:19          0 covid/jhu/raw/
        2020-04-17 05:13:43     314337 covid/jhu/raw/04-16-2020.csv
        2020-04-17 05:14:48    1223240 covid/jhu/transformed/2020-04-17/jhu_2020-04-17.gz
        
        -Data Scraper
        aws s3 ls s3://poly-testing/covid/cds  --recursive
        2020-04-17 05:14:49     819222 covid/cds/2020-04-17/cds_2020-04-17.gz
        
        -Combined
        aws s3 ls s3://poly-testing/covid/combined  --recursive
        2020-04-17 05:25:51          0 covid/orc/_SUCCESS
        2020-04-17 05:25:49    3834451 covid/orc/covid19_combined.gz
      

About

Daily Refresh of COVID-19 data integration resultset (Public s3 Bucket)

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published