Skip to content

Setting up a data analysis pipeline with Bash and Python scripts that will: Start PostgreSQL database in a Docker container. Create an appropriate table for persisting the data. Insert data from compressed CSV into the created table. Query the database to produce a new CSV with summaried data.

License

Notifications You must be signed in to change notification settings

wshawn2020/DataAnalysisPipeline

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

3 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

What We Gonna Do

You will produce a set of Bash and Python scripts that will:

  • Start PostgreSQL database in a Docker container
  • Create an appropriate table for persisting the data described below
  • Insert data from compressed CSV into the created table
  • Query the database to produce a new CSV with summary data

Dependencies

In order to complete this task, you will need to use:

  • Bash
  • Docker
  • Gzip tools (optional, you could use Python stdlib)
  • Python
    • Conda
    • PsycoPg2
    • Pandas

Input Data

There is a compressed CSV file within this repository data.csv.gz which contains open, high, low and close prices as well as total traded shares for a set of stocks that trade on the ASX exchange.

The columns are:

  • date
  • symbol
  • name
  • open_price
  • high_price
  • low_price
  • close_price
  • volume

Output Data

The output data is expected to be in CSV format, with the following schema: <symbol>,<mean_change_pct>,<max_high_price>,<min_low_price>,<median_volume>

Where the computed values are:

  • <mean_change_pct> The percentage change between <close_price> for a given symbol and the preceeding date's value
  • <max_high_price> The maximum <high_price> for a given symbol over all data
  • <min_low_price> The minimum <low_price> for a given symbol over all data
  • <median_volume> The median <volume> for a given symbol over all data

docker-compose.yml

We have included a docker-compose.yml file which will bootstrap a PostgreSQL instance including an Adminer Web UI that may be useful for you to debug your data.

PostgreSQL will be listening to default port 5432 and Adminer to port 8080.

env.yml

We have included a env.yml Conda environment file that you may use to bootstrap your Python dependencies. If you do not already have Conda, you can obtain it at conda.io.

Summaries

Keypoint

It's finished building up the pipeline for launching environment, handling table in PostgreSQL database and exporting csv file.

There's a bash script called <launch_pipeline.sh> which can automatically finish all the processes just enter simple command ./launch_pipeline.sh in bash shell under project's root path.

If it's going to make adaptions manually, all the python scripts located in folder script. The entry main function is in ./script/main.py. The definition of DATABASE is in ./script/database.py.

Once finish all the processes of the pipeline, the export csv file can be found under folder path ./export/report.csv.

For persisting data in database, there are tiny adaptions in docker-compose.yml. It's just creating a volume for that and mounting to host path.

Environment

It's suggest that making sure environment dependencies listed below intalled successfully in advance for running bash script <launch_pipeline.sh>.

Dependencies:

  • bash
  • conda
  • docker

About

Setting up a data analysis pipeline with Bash and Python scripts that will: Start PostgreSQL database in a Docker container. Create an appropriate table for persisting the data. Insert data from compressed CSV into the created table. Query the database to produce a new CSV with summaried data.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published