What We Gonna Do

You will produce a set of Bash and Python scripts that will:

Start PostgreSQL database in a Docker container
Create an appropriate table for persisting the data described below
Insert data from compressed CSV into the created table
Query the database to produce a new CSV with summary data

Dependencies

In order to complete this task, you will need to use:

Bash
Docker
Gzip tools (optional, you could use Python stdlib)
Python
- Conda
- PsycoPg2
- Pandas

Input Data

There is a compressed CSV file within this repository data.csv.gz which contains open, high, low and close prices as well as total traded shares for a set of stocks that trade on the ASX exchange.

The columns are:

date
symbol
name
open_price
high_price
low_price
close_price
volume

Output Data

The output data is expected to be in CSV format, with the following schema: <symbol>,<mean_change_pct>,<max_high_price>,<min_low_price>,<median_volume>

Where the computed values are:

<mean_change_pct> The percentage change between <close_price> for a given symbol and the preceeding date's value
<max_high_price> The maximum <high_price> for a given symbol over all data
<min_low_price> The minimum <low_price> for a given symbol over all data
<median_volume> The median <volume> for a given symbol over all data

docker-compose.yml

We have included a docker-compose.yml file which will bootstrap a PostgreSQL instance including an Adminer Web UI that may be useful for you to debug your data.

PostgreSQL will be listening to default port 5432 and Adminer to port 8080.

env.yml

We have included a env.yml Conda environment file that you may use to bootstrap your Python dependencies. If you do not already have Conda, you can obtain it at conda.io.

Summaries

Keypoint

It's finished building up the pipeline for launching environment, handling table in PostgreSQL database and exporting csv file.

There's a bash script called <launch_pipeline.sh> which can automatically finish all the processes just enter simple command ./launch_pipeline.sh in bash shell under project's root path.

If it's going to make adaptions manually, all the python scripts located in folder script. The entry main function is in ./script/main.py. The definition of DATABASE is in ./script/database.py.

Once finish all the processes of the pipeline, the export csv file can be found under folder path ./export/report.csv.

For persisting data in database, there are tiny adaptions in docker-compose.yml. It's just creating a volume for that and mounting to host path.

Environment

It's suggest that making sure environment dependencies listed below intalled successfully in advance for running bash script <launch_pipeline.sh>.

Dependencies:

bash
conda
docker

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What We Gonna Do

Dependencies

Input Data

Output Data

docker-compose.yml

env.yml

Summaries

Keypoint

Environment

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
data		data
export		export
script		script
LICENSE		LICENSE
README.md		README.md
data.csv.gz		data.csv.gz
docker-compose.yml		docker-compose.yml
env.yml		env.yml
launch_pipeline.sh		launch_pipeline.sh

License

wshawn2020/DataAnalysisPipeline

Folders and files

Latest commit

History

Repository files navigation

What We Gonna Do

Dependencies

Input Data

Output Data

docker-compose.yml

env.yml

Summaries

Keypoint

Environment

About

Topics

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages