Skip to content

kevinmonisit/notebook-pipeline-runner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

40 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Notebook Pipeline Runner

  • Runs a pipeline of Jupyter Notebooks
  • Prevents multiple instances of the program from running concurrently
  • Log and error handling
  • Emails the user when the pipeline is complete or if an error occurs

Requirements

Make sure you have pip3 and python3 installed. If you don't, run

sudo apt update

and then

sudo apt-get install python3-pip.

Then, install the requirements by running

pip3 install -r requirements.txt

Then type

python3 main.py.

Requirements.txt * IMPORTANT *

Requirements.txt is a file that contains all the dependencies for the program. Right now, it contains dependencies needed to run the dummy-test notebooks. Everywhere an external library is used in the notebooks, you must add it to requirements.txt.

Make sure you add all libraries used in the notebooks into the requirements.txt file. If you don't, the program WILL cause an error mid-pipeline.

When you've added, type pip3 install -r requirements.txt to install these dependencies into your environment.

Setup

On the server (DigitalOcean, etc.) that you want to run the pipeline on, type the command

git clone git@github.com:kevinmonisit/notebook-pipeline-runner.git
cd notebook-pipeline-runner
  1. Make sure you have the requirements installed via pip3 install -r requirements.txt. (also make sure you have Python3)

  2. Go to SendGrid.com and create an account. At the beginning of account regisration, you're going to have to create a verified sender and verify the email you wish to send an email from. This is the first thing you do before your account is created. Set the Sender Email to any email you wish to send alerts from. (Note: You cannot send emails to yourself with SendGrid.com)

  3. When you create your account, you should see a dashboard.

image

  1. On the right, there is a "Settings" button. Click on that.

  2. Click on API KEYS.

  3. On the top right, click Create API Key.

  4. Set the API Key Permissions to Restricted Access, and then give the API Key permission for "Mail Send".

image

  1. Make sure to copy the API key

  2. Go to the file test.env and set the SENDGRID_API_KEY= to '<API KEY>'. Make sure the single quotations are there if they aren't already.

  3. Rename the file test.env to .env.

  4. Fill out the .env, replacing SENDGRID_FROM_EMAIL and SENDGRID_TO_EMAIL with the emails you choose. SENDGRID_FROM_EMAIL should be the email you set up your SendGrid.com account with to send emaisl.

  5. After setting the .env file, you can now run command python3 main.py from the directory of the project.

  6. An email will be sent to SENDGRID_TO_EMAIL stating that pipeline initialization has started.

Note: SendGrid API allows for 100 free emails per day.

To bypass confirmation

Bypass the confirmation prompt by typing

python3 main.py --bypass-confirm.

Modifying the Pipeline

The process:

[ your personal computer ] -- [transferring requirements.txt and notebooks] --> [ server ]
  1. Before you run the pipeline, from the environment (personal computer, etc.) that you usually run the pipeline, type python3 -m pip freeze > requirements.txt. This will create a requirements.txt which contains all the dependencies needed to run the pipeline normally.

  2. Then, replace the requirements.txt file in the project directory with the one you just created.

  3. Run pip3 -m install -r requirements.txt to install the dependencies into your environment (assuming this is the server).

  4. Type python3 main.py --bypass-confirm to run the dummy pipeline to make sure everything works. Now, it's time to modify the pipeline to your liking now that we've verified that the pipeline works.

  5. To modify the pipeline, edit the main.py file. The main.py file contains the main function, which contains an array called notebooks, containing the path to each notebook that will be run and in what order they will be run. Modify this array to your liking.

notebooks = ['./notebooks/notebook.ipynb',
             './notebooks/notebook2.ipynb',
             './notebooks/notebook4.ipynb',
             './notebooks/notebookERROR.ipynb',
             './notebooks/notebook4.ipynb'
             ]

To Run as CRON Job

In the terminal, run

pwd to get the path to the directory containing the main.py file.

Then type

crontab -e.

This will open the crontab file in your default text editor. It will most likely be Vim. Press i to enter insert mode, and then add the following line to the file:

0 0 * * * python3 /path/to/main.py --bypass-confirm

Replace /path/to/main.py with the actual path to the main.py file. Then press esc, type :wq, and press enter to save and exit the file.

You should be good to go. The program will now run every day at midnight. If you want to modify the date/time at which it runs, you can check out https://crontab.guru/.

Testing

To run the test script that verifies the program will not run multiple instances concurrently:

chmod +x test.sh

./test.sh

About

Automation of notebook pipeline for Taimaka Project

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published