Skip to content

wojtekwanczyk/domain-scraper

Repository files navigation

domain-scraper 🔍

Configure domain-scraper by setting variables to make send-summary work

  • DOMAINS_SUBSCRIBERS - comma separated vaild email addresses
  • GMAIL_APP_USERNAME - gmail username
  • GMAIL_APP_PASSWORD - gmail password

Usage

See how to use this tool with domain-scraper -h

Using docker image

Remember to add variables GMAIL_APP_USERNAME & GMAIL_APP_PASSWORD to docker run command or use env.list file with --env-file flag, e.g.

mkdir -p emails/input emails/archive
image_name="domain-scraper"
app_path="/domain-scraper"
docker run --rm \
    --mount type=bind,source=${PWD}/emails,target=${app_path}/emails \
    --volume db:${app_path}/db:rw \
    -e GMAIL_APP_USERNAME -e GMAIL_APP_PASSWORD -e DOMAINS_SUBSCRIBERS \
    "${image_name}" domain-scraper -ps

minikube deployment

eval $(minikube docker-env) # to add image to minikube docker repo
minikube mount $HOME:/hosthome # run in separate terminal

kubectl apply -f secrets.yaml # prepare secrets.yaml file with GMAIL_APP_USERNAME, GMAIL_APP_PASSWORD and DOMAINS_SUBSCRIBERS defined
kubectl create -f cronjob.yaml

TODO:

  • move scanned emails to separate dir to avoid duplication
  • add requirements.txt file
  • add html alternative message template + use it
  • add possibility to read INPUT_DIR, ARCHIEVE_DIR, DB_FILE from env
  • add option for send-summary to send all emails instead of only new emails
  • add setup.cfg, entrypoint and test building
  • add Dockerfile, build and test the image
  • add variables to deployment with secrets.yaml (email-secrets)
  • configure persistent volume to store db
  • add kubernetes yaml, verify the deployment
  • do some refactoring, especially variable naming (msg, messages_to_send)
  • add docstrings
  • add types declarations
  • add logging and remove all prints
  • add option to parse only one email
  • create helm chart from the repo
  • add coverage measurement
  • write unit tests
  • evaluate domain parsing - separate ipv4/ipv6 parsing from domains
  • add exception handling when file in INPUT_DIR is not email file or does not contain Received header
  • add validation for input files, to check if they are actually emails
  • add validation for DOMAINS_SUBSCRIBERS

About

Scrape domains from raw emails in local directory

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published