A Data Warehouse with Django and Scrapy

A Django app to store scraped website data with the intention to use the data as a source to import from.

It's a work in progress and not ready for use in a production environment.

Many parts of this project are based on previous work I have done. See the credits section below.

It's highly likely that this project will change significantly over time 💥

How it works so far

Initial command to obtain links to all pages to scrape: scrapy crawl sitemap
Collect the page content for each site map page: scrapy crawl pages
Run command python manage.py build_blocks to "build the blocks" from the scraped data (page content)

Setup

You'll need a wordpress site running from which you can scrape data. I used a local install of wordpress with default theme and sample content.

Clone this repo
Create a virtualenv and install requirements poetry install then poetry shell
Create a database and user for the project python manage.py migrate then python manage.py createsuperuser
Run the initial command to obtain links to all pages to scrape: scrapy crawl sitemap from the warehouse/sitemap/spiders directory
Collect the page content for each site map page: scrapy crawl pages from the warehouse/pages/spiders directory
Run command python manage.py build_blocks to "build the blocks" from the scraped data (page content). Run from the root directory of the project.

TODO

Add tests
Refine the django admin interface
Add a JSON API to access the data from a wagtail site for import
and more...

Dependencies

Production

Poetry for dependency management
Scrapy for scraping
Django for the web app
BeautifulSoup for parsing html

Development

Pre-commit for code linting
Black for code formatting
Flake8 for code linting
Isort for import sorting

License

MIT

Name		Name	Last commit message	Last commit date
Latest commit History 23 Commits
spiders		spiders
warehouse		warehouse
.flake8		.flake8
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
LICENSE		LICENSE
README.md		README.md
manage.py		manage.py
poetry.lock		poetry.lock
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Data Warehouse with Django and Scrapy

How it works so far

Setup

TODO

Dependencies

Production

Development

License

Credits

Previous work I have done and where I have pulled ideas from

About

Releases

Packages

Languages

License

nickmoreton/django-data-warehouse

Folders and files

Latest commit

History

Repository files navigation

A Data Warehouse with Django and Scrapy

How it works so far

Setup

TODO

Dependencies

Production

Development

License

Credits

Previous work I have done and where I have pulled ideas from

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages