This is a library to scrape and reconcile all payments made by a hiarcharcy of NHS institutions over time. It is the final of three projects on public procurement data (the first two of which were centgovspend and TSRC-NCVO-CSDP). Code for an interactive dashboard is found at src/dashboard, and an extremely unfinished prototype of the dashboard itself is at:
with the help of Ian M. Knowles. Links to open-access (OSF) versions of the two headline academic papers ("The Role of Non-Profits in Public Health Service Provision: Evidence from 25,338 heterogeneous procurement datasets"
with John Mohan and "Is outsourcing healthcare services to the private sector associated with higher mortality rates? An observational analysis of privatisation in England's NHS, 2013-2020"
with Ben Goodair and Aaron Reeves) will be hosted on the Open Science Framework (OSF) in due course, and linked here. A full, build passing notebook for the first of these two papers can be found here. If you would like to collaborate on these or related work, please don't hestiate to get in touch! Two spin-off repositories specifically for pdf-parsing and institutional data curation can be found here and here respectively.
NHSSpend tries to minimize the number of pre-requisite installations outside of the standard library, and we recommend an Anaconda installation to provide a comprehensive set of basic tools. However, a couple are necessary due to the magnitude of the undertaking. These include a range of modules found in the requirements.txt file (generated by pipreqs). The pdfparser is based on a version of the pdftableparser library, and the Charity Commission data is extracted using the charity-commission-extract library from NCVO. The Elasticsearch functionality is a custom implementation.
The data originates from one of two lists of recognised NHS institutions (Trusts and CCGs) and the main NHS England data provision page. These lists are used to create mappings to websites, and update on the status of the data (data/data_support/ccg_list.xlsx and data/data_support/trust_list.xlsx) with a number of different parametres fed into the scraper (src/NHSscraper.py). The data curation exercise has stopped as of April 2020 in order to focus on the analysis of the data, with the compresse datasets found in data/merged/* subdirectory of this repository). This is also partly due to the Covid-19 pandemic and the restructuring of Clinical Commissioning groups more generally (where 18 mergers took the number of CCGs from 191 to 136). However, please do raise issues on here if you think any of those institutions are mislabelled, or outdated. If you want to update this list (and the subsequent scrapers), please do raise an issue\get in touch (this is a constant ongoing work in progress until there is a centrally covened resource provided by the Government Data Service).
The procurement data itself is provided under an Open Government License (OGL). Guidance for publishing spend over £25,000 is published by HM Treasury.
The es_configure.md
describes the reconciliation approach. These reconciliations are then manually verified and merged back into the procurement data.
It is possible that you are reading this most interested in a copy of the output data! A link to the scraped, parsed, cleaned and reconciled can be found at NHSSpend/data/data_final. Please see the readme.md in that subdirectory for information on each of the fields.
Repo structure is based on the tree
utility.
├ readme.md
├ es_configure.md
├ requirements.txt
├ src
│ └ analysis
│ │ ├ charity_analysis_notebook.ipynb
│ │ ├ general_analysis_functions.py
│ │ ├ helper_functions.py
│ │ ├ charity_analysis_functions.py
│ ├ scrape_and_parse_ccgs.py
│ ├ scrape_and_parse_trusts.py
│ ├ scraping_tools.py
│ ├ generate_output.py
│ ├ ingest_everything.py
│ ├ merge_and_evaluate_tools.py
│ ├ NHSSpend.py
│ ├ parsing_tools.py
│ ├ pdf_table_parser.py
│ ├ preconciliation.py
├ dashboard
├ data
│ └ data_support/*
│ └ data_cc/*
│ └ data_ch/*
│ └ data_dashboard/*
│ └ data_final/*
│ └ data_masteringest/*
│ └ data_merge/*
│ └ data_nhsccgs/*
│ └ data_nhsdigital/*
│ └ data_nhsengland/*
│ └ data_nhstrusts/*
│ └ data_reconciled/*
│ └ data_shapefiles/*
│ └ data_summary/*
├ papers
│ └ corporate_networks
│ └ figures
│ └ tables
│ └ third_sector
├ logging
│ │ ├ nhsspend.log
│ └ eval_logs
├ tokens
This work was primarily funded by the [British Academy]. In addition to this, generous funding was provided by John Mohan for the undertaking of a 'data audit' by Steve Barnard. An earlier 'proof of concept' of the project was funded by ESRC Grant ES/M010392/1 (PI John Mohan) and undertaken at the Third Sector Research Sector. Additional thanks are due to Max Hattersly, Ben Goodair and Yu Pei for all of their work on data verification.
This code is made available under a GNU GENERAL PUBLIC LICENSE 3.0.
- More docstrings
- Publish related academic papers
Last updated: 2021-07-01