There are two repos which should be checked out in a common directory.
One is euctr-tracker-code
, and the other is euctr-tracker-data
.
The latter contains JSON files which are loaded into memory by the
frontend
Django app in euctr-tracker-code
loads the contents of
the JSON files into memory.
Thus we avoid using a database. Originally this design made sense as the data structure was simple, and the source data was in a third party database over which we had no control. This makes less sense now as we run our own database for the source data, and the model has become more complex; we do, however, maintain the advantage of maintaining all the historic data in git.
Once a month, we run a shell script
crawl-and-dump-eutrialstracker_live.sh
that does two things:
- It executes command that uses
scrapy
to scrape the EUCTR register, via cron. This writes to a local postgres database,euctr
. It takes up to a day to run - it must do a full scrape each time, due to the design of the website we are scraping. - It runs the Django management command
get_trials_from_db
, which creates a CSV based on this data, and writes it to theeuctr-tracker-data
directory.
Every day, loaddata-eutrialstracker_live.sh
is also run. If
any uncommitted changes are found in the euctr-tracker-data
directory, it executes the Django management command
update_trials_json
, to create the set of JSON files which drive the
website.
However, this second shell script will initially (deliberately) exit
early, because the sponsors CSV file (see Terminology section,
below) contains rows that have not had the normalised sponsor column
updated. This will happen every time new sponsors are added. When the
update_trials_json
script exits early, it emails a recipient
specified in the django settings file, telling them intervention is
needed, and attaching the file.
The recipient should edit the CSV and update it directly in the
euctr-tracker-data
repo; loaddata-eutrialstracker_live.sh
will
then pick this up on the next run. Finally, this script will commit
all the latest generated files to the euctr-tracker-data
repostory.
Install these Python development packages before you begin. For example, on a Debian-based system:
apt install python3 python3-venv build-essential python3-dev phantomjs
Using Python 3, create and enter a virtualenv, as described here. For example:
python3.5 -m venv venv
. venv/bin/activate
Install required Python packages.
pip install -r requirements.txt
Set environment variables required.
export EUCTR_SECRET_KEY= # random longish string for sessions
export EUCTR_DEBUG= # yes or no
export EUCTR_OPENTRIALS_DB=postgres://<account_name>:<password>@<servername>/warehouse
export EUCTR_CRAWLERA_APIKEY= # for crawler proxying
Checkout the data respository.
cd ..
git clone git@github.com:ebmdatalab/euctr-tracker-data.git
cd -
Run the application.
cd euctr
./manage.py runserver
There are a few tests.
./manage.py test
We use fabric to deploy over SSH to a pet server.
fab deploy:live
The code and data are updated via git from the master branch of their repositories.
The configuration is in fabfile.py
and the deploy
directory.
When settings up a new server:
- Put environment settings live in
/etc/profile.d/eutrialstracker_live.sh
- Put SSH keys for the git repositories in
/var/www/eutrialstracker_live/ssh-keys
The frontend application reads data from static JSON files in
the ../euctr-tracker-data/
directory.
- Set the location of a PostgreSQL database. This is only used as an intermediate store for the crawler to keep data in. The live website doesn't use it.
export EUCTR\_OPENTRIALS\_DB=postgres://username:password@hostname/dbname
If you need the schema, look in euctr/crawl/schema.sql
.
- Run the EUCTR crawler to populate the PostgreSQL database by running with a date range:
cd euctr
./manage.py run_crawler 2004-01-01 2017-09-19
Or update results from a particular query, e.g. a specific trial id:
./manage.py run_crawler --query=2004-000012-13
- Update
../euctr-tracker-data/trials.csv
from the PostgreSQL database by running:
./manage.py get_trials_from_db
This assumes the table is called "euctr". It uses the SQL script
opentrials-to-csv.sql
for the calculations and conversions needed.
- Regenerate the JSON files from the CSV file by running:
./manage.py update_trials_json
In certain conditions, the scraper can currently automatically commit bad data into euctr-tracker-data. This will then be continuously deployed, leading to errors in production. In this case, you will need to roll-back the data by:
- remove the offending commit(s)
This can be done by reverting a bad commit, or by re-creating the default branch at a known-good commit.
- push the revised branch to GitHub
This can be done with a PR (in the case of a reversion), or by a force push (in the case of re-creating a branch)
- deploy to production
If using a reversion & PR, this will deploy continuously. If using a re-creation & force push, manual fixups will be required on the production deployment.
- re-run the scraper, with appropriate fixes
Note that the scraper initially scrapes into a PostgreSQL database. It updates the rows each time it is run (rather than creating new rows). This means that old data should be automatically removed from the database, and that although data remains from the bad scrape in the database at the time of starting a new scrape, it should be completely overwritten by the new scrape.
The spreadsheet ../euctr-tracker-data/normalized_sponsor_names.xlsx
contains
normalized versions of the names for trials that are listed in the register.
normalized_name_only
: This column represents normalization based only on
sponsor name or very simple research. Different spelling/abbreviations of a
sponsor name and divisions/subsidiaries of the same sponsor will be normalized
to the same parent sponsor name (ex: DrugCompany Inc., DrugCompany Incorporated
and DrugCompany Generics would all be normalized to the generic DrugCompany
Inc.). This also includes, where possible, instances in which a sponsor name is
implemented in different languages and also attempts to consolidate hospital
systems under a single normalized name (ex: Putting hospitals in the Copenhagen
University hospital system together or matching an old NHS Trust name to the
new name). If a sponsor is identified in their sponsor name as part of another
company (ex: “doing business as” or “a wholly owned subsidiary of” ) then that
company will be normalized to the parent company identified (ex:
“GenericCompany Inc, a wholly owned subsidiary of Drugs Inc.” would normalize
to “Drugs Inc.“). Attempts are also made to make sponsor names more uniform and
readable with the removal of extraneous text or names in all capital letters.
normalized_name
: This column represents an effort to normalize based on more in
depth research into mergers, acquisitions and name changes for corporate
entities. Proof is sought to account changes in corporate ownership and where
it is believed responsibility for reporting would ultimately be vested. Proof
of notable acquisitions is given in the spreadsheet for changes made in this
column.