GitHub - jherskovic/DataFakehouse: Generates synthetic data warehouses for data exploration

jherskovic / DataFakehouse Public

Notifications You must be signed in to change notification settings
Fork 0
Star 3

Generates synthetic data warehouses for data exploration

Notifications

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
README		README
contingency.sql		contingency.sql
create_db.sh		create_db.sh
db_creation.sql		db_creation.sql
index_creation.sql		index_creation.sql
populate_db.py		populate_db.py

Repository files navigation

This is a simple set of scripts that creates a simulated data warehouse. It has
a lot of simplifying assumptions, but it’s good enough for some kinds of
research (specially into data mining techniques) and for teaching purposes. Here
are some of the more relevant simplifying assumptions:

1. All diseases are chronic. 
2. Care is episodic. In other words, this is an encounter-based setting, like an
   outpatient clinic.
3. Patients have a condition from the start, or they don’t. Conditions don’t 
   appear during the course of care. 
4. There’s a standard set of labs that is ordered every single time a patient 
   with a condition visits. You can think of vitals as ‘labs’ if that helps. 
5. The number of potential conditions is small. This can be increased easily, if
   necessary.
6. All lab values are normally distributed, both the normal and abnormal ones. 
7. We know the ground truth about whether a patient has a condition or not 
   (great for computing sensitivity and specificity!)
8. Conditions may (or may not) be billed for. Billing is based on the 
   physicians' diagnoses during a visit, but it is also based on whether the
   condition in question is being treated at the institution or not. In other
   words, the physician may know that you have Bagelitis, but if it's being
   treated elsewhere we won't bill for it. This approximates the US billing 
   model; YMMV.
   
Requires PostgreSQL 8.3 or greater, a reasonably modern python 2.x, and
psycopg2.

See create_db.sh for parameters, populate_db.py to tweak the probabilities of
events, and db_creation.sql to tweak the prevalence of diseases and the
likelihood that they are billed for at your fake institution.