This application is the Bulk FHIR layer that sits on top of Data Query to provide anonymous data.
Read more
- Patient
Limitations of the current MVP include:
- Data is currently DSTU2 format and not R4 as dictated by the specification.
- Security is implemented with access tokens and not SMART Authorization.
system/*.read
scopes are not currently available. Patient/$export
does not support the optional_type
and_since
parameters.
The VA houses the largest medical history database in the US.
To support a data set this large, some deviations from the specification have been made.
The kick off request (/Patient/$export
) does not actually initiate the bulk packaging.
Instead, bulk data is prepared in advance on a periodic basis, e.g. monthly.
A Publication is the periodic collection of data.
The Bulk FHIR application makes one Publication available to all consumers.
The Bulk FHIR endpoints still function as specified.
- The
/Patient/$export
endpoint will return the location of the Status endpoint. - The Status endpoint will always return a Complete response. It is is never In-Progress.
Publications are very large. Depending on the resource type, the number of records can range from tens of millions to billions. Publications are made of many files, which are identified in the Complete status response. A Publication can have thousands of files, each file containing tens of thousands of records.
Publications are created in a rolling wave. For example, the January publication is made available in February. The February publication will be built automatically in the background over the month and made available in March.
Personally identifiable information (PII) data is removed or synthesized. The following generalizations apply:
- Optional data that is considered PII is removed
- Dates are truncated to the year, e.g.
2005-01-01T12:34:56Z
- Remove
.address
,.contact[]
,.id
,.identifier[]
,.photo
,.telecom
. - Remove
.multipleBirthInteger
and populate.multipleBirthBoolean
if applicable. - Synthesize
.name
using generated values. Only.name.given
,.name.family
, and.name.text
will be populated. - Synthesize
.birthDate
. Patients that are greater than 90 years old will have their birth date adjusted such that they appear 90. For example, if the current year is 2019 and the patient is 92, their birth date will be1929-01-01T12:34:56Z
- Synthesize
.deceasedDateTime
Read more
- https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html
- https://privacyruleandresearch.nih.gov/pr_08.asp
- https://www.law.cornell.edu/cfr/text/45/164.514
- https://www.hhs.gov/sites/default/files/ocr/privacy/hipaa/understanding/coveredentities/minimumnecessary.pdf
- Data Query is responsible for enabling access to bulk FHIR compliant records through VA internal APIs that are protected from general access.
- The Incredible Bulk communicates with Data Query through internal, protected APIs.
- internal calls to data query require the
DATA_QUERY_INTERNAL_ACCESS_KEY
found in the deployment unit
- internal calls to data query require the
- The Incredible Bulk is responsible for Publication management and anonymization.
- internal calls to the publication endpoint require the
KONG_INTERNAL_PROTECTED_OP_TOKENS
found in the deployment unit
- internal calls to the publication endpoint require the
- Publication files are created by The Incredible Bulk but served to consumers directly from S3 (via Kong)
- Consumer access through Kong requires the sharing of the
KONG_PUBLIC_PROTECTED_OP_TOKENS
found in the deployment unit
- Consumer access through Kong requires the sharing of the
- Timers are implemented using Kubernetes batch CronJob containers that periodically poke Publication endpoints.
When building files, The Incredible Bulk will gather data from Data Query where it will be anonymized and written to S3.
- A Publication is created using
POST /internal/publication
- Data Query will be interrogated to determine records that are available.
- The number of files required will be determined and groups of records will be associated to each file.
- The status of each file will be
NOT_STARTED
- A timer will trigger file building using
POST /internal/publication/any/file/next
- The first file that has a status of
NOT_STARTED
for the oldest Publication will be chosen. - Records will be extracted from Data Query, anonymized, and written to S3 for storage.
- The first file that has a status of
- Once all files are created (status is
COMPLETE
) for the Publication, the entire Publication will be consideredCOMPLETE
and made immediately available to consumers on future status calls. (The status endpoint is returned as part of the/Patient/$export
call.)
Notes
- A second timer will periodically check for incomplete Publication files.
For example, if an instance of The Incredible Bulk is building a file, but were to crash, then
the file would have been marked as
IN_PROGRESS
, but cannot complete. This timer will look for such instances and update the file status asNOT_STARTED
so that it can be re-attempted. - Specific files can be built using
POST /internal/publication/{id}/file/{fileId}
- Publications can be listed using
GET /internal/publication
- Status can be queried using
GET /internal/publication/{id}
- Implementation Guide (IG) has been updated since this PoC.
- OAuth is not currently supported. Simple API key authentication method is used. The API key provides an "all or nothing" approach. We have no mechanism for allowing access to different resources for different users.
- The IG assumes bulk data files are generated on demand. Our data set is very large and not well suited for on demand create. Instead data sets are created monthly, taking many days for just Patient alone. Files are built in batches to avoid overloading the servers and database.
- Data sets are very large, e.g. Observation has billions of records. Transferring this data to clients will be time consuming. Per the specification, records are included in multiple files. We must find the balance in files that are very large and having a very large number of files. Even with large files, there will still be a great number of them. This
- The Bulk FHIR specification defines STU3 structures, this PoC returns DSTU2 flavored structures.
- Only Patient resource is implemented. Support for Observation, Condition, Procedure, etc. is absent.
- The current solution periodically builds comprehensive publications monthly. There is significant cost (in time) to produce the data set. There is no support for incremental updates, which could be problematic for users that wish to stay as current.
- We do not support optional endpoints or parameters for the following
- groups or group level data export
- system level export , e.g.
services/fhir/v0/stu3$export
- query parameters:
_outputFormat
(we only support output application/fhir+ndjson)_since
(time based filtering)_type
(We only support Patient)- no experimental parameters, e.g. type filters
- delete operations
- new optional
Expires
header is not supported but should be