-
Notifications
You must be signed in to change notification settings - Fork 145
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Airlock - workspace data export (Design) #33
Comments
Let's consider the workflow identified here: https://docs.microsoft.com/en-us/azure/architecture/example-scenario/ai/secure-compute-for-research. as an initial data export workflow. |
The dataflow described is a pipeline/data supply chain with governance (i.e. contracts, licensing, etc) controlling ingress and egress. Staging inputs and outputs can be considered an activity in a pipeline with different roles, responsibilities and tools to those used to produce the artefacts themselves. Now, the primary resource within a TRE that provides governance boundaries is the "workspace". One idea would be to consider ways to connect/orchestrate workspaces each with distinct ingress/egress policies governed and aligned with contracts/SLAs/agreements etc as appropriate. For example, Ingress Staging Workspace (linking, pseudonymisation, de-identification, etc) -> ML Workspace (artefact production) -> Egress Staging Workspace (as above) By using the workspace construct to flexibly provision environment and data controls (in the context of the principles of the 5 safes) we can create assurances that the workspace meets the legal requirements, etc. A pipeline of workspaces would also start work towards federation of TREs. |
Interesting way of thinking about it. So everything is a data import "job" specify to a destination workspace. Can all users request to send data to any workspace (maybe even in another TRE), or only to workspaces they are a member of? There still needs to be an approval flow, do we have How does the approver then get it to the Researcher's outside location - say client machine? Is this a special case. Do they need to be able to download their approved exports? |
So everything is a data import "job" specify to a destination workspace. Yes, everything is a "process" or "job" as you say. This is useful for risk assessment and mitigation. It clearly identifies security zones with the ability to implement "controls" between processes. This could considering aspects of data de-identification and environment configuration for downstream processes. Can all users request to send data to any workspace (maybe even in another TRE), or only to workspaces they are a member of? That's a policy decision depending on the data rights but it should be the "Workspace Owner" responsibility (or their delegate.). I think it's important recognise that the flow of data is also a flow of ownership and rights that constrain downstream processing. In terms of defining workspace connectivity, I'd suggest that a workspace owner would specify a whitelist of downstream workspaces that are allowed to request artefacts. Workspace owners should not need to be members of downstream workspaces. I think it should be the other way around with downstream "Workspace Owners" or external processes requesting upstream artefacts (pull). There still needs to be an approval flow, do we have Data Reviewer role (say Workspace Owner for now), who can see all artefacts that are pending approval to get into their workspace, and simply move it from a transfer location into their workspace shared storage should it be "approved"? Yes, the artefact requests will need to be approved and it could be the workspace owner for workspace processes or delegate to some "Data Reviewer". If a workspace request is to an upstream service outside of the current TRE (e.g. NHS provider) it would follow the same sort of protocol. That's how it happens today. I am concerned with the assumption of "data moving/copied" through workspaces. We need to be careful about that. The assumption that data is copied through workspaces processes may be not be feasible for larger datasets. However, with a supply chain model it will be possible to deploy workspaces next to the data without moving the data to the workspace. There's also the whole world of data virtualisation options in relation to TREs and workspaces. What we should be doing is managing to the flow of rights to access data independent of the data tech if possible How does the approver then get it to the Researcher's outside location - say client machine? Is this a special case. Do they need to be able to download their approved exports? I'm not sure what you mean. Some workspaces may under certain conditions should allow for artefacts to be exported from the TRE. We should also remember that the processing of data is undertaken in the context of a risk management process. Take a look at the diagram below that outlines the trust, dependancy and rights relationships between processes. In the example, we have
What's important is the explicit injection of risk management into processes to allow for points of governance and stewardship, along with an overarching model of risk This is one flow and there are primitives (resource sharing, resource encapsulation, etc.) that we can consider and use to build trusted data supply chains. |
@joalmeid @daltskin we need to decide if we are treating "airlock import/export" as a single feature or two for initial iteration. I'm happy either way, I had them as one, but then split it as we were going to focus on export first, and was a requirement for some workspaces to have one but not the other. There is a lot of common ground, however might be worth considering requirements first. I think @mjbonifa's diagram above is useful. As a start I'm thinking each workspace has an inbox and outbox, only accessible by the Data Steward/Airlock Manager for that workspace? @mjbonifa understand in some scenarios data will not move, and this is about granting access, but feel the first iteration, based on previous work and customer requirements should focus on the movement of artefacts. I also think we should focus on external locations as source and destination, although remembering that another workspace inbox as a external location should be considered down the line. The next stage is to define how data is transferred from:
Previously we have done this with file shares and storage explorer- might be acceptable for staging and download, but we need a more streamlined method with approvals. |
@marrobi it makes sense for each workspace to have an inbox and outbox as you describe them, that construct allows us to work towards the above scenarios Yes to remembering that another workspace inbox as a external location should be considered down the line maybe we should break this out into another issue? Another point is that the diagram above could be interpreted incorrectly. The workspaces for risk management could be anywhere within a dataflow depending on the scenario. These are not just at the boundary between the TRE and processes external to the TRE, as could be assumed. Understand that a first version will copy data through the process for now |
Two new features
Re dataflow think we need clearer use cases, feels to me this is more of a data pipeline with with defined activities, including approval, between each movement of data/change of data access) . The workspace inbox/outbox being stages in that pipeline. |
There has been some previous experiences/attempts on this. Coming from fresh perspectives, and focusing on a first iteration for airlock processes, believe in a few facts to start with:
|
I posted this in 1109 but makes more sense here, I have reviewed the stories for data ingress / egress and moving large data files around and I thinking that there may be a simpler solution to this problem, does require some manual input but there are a lot of advantages. 1. Researcher uploads data via specific workspace in the TRE web portal This process solves the following:
The "airgap" is created by default and is a function of the security credentials of the PI. Unless the PI actively does something data does not move. |
If in the shared space then appropriate governance rules will need to be created for them for the connections between workspace and shared services If in the workspace the appropriate processes established to get the audit data and any other info out of a workspace. I'd suggest that the audit function is a separate issue and relates to many events related to the workspace. Hopefully there's a place to discuss that elsewhere. |
Yes I agree, to bring the requirement for delegation into scope we would need to introduce the staging area back into the process, we can't delegate access to a file etc if the rest of the TRE doesn't have visibility. The process would be as follows:
The same process can the be reversed to get data out of the TRE because the permissions are only shared between the PI/Delegate and researcher. The audit log is outside of the workspace but inside the TRE so that access can be gained for audit purposes, it should only contain metadata so no IG issues. The email would be a shared service but I'm open to suggestions on how notifications are handled. Also agree that we need to decide what gets audited separately. |
@CalMac-tns thanks for the input, a good discussion. I am conscious that we need to focus on requirements - from multiple sets of requirements- rather than specific implementation details at this point. There also is a bit of confusion around import vs export - we need to consider both flows. @joalmeid is going to create some user stories from the various sets of requirements then we will work to create a technical architecture that meets this. |
We've went through a set of existing requirements, guidance from HDRUK and current inputs on GH airlock import/export. Considerations:
Main Goals: Envisioned high-level Workflow graph TD
A(fa:fa-box-open Workspace Storage) --> |Export request created|B(Internal Storage)
B -->C{fa:fa-spinner Approval?}
C --> |Approved| D(External Storage)
C -.- |Rejected| E(fa:fa-ban Deleted);
Draft User stories As a Workspace TRE User/Researcher As a Workspace TRE User/Researcher As an automated process in TRE As a TRE Workspace Data Steward As a TRE Workspace Data Steward As a TRE User/Researcher As a TRE Admin |
Thanks for the input @joalmeid , taking the story a step further how would PII scanning actually work with the TRE? How would we deal with situation where a PI (Principal Investigator) delegates authority to someone else? In the story above a delegate would be similar to a Data Steward but would only get the role once delegated from the PI? |
Does the Data Steward assignment need to be considered for the scope of this story? |
The process we are working to the Data Steward (aka Delegate) may not be known at the same time as the workspace creation, the PI will delegate the responsibility at a later date once the data sets have been received. |
A user would be able to be assigned the Workspace Data Steward role at any point in time. Does that fit that need? |
@marrobi Finally adding to the discussion here... I have started building Bicep templates to implement the flow found in the Architecture Center reference (https://docs.microsoft.com/en-us/azure/architecture/example-scenario/ai/secure-compute-for-research). This reference was written by one of my peers and it's commonly implemented in US EDU. At this time, my templates are missing the Logic App to perform the approval. I am also reconsidering hosting the Azure Data Factory in the hub. It might just make sense to deploy the whole thing as a workspace service, instead of partially as a shared service, partially as a workspace service. See the GitHub repo here: https://github.com/SvenAelterman/AzureTRE-ADF I've also read something about virus scanning. Azure Storage has that built-in, but last time I checked, there's no event when the scan is finished and it can take hours. So I had previously developed an integration with VirusTotal, which can be found here: https://blog.aelterman.com/2021/02/21/on-demand-malware-scanning-for-azure-storage-blobs-with-virustotal/ |
Thanks @SvenAelterman . We're trying to crystalize the requirements and flow that makes sense in a TRE. Haven't break it down into implementation, but sure it will help. The airlock processes are definitely bound to a workspace. |
@SvenAelterman a thought - what if had a data factory self hosted integration runtime in the core/hub resource processor subnet, as this has access to all the workspace VNets (as the resource processor has to carry out data plane operations) so would be no need to add managed private endpoints. Also any outbound traffic can also be routed via the Azure firewall to prevent data exfiltration and for auditing purposes. What do you think vs managed network? |
@marrobi Hadn't thought about that yet. It could be a useful solution. At the same time, it's yet one more VM to manage. I am not sure how most customers would balance that. |
@SvenAelterman it could be run on a container instance maybe as per https://docs.microsoft.com/en-us/azure/data-factory/how-to-run-self-hosted-integration-runtime-in-windows-container, but that doesn't support auto update. If a VM could likely be B series, with auto updates of the OS and integration runtime. |
I had worked on a simplified way of achieving ingress/egress as a short term solution on my fork here. The solution uses two storage accounts that sit within a workspace - one which is the current workspace SA which currently hosts On deploying the base workspace a AD Group is deployed with the intention that workspace PIs would be added to the group to gain the required permissions to carry out ingress/egress actions. Access to the public or "airlock" SA would be via storage explorer to upload/retrieve files, with a script leveraging the PIs permissions to be used to copy files between the two SAs (currently a bash script sitting in the This solution does not fully achieve what is intended for this feature although it may provide a starting point for what is to be produced. |
design is done |
Preventing data exfiltration is of absolute importance but there is a need to be able to export certain products of the work that have been done within the workspace such as ML models, new data sets to be pushed back to the data platform, reports, and similar artifacts.
A high level egress workflow would look like:
The text was updated successfully, but these errors were encountered: