Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add docs/deciding_what_data_to_stream_with_dfe_analytics.md #178

Merged
merged 5 commits into from
Dec 13, 2024

Conversation

stevenleggdfe
Copy link
Contributor

No description provided.

Copy link
Collaborator

@asatwal asatwal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a minor questions, but otherwise looks good to me.

README.md Outdated
@@ -196,7 +196,7 @@ The `dfe:analytics:install` generator will also initialize some empty config fil
| `config/analytics_blocklist.yml` | Autogenerated file to list all fields we will NOT send to BigQuery, to support the `analytics:check` task |
| `config/analytics_custom_events.yml` | Optional file including list of all custom event names |

**It is imperative that you perform a full check of the fields that are being sent, and exclude those containing personally-identifiable information (PII) in `config/analytics_hidden_pii.yml`, in order to comply with the requirements of the [Data Protection Act 2018](https://www.gov.uk/data-protection), unless an exemption has been obtained.**
**It is imperative that you perform a full check of the fields that are being sent, exclude those for which you have no legal basis for processing under GDPR in `config/analytics_blocklist.yml`, and exclude those containing personally-identifiable information (PII) in `config/analytics_hidden_pii.yml`, in order to comply with the requirements of the [Data Protection Act 2018](https://www.gov.uk/data-protection), unless an exemption has been obtained. Further guidance on this is available [here](./docs/deciding_what_data_to_stream_with_dfe_analytics.md).**
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If there are PII fields in the database, then the advise here is to add these to config/analytics_hidden_pii.yml. If we are not interested in these fields at all then should we advise to add them to config/analytics_blocklist.yml ? I expect we can't just ignore them altogether as analytics will throw an error ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've clarified this. I think the bigger distinction users need to draw is between PII and human readable PII; all data they're streaming should be treated as PII.


There are some common myths about what is and is not personal data. These are not true:
- "Only human readable personal data is personal data - email addresses, names etc." - in fact, anything that uniquely identifies an individual is personal data. Database identifiers, IP addresses etc. can all be personal data even if it is not possible for a human to identify who some data belongs to.
- "Replacing identifiers of individuals with other identifiers stops it being personal data" - this is a process known as pseudonymisation. This process may improve data security in some situations, but does not stop data being personal data. See [ICO guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/personal-information-what-is-it/what-is-personal-data/what-is-personal-data/#pd4) for more information.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest using definition of pseudonymisation here i.e. “…the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information"

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added


There are some common myths about what is and is not personal data. These are not true:
- "Only human readable personal data is personal data - email addresses, names etc." - in fact, anything that uniquely identifies an individual is personal data. Database identifiers, IP addresses etc. can all be personal data even if it is not possible for a human to identify who some data belongs to.
- "Replacing identifiers of individuals with other identifiers stops it being personal data" - this is a process known as pseudonymisation. This process may improve data security in some situations, but does not stop data being personal data. See [ICO guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/personal-information-what-is-it/what-is-personal-data/what-is-personal-data/#pd4) for more information.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The ICO guidance https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/personal-information-what-is-it/what-is-personal-data/what-is-personal-data/#pd4 includes an example. Suggest we provide examples relevant to teacher services for the the key parts of the guidance

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added examples to the second two myths; the first already had them.

There are some common myths about what is and is not personal data. These are not true:
- "Only human readable personal data is personal data - email addresses, names etc." - in fact, anything that uniquely identifies an individual is personal data. Database identifiers, IP addresses etc. can all be personal data even if it is not possible for a human to identify who some data belongs to.
- "Replacing identifiers of individuals with other identifiers stops it being personal data" - this is a process known as pseudonymisation. This process may improve data security in some situations, but does not stop data being personal data. See [ICO guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/personal-information-what-is-it/what-is-personal-data/what-is-personal-data/#pd4) for more information.
- "Removing personal identifiers from data stops it being personal data" - this myth can be true in some cases (for example, when data is 'anonymised' by aggregating together and only total figures are stored and processed); however it is not reliably true for data streamed by ```dfe-analytics```. This is because this streamed data includes every click on the service made by a user on the service, every database change made about that user's data, and in addition provides the capability to join this data together. This is potentially a significant volume of information about the user. Even with the identifiers removed it would still be possible in many cases to work out the user's identity.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This para isn't clear to me.
Suggest we include an example about how, once the personal identifiers are removed, then the clicks a user makes and the changes made about that user's data could be using to work out their identity

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I've added an example

@@ -196,7 +196,7 @@ The `dfe:analytics:install` generator will also initialize some empty config fil
| `config/analytics_blocklist.yml` | Autogenerated file to list all fields we will NOT send to BigQuery, to support the `analytics:check` task |
| `config/analytics_custom_events.yml` | Optional file including list of all custom event names |

**It is imperative that you perform a full check of the fields that are being sent, and exclude those containing personally-identifiable information (PII) in `config/analytics_hidden_pii.yml`, in order to comply with the requirements of the [Data Protection Act 2018](https://www.gov.uk/data-protection), unless an exemption has been obtained.**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do they need to exclude PII or can they hide it instead. If they can hide it, can we signpost them to further guidance on how to do this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Ravi-Sachdev-0 I think you're commenting on the previous version of this file before the changes I'm proposing in this PR. The old version is in red on the left, the new one in green on the right. This further guidance is linked to in the new documentation from the new version of this paragraph.

README.md Outdated
@@ -196,7 +196,7 @@ The `dfe:analytics:install` generator will also initialize some empty config fil
| `config/analytics_blocklist.yml` | Autogenerated file to list all fields we will NOT send to BigQuery, to support the `analytics:check` task |
| `config/analytics_custom_events.yml` | Optional file including list of all custom event names |

**It is imperative that you perform a full check of the fields that are being sent, and exclude those containing personally-identifiable information (PII) in `config/analytics_hidden_pii.yml`, in order to comply with the requirements of the [Data Protection Act 2018](https://www.gov.uk/data-protection), unless an exemption has been obtained.**
**It is imperative that you perform a full check of the fields that are being sent, exclude those for which you have no legal basis for processing under GDPR in `config/analytics_blocklist.yml`, and exclude those containing personally-identifiable information (PII) in `config/analytics_hidden_pii.yml`, in order to comply with the requirements of the [Data Protection Act 2018](https://www.gov.uk/data-protection), unless an exemption has been obtained. Further guidance on this is available [here](./docs/deciding_what_data_to_stream_with_dfe_analytics.md).**

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given different services have different DPIAs, and collect different types of data, shouldn't we be referencing their own DPIAs when referring to their legal basis for collecting data?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I've added "as set out in your service's Data Protection Impact Assessment (DPIA)"


This guide sets out how teams should decide what data to stream with ```dfe-analytics``` and links to resources to explain how to do this.

```dfe-analytics``` is designed to be used alongside the [```dfe-analytics-dataform``` Dataform package](https://github.com/DFE-Digital/dfe-analytics-dataform/) to transform streamed data. The approach below relies on using both together.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we include a diagram and description of wha is dfe analytics and dataform package? I expect that non-technical people and/or new teams will need this level of detail

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added this

- they understand what information is held by their unit or directorate
- risks to their information are being addressed
- information is appropriately protected and marked
- information is used in compliance with all legal requirements, such as the Data Protection Act 2018, UK GDPR, Freedom of Information/Environmental Information Regulations, the Public Records Act and the Inquiries Act.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we be signposting the DPIA team here, in addition to the IAO?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've added "This includes ensuring that a Data Protection Impact Assessment has been carried out for their service, supported by the relevant departmental teams." - I didn't want to be specific as the exact team that does this seems to keep changing!


IAOs and their teams should also consider public tasks which may need to be performed by other teams within DfE, both now and in the future, when evaluating how long data should be retained for.

Teams **should** consider [implementing a data retention schedule](https://github.com/DFE-Digital/dfe-analytics-dataform/?tab=readme-ov-file#data-retention-schedules) in ```dfe-analytics-dataform``` to ensure that data streamed by ```dfe-analytics``` is deleted automatically once it is no longer required.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we flag that we've delivered functionality to enable actioning of data retention schedules?

Can we make it clear that teams use this functionality themselves, to implement their data retention schedules.
Who do teams need to seek advice and signoff for their data retention schedules? Is this their IAO?

Does this guidance (https://github.com/DFE-Digital/dfe-analytics-dataform/?tab=readme-ov-file#data-retention-schedules) cover all the things a team needs to do to implement their data retention schedule?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added "It is the IAO's ultimate responsibility to sign off on this."

I think it's already clear that teams use this functionality themselves from the sentence "Teams must consider implementing a data retention schedule in dfe-analytics-dataform..."

The guidance does cover all the things a team needs to do to implement their data retention schedule but could use some attention - I'll attempt this as a separate PR on that repo.

- "Replacing identifiers of individuals with other identifiers stops it being personal data" - this is a process known as pseudonymisation. This process may improve data security in some situations, but does not stop data being personal data. See [ICO guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/personal-information-what-is-it/what-is-personal-data/what-is-personal-data/#pd4) for more information.
- "Removing personal identifiers from data stops it being personal data" - this myth can be true in some cases (for example, when data is 'anonymised' by aggregating together and only total figures are stored and processed); however it is not reliably true for data streamed by ```dfe-analytics```. This is because this streamed data includes every click on the service made by a user on the service, every database change made about that user's data, and in addition provides the capability to join this data together. This is potentially a significant volume of information about the user. Even with the identifiers removed it would still be possible in many cases to work out the user's identity.

As a result it is recommend that teams consider **all** data streamed into BigQuery by ```dfe-analytics``` to be personal data unless proven otherwise.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

recommend(ed)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed


## How do we hide more sensitive data when it is streamed?
Data that is more sensitive than most personal data **should** be marked as such when it is streamed into BigQuery by ```dfe-analytics``` - if it is streamed at all. This includes:
- Human-readable PII, such as names, email addresses, home addresses and TRNs

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really important point.
Can we include a definition of what is Human-readable PII, as well as a complete list of theses types of data? I'm assuming it not in the 100's or 1000's of types of PII....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not possible to be exhaustive because any ID a service decides to give out to its users (i.e. not just an internal ID) is human readable PII. But I've expanded the list as much as I can!


This data should be both:
1. hidden when it is streamed using ```dfe-analytics``` [hidden field functionality](https://github.com/DFE-Digital/dfe-analytics/tree/SL-dev?tab=readme-ov-file#5-send-database-events)
2. hidden when it is transformed using ```dfe-analytics-dataform``` and Dataform as described [here](https://github.com/DFE-Digital/dfe-analytics-dataform/?tab=readme-ov-file#hidden-fields).

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggest we include the types of data here (pasted below), rather than linking it. Unless we expect that this data could change and it is best to link to the most up to date version?

personal data revealing racial or ethnic origin;
personal data revealing political opinions;
personal data revealing religious or philosophical beliefs;
personal data revealing trade union membership;
genetic data;
biometric data (where used for identification purposes);
data concerning health;
data concerning a person’s sex life; and
data concerning a person’s sexual orientation.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added


This will ensure that by default, BigQuery users only have access to a pseudonymised version of this data. Only users who have been explicitly granted permissions to do so will have access to this data in its raw form.

It is a common misunderstanding that this feature is designed to 'hide personal data'. This is incorrect as almost all data in BigQuery is personal data (see above), not just more sensitive data like this. Using this feature to hide this data enhances its security but does not mean that other data in BigQuery can be considered not to be personal data.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Who are the recommended contacts for further info if needed?

  • ICO.org
  • DPIA office
  • IAO
  • TWD data insights

Are there others teams should consider depending on the types of query?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a section

@stevenleggdfe stevenleggdfe merged commit a38572b into main Dec 13, 2024
4 of 5 checks passed
@stevenleggdfe stevenleggdfe deleted the SL-dev branch December 13, 2024 13:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants