Add docs/deciding_what_data_to_stream_with_dfe_analytics.md #178

stevenleggdfe · 2024-12-02T14:20:34Z

No description provided.

asatwal

Just a minor questions, but otherwise looks good to me.

asatwal · 2024-12-11T11:19:29Z

README.md

@@ -196,7 +196,7 @@ The `dfe:analytics:install` generator will also initialize some empty config fil
 | `config/analytics_blocklist.yml`      | Autogenerated file to list all fields we will NOT send to BigQuery, to support the `analytics:check` task                                                                                              |
 | `config/analytics_custom_events.yml`  | Optional file including list of all custom event names                                                             |

-**It is imperative that you perform a full check of the fields that are being sent, and exclude those containing personally-identifiable information (PII) in `config/analytics_hidden_pii.yml`, in order to comply with the requirements of the [Data Protection Act 2018](https://www.gov.uk/data-protection), unless an exemption has been obtained.**
+**It is imperative that you perform a full check of the fields that are being sent, exclude those for which you have no legal basis for processing under GDPR in `config/analytics_blocklist.yml`, and exclude those containing personally-identifiable information (PII) in `config/analytics_hidden_pii.yml`, in order to comply with the requirements of the [Data Protection Act 2018](https://www.gov.uk/data-protection), unless an exemption has been obtained. Further guidance on this is available [here](./docs/deciding_what_data_to_stream_with_dfe_analytics.md).**


If there are PII fields in the database, then the advise here is to add these to config/analytics_hidden_pii.yml. If we are not interested in these fields at all then should we advise to add them to config/analytics_blocklist.yml ? I expect we can't just ignore them altogether as analytics will throw an error ?

I've clarified this. I think the bigger distinction users need to draw is between PII and human readable PII; all data they're streaming should be treated as PII.

Ravi-Sachdev-0 · 2024-12-11T13:11:54Z

docs/deciding_what_data_to_stream_with_dfe_analytics.md

+
+There are some common myths about what is and is not personal data. These are not true:
+- "Only human readable personal data is personal data - email addresses, names etc." - in fact, anything that uniquely identifies an individual is personal data. Database identifiers, IP addresses etc. can all be personal data even if it is not possible for a human to identify who some data belongs to.
+- "Replacing identifiers of individuals with other identifiers stops it being personal data" - this is a process known as pseudonymisation. This process may improve data security in some situations, but does not stop data being personal data. See [ICO guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/personal-information-what-is-it/what-is-personal-data/what-is-personal-data/#pd4) for more information.


Suggest using definition of pseudonymisation here i.e. “…the processing of personal data in such a manner that the personal data can no longer be attributed to a specific data subject without the use of additional information"

Ravi-Sachdev-0 · 2024-12-11T13:15:04Z

docs/deciding_what_data_to_stream_with_dfe_analytics.md

+
+There are some common myths about what is and is not personal data. These are not true:
+- "Only human readable personal data is personal data - email addresses, names etc." - in fact, anything that uniquely identifies an individual is personal data. Database identifiers, IP addresses etc. can all be personal data even if it is not possible for a human to identify who some data belongs to.
+- "Replacing identifiers of individuals with other identifiers stops it being personal data" - this is a process known as pseudonymisation. This process may improve data security in some situations, but does not stop data being personal data. See [ICO guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/personal-information-what-is-it/what-is-personal-data/what-is-personal-data/#pd4) for more information.


The ICO guidance https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/personal-information-what-is-it/what-is-personal-data/what-is-personal-data/#pd4 includes an example. Suggest we provide examples relevant to teacher services for the the key parts of the guidance

I've added examples to the second two myths; the first already had them.

Ravi-Sachdev-0 · 2024-12-11T13:22:43Z

docs/deciding_what_data_to_stream_with_dfe_analytics.md

+There are some common myths about what is and is not personal data. These are not true:
+- "Only human readable personal data is personal data - email addresses, names etc." - in fact, anything that uniquely identifies an individual is personal data. Database identifiers, IP addresses etc. can all be personal data even if it is not possible for a human to identify who some data belongs to.
+- "Replacing identifiers of individuals with other identifiers stops it being personal data" - this is a process known as pseudonymisation. This process may improve data security in some situations, but does not stop data being personal data. See [ICO guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/personal-information-what-is-it/what-is-personal-data/what-is-personal-data/#pd4) for more information.
+- "Removing personal identifiers from data stops it being personal data" - this myth can be true in some cases (for example, when data is 'anonymised' by aggregating together and only total figures are stored and processed); however it is not reliably true for data streamed by ```dfe-analytics```. This is because this streamed data includes every click on the service made by a user on the service, every database change made about that user's data, and in addition provides the capability to join this data together. This is potentially a significant volume of information about the user. Even with the identifiers removed it would still be possible in many cases to work out the user's identity.


This para isn't clear to me.
Suggest we include an example about how, once the personal identifiers are removed, then the clicks a user makes and the changes made about that user's data could be using to work out their identity

Good idea, I've added an example

Ravi-Sachdev-0 · 2024-12-11T13:25:21Z

README.md

@@ -196,7 +196,7 @@ The `dfe:analytics:install` generator will also initialize some empty config fil
 | `config/analytics_blocklist.yml`      | Autogenerated file to list all fields we will NOT send to BigQuery, to support the `analytics:check` task                                                                                              |
 | `config/analytics_custom_events.yml`  | Optional file including list of all custom event names                                                             |

-**It is imperative that you perform a full check of the fields that are being sent, and exclude those containing personally-identifiable information (PII) in `config/analytics_hidden_pii.yml`, in order to comply with the requirements of the [Data Protection Act 2018](https://www.gov.uk/data-protection), unless an exemption has been obtained.**


Do they need to exclude PII or can they hide it instead. If they can hide it, can we signpost them to further guidance on how to do this?

@Ravi-Sachdev-0 I think you're commenting on the previous version of this file before the changes I'm proposing in this PR. The old version is in red on the left, the new one in green on the right. This further guidance is linked to in the new documentation from the new version of this paragraph.

Ravi-Sachdev-0 · 2024-12-11T13:26:46Z

README.md

@@ -196,7 +196,7 @@ The `dfe:analytics:install` generator will also initialize some empty config fil
 | `config/analytics_blocklist.yml`      | Autogenerated file to list all fields we will NOT send to BigQuery, to support the `analytics:check` task                                                                                              |
 | `config/analytics_custom_events.yml`  | Optional file including list of all custom event names                                                             |

-**It is imperative that you perform a full check of the fields that are being sent, and exclude those containing personally-identifiable information (PII) in `config/analytics_hidden_pii.yml`, in order to comply with the requirements of the [Data Protection Act 2018](https://www.gov.uk/data-protection), unless an exemption has been obtained.**
+**It is imperative that you perform a full check of the fields that are being sent, exclude those for which you have no legal basis for processing under GDPR in `config/analytics_blocklist.yml`, and exclude those containing personally-identifiable information (PII) in `config/analytics_hidden_pii.yml`, in order to comply with the requirements of the [Data Protection Act 2018](https://www.gov.uk/data-protection), unless an exemption has been obtained. Further guidance on this is available [here](./docs/deciding_what_data_to_stream_with_dfe_analytics.md).**


Given different services have different DPIAs, and collect different types of data, shouldn't we be referencing their own DPIAs when referring to their legal basis for collecting data?

Yes, I've added "as set out in your service's Data Protection Impact Assessment (DPIA)"

Ravi-Sachdev-0 · 2024-12-11T13:28:41Z

docs/deciding_what_data_to_stream_with_dfe_analytics.md

+
+This guide sets out how teams should decide what data to stream with ```dfe-analytics``` and links to resources to explain how to do this.
+
+```dfe-analytics``` is designed to be used alongside the [```dfe-analytics-dataform``` Dataform package](https://github.com/DFE-Digital/dfe-analytics-dataform/) to transform streamed data. The approach below relies on using both together.


Can we include a diagram and description of wha is dfe analytics and dataform package? I expect that non-technical people and/or new teams will need this level of detail

I've added this

Ravi-Sachdev-0 · 2024-12-11T13:29:54Z

docs/deciding_what_data_to_stream_with_dfe_analytics.md

+- they understand what information is held by their unit or directorate
+- risks to their information are being addressed
+- information is appropriately protected and marked
+- information is used in compliance with all legal requirements, such as the Data Protection Act 2018, UK GDPR, Freedom of Information/Environmental Information Regulations, the Public Records Act and the Inquiries Act. 


Should we be signposting the DPIA team here, in addition to the IAO?

I've added "This includes ensuring that a Data Protection Impact Assessment has been carried out for their service, supported by the relevant departmental teams." - I didn't want to be specific as the exact team that does this seems to keep changing!

Ravi-Sachdev-0 · 2024-12-11T14:00:14Z

docs/deciding_what_data_to_stream_with_dfe_analytics.md

+
+IAOs and their teams should also consider public tasks which may need to be performed by other teams within DfE, both now and in the future, when evaluating how long data should be retained for.
+
+Teams **should** consider [implementing a data retention schedule](https://github.com/DFE-Digital/dfe-analytics-dataform/?tab=readme-ov-file#data-retention-schedules) in ```dfe-analytics-dataform``` to ensure that data streamed by ```dfe-analytics``` is deleted automatically once it is no longer required.


Should we flag that we've delivered functionality to enable actioning of data retention schedules?

Can we make it clear that teams use this functionality themselves, to implement their data retention schedules.
Who do teams need to seek advice and signoff for their data retention schedules? Is this their IAO?

Does this guidance (https://github.com/DFE-Digital/dfe-analytics-dataform/?tab=readme-ov-file#data-retention-schedules) cover all the things a team needs to do to implement their data retention schedule?

Added "It is the IAO's ultimate responsibility to sign off on this."

I think it's already clear that teams use this functionality themselves from the sentence "Teams must consider implementing a data retention schedule in dfe-analytics-dataform..."

The guidance does cover all the things a team needs to do to implement their data retention schedule but could use some attention - I'll attempt this as a separate PR on that repo.

Ravi-Sachdev-0 · 2024-12-11T14:01:36Z

docs/deciding_what_data_to_stream_with_dfe_analytics.md

+- "Replacing identifiers of individuals with other identifiers stops it being personal data" - this is a process known as pseudonymisation. This process may improve data security in some situations, but does not stop data being personal data. See [ICO guidance](https://ico.org.uk/for-organisations/uk-gdpr-guidance-and-resources/personal-information-what-is-it/what-is-personal-data/what-is-personal-data/#pd4) for more information.
+- "Removing personal identifiers from data stops it being personal data" - this myth can be true in some cases (for example, when data is 'anonymised' by aggregating together and only total figures are stored and processed); however it is not reliably true for data streamed by ```dfe-analytics```. This is because this streamed data includes every click on the service made by a user on the service, every database change made about that user's data, and in addition provides the capability to join this data together. This is potentially a significant volume of information about the user. Even with the identifiers removed it would still be possible in many cases to work out the user's identity.
+
+As a result it is recommend that teams consider **all** data streamed into BigQuery by ```dfe-analytics``` to be personal data unless proven otherwise.


recommend(ed)

Ravi-Sachdev-0 · 2024-12-11T14:06:45Z

docs/deciding_what_data_to_stream_with_dfe_analytics.md

+
+## How do we hide more sensitive data when it is streamed?
+Data that is more sensitive than most personal data **should** be marked as such when it is streamed into BigQuery by ```dfe-analytics``` - if it is streamed at all. This includes:
+- Human-readable PII, such as names, email addresses, home addresses and TRNs


This is a really important point.
Can we include a definition of what is Human-readable PII, as well as a complete list of theses types of data? I'm assuming it not in the 100's or 1000's of types of PII....

It's not possible to be exhaustive because any ID a service decides to give out to its users (i.e. not just an internal ID) is human readable PII. But I've expanded the list as much as I can!

Ravi-Sachdev-0 · 2024-12-11T14:08:26Z

docs/deciding_what_data_to_stream_with_dfe_analytics.md

+
+This data should be both:
+1. hidden when it is streamed using ```dfe-analytics``` [hidden field functionality](https://github.com/DFE-Digital/dfe-analytics/tree/SL-dev?tab=readme-ov-file#5-send-database-events)
+2. hidden when it is transformed using ```dfe-analytics-dataform``` and Dataform as described [here](https://github.com/DFE-Digital/dfe-analytics-dataform/?tab=readme-ov-file#hidden-fields).


Suggest we include the types of data here (pasted below), rather than linking it. Unless we expect that this data could change and it is best to link to the most up to date version?

personal data revealing racial or ethnic origin;
personal data revealing political opinions;
personal data revealing religious or philosophical beliefs;
personal data revealing trade union membership;
genetic data;
biometric data (where used for identification purposes);
data concerning health;
data concerning a person’s sex life; and
data concerning a person’s sexual orientation.

Ravi-Sachdev-0 · 2024-12-11T14:11:12Z

docs/deciding_what_data_to_stream_with_dfe_analytics.md

+
+This will ensure that by default, BigQuery users only have access to a pseudonymised version of this data. Only users who have been explicitly granted permissions to do so will have access to this data in its raw form.
+
+It is a common misunderstanding that this feature is designed to 'hide personal data'. This is incorrect as almost all data in BigQuery is personal data (see above), not just more sensitive data like this. Using this feature to hide this data enhances its security but does not mean that other data in BigQuery can be considered not to be personal data.


Who are the recommended contacts for further info if needed?

ICO.org

DPIA office

IAO

TWD data insights

Are there others teams should consider depending on the types of query?

Added a section

…s.md

Add docs/deciding_what_data_to_stream_with_dfe_analytics.md

8ab32b4

asatwal approved these changes Dec 11, 2024

View reviewed changes

Ravi-Sachdev-0 reviewed Dec 11, 2024

View reviewed changes

Ravi-Sachdev-0 requested changes Dec 11, 2024

View reviewed changes

stevenleggdfe added 4 commits December 12, 2024 16:30

Address PR comments in README.md

9b8228c

Add platform architecture overview

f4dfc12

Rename image.png to teacher_services_analytics_platform_overview.png

79c3d37

Address PR comments on deciding_what_data_to_stream_with_dfe_analytic…

6e20c59

…s.md

stevenleggdfe requested review from Ravi-Sachdev-0 and asatwal December 12, 2024 17:18

Ravi-Sachdev-0 approved these changes Dec 13, 2024

View reviewed changes

stevenleggdfe merged commit a38572b into main Dec 13, 2024
4 of 5 checks passed

stevenleggdfe deleted the SL-dev branch December 13, 2024 13:35

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add docs/deciding_what_data_to_stream_with_dfe_analytics.md #178

Add docs/deciding_what_data_to_stream_with_dfe_analytics.md #178

stevenleggdfe commented Dec 2, 2024

asatwal left a comment

asatwal Dec 11, 2024

stevenleggdfe Dec 12, 2024

Ravi-Sachdev-0 Dec 11, 2024

stevenleggdfe Dec 12, 2024

Ravi-Sachdev-0 Dec 11, 2024

stevenleggdfe Dec 12, 2024

Ravi-Sachdev-0 Dec 11, 2024

stevenleggdfe Dec 12, 2024

Ravi-Sachdev-0 Dec 11, 2024

stevenleggdfe Dec 12, 2024

Ravi-Sachdev-0 Dec 11, 2024

stevenleggdfe Dec 12, 2024

Ravi-Sachdev-0 Dec 11, 2024

stevenleggdfe Dec 12, 2024

Ravi-Sachdev-0 Dec 11, 2024

stevenleggdfe Dec 12, 2024

Ravi-Sachdev-0 Dec 11, 2024

stevenleggdfe Dec 12, 2024

Ravi-Sachdev-0 Dec 11, 2024

stevenleggdfe Dec 12, 2024

Ravi-Sachdev-0 Dec 11, 2024

stevenleggdfe Dec 12, 2024

Ravi-Sachdev-0 Dec 11, 2024

stevenleggdfe Dec 12, 2024

Ravi-Sachdev-0 Dec 11, 2024

stevenleggdfe Dec 12, 2024


		This guide sets out how teams should decide what data to stream with ```dfe-analytics``` and links to resources to explain how to do this.

		```dfe-analytics``` is designed to be used alongside the [```dfe-analytics-dataform``` Dataform package](https://github.com/DFE-Digital/dfe-analytics-dataform/) to transform streamed data. The approach below relies on using both together.


		IAOs and their teams should also consider public tasks which may need to be performed by other teams within DfE, both now and in the future, when evaluating how long data should be retained for.

		Teams should consider [implementing a data retention schedule](https://github.com/DFE-Digital/dfe-analytics-dataform/?tab=readme-ov-file#data-retention-schedules) in ```dfe-analytics-dataform``` to ensure that data streamed by ```dfe-analytics``` is deleted automatically once it is no longer required.


		This will ensure that by default, BigQuery users only have access to a pseudonymised version of this data. Only users who have been explicitly granted permissions to do so will have access to this data in its raw form.

		It is a common misunderstanding that this feature is designed to 'hide personal data'. This is incorrect as almost all data in BigQuery is personal data (see above), not just more sensitive data like this. Using this feature to hide this data enhances its security but does not mean that other data in BigQuery can be considered not to be personal data.

Add docs/deciding_what_data_to_stream_with_dfe_analytics.md #178

Add docs/deciding_what_data_to_stream_with_dfe_analytics.md #178

Conversation

stevenleggdfe commented Dec 2, 2024

asatwal left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment