Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Re-Evaluate Anonymisation and Security Measure names for Correctness #15

Closed
coolharsh55 opened this issue Jun 24, 2021 · 16 comments
Closed
Milestone

Comments

@coolharsh55
Copy link
Collaborator

| Migrated ISSUE-33: The categorisation of Pseudoanonymisation and Encryption is not (semantically) correct

State: RAISED
Raised by: Harshvardhan J. Pandit
Opened on: 2019-11-26
Description: (from presentation to Kantara CISWG) Anonymisation is a subclass of Pseudoanonymisation which is conflicting in semantics as it specifies anonymisation is a type of pseudoanonymisation, which might not be intended. Also, Pseudoanonymisation and Encryption should not be grouping together (as a concept).
Reporter: Harsh
Notes: suggested to start a discussion on this issue.

@mayaborges
Copy link

I agree that Anonymisation should not be a subclass of Pseudoanonymisation, given that data cannot be both anonymised and pseudoanynomised. It could be argued that Anonymisation could be either Full (or True) Anonymisation or Psuedoanonymisation, in which case Pseudoanonymisation would be a subclass of Anonymisation, but that may introduce confusion between Anonymisation and Full Anonymisation and therefore be undesirable. So having Anonymisation and Pseudoanonymisation as parallels may be the best solution.

A possible name for a superclass for both types of anonymisation as well as encryption might be Data Obfuscation.

@coolharsh55
Copy link
Collaborator Author

Hi Maya, thanks for the input, I agree with your arguments. I tried looking up EDPB and ISO definitions for these terms and how they are used, and it is similar to what you propose. But other uses (e.g. industry, technical) considers 'anonymisation' as a broad range of techniques which also includes pseudo-anonmisation.

Then there is further confusion as to what data is produced as an outcome of these processes. An anonmisation process may still produce personal data (non-anonymous) if its associated with an identifer. For example, consider the case where an identifier is associated with a exact location. The anonymisation technique replaces this with country. Now the data is anonymised through anonymisation process but is still personal data. So there is a distinction between anonimisation as a technical term and that as applied for GDPR.

To support your proposal, maybe we can have Anonymisation as the general class of anonymisation-related techniques, and specifically PseudoAnonymisation and CompleteAnonymisation as subclasses. Data Obfuscation involves other techniques in addition to anonymisation, so it can be the parent class of Anonymisation once those other concepts have been identified.

@coolharsh55
Copy link
Collaborator Author

Recording conversation at PEPR'22 about Anonymisation, where Damien pointed out this problem. The potential operation is changing "Anonymisation" to "AnonymisationMeasure" and "CompleteAnonymisation" to "Anonymisation" so as to bring these concepts in line with what is defined legally and in standards (e.g. ISO 29100) while keeping the 'taxonomy' of anonymisation approaches in tech/org measures.

@TedTed
Copy link

TedTed commented Jun 28, 2022

Thanks Harshvardhan! To add a bit more explanation to this, I see a fairly serious risk with calling "Anonymization" the concept that corresponds to "The class of measures/processes that are used in order to make data less identifiable": we end up in a situation where people might use "Anonymization" on their data, and end up with data that is not "anonymized" according to ISO standards & EU regulation. This confusion happens frequently in the media, due to the use of the work "anonymization" to mean "de-identification" in the US. I've seen this create problems in my previous role in a big tech company, which is partly why we decided to only call something "anonymization" if it reached that high bar of making it impossible to re-identify people.

I strongly support changing "CompleteAnonymization" to simply "Anonymization", so that something is called "Anonymization" if and only if it leads to anonymized data, and the confusion disappears. Changing "Anonymization" to "AnonymizationMeasure" helps people understand that this might not be enough, so this definitely seems much better to me. It might not be enough, though. An alternative would be to call this "DeidentificationMeasure", and rename the process of removing identifiers something like "IdentifierRedaction" to avoid confusion. Yet another alternative, clearer but verbose, would be something lie "ReidentificationRiskMitigation", to better capture this idea of "measure towards making it harder to identify people".

@coolharsh55 coolharsh55 changed the title Categorisation of Pseudo-Anonymisation and Encryption is not correct Re-Evaluate Anonymisation and Security Measure names for Correctness Jun 28, 2022
@coolharsh55
Copy link
Collaborator Author

Thanks @TedTed ; I have updated the title on this issue to (re-)evaluate all names in tech/org measures with this perspective, and make changes where necessary.

@coolharsh55 coolharsh55 added this to the DPV v1 milestone Jun 30, 2022
@coolharsh55
Copy link
Collaborator Author

Hi All, thanks for the feedback. The structure is now as follows:

  • DataAnonymisationTechnique
    • Anonymisation
    • Pseudonymisation
    • Deidentification

@derhagen
Copy link

I fail to see the added value of introducing Deidentification over DataAnonymisationTechnique, which are defined as

DataAnonymisationTechnique: Use of anonymisation techniques that reduce the identifiability in data
Deidentification: Removal of identity or information to reduce identifiability

By definition, any measure that reduces identifiability of data needs to "remove information", in some sense. Therefore, Deidentification does not narrow down the space of techniques, and should either be further specified or ommitted. Was Deidentification included with a reference to HIPAA? Even in that case, we should consider to replace Deidentification with the "Expert Determination" and "Safe Harbor" methods as mentioned here: https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html

Otherwise, even though you renamed Anonymization to DataAnonymizationTechnique, I discovered this issue because I thought "Wait, Pseudonymization is not an Anonymization technique!". What about something along the lines of DataObfuscationTechnique?

@derhagen
Copy link

This discussion should probably be held in parallel with NonPersonalData and its subclasses, where some tidying up might be necessary. The Note of AnonymisedData refers to AnonymisedDataWithinScope, which does not seem to exist yet (ContextuallyAnonymisedData is a proposed term), and according to the ENISA source, SyntheticData "can be personal data, which are manipulated in a way to limit the potentials for individuals’ re-identification", which is not entirely aligned with DPV's definition.

@derhagen
Copy link

The GDPR (Recital 26) approach to anonymity is based on a rather risk-based "reasonable likeliness", based on

  • the costs of and
  • the amount of time required for identification, taking into consideration
  • the available technology
  • at the time of the processing and
  • [future] technological developments

Hence, these factors should be represented more precisely in the respective Class descriptions. As all of this is an active area of research and (in my opinion) not conclusively addressed by courts, it might might make sense to mark these Classes as unstable or proposed, if that is possible?

@coolharsh55
Copy link
Collaborator Author

Hi.

I fail to see the added value of introducing Deidentification over DataAnonymisationTechnique, which are defined as...

Deidentification is a specific category of anonymisation techniques that focus on reducing identifiability. Anonymisation is broader than identifier removals because it also relates to potential re-combinations with other datasets to create identifiability.

Was Deidentification included with a reference to HIPAA? Even in that case, we should consider to replace Deidentification with the "Expert Determination" and "Safe Harbor" methods as mentioned here: https://www.hhs.gov/hipaa/for-professionals/privacy/special-topics/de-identification/index.html

Deidentification is a common term in this domain. E.g. there's even an ISO standard (20889:2018) about it https://www.iso.org/standard/69373.html. For HIPAA - the title explicitly states de-identification which is a strong argument to represent that concept. Further types of de-identification processes should be modelled as subclasses/sub-types of Deidentification, and not replace it. I would prefer the ISO terminology over HIPAA in this case as it is broader in scope and represents greater technical consensus in this case, with HIPAA concepts added later within the resulting hierarchy (if needed). Pseudonymisation is a declared as a DataAnonymisationTechnique (and not as a type of Anonymisation) for the sake of grouping anonymisation related concepts together under an umbrella term.

The Note of AnonymisedData refers to AnonymisedDataWithinScope, which does not seem to exist yet (ContextuallyAnonymisedData is a proposed term), and according to the ENISA source, SyntheticData "can be personal data, which are manipulated in a way to limit the potentials for individuals’ re-identification", which is not entirely aligned with DPV's definition.

AnonymisedDataWithinScope has been changed to ContextuallyAnonymisedData, the note has been updated. Where SnythethicData is also a personal data, the data should be declared also as a subclass/type of PersonalData. The note states it can be personal or non-personal. The description is taken from ENISA guide on Data Protection Engineering,https://www.enisa.europa.eu/publications/data-protection-engineering

The GDPR (Recital 26) approach to anonymity is based on a rather risk-based "reasonable likeliness", based on
* the costs of and
* the amount of time required for identification, taking into consideration
* the available technology
* at the time of the processing and
* [future] technological developments

Hence, these factors should be represented more precisely in the respective Class descriptions. As all of this is an active area of research and (in my opinion) not conclusively addressed by courts, it might might make sense to mark these Classes as unstable or proposed, if that is possible?

I see the value in representing this as a concept, but an unsure as to how it should be associated with processing information. My guess is to provide as an organisational measure, similar to policies and assessments. So IdentifiabilityAssessment as an OrganisationalMeasure with the stated recital-26 concepts as descriptions. I do not think we should represent each of those factors individually as concepts and properties for only the scope of identifiability. Costs, time for technical processes, technology availability (e.g. TRL in SotA), and future predictions are far too broad and relevant for a lot of other concepts - so should be modelled with a greater scope (and careful consideration). I can add these as proposed concepts if you or someone else is willing to take on the task of investigating these.

@coolharsh55
Copy link
Collaborator Author

We discussed in today's meeting and are okay with the current list. We're keeping this open in case there are further discussions. Other we will close this in the coming weeks as completed.

@TedTed
Copy link

TedTed commented Nov 23, 2022

For context, does the "current list" refer to this comment or to the state of the world prior to this issue?

@coolharsh55
Copy link
Collaborator Author

Current list as in the concepts that are in DPV as of now, after the comments.

@derhagen
Copy link

Sorry for the late response, but I continue to raise the argument that Pseudonymization is not an anonymisation technique.

Thank you for your clarification of Deidentification, I think the fact that it refers to a term from an ISO standard should be mentioned in the Class description. Strictly following the Class descriptions as they are right now, Deidentification and DataAnonymisationTechnique describe the equivalend things, without the additional knowlege of the mentioned ISO standard.

With respect to the Recital 26 criteria for anonymised data, I didn't propose to add these as organizational measures - even though that's a good idea - but simply to add a reference to Recital 26 and the mentioned criteria to the Class description or note, as they define what anonymised data is in the first place.

@coolharsh55
Copy link
Collaborator Author

coolharsh55 commented Nov 24, 2022

Hi. Thanks for your comment, I understand your point, and the need to change this.

I continue to raise the argument that Pseudonymization is not an anonymisation technique.

Yes, strictly speaking this is correct, though the concept DataAnonymisationTechnique was intended to group related concepts together as noted by Irish Data Protection Commission in their Guidance on Anonymisation and Pseudonymisation pg.12. Still, as you state, it would be better to avoid this confusion. So based on tha rationale laid out in NIST NISTIR 8053 De-Identification of Personal Information, these concepts are organised as follow:

  • DeIdentification as the top concept
    • Anonymisation
      • CompleteAnonymisation (edited to remove concept)
    • Pseudoanonymisation

to add a reference to Recital 26 and the mentioned criteria to the Class description or note, as they define what anonymised data is in the first place

Instead of GDPR's recitals, the techniques have been linked to ISO 29100:2011 Security Techniques -- Privacy Framework definitions which are more broadly used.

coolharsh55 added a commit that referenced this issue Nov 24, 2022
The following typos in IRIs were fixed using the new SHACL shapes from
previous commit:
- dpv:expiry relation instead of dpv:hasExpiry relation in consent
- dpv:hasConsequenceOn was used as a parent even though it was proposed.
  The term has been promoted to accepted status
- Typos in Technical measures where Crypto- was mistyped as Cryto-

Errors in labels:
- MaintainCreditCheckingDatabase
- MaintainCreditRatingDatabase

The following terms were updated:
- GDPR's legal bases where text has been added from Art.6 and the parent
  terms have been aligned with main spec's legal bases (including
  creation of new terms to match granularity)
- Anonymisation and Pseudonymisation have been changed to be types of
  Deidentification techniques (as the grouping parent concept) to
  distinguish them following discussions in #15
- DPV-LEGAL has laws and DPAs for USA from contributions by @JonathanBowker
coolharsh55 added a commit that referenced this issue Nov 24, 2022
The following typos in IRIs were fixed using the new SHACL shapes from
previous commit:
- dpv:expiry relation instead of dpv:hasExpiry relation in consent
- dpv:hasConsequenceOn was used as a parent even though it was proposed.
  The term has been promoted to accepted status
- Typos in Technical measures where Crypto- was mistyped as Cryto-

Errors in labels:
- MaintainCreditCheckingDatabase
- MaintainCreditRatingDatabase

The following terms were updated:
- GDPR's legal bases where text has been added from Art.6 and the parent
  terms have been aligned with main spec's legal bases (including
  creation of new terms to match granularity)
- Anonymisation and Pseudonymisation have been changed to be types of
  Deidentification techniques (as the grouping parent concept) to
  distinguish them following discussions in #15
- DPV-LEGAL has laws and DPAs for USA from contributions by @JonathanBowker
@coolharsh55 coolharsh55 modified the milestones: DPV v1, DPV v1.1 May 10, 2023
@coolharsh55 coolharsh55 modified the milestones: DPV v1.1, dpv v2 Apr 13, 2024
@coolharsh55
Copy link
Collaborator Author

Reviewed and closed based on implementation in https://w3id.org/dpv#vocab-TOM-technical which contains the described structure.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants