PIMO is a tool for data masking. It can mask data from a JSONline stream and return another JSONline stream thanks to a masking configuration contained in a yaml file.
PIMO requires a yaml configuration file to works. By default, the file is named masking.yml
and is placed in the working directory. The file must respect the following format :
version: "1"
seed: 42
masking:
- selector:
jsonpath: "example.example"
mask:
type: "argument"
# Optional cache (coherence preservation)
cache: "cacheName"
# another mask on a different location
- selector:
jsonpath: "example.example2"
mask:
type: "argument"
preserve: "null"
caches:
cacheName:
# Optional bijective cache (enable re-identification if the cache is dumped on disk)
unique: true
version
is the version of the masking file.
seed
is to give every random mask the same seed, it is optional and if it is not defined, the seed is derived from the current time to increase randomness.
masking
is used to define the pipeline of masks that is going to be applied.
selector
is made of a jsonpath and a mask.
jsonpath
defines the path of the entry that has to be masked in the json file.
mask
defines the mask that will be used for the entry defined by selector
.
cache
is optional, if the current entry is already in the cache as key the associated value is returned without executing the mask. Otherwise the mask is executed and a new entry is added in the cache with the orignal content as key
and the masked result as value
. The cache have to be declared in the caches
section of the YAML file.
preserve
is optional, and is used to keep some values unmasked in the json file. Allowed preserve
options are: "null"
(null values), "empty"
(empty string ""
), and "blank"
(both empty
and null
values). Additionally, preserve
can be used with mask fromCache
to preserve uncached values. (usage: preserve: "notInCache"
)
Multiple masks can be applied on the same jsonpath location, like in this example :
- selector:
jsonpath: "example"
masks:
- add: "hello"
- template: "{{.example}} World!"
- remove: true
Masks can be applied on multiple selectors, like in this example:
- selectors:
- jsonpath: "example"
- jsonpath: "example2"
mask:
add: "hello"
The following types of masks can be used :
- Pure randomization masks
regex
is to mask using a regular expression given in argument.randomInt
is to mask with a random int from a range with arguments min and max.randomDecimal
is to mask with a random decimal from a range with arguments min, max and precision.randDate
is to mask a date with a random date betweendateMin
anddateMax
.randomDuration
is to mask a date by adding or removing a random time betweenMin
andMax
.randomChoice
is to mask with a random value from a list in argument.weightedChoice
is to mask with a random value from a list with probability, both given with the argumentschoice
andweight
.randomChoiceInUri
is to mask with a random value from an external resource.
- K-Anonymization
- Re-identification and coherence preservation
hash
is to mask with a value from a list by matching the original value, allowing to mask a value the same way every time.hashInUri
is to mask with a value from an external resource, by matching the original value, allowing to mask a value the same way every time.fromCache
is a mask to obtain a value from a cache.ff1
mask allows the use of FPE which enable private-key based re-identification.
- Formatting
dateParser
is to change a date format.template
is to mask a data with a template using other values from the jsonline.template-each
is like template but will apply on each value of an array.
- Data structure manipulation
remove
is to mask a field by completely removing it.add
is a mask to add a field to the jsonline.add-transient
same asadd
but the field is not exported in the output jsonline.
- Others
constant
is to mask the value by a constant value given in argument.command
is to mask with the output of a console command given in argument.incremental
is to mask data with incremental value starting fromstart
with a step ofincrement
.fluxUri
is to replace by a sequence of values defined in an external resource.replacement
is to mask a data with another data from the jsonline.pipe
is a mask to handle complex nested array structures, it can read an array as an object stream and process it with a sub-pipeline.luhn
can generate valid numbers using the Luhn algorithm (e.g. french SIRET or SIREN).markov
can generate pseudo text based on a sample text.
A full masking.yml
file example, using every kind of mask, is given with the source code.
In case two types of mask are entered with the same selector, the program can't extract the masking configuration and will return an error. The file wrongMasking.yml
provided with the source illustrate that error.
To use PIMO to mask a data.json
, use in the following way :
./pimo <data.json >maskedData.json
This takes the data.json
file, masks the data contained inside it and put the result in a maskedData.json
file. If data are in a table (for example multiple names), then each field of this table will be masked using the given mask. The following flags can be used:
--repeat=N
This flag will make pimo mask every input N-times (useful for dataset generation).--skip-line-on-error
This flag will totally skip a line if an error occurs masking a field.--skip-field-on-error
This flag will return output without a field if an error occurs masking this field.--empty-input
This flag will give PIMO a{}
input, usable with--repeat
flag.--config=filename.yml
This flag allow to use another file for config than the defaultmasking.yml
.--load-cache cacheName=filename.json
This flag load an initial cache content from a file (json line format{"key":"a", "value":"b"}
).--dump-cache cacheName=filename.json
This flag dump final cache content to a file (json line format{"key":"a", "value":"b"}
).--verbosity <level>
or-v<level>
This flag increase verbosity on the stderr output, possible values: none (0), error (1), warn (2), info (3), debug (4), trace (5).--debug
This flag complete the logs with debug information (source file, line number).--log-json
Set this flag to produce JSON formatted logs (demo9 goes deeper into logging and structured logging)--mask
Declare a simple masking definition in command line (minified YAML format:--mask "value={fluxUri: 'pimo://nameFR'}"
, or--mask "value=[{add: ''},{fluxUri: 'pimo://nameFR'}]"
for multiple masks). For advanced use case (e.g. if caches needed)masking.yml
file definition will be preferred.--repeat-until <condition>
This flag will make PIMO keep masking every input until the condition is met. Condition format is using Template. Last output verifies the condition.--repeat-while <condition>
This flag will make PIMO keep masking every input while the condition is met. Condition format is using Template.
This section will give examples for every types of mask.
Please check the demo folder for more advanced examples.
- selector:
jsonpath: "phone"
mask:
regex: "0[1-7]( ([0-9]){2}){4}"
This example will mask the phone
field of the input jsonlines with a random string respecting the regular expression.
- selector:
jsonpath: "name"
mask:
constant: "Bill"
This example will mask the name
field of the input jsonlines with the value of the constant
field.
- selector:
jsonpath: "name"
mask:
randomChoice:
- "Mickael"
- "Mathieu"
- "Marcelle"
This example will mask the name
field of the input jsonlines with random values from the randomChoice
list.
- selector:
jsonpath: "name"
mask:
randomChoiceInUri: "file://names.txt"
This example will mask the name
field of the input jsonlines with random values from the list contained in the name.txt file. The different URI usable with this selector are : pimo
, file
and http
/https
.
A value can be injected in URI with the template syntax. For example, file://name{{.gender}}.txt
select a line in name_F.txt
if the current jsonline is {gender : "F"}
.
- selector:
jsonpath: "age"
mask:
randomInt:
min: 25
max: 32
This example will mask the age
field of the input jsonlines with a random number between min
and max
included.
- selector:
jsonpath: "score"
mask:
randomDecimal:
min: 0
max: 17.23
precision: 2
This example will mask the score
field of the input jsonlines with a random float between min
and max
, with the number of decimal chosen in the precision
field.
- selector:
jsonpath: "name"
mask:
command: "echo -n Dorothy"
This example will mask the name
field of the input jsonlines with the output of the given command. In this case, Dorothy
.
- selector:
jsonpath: "surname"
mask:
weightedChoice:
- choice: "Dupont"
weight: 9
- choice: "Dupond"
weight: 1
This example will mask the surname
field of the input jsonlines with a random value in the weightedChoice
list with a probability proportional at the weight
field.
- selector:
jsonpath: "town"
mask:
hash:
- "Emerald City"
- "Ruby City"
- "Sapphire City"
This example will mask the town
field of the input jsonlines with a value from the hash
list. The value will be chosen thanks to a hashing of the original value, allowing the output to be always the same in case of identical inputs.
- selector:
jsonpath: "name"
mask:
hashInUri: "pimo://nameFR"
This example will mask the name
field of the input jsonlines with a value from the list nameFR contained in pimo, the same way as for hash
mask. The different URI usable with this selector are : pimo
, file
and http
/https
.
- selector:
jsonpath: "date"
mask:
randDate:
dateMin: "1970-01-01T00:00:00Z"
dateMax: "2020-01-01T00:00:00Z"
This example will mask the date
field of the input jsonlines with a random date between dateMin
and dateMax
. In this case the date will be between the 1st January 1970 and the 1st January 2020.
- selector:
jsonpath: "last_contact"
mask:
duration: "-P2D"
This example will mask the last_contact
field of the input jsonlines by decreasing its value by 2 days. The duration field should match the ISO 8601 standard for durations.
- selector:
jsonpath: "date"
mask:
dateParser:
inputFormat: "2006-01-02"
outputFormat: "01/02/06"
This example will change every date from the date field from the inputFormat
to the outputFormat
. The format should always display the following date : Mon Jan 2 15:04:05 -0700 MST 2006
. Either field is optional and in case a field is not defined, the default format is RFC3339, which is the base format for PIMO, needed for duration
mask and given by randDate
mask.
- selector:
jsonpath: "date"
mask:
randomDuration:
min: "-P2D"
max: "-P27D"
This example will mask the date
field of the input jsonlines by decreasing its value by a random value between 2 and 27 days. The durations should match the ISO 8601 standard.
- selector:
jsonpath: "id"
mask:
incremental:
start: 1
increment: 1
This example will mask the id
field of the input jsonlines with incremental values. The first jsonline's id
will be masked by 1, the second's by 2, etc...
- selector:
jsonpath: "name4"
mask:
replacement: "name"
This example will mask the name4
field of the input jsonlines with the field name
of the jsonline. This selector must be placed after the name
selector to be masked with the new value and it must be placed before the name
selector to be masked by the previous value.
- selector:
jsonpath: "mail"
mask:
template: "{{.surname}}.{{.name}}@gmail.com"
This example will mask the mail
field of the input jsonlines respecting the given template. In the masking.yml
config file, this selector must be placed after the fields contained in the template to mask with the new values and before the other fields to be masked with the old values. In the case of a nested json, the template must respect the following example :
- selector:
jsonpath: "user.mail"
mask:
template: "{{.user.surname}}.{{.user.name}}@gmail.com"
The format for the template should respect the text/template
package : https://golang.org/pkg/text/template/
The template mask can format the fields used. The following example will create a mail address without accent or upper case:
- selector:
jsonpath: "user.mail"
mask:
template: "{{.surname | NoAccent | upper}}.{{.name | NoAccent | lower}}@gmail.com"
Available functions for templates come from http://masterminds.github.io/sprig/.
- selector:
jsonpath: "array"
mask:
template-each:
template: "{{title .value}}"
item: "value"
This will affect every values in the array field. The field must be an array ({"array": ["value1", "value2]
).
The item
property is optional and defines the name of the current item in the templating string (defaults to "it"). There is another optional property index
, if defined then a property with the given name will be available in the templating string (e.g. : index: "idx"
can be used in template with {{.idx}}
).
The format for the template should respect the text/template
package : https://golang.org/pkg/text/template/
See also the Template mask for other options, all functions are applicable on template-each.
- selector:
jsonpath: "useless-field"
mask:
remove: true
This field will mask the useless-field
of the input jsonlines by completely deleting it.
- selector:
jsonpath: "newField"
mask:
add: "newvalue"
This example will create the field newField
containing the value newvalue
. This value can be a string, a number, a boolean...
The field will be created in every input jsonline that doesn't already contains this field.
- selector:
jsonpath: "newField"
mask:
add-transient: "newvalue"
This example will create the field newField
containing the value newvalue
. This value can be a string, a number, a boolean... It can also be a template.
The field will be created in every input jsonline that doesn't already contains this field, and it will be removed from the final JSONLine output.
This mask is used for temporary field that is only available to other fields during the execution.
- selector:
jsonpath: "id"
mask:
fluxURI: "file://id.csv"
This example will create an id
field in every output jsonline. The values will be the ones contained in the id.csv
file in the same order as in the file. If the field already exist on the input jsonline it will be replaced and if every value of the file has already been assigned, the input jsonlines won't be modified.
- selector:
jsonpath: "id"
mask:
fromCache: "fakeId"
caches:
fakeId :
unique: true
This example will replace the content of id
field by the matching content in the cache fakeId
. Cache have to be declared in the caches
section.
Cache content can be loaded from jsonfile with the --load-cache fakeId=fakeId.jsonl
option or by the cache
option on another field.
If no matching is found in the cache, fromCache
block the current line and the next lines are processing until a matching content go into the cache.
- selector:
jsonpath: "siret"
mask:
ff1:
radix: 10
keyFromEnv: "FF1_ENCRYPTION_KEY"
This example will encrypt the siret
column with the private key base64-encoded in the FF1_ENCRYPTION_KEY environment variable. Use the same mask with the option decrypt: true
to re-identify the unmasked value.
Be sure to check the full FPE demo to get more details about this mask.
- selector:
jsonpath: "age"
mask:
range: 5
This mask will replace an integer value {"age": 27}
with a range like this {"age": "[25;29]"}
.
If the data structure contains arrays of object like in the example below, this mask can pipe the objects into a sub pipeline definition.
data.jsonl
{
"organizations": [
{
"domain": "company.com",
"persons": [
{
"name": "leona",
"surname": "miller",
"email": ""
},
{
"name": "joe",
"surname": "davis",
"email": ""
}
]
},
{
"domain": "company.fr",
"persons": [
{
"name": "alain",
"surname": "mercier",
"email": ""
},
{
"name": "florian",
"surname": "legrand",
"email": ""
}
]
}
]
}
masking.yml
version: "1"
seed: 42
masking:
- selector:
# this path points to an array of persons
jsonpath: "organizations.persons"
mask:
# it will be piped to the masking pipeline definition below
pipe:
# the parent object (a domain) will be accessible with the "_" variable name
injectParent: "_"
masking:
- selector:
jsonpath: "name"
mask:
# fields inside the person object can be accessed directly
template: "{{ title .name }}"
- selector:
jsonpath: "surname"
mask:
template: "{{ title .surname }}"
- selector:
jsonpath: "email"
mask:
# the value stored inside the parent object is accessible through "_" thanks to the parent injection
template: "{{ lower .name }}.{{ lower .surname }}@{{ ._.domain }}"
In addition to the injectParent
property, this mask also provide the injectRoot
property to inject the whole structure of data.
It is possible to simplify the masking.yml
file by referencing an external yaml definition :
version: "1"
seed: 42
masking:
- selector:
jsonpath: "organizations.persons"
mask:
pipe:
injectParent: "domain"
file: "./masking-person.yml"
Be sure to check demo to get more details about this mask.
The Luhn algorithm is a simple checksum formula used to validate a variety of identification numbers.
The luhn
mask can calculate the checksum for any value.
- selector:
jsonpath: "siret"
mask:
luhn: {}
In this example, the siret
value will be appended with the correct checksum, to create a valid SIRET number (french business identifier).
The mask can be parametered to use a different universe of valid characters, internally using the Luhn mod N algorithm.
- selector:
jsonpath: "siret"
mask:
luhn:
universe: "abcde"
Markov chains produces pseudo text based on an sample text.
sample.txt
I want a cheese burger
I need a cheese cake
masking.yml
- selector:
jsonpath: "comment"
mask:
markov:
max-size: 20
sample: "file://sample.txt"
separator: " "
This example will mask the surname comment of the input jsonlines with a random value comment generated by the markov mask with an order of 2
. The different possibilities generated from sample.txt will be :
I want a cheese burger
I need a cheese burger
I want a cheese cake
I need a cheese cake
The separator
field defines the way the sample text will be split (""
for splitting into characters, " "
for splitting into words)
PIMO can generate a Mermaid syntax flow chart to visualize the transformation process.
for example the command pimo flow masking.yml > masing.mmd
with that masking.yml file generate following chart :
To integrate with Visual Studio Code (opens new window), download the YAML extension.
Then, edit your Visual Studio Code settings yaml.schemas
to containing the following configuration:
{
"yaml.schemas": {
"https://raw.githubusercontent.com/CGI-FR/PIMO/main/schema/v1/pimo.schema.json": "/**/*masking*.yml"
}
}
Using this configuration, the schema will be applied on every YAML file containing the word `masking`` in their name.
- CGI France ✉Contact support
- Pole Emploi
Copyright (C) 2021 CGI France
PIMO is free software: you can redistribute it and/or modify it under the terms of the GNU General Public License as published by the Free Software Foundation, either version 3 of the License, or (at your option) any later version.
PIMO is distributed in the hope that it will be useful, but WITHOUT ANY WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License for more details.
You should have received a copy of the GNU General Public License along with PIMO. If not, see http://www.gnu.org/licenses/.