A redaction tool for structured data. Run wernicke
with JSON on stdin, get
redacted values out. Preserves structure and (to some extent) semantics. You
might want this because you have test data where the actual values are
sensitive. Because the changes are consistent within the data and the overall
data structure is preserved, there a better chance your data will stay suitable
for testing, even though it's been scrubbed.
Most people run wernicke on a shell, so you either have json_producing_thing | wernicke
or wernicke < some_file.json > redacted.json
. EDN is also supported.
See wernicke --help
for additional information.
Example input | Example output |
---|---|
IPs, MAC addresses, timestamps, various AWS identifiers, and a few other types of strings are redacted to strings of the same type: IPs to IPs, SGs to SGs, et cetera. If these strings have an alphanumeric id, that id will have the same length. | |
{
"long_val": "ABBBAAAABBBBAAABBBAABB",
"ip": "10.0.0.1",
"mac": "ff:ff:ff:ff:ff:ff",
"timestamp": "2017-01-01T12:34:56.000Z",
"ec2": "ip-10-0-0-1.ec2.internal",
"security_group": "sg-12345",
"vpc": "vpc-abcdef",
"aws_access_key": "AKIAXXXXXXXXXXXXXXXX",
"aws_role_cred": "AROAYYYYYYYYYYYYYYYY"
} |
{
"long_val": "teyjdaeqEYGw18fRIt5vLo",
"ip": "254.65.252.245",
"mac": "aa:3e:91:ab:3b:3a",
"timestamp": "2044-19-02T20:32:55.72Z",
"ec2": "ip-207-255-185-237.ec2.internal",
"security_group": "sg-887b8",
"vpc": "vpc-a9d96a",
"aws_access_key": "AKIAQ5E7IHRMOW7YABLS",
"aws_role_cred": "AROA6QA7SQTM6YWS4F0H"
} |
Redaction happens in arbitrarily nested structures. | |
{
"a": {
"b": [
"c",
"d",
{
"e": "10.0.0.1"
}
]
}
} |
{
"a": {
"b": [
"c",
"d",
{
"e": "1.212.241.246"
}
]
}
} |
In addition to values in the tree, keys are also redacted, even nested ones. | |
{
"vpc-12345": {
"sg-abcdef": {
"instance_count": 5
}
}
} |
{
"vpc-ec60f": {
"sg-086fd3": {
"instance_count": 5
}
}
} |
Redaction also happens in the middle of strings. | |
{
"x": "i-abc123 is in sg-12345"
} |
{
"x": "i-26a1bf is in sg-77aff"
} |
The redacted values will change across runs (this is necessary to make redaction irreversible). | |
{
"ip": "10.0.0.1",
"mac": "ff:ff:ff:ff:ff:ff"
} |
{
"ip": "246.220.253.214",
"mac": "dc:08:90:75:e3:91"
} |
Redacted values _are_ consistent within runs. If the input contains the same value multiple times it will get redacted identically. This allows you to still do correlation in the result. | |
{
"ip": "10.0.0.1",
"also_ip": "10.0.0.1"
} |
{
"ip": "247.226.167.9",
"also_ip": "247.226.167.9"
} |
(These examples were pretty-printed for viewing comfort, but wernicke does not do that for you. Try jq.)
Download from https://github.com/latacora/wernicke/releases
We try to do something reasonable for most use cases. If you have a generally
useful redactions, please consider contributing them. However, sometimes
redaction behavior really does need to be configured. Pass an EDN literal on the
command line like so: wernicke --config '{:some-rules "detailed below"}'
.
Right now this requires a pretty extensive understanding of how wernicke works--we want to make this more accessible, though! If there's a specific thing you want to accomplish, feel free to write a ticket.
For example, to redact all numbers, add the following structure to your EDN:
{:extra-rules
[{:name :numbers
:type :regex
:pattern "\\d*"}]}
The extra rules will be compiled before use, so e.g. you do not need to specify the parsed regex structure for this to work.
Add the following structure to your EDN:
{:disabled-rules [:latacora.wernicke.patterns/arn-re]}
This still requires you to know what the rule names are. You can find these in
latacora.wernicke.core/default-config
.
To run the project directly from a source checkout:
$ clj -m latacora.wernicke.cli
To run the project's tests:
$ clj -A:test
To build a native image:
$ clj -A:native-image
(This requires GraalVM to be installed with SubstrateVM, and the GRAAL_HOME
environment variable to be set.)
Named after Carl Wernicke, a German physician who did research on the brain. Wernicke's aphasia is a condition where patients demonstrate fluent speech with intact syntax but with nonsense words. This tool is kind of like that: the resulting structure is maintained but all the words are swapped out with (internally consistent) nonsense.
Copyright © Latacora, LLC
This program and the accompanying materials are made available under the terms of the Eclipse Public License 2.0 which is available at http://www.eclipse.org/legal/epl-2.0.