Skip to content

pii-scrubber is an extensible go-library to identify and mask PII data from text and objects

Notifications You must be signed in to change notification settings

aavaz-ai/pii-scrubber

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

7 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Concepts

Entity

Entity represents an identifiable piece of text which we are interested in, e.g. Date, Email, Credit-Card Number etc.

EntityScrubber

EntityScrubber is responsible for detecting and masking an entity in the provided input. The library provides pre-built scrubbers for the following entities

Date          Entity = "DATE"
Time          Entity = "TIME"
CreditCard    Entity = "CREDIT_CARD"
Phone         Entity = "PHONE"
Link          Entity = "LINK"
Email         Entity = "EMAIL"
IP            Entity = "IP"
NotKnownPort  Entity = "UNKNOWN_PORT"
BtcAddress    Entity = "BTC_ADDRESS"
StreetAddress Entity = "STREET_ADDRESS"
ZipCode       Entity = "ZIP_CODE"
PoBox         Entity = "PO_BOX"
SSN           Entity = "SSN"
MD5Hex        Entity = "MD5_HEX"
SHA1Hex       Entity = "SHA1_HEX"
SHA256Hex     Entity = "SHA_256_HEX"
GUID          Entity = "GUID"
ISBN          Entity = "ISBN"
MACAddress    Entity = "MAC_ADDRESS"
IBAN          Entity = "IBAN"
GitRepo       Entity = "GIT_REPO"
StrictLink    Entity = "STRICT_LINK"

Users can override the implementation of any of the existing scrubber or add their own custom-entities and corresponding scrubbers. Entity scrubber implements the following interface

type EntityScrubber interface {
	Match(text string) [][]int
	Mask(detectedEntity []byte, config *EntityConfig) []byte
}

Match function takes the text as input and returns all the locations for the Entity in the text

Mask function is responsible for masking a detected instance of an Entity

Installation

To install the library, run the following command in your go project:

go get github.com/aavaz-ai/pii-scrubber

Usage

The Scrubber interface exposes two high-level functions

  • ScrubTexts: Useful in scrubbing PII out of the string data
  • ScrubStruct: An abstraction written on top of ScrubTexts which makes it easier to scrub PII from various specified fields of an object

Scrub PII from String

example:

	texts := []string{
		"Hi my phone number is +919140520809",
	}

	scrubber, err := piiscrubber.NewDefaultScrubber()
    if err != nil {
        panic(err)
    }

	response, err := scrubber.ScrubTexts(texts)
    if err != nil {
        panic(err)
    }

    fmt.Println(response)

Output:

["Hi my phone number is <PHONE_NUMBER>"]

Scrub PII from Objects

example:

	type Address struct {
		Location string
		ZipCode  string
	}

	type User struct {
		Name             string            `pii:"true"`
		CustomAttributes map[string]string `pii:"true"`
		Age              int
		Position         string
		Address          *Address `pii:"true"`
		Email            string   `pii:"true"`
	}

	v := User{
		Name: "Anshal +9140528009",
		CustomAttributes: map[string]string{
			"PIIKey": "Hello here is my credit card 6011553157232994",
		},
		Age:      10,
		Position: "Software Engineer",
		Address: &Address{
			Location: "My 488-23-3729",
			ZipCode:  "22132",
		},
		Email: "abc@gmail.com",
	}

	scrubber, err := piiscrubber.NewDefaultScrubber()
	if err != nil {
		panic(err)
	}

Output:

{
  "Name": "Anshal <PHONE_NUMBER>",
  "CustomAttributes": {
    "PIIKey": "Hello here is my credit card <CREDIT_CARD>"
  },
  "Age": 10,
  "Position": "Software Engineer",
  "Address": {
    "Location": "My <US_SSN>",
    "ZipCode": "<ZIP_CODE>"
  },
  "Email": "<EMAIL_ADDRESS>"
}

Advance Usage

In the following example, we implement an orgNameEntityScrubber, that matches a certain organisation's name, and masks it with a placeholder value

type orgNameEntityScrubber struct {
}

func (s *orgNameEntityScrubber) Match(text string) [][]int {
	regex := regexp.MustCompile("Enterpret")
	return regex.FindAllStringIndex(text, -1)
}

func (s *orgNameEntityScrubber) Mask(detectedEntity []byte, config *piiscrubber.EntityConfig) []byte {
	return []byte("<ORG_PLACEHOLDER>")
}

func main() {

	texts := []string{
		"Hi this is Anshal, my contact is +919140520809, I am currently working at Enterpret",
	}

	orgNameEntity := piiscrubber.Entity("ORG_NAME")

	scrubber, err := piiscrubber.NewWithCustomEntityScrubbers(piiscrubber.NewWithCustomEntityScrubbersParams{
		BlacklistedEntities: []piiscrubber.Entity{
			piiscrubber.CreditCard,
			piiscrubber.Phone,
			piiscrubber.Email,
			piiscrubber.SSN,
			orgNameEntity,
		},
		CustomEntityScrubbers: map[piiscrubber.Entity]piiscrubber.EntityScrubber{
			orgNameEntity: &orgNameEntityScrubber{},
		},
	})
	if err != nil {
		panic(err)
	}

	response, err := scrubber.ScrubTexts(texts)
	if err != nil {
		panic(err)
	}

	fmt.Println(response)
}

Output:

["Hi this is Anshal, my contact is <PHONE_NUMBER>, I am currently working at <ORG_PLACEHOLDER>"]



type creditCardOverrideScrubber struct {
}

func (s *creditCardOverrideScrubber) Match(text string) [][]int {
	// implement the logic to detect credit-card number here

	regex := regexp.MustCompile("4263 9826 4026 9299")
	return regex.FindAllStringIndex(text, -1)
}

func (s *creditCardOverrideScrubber) Mask(detectedEntity []byte, config *piiscrubber.EntityConfig) []byte {
	return []byte("<CUSTOM_CREDIT_CARD>")
}

func main() {

	texts := []string{
		"Hi this is Anshal, my credit card is 4263982640269299, and 4263 9826 4026 9299, I am currently working at Enterpret",
	}

	scrubber, err := piiscrubber.NewWithCustomEntityScrubbers(piiscrubber.NewWithCustomEntityScrubbersParams{
		BlacklistedEntities: []piiscrubber.Entity{
			piiscrubber.CreditCard,
			piiscrubber.Email,
			piiscrubber.SSN,
		},
		CustomEntityScrubbers: map[piiscrubber.Entity]piiscrubber.EntityScrubber{
			piiscrubber.CreditCard: &creditCardOverrideScrubber{},
		},
	})
	if err != nil {
		panic(err)
	}

	response, err := scrubber.ScrubTexts(texts)
	if err != nil {
		panic(err)
	}

	fmt.Println(response)
}

Output:

["Hi this is Anshal, my credit card is 4263982640269299, and <CUSTOM_CREDIT_CARD>, I am currently working at Enterpret"]



EntityConfig provides limited parameters to customise the masking operation for a detected entity. More advance requirements can be addressed by overriding the EntityScrubber itself

texts := []string{
		"Hi this is Anshal, my contact is +919140520809, and credit card is 4263982640269299, I am currently working at Enterpret",
	}

	scrubber, err := piiscrubber.NewWithCustomEntityScrubbers(piiscrubber.NewWithCustomEntityScrubbersParams{
		BlacklistedEntities: []piiscrubber.Entity{
			piiscrubber.CreditCard,
			piiscrubber.Phone,
			piiscrubber.Email,
			piiscrubber.SSN,
		},
		Config: map[piiscrubber.Entity]*piiscrubber.EntityConfig{
			piiscrubber.CreditCard: {
				UnmaskedSuffixOffset: 4,
				MaskWithChar:         runePtr('X'),
			},
		},
	})
	if err != nil {
		panic(err)
	}

	response, err := scrubber.ScrubTexts(texts)
	if err != nil {
		panic(err)
	}

	fmt.Println(response)

Output:

["Hi this is Anshal, my contact is <PHONE_NUMBER>, and credit card is XXXXXXXXXXXX9299, I am currently working at Enterpret"]



Tests

Unit Tests

cd tests/unit-tests
go test ./...

Coverage Tests

cd tests/benchmarks/coverage
go run ./...

Output:

_______STREET_ADDRESS_______
Total: 200
Caught: 162
Coverage: 81%

=====================
_______CREDIT_CARD_______
Total: 26
Caught: 23
Coverage: 88%

=====================
_______EMAIL_______
Total: 13
Caught: 13
Coverage: 100%

=====================

Performance Tests

cd tests/benchmarks/performance
go test -bench=.

Output:

goos: darwin
goarch: amd64
pkg: github.com/aavaz-ai/pii-scrubber/tests/benchmarks/performance
cpu: VirtualApple @ 2.50GHz
Benchmark_1000Sentences-10            87          12112771 ns/op
Benchmark_100Sentences-10            543           2194837 ns/op

About

pii-scrubber is an extensible go-library to identify and mask PII data from text and objects

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages