-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PII data file #828
base: dev
Are you sure you want to change the base?
PII data file #828
Changes from all commits
c59bc6c
9c58af3
501570c
30c8a19
ebfe95e
eb5d0ad
af8cdd8
3738e51
d87c992
b0beaf5
a00380b
7be8cb1
2aceec1
d16f0f7
ee735cd
d825e8b
22ec3fd
a1965c2
ea9d692
f47b45f
1f1764e
4d2d212
d4bb363
44f9905
b4a23a5
964f3a1
b7d4161
b3aff29
2763d17
0204795
3205ef2
ef6eccd
818bd07
ed0f084
6cb60a9
0abb743
a567fcc
9fd1b77
dbd977c
5c15dad
10fde20
0de92a3
1099e48
9e51577
032e8f0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please consider using pdf2Parquet transform (instead pdfplumber) in order to ingest the pdf document. It might be a bit more cumbersome to use in its current release but we are actually making improvements to this. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. You should not be using the private method _redact_pii(). Instead you should use the pdf2Parquet transform to create a parquet file and then use the transform() method to redact the content. |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,349 @@ | ||
{ | ||
"cells": [ | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"## Extracting Text from PDF and Configuring PII Redactor" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"\n", | ||
"**Author**: Pooja Holkar ,\n", | ||
"**email**:poholkar@in.ibm.com\n", | ||
"\n", | ||
"Click link to open notebook in google colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb)\n", | ||
"\n", | ||
"\n", | ||
"### What is a PII Redactor?\n", | ||
"\n", | ||
"A PII (Personally Identifiable Information) Redactor is a tool designed to identify and redact sensitive information in text data. PII includes details that can be used to identify an individual, such as:\n", | ||
"\n", | ||
"Names\n", | ||
"Email addresses\n", | ||
"Phone numbers\n", | ||
"Addresses\n", | ||
"Financial details (e.g., credit card numbers)\n", | ||
"\n", | ||
"### Overview of the use case\n", | ||
"In this usecase, the PII Redactor is applied to text extracted from invoices to ensure sensitive customer information is not exposed during processing, sharing, or storage.\n", | ||
"\n", | ||
" **Workflow Overview**\n", | ||
"\n", | ||
"The text from the invoice (a PDF document in this case) is extracted using the pdfplumber library.\n", | ||
"\n", | ||
" **Redactor Configuration**\n", | ||
"\n", | ||
"The system is configured to recognize specific PII entities relevant to invoices, such as:\n", | ||
"Customer names\n", | ||
"Email addresses\n", | ||
"Phone numbers\n", | ||
"Shipping addresses\n", | ||
"\n", | ||
" **PII Detection and Redaction**\n", | ||
"\n", | ||
"The redactor scans the extracted text and applies redaction rules, replacing sensitive details with placeholders.\n", | ||
"Output:\n", | ||
"\n", | ||
"The redacted text is displayed alongside a summary of all identified PII entities for auditing purposes.\n", | ||
"\n", | ||
"### Why is PII Redaction Important?\n", | ||
"\n", | ||
" **Data Privacy Compliance**: Adheres to regulations like GDPR, HIPAA, or CCPA that mandate safeguarding customer information.\n", | ||
"\n", | ||
" **Risk Mitigation**: Prevents unauthorized access to or misuse of sensitive data.\n", | ||
"\n", | ||
" **Automation Benefits**: Simplifies and accelerates the process of securing information in large-scale document handling.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Pre-req: Install data-prep-kit dependencies" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 1, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"%%capture logpip --no-stderr\n", | ||
"!pip install data-prep-toolkit==0.2.2\n", | ||
"!pip install 'data-prep-toolkit-transforms[all]==0.2.2'\n", | ||
"!pip install pdfplumber \n", | ||
"!pip install flair \n", | ||
"!pip install spacy \n", | ||
"!pip install presidio_analyzer \n", | ||
"!pip install presidio_anonymizer==2.2.355" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 2, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"import pdfplumber\n", | ||
"from pii_redactor_transform import PIIRedactorTransform\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Step 1: Inspect the Data \n", | ||
"\n", | ||
"We will use simple invoice PDF\n", | ||
"\n", | ||
"[invoicedata](https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 3, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"--2024-12-08 17:51:23-- https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf\n", | ||
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...\n", | ||
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n", | ||
"HTTP request sent, awaiting response... 200 OK\n", | ||
"Length: 33150 (32K) [application/octet-stream]\n", | ||
"Saving to: ‘Invoice.pdf.1’\n", | ||
"\n", | ||
"Invoice.pdf.1 100%[===================>] 32.37K --.-KB/s in 0.04s \n", | ||
"\n", | ||
"2024-12-08 17:51:23 (841 KB/s) - ‘Invoice.pdf.1’ saved [33150/33150]\n", | ||
"\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"!wget 'https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf'" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 4, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"pdf_path=\"Invoice.pdf\"" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Step 2: Extract Text from PDF\n", | ||
"\n", | ||
"This step uses the pdfplumber library to open and read a PDF file. The code processes each page of the PDF to extract text and concatenates it into a single string." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 5, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"with pdfplumber.open(pdf_path) as pdf:\n", | ||
" text = \"\\n\".join(page.extract_text() for page in pdf.pages)\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Step 3: Configure the PII Redactor\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"\n", | ||
"This configuration defines the parameters for identifying and redacting Personally Identifiable Information (PII) in the extracted text." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 6, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"\n", | ||
"config = {\n", | ||
" \"entities\": [\"PERSON\", \"EMAIL_ADDRESS\", \"PHONE_NUMBER\", \"LOCATION\"],\n", | ||
" \"operator\": \"replace\",\n", | ||
" \"transformed_contents\": \"redacted_contents\",\n", | ||
" \"score_threshold\": 0.6\n", | ||
"}" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Step 4: Initialize and Run the PII Redactor\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"This step initializes the PII Redactor using the previously defined configuration and prepares it for processing the extracted text." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 7, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stderr", | ||
"output_type": "stream", | ||
"text": [ | ||
"17:51:24 INFO - Loading model from flair/ner-english-large\n" | ||
] | ||
}, | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"2024-12-08 17:51:39,469 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"\n", | ||
"redactor = PIIRedactorTransform(config)\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Step 5: Apply the Redactor to Text Data\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"This step applies the initialized PII redactor to the extracted text, redacting sensitive information and providing details about the identified entities." | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 8, | ||
"metadata": {}, | ||
"outputs": [], | ||
"source": [ | ||
"\n", | ||
"redacted_text, detected_entities = redactor._redact_pii(text)\n", | ||
"\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"### Step 6: Display the Redaction Results\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"This step outputs the results of the redaction process, including the redacted text and the details of the detected PII entities.\n" | ||
] | ||
}, | ||
{ | ||
"cell_type": "code", | ||
"execution_count": 9, | ||
"metadata": {}, | ||
"outputs": [ | ||
{ | ||
"name": "stdout", | ||
"output_type": "stream", | ||
"text": [ | ||
"Redacted Text:\n", | ||
" INVOICE\n", | ||
"Apple Inc.\n", | ||
"Invoice Details:\n", | ||
"Invoice Number: INV-2024-001\n", | ||
"Invoice Date: November 15, 2024\n", | ||
"Due Date: November 30, 2024\n", | ||
"Billing Information:\n", | ||
"Customer Name: <PERSON>\n", | ||
"Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704\n", | ||
"Email: <EMAIL_ADDRESS>\n", | ||
"Phone: <PHONE_NUMBER>\n", | ||
"Shipping Information:\n", | ||
"Recipient Name: <PERSON>\n", | ||
"Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704\n", | ||
"Item Details:\n", | ||
"Description Quantity Unit Price Total\n", | ||
"MacBook Air (13-inch, M2) 1 $999.00 $999.00\n", | ||
"AppleCare+ for MacBook Air 1 $199.00 $199.00\n", | ||
"Subtotal: $1,198.00\n", | ||
"Tax (8%): $95.84\n", | ||
"Total Amount Due: $1,293.84\n", | ||
"Payment Method: Credit Card (Visa)\n", | ||
"Transaction ID: 9876543210ABCDE\n", | ||
"Notes:\n", | ||
"Thank you for your purchase!\n", | ||
"For assistance, please contact our support team at <EMAIL_ADDRESS> or 1-800-MY-APPLE.\n", | ||
"Detected Entities:\n", | ||
" ['PERSON', 'LOCATION', 'LOCATION', 'LOCATION', 'EMAIL_ADDRESS', 'PERSON', 'LOCATION', 'LOCATION', 'LOCATION', 'EMAIL_ADDRESS', 'PHONE_NUMBER']\n" | ||
] | ||
} | ||
], | ||
"source": [ | ||
"# Step 5: Print the Results\n", | ||
"print(\"Redacted Text:\\n\", redacted_text)\n", | ||
"print(\"Detected Entities:\\n\", detected_entities)" | ||
] | ||
}, | ||
{ | ||
"cell_type": "markdown", | ||
"metadata": {}, | ||
"source": [ | ||
"<br>\n", | ||
"<br>\n", | ||
"\n", | ||
"### This notebook effectively demonstrates how to seamlessly apply redaction for PII entities" | ||
] | ||
} | ||
], | ||
"metadata": { | ||
"kernelspec": { | ||
"display_name": "Python 3", | ||
"language": "python", | ||
"name": "python3" | ||
}, | ||
"language_info": { | ||
"codemirror_mode": { | ||
"name": "ipython", | ||
"version": 3 | ||
}, | ||
"file_extension": ".py", | ||
"mimetype": "text/x-python", | ||
"name": "python", | ||
"nbconvert_exporter": "python", | ||
"pygments_lexer": "ipython3", | ||
"version": "3.11.10" | ||
} | ||
}, | ||
"nbformat": 4, | ||
"nbformat_minor": 4 | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PoojaHolkar I tested the following on colab and they seem to work:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@PoojaHolkar You are not be install transforms[all]. Only transforms[pii_redactor]. Also, you should not be including the transitive dependencies for the transforms. Here is what the pip install block should look like:
%%capture logpip --no-stderr
!pip install data-prep-toolkit==0.2.2
!pip install 'data-prep-toolkit-transforms[pii_redactor]==0.2.2'
!pip install pdfplumber
!pip install spacy