Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PII data file #828

Open
wants to merge 45 commits into
base: dev
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
45 commits
Select commit Hold shift + click to select a range
c59bc6c
Update README.md
shahrokhDaijavad Nov 15, 2024
9c58af3
Update README.md
shahrokhDaijavad Nov 18, 2024
501570c
Update README.md
shahrokhDaijavad Nov 18, 2024
30c8a19
Update README.md
shahrokhDaijavad Nov 18, 2024
ebfe95e
Update README inweb2parquet
shahrokhDaijavad Nov 18, 2024
eb5d0ad
Update README.md for the web2parquet
shahrokhDaijavad Nov 18, 2024
af8cdd8
Update README-list.md
shahrokhDaijavad Nov 18, 2024
3738e51
Update README-list.md
shahrokhDaijavad Nov 18, 2024
d87c992
Update README.md
Padarn Nov 16, 2024
b0beaf5
Update README.md
shahrokhDaijavad Nov 18, 2024
a00380b
Create test
PoojaHolkar Nov 19, 2024
7be8cb1
PII input file
PoojaHolkar Nov 24, 2024
2aceec1
PII_redactor code example
PoojaHolkar Nov 24, 2024
d16f0f7
invoice data
PoojaHolkar Nov 25, 2024
ee735cd
upload data
PoojaHolkar Nov 25, 2024
d825e8b
upload data
PoojaHolkar Nov 25, 2024
22ec3fd
Delete examples/notebooks/PII/Invoice.pdf
PoojaHolkar Nov 25, 2024
a1965c2
Delete examples/notebooks/PII/invoicedata/test.py
PoojaHolkar Nov 25, 2024
ea9d692
notebook recipe for PII redaction code
PoojaHolkar Nov 25, 2024
f47b45f
update pdf2parquet README
dolfim-ibm Nov 13, 2024
1f1764e
add data_files_to_use
dolfim-ibm Nov 13, 2024
4d2d212
doc_chunk README
dolfim-ibm Nov 13, 2024
d4bb363
text_encoder README
dolfim-ibm Nov 13, 2024
44f9905
Added notebook for pdf2parquet
Nov 20, 2024
b4a23a5
Added doc chunk minimal notebook
touma-I Nov 20, 2024
964f3a1
Update pdf2parquet.ipynb
shahrokhDaijavad Nov 20, 2024
b7d4161
Update pdf2parquet.ipynb
shahrokhDaijavad Nov 20, 2024
b3aff29
minimal sample notebook for how transform can be invoked
touma-I Nov 20, 2024
2763d17
restoring the make venv
shahrokhDaijavad Nov 20, 2024
0204795
unification of notebooks
shahrokhDaijavad Nov 20, 2024
3205ef2
added constraint for pydantic to prevent llama-index-core from picki…
touma-I Nov 22, 2024
ef6eccd
updated README file and added a sample notebook
Nov 19, 2024
818bd07
removed python code in README and minor changes in the notebook
Nov 19, 2024
ed0f084
updated with relative path and added markdown for notebook
Nov 21, 2024
6cb60a9
Update web2parquet.ipynb
shahrokhDaijavad Nov 22, 2024
0abb743
Update Run_your_first_PII_redactor_transform.ipynb
PoojaHolkar Nov 25, 2024
a567fcc
updated code
pholkar1 Nov 27, 2024
9fd1b77
Delete examples/notebooks/PII/test
PoojaHolkar Nov 27, 2024
dbd977c
Added google colab version for users
pholkar1 Dec 4, 2024
5c15dad
Merge branch 'IBM:dev' into dev
PoojaHolkar Dec 5, 2024
10fde20
Delete examples/notebooks/Input-Test-Data/Invoice.pdf
PoojaHolkar Dec 5, 2024
0de92a3
Merge branch 'IBM:dev' into dev
PoojaHolkar Dec 6, 2024
1099e48
colab version notebook
pholkar1 Dec 6, 2024
9e51577
colab version notebook
pholkar1 Dec 8, 2024
032e8f0
colab running version update
pholkar1 Dec 8, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
349 changes: 349 additions & 0 deletions examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PoojaHolkar I tested the following on colab and they seem to work:

!pip install data-prep-toolkit==0.2.2
!pip install 'data-prep-toolkit-transforms[pii_redactor]==0.2.2'

from pii_redactor_transform import PIIRedactorTransform

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PoojaHolkar You are not be install transforms[all]. Only transforms[pii_redactor]. Also, you should not be including the transitive dependencies for the transforms. Here is what the pip install block should look like:

%%capture logpip --no-stderr
!pip install data-prep-toolkit==0.2.2
!pip install 'data-prep-toolkit-transforms[pii_redactor]==0.2.2'
!pip install pdfplumber
!pip install spacy

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please consider using pdf2Parquet transform (instead pdfplumber) in order to ingest the pdf document. It might be a bit more cumbersome to use in its current release but we are actually making improvements to this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You should not be using the private method _redact_pii(). Instead you should use the pdf2Parquet transform to create a parquet file and then use the transform() method to redact the content.

Original file line number Diff line number Diff line change
@@ -0,0 +1,349 @@
{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"## Extracting Text from PDF and Configuring PII Redactor"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"**Author**: Pooja Holkar ,\n",
"**email**:poholkar@in.ibm.com\n",
"\n",
"Click link to open notebook in google colab: [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb)\n",
"\n",
"\n",
"### What is a PII Redactor?\n",
"\n",
"A PII (Personally Identifiable Information) Redactor is a tool designed to identify and redact sensitive information in text data. PII includes details that can be used to identify an individual, such as:\n",
"\n",
"Names\n",
"Email addresses\n",
"Phone numbers\n",
"Addresses\n",
"Financial details (e.g., credit card numbers)\n",
"\n",
"### Overview of the use case\n",
"In this usecase, the PII Redactor is applied to text extracted from invoices to ensure sensitive customer information is not exposed during processing, sharing, or storage.\n",
"\n",
" **Workflow Overview**\n",
"\n",
"The text from the invoice (a PDF document in this case) is extracted using the pdfplumber library.\n",
"\n",
" **Redactor Configuration**\n",
"\n",
"The system is configured to recognize specific PII entities relevant to invoices, such as:\n",
"Customer names\n",
"Email addresses\n",
"Phone numbers\n",
"Shipping addresses\n",
"\n",
" **PII Detection and Redaction**\n",
"\n",
"The redactor scans the extracted text and applies redaction rules, replacing sensitive details with placeholders.\n",
"Output:\n",
"\n",
"The redacted text is displayed alongside a summary of all identified PII entities for auditing purposes.\n",
"\n",
"### Why is PII Redaction Important?\n",
"\n",
" **Data Privacy Compliance**: Adheres to regulations like GDPR, HIPAA, or CCPA that mandate safeguarding customer information.\n",
"\n",
" **Risk Mitigation**: Prevents unauthorized access to or misuse of sensitive data.\n",
"\n",
" **Automation Benefits**: Simplifies and accelerates the process of securing information in large-scale document handling.\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Pre-req: Install data-prep-kit dependencies"
]
},
{
"cell_type": "code",
"execution_count": 1,
"metadata": {},
"outputs": [],
"source": [
"%%capture logpip --no-stderr\n",
"!pip install data-prep-toolkit==0.2.2\n",
"!pip install 'data-prep-toolkit-transforms[all]==0.2.2'\n",
"!pip install pdfplumber \n",
"!pip install flair \n",
"!pip install spacy \n",
"!pip install presidio_analyzer \n",
"!pip install presidio_anonymizer==2.2.355"
]
},
{
"cell_type": "code",
"execution_count": 2,
"metadata": {},
"outputs": [],
"source": [
"import pdfplumber\n",
"from pii_redactor_transform import PIIRedactorTransform\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 1: Inspect the Data \n",
"\n",
"We will use simple invoice PDF\n",
"\n",
"[invoicedata](https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf)"
]
},
{
"cell_type": "code",
"execution_count": 3,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"--2024-12-08 17:51:23-- https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf\n",
"Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...\n",
"Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
"HTTP request sent, awaiting response... 200 OK\n",
"Length: 33150 (32K) [application/octet-stream]\n",
"Saving to: ‘Invoice.pdf.1’\n",
"\n",
"Invoice.pdf.1 100%[===================>] 32.37K --.-KB/s in 0.04s \n",
"\n",
"2024-12-08 17:51:23 (841 KB/s) - ‘Invoice.pdf.1’ saved [33150/33150]\n",
"\n"
]
}
],
"source": [
"!wget 'https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf'"
]
},
{
"cell_type": "code",
"execution_count": 4,
"metadata": {},
"outputs": [],
"source": [
"pdf_path=\"Invoice.pdf\""
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 2: Extract Text from PDF\n",
"\n",
"This step uses the pdfplumber library to open and read a PDF file. The code processes each page of the PDF to extract text and concatenates it into a single string."
]
},
{
"cell_type": "code",
"execution_count": 5,
"metadata": {},
"outputs": [],
"source": [
"with pdfplumber.open(pdf_path) as pdf:\n",
" text = \"\\n\".join(page.extract_text() for page in pdf.pages)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 3: Configure the PII Redactor\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"\n",
"This configuration defines the parameters for identifying and redacting Personally Identifiable Information (PII) in the extracted text."
]
},
{
"cell_type": "code",
"execution_count": 6,
"metadata": {},
"outputs": [],
"source": [
"\n",
"config = {\n",
" \"entities\": [\"PERSON\", \"EMAIL_ADDRESS\", \"PHONE_NUMBER\", \"LOCATION\"],\n",
" \"operator\": \"replace\",\n",
" \"transformed_contents\": \"redacted_contents\",\n",
" \"score_threshold\": 0.6\n",
"}"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 4: Initialize and Run the PII Redactor\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This step initializes the PII Redactor using the previously defined configuration and prepares it for processing the extracted text."
]
},
{
"cell_type": "code",
"execution_count": 7,
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"17:51:24 INFO - Loading model from flair/ner-english-large\n"
]
},
{
"name": "stdout",
"output_type": "stream",
"text": [
"2024-12-08 17:51:39,469 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>\n"
]
}
],
"source": [
"\n",
"redactor = PIIRedactorTransform(config)\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 5: Apply the Redactor to Text Data\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This step applies the initialized PII redactor to the extracted text, redacting sensitive information and providing details about the identified entities."
]
},
{
"cell_type": "code",
"execution_count": 8,
"metadata": {},
"outputs": [],
"source": [
"\n",
"redacted_text, detected_entities = redactor._redact_pii(text)\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"### Step 6: Display the Redaction Results\n"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"This step outputs the results of the redaction process, including the redacted text and the details of the detected PII entities.\n"
]
},
{
"cell_type": "code",
"execution_count": 9,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Redacted Text:\n",
" INVOICE\n",
"Apple Inc.\n",
"Invoice Details:\n",
"Invoice Number: INV-2024-001\n",
"Invoice Date: November 15, 2024\n",
"Due Date: November 30, 2024\n",
"Billing Information:\n",
"Customer Name: <PERSON>\n",
"Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704\n",
"Email: <EMAIL_ADDRESS>\n",
"Phone: <PHONE_NUMBER>\n",
"Shipping Information:\n",
"Recipient Name: <PERSON>\n",
"Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704\n",
"Item Details:\n",
"Description Quantity Unit Price Total\n",
"MacBook Air (13-inch, M2) 1 $999.00 $999.00\n",
"AppleCare+ for MacBook Air 1 $199.00 $199.00\n",
"Subtotal: $1,198.00\n",
"Tax (8%): $95.84\n",
"Total Amount Due: $1,293.84\n",
"Payment Method: Credit Card (Visa)\n",
"Transaction ID: 9876543210ABCDE\n",
"Notes:\n",
"Thank you for your purchase!\n",
"For assistance, please contact our support team at <EMAIL_ADDRESS> or 1-800-MY-APPLE.\n",
"Detected Entities:\n",
" ['PERSON', 'LOCATION', 'LOCATION', 'LOCATION', 'EMAIL_ADDRESS', 'PERSON', 'LOCATION', 'LOCATION', 'LOCATION', 'EMAIL_ADDRESS', 'PHONE_NUMBER']\n"
]
}
],
"source": [
"# Step 5: Print the Results\n",
"print(\"Redacted Text:\\n\", redacted_text)\n",
"print(\"Detected Entities:\\n\", detected_entities)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<br>\n",
"<br>\n",
"\n",
"### This notebook effectively demonstrates how to seamlessly apply redaction for PII entities"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
Binary file added examples/notebooks/PII/invoicedata/Invoice.pdf
Binary file not shown.