IBM · PoojaHolkar · Nov 15, 2024 · Nov 18, 2024 · Nov 18, 2024 · Nov 18, 2024
diff --git a/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb b/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb
@@ -0,0 +1,349 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Extracting Text from PDF and Configuring PII Redactor"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "**Author**: Pooja Holkar ,\n",
+    "**email**:poholkar@in.ibm.com\n",
+    "\n",
+    "Click link to open notebook in google colab:  [![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/IBM/data-prep-kit/blob/dev/examples/notebooks/PII/Run_your_first_PII_redactor_transform.ipynb)\n",
+    "\n",
+    "\n",
+    "### What is a PII Redactor?\n",
+    "\n",
+    "A PII (Personally Identifiable Information) Redactor is a tool designed to identify and redact sensitive information in text data. PII includes details that can be used to identify an individual, such as:\n",
+    "\n",
+    "Names\n",
+    "Email addresses\n",
+    "Phone numbers\n",
+    "Addresses\n",
+    "Financial details (e.g., credit card numbers)\n",
+    "\n",
+    "### Overview of the use case\n",
+    "In this usecase, the PII Redactor is applied to text extracted from invoices to ensure sensitive customer information is not exposed during processing, sharing, or storage.\n",
+    "\n",
+    " **Workflow Overview**\n",
+    "\n",
+    "The text from the invoice (a PDF document in this case) is extracted using the pdfplumber library.\n",
+    "\n",
+    " **Redactor Configuration**\n",
+    "\n",
+    "The system is configured to recognize specific PII entities relevant to invoices, such as:\n",
+    "Customer names\n",
+    "Email addresses\n",
+    "Phone numbers\n",
+    "Shipping addresses\n",
+    "\n",
+    " **PII Detection and Redaction**\n",
+    "\n",
+    "The redactor scans the extracted text and applies redaction rules, replacing sensitive details with placeholders.\n",
+    "Output:\n",
+    "\n",
+    "The redacted text is displayed alongside a summary of all identified PII entities for auditing purposes.\n",
+    "\n",
+    "### Why is PII Redaction Important?\n",
+    "\n",
+    " **Data Privacy Compliance**: Adheres to regulations like GDPR, HIPAA, or CCPA that mandate safeguarding customer information.\n",
+    "\n",
+    " **Risk Mitigation**: Prevents unauthorized access to or misuse of sensitive data.\n",
+    "\n",
+    " **Automation Benefits**: Simplifies and accelerates the process of securing information in large-scale document handling.\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Pre-req: Install data-prep-kit dependencies"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "%%capture logpip --no-stderr\n",
+    "!pip install data-prep-toolkit==0.2.2\n",
+    "!pip install 'data-prep-toolkit-transforms[all]==0.2.2'\n",
+    "!pip install pdfplumber \n",
+    "!pip install flair \n",
+    "!pip install spacy \n",
+    "!pip install presidio_analyzer \n",
+    "!pip install presidio_anonymizer==2.2.355"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import pdfplumber\n",
+    "from pii_redactor_transform import PIIRedactorTransform\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 1: Inspect the Data \n",
+    "\n",
+    "We will use simple invoice PDF\n",
+    "\n",
+    "[invoicedata](https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "--2024-12-08 17:51:23--  https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf\n",
+      "Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.111.133, 185.199.110.133, ...\n",
+      "Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.\n",
+      "HTTP request sent, awaiting response... 200 OK\n",
+      "Length: 33150 (32K) [application/octet-stream]\n",
+      "Saving to: ‘Invoice.pdf.1’\n",
+      "\n",
+      "Invoice.pdf.1       100%[===================>]  32.37K  --.-KB/s    in 0.04s   \n",
+      "\n",
+      "2024-12-08 17:51:23 (841 KB/s) - ‘Invoice.pdf.1’ saved [33150/33150]\n",
+      "\n"
+     ]
+    }
+   ],
+   "source": [
+    "!wget 'https://raw.githubusercontent.com/PoojaHolkar/data-prep-kit/refs/heads/dev/examples/notebooks/PII/invoicedata/Invoice.pdf'"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "pdf_path=\"Invoice.pdf\""
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 2: Extract Text from PDF\n",
+    "\n",
+    "This step uses the pdfplumber library to open and read a PDF file. The code processes each page of the PDF to extract text and concatenates it into a single string."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "with pdfplumber.open(pdf_path) as pdf:\n",
+    "    text = \"\\n\".join(page.extract_text() for page in pdf.pages)\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 3: Configure the PII Redactor\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "\n",
+    "This configuration defines the parameters for identifying and redacting Personally Identifiable Information (PII) in the extracted text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 6,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "config = {\n",
+    "    \"entities\": [\"PERSON\", \"EMAIL_ADDRESS\", \"PHONE_NUMBER\", \"LOCATION\"],\n",
+    "    \"operator\": \"replace\",\n",
+    "    \"transformed_contents\": \"redacted_contents\",\n",
+    "    \"score_threshold\": 0.6\n",
+    "}"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 4: Initialize and Run the PII Redactor\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This step initializes the PII Redactor using the previously defined configuration and prepares it for processing the extracted text."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 7,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stderr",
+     "output_type": "stream",
+     "text": [
+      "17:51:24 INFO - Loading model from flair/ner-english-large\n"
+     ]
+    },
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "2024-12-08 17:51:39,469 SequenceTagger predicts: Dictionary with 20 tags: <unk>, O, S-ORG, S-MISC, B-PER, E-PER, S-LOC, B-ORG, E-ORG, I-PER, S-PER, B-MISC, I-MISC, E-MISC, I-ORG, B-LOC, E-LOC, I-LOC, <START>, <STOP>\n"
+     ]
+    }
+   ],
+   "source": [
+    "\n",
+    "redactor = PIIRedactorTransform(config)\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 5: Apply the Redactor to Text Data\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This step applies the initialized PII redactor to the extracted text, redacting sensitive information and providing details about the identified entities."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 8,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "\n",
+    "redacted_text, detected_entities = redactor._redact_pii(text)\n",
+    "\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "### Step 6: Display the Redaction Results\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "This step outputs the results of the redaction process, including the redacted text and the details of the detected PII entities.\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Redacted Text:\n",
+      " INVOICE\n",
+      "Apple Inc.\n",
+      "Invoice Details:\n",
+      "Invoice Number: INV-2024-001\n",
+      "Invoice Date: November 15, 2024\n",
+      "Due Date: November 30, 2024\n",
+      "Billing Information:\n",
+      "Customer Name: <PERSON>\n",
+      "Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704\n",
+      "Email: <EMAIL_ADDRESS>\n",
+      "Phone: <PHONE_NUMBER>\n",
+      "Shipping Information:\n",
+      "Recipient Name: <PERSON>\n",
+      "Address: 123 <LOCATION>, Apt 45, <LOCATION>, <LOCATION> 62704\n",
+      "Item Details:\n",
+      "Description Quantity Unit Price Total\n",
+      "MacBook Air (13-inch, M2) 1 $999.00 $999.00\n",
+      "AppleCare+ for MacBook Air 1 $199.00 $199.00\n",
+      "Subtotal: $1,198.00\n",
+      "Tax (8%): $95.84\n",
+      "Total Amount Due: $1,293.84\n",
+      "Payment Method: Credit Card (Visa)\n",
+      "Transaction ID: 9876543210ABCDE\n",
+      "Notes:\n",
+      "Thank you for your purchase!\n",
+      "For assistance, please contact our support team at <EMAIL_ADDRESS> or 1-800-MY-APPLE.\n",
+      "Detected Entities:\n",
+      " ['PERSON', 'LOCATION', 'LOCATION', 'LOCATION', 'EMAIL_ADDRESS', 'PERSON', 'LOCATION', 'LOCATION', 'LOCATION', 'EMAIL_ADDRESS', 'PHONE_NUMBER']\n"
+     ]
+    }
+   ],
+   "source": [
+    "# Step 5: Print the Results\n",
+    "print(\"Redacted Text:\\n\", redacted_text)\n",
+    "print(\"Detected Entities:\\n\", detected_entities)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "<br>\n",
+    "<br>\n",
+    "\n",
+    "### This notebook effectively demonstrates how to seamlessly apply redaction for PII entities"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "Python 3",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.10"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 4
+}
diff --git a/examples/notebooks/PII/invoicedata/Invoice.pdf b/examples/notebooks/PII/invoicedata/Invoice.pdf