Skip to content

Commit

Permalink
Added notebook for pdf2parquet
Browse files Browse the repository at this point in the history
Signed-off-by: Maroun Touma <touma@marouns-mbp.watson.ibm.com>
  • Loading branch information
Maroun Touma committed Nov 20, 2024
1 parent c4c9e5e commit abec823
Showing 1 changed file with 212 additions and 0 deletions.
212 changes: 212 additions & 0 deletions transforms/language/pdf2parquet/pdf2parquet.ipynb
Original file line number Diff line number Diff line change
@@ -0,0 +1,212 @@
{
"cells": [
{
"cell_type": "markdown",
"id": "afd55886-5f5b-4794-838e-ef8179fb0394",
"metadata": {},
"source": [
"##### **** These pip install need to be adapted to use the appropriate release level. Alternatively, The venv running the jupyter lab could be pre-configured with a requirement file that includes the right release\n",
"\n",
"##### **** example for transform developers working from git clone\n",
"```\n",
"make venv\n",
"source venv/bin/activate && pip install jupyterlab\n",
"```"
]
},
{
"cell_type": "code",
"execution_count": 1,
"id": "4c45c3c6-e4d7-4e61-8de6-32d61f2ce695",
"metadata": {},
"outputs": [],
"source": [
"%%capture\n",
"## This is here as a reference only\n",
"# Users and application developers must use the right tag for the latest from pypi\n",
"#!pip install data-prep-toolkit\n",
"#!pip install data-prep-toolkit-transforms\n",
"#!pip install data-prep-connector"
]
},
{
"cell_type": "markdown",
"id": "407fd4e4-265d-4ec7-bbc9-b43158f5f1f3",
"metadata": {
"jp-MarkdownHeadingCollapsed": true
},
"source": [
"##### **** Configure the transform parameters. We will only show the use of double_precision. For a complete list, please refer to the Readme.md for this transform\n",
"##### \n",
"| parameter:type | Description |\n",
"| --- | --- |\n",
"| data_files_to_use: list | list of file extensions in the input folder to use for running the transform |\n",
"|pdf2parquet_double_precision: int | control precision |\n"
]
},
{
"cell_type": "markdown",
"id": "ebf1f782-0e61-485c-8670-81066beb734c",
"metadata": {},
"source": [
"##### ***** Import required Classes and modules"
]
},
{
"cell_type": "code",
"execution_count": 2,
"id": "c2a12abc-9460-4e45-8961-873b48a9ab19",
"metadata": {},
"outputs": [],
"source": [
"import ast\n",
"import os\n",
"import sys\n",
"\n",
"from data_processing.runtime.pure_python import PythonTransformLauncher\n",
"from data_processing.utils import ParamsUtils\n",
"from pdf2parquet_transform_python import Pdf2ParquetPythonTransformConfiguration\n"
]
},
{
"cell_type": "markdown",
"id": "7234563c-2924-4150-8a31-4aec98c1bf33",
"metadata": {},
"source": [
"##### ***** Setup runtime parameters for this transform"
]
},
{
"cell_type": "code",
"execution_count": 3,
"id": "e90a853e-412f-45d7-af3d-959e755aeebb",
"metadata": {},
"outputs": [],
"source": [
"\n",
"# create parameters\n",
"input_folder = os.path.join(\"python\", \"test-data\", \"input\")\n",
"output_folder = os.path.join( \"python\", \"output\")\n",
"local_conf = {\n",
" \"input_folder\": input_folder,\n",
" \"output_folder\": output_folder,\n",
"}\n",
"params = {\n",
" # Data access. Only required parameters are specified\n",
" \"data_local_config\": ParamsUtils.convert_to_ast(local_conf),\n",
" \"data_files_to_use\": ast.literal_eval(\"['.pdf','.docx','.pptx','.zip']\"),\n",
" # execution info\n",
" \"runtime_pipeline_id\": \"pipeline_id\",\n",
" \"runtime_job_id\": \"job_id\",\n",
" # pdf2parquet params\n",
" \"pdf2parquet_double_precision\": 0,\n",
"}"
]
},
{
"cell_type": "markdown",
"id": "7949f66a-d207-45ef-9ad7-ad9406f8d42a",
"metadata": {},
"source": [
"##### ***** Use python runtime to invoke the transform"
]
},
{
"cell_type": "code",
"execution_count": 4,
"id": "0775e400-7469-49a6-8998-bd4772931459",
"metadata": {},
"outputs": [
{
"name": "stderr",
"output_type": "stream",
"text": [
"13:23:55 INFO - pdf2parquet parameters are : {'batch_size': -1, 'artifacts_path': None, 'contents_type': <pdf2parquet_contents_types.MARKDOWN: 'text/markdown'>, 'do_table_structure': True, 'do_ocr': True, 'ocr_engine': <pdf2parquet_ocr_engine.EASYOCR: 'easyocr'>, 'bitmap_area_threshold': 0.05, 'pdf_backend': <pdf2parquet_pdf_backend.DLPARSE_V2: 'dlparse_v2'>, 'double_precision': 0}\n",
"13:23:55 INFO - pipeline id pipeline_id\n",
"13:23:55 INFO - code location None\n",
"13:23:55 INFO - data factory data_ is using local data access: input_folder - python/test-data/input output_folder - python/output\n",
"13:23:55 INFO - data factory data_ max_files -1, n_sample -1\n",
"13:23:55 INFO - data factory data_ Not using data sets, checkpointing False, max files -1, random samples -1, files to use ['.pdf', '.docx', '.pptx', '.zip'], files to checkpoint ['.parquet']\n",
"13:23:55 INFO - orchestrator pdf2parquet started at 2024-11-20 13:23:55\n",
"13:23:55 INFO - Number of files is 2, source profile {'max_file_size': 0.3013172149658203, 'min_file_size': 0.2757863998413086, 'total_file_size': 0.5771036148071289}\n",
"13:23:55 INFO - Initializing models\n",
"13:23:58 INFO - Processing archive_doc_filename='2305.03393v1-pg9.pdf' \n",
"13:23:59 INFO - Processing archive_doc_filename='2408.09869v1-pg1.pdf' \n",
"13:24:00 INFO - Completed 1 files (50.0%) in 0.029 min\n",
"13:24:03 INFO - Completed 2 files (100.0%) in 0.08 min\n",
"13:24:03 INFO - Done processing 2 files, waiting for flush() completion.\n",
"13:24:03 INFO - done flushing in 0.0 sec\n",
"13:24:03 INFO - Completed execution in 0.132 min, execution result 0\n"
]
}
],
"source": [
"%%capture\n",
"sys.argv = ParamsUtils.dict_to_req(d=params)\n",
"launcher = PythonTransformLauncher(runtime_config=Pdf2ParquetPythonTransformConfiguration())\n",
"launcher.launch()\n"
]
},
{
"cell_type": "markdown",
"id": "c3df5adf-4717-4a03-864d-9151cd3f134b",
"metadata": {},
"source": [
"##### **** The specified folder will include the transformed parquet files."
]
},
{
"cell_type": "code",
"execution_count": 5,
"id": "7276fe84-6512-4605-ab65-747351e13a7c",
"metadata": {},
"outputs": [
{
"data": {
"text/plain": [
"['python/output/redp5110-ch1.parquet',\n",
" 'python/output/metadata.json',\n",
" 'python/output/archive1.parquet']"
]
},
"execution_count": 5,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"import glob\n",
"glob.glob(\"python/output/*\")"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "fef6667e-71ed-4054-9382-55c6bb3fda70",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.11.10"
}
},
"nbformat": 4,
"nbformat_minor": 5
}

0 comments on commit abec823

Please sign in to comment.