Skip to content

Commit

Permalink
edited pdf transform documentation for consistency (#523)
Browse files Browse the repository at this point in the history
* edited pdf transform documentation for consistency

* updated webpage for beautiful soup 4
  • Loading branch information
Tyrest authored Aug 10, 2023
1 parent 3e2e35e commit a20f005
Show file tree
Hide file tree
Showing 2 changed files with 35 additions and 32 deletions.
58 changes: 29 additions & 29 deletions docs/guide/transforms/pdf_transform.md
Original file line number Diff line number Diff line change
@@ -1,8 +1,8 @@
The PDF transform allows users to extract text from pdf files. Autolabel offers both direct text extraction, useful for extracting text from pdfs that contain text, and optical character recognition (OCR) text extraction, useful for extracting text from pdfs that contain images. To use this transform, follow these steps:

<ol>
<li>Install dependencies
For direct text extraction, install the <code>pdfplumber</code> package:
## Installation

For direct text extraction, install the <code>pdfplumber</code> package:

```bash
pip install pdfplumber
Expand All @@ -14,10 +14,32 @@ For OCR text extraction, install the <code>pdf2image</code> and <code>pytesserac
pip install pdf2image pytesseract
```

</li>
<li>Add the transform to your config file
## Parameters for this transform

<ol>
<li>file_path_column: the name of the column containing the file paths of the pdf files to extract text from</li>
<li>ocr_enabled: a boolean indicating whether to use OCR text extraction or not</li>
<li>page_format: a string containing the format to use for each page of the pdf file. The following fields can be used in the format string:
<ul>
<li>page_num: the page number of the page</li>
<li>page_content: the content of the page</li></li>
</ul>
<li>page_sep: a string containing the separator to use between each page of the pdf file
</ol>

### Output Format

The page_format and page_sep parameters define how the text extracted from the pdf will be formatted. For example, if the pdf file contained 2 pages with "Hello," on the first page and "World!" on the second, a page_format of <code>{page_num} - {page_content}</code> and a page_sep of <code>\n</code> would result in the following output:

```python
"1 - Hello,\n2 - World!"
```

The metadata column contains a dict with the field "num_pages" indicating the number of pages in the pdf file.

## Using the transform

below is an example of a pdf transform to extract text from a pdf file:
Below is an example of a pdf transform to extract text from a pdf file:

```json
{
Expand All @@ -41,29 +63,7 @@ below is an example of a pdf transform to extract text from a pdf file:
}
```

The `params` field contains the following parameters:

<ul>
<li>file_path_column: the name of the column containing the file paths of the pdf files to extract text from</li>
<li>ocr_enabled: a boolean indicating whether to use OCR text extraction or not</li>
<li>page_format: a string containing the format to use for each page of the pdf file. The following fields can be used in the format string:
<ul>
<li>page_num: the page number of the page</li>
<li>page_content: the content of the page</li></li>
</ul>
<li>page_sep: a string containing the separator to use between each page of the pdf file
</ul>

For example, if the pdf file contained 2 pages with "Hello," on the first page and "World!" on the second, a page_format of <code>{page_num} - {page_content}</code> and a page_sep of <code>\n</code> would result in the following output:

```python
"1 - Hello,\n2 - World!"
```

The metadata column contains a dict with the field "num_pages" indicating the number of pages in the pdf file.

</li>
<li>Run the transform
## Run the transform

```python
from autolabel import LabelingAgent, AutolabelDataset
Expand Down
9 changes: 6 additions & 3 deletions docs/guide/transforms/webpage_transform.md
Original file line number Diff line number Diff line change
@@ -1,15 +1,17 @@
The Webpage transform supports loading and processing webpage urls. Given a url, this transform will send the request to load the webpage and then parse the webpage returned to collect the text to send to the LLM.
The Webpage transform supports loading and processing webpage urls. Given a url, this transform will send the request to load the webpage and then parse the webpage returned to collect the text to send to the LLM.

Use this transform yourself here in a Colab - [![open in colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/drive/1PwrdBUUX1u4X2SWjgKYNxB11Gb7XEIZs#scrollTo=1f17f05a)

In order to use this transform, use the following steps:

## Installation

Use the following command to download all dependencies for the webpage transform.
Use the following command to download all dependencies for the webpage transform. `beautifulsoup4` must be version `4.12.2` or higher.

```bash
pip install bs4 httpx fake_useragent
pip install beautifulsoup4 httpx fake_useragent
```

Make sure to do this before running the transform.

## Parameters for this transform
Expand Down Expand Up @@ -41,6 +43,7 @@ Below is an example of a webpage transform to extract text from a webpage:
```

## Run the transform

```python
from autolabel import LabelingAgent, AutolabelDataset
agent = LabelingAgent(config)
Expand Down

0 comments on commit a20f005

Please sign in to comment.