Skip to content

Commit

Permalink
Include help in README, closes #3
Browse files Browse the repository at this point in the history
  • Loading branch information
simonw committed Jun 29, 2022
1 parent a78d47e commit 393343c
Show file tree
Hide file tree
Showing 3 changed files with 80 additions and 3 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/test.yml
Original file line number Diff line number Diff line change
Expand Up @@ -30,4 +30,4 @@ jobs:
- name: Run tests
run: |
pytest
cog --check README.md
79 changes: 78 additions & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -17,7 +17,7 @@ Install this tool using `pip`:

pip install s3-ocr

## Usage
## Starting OCR against every PDF in a bucket

The `start` command loops through every PDF file in a bucket (every file ending in `.pdf`) and submits it to [Textract](https://aws.amazon.com/textract/) for OCR processing.

Expand All @@ -29,6 +29,33 @@ You can start the process running like this:

OCR can take some time. The results of the OCR will be stored in `textract-output` in your bucket.

<!-- [[[cog
import cog
from s3_ocr import cli
from click.testing import CliRunner
runner = CliRunner()
result = runner.invoke(cli.cli, ["start", "--help"])
help = result.output.replace("Usage: cli", "Usage: s3-ocr")
cog.out(
"```\n{}\n```".format(help)
)
]]] -->
```
Usage: s3-ocr start [OPTIONS] BUCKET
Start OCR tasks for all files in this bucket
Options:
--access-key TEXT AWS access key ID
--secret-key TEXT AWS secret access key
--session-token TEXT AWS session token
--endpoint-url TEXT Custom endpoint URL
-a, --auth FILENAME Path to JSON/INI file containing credentials
--help Show this message and exit.
```
<!-- [[[end]]] -->

## Changes made to your bucket

To keep track of which files have been submitted for processing, `s3-ocr` will create a JSON file for every file that it adds to the OCR queue.
Expand Down Expand Up @@ -61,6 +88,29 @@ The `s3-ocr status <bucket-name>` command shows a rough indication of progress t
```
It compares the jobs that have been submitted, based on `.s3-ocr.json` files, to the jobs that have their results written to the `textract-output/` folder.

<!-- [[[cog
result = runner.invoke(cli.cli, ["status", "--help"])
help = result.output.replace("Usage: cli", "Usage: s3-ocr")
cog.out(
"```\n{}\n```".format(help)
)
]]] -->
```
Usage: s3-ocr status [OPTIONS] BUCKET
Show status of OCR jobs for a bucket
Options:
--access-key TEXT AWS access key ID
--secret-key TEXT AWS secret access key
--session-token TEXT AWS session token
--endpoint-url TEXT Custom endpoint URL
-a, --auth FILENAME Path to JSON/INI file containing credentials
--help Show this message and exit.
```
<!-- [[[end]]] -->

## Creating a SQLite index of your OCR results

The `s3-ocr index <database_file> <bucket>` command creates a SQLite database contaning the results of the OCR, and configure SQLite full-text search for the text:
Expand Down Expand Up @@ -91,6 +141,29 @@ CREATE TABLE [fetched_jobs] (
```
The database is designed to be used with [Datasette](https://datasette.io).

<!-- [[[cog
result = runner.invoke(cli.cli, ["index", "--help"])
help = result.output.replace("Usage: cli", "Usage: s3-ocr")
cog.out(
"```\n{}\n```".format(help)
)
]]] -->
```
Usage: s3-ocr index [OPTIONS] DATABASE BUCKET
Show status of OCR jobs for a bucket
Options:
--access-key TEXT AWS access key ID
--secret-key TEXT AWS secret access key
--session-token TEXT AWS session token
--endpoint-url TEXT Custom endpoint URL
-a, --auth FILENAME Path to JSON/INI file containing credentials
--help Show this message and exit.
```
<!-- [[[end]]] -->

## Development

To contribute to this tool, first checkout the code. Then create a new virtual environment:
Expand All @@ -106,3 +179,7 @@ Now install the dependencies and test dependencies:
To run the tests:

pytest

To regenerate the README file with the latest `--help`:

cog -r README.md
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,6 @@ def get_long_description():
s3-ocr=s3_ocr.cli:cli
""",
install_requires=["click", "boto3", "sqlite-utils"],
extras_require={"test": ["pytest"]},
extras_require={"test": ["pytest", "cogapp"]},
python_requires=">=3.7",
)

0 comments on commit 393343c

Please sign in to comment.