Skip to content

Commit

Permalink
Native JSON Template Support
Browse files Browse the repository at this point in the history
  • Loading branch information
bosd committed Mar 30, 2023
1 parent 2ec8c5c commit 0d2d16c
Show file tree
Hide file tree
Showing 5 changed files with 28 additions and 20 deletions.
4 changes: 2 additions & 2 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,7 +10,7 @@ process.
1. extracts text from PDF files using different techniques, like
`pdftotext`, `text`, `pdfminer`, `pdfplumber` or OCR -- `tesseract`, or
`gvision` (Google Cloud Vision).
2. searches for regex in the result using a YAML-based template system
2. searches for regex in the result using a YAML or JSON-based template system
3. saves results as CSV, JSON or XML or renames PDF files to match the content.

With the flexible template system you can:
Expand Down Expand Up @@ -138,7 +138,7 @@ the list to add your own. If deployed by a bigger organisation, there
should be an interface to edit templates for new suppliers. 80-20 rule.
For a short tutorial on how to add new templates, see [TUTORIAL.md](TUTORIAL.md).

Templates are based on Yaml. They define one or more keywords to find
Templates are based on Yaml or JSON. They define one or more keywords to find
the right template, one or more exclude_keywords to further narrow it down
and regexp for fields to be extracted. They could also be a static value,
like the full company name.
Expand Down
2 changes: 1 addition & 1 deletion TUTORIAL.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,7 @@ invoice. Each template should work on all invoices of a company or
subsidiary (e.g. Amazon Germany vs Amazon US).

Adding templates is easy and shouldn't take longer than adding 2-3
invoices by hand. We use a simple YML-format. Many options are optional
invoices by hand. We use a simple YAML or json-format. Many options are optional
and you just need them to take care of edge cases.

Existing templates can be found in the [templates folder](https://github.com/invoice-x/invoice2data/tree/master/src/invoice2data/extract/templates) of the installed
Expand Down
9 changes: 7 additions & 2 deletions src/invoice2data/extract/loader.py
Original file line number Diff line number Diff line change
@@ -1,10 +1,11 @@
"""
This module abstracts templates for invoice providers.
Templates are initially read from .yml files and then kept as class.
Templates are initially read from .yml or .json files and then kept as class.
"""

import os
import json
try:
from yaml import load, YAMLError, CSafeLoader as SafeLoader
except ImportError: # pragma: no cover
Expand Down Expand Up @@ -71,7 +72,11 @@ def read_templates(folder=None):
except YAMLError as error:
logger.warning("Failed to load %s template:\n%s", name, error)
continue

else:
try:
tpl = json.loads(template_file.read())
except ValueError as error:
logger.warning("json Loader Failed to load %s template:\n%s", name, error)
tpl["template_name"] = name

# Test if all required fields are in template:
Expand Down
18 changes: 18 additions & 0 deletions src/invoice2data/extract/templates/com/com.flipkart.WSRetail.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
{
"issuer": "Flipkart",
"fields": {
"amount": "GrandTotal(\\d+\\.\\d+)",
"date": "InvoiceDate:(\\d{1,4}\\-\\d{1,2}\\-\\d{1,4})",
"invoice_number": "InvoiceNo:(\\S+)",
"order_id": "OrderID:(\\w{2}\\d{16,18})"
},
"keywords": [
"flipkart",
"WS\\s?Retail",
"OD"
],
"options": {
"currency": "INR",
"remove_whitespace": true
}
}
15 changes: 0 additions & 15 deletions src/invoice2data/extract/templates/com/com.flipkart.WSRetail.yml

This file was deleted.

0 comments on commit 0d2d16c

Please sign in to comment.