Breaking changes

A Document AI Package - Jupyter notebook tutorials

Breaking changes

With the latest release of deepdoctection v.0.33.0 the package has been refactored and is not compatible with previous releases. If you are on a previous version, please update to the latest version or choose the repo version that is tagged with v.0.32.0

Jupyter Notebooks for deepdoctection

In this repo you will find jupyter notebooks that used to be in the main repo deepdoctection. If you encouter problems, feel free to open an issue in the deepdoctection repository.

In addition, the repo contains a folder with examples that are used in the notebooks.git

Get_Started.ipynb:

Introduction to deepdoctection
Analyzer
Output structure: Page, Layouts, Tables
Saving and reading a parsed document

Pipelines.ipynb:

Pipelines
Analyzer configuration
Pipeline components
Layout detection models
OCR matching and reading order

Analyzer_Configuration.ipynb

Analyzer Configuration
How to change configuration
High level Configuration
Layout models
Table transformer
Custom model
Table segmentation
Text extraction
PDFPlumber
Tesseract
DocTr
AWS Textract
Word matching
Text ordering

Analyzer_with_Table_Transformer.ipynb:

Analyzer configuration for running Table Transformer
General configuration
Table segmentation

Doclaynet_with_YOLO.ipynb

Writing a predictor from a third party library
Adding the model wrapper for YOLO
Adding the model to the ModelCatalog
Modifying the factory class to build the Analyzer
Running the Analyzer with the YoloDetector

Doclaynet_Analyzer_Config.ipynb

Advanced Analyzer Configuration
Adding the model wrapper for YOLO
Configuration to parse the page with respect to granular layout segments
Extracting figures
Relating captions to figures and tables

Custom_Pipeline.ipynb:

Model catalog and registries
Predictors
Instantiating Pipeline backbones
Instantiating Pipelines

Datasets_and_Eval.ipynb:

Creation of custom datasets
Evaluation
Fine tuning models

Data_structure.ipynb:

Diving deeper into the data structure
Page and Image
ObjectTypes
ImageAnnotation and sub categories
Adding an ImageAnnotation
Adding a ContainerAnnotation to an ImageAnnotation
Sub images from given ImageAnnotation

Using_LayoutLM_for_sequence_classification.ipynb:

Fine tuning LayoutLM for sequence classification on a custom dataset
Evaluation
Building and running a production pipeline

Running_pre_trained_models_from_other_libraries.ipynb

Installing and running pre-trained models provided by Layout-Parser
Adding new categories

The next three notebooks are experiments on a custom dataset for token classification that has been made available through Huggingface. It shows, how to train and evaluate each model of the LayoutLM family and how to track experiments with W&B.

Layoutlm_v1_on_custom_token_classification.ipynb

LayoutLMv1 for financial report NER
Defining object types
Visualization and display of ground truth
Defining Dataflow and Dataset
Defining a split and saving the split distribution as W&B artifact
LayoutLMv1 training
Further exploration of evaluation
Evaluation with confusion matrix
Visualizing predictions and ground truth
Evaluation on test set
Changing training parameters and settings

Layoutlm_v2_on_custom_token_classification.ipynb

LayoutLMv2 for financial report NER
Defining ObjectTypes, Dataset and Dataflow
Loading W&B artifact and building dataset split
Exporing the language distribustion across the split
Evaluation
LayoutXLM for financial report NER
Training XLM models on separate languages

Layoutlm_v3_on_custom_token_classification.ipynb

LayoutLMv3 for financial report NER
Evaluation
Conclusion

To use the notebooks deepdoctection must be installed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

A Document AI Package - Jupyter notebook tutorials

Breaking changes

Jupyter Notebooks for deepdoctection

About

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 51 Commits
pics		pics
sample		sample
.gitignore		.gitignore
Analyzer_Configuration.ipynb		Analyzer_Configuration.ipynb
Analyzer_with_Table_Transformer.ipynb		Analyzer_with_Table_Transformer.ipynb
Custom_Pipeline.ipynb		Custom_Pipeline.ipynb
Data_structure.ipynb		Data_structure.ipynb
Datasets_and_Eval.ipynb		Datasets_and_Eval.ipynb
Doclaynet_Analyzer_Config.ipynb		Doclaynet_Analyzer_Config.ipynb
Doclaynet_with_YOLO.ipynb		Doclaynet_with_YOLO.ipynb
Get_Started.ipynb		Get_Started.ipynb
LICENSE		LICENSE
Layoutlm_v1_on_custom_token_classification.ipynb		Layoutlm_v1_on_custom_token_classification.ipynb
Layoutlm_v2_on_custom_token_classification.ipynb		Layoutlm_v2_on_custom_token_classification.ipynb
Layoutlm_v3_on_custom_token_classification.ipynb		Layoutlm_v3_on_custom_token_classification.ipynb
Pipelines.ipynb		Pipelines.ipynb
README.md		README.md
Running_pre_trained_models_from_third_party_libraries.ipynb		Running_pre_trained_models_from_third_party_libraries.ipynb
Using_LayoutLM_for_sequence_classification.ipynb		Using_LayoutLM_for_sequence_classification.ipynb

License

deepdoctection/notebooks

Folders and files

Latest commit

History

Repository files navigation

A Document AI Package - Jupyter notebook tutorials

Breaking changes

Jupyter Notebooks for deepdoctection

About

Resources

License

Stars

Watchers

Forks

Contributors 3

Languages