With the latest release of deepdoctection v.0.33.0 the package has been refactored and is not compatible with previous releases. If you are on a previous version, please update to the latest version or choose the repo version that is tagged with v.0.32.0
In this repo you will find jupyter notebooks that used to be in the main repo deepdoctection. If you encouter problems, feel free to open an issue in the deepdoctection repository.
In addition, the repo contains a folder with examples that are used in the notebooks.git
- Introduction to deepdoctection
- Analyzer
- Output structure: Page, Layouts, Tables
- Saving and reading a parsed document
- Pipelines
- Analyzer configuration
- Pipeline components
- Layout detection models
- OCR matching and reading order
- Analyzer Configuration
- How to change configuration
- High level Configuration
- Layout models
- Table transformer
- Custom model
- Table segmentation
- Text extraction
- PDFPlumber
- Tesseract
- DocTr
- AWS Textract
- Word matching
- Text ordering
Analyzer_with_Table_Transformer.ipynb:
- Analyzer configuration for running Table Transformer
- General configuration
- Table segmentation
- Writing a predictor from a third party library
- Adding the model wrapper for YOLO
- Adding the model to the
ModelCatalog
- Modifying the factory class to build the Analyzer
- Running the Analyzer with the YoloDetector
Doclaynet_Analyzer_Config.ipynb
- Advanced Analyzer Configuration
- Adding the model wrapper for YOLO
- Configuration to parse the page with respect to granular layout segments
- Extracting figures
- Relating captions to figures and tables
- Model catalog and registries
- Predictors
- Instantiating Pipeline backbones
- Instantiating Pipelines
- Creation of custom datasets
- Evaluation
- Fine tuning models
- Diving deeper into the data structure
- Page and Image
ObjectTypes
ImageAnnotation
and sub categories- Adding an
ImageAnnotation
- Adding a
ContainerAnnotation
to anImageAnnotation
- Sub images from given
ImageAnnotation
Using_LayoutLM_for_sequence_classification.ipynb:
- Fine tuning LayoutLM for sequence classification on a custom dataset
- Evaluation
- Building and running a production pipeline
Running_pre_trained_models_from_other_libraries.ipynb
- Installing and running pre-trained models provided by Layout-Parser
- Adding new categories
The next three notebooks are experiments on a custom dataset for token classification that has been made available through Huggingface. It shows, how to train and evaluate each model of the LayoutLM family and how to track experiments with W&B.
Layoutlm_v1_on_custom_token_classification.ipynb
- LayoutLMv1 for financial report NER
- Defining object types
- Visualization and display of ground truth
- Defining Dataflow and Dataset
- Defining a split and saving the split distribution as W&B artifact
- LayoutLMv1 training
- Further exploration of evaluation
- Evaluation with confusion matrix
- Visualizing predictions and ground truth
- Evaluation on test set
- Changing training parameters and settings
Layoutlm_v2_on_custom_token_classification.ipynb
- LayoutLMv2 for financial report NER
- Defining
ObjectTypes
, Dataset and Dataflow - Loading W&B artifact and building dataset split
- Exporing the language distribustion across the split
- Evaluation
- LayoutXLM for financial report NER
- Training XLM models on separate languages
Layoutlm_v3_on_custom_token_classification.ipynb
- LayoutLMv3 for financial report NER
- Evaluation
- Conclusion
To use the notebooks deepdoctection must be installed.