Added option to modify config parameter used by Tesseract in LayoutLMV2/LayoutXLM Processor #17005

kelvinAI · 2022-04-29T07:33:15Z

What does this PR do?

Giving user option to set config parameter used by Tesseract when performing feature extraction. Eg. to change psm levels while performing transcription by passing in '--psm 10' to config parameter while invoking image_to_data

It is shown that changing the psm values greatly influences the end result of LayoutLMV2/XLM, and the specific psm value is different depending on the document formatting. Refer : PSM

pytesseract.image_to_data(image, lang=lang, output_type="dict", config="--psm 10")

Users can now set the tesseract config parameter during Processor initialization, like so:

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", ocr_lang="eng", tesseract_config="--psm 5")

Before submitting

[❌] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
[✔️] Did you read the contributor guideline,
Pull Request section?
[❌] Was this discussed/approved via a Github issue or the forum? Please add a link
to it if that's the case.
[✔️] Did you make sure to update the documentation with your changes? Here are the
documentation guidelines, and
here are tips on formatting docstrings.
[❌] Did you write any new necessary tests?

Feel free to modify as needed.
Thanks

@NielsRogge @LysandreJik

HuggingFaceDocBuilderDev · 2022-04-29T07:47:23Z

The documentation is not available anymore as the PR was closed or merged.

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* add PR title to push CI report * add link Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

…huggingface#17076) * [ fast_tokenizers.mdx ] - Added translation to portuguese to tutorial * Delete docs/source/pt-br directory * [ fast_tokenizers.mdx ] - Continuing work on file * [ fast_tokenizers.mdx ] - Continuing work on file * Add fast tokenizers to _toctree.yml * Eliminated config and toctree.yml * Nits in fast_tokenizers.mdx Co-authored-by: Omar U. Espejel <espejelomar@gmail.com>

…6184) * Translated version of model_sharing to spanish * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Update docs/source_es/model_sharing.mdx * Addind model sharing to _toctree.yml Co-authored-by: Omar U. Espejel <espejelomar@gmail.com>

* file copied and toctree updated * Intro and configuration translated * model section translated * enter hotfix * Translation over, correction pending * Typos and corrections * Update docs/source/es/create_a_model.mdx Co-authored-by: Omar U. Espejel <espejelomar@gmail.com> * Update docs/source/es/create_a_model.mdx Co-authored-by: Omar U. Espejel <espejelomar@gmail.com> * Update docs/source/es/create_a_model.mdx Co-authored-by: Omar U. Espejel <espejelomar@gmail.com> * Update docs/source/es/create_a_model.mdx Co-authored-by: Omar U. Espejel <espejelomar@gmail.com> Co-authored-by: Omar U. Espejel <espejelomar@gmail.com>

Change config.encoder_ffn_dim -> config.decoder_ffn_dim for decoder.

* [doc] performance/scalability revamp * link the new docs * no : * mixed precision * work on the first doc * expand the main doc * Trigger CI * style * revamp single GPU training section * work on training performance * remove files not used anymore or will be added later * final touches * fix rebase * Add hardware section to toctree * fix toctree again * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * remove `fast_tokenizers` entry that was copied in rebase * add warning about DP vs DDP * remove todo * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * fix missing closure of codeblock * Update docs/source/en/perf_train_gpu_many.mdx Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * sync with huggingface#16860 * update toc Co-authored-by: leandro <leandro.vonwerra@spoud.io> Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fixed bug run_mlm_flax_stream.py Fixed bug caused by an update to tokenizer keys introduced in recent transformers versions (between `4.6.2` and `4.18.0`) where additional keys were introduced to the tokenizer output. * Update run_mlm_flax_stream.py * adding missing paranthesis * formatted to black * remove cols from dataset instead * reformat to black * moved rem. columns to map * formatted to black Co-authored-by: KennethEnevoldsen <kennethcenevolsen@gmail.com>

…#17219) * adding partial checkpoint support for optimizer state * formatted trainer.py * Refactoring based on comments * reformatting * Update src/transformers/trainer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/trainer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> * Update src/transformers/trainer.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Cavdar <dcavdar@a07817b12d7e.ant.amazon.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* add new preprocessing arguments * add new filters * add new filters to readme * fix config and test count, update function names and docstrings * reformat code * update readme * Update readme * rename config_test filter Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * rename few_assignments filter Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * rename tokenizer in arguments Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * rename functions and add limit_line argument for config_test filter * update threshold for config_test filter Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> Co-authored-by: Loubna ben allal <loubnabenallal@gmail.com>

* add pretokenization arguments * add pretokenization script * add support for pretokenized data * reformat code * fix run command for training * fix model call from config * remove a package * add comments on pretokenization in the readme * remove explicit parallelization Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * update readme Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * update readme -remove username Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * update readme -remove username Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> * keep data parallelization * reformat code * reformat code * update readme * reformat code * Update examples/research_projects/codeparrot/README.md Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com> Co-authored-by: Loubna ben allal <loubnabenallal@gmail.com>

…e#17276)

* Fix edge cases TypeError: 'NoneType' object is not callable * fix style

* Automatically sort auto mappings * Better class extraction * Some auto class magic * Adapt test and underlying behavior * Remove re-used config * Quality

…e#17288) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* logging documentation * style Co-authored-by: Sander Land <sander@chatdesk.com>

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix opt tests * remove unused tok * make style * make flake8 happy * Update tests/models/opt/test_modeling_opt.py

* Fix test_model_parallelization * Modify

* save intermediate * add wav2vec2 conformer * add more code * more * first test passes * make all checkpoints work * update * up * more clean ups * save clean-up * save clean-up * save more * remove bogus * finalize design conformer * remove vision * finish all tests * more changes * finish code * add doc tests * add slow tests * fix autoconfig test * up * correct docstring * up * update * fix * Apply suggestions from code review Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com> * Update docs/source/en/model_doc/wav2vec2-conformer.mdx * upload * save copied from * correct configs * fix model outputs * add to docs * fix imports * finish * finish code * correct copied from * correct again * correct make fix * improve make fix copies * save * correct fix copy from * correct init structure * correct * fix import * apply suggestions Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com> Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>

* use matrix.machine_type * fix job names used in job_link Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

…` classes (huggingface#17639) * add new bloom classes * (feat) add bloom classification tests; make style * style: change import in test * add some typehints to bloom classes * merge main into branch * fix: input checking in bloom seq classification * fix tests * change model class tests * fix few tests - more tests should pass - one test left * make token classifier return hidden states * style: make BLOOM typehints consistent Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com> Co-authored-by: younesbelkada <younesbelkada@gmail.com> Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

* Function refactor * Update src/transformers/utils/fx.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

NielsRogge · 2022-06-15T13:29:25Z

Hi,

Sorry for the late reply. I'll review now.

src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

* rembert: fix python codeblock * rembert: use correct google/rembert checkpoint name in documentation * rembert: use correct google/rembert checkpoint name in TF documentation

* [Wav2Vec2Conformer] Official release * remove from not-in-readme

…" (huggingface#17717) This reverts commit b76290f.

* Add flag to push weights directly into main

…v2.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

kelvinAI · 2022-06-16T10:08:56Z

LayoutLMv2FeatureExtractor constructor must be modified to accept tesseract_config instead of tess_config for this change. Hang on, I'll work on it.

VIsualBert uses bert-base-uncased tokenizer, therefore, instead of {mask}, the mask token should be [MASK]

* fix the naming * from pt in test for now * make style * slow test and removed from_pt

… during feature extraction - Added optional 'tess_config' kwarg when setting up LayoutLMV2 processor that is used by pytesseract during feature extraction - Eg. Can be used to modify psm values by setting tess_config to '--psm 7' - Different psm values significantly influences the output of layoutlmv2

…v2.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

…ransformers into layoutlmv2_tessconfig

kelvinAI · 2022-06-16T14:11:36Z

Tried to rebase and merge with upstream but it is now changing too many files. I've created a fresh new PR here #17733

LysandreJik requested a review from NielsRogge April 29, 2022 12:45

ydshieh and others added 28 commits May 13, 2022 13:47

install dev. version of accelerate (huggingface#17243)

7198b63

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Fix push CI channel (huggingface#17242)

506899d

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Add PR title to push CI report (huggingface#17246)

50d1867

* add PR title to push CI report * add link Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Fix obvious typos in flax decoder impl (huggingface#17279)

e86faec

Change config.encoder_ffn_dim -> config.decoder_ffn_dim for decoder.

TF - Fix convnext classification example (huggingface#17261)

d3d87b4

Remove next sentence prediction from supported ONNX tasks (huggingfac…

a5d1839

…e#17276)

Align logits and labels in OPT (huggingface#17237)

95b6bef

Mlflowcallback fix nonetype error (huggingface#17171)

2f611f8

* Fix edge cases TypeError: 'NoneType' object is not callable * fix style

Automatically sort auto mappings (huggingface#17250)

ddb1a47

* Automatically sort auto mappings * Better class extraction * Some auto class magic * Adapt test and underlying behavior * Remove re-used config * Quality

Make TrainerHyperParameterSigOptIntegrationTest slow test (huggingfac…

66b3e10

…e#17288) Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Better error in the Auto API when a dep is missing (huggingface#17289)

9b0d286

Fix FlavaForPreTrainingIntegrationTest CI test (huggingface#17232)

3fb82f7

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Use the PR URL in CI report (huggingface#17269)

8600d77

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

logging documentation update (huggingface#17174)

053a80c

* logging documentation * style Co-authored-by: Sander Land <sander@chatdesk.com>

docs(transformers): fix typo (huggingface#17263)

6cb7187

Add Tensorflow Swin model (huggingface#16988)

f6a6388

Co-authored-by: Matt <Rocketknight1@users.noreply.github.com> Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

[Tests] Fix slow opt tests (huggingface#17282)

e705e12

* fix opt tests * remove unused tok * make style * make flake8 happy * Update tests/models/opt/test_modeling_opt.py

Fix test_model_parallelization (huggingface#17249)

f0395cf

* Fix test_model_parallelization * Modify

Fix missing job action button in CI report (huggingface#17270)

1ac2b8f

* use matrix.machine_type * fix job names used in job_link Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

haileyschoelkopf and others added 4 commits June 14, 2022 17:10

FX function refactor (huggingface#17625)

7ec9128

* Function refactor * Update src/transformers/utils/fx.py Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

[LongT5] disable model parallel test (huggingface#17702)

120649b

fix tolerance for a bloom slow test (huggingface#17634)

d453ea6

NielsRogge reopened this Jun 15, 2022

NielsRogge reviewed Jun 15, 2022

View reviewed changes

src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py Outdated Show resolved Hide resolved

NielsRogge reviewed Jun 15, 2022

View reviewed changes

src/transformers/models/layoutlmv2/feature_extraction_layoutlmv2.py Outdated Show resolved Hide resolved

ydshieh and others added 10 commits June 15, 2022 17:43

Change push CI to run on workflow_run event (huggingface#17692)

b76290f

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>

Documentation: RemBERT fixes (huggingface#17641)

242cc6e

* rembert: fix python codeblock * rembert: use correct google/rembert checkpoint name in documentation * rembert: use correct google/rembert checkpoint name in TF documentation

[Wav2Vec2Conformer] Official release (huggingface#17709)

7f14839

* [Wav2Vec2Conformer] Official release * remove from not-in-readme

Revert "Change push CI to run on workflow_run event (huggingface#17692)…

50415b8

…" (huggingface#17717) This reverts commit b76290f.

Update requirements.txt (huggingface#17719)

6ebeeee

CLI: Add flag to push TF weights directly into main (huggingface#17720)

c3c62b5

* Add flag to push weights directly into main

normalize keys_to_ignore (huggingface#17722)

66f8933

Sort the model doc Toc Alphabetically (huggingface#17723)

3981ee8

Update src/transformers/models/layoutlmv2/feature_extraction_layoutlm…

88a34e8

…v2.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

Update src/transformers/models/layoutlmv2/feature_extraction_layoutlm…

7e164f3

…v2.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

kelvinAI and others added 3 commits June 16, 2022 18:17

Updated variable names to be more explicit

34d060c

Fix mask token in the example (huggingface#17725)

2eadb7e

VIsualBert uses bert-base-uncased tokenizer, therefore, instead of {mask}, the mask token should be [MASK]

Fix tf shared embedding (huggingface#17730)

f44e2c2

* fix the naming * from pt in test for now * make style * slow test and removed from_pt

kelvinAI requested a review from NielsRogge June 16, 2022 12:27

kelvinAI and others added 5 commits June 16, 2022 21:47

Update src/transformers/models/layoutlmv2/feature_extraction_layoutlm…

57f2e40

…v2.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

Update src/transformers/models/layoutlmv2/feature_extraction_layoutlm…

1a5ec60

…v2.py Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>

Updated variable names to be more explicit

cf7c657

Merge branch 'layoutlmv2_tessconfig' of https://github.com/kelvinAI/t…

fd79650

…ransformers into layoutlmv2_tessconfig

kelvinAI closed this Jun 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Added option to modify config parameter used by Tesseract in LayoutLMV2/LayoutXLM Processor #17005

Added option to modify config parameter used by Tesseract in LayoutLMV2/LayoutXLM Processor #17005

kelvinAI commented Apr 29, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 29, 2022 •

edited

Loading

NielsRogge commented Jun 15, 2022

kelvinAI commented Jun 16, 2022

kelvinAI commented Jun 16, 2022

Added option to modify config parameter used by Tesseract in LayoutLMV2/LayoutXLM Processor #17005

Added option to modify config parameter used by Tesseract in LayoutLMV2/LayoutXLM Processor #17005

Conversation

kelvinAI commented Apr 29, 2022 • edited Loading

What does this PR do?

Before submitting

HuggingFaceDocBuilderDev commented Apr 29, 2022 • edited Loading

NielsRogge commented Jun 15, 2022

kelvinAI commented Jun 16, 2022

kelvinAI commented Jun 16, 2022

kelvinAI commented Apr 29, 2022 •

edited

Loading

HuggingFaceDocBuilderDev commented Apr 29, 2022 •

edited

Loading