Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added option to modify config parameter used by Tesseract in LayoutLMV2/LayoutXLM Processor #17005

Closed
wants to merge 366 commits into from

Conversation

kelvinAI
Copy link
Contributor

@kelvinAI kelvinAI commented Apr 29, 2022

What does this PR do?

Giving user option to set config parameter used by Tesseract when performing feature extraction. Eg. to change psm levels while performing transcription by passing in '--psm 10' to config parameter while invoking image_to_data

It is shown that changing the psm values greatly influences the end result of LayoutLMV2/XLM, and the specific psm value is different depending on the document formatting. Refer : PSM

pytesseract.image_to_data(image, lang=lang, output_type="dict", config="--psm 10") 

Users can now set the tesseract config parameter during Processor initialization, like so:

processor = LayoutLMv2Processor.from_pretrained("microsoft/layoutlmv2-base-uncased", ocr_lang="eng", tesseract_config="--psm 5")

Before submitting

  • [❌] This PR fixes a typo or improves the docs (you can dismiss the other checks if that's the case).
  • [✔️] Did you read the contributor guideline,
    Pull Request section?
  • [❌] Was this discussed/approved via a Github issue or the forum? Please add a link
    to it if that's the case.
  • [✔️] Did you make sure to update the documentation with your changes? Here are the
    documentation guidelines, and
    here are tips on formatting docstrings.
  • [❌] Did you write any new necessary tests?

Feel free to modify as needed.
Thanks

@NielsRogge @LysandreJik

@HuggingFaceDocBuilderDev
Copy link

HuggingFaceDocBuilderDev commented Apr 29, 2022

The documentation is not available anymore as the PR was closed or merged.

@LysandreJik LysandreJik requested a review from NielsRogge April 29, 2022 12:45
ydshieh and others added 28 commits May 13, 2022 13:47
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* add PR title to push CI report

* add link

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
…huggingface#17076)

* [ fast_tokenizers.mdx ] - Added translation to portuguese to tutorial

* Delete docs/source/pt-br directory

* [ fast_tokenizers.mdx ] - Continuing work on file

* [ fast_tokenizers.mdx ] - Continuing work on file

* Add fast tokenizers to _toctree.yml

* Eliminated config and toctree.yml

* Nits in fast_tokenizers.mdx

Co-authored-by: Omar U. Espejel <espejelomar@gmail.com>
…6184)

* Translated version of model_sharing to spanish

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Update docs/source_es/model_sharing.mdx

* Addind model sharing to _toctree.yml

Co-authored-by: Omar U. Espejel <espejelomar@gmail.com>
* file copied and toctree updated

* Intro and configuration translated

* model section translated

* enter hotfix

* Translation over, correction pending

* Typos and corrections

* Update docs/source/es/create_a_model.mdx

Co-authored-by: Omar U. Espejel <espejelomar@gmail.com>

* Update docs/source/es/create_a_model.mdx

Co-authored-by: Omar U. Espejel <espejelomar@gmail.com>

* Update docs/source/es/create_a_model.mdx

Co-authored-by: Omar U. Espejel <espejelomar@gmail.com>

* Update docs/source/es/create_a_model.mdx

Co-authored-by: Omar U. Espejel <espejelomar@gmail.com>

Co-authored-by: Omar U. Espejel <espejelomar@gmail.com>
Change config.encoder_ffn_dim -> config.decoder_ffn_dim for decoder.
* [doc] performance/scalability revamp

* link the new docs

* no :

* mixed precision

* work on the first doc

* expand the main doc

* Trigger CI

* style

* revamp single GPU training section

* work on training performance

* remove files not used anymore or will be added later

* final touches

* fix rebase

* Add hardware section to toctree

* fix toctree again

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* remove `fast_tokenizers` entry that was copied in rebase

* add warning about DP vs DDP

* remove todo

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* fix missing closure of codeblock

* Update docs/source/en/perf_train_gpu_many.mdx

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* sync with huggingface#16860

* update toc

Co-authored-by: leandro <leandro.vonwerra@spoud.io>
Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* fixed bug run_mlm_flax_stream.py

Fixed bug caused by an update to tokenizer keys introduced in recent transformers versions (between `4.6.2` and `4.18.0`) where additional keys were introduced to the tokenizer output.

* Update run_mlm_flax_stream.py

* adding missing paranthesis

* formatted to black

* remove cols from dataset instead

* reformat to black

* moved rem. columns to map

* formatted to black

Co-authored-by: KennethEnevoldsen <kennethcenevolsen@gmail.com>
…#17219)

* adding partial checkpoint support for optimizer state

* formatted trainer.py

* Refactoring based on comments

* reformatting

* Update src/transformers/trainer.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/trainer.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

* Update src/transformers/trainer.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>

Co-authored-by: Cavdar <dcavdar@a07817b12d7e.ant.amazon.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* add new preprocessing arguments

* add new filters

* add new filters to readme

* fix config and test count, update function names and docstrings

* reformat code

* update readme

* Update readme

* rename config_test filter

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* rename few_assignments filter

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* rename tokenizer in arguments

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* rename functions and add limit_line argument for config_test filter

* update threshold for config_test filter

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna ben allal <loubnabenallal@gmail.com>
* add pretokenization arguments

* add pretokenization script

* add support for pretokenized data

* reformat code

* fix run command for training

* fix model call from config

* remove a package

* add comments on pretokenization in the readme

* remove explicit parallelization

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* update readme

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* update readme -remove username

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* update readme -remove username

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

* keep data parallelization

* reformat code

* reformat code

* update readme

* reformat code

* Update examples/research_projects/codeparrot/README.md

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>

Co-authored-by: Leandro von Werra <lvwerra@users.noreply.github.com>
Co-authored-by: Loubna ben allal <loubnabenallal@gmail.com>
* Fix edge cases TypeError: 'NoneType' object is not callable

* fix style
* Automatically sort auto mappings

* Better class extraction

* Some auto class magic

* Adapt test and underlying behavior

* Remove re-used config

* Quality
…e#17288)

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* logging documentation

* style

Co-authored-by: Sander Land <sander@chatdesk.com>
Co-authored-by: Matt <Rocketknight1@users.noreply.github.com>
Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
* fix opt tests

* remove unused tok

* make style

* make flake8 happy

* Update tests/models/opt/test_modeling_opt.py
* Fix test_model_parallelization

* Modify
* save intermediate

* add wav2vec2 conformer

* add more code

* more

* first test passes

* make all checkpoints work

* update

* up

* more clean ups

* save clean-up

* save clean-up

* save more

* remove bogus

* finalize design conformer

* remove vision

* finish all tests

* more changes

* finish code

* add doc tests

* add slow tests

* fix autoconfig test

* up

* correct docstring

* up

* update

* fix

* Apply suggestions from code review

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>

* Update docs/source/en/model_doc/wav2vec2-conformer.mdx

* upload

* save copied from

* correct configs

* fix model outputs

* add to docs

* fix imports

* finish

* finish code

* correct copied from

* correct again

* correct make fix

* improve make fix copies

* save

* correct fix copy from

* correct init structure

* correct

* fix import

* apply suggestions

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
Co-authored-by: Anton Lozhkov <aglozhkov@gmail.com>
* use matrix.machine_type

* fix job names used in job_link

Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
haileyschoelkopf and others added 4 commits June 14, 2022 17:10
…` classes (huggingface#17639)

* add new bloom classes

* (feat) add bloom classification tests; make style

* style: change import in test

* add some typehints to bloom classes

* merge main into branch

* fix: input checking in bloom seq classification

* fix tests

* change model class tests

* fix few tests

- more tests should pass
- one test left

* make token classifier return hidden states

* style: make BLOOM typehints consistent

Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>

Co-authored-by: younesbelkada <younesbelkada@gmail.com>
Co-authored-by: Younes Belkada <49240599+younesbelkada@users.noreply.github.com>
* Function refactor

* Update src/transformers/utils/fx.py

Co-authored-by: Sylvain Gugger <35901082+sgugger@users.noreply.github.com>
@NielsRogge NielsRogge reopened this Jun 15, 2022
@NielsRogge
Copy link
Contributor

Hi,

Sorry for the late reply. I'll review now.

ydshieh and others added 10 commits June 15, 2022 17:43
Co-authored-by: ydshieh <ydshieh@users.noreply.github.com>
* rembert: fix python codeblock

* rembert: use correct google/rembert checkpoint name in documentation

* rembert: use correct google/rembert checkpoint name in TF documentation
* [Wav2Vec2Conformer] Official release

* remove from not-in-readme
…v2.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
…v2.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
@kelvinAI
Copy link
Contributor Author

LayoutLMv2FeatureExtractor constructor must be modified to accept tesseract_config instead of tess_config for this change. Hang on, I'll work on it.

kelvinAI and others added 3 commits June 16, 2022 18:17
VIsualBert uses bert-base-uncased tokenizer, therefore, instead of {mask}, the mask token should be [MASK]
* fix the naming

* from pt in test for now

* make style

* slow test and removed from_pt
@kelvinAI kelvinAI requested a review from NielsRogge June 16, 2022 12:27
kelvinAI and others added 5 commits June 16, 2022 21:47
… during feature extraction

- Added optional 'tess_config' kwarg when setting up LayoutLMV2 processor that is used by pytesseract during feature extraction
- Eg. Can be used to modify psm values by setting tess_config to '--psm 7'
- Different psm values significantly influences the output of layoutlmv2
…v2.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
…v2.py

Co-authored-by: NielsRogge <48327001+NielsRogge@users.noreply.github.com>
@kelvinAI
Copy link
Contributor Author

Tried to rebase and merge with upstream but it is now changing too many files. I've created a fresh new PR here #17733

@kelvinAI kelvinAI closed this Jun 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.