Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added option to modify config parameter used by Tesseract in LayoutLMV2/LayoutXLM Processor #17005

Closed
wants to merge 366 commits into from
Closed
Changes from 1 commit
Commits
Show all changes
366 commits
Select commit Hold shift + click to select a range
7198b63
install dev. version of accelerate (#17243)
ydshieh May 13, 2022
506899d
Fix push CI channel (#17242)
ydshieh May 13, 2022
50d1867
Add PR title to push CI report (#17246)
ydshieh May 13, 2022
f902481
[ fast_tokenizers.mdx ] - Added translation to portuguese to tutorial…
Fellip15 May 13, 2022
16be422
Translated version of model_sharing.mdx doc to spanish (#16184)
Gerard-170 May 13, 2022
ee393c0
Guide to create custom models in Spanish (#17158)
ignacioct May 13, 2022
e86faec
Fix obvious typos in flax decoder impl (#17279)
cloudhan May 16, 2022
d3d87b4
TF - Fix convnext classification example (#17261)
gante May 16, 2022
71abd3a
[WIP] [doc] performance/scalability revamp (#15723)
stas00 May 16, 2022
71d18d0
fixed bug in run_mlm_flax_stream.py (#17203)
KennethEnevoldsen May 16, 2022
518dd12
Updated checkpoint support for Sagemaker Model Parallel (#17219)
cavdard May 16, 2022
e730e12
Update codeparrot data preprocessing (#16944)
loubnabnl May 16, 2022
05a9057
CodeParrot data pretokenization (#16932)
loubnabnl May 16, 2022
a5d1839
Remove next sentence prediction from supported ONNX tasks (#17276)
lewtun May 16, 2022
95b6bef
Align logits and labels in OPT (#17237)
MichelBartels May 16, 2022
2f611f8
Mlflowcallback fix nonetype error (#17171)
orieg May 16, 2022
ddb1a47
Automatically sort auto mappings (#17250)
sgugger May 16, 2022
66b3e10
Make TrainerHyperParameterSigOptIntegrationTest slow test (#17288)
ydshieh May 16, 2022
9b0d286
Better error in the Auto API when a dep is missing (#17289)
sgugger May 16, 2022
3fb82f7
Fix FlavaForPreTrainingIntegrationTest CI test (#17232)
ydshieh May 16, 2022
8600d77
Use the PR URL in CI report (#17269)
ydshieh May 16, 2022
053a80c
logging documentation update (#17174)
sanderland May 16, 2022
6cb7187
docs(transformers): fix typo (#17263)
k-zehnder May 16, 2022
f6a6388
Add Tensorflow Swin model (#16988)
amyeroberts May 16, 2022
e705e12
[Tests] Fix slow opt tests (#17282)
patrickvonplaten May 16, 2022
f0395cf
Fix test_model_parallelization (#17249)
lkm2835 May 16, 2022
5a99573
Add Wav2Vec2Conformer (#16812)
patrickvonplaten May 16, 2022
1ac2b8f
Fix missing job action button in CI report (#17270)
ydshieh May 17, 2022
a26ab95
Fix wrong PT/TF categories in CI report (#17272)
ydshieh May 17, 2022
ec7f8af
[ConvNeXT] Fix drop_path_rate (#17280)
NielsRogge May 17, 2022
6d21142
fix retribert's `test_torch_encode_plus_sent_to_model` (#17231)
SaulLu May 17, 2022
651e48e
Fix tests of mixed precision now that experimental is deprecated (#17…
Rocketknight1 May 17, 2022
349f1c8
Rewrite TensorFlow train_step and test_step (#17057)
Rocketknight1 May 17, 2022
1f13ba8
correct opt (#17301)
patrickvonplaten May 17, 2022
28a0811
Improve mismatched sizes management when loading a pretrained model (…
regisss May 17, 2022
10704e1
[Test] Fix W2V-Conformer integration test (#17303)
patrickvonplaten May 17, 2022
38ddab1
Doctest longformer (#16441)
KMFODA May 17, 2022
986dd5c
Fix style
sgugger May 17, 2022
032d63b
Fix dummy creation script (#17304)
sgugger May 17, 2022
0511305
Add PR author in CI report + merged by info (#17298)
ydshieh May 17, 2022
bad3583
Add support for pretraining recurring span selection to Splinter (#17…
jvcop May 17, 2022
d9050dc
[LED] fix global_attention_mask not being passed for generation and d…
caesar-one May 17, 2022
c352640
fix (#17310)
patrickvonplaten May 17, 2022
d6b8e9c
Add trajectory transformer (#17141)
CarlCochet May 17, 2022
7ba1d4e
Add type hints for ProphetNet (Pytorch) (#17223)
jQuinRivero May 18, 2022
60ad734
[T5] Fix init in TF and Flax for pretraining (#17294)
patrickvonplaten May 18, 2022
1c9d1f4
Updating the docs for `max_seq_len` in QA pipeline (#17316)
Narsil May 18, 2022
2cb2ea3
Accepting real pytorch device as arguments. (#17318)
Narsil May 18, 2022
fe28eb9
remove (#17325)
ydshieh May 18, 2022
91ede48
Fix typo (#17328)
kamalkraj May 18, 2022
5fdb54e
Add Information Gain Filtration algorithm (#16953)
mraunak May 18, 2022
4710702
Fix style
sgugger May 18, 2022
adc0ff2
Add CvT (#17299)
NielsRogge May 18, 2022
6da76b9
Add onnx export cuda support (#17183)
JingyaHuang May 18, 2022
b3b9f99
Fix test_t5_decoder_model_past_large_inputs (#17320)
ydshieh May 18, 2022
060fe61
Not send successful report (#17329)
ydshieh May 18, 2022
6e195eb
docs for typical decoding (#17186)
jadermcs May 18, 2022
1762ded
Fix metric calculation in examples and setup tests to run on multi-gp…
muellerzr May 18, 2022
6aad387
fix (#17337)
ydshieh May 18, 2022
1b20c97
Fix ci_url might be None (#17332)
ydshieh May 18, 2022
3601aa8
[tests] fix copy-n-paste error (#17312)
stas00 May 18, 2022
a4386d7
[BC] Fixing usage of text pairs (#17324)
Narsil May 19, 2022
2b28229
Adding `batch_size` test to QA pipeline. (#17330)
Narsil May 19, 2022
e8714c0
[OPT] Run test in lower precision on GPU (#17353)
patrickvonplaten May 19, 2022
518bd02
[Generation] Fix Transition probs (#17311)
patrickvonplaten May 19, 2022
5d6feec
fix for 17292 (#17293)
nadahlberg May 19, 2022
48c2269
Fix bug in Wav2Vec2 pretrain example (#17326)
ddobokki May 19, 2022
5419205
[Test OPT] Add batch generation test opt (#17359)
patrickvonplaten May 19, 2022
3fd7de4
Pin dill to fix examples (#17368)
sgugger May 20, 2022
b9bb417
Fix a typo relative_postion_if_large -> relative_position_if_large (#…
stancld May 20, 2022
b48ac1a
Fix CodeParrot training script (#17291)
loubnabnl May 23, 2022
7b8cb26
Correct & Improve Doctests for LayoutLMv2 (#17168)
garyhlai May 23, 2022
c86aad6
Fix cvt docstrings (#17367)
AnugunjNaman May 23, 2022
1cd01b0
Fix Comet ML integration (#17381)
mxschmdt May 23, 2022
2e7e428
Traced models serialization and torchscripting fix (#17206)
michaelbenayoun May 23, 2022
56f5059
Use Accelerate in `from_pretrained` for big model inference (#17341)
sgugger May 23, 2022
71cced8
OPTForCausalLM lm_head input size should be config.word_embed_proj_di…
vfbd May 23, 2022
13541b4
Add support for `device_map="auto"` to OPT (#17382)
sgugger May 23, 2022
31ee80d
Add LayoutLMv3 (#17060)
NielsRogge May 24, 2022
d980929
Enabling `imageGPT` auto feature extractor. (#16871)
Narsil May 24, 2022
374a2f6
Clean up CLIP tests (#17380)
NielsRogge May 24, 2022
71e6027
[WIP] Adding GPT-NeoX-20B (#16659)
zphang May 24, 2022
1ef9a1e
Bump tensorflow in /examples/research_projects/decision_transformer (…
dependabot[bot] May 24, 2022
4d727bd
Fix expected value for OPT test `test_inference_no_head` (#17395)
ydshieh May 25, 2022
bd908e9
Fix README localizer script (#17407)
sgugger May 25, 2022
56b35ce
Make check_init script more robust and clean inits (#17408)
sgugger May 25, 2022
31484af
Add test for new model parallelism features (#17401)
sgugger May 25, 2022
897a8dd
Support compilation via Torchdynamo, AOT Autograd, NVFuser (#17308)
anijain2305 May 25, 2022
35e2d13
Upd AutoTokenizer.from_pretrained doc examples (#17416)
c00k1ez May 25, 2022
284fc6c
Add link to Hub PR docs in model cards (#17421)
lewtun May 25, 2022
740a157
fix link in performance docs (#17419)
lvwerra May 25, 2022
a9eca74
Wav2vec2 finetuning shared file system (#17423)
patrickvonplaten May 25, 2022
70484a8
Adding the Portuguese version of the tasks/sequence_classification.md…
jonatasgrosman May 25, 2022
5e7f085
Added es version of bertology.mdx doc (#17255)
jQuinRivero May 25, 2022
8f46ac9
Spanish translation of the files sagemaker.mdx and image_classificati…
SimplyJuanjo May 25, 2022
2295bca
Spanish translation of the file preprocessing.mdx (#16299)
yharyarias May 26, 2022
7535d92
Pin protobouf that breaks TensorBoard in PyTorch (#17440)
sgugger May 26, 2022
98f6e1e
Fix model parallelism test (#17439)
sgugger May 26, 2022
7999ec1
[OPT] Fix bos token id default (#17441)
patrickvonplaten May 26, 2022
d156898
Improve notrainer examples (#17449)
pacman100 May 27, 2022
13fd673
Fix typo (remove parenthesis) (#17415)
mikcnt May 31, 2022
04681c1
typo IBERT in __repr__ quant_mode (#17398)
scratchmex May 31, 2022
28d0048
Fx support for multiple model architectures (#17393)
michaelbenayoun May 31, 2022
2ef09ec
Fix nits (#17349)
omarespejel May 31, 2022
b0e0ac8
[Generate] Fix output scores greedy search (#17442)
patrickvonplaten May 31, 2022
c1a1386
Fix ViTMAEModelTester (#17470)
ydshieh May 31, 2022
975dd2b
TF: GPT-2 generation supports left-padding (#17426)
gante May 31, 2022
567d9c0
Disk offload fix (#17428)
sgugger May 31, 2022
5af3895
Added XLM onnx config (#17030)
nandwalritik May 31, 2022
400b309
Docker image build in parallel (#17434)
ydshieh May 31, 2022
8f8b3cb
Fix checkpoint name (#17484)
ydshieh May 31, 2022
dfc3846
Setup for Italian translation and add quicktour.mdx translation (#17472)
mfumanelli May 31, 2022
52e7c92
Add HF.co for PRs / Issues regarding specific model checkpoints (#17485)
patrickvonplaten May 31, 2022
6ee1474
Accumulate tokens into batches in `PreTrainedTokenizerBase.add_tokens…
Witiko May 31, 2022
f394a2a
[Json configs] Make json prettier for all saved tokenizer files & ens…
patrickvonplaten May 31, 2022
7822a9b
Opt in flax and tf (#17388)
ArthurZucker May 31, 2022
ba286fe
[GPT2Tokenizer] Fix GPT2 with bos token (#17498)
patrickvonplaten May 31, 2022
4f38808
Add OnnxConfig for SqueezeBert iss17314 (#17315)
artemisep Jun 1, 2022
811da2b
Fixed wrong error message for missing weight file (#17216)
123jimin Jun 1, 2022
24092b1
Fix typo of variable names for key and query projection layer (#17155)
Jun 1, 2022
d91da4c
Add warning when using older version of torch for ViltFeatureExtracto…
xhluca Jun 1, 2022
b1160c0
Fix wav2vec2 export onnx model with attention_mask error (#16004)
nilboy Jun 1, 2022
bdc0171
Refactor classes to inherit from nn.Module instead of nn.Sequential (…
amyeroberts Jun 1, 2022
3042ea4
Fix `tokenizer` type annotation in `pipeline(...)` (#17500)
willfrey Jun 1, 2022
6813439
Exclude Databricks from notebook env (#17496)
sgugger Jun 1, 2022
4390151
Fix MP and CPU offload tests for Funnel and GPT-Neo (#17503)
sgugger Jun 1, 2022
4d1ce39
Debug LukeForMaskedLM (#17499)
ryokan0123 Jun 1, 2022
693720e
Fix LayoutXLMProcessorTest (#17506)
ydshieh Jun 1, 2022
1d2b57b
Fix CTRL tests (#17508)
ydshieh Jun 1, 2022
84aaadd
Adding LeViT Model by Facebook (#17466)
AnugunjNaman Jun 1, 2022
028d4b7
Deal with the error when task is regression (#16330)
fireindark707 Jun 1, 2022
3766df4
Fix flakey no-trainer test (#17515)
muellerzr Jun 1, 2022
ca1f1c8
CLI: tool to convert PT into TF weights and open hub PR (#17497)
gante Jun 1, 2022
58fb3c9
Fix Tapas tests (#17510)
ydshieh Jun 1, 2022
0932adb
Split push CI into 2 workflows (#17369)
ydshieh Jun 2, 2022
659b27f
Print more library versions in CI (#17384)
ydshieh Jun 2, 2022
216499b
Fix CI tests hang forever (#17471)
ydshieh Jun 2, 2022
f128ccb
Clean README in post release job as well. (#17519)
sgugger Jun 2, 2022
588d8f1
Fix when Accelerate is not installed (#17518)
sgugger Jun 2, 2022
048dd73
Check list of models in the main README and sort it (#17517)
sgugger Jun 2, 2022
085321c
Update configuration_auto.py (#17527)
kamalkraj Jun 2, 2022
046c5ea
Implemented loss for training AudioFrameClassification (#17513)
MorenoLaQuatra Jun 2, 2022
2f59ad1
[trainer/deepspeed] load_best_model (reimplement re-init) (#17151)
stas00 Jun 2, 2022
013462c
fix OPT-Flax CI tests (#17512)
ArthurZucker Jun 2, 2022
1c220ce
Update URL for Hub PR docs (#17532)
lewtun Jun 2, 2022
607acd4
Add Gated-SiLU to T5 (#17420)
DanielHesslow Jun 3, 2022
5c17918
Allow from transformers import TypicalLogitsWarper (#17477)
teticio Jun 3, 2022
babeff5
Add support for Perceiver ONNX export (#17213)
deutschmn Jun 3, 2022
1c57242
Fix bug - layer names and activation from previous refactor (#17524)
amyeroberts Jun 3, 2022
8343901
Fix all offload and MP tests (#17533)
sgugger Jun 3, 2022
254d9c0
Update run_glue_no_trainer.py (#17546)
bofenghuang Jun 3, 2022
c4e58cd
Clean imports to fix test_fetcher (#17531)
sgugger Jun 3, 2022
72f5b94
Update index.mdx (#17547)
BritneyMuller Jun 3, 2022
26e5e12
[deepspeed] fix load_best_model test (#17550)
stas00 Jun 3, 2022
da71df1
fix integration test levit (#17555)
AnugunjNaman Jun 6, 2022
4aed1dc
Adding the Portuguese version of the tasks/token_classification.mdx d…
jonatasgrosman Jun 6, 2022
f6ad0e0
Add installation.mdx Italian translation (#17530)
mfumanelli Jun 6, 2022
2e37ef3
Remove RuntimeErrors for NaN-checking in 20B (#17563)
zphang Jun 6, 2022
34a886f
Translation/italian: added pipeline_tutorial.mdx [Issue: #17459] (#17…
nickprock Jun 6, 2022
d28b7aa
[deepspeed / testing] reset global state (#17553)
stas00 Jun 6, 2022
19a8a30
Add magic method to our TF models to convert datasets with column inf…
Rocketknight1 Jun 6, 2022
ad71965
Remove circular imports in layoutlm/__init__.py (#17576)
regisss Jun 6, 2022
9aa230a
Use latest stable PyTorch/DeepSpeed for Push & Scheduled CI (#17417)
ydshieh Jun 7, 2022
b6a65ae
Fix circular import in onnx.utils (#17577)
sgugger Jun 7, 2022
b118730
Fix gendered sentence in Spanish translation(#17558)
omarespejel Jun 7, 2022
9e72eb4
Skip disk offload test for T5
sgugger Jun 7, 2022
3cab902
Add examples telemetry (#17552)
sgugger Jun 7, 2022
5c8f601
Fx support for Deberta-v[1-2], Hubert and LXMERT (#17539)
michaelbenayoun Jun 7, 2022
706bb83
quicktour.mdx en -> pt translation (#17074)
vitorfrois Jun 7, 2022
119e3c0
M-CTC-T Model (#16402)
cwkeam Jun 7, 2022
c6cea5a
fix (#17589)
ydshieh Jun 7, 2022
78c695e
CLI: add stricter automatic checks to `pt-to-tf` (#17588)
gante Jun 8, 2022
9d99489
Add TFData2VecVision for semantic segmentation (#17271)
sayakpaul Jun 8, 2022
264128c
Explicit versions in docker files (#17586)
ydshieh Jun 8, 2022
ae7bae8
fix `train_new_from_iterator` in the case of byte-level tokenizers (#…
SaulLu Jun 8, 2022
34097b3
Extend Transformers Trainer Class to Enable CPU AMP and Integrate Int…
jianan-gu Jun 8, 2022
ee82c86
Fix link for community notebooks (#17602)
ngoquanghuy99 Jun 8, 2022
7d0b6fc
CLI: Properly detect encoder-decoder models (#17605)
gante Jun 8, 2022
e160a5d
Fix telemetry URL (#17608)
sgugger Jun 8, 2022
e9d5138
TF: Merge PT and TF behavior for Bart when no decoder_input_ids are p…
gante Jun 8, 2022
66e8656
CLI: Print all different tensors on exception (#17612)
gante Jun 8, 2022
dfc76b2
has_attentions - consistent test skipping logic and tf tests (#17495)
amyeroberts Jun 9, 2022
ca2a55e
BLOOM (#17474)
younesbelkada Jun 9, 2022
5323094
Add ONNX support for ResNet (#17585)
regisss Jun 9, 2022
e0be053
Add ONNX support for ConvNeXT (#17627)
regisss Jun 9, 2022
9fc3423
Use shape_list to safely get shapes for Swin (#17591)
amyeroberts Jun 9, 2022
2908064
Mention in the doc we drop support for fairscale (#17610)
sgugger Jun 9, 2022
2351729
Adding `top_k` argument to `text-classification` pipeline. (#17606)
Narsil Jun 9, 2022
c70dacd
Fix very long job failure text in Slack report (#17630)
ydshieh Jun 9, 2022
90ed9ae
fix use_amp rename after pr 17138 (#17636)
stas00 Jun 9, 2022
c38f4e1
Running a pipeline of `float16`. (#17637)
Narsil Jun 9, 2022
75343de
[modeling_utils] torch_dtype/auto floating dtype fixes (#17614)
stas00 Jun 9, 2022
da0bed5
Pre-build DeepSpeed (#17607)
ydshieh Jun 9, 2022
fba0b6a
convert assertion to raised exception in debertav2 (#17619)
sam-h-bean Jun 9, 2022
df1ec6b
didn't exist in pt-1.9 (#17644)
stas00 Jun 9, 2022
e0b58fb
Translation/autoclass (#17615)
mfumanelli Jun 10, 2022
af4a1ec
Skip tests until bug is fixed. (#17646)
sgugger Jun 10, 2022
6e93d94
Move Clip image utils to image_utils.py (#17628)
alaradirik Jun 10, 2022
49becba
Enable crop_center method to handle (W, H, C) images (#17626)
alaradirik Jun 10, 2022
1d46330
Bump cookiecutter in /examples/research_projects/decision_transformer…
dependabot[bot] Jun 10, 2022
2bc3051
Fix style
LysandreJik Jun 10, 2022
cdaed36
Fix style
LysandreJik Jun 10, 2022
fd1e670
Add skip logic for attentions test - Levit (#17633)
amyeroberts Jun 10, 2022
b880909
Fix dtype getters (#17656)
sgugger Jun 10, 2022
35b1603
Fixes #17128 . (#17356)
mygithubid1 Jun 10, 2022
c99ddcc
🐛 Properly raise `RepoNotFoundError` when not authenticated (#17651)
SBrandeis Jun 10, 2022
3114df4
update README.md (#17657)
loubnabnl Jun 10, 2022
5e428b7
[BigBirdFlaxTests] Make tests slow (#17658)
patrickvonplaten Jun 10, 2022
b4eef63
[Data2Vec] Speed up test (#17660)
patrickvonplaten Jun 10, 2022
13e875c
[Generation Test] Make fast test actually fast (#17661)
patrickvonplaten Jun 10, 2022
39e1461
fix typo from emtpy to empty (#17643)
domenicrosati Jun 10, 2022
224bde9
Avoid GPU OOM for a TF Rag test (#17638)
ydshieh Jun 10, 2022
a5282ab
Fix typo in adding_a_new_model README (#17679)
ayushtues Jun 13, 2022
66336dc
Add Visual Question Answering (VQA) pipeline (#17286)
sijunhe Jun 13, 2022
c1daf72
Fixed documentation typo, parameter name is evaluation_strategy, not …
sainttttt Jun 13, 2022
7308358
explicitly set utf8 for Windows (#17664)
Jun 13, 2022
a1344db
Fix dtype getter (#17668)
sgugger Jun 13, 2022
5483388
Update modeling_gpt_neox.py (#17575)
willfrey Jun 13, 2022
457d4a3
Add Ray's scope to training arguments (#17629)
Jun 13, 2022
4aabf9b
enable cpu distribution training using mpirun (#17570)
sywangyi Jun 13, 2022
1690094
Add FP16 Support for SageMaker Model Parallel (#17386)
haohanchen-aws Jun 13, 2022
a72f1c9
Add `LongT5` model (#16792)
stancld Jun 13, 2022
df15703
Fix doc builder Dockerfile (#17435)
ydshieh Jun 14, 2022
3b29c9f
Extend Transformers Trainer Class to Enable PyTorch Torchscript for I…
jianan-gu Jun 14, 2022
53496ac
[LongT5] Rename checkpoitns (#17700)
patrickvonplaten Jun 14, 2022
9068fa6
Rag end2end new (#17650)
Jun 14, 2022
3960ce9
Include a comment to reflect Amy's contributions (#17689)
sayakpaul Jun 14, 2022
bd43151
Swin main layer (#17693)
amyeroberts Jun 14, 2022
edb672a
Add `BloomForSequenceClassification` and `BloomForTokenClassification…
haileyschoelkopf Jun 14, 2022
7ec9128
FX function refactor (#17625)
michaelbenayoun Jun 14, 2022
120649b
[LongT5] disable model parallel test (#17702)
patil-suraj Jun 14, 2022
d453ea6
fix tolerance for a bloom slow test (#17634)
younesbelkada Jun 14, 2022
b76290f
Change push CI to run on workflow_run event (#17692)
ydshieh Jun 15, 2022
242cc6e
Documentation: RemBERT fixes (#17641)
stefan-it Jun 15, 2022
7f14839
[Wav2Vec2Conformer] Official release (#17709)
patrickvonplaten Jun 15, 2022
50415b8
Revert "Change push CI to run on workflow_run event (#17692)" (#17717)
ydshieh Jun 15, 2022
6ebeeee
Update requirements.txt (#17719)
jeffra Jun 15, 2022
c3c62b5
CLI: Add flag to push TF weights directly into main (#17720)
gante Jun 15, 2022
66f8933
normalize keys_to_ignore (#17722)
stas00 Jun 15, 2022
3981ee8
Sort the model doc Toc Alphabetically (#17723)
sgugger Jun 15, 2022
88a34e8
Update src/transformers/models/layoutlmv2/feature_extraction_layoutlm…
kelvinAI Jun 16, 2022
7e164f3
Update src/transformers/models/layoutlmv2/feature_extraction_layoutlm…
kelvinAI Jun 16, 2022
34d060c
Updated variable names to be more explicit
kelvinAI Jun 16, 2022
2eadb7e
Fix mask token in the example (#17725)
Jiayi-Pan Jun 16, 2022
f44e2c2
Fix tf shared embedding (#17730)
ArthurZucker Jun 16, 2022
0dca70e
Added option for users to modify config parameter used by pytesseract…
kelvinAI Apr 25, 2022
57f2e40
Update src/transformers/models/layoutlmv2/feature_extraction_layoutlm…
kelvinAI Jun 16, 2022
1a5ec60
Update src/transformers/models/layoutlmv2/feature_extraction_layoutlm…
kelvinAI Jun 16, 2022
cf7c657
Updated variable names to be more explicit
kelvinAI Jun 16, 2022
fd79650
Merge branch 'layoutlmv2_tessconfig' of https://github.com/kelvinAI/t…
kelvinAI Jun 16, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -46,11 +46,11 @@ def normalize_box(box, width, height):
]


def apply_tesseract(image: Image.Image, lang: Optional[str]):
def apply_tesseract(image: Image.Image, lang: Optional[str], tess_config: Optional[str]):
kelvinAI marked this conversation as resolved.
Show resolved Hide resolved
"""Applies Tesseract OCR on a document image, and returns recognized words + normalized bounding boxes."""

# apply OCR
data = pytesseract.image_to_data(image, lang=lang, output_type="dict")
data = pytesseract.image_to_data(image, lang=lang, output_type="dict", config=tess_config)
words, left, top, width, height = data["text"], data["left"], data["top"], data["width"], data["height"]

# filter empty words and corresponding coordinates
Expand Down Expand Up @@ -103,6 +103,8 @@ class LayoutLMv2FeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionM
ocr_lang (`Optional[str]`, *optional*):
The language, specified by its ISO code, to be used by the Tesseract OCR engine. By default, English is
used.
tess_config (`Optional[str]`, *optional*):
Optional arguments forwarded to `config` parameter when calling Tesseract. For example to change psm modes.
kelvinAI marked this conversation as resolved.
Show resolved Hide resolved

<Tip>

Expand All @@ -112,13 +114,23 @@ class LayoutLMv2FeatureExtractor(FeatureExtractionMixin, ImageFeatureExtractionM

model_input_names = ["pixel_values"]

def __init__(self, do_resize=True, size=224, resample=Image.BILINEAR, apply_ocr=True, ocr_lang=None, **kwargs):
def __init__(
self,
do_resize=True,
size=224,
resample=Image.BILINEAR,
apply_ocr=True,
ocr_lang=None,
tess_config="",
**kwargs
):
super().__init__(**kwargs)
self.do_resize = do_resize
self.size = size
self.resample = resample
self.apply_ocr = apply_ocr
self.ocr_lang = ocr_lang
self.tess_config = tess_config

def __call__(
self, images: ImageInput, return_tensors: Optional[Union[str, TensorType]] = None, **kwargs
Expand Down Expand Up @@ -201,7 +213,7 @@ def __call__(
words_batch = []
boxes_batch = []
for image in images:
words, boxes = apply_tesseract(self.to_pil_image(image), self.ocr_lang)
words, boxes = apply_tesseract(self.to_pil_image(image), self.ocr_lang, self.tess_config)
words_batch.append(words)
boxes_batch.append(boxes)

Expand Down