Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Got "Killed" running ocrd-cis-ocropy-clip #74

Open
stefanCCS opened this issue Oct 14, 2020 · 5 comments
Open

Got "Killed" running ocrd-cis-ocropy-clip #74

stefanCCS opened this issue Oct 14, 2020 · 5 comments

Comments

@stefanCCS
Copy link

Running this workflow below with an image creates a "killed" error like this:

09:12:59.552 INFO processor.OcropyClip - INPUT FILE 0 / PAGE-1
09:12:59.922 INFO processor.OcropyClip - Page "OCR-D-SEG-REG-DESKEW-1" uses 300.000000 DPI
Killed

For getting the image, please contact myself on Gitter

Workflow used:

ocrd-cis-ocropy-binarize \
	-I OCR-D-IMG \
	-O OCR-D-BIN
  ocrd-tesserocr-segment-region \
	-I OCR-D-BIN \
	-O OCR-D-SEG-REG
  ocrd-tesserocr-deskew \
	-I OCR-D-SEG-REG \
	-O OCR-D-SEG-REG-DESKEW
  ocrd-cis-ocropy-clip \
	-I OCR-D-SEG-REG-DESKEW \
	-O OCR-D-SEG-REG-DESKEW-CLIP
@bertsky
Copy link
Collaborator

bertsky commented Oct 14, 2020

Thanks @stefanCCS for the report. I cannot say much without more context – either in form of debug level log output (e.g. by running with -l DEBUG) or with the incriminated page's image.

But that workflow itself is also flawed – please see here.

@stefanCCS
Copy link
Author

Addtional information with -l DEBUG:

11:15:58.371 DEBUG ocrd.resolver.workspace_from_url - Deriving dst_dir /home/ocrdadmin/ocrd_all/myData/workspaces/CrashCisOcropyClip from /home/ocrdadmin/ocrd_all/myData/workspaces/CrashCisOcropyClip/mets.xml
11:15:58.372 DEBUG ocrd.resolver.workspace_from_url - workspace_from_url
mets_basename='mets.xml'
mets_url='/home/ocrdadmin/ocrd_all/myData/workspaces/CrashCisOcropyClip/mets.xml'
src_baseurl='/home/ocrdadmin/ocrd_all/myData/workspaces/CrashCisOcropyClip'
dst_dir='/home/ocrdadmin/ocrd_all/myData/workspaces/CrashCisOcropyClip'
11:15:58.372 DEBUG ocrd.resolver.download_to_directory - directory=|/home/ocrdadmin/ocrd_all/myData/workspaces/CrashCisOcropyClip| url=|/home/ocrdadmin/ocrd_all/myData/workspaces/CrashCisOcropyClip/mets.xml| basename=|mets.xml| if_exists=|skip| subdir=|None|
11:15:58.372 DEBUG ocrd.resolver.download_to_directory - Stop early, src_path and dst_path are the same: '/home/ocrdadmin/ocrd_all/myData/workspaces/CrashCisOcropyClip/mets.xml' (url: '/home/ocrdadmin/ocrd_all/myData/workspaces/CrashCisOcropyClip/mets.xml')
11:15:58.401 DEBUG ocrd.workspace_validator - input_file_grp=['OCR-D-SEG-REG-DESKEW'] output_file_grp=['OCR-D-SEG-REG-DESKEW-CLIP']
11:15:58.401 DEBUG ocrd.processor.helpers.run_processor - Running processor <class 'ocrd_cis.ocropy.clip.OcropyClip'>
11:15:58.403 DEBUG ocrd.processor.helpers.run_processor - Processor instance <ocrd_cis.ocropy.clip.OcropyClip object at 0x7fa39105c630> (ocrd-cis-ocropy-clip v0.1.4 doing layout/segmentation/region)
11:15:58.405 INFO processor.OcropyClip - INPUT FILE 0 / PAGE-1
11:15:58.405 DEBUG ocrd.workspace.download_file - download_file <OcrdFile fileGrp=OCR-D-SEG-REG-DESKEW, ID=OCR-D-SEG-REG-DESKEW-1, mimetype=application/vnd.prima.page+xml, url=OCR-D-SEG-REG-DESKEW/OCR-D-SEG-REG-DESKEW-1.xml, local_filename=OCR-D-SEG-REG-DESKEW/OCR-D-SEG-REG-DESKEW-1.xml]/>  [_recursion_count=0]
11:15:58.451 DEBUG ocrd.workspace.download_file - download_file <OcrdFile fileGrp=OCR-D-IMG, ID=OCR-D-IMG-1, mimetype=image/tiff, url=OCR-D-IMG/advertisement-MMKB27_017779053_00004_tiff.tif, local_filename=OCR-D-IMG/advertisement-MMKB27_017779053_00004_tiff.tif]/>  [_recursion_count=0]
11:16:01.903 DEBUG ocrd.workspace.image_from_page - page 'OCR-D-SEG-REG-DESKEW-1' has  orientation=0 skew=0.00
11:16:01.903 DEBUG ocrd.workspace.image_from_page - Using AlternativeImage 1 (,binarized) for page 'OCR-D-SEG-REG-DESKEW-1'
11:16:01.909 DEBUG ocrd.workspace.download_file - download_file <OcrdFile fileGrp=OCR-D-BIN, ID=OCR-D-BIN-1.IMG-BIN, mimetype=image/png, url=OCR-D-BIN/OCR-D-BIN-1.IMG-BIN.png, local_filename=OCR-D-BIN/OCR-D-BIN-1.IMG-BIN.png]/>  [_recursion_count=0]
11:16:02.128 INFO processor.OcropyClip - Page "OCR-D-SEG-REG-DESKEW-1" uses 300.000000 DPI
Killed

@stefanCCS
Copy link
Author

For getting the image, please contact myself on Gitter private Chat.

@bertsky
Copy link
Collaborator

bertsky commented Oct 14, 2020

Thanks! I am able to reproduce this now.

Unfortunately, it's not strictly a bug, but just inefficient programming. Clipping necessarily has to compare N by N regions somehow. And when coordinates do suggest a pair intersects, the algorithm ultimately needs to look at both regions' masks into the page, to check for overlapping connected components. So to avoid calculating the same masks over and over again, I decided to pre-calculate them (trading CPU time with RSS size). But if the images are large (yours is 4726x6883) and there are many regions (Tesseract found 315 of them), then obviously there is a lot to store in memory. This just scales badly. Efficiency was not a primary concern of the second project phase (and this processor is a stop-gap anyway).

I'm not saying "won't fix", but I am not sure whether we should really prioritize this right now. I don't see an easy way out to be honest. (I could probably move away from page masks to pair-wise masks spanning the joint bboxes.)

Perhaps the best workaround for now is to downscale your images to 300 DPI.

@stefanCCS
Copy link
Author

Ok, understood.
Rized my image to 50% - works fine (image was 300dpi before, resulting in 150dpi).
--> I am not the one, who decides, if or when this issue should be fixed (I can work with this workaround to resize my images as I am in general evaluation phase).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants