Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Draft for slim containers #386

Closed
wants to merge 19 commits into from
Closed

Draft for slim containers #386

wants to merge 19 commits into from

Conversation

joschrew
Copy link

@joschrew joschrew commented Jul 7, 2023

Description:
All processors run in their own docker-container as a processing-worker. Also there are containers: processing-server, rabbitmq and mongodb running. Executables for processors are delegators to ocrd network client.

Related issue:
#69 (comment)

Example usage:

Clone this repo:
git clone git@github.com:joschrew/ocrd_all.git

Change to repository
cd ocrd_all

Core, ocrd_cis and ocrd_tesserocr are needed for the example-run:
git submodule update --init core/ ocrd_cis/ ocrd_tesserocr/

Create the venv and docker-compose.yaml:
make -f Makefile-slim slim-venv

Create datadir (necessary to get the workspaces to the containers):
mkdir data

Start the containers:
docker-compose up -d

Get a workspace for testing:

curl "https://raw.githubusercontent.com/OCR-D/ocrd-webapi-implementation/main/things/example_ws.ocrd.zip" --output foo.zip
unzip foo.zip "data/*"

Activate the venv:
. venv2/bin/activate

Run a processor on the workspace:
ocrd-cis-ocropy-binarize -I OCR-D-IMG -O OCR-D-BIN -m mets.xml

Copy link
Collaborator

@bertsky bertsky left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early stage, dry review – still trying to make it run for me.

slim-containers-files/delegator_template.py Outdated Show resolved Hide resolved
import subprocess

# Later the address (or rather the port) should be dynamic
processing_server_address = "http://localhost:8000"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not the exposed OCRD_PS_PORT here? Or even better, instead of the host side (localhost), we should be fine with the Docker network's DNS:

Suggested change
processing_server_address = "http://localhost:8000"
processing_server_address = "http://ocrd-processing-server:8000"

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How does that work (I mean the thing with the hostname, I think I know how to set set Port via env)? When I am on the host it cannot resolve ocrd-processing-server out of the box. What do I have to change additionally to make it work to query the container from the host via its service/host name?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not from the host (network), but from the virtual Docker network (i.e. from inside another container). See Compose documentation. (The port then is the container internal port BTW.)

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After trying some things I think I cannot solve this properly. First the hostname cannot be the service name because the delegator is run from the host. And I cannot read the port from .env because I don't know the working dir where the delegator is executed from. So I decided to set the port dynamic in the Makefile like the processor-name. The proposed suggestion at least does not work. Commit: 7736969

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First the hostname cannot be the service name because the delegator is run from the host.

Oh, right! Sorry, weak thinking on my side.

And I cannot read the port from .env because I don't know the working dir where the delegator is executed from.

Ok, good point. However, we could write all the .env values into the venv's activate by make, dynamically (i.e. setting the exact values used for the servers):

%/bin/activate:
	$(PYTHON) -m venv $(subst /bin/activate,,$@)
	. $@ && pip install --upgrade pip setuptools wheel
	@echo ". $(CURDIR)/.env" >> $@

IMO the .env should be the central source for configuration, so you should be able to modify it to your needs after it was generated.

Copy link
Author

@joschrew joschrew Jul 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see a problem with that approach, because it only works with the venv activated in bash (not csh or fish) and invoking executables directly would not work with this approach (e.g.: venv/bin/ocrd-some-processor ...). Btw in the example code this does not work: @echo ". $(CURDIR)/.env" >> $@, exporting is needed to be available to a started python process e.g. the processor.
But I think it is a good idea to use the .env for the config. For now I have implemented another approach approach: In the Makefile the env-path is set into the delegator which then reads the port from it. I know it is not ideal so I think we should talk about this again and agree on a proper solution.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have reverted the last commit regarding this and decided to use the proposed approach

Comment on lines 20 to 24
cmd = [
"ocrd", "network", "client", "processing", "processor",
processor_name, "--address", processing_server_address
]
subprocess.run(cmd + args[1:])
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The Processing Server API is asynchronous, and so is its network client. So this CLI will not give the same user experience as the native CLI (for example, you cannot script these calls).

Thus, either we add some callback mechanism here and block until the job is done, or we switch to the Processor Server API which is synchronous (but has no client CLI yet).

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It could be as simple as passing "--callback-url", "http://localhost:" + PORT + "/" here. The PORT should be generated before by an internal background task like so:

PORT = None
async def serve_callback():
    nonlocal PORT
    # set up web server
    def get(path):
        raise Exception("done")
    ...
    # run web server on arbitrary available port
    PORT = ...

server = asyncio.create_task(serve_callback)

Then after the subprocess.run, you can simply block by doing await server. The whole thing must be in a function that you asyncio.run(...).

Copy link
Author

@joschrew joschrew Jul 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a waiting mechanism with the python build-in HTTPServer. I tried using your proposed async approach but discarded that before I could completely finish. Probably async is possible but it is way more complicated (that's why I finally opted out) and I think a "non-async" http server has the same effect and the waiting-costs for running a "normal" http server vs. a async http server is neglectable.
I cannot say I have much more than an idea of how it internally works but I think a python-process with an async server gets scheduled by the operating system as well anyway. The async-stuff with polling and so on is "only" python internal. It would gain time if we had another async process running alongside which could then be run when the server is waiting.
So I don't think async is helpful here anyway, that's why I use the synchrony server for waiting.

Comment on lines +7 to +11
depends_on:
- ocrd-processing-server
- ocrd-mongodb
- ocrd-rabbitmq
# restart: The worker creates its queue but rabbitmq needs a few seconds to be available
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If timing is an issue, I suggest to change the dependency type:

Suggested change
depends_on:
- ocrd-processing-server
- ocrd-mongodb
- ocrd-rabbitmq
# restart: The worker creates its queue but rabbitmq needs a few seconds to be available
depends_on:
- ocrd-processing-server
ocrd-mongodb:
condition: service_started
ocrd-rabbitmq:
condition: service_started

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlining problem is the following: The container for the queue is started and running, but it needs 1-3 seconds that queue creation is possible. But the processing worker tries to create it's queue right away.

This suggestion (service_started) is not working because the queue-service is considered started from docker-compose but it is in reality not ready to be used right away although it is considered started. I have a similar issue in another project and already tried a few things solving this. I think the only solution to this problem from the docker side would be to implement a (manual) health-check for the rabbitmq container. But therefore I'd have to extend the rabbit-mq image which I do not want.

For this PR to function some extension to core is needed anyway. There I want to add an optional queue-creation-timeout to the worker startup so that it waits a few seconds with adding its queue or to try again a few times. But this restart-fix was the fastest way to do that that's why it is here and I agree that it should be removed finally. I will remark this as solved as soon as the needed changes to core are made (which need one change to this PR as well).


{{ processor_name }}:
extends:
file: slim-containers-files/{{ processor_group_name}}/docker-compose.yaml
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So IIUC in the final setup, when we have correct Dockerfiles and compose files in all modules, this will simply become {{ module_name }}/docker-compose.yaml?

Makefile-slim Outdated Show resolved Hide resolved
slim-containers-files/ps-config.yaml Outdated Show resolved Hide resolved
slim-containers-files/ps-config.yaml Outdated Show resolved Hide resolved
Makefile-slim Outdated
Comment on lines 1 to 12
export PYTHON ?= python3
VIRTUAL_ENV = $(CURDIR)/venv2
BIN = $(VIRTUAL_ENV)/bin
ACTIVATE_VENV = $(BIN)/activate
OCRD_MODULES = OCRD_CIS OCRD_TESSEROCR
OCRD_CIS = ocrd-cis-ocropy-binarize ocrd-cis-ocropy-dewarp
OCRD_TESSEROCR = ocrd-tesserocr-recognize ocrd-tesserocr-segment-region
PROCESSORS = $(foreach mod,$(OCRD_MODULES),$(foreach proc,$($(mod)), $(proc) ))
DELEGATORS = $(foreach proc,$(PROCESSORS),$(BIN)/$(proc))

slim-venv: docker-compose.yaml .env $(DELEGATORS) | $(VIRTUAL_ENV)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of this would be needed if you added to the existing Makefile directly. We already have (sensible+configurable) definitions for VIRTUAL_ENV, OCRD_MODULES (not a variable but the true submodule name) and OCRD_EXECUTABLES (your PROCESSORS). There is even an existing delegator mechanism (used for sub-venvs on some modules).

Makefile-slim Show resolved Hide resolved
Copy link
Member

@kba kba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good so far but I need a bit more explanation of all the parts involved, what gets generated and what is actually run.

slim-containers-files/Dummy-Core-Dockerfile Outdated Show resolved Hide resolved
slim-containers-files/delegator_template.py Outdated Show resolved Hide resolved
joschrew added 2 commits July 12, 2023 16:49
Changes not yet merged to core are needed for this PR to work. This
must be reset later
@joschrew
Copy link
Author

I have started the process of addressing all comments but this will take a while. I think I'll add them step by step. I will report back if that is finished.

@joschrew joschrew force-pushed the master branch 2 times, most recently from 1de5dbb to 10725b9 Compare July 20, 2023 07:00
joschrew and others added 10 commits July 20, 2023 09:02
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>
Co-authored-by: Robert Sachunsky <38561704+bertsky@users.noreply.github.com>

processing_server_address = "http://localhost:{{ OCRD_PS_PORT }}"
env_path = "{{ OCRD_SLIM_ENV_PATH }}"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe use https://github.com/theskumar/python-dotenv for that, so you don't have to parse sh variable assignments?

This reverts commit cfd1a79. Goal is
to read .env when activating the venv and use the env-variable to set
the processing server port.
context: ../../ocrd_tesserocr
dockerfile: ../slim-containers-files/ocrd_tesserocr/Dockerfile
command:
ocrd network processing-worker ocrd-tesseroc-recognize --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
ocrd network processing-worker ocrd-tesseroc-recognize --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue
ocrd network processing-worker ocrd-tesserocr-recognize --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue

context: ../../ocrd_tesserocr
dockerfile: ../slim-containers-files/ocrd_tesserocr/Dockerfile
command:
ocrd network processing-worker ocrd-tesserocr-segment-region --database $MONGODB_URL --queue $RABBITMQ_URL --create-queue
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should now use the ocrd-tesserocr-segment-region worker syntax instead to get instance caching, right?

@joschrew joschrew closed this by deleting the head repository Nov 12, 2024
@bertsky
Copy link
Collaborator

bertsky commented Nov 12, 2024

superseded by #449

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants