-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow building with thin module Docker containers #69
Comments
I think OCR-D has to decide if you want a "have it all" environment. Then either separate venvs per processor or some kind of Docker setup could be the solution. I'd go for a Docker solution because that also solves dependency issues outside of the Python world. And it is actually possible to do without over-engineering too hard. I.e. I use a |
|
As long as we don't have REST integration, all-in-one is all we have. Isolation within that approach will always be up to demand and interdependencies. Whether we then delegate to local venvs or thin Docker containers is indeed another freedom of choice (which we ideally should also make a user choice here).
Yes, but it may also increase space and time cost unnecessarily under circumstances (depending on which modules are enabled and what platform the host is). So I'd really like that to be for the user to decide ultimately.
Yes of course, once we wrap the processor CLIs in another shell script layer, we can make that local venv or Docker run again. And for the latter we only need to ensure we pass on all the arguments. (Perhaps we could even avoid spinning up a new container with each invocation by using The only thing that troubles me with delegating to thin module Docker containers is that we more or less surrender version control. It's really difficult to do that with Docker images solely based on digest numbers. But we could of course petition module providers to use the |
Unfortunately, we have now a situation where both tensorflow (tf2) and tensorflow-gpu (tf1) can be installed side-by-side, so scripts won't fail at startup anymore but when doing the customary After discussing with @bertsky I see no alternative to isolated venvs. Implementing this in the Makefile is tedious but doable. However for our Docker builds, we need to decide on a mechanism to create an entry-point "venv broker" script that activates the right environment or similar. I'm stumped on how we can sensibly support But the situation right now, with runtime errors instead of startup failures, is unacceptable, I see no alternative to package isolation. If anyone else does see an alternative, I'd be happy to hear. |
Are there plans to upgrade models and software to TF 2? I think it would help to get an overview of all processors which still use TF 1 with an estimation whether and when they will run with TF 2 and who is responsable for that. And we should have all TF 1 based processors in their own group As soon as there is a separate group of TF 1 executables, the Makefile implementation could be straight forward by calling |
From the direct requirements:
But we'll also have to check transitive dependencies. |
ocrd_pc_segmentation depends indirectly on tensorflow<2.1.0,>=2.0.0. |
Because of the close relationship between Python version(s) and available prebuild Tensorflow versions we must be prepared that TF1 might require a different (= older) Python version than TF2. I already have that situation when I want to build with a recent Linux distribution, and because of a bug in Debian / Ubuntu it is currently not possible to create a Python 3.7 venv when Python 3.8 is installed, too. This is of course not relevant for the old Ubuntu which is currently our reference platform. |
I did not look into the details of the code. Does it spawn processes for the single steps (then it should work), or do all steps run in the same Python process? |
The README is up-to-date w.r.t. that. TF2 migration can be less or more effort, depending on what part of the API the module (or its dependent) relies. TF offers an upgrade script to rewrite the code to use Plus TF 2.2 brings even more breaking changes (as you observed for Since we have modules like
This would work, but why make an exception for TF? We have seen other conflicting dependencies already, and we know that
The recipe is simple (and has already been discussed above): the top-level PATH directory (which will be a tiny venv merely for |
I was not clear: That can also be done with a mechanism like the one you describe. No reason not to use venv in a Docker image, on the contrary.
After sleeping on it, I am indeed unstumped. I misremembered |
O what a mess! I now tried an installation of ocrd_all with a 2nd venv for
@mikegerber, I am afraid that |
@stweil please report to the respective repo (yes, probably |
|
@bertsky, |
(Sorry for the late reply to this, I'm reviewing open issues.) I'm trying to understand this and I think you're saying that going from, for example, |
This is not about containers, but images. And digest numbers are the only reliable identification that Docker images get unconditionally (without extra steps at build time). But then digest numbers would have to be mapped to the git submodule commits that ocrd_all already uses, which seems unmanagable to me. So practically I guess everyone would just try to get the most recent image and pray they never have to go backwards.
ocrd_all is more fine-grained than version numbers / release tags – it manages the submodules' commits. So if you replace version with commit then yes, that's what I mean. All Docker builds need to automatically include their git revisions. With that in place, and with some script foo, we could selectively exchange native installations with thin Docker containers per module as needed – to the point where all modules are Dockerized, so the top level (be it native or Docker itself) becomes thin itself. |
Anyway, with #118 ff. the original issue (different TF requirements) has been solved. The topic has since moved on to how do we integrate/compose thin module Docker images (where available) as an alternative, without giving up version control. I therefore suggest to rename the issue to |
Since we now have a script mechanism in place delegating to sub-venvs, we could start delegating to thin Docker containers. But we have to consider that we would be calling Docker containers from Docker containers. It's doable, but needs to be accounted for. Especially that the existing mountpoints and bind-mounts need to be passed on. (The situation is different for @mikegerber's solution IIUC, because its outer layer is native, not Docker.)
So maybe we should start by devising a scheme for including the git version number into all thin/module images. ocrd/all already uses these labels:
Let's extend that to all existing submodule images, i.e.
Then we can follow up with a PR here that inserts the docker (pull and) run for the revision of the respective submodule into the CLI delegator script. |
Yes, the containers are intended to provide dependency-isolated processors to the native/host-side workflow script ("the outer layer"). |
Besides spinning up multiple CLI-only containers (somehow) sharing volumes, we could also integrate containers as true network services, but merely by installing a thin OpenSSH server layer on top of each module's CLI offerings. This was done for |
Not necessary anymore: we now have the possibility to build network services for module processors by either
both options could be predefined in a docker-compose.yml, each using the same (module) image but differing entry points (i.e.
Thus, in ocrd_all, ...
ocrd-tesserocr-recognize:
extends:
file: ocrd_tesserocr/docker-compose.yml
service: ocrd-tesserocr-recognize
command: ocrd network processing-worker ocrd-tesserocr-recognize --database $MONGO_URL --queue $RABBITMQ_URL
depends_on:
- ocrd-processing-server
- ocrd-mongo-db
- ocrd-rabbit-mq
... ...
ocrd-tesserocr-recognize:
extends:
file: ocrd_tesserocr/docker-compose.yml
service: ocrd-tesserocr-recognize
command: ocrd network processor-server ocrd-tesserocr-recognize --database $MONGO_URL --address ocrd-tesserocr-recognize:80
depends_on:
- ocrd-mongo-db
... where configuration (i.e. setting environment variables) can happen via .env mechanism or shell. Now, what's left is generating CLI entry points that delegate to each respective REST endpoint: ifneq ($(findstring ocrd_tesserocr, $(OCRD_MODULES)),)
OCRD_TESSEROCR := $(BIN)/ocrd-tesserocr-binarize
OCRD_TESSEROCR += $(BIN)/ocrd-tesserocr-crop
OCRD_TESSEROCR += $(BIN)/ocrd-tesserocr-deskew
OCRD_TESSEROCR += $(BIN)/ocrd-tesserocr-recognize
OCRD_TESSEROCR += $(BIN)/ocrd-tesserocr-segment-line
OCRD_TESSEROCR += $(BIN)/ocrd-tesserocr-segment-region
OCRD_TESSEROCR += $(BIN)/ocrd-tesserocr-segment-word
OCRD_EXECUTABLES += $(OCRD_TESSEROCR)
$(OCRD_TESSEROCR): ocrd_tesserocr
$(file >$@,$(call delegator,$(@F)))
chmod +x $@
endif
...
define delegator
#!/bin/bash
ocrd network client processing process $(1) "$$@"
endef So this would create executable files like This whole approach could replace both the sub-venvs and I am not sure I have the full picture of what we should do, though. Thoughts @kba ? |
This looks fairly complete to me, I'm currently trying out the updated |
How would you address the problem regarding getting the workspace to be processed to the processing-worker? Currently when running and using ocrd in docker the current directory is volume-mounted into /data of the container. |
@joschrew But we do have to talk about efficiency, in fact we already did – last time was when conceptualising the METS Server. There in particular, I laid out the existing (in the sense of currently available) implicit transfer model (backed by ad-hoc So these questions unfortunately hinge on a lot of unfinished business:
|
By now, everything needed has been completed:
– implemented and already in productive use.
– some issues remain with
– the Processing Server now schedules jobs and manages their interdependencies, both for page ranges and We still need some more client CLIs to reduce complexity (like avoiding However, https://github.com/joschrew/workflow-endpoint-usage-example already contains a complete, flexibly configurable deployment, generating Compose files and using ocrd-all-tool.json – but it is still based on
|
I gave this – and the larger subject volumes and resource locations – some thought: Current situationIn core we provide a 4 dynamically configurable locations for every processor:
At runtime, Because of that, resmgr has to acknowledge the same resource locations for each processor. That entails looking up the resource locations (by calling the processor runtime with To short-circuit the dynamic calls (which have significant latency, esp. if resmgr must do it for Now, in a native installation of ocrd_all, we simply install all tools and
In our Docker rules for the fat container, we also did this as part of the build recipe. But we then added a few tricks making it easier for users to have persistent volumes for their models (including both the pre-installed ones and any user-downloaded ones):
This includes all cases, including ocrd_tesserocr which additionally uses the same trick to hide away its Future solutionFor slim containers, there will be no more single Dockerfile, and we cannot expect all modules to agree on the same "trick" alias in their local Dockerfile. Rathermore, since services have to be defined in a docker-compose.yml that is generated (from the available git modules and/or the scenario config) anyway, we can also generate the named volume path and environment variables for them as we like. So we don't need the Now, what does that mean for First of all, we do not have a single distribution target anymore, but a bunch of images each with their own RUN cat ocrd-tool.json | jq .tools[] > `python -c "import ocrd; print(ocrd.__path__[0])"`/ocrd-all-tool.json Finally, for
So it really is complicated, and we don't have a good concept how to query and install processor resources in networked ocrd_all. @kba argued elsewhere his workaround is to Opinions? |
Elaborating a bit on option 2: of course, the (generated) docker-compose.yml for each module could also provide an additional server entry point – a simple REST API wrapper for resmgr CLI. Its (generated) volume and variable config would have to match the respective Processing Worker (or Processor Server) to be used. But the local resmgr would not need to "know" anything beyond what it can see in its thin container – a local ocrd-all-tool.json and ocrd-all-module-dir.json precomputed for the processors of that module at build time, plus the filesystem in that container and mounted volumes. In addition, to get the same central resmgr user experience (for all processor executables at the same time), one would still need
Regardless, crucially, this central component needs to know about all the deployed resmgr services – essentially holding a mapping from processor executables to module resmgr server host-port pairs. This could be generated along with the docker-compose.yml (in a new format like |
In #68 @bertsky :
The text was updated successfully, but these errors were encountered: