Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use miniforge to get python #39

Closed
wants to merge 1 commit into from

Conversation

yuvipanda
Copy link
Contributor

We were using apt to get the system python earlier.
However, that is pinned to the distro, and can't be upgraded
separately. This is problematic for cases with reticulate
that want to use a newer python version.

Instead, we can get python from
miniforge.
Miniforge is a community maintained conda installer that gives
us an independent, non-root installation of Python that is
separate from the system python. It also has a bunch of other
very useful characteristics that come from being a fully
separated and isolated environment than system python. I

This is what I now recommend people use over system python in
their own data science-y docker images. Felt like a good
time to put this upstream in rocker too!

  • Remove pin of pip, since miniforge3 comes with a new enough pip
  • Use CONDA_DIR to point to python base, not VENV_DIR

We were using apt to get the system python earlier.
However, that is pinned to the distro, and can't be upgraded
separately. This is problematic for cases with reticulate
that want to use a newer python version.

Instead, we can get python from
[miniforge](github.com/conda-forge/miniforge/).
Miniforge is a community maintained conda installer that gives
us an independent, non-root installation of Python that is
separate from the system python. It also has a bunch of other
very useful characteristics that come from being a fully
separated and isolated environment than system python. I

This is what I now recommend people use over system python in
their own data science-y docker images. Felt like a good
time to put this upstream in rocker too!

- Remove pin of pip, since miniforge3 comes with a new enough pip
- Use CONDA_DIR to point to python base, not VENV_DIR
@cboettig
Copy link
Member

cboettig commented Jun 5, 2020

👏 Thanks @yuvipanda , this is great.

Been meaning to ask you about this anyway, particularly as we realize how easy it is to have even multiple scripts in the same analysis that require different versions of python (e.g. all those libs that still need Tensorflow 1.x and hence python < 3.8, while currently Ubuntu is on 20.04).

Just a note that we've recently moved the rocker/versioned stack over onto a Ubuntu-LTS based platform, so the current rocker/binder is actually now being built by https://github.com/rocker-org/rocker-versioned2/blob/master/scripts/install_binder.sh. Dockerfiles are generated automatically from a JSON spec, e.g. https://github.com/rocker-org/rocker-versioned2/blob/master/stacks/binder.json

We're hoping that by moving to Ubuntu LTS, and by using standalone setup scripts as the RUN commands, we can better align with other binder images etc. I'm looking forward to porting the miniforge approach over across our stack, one thing the modular approach should let us do is also share build recipes more across the stack (since python is now essential to so many parts of the R stack anyway now, especially the ML stuff).

@yuvipanda
Copy link
Contributor Author

Oooo, very interesting. Does this mean these changes would need to be made in https://github.com/rocker-org/rocker-versioned2/blob/master/scripts/install_python.sh instead? Happy to do that there instead.

I am glad you switched to Ubuntu LTS! Much better, IMO, for most docker image bases :)

@cboettig
Copy link
Member

cboettig commented Jun 8, 2020

Yup, right on! (the env vars will also need to be updated in the all the https://github.com/rocker-org/rocker-versioned2/blob/master/stacks but I can do that when I get a moment).

@yuvipanda
Copy link
Contributor Author

@cboettig awesome! lmk if i can help

@cboettig
Copy link
Member

@yuvipanda Would love to chat more about python setup strategies at some point -- in particular, in getting the balance right between system-level install, supporting containers with multiple-user environments, and supporting different versions of python (well, everything) in the same container.

This new recipe installs everything at the $USER level. We've been configuring a python venv in /opt/venv so that it can be shared across users: https://github.com/rocker-org/rocker-versioned2/blob/master/scripts/install_python.sh -- by setting WORKON_HOME env var, reticulate will automatically detect this as the default environment, while also still giving users the freedom to set a seperate venv if/when needed (if we set RETICULATE_PYTHON_ENV a user would be stuck in that venv unless they overrode / unset the env var manually). I've tried to set permissions in /opt/venv so that users can install there, but maybe that is asking for trouble. Thoughts on this approach vs the home-level install?

The home-level install is also what reticulate now does by default (as well as installing miniconda). Like you say, this can be very useful, though in my experience it's been that the system python is too new (python 3.8, which isn't compatible with tensorflow 1.x and all the bazillion modules that still depend on tensorflow 1.x) rather than that system python is too old. It's tempting to actually just omit installing python and python modules entirely, and assume the user will just install these on the fly with the reticulate functions. In a way that's appealing since I keep running up against issues of needing different versions of python for different projects (or even different scripts in the same project), but at some level it feels a bit silly.

Would love your advice on this. Also any PRs directly to https://github.com/rocker-org/rocker-versioned2/blob/master/scripts/install_python.sh would be great as well. (we're also trying to wrap heads around a new set of version constraints as we build images with different versions of the CUDA libs -- have you played around with that?)

@eddelbuettel
Copy link
Member

(Sorry late to the party here but aren't the different Python PPAs at launchpad that would be easier / simpler / less disruptive than junping to conda?)

@cboettig
Copy link
Member

@eddelbuettel thanks for weighing in and good question. Yeah, I was looking at that earlier (.e.g. I believe deadsnakes ppa is popular for this). Part of the challenge here though is allowing flexible user configuration without root access (e.g. a single user may want to toggle between multiple python versions and not leave the RStudio R console) and trying not swim against the current of decisions we inherit from the reticulate interface in particular, which embraces the venv and miniconda approach (and user-level install) as first-class citizens (while appearing to frown on the very idea of 'system' install...). I am worried that some of these choices are driven by ignoring the standard linux model of multi-user / shared library...

@eddelbuettel
Copy link
Member

eddelbuettel commented Jun 22, 2020

It is all hugely painful. A few months ago I updated reticulate and it insisted on getting me a massive miniconda install for no reason. I understand why Kevin et al do that -- reliable everywhere etc pp -- but I was on my own well-tended-to Ubuntu box. So I had to poke a little. And once you point it to your preferred Python (which is of course just /usr/bin/python3) via an env var it all works too.

Ubuntu has good Python support. There are reasons we put Rocker on top of Ubuntu. Maybe we'll have enough energy left to do the same for Binder. They too have their reasons of many distros to support but maybe we can do better here.

@cboettig
Copy link
Member

@eddelbuettel yes but I'm still a little unclear what the ideal Ubuntu-based python setup is. 20.04 has lovely built in support for python 3.8 out of the box, but as noted above this is a huge pain for working with the vast constellation of python modules that are not compatible with it. We could add the deadsnakes PPA and start adding other older python versions (python3.6, python3.7 etc) to the system installation, but how would a user initialize a project that needs these? Would we pre-install them and make them available from /usr/bin (version numbers attached)? I think that would work, as long as users know to chose, e.g. /usr/bin/python3.7 as system python before trying to use greta (R package) or stable-baselines (python module) or any of the other things that don't support python 3.8 yet. I just worry that may be swimming against the tide of the mechanisms that are documented out there in reticulate and binder etc. binder builds off Ubuntu LTS too and developed by python-focused team so there's some incentive to be aligned with them as well

@eddelbuettel
Copy link
Member

eddelbuettel commented Jun 22, 2020

I am not a Python programmer but I have lived with Debian systems (and hence Ubuntu) for 25 years. And for as long as I can now remeber they always had

 /usr/{,local}/lib/(site|dist)-library/{2,3}-[0-9]

in other words they had happily coexisting mutiple versions. (Just how we had 1.* and 2.4-ish coexist way back when). I am a little fuzzy about whether you would get something like NumPy simultaneously for 3.6, 3.7 and 3.8 (and now 3.9) from the distro. I never care enough about Python to really push that. (And now that I have 3.8 fine for free here, works wants 3.7 so off to a 18.04 container we go for that...)

I just find it a very sad statement for all of us that best solution on the OS with the arguably best supported package management system ... the best we can do is to double everything up again for every user below ~/.conda (or alike). But I am apparently an outlier here because everybody else thinks duplication in *env systems is the Bee's Knees.

So don't let me stop you, I am after all probably not your audience.

@eddelbuettel
Copy link
Member

As for the selection by version, ie picking python37 explicitly etc: That is as far as I can tell done everywhere including when people switch R interpreters by $PATH setting. The repos are versioned and don't overlap so yes. that should work. But if three different Python projects used via reticulate need three different Python interpreters then you have yourself a problem. But that one was made, I'd argue, by the Python devs not trying harder to all gel on a current version. But what do I know...

@yuvipanda
Copy link
Contributor Author

I just find it a very sad statement for all of us that best solution on the OS with the arguably best supported package management system ... the best we can do is to double everything up again for every user below ~/.conda (or alike). But I am apparently an outlier here because everybody else thinks duplication in *env systems is the Bee's Knees.

This is unfortunately very true. I originally set up repo2docker (which builds images for mybinder) to use venvs only, and resisted calls to just use conda for everything for a long time - if all you are doing is installing a couple of libraries, it's just way faster and smaller. However, when we had to start supporting multiple versions of python, it basically proved impossible to do in a stable way. I spent a lot of time trying to make sure deadsnakes worked properly for multiple python versions in the same container image, and eventually gave up. Part of that was to try not swim against the stream, but part of that is also a recognition that the use cases of data science & interactive computing aren't the cases that packaging in the ubuntu / debian sense were trying to solve.

Either way, I currently believe that just using miniforge (rather than miniconda) python environments for everything is the way to go on interactive datascience focused container images.

@vsoch
Copy link

vsoch commented Jun 23, 2020

This is such an interesting issue! I don't know if I have any wisdom, but I want to offer another perspective, and all the ways I've installed Python over the years in containers. I'm reading about that others have preference for being able to install different Python versions, and my "strategy" is sort of the opposite of that - make it as easy as possible to just shell into the container and hit one version of Python, the "right" one for the container, and don't rely on the user to need to know about environments, etc. With that in mind, here are a bunch of installs I've done and some background around them.

Miniconda

I really liked using miniconda3 for a few years because it seemed so easy to start off with a working container with pip and conda. It's mostly out of laziness, because instead of needing to install anything I just do:

FROM continuumio/miniconda3

I also don't use "best practice" and checkout any environments, I would just install to the core. But I'm not sure how often the image bases are updated, albeit being used a ton.

minideb

This led me to try starting with a more well-updated base, and installing miniconda on my own. That would look something like:

FROM bitnami/minideb:stretch
WORKDIR /code
ENV PATH /opt/conda/bin:${PATH}
ENV LANG C.UTF-8
ENV SHELL /bin/bash
RUN apt-get update && \
    /bin/bash -c "install_packages wget bzip2 ca-certificates git && \
    wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh && \
    bash Miniconda3-latest-Linux-x86_64.sh -b -p /opt/conda && \
    rm Miniconda3-latest-Linux-x86_64.sh && \
    conda create --name urlchecker && \
    conda clean --all -y"

And I created an environment this time. I think this one resulted in fewer security issues for the quay.io security scan, and I felt comforted that it was a more-up-to-date base image.

python

And of course it's super easy to use a Python base, and just install the basics that you need. I would typically do

FROM python:3.7
apt-get update && apt-get install -y  python3 python3-dev python3-pip python3-setuptools

(I can't find an example off the bat, I think I most recently used this in CI recipes actually).

So clearly the choice of "how to install python" is fairly subjective. It's like deciding based on gut feelings, but clearly we could come to some "best practice" based on some very simple criteria:

  • security issues (how often the container is updated)
  • ease of understanding the recipe
  • build size (multistage build is going to not help with python so much as something we could compile)
  • user interaction with the container

Probably we could do a project that 1) identifies some set of criteria, and then 2) browses around for different ways to do it, and then makes a comparison table to give a recommendation. Thoughts?

@cboettig
Copy link
Member

I think this can be closed now too?

@yuvipanda
Copy link
Contributor Author

@cboettig yeah, this is no longer relevant to this repo. But I'd like to perhaps re-open this on the rocker-versioned2 repo sometime.

@yuvipanda yuvipanda closed this Jan 10, 2022
@cboettig
Copy link
Member

@yuvipanda yeah, definitely would like to revisit the python setup! Currently have been leaning into pipenv, (see https://github.com/rocker-org/ml#python-versions-and-virtualenvs, which probably also should be converted to a template repo now since that config is also in rocker-versioned2) which seems to provide a pretty straightforward way to install different versions of python for different projects (something which seems pretty necessary for our python work). pipenv configs are now detected by reticulate and renv I believe too.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants