Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: get host uid:gid and use in docker #576

Merged
merged 2 commits into from
Sep 16, 2019
Merged

Conversation

focusaurus
Copy link
Contributor

Impact: major
Type: feature|bugfix

Issue

On linux, the docker container runs as user node (uid 1000), which in many cases is not the same as the developer's host OS uid. This can cause mismatches in ownership of files that are shared between the host and the container via docker volume mounts. Ultimately this surfaces as filesystem errors (EACCESS, permission denied, etc) at various times including when trying to run yarn, do builds, etc.

Why doesn't bug affect all linux users?

On many linux distributions, the first non-system user is assigned uid 1000. That's why in our docker container the node user has uid 1000, it's just a default. That's also why many developer's have uid 1000 as well: it's the default for the first non-system user on their distribution and their user is the first one they add when doing their OS install. Because of that default and coincidence many users avoid this issue because their uid just happens to match the one we use. (This was the case for me personally and why it is harder for me to encounter/reproduce these issues)

Why doesn't this affect macOS users?

Docker for Mac has automatic mapping that avoids this issue largely.

How does this apply to reaction core? Other repos?

If testing proves successful in storefront, I believe this pattern would apply to reaction core and essentially every docker-based service we have that makes use of volume mounts, which is probably most but not all of them. I started with core initially as a fix there would have the widest impact, but the docker setup there is significantly more complex and with the slow startup time I switched to storefront as an easier first project to tackle.

Related issues

Solution

The key elements of the solution are:

  • Start the docker process as root (so if necessary we can chown volumes to fix them)
  • Determine the correct uid:gid with stat on the repo root
    • I think this will be pretty reliable and less hassle than REACTION_USER
  • During container startup, modify the node user account in the container to have the matching uid
  • Do a quick non-recursive stat on each volume mount point and check if ownership is correct.
    • If so, proceed with startup
    • If not, fix ownership
    • This is a performance trade-off that I think is the best we can do here balancing "it always just works" and keeping startup fast
  • su-exec to node in the container then proceed to launch the application
  • Provide a ./bin/fix-volumes script that can be run at any time that will chown/chmod all volume mount directories properly and should be a 1-stop fix to this entire category of errors

There are a ton of unix heavy details in here we should scrutinize during code review.

Some things to note about the solution (pending QA testing)

  • The goal here is to have things "just work" in all cases
  • File owners and permissions on the host filesystem will be changed when running fix-volumes
    • There's potential here for surprise or confusion in the user base, and perhaps "hey don't do that!"
  • I believe the solution will handle pre-existing files in the volume mount directories with assorted owner/permission combinations and it should force them all to be correct, but I think there's a lot of testing surface here

Breaking changes

I don't think anything here would count as "breaking" but the changed ownership/permission of host files could be surprising/unexpected as noted above.

Testing

  • Try a fresh git clone of this example-storefront branch, ./bin/setup, and docker-compose up
    • Try on mac, linux with userid !=1000, linux with userid=1000
  • Create some permutations of ownership mismatches and test the fix scripts.
    • Directories of interest include
      • $HOME/.cache/yarn-offline-mirror
      • $HOME/.cache/yarn
      • example-storefront/node_modules
      • example-storefront/build
  • Note you may want to create a new user in your linux host for this to force non-1000 uid. On linux mint I was able to do this via the users & groups GUI (or you can use adduser CLI) and once I did sudo adduser plyons2 docker the new user could use docker. Then I sudo su - plyons2 to get a shell with that user for testing.
  • Verify when the application finally loads that it is not running as root
    • docker-compose run web ps -ef
    • You will see some early root processes then a switch to node for the bin/start

Example good output

  • Your container id may vary
  • If everything is running as root, that's a bug
docker exec --interactive --tty rc-storefront_web_1 ps -ef
PID   USER     TIME  COMMAND
    1 node      0:00 sh ./bin/start
   81 node      0:00 node /opt/yarn-v1.13.0/bin/yarn.js dev
  102 node      0:46 /usr/local/bin/node ./src/server.js
  113 root      0:00 ps -ef

@focusaurus
Copy link
Contributor Author

I tested on mac and linux and it worked properly in a few basic cases.

Copy link
Contributor

@rosshadden rosshadden left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How do you feel about using this as a case study for a bit until we're satisfied it works for everyone before we roll out to other projects?

@focusaurus
Copy link
Contributor Author

Sounds pragmatic to me. I'd also like a bit more testing before merge. I'll put a call out on slack. @manueldelreal might you be able to test this branch?

Dockerfile Outdated Show resolved Hide resolved
@focusaurus
Copy link
Contributor Author

focusaurus commented Sep 7, 2019 via email

@focusaurus
Copy link
Contributor Author

OK I started implementing separate Dockerfiles but that got me thinking. I think duplicating the dockerfile will make things worse and the inevitable drift between them is something I want to avoid. I've made Dockerfile changes that ensure fix-volumes.sh is cheap for a production start and I think that's better than having 2 complex Dockerfiles to maintain in almost-sync.

@rosshadden
Copy link
Contributor

You can run commands based on environment variables. For example (ripped from some medium post so I wouldn't have to contrive one):

RUN if [ "$NODE_ENV" = "development" ]; \
	then npm install;  \
	else npm install --only=production; \
	fi

@aldeed
Copy link
Contributor

aldeed commented Sep 9, 2019

@focusaurus Yes there will always be a point at which maintaining the conditional logic correctly is more confusing than splitting to two files. We may not be at that point, but I just thought I'd float the idea.

If you're sticking with one, then you'll need to keep CMD ["yarn", "start"] instead of the entrypoint when it's a prod build. The command is overridden in docker-compose.yml anyway, with command: "/usr/local/src/reaction-app/bin/start". Couldn't you keep the same override paradigm and add your script to bin/start?

I'm not particular about the exact mechanics of how we do it, but I do think the Dockerfile should be as minimal, dependency-free, and production-targeted as possible, since docker-compose.yml can override most things when it's being used for development. Even the existing CMD ["yarn", "start"] could be changed to directly do node ./src/server.js command to eliminate that yarn dep.

One other thing that I've wanted to try to do is to create a single published "development environment" image that most of our docker-compose files can use instead of using the local Dockerfile. That would help people get going faster by eliminating the initial image build time, and would take up less hard drive space. Since all we really need for development is a Node image with some pre-chown'd folders into which we can link host files, why not use the same image everywhere? But that would mean a separate Dockerfile, which I guess is what was in the back of my mind when suggesting that.

If there is a way to solve the USER/chown stuff without image changes, then we actually could use one of the official Node images directly for development. That would be my true goal.

To be clear, what I mean is that in docker-compose.yml, ideally we would change this:

build:
  context: .

to this:

image: node:10-alpine

or at least to this:

image: reactioncommerce/dev:node10v1

At which point Dockerfile needs no conditional logic because it's only for production (maybe also for CI tests).

@focusaurus
Copy link
Contributor Author

If you're sticking with one, then you'll need to keep CMD ["yarn", "start"] instead of the entrypoint when it's a prod build.

entrypoint.sh runs ./bin/start as the default command.

The command is overridden in docker-compose.yml anyway, with command: "/usr/local/src/reaction-app/bin/start". Couldn't you keep the same override paradigm and add your script to bin/start?

I removed the docker-compose override. There's lots of ways this could work. My thought was "create a generic mechanism to solve the docker uid issue that we can eventually templatize to all projects that need volume mounts" so I was trying to keep it logically into a "prepare the mounts" phase prior to a "start the app" phase.

I'm not particular about the exact mechanics of how we do it, but I do think the Dockerfile should be as minimal, dependency-free, and production-targeted as possible, since docker-compose.yml can override most things when it's being used for development. Even the existing CMD ["yarn", "start"] could be changed to directly do node ./src/server.js command to eliminate that yarn dep.

I was initially trying to limit scope of changes in this PR in hoping to get it to land without spending a full cycle on it. There's lots we can change about how we use docker. All I'm focused on in this PR is getting the volume mount permissions correct.

You're other comment about not needing a Dockerfile for local dev I guess might be nice but it's beyond the scope I want to tackle here.

@manueldelreal
Copy link
Member

@focusaurus tested with fresh installs for users with UID 1001 (my default one), 1000 and an entirely new user, all three scenarios yielded a running container with no permissions issues.

I had a slight hiccup on the first run but it was unrelated to this branch's code changes. I say that this looks good for the scope that you trying to tackle here.

@aldeed
Copy link
Contributor

aldeed commented Sep 10, 2019

Regarding "beyond the scope", 👍 . We so rarely touch the docker setup that it's tempting to do all of the things that have been building up whenever we do.

@focusaurus
Copy link
Contributor Author

OK bin/start has some dev-only stuff it there so I need to change the default prod command.

@focusaurus focusaurus changed the title feat: get host uid:gid and use in docker WIP: feat: get host uid:gid and use in docker Sep 10, 2019
@focusaurus
Copy link
Contributor Author

Marking this WIP. We have a lot of stuff that's good and ready to go but we're considering more substantial changes to the prod Dockerfile.

@focusaurus
Copy link
Contributor Author

@manueldelreal Can you pull down the latest changes and run a few more quick tests please?

@focusaurus
Copy link
Contributor Author

Tested latest commits on my mac, all good:

Successfully built a33ce44c4c93
Successfully tagged example-storefront_web:latest
Recreating example-storefront_web_1 ... done
Attaching to example-storefront_web_1
web_1  | Fixing volume ./node_modules (before=0:0 after=501:0)…✓
web_1  | [11:05:44 PM] Compiling server
web_1  | [11:05:50 PM] Compiling client
web_1  | > Using external babel configuration
web_1  | > Location: "/usr/local/src/reaction-app/.babelrc"
web_1  | [11:05:59 PM] Compiled server in 15s
web_1  | [11:06:03 PM] Compiled client in 14s
web_1  |  DONE  Compiled successfully in 19488ms11:06:03 PM
web_1  | 
web_1  | Server started ! ✓
web_1  | 
web_1  |       http://localhost:4000
web_1  |       Press CTRL-C to stop
web_1  |     
web_1  |  WAIT  Compiling...11:06:04 PM
web_1  | 
web_1  | [11:06:05 PM] Compiling client
web_1  | [11:06:05 PM] Compiled client in 542ms
web_1  |  DONE  Compiled successfully in 580ms11:06:05 PM
web_1  | 

@focusaurus
Copy link
Contributor Author

Also tested on linux with uid 1001. Looks good. FYI this line of output is the new code:

web_1  | Fixing volume ./node_modules (before=0:0 after=1001:1001)…✓

- Grab the uid:gid of the repo root in docker
- Use that for the node user we run as in the container
- Check owner of all volume mounts, if not OK, fix them
- this should avoid permission errors on linux
- provide bin/fix-volumes to fix owner issues ad hoc
- There is no more ../node_modules
  - 1 and only 1 place where modules go for local dev (and prod)
  - local dev it's a volume mount, prod it's baked in
  - We have no native add-ons at the moment, so it should be OK, but
    `npm rebuild` as needed
- Also many docker and CI refactorings
  - split prod and dev dockerfiles
    - They are both small now
  - Use ci-scripts/docker-labels to reduce LABEL boilerplate
  - change lint CI task to run outside a docker container

Signed-off-by: Peter Lyons <pete@reactioncommerce.com>
Signed-off-by: Peter Lyons <pete@reactioncommerce.com>
@focusaurus focusaurus changed the title WIP: feat: get host uid:gid and use in docker feat: get host uid:gid and use in docker Sep 12, 2019
@focusaurus
Copy link
Contributor Author

OK I think I'm good for someone to merge this after the next round of code re-review and testing.

@focusaurus
Copy link
Contributor Author

OK I'm going to merge this. The interesting change is a single commit we can revert later if any serious issues surface.

@focusaurus focusaurus merged commit f159587 into develop Sep 16, 2019
@focusaurus focusaurus deleted the feat-docker-uid-match-3 branch September 16, 2019 15:20
@janus-reith
Copy link
Collaborator

janus-reith commented Nov 15, 2019

Just noticed that this might not work in some case.
On AWS Linux my default USER:GROUP is 1000:1000 as expected.
When fix-volumes.sh runs, it states:
Fixing volume /home/node/.cache/yarn (before=0:0 after=1000:1000)…✓

After that, everything inside the reaction-next-satrterkit belongs to root and can't be edited by my normal user anymore. I wonder what is wrong here.

After removing the existing volumes, and chowning back my folder I can't reproduce it anymore.
Still states that it changes permissons from 0:0 to 1000:1000 but without changing everything to root.

@focusaurus
Copy link
Contributor Author

Hmm. We some a little weirdness today too with some older versions of the docker-base dev images. Could you docker pull reactioncommerce/node-dev:10.16.3-v2 to make sure you have the most recent build of that and then see if you can reproduce your issue? I should be able to track it down if it's doing something weird like that consistently.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants