-
Notifications
You must be signed in to change notification settings - Fork 427
starter
starter's code is found in cmd/starter
. It's a CGO program, with high-level configuration management implemented in Go and low-level system-access implemented in C.
An init
function is used to make sure the Go runtime is configured to use a single goroutine and to pin the main function to a single thread.
The C portion of the code runs before the Go runtime by way of a constructor function.
Two environment variables affect the program's behavior:
- SINGULARITY_MESSAGELEVEL: the log level
- PIPE_EXEC_FD: the file descriptor used to pass the configuration (in JSON format) to starter
The configuration is placed in shared memory so that the child processes can access it.
The program determines whether its setuid bit is set by examining the auxiliary vector looking for the AT_SECURE attribute. The kernel sets this attribute to a non-zero value to indicate that the program should be treated securely, and this usually means the setuid bit was set.
Why isn't getauxval used? It's been available since glibc 2.16.
If the program is running as root or setuid, it tries to mount an overlay in order to get the kernel to load the overlay module.
Why not
modprobe overlay
? Probably to account to built-in module (comment should be added to the code).
Privileges are dropped temporarily.
Configuration is read and the configuration file descriptor is closed.
Each of stdin, stdout and stderr is pointed to /dev/null if they are closed. This is done because some programs do not work properly if file descriptors 0, 1 and 2 are closed at start up.
The list of open file descriptors is saved.
Stage 1 thread is launched sharing open files and filesystem with the main thread both ways. This is achieved by passing CLONE_FILES
to the clone
call, causing both processes to share the same file descriptor table. CLONE_FS
is also passed to clone
, causing both processes to the the same filesystem information, including root of the filesystem, current working directory and umask.
Stage 1 is responsible for singularity configuration file parsing, handle user input, read capabilities, check what namespaces is required.
If the binary is setuid, root privileges are restored and prepare stage 1
The master thread waits for stage 1 to be done.
The master thread check the exit status of the stage 1 process. If it's non-zero, it exits. If it got a signal, it sends the same signal to itself.
If the container to be started is an instance, fork:
- Child:
- Set itself as session leader
- Set process mask to 0
- Parent (master):
- Close both file descriptors from socket pair.
- Set signal handler for SIGUSR1 (exit with status 0) and SIGUSR2 (exit with status 1)
- Wait for child
- Exit with same status as child
What is this and why is it necessary?
The new list of open file descriptors is captured
The two file descriptor lists are compared and any new file descriptors that correspond to tty devices; anonymous inodes (obtained from calling epoll_create, inotify_init, eventfd, etc); and any that cannot be resolved (/proc/$pid/fd/$fd symlink is broken or cannot be read) are all closed. The file descriptors corresponding to the socket pair are ignored. For all the other file descriptors, the close-on-exec flag is set.
Why?
User namespace is initialized. This elevates privileges if any of these conditions are true:
- A user namespace is not specified and a new user namespace is not requested
- A user namespace is not specified, a new user namespace is requested, and a shared mount is not requested
Note that from here on the process might be operating with elevated privileges (see above).
In the same step, if a new user namespace is requested, a user namespace is not specified and shared mount is not requested, then CLONE_NEWUSER is added to fork flags.
If fork flags is exactly CLONE_NEWUSER, a file descriptor for event notification is set up.
If a join mount is not requested, the RPC socket pair is set up.
If the process is running suid, the filesystem ID is reset to the real ID of the calling process.
A pipe is created for synchronization.
PID namespace is set up. This adds CLONE_NEWPID to fork flags if a new PID namespace is requested.
Set up process to be killed if parent dies.
Rendezvous with master on user namespace mappings and apply user namespace mappings.
Close one end of the master socket pair.
Initialize hostname (UTS) namespace.
Rendezvous with master process on sync pipe.
If a new PID namespace is requested and a new mount namespace is requested, a new PID namespace is created.
If a new user namespace is requested, set up the new user namespace mappings for the stage 2 process. It rendezvous with the stage 2 process.
Terminal control is passed to stage 2 process.
Close one end of the master socket pair.
Rendezvous with stage 2 process on sync pipe.
Stage preparation is the same for all the stages, it only changes as a function of the current configuration.
TBD
The configuration structure consists of:
- capabilities: the set of permitted, effective, inheritable, bounding and ambient Linux capabilities
- namespace: the network, mount, user, IPC, UTS, cgroup and PID namespace information
- container
- json: the entire configuration as a JSON object