Posh, the "Process Offload Shell'", is a shell and runtime that automatically reduces data movement when running shell pipelines on data stored in remote storage, such as NFS.
Posh enables speedups for I/O heavy shell applications that access remote
filesystems by pushing computation to proxy servers closer to the data.
Posh uses annotations, metadata
about individual shell commands (awk
, cat
, grep
, etc.) to understand which files an arbitrary shell pipeline will access to schedule and execute the command across proxy servers, so the computation looks like it had been running locally.
For more details, check out our research paper, POSH: A Data Aware Shell which will be published at Usenix ATC 2020.
This implementation is a research prototype -- use at your own risk!
- The latest version of Rust. See this link for installation details.
- Our prototype has been fully tested on Linux (versions 18.04 and 19.10).
- To build the repo, run from the main directory:
cargo b --release
- Run the integration tests and unit tests to ensure that Posh can properly redirect
stdin
,stdout
andstderr
between processes:
cargo test
- Posh includes a server binary that runs at a proxy server, which must have access to the same remote filesystem data that the client is trying to access.
- Posh includes two client binaries, that intercept shell commands and
schedule and execute them across the proxy servers:
- The first executes shell scripts (with one or more commands) over Posh
- The second starts a shell prompt and lets user type in individual commands.
- The client and server must communicate over a custom port, which can be configured in both the server and client binaries; the default is 1235. The server must keep this port open for TCP traffic.
- The client and server binaries require a directory to store temporary output while processes are running.
- A proxy server must have access to one remote folder on behalf of a client. The proxy server could be a remote storage server itself (store this data locally) or even access this data over NFS or another remote filesystem protocol. Run the following at the proxy server:
$POSH_SRC/target/release/server
--folder <client_folder> # folder this Proxy provides access to, required
--ip_address <ip_addr> # ip address of the client, required
--runtime_port <runtime_port> # port server has open for all Posh communication, default = 1235
--tmpfile <path/to/temporary/directory> # place for Posh to keep temporary output while running commands, required
- The Posh client shell requires an annotations file and a configuration file to understand and schedule shell commands. Each section, linked contains further information about the information these files much contain.
- To run the shell script binary, run:
$POSH_SRC/target/release/shell-exec
<binary> # shell script to run over Posh, required
--annotations_file <path> # path to annotations, required
--mount_file <path> # path to config file, required
--pwd <directory> # directory to execute this script from, required
--tmpfile <path/to/temporary/directory> # place for Posh to keep temporary output while running commands, required
--runtime_port <runtime_port> # port to communicate with server with, default = 1235
--splitting_factor <splitting factor> # parallelization factor, default = 1
--tracing_level <tracing_level> # log debug outpu†, default = none
- To run the shell prompt binary, run:
$POSH_SRC/target/release/shell-client
--annotations_file <path> # path to annotations, required
--mount_file <path> # path to config file, required
--tmpfile <path/to/temporary/directory> # place for Posh to keep temporary output while running commands, required
--runtime_port <runtime_port> # port to communicate with server with, default = 1235
--splitting_factor <splitting factor> # parallelization factor, default = 1
--tracing_level <tracing_level> # log debug outpu†, default = none
- Syntax allowed:
- Posh can accelerate commands with standard shell syntax, including pipes
(
|
), andstdin
,stdout
andstderr
redirections (<
,>
,2>
) - Posh allows export commands (e.g.
export VAR=VALUE
) to configure environment variables within scripts - We are working on including more standard syntax.
- Posh can accelerate commands with standard shell syntax, including pipes
(
- A sample config file is provided in
config/sample.config
. To use Posh, edit the lines undermounts
with your configuration information. - The config file has up to 3 parts. # 1 is required, while 2 and 3 are
only necessary for experimental features.
- [Required] A list of
mounts
, e.g. a list of IPs for proxy servers mapped to the corresponding client remote mounted directory, which must be an absolute path, for example:mounts: "255.255.255.0": "/home/user/remote_mount1" "255.255.255.1": "/home/user/remote_mount2"
- [Optional] A list of rough link speeds between different proxies, where the
client
is included as a local proxy. This is used for an experimental scheduling algorithm. For example:links: "(255.255.255.0,client)": 500 # in Mbps "(255.255.255.1,client)": 500 # in Mbps ```
- [Optional] A list of temporary file locations on each proxy server
that Posh can write to.
tmp_directory: "255.255.255.1": "/tmp/posh"
- [Required] A list of
- Sample annotations are provided in
config/eval_annotations.txt
- See ANNOTATIONS.md for more information on the annotation format and adding your own annotations.
Here, we'll describe the end to end steps for running a program over Posh, for a
simple pipeline that runs cat
of files from two different NFS mounts, and
pipes the output to grep
. The proxy servers will run on each NFS mount directly.
-
Configure NFS at the client and servers so the two servers expose NFS mounts to the client. In this example, each NFS servers hosts the shared directory at
/mnt/logs
and the client mounts them at/home/user/mount1
and/home/user/mount2
respectively. -
The data we'll be using is network access logs from the SEC's Edgar log dataset. Download two sample logs, one to each mount, from the edgar log website:
# (at the first NFS server) wget http://www.sec.gov/dera/data/Public-EDGAR-log-file-data/2017/Qtr1/log20170314.zip unzip log20170314.zip mv log20170314.csv /mnt/logs/log.csv # (at the second NFS server) wget http://www.sec.gov/dera/data/Public-EDGAR-log-file-data/2017/Qtr1/log20170315.zip unzip log20170315.zip mv log20170315.csv /mnt/logs/log.csv
-
At each of the proxy servers, run the following command (substituting the client IP address):
$POSH_SRC/target/release/server --folder /mnt/logs --ip_address $CLIENT_IP --tmpfile /tmp/posh
- Configure the client configuration file as described in the above section to look like the following:
mounts:
"FIRST_SERVER_IP": "/home/user/mount1"
"SECOND_SERVER_IP": "/home/user/mount2"
- Run the following at the client:
cd /home/user
$POSH_SRC/target/release/shell-client --annotations_file $POSH_SRC/config/eval_annotations.txt --mount_file $POSH_SRC/config/sample.config
- At the resulting prompt, type in:
cat mount1/log.csv mount2/log.csv | grep "127.0.0.1"
The result will show up faster than using bash
, as Posh offloads a cat | grep
command to run at each proxy server and just aggregates the output in the correct order.