README.datawarp

# -*- Mode: sh; sh-basic-offset:2 ; indent-tabs-mode:nil -*-
#
# Copyright (c) 2017 Los Alamos National Security, LLC.  All rights
#                         reserved.
# $COPYRIGHT$
#
# Additional copyrights may follow
#
# $HEADER$
#

Notes on using DataWarp on LANL Cray systems
============================================

Last Update 27 March 2017
Cornell Wright

DataWarp is a Cray product that provides a high speed storage device based on
SSD technologies that is more closely integrated with the compute nodes on an
HPC system than a traditional parallel file system (PFS).

A full description of DataWarp is available in the documents:

XC(TM) Series DataWarp(TM) User Guide (CLE 6.0.UP02) S-2558 (CLE 6.0.UP02.) and
XC(TM) Series DataWarp(TM) Installation and Administration Guide (CLE 6.0.UP02) S-2564 (CLE 6.0.UP02.)

both of which are available on the web site: https://pubs.cray.com.


A Brief Description
-------------------

DataWarp is implemented as a set of service nodes, each of which has about 6
TiB of SSD installed.  When requested by a directive in a job script, a
DataWarp filesystem is initialized on some or all of the DataWarp nodes and
made available to the job's compute nodes via a mount on each compute node.
The DataWarp filesysem can operate in a variety of modes, including both
scratch and cache; this document will focus only on "striped scratch" mode
which is suitable for checkpoints.

In addition to the job script directive allocating the DataWarp filesystem,
additional directives are supported to cause pre-job stage-in of files or
post-job stage-out of files between DataWarp and the PFS.

Examples of those directives:

	#DW jobdw type=scratch access_mode=striped capacity=100GiB
	#DW stage_in destination=$DW_JOB_STRIPED/<dir> source=<pfs_dir> type=directory
	#DW stage_out source=$DW_JOB_STRIPED/<dir> destination=<pfs_dir> type=directory

See note under known problems about #DW directives.  

DataWarp Usage
--------------

When a DataWarp filesystem has been allocated, its mount point can be located
via environment variable DW_JOB_STRIPED.  This variable is set both on the
internal login node and on all the compute nodes.  DataWarp itself is only
accessible from the compute nodes. All access to DataWarp must be from a
compute node.  If a job script needs to access DataWarp, it needs to use aprun.
For example, to make a directory in DataWarp, issue the command:

  aprun -n 1 mkdir $DW_JOB_STRIPED/subdir

The compute nodes see DataWarp as a normal POSIX compatible filesystem.  In
addition to the POSIX functionality, files and directories can be transferred
between DataWarp and the PFS independent of the compute nodes.  These transfers
can be invoked as mentioned above by job script directives or by DataWarp API
calls made by the application.  They are documented in the User Guide. 

When a DataWarp job is submitted, 3 additional jobs are automatically submitted
by Moab.  They are called: stage-in, stage-out and coordinating.  The message
printed to stdout by msub that normally contains the job ID, for a DataWarp job
contains the stage-in, compute and stage-out job IDs in that order.  

Like this:

	msub /lustre/ttscratch1/cornell/run_dw_simple_20170111.130847/dw_simple_job.sh
	55983.datawarp-stagein 55984 55983.datawarp-stageout

The additional jobs are submitted by a Moab function, the msub DataWarp filter.
It logs its actions to the file ~/ac_datawarp.log in the user's home directory.

The state of DataWarp and all of a user's allocations can be displayed by the
dwstat command (documented in the User Guide.)  An example:

	module load dws
	dwstat most
	    pool units quantity     free  gran 
	wlm_pool bytes 34.88TiB 34.69TiB 32GiB 

	sess state token     creator owner             created expiration nodes 
	  20 CA--- 45722 MOAB-TORQUE  4611 2016-12-05T13:42:20      never     0 
	  21 CA--- 45716 MOAB-TORQUE  4611 2016-12-05T13:42:20      never     0 

	inst state sess bytes nodes             created expiration intact label public confs 
	  20 CA---   20 32GiB     1 2016-12-05T13:42:20      never   true I20-0  false     1 
	  21 CA---   21 32GiB     1 2016-12-05T13:42:20      never   true I21-0  false     1 

	conf state inst    type activs 
	  22 CA---   20 scratch      0 
	  23 CA---   21 scratch      0 

The libHIO package (described below) contains a script, dw_simple_sub.sh, which
will submit a simple DataWarp test job.  This can be used to familiarize
oneself with the process of creating and submitting a DataWarp job. 

An application team could choose to use DataWarp directly by issuing POSIX IO
calls and initiating stage-in and stage-out either via job script directives or
via DataWarp API calls. Alternatively, the LANL developed, open source package
libHIO could be used.  See below for more information on libHIO.


libHIO
------

libHIO is a flexible, high-performance parallel IO package developed at LANL.
It has been released as open source and is available at:

https://github.com/hpc/libhio

libHIO supports IO to either a conventional PFS or to DataWarp with management
of DataWarp space and stage-in and stage-out from and to the PFS.

For more information on using libHIO, see the github package, in particular:

README
libhio_api.pdf
hio_example.c


LANL Systems with DataWarp
--------------------------

System			DW Nodes	Capacity	Theoretical
							Speed	
--------------		--------	--------	-----------
Trinity Ph 1		300		1743 TiB	1589 GiB/S
Trinity Ph 2		234		1360 TiB	1239 GiB/S
Trinity Combined	576		3347 TiB	3050 GiB/S
Trinitite		  6		  35 TiB	  32 GiB/S

As of the writing of this document, actual peak speeds are running 
at about 90% of theoretical for N-N and at about 75% for N-1.

Using DWStat and DWCli:
-----------------------
The dwcli command line tool is the easiest way to stage data out during the
job script (but not during the application). The path handling is a little
tricky as you do not use real burst buffer paths. Rather you use relative
paths that begin with slash. Here is an example of dwcli invocation that
I think makes the entire concept clear:

# Retrieve only your own session id and configuration id
dwsessid=$(dwstat sessions |grep $((ACDW_JOBID+1)) |awk  '{print $1}')
dwconfid=$(dwstat configurations |awk -v sess=$dwsessid '{if ($3 == sess) print $1}')

# Create a stageout directory, bb directory, and write stuff
mkdir -p /lustre/ttscratch1/bws/job.81111.tt-drm/jobdir/bbdat
aprun -n 1 mkdir -p $DW_JOB_STRIPED/lustre/ttscratch1/bws/job.81111.tt-drm/jobdir
< Write some data into $DW_JOB_STRIPED/lustre/ttscratch1/bws/job.81111.tt-drm/jobdir/*>

dwcli stage out --session $dwsessid --configuration $dwconfid --backing-path /lustre/ttscratch1/bws/job.81111.tt-drm/jobdir/bbdat/ --dir /lustre/ttscratch1/bws/job.81111.tt-drm/jobdir

Note 1: The stage out is asynchronous.  To wait for the staging to complete
        use a loop similar to this:

status=""
count=0
while [ "-" != "$status" -a $count < 100 ]; do
  sleep 30
  dwcli stage query --session $dwsessid --configuration $dwconfid
  status=$(dwcli stage query --session $dwsessid --configuration $dwconfid |awk '{ print $8 }')
  count=$((count++))
  echo "Files remaining to be staged: <$status>"
done
      
Note 2: The backing path must end with a slash. Note that the --dir is 
        relative to $DW_JOB_STRIPED, and must start with a slash.  This is 
        non-intuitive, but does make some degree of sense.

Note 3: Stage in is similar. You do not specify $DW_JOB_STRIPED for --dir,
        that part is implied.


Known Problems and Gotchas:
---------------------------

Many of the problems seen by DataWarp users are related to the msub DataWarp
filter.  This is a program that runs during the execution of the msub command.
The msub DataWarp filter is responsible for recognizing #DW directives in the
job script and using their information to launch the 3 auxiliary DataWarp jobs
that manage DataWarp allocation, deallocation, stage-in and stage-out.  The
difficulties arise due to the fact that the msub DataWarp filter runs in the
user's context and can be sensitive to user specific factors such as the
environment and SSH configuration.  

The msub DataWarp filter logs to a file in each user's home directory named
~/ac_datawarp.log.  This log is a good starting point for debugging msub
related DataWarp problems.  It can also be deleted when not needed to free up
disk space.


1) Proxy Environment Variables
------------------------------

Proxy environment variables, e.g., HTTP_PROXY, HTTPS_PROXY, can cause problems
with job submission because they interfere with some of Moab and/or DataWarp's 
communications.  If you use these settings, unset them before submitting a 
DataWarp job.  This should be corrected soon.


2) SSH Communications Problems
------------------------------

Sometimes there is an ssh based communication problem between the msub command
and the DataWarp service.  Cray has plans to move that interface to a "restful
interface" but for the moment it's operating over ssh.  Sometimes if keys
have changed, the underlying ssh fails. The (rather generic) symptom is an
error message from msub:

  DataWarp commands in job script failed validation

To diagnose the problem, look in ~/ac_datawarp.log, the last 4 lines will be
something like:

2017-02-07 17:15:50,617 pid:45756       ac_dw_submitfilter      common.py:run_cmd:178   DEBUG   Executing Popen with args: ["/opt/cray/elogin/eswrap/default/bin/dw_wlm_cli", "-f", "job_process", "--job", "/users/cornell/pgm/hio/ga-gnu/libhio-1.4.0/test/run/job.20170206.094549.724988764.sh"]
2017-02-07 17:16:25,902 pid:45756       ac_dw_submitfilter      dw_cli_commands.py:dw_job_process:304   ERROR   DW dw_job_process command failed with rc = 1
2017-02-07 17:16:25,902 pid:45756       ac_dw_submitfilter      ac_dw_submitfilter:main:664     ERROR   DataWarp commands in job script failed validation in DW API job_process method
2017-02-07 17:16:25,903 pid:45756       ac_dw_submitfilter      ac_dw_submitfilter:main:666     INFO    END, rc = 0

The popen command on the first line may be the source of the problem.
Removing the python Popen parameter escapes, it converts to:

/opt/cray/elogin/eswrap/default/bin/dw_wlm_cli -f job_process -job /users/cornell/pgm/hio/ga-gnu/libhio-1.4.0/test/run/job.20170206.094549.724988764.sh

Perform the equivalent conversion on the command from your ~/ac_datawarp.log
file and run the command by hand and see if there are any SSH related errors.
If it is a key problem, it can be corrected by by removing the incorrect entry
in ~/.ssh/known_hosts file.


3) $DW_JOB_STRIPED string in #DW stage_in and stage_out
-------------------------------------------------------

The string $DW_JOB_STRIPED in the #DW stage_in and stage_out directives is a
required string. It is not a shell variable. All of the #DW directives are
processed during the msub command and shell variables will not be expanded.  I
speculate that the $DW_JOB_STRIPED string is a placeholder for a future
capability, but for now, it must be coded exactly as shown in the DataWarp User
Guide. 


4) #DW stage_in type=file destination
-------------------------------------

The "#DW stage_in type=file" directive has a required destination argument.  As
in the type=dir format of the directive, the destination must start with the
literal string "$DW_JOB_STRIPED".  It must be followed by a file name or a
subdirectory and file name.  Examples:

#DW stage_in type=file source=<pfs path> destination=$DW_JOB_STRIPED/foo

will copy the specified source file into file name "foo" in the root of
the DataWarp allocation.

#DW stage_in type=file source=<pfs path> destination=$DW_JOB_STRIPED/sub/foo

will create subdirectory "sub" in the root of the DataWarp allocation and 
will copy the specified source file into file name "foo" in that 
subdirectory.


5) #DW stage_in type=list list file format
------------------------------------------

The list file specified on a "#DW stage_in type=list" directive contains
source and destination on each line.  Similar to the format of the
destination argument of the type=file stage_in directive described above,
destination must start with the literal string "$DW_JOB_STRIPED".  It must
be followed by a file name or a subdirectory and file name. Examples list
file lines:

<pfs path> $DW_JOB_STRIPED/foo

will copy the specified source file into file name "foo" in the root of
the DataWarp allocation.

<pfs path> $DW_JOB_STRIPED/sub/foo

will create subdirectory "sub" in the root of the DataWarp allocation and 
will copy the specified source file into file name "foo" in that 
subdirectory.


6) Stage_in errors poorly messaged
----------------------------------

When there is an error processing a stage_in directive, there is no indication
given in the job's output log.  The datawarp-stagein job will run, the other 
3 sub-jobs will be cancelled and there will be no job output file.  From a
user's perspective, the job will have disappeared without producing an output
file.  A generic indication of the failure is available from the checkjob -v
command.  Example:

-->msub ga_home_1.sh 

53044.datawarp-stagein 53045 53044.datawarp-stageout

-->checkjob -v 53045   
job 53045

. . . .

Message[0] Scheduling DataWarp Storage
Message[1] DW Staging running
Message[2] DataWarp StageIn failed


7) JSON Error - eswrap module
-----------------------------

One user had added a "module load eswrap" to their .bashrc file.  This caused
the following error message from msub:

ERROR:  Error encountered: No JSON object could be decoded DataWarp request
recognized, but errors were encountered processing DW directives.

There were also other eswrap related error messages when the dwstat command was
used. Removing the module load corrected the problem.

This is due to the fact that the msub DataWarp filter uses the eswrap
communications mechanism and other paths that are similar to what is used by
the dwstat command.  So, a good debugging step is to make sure the dwstat
command is working correctly before submitting a DataWarp job with msub.


8) Finding Python
-----------------

The msub DataWarp filter is written in Python 2 and runs in the user's context
while the msub command is running.  Normally, msub is configured to invoke the
system Python interpreter via a fully qualified path name, typically
/usr/bin/python.  There have been instances when instead it was configured to
run just the unqualified executable named python.  In that case, changing the
default to Python 3, possibly via a "module load python" command will cause
problems. Additionally, having a file or directory named "python" on the
current PATH (particularly if the PATH contains ".") will also cause problems.  

--- end of README.datawarp ---