diff --git a/doc/source/chapter02_beginner-tutorial/awscli-config.rst b/doc/source/chapter02_beginner-tutorial/awscli-config.rst index 00e8e27..31059c3 100644 --- a/doc/source/chapter02_beginner-tutorial/awscli-config.rst +++ b/doc/source/chapter02_beginner-tutorial/awscli-config.rst @@ -40,6 +40,8 @@ Running ``aws configure``, you will be asked 4 questions:: "Access Key ID" and "Secret Access Key" are just like your AWS account name and password. For security reasons they are not the one you use to log into the console. You can obtain them in the "My Security Credentials" console. +.. _credentials-label: + Obtaining security credentials ------------------------------ diff --git a/doc/source/chapter02_beginner-tutorial/img/EC2_launch_seven_steps.png b/doc/source/chapter02_beginner-tutorial/img/EC2_launch_seven_steps.png new file mode 100644 index 0000000..57e72bf Binary files /dev/null and b/doc/source/chapter02_beginner-tutorial/img/EC2_launch_seven_steps.png differ diff --git a/doc/source/chapter02_beginner-tutorial/quick-start.rst b/doc/source/chapter02_beginner-tutorial/quick-start.rst index 95fd60c..c1648b4 100644 --- a/doc/source/chapter02_beginner-tutorial/quick-start.rst +++ b/doc/source/chapter02_beginner-tutorial/quick-start.rst @@ -33,6 +33,8 @@ In the EC2 console, make sure you are in the **US East (N. Virginia)** region as .. figure:: img/region_list.png :width: 300 px +.. _choose_ami-label: + In the EC2 console, click on "AMI" (Amazon Machine Image) under "IMAGES" on the left navigation bar. Then select "Public images" and search for **ami-08c83a8b3ebd20b63** or **GEOSChem_tutorial_20180926** – that's the system with GEOS-Chem installed. Select it and click on "Launch". .. figure:: img/search_ami.png @@ -66,6 +68,8 @@ You can monitor your server in the EC2-Instance console. Within < 1min of initia You now have your own server running on the cloud! +.. _login_ec2-label: + Step 3: Log into the server and run GEOS-Chem --------------------------------------------- @@ -192,6 +196,8 @@ We encourage users to try the new NetCDF diagnostics, but you can still use the Also, you could indeed download the output data and use old tools like IDL & MATLAB to analyze them, but we highly recommend the open-source Python/Jupyter/xarray ecosystem. It will vastly improve user experience and working efficiency, and also help open science and reproducible research. +.. _terminate-label: + Step 5: Shut down the server (Very important!!) ----------------------------------------------- diff --git a/doc/source/chapter02_beginner-tutorial/research-workflow.rst b/doc/source/chapter02_beginner-tutorial/research-workflow.rst index 34b41b4..782cb5c 100644 --- a/doc/source/chapter02_beginner-tutorial/research-workflow.rst +++ b/doc/source/chapter02_beginner-tutorial/research-workflow.rst @@ -1,25 +1,298 @@ -Final word on research workflow on cloud -======================================== +Put everything together: a complete workflow +============================================ -Congrats! You've finished all beginner tutorials. Now you've learned enough AWS stuff to perform most of simulation, data analysis, and data management tasks. These tutorials could feel pretty intense if you are new to cloud computing (although I really tried to make them as user-friendly as possible). Don't worry, repeat these practices several times and you will get familiar with the research workflow on the cloud very quickly. There are also advanced tutorials, but they are just add-ons and are really not necessary for just getting science done. +Congrats! If you've been reading the tutorials in order, now you should know enough AWS stuff to perform most of simulation, data analysis, and data management tasks. These tutorials could feel pretty intense if you are new to cloud computing (although I really tried to make them as user-friendly as possible). Don't worry, repeat these practices several times and you will get familiar with the research workflow on the cloud very quickly. There are also advanced tutorials, but they are just add-ons and are really not necessary for just getting science done. + +General comments on cloud vs local computer +------------------------------------------- The major difference (in terms of research workflow) between local HPC clusters and cloud platforms is **data management**, and that's what new users might feel uncomfortable with. To get used to the cloud, the key is to use and love S3! On traditional local disks, any files you create will stay there forever (so I often end up leaving tons of random legacy files in my home directory). On the other hand, the pricing model of cloud storage (charge you by the exact amount of data) will force you to really think about what files should kept by transferring to S3, and what should be simply discarded (e.g. TBs of legacy data that are not used anymore). There are also ways to make cloud platforms :ref:`behave like traditional HPC clusters `, but they can often bring more restrictions than benefits. To fully utilize the power and flexibility of cloud platforms, directly use native, basic services like EC2 and S3. -Here's my typical research workflow for reference: +A reference workflow +-------------------- + +Here's the outline of a typical research workflow: -1. Launch EC2 instances from pre-configured AMI. Consider spot instances for big computing. (:doc:`Use AWSCLI to launch them with one command <../chapter03_advanced-tutorial/advanced-awscli>`.) +1. Launch EC2 instances from pre-configured AMI. Consider spot instances for big computing. :doc:`Consider AWSCLI to simply the above process to one shell command <../chapter03_advanced-tutorial/advanced-awscli>` 2. Prepare input data by pulling them from S3 to EC2. Put commonly used ``aws s3 cp`` commands into bash scripts. 3. Tweak model configurations as needed. -4. Run simulations :ref:`with tmux `. Log out and go to sleep if the model runs for a long time. +4. Run simulations :ref:`with tmux `. Log out and go to sleep if the model runs for a long time. Re-login at anytime to check progress. 5. Use Python/Jupyter to analyze output data. -6. After simulation and data analysis tasks are done, upload output data and customized model configuration (mostly run directories) to S3. Or download them to local machines if necessary (Recall that :ref:`egress charge ` is $90/TB; for several GBs the cost is negligible). +6. When the EC2 instance is not needed anymore, transfer output data and customized model configuration (mostly run directories) to S3. Or download them to local machines if necessary (Recall that :ref:`egress charge ` is $90/TB; for several GBs the cost is negligible). 7. Once important data safely live on S3 or on your local machine, shut down EC2 instances to stop paying for CPU charges. -8. Go to write papers, attend meetings, do anything other than computing. During this time, no machines are running on the cloud, and the only cost is data storage on S3 ($23/TB/month). Consider `S3 - Infrequent Access `_ which costs half if the data will not be used for several months. -9. Whenever need to continue computing, launch EC2 instances, pull stuff from S3, start coding again. +8. Go to write papers, attend meetings, do anything other than computing. During this time, no machines are running on the cloud, and the only cost is data storage on S3 ($23/TB/month). +9. Whenever need to continue computing, launch new EC2 instances, pull previous data from S3. + +Talk is cheap. Let's actually walk through them. + +Below are reproducible steps (copy & paste-able commands) to set up a custom model run. We use a half-day, 2x2.5 simulation as an example, but the same idea applies to other types of runs. **Most laborious steps only need to be done once**. Subsequent workflow will be much simpler. + +I assume you've read all previous sections. Don't worry if you can't remember everything -- there will be links to previous sections whenever necessary. + +Launch EC2 instance with custom configuration +--------------------------------------------- + +A complete EC2 configuration has 7 steps, with tons of options throughout the steps: + +.. figure:: img/EC2_launch_seven_steps.png + +You typically only touch very few options, as listed in order below. + +- Choose our tutorial AMI :ref:`just as in the quick start guide `. This effectively did "Step 1: Choose an Amazon Machine Image (AMI)". You will be brought to "Step 2: Choose an Instance Type". + +* At Step 2, choose the "Compute optimized" family, select ``c5.4xlarge``, which is suitable for medium-sized simulations. For longer-term, higher-resolution runs, consider even bigger ones like ``c5.9xlarge`` and ``c5.18xlarge``. + +- At "Step 3, Configure Instance Details", select "Request Spot instances". :doc:`See here to review spot instances configuration `. (At this step you also have the chance to select an "IAM role" to simplify AWSCLI configuration for S3, as :doc:`explained in advanced tutorial <../chapter03_advanced-tutorial/iam-role>`.) + +* At "Step 4: Add Storage", increase the size to 400~500 GB to host more input/output data. :doc:`See here to review EBS volume configuration `. + +- Nothing to do for "Step 5: Add Tags". Just go to the next step. You can always add `resource tags `_ (just convenient labels) anytime later. + +* At "Step 6: Configure Security Group", select a proper security group. :doc:`See here to review security group configuration `. If you don't bother with security group configurations, simply choose "Create a new security group" (it works but not optimal). + +- Nothing to do for "Step 7: Review Instance Launch". Just click on "Launch". + +Occasionally you might `hit EC2 instance limit `_, especially when you try to launch a very large instance on a new account. Just `request for limit increase `_. if that happens. + +Advanced tutorial will show you how to :doc:`use AWSCLI to simply the above process to one shell command <../chapter03_advanced-tutorial/advanced-awscli>`. + +Set up your own model configuration +----------------------------------- + +Log into the instance :ref:`as in the quick start guide `. Here you will set up you own model configuration, instead of using the pre-configured tutorial run directory. You can also change model version -- any versions newer than `v11-02a `_ should work smoothly. The system will still work with future releases of GEOS-Chem, unless there are big structural changes that break the compile process. + +Existing GEOS-Chem users should feel quite familiar about the steps presented here. New users might need to refer to our `user guide `_ for more complete explanation. + +Get source code and checkout model version +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +You can obtain the latest copy of the code from `GEOS-Chem's GitHub repo `_:: + + $ mkdir ~/GC # make you own folder instead using the "tutorial" folder. + $ cd ~/GC + $ git clone https://github.com/geoschem/geos-chem Code.12.0.1 # might use different names for future versions + $ git clone https://github.com/geoschem/geos-chem-unittest.git UT + +You may list all versions (they are just `git tags `_) in chronological order:: + + $ cd Code.12.0.1 + $ git log --tags --simplify-by-decoration --pretty="format:%ci %d" + 2018-08-22 16:57:08 -0400 (HEAD -> master, tag: 12.0.1, origin/master, origin/HEAD) + 2018-08-09 16:59:22 -0400 (tag: 12.0.0) + 2018-07-30 10:31:40 -0400 (tag: 12.0.0-1yr-bm, tag: 12.0.0-1mo-bm) + 2018-06-21 10:04:24 -0400 (tag: v11-02-release-candidate, tag: v11-02-rc) + 2018-05-11 16:31:42 -0400 (tag: v11-02f-1yr-Run1) + ... + +**New users had better just use the default, latest version to minimize confusion**. Experienced users might want to checkout to a different version, say ``12.0.0``:: + + $ git checkout 12.0.0 # just the name of the tag + $ git branch + * (HEAD detached at 12.0.0) + $ # git checkout master # restore the latest version if you want + +You need to do version checkout for both source code and unit tester. + +Configure unit tester and generate run directory +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Then you need to generate run directories from unit tester: + +In ``UT/perl/CopyRunDirs.input``, change the default paths:: + + GCGRID_ROOT : /n/holylfs/EXTERNAL_REPOS/GEOS-CHEM/gcgrid + DATA_ROOT : {GCGRIDROOT}/data/ExtData + ... + UNIT_TEST_ROOT : {HOME}/UT + ... + COPY_PATH : {HOME}/GC/rundirs + +to:: + + GCGRID_ROOT : /home/ubuntu + DATA_ROOT : {GCGRIDROOT}/ExtData + ... + UNIT_TEST_ROOT : {HOME}/GC/UT + ... + COPY_PATH : {HOME}/GC + +Then uncomment the run directory you want:: + + geosfp 2x25 - standard 2016070100 2016080100 - + +In ``UT/perl/Makefile``, make sure the source code path is correct:: + + CODE_DIR :=$(HOME)/GC/Code.$(VERSION) + +Finally, generate the run directory:: + + $ ./gcCopyRunDirs + +Go to the run directory and compile:: + + $ make realclean + $ make -j4 mpbuild NC_DIAG=y BPCH_DIAG=n TIMERS=1 + +Note that you almost have to execute ``make`` command **in the run directory**. This will ensure the correct combination of compile flags for this specific run configuration. GEOS-Chem's compile flags have become so complicated that you will almost never get the right compile settings by compiling in the source code directory. See `our wiki `_ for more information. + +Get more input data from S3 +--------------------------- + +If you just run the executable ``./geos.mp``, it will complain about missing input data. Remember that the default ``~/ExtData`` folder only contains sample data for a demo 4x5 simulation; other data need to be retrieved from S3 using AWSCLI commands (:doc:`see here to review S3 usage `). In order to use AWSCLI on EC2, you need to either :ref:`configure credentials (beginner approach) ` or :doc:`configure IAM role (advanced approach) <../chapter03_advanced-tutorial/iam-role>`. + +Try ``aws s3 ls`` to make sure AWSCLI is woking. Then retrieve data by:: + + # GEOSFP 2x2.5 CN metfield + aws s3 cp --request-payer=requester --recursive \ + s3://gcgrid/GEOS_2x2.5/GEOS_FP/2011/01/ ~/ExtData/GEOS_2x2.5/GEOS_FP/2011/01/ + + # GEOSFP 2x2.5 1-month metfield + aws s3 cp --request-payer=requester --recursive \ + s3://gcgrid/GEOS_2x2.5/GEOS_FP/2016/07/ ~/ExtData/GEOS_2x2.5/GEOS_FP/2016/07/ + + # 2x2.5 restart file + aws s3 cp --request-payer=requester \ + s3://gcgrid/SPC_RESTARTS/initial_GEOSChem_rst.2x25_standard.nc ~/ExtData/SPC_RESTARTS + + # fix the softlink in run directory + ln -s ~/ExtData/SPC_RESTARTS/initial_GEOSChem_rst.2x25_standard.nc ~/GC/geosfp_2x25_standard/GEOSChem_restart.201607010000.nc + +Tweak run-time configurations +----------------------------- + +Here shows very common customizations in the run directory. You might further tweak any settings as needed. + +In ``input.geos``, change the simulation length to 12 hours instead of 1 month. + +:: + + End YYYYMMDD, hhmmss : 20160701 120000 + +.. note:: + If you do need to run the simulation over months, remember to pull more metfields in the previous step. For example, metfields for the entire year can be retrieved by ``aws s3 cp --request-payer=requester --recursive s3://gcgrid/GEOS_2x2.5/GEOS_FP/2016/ ~/ExtData/GEOS_2x2.5/GEOS_FP/2016/`` + +In ``HEMCO_Config.rc``, tweak emission configurations as needed. Here I disable CEDS due to `its data size issue `_, which will be fixed in 12.1.0. + +:: + + --> CEDS : false + +In ``HISTORY.rc``, change the output path: + +:: + + EXPID: ./OutputDir/GEOSChem + +Remember to ``mkdir OutputDir`` so the path you specified actually exists. + +Say I am only interested in Species Concentrations diagnostics. Comment out others:: + + COLLECTIONS: 'SpeciesConc', + # 'AerosolMass', + # 'Aerosols', + # 'CloudConvFlux', + # 'ConcAfterChem', + # 'DryDep', + # 'JValues', + # 'JValuesLocalNoon', + # 'LevelEdgeDiags', + # 'ProdLoss', + # 'StateChm', + # 'StateMet', + # 'WetLossConv', + # 'WetLossLS', + +Output hourly instantaneous field, instead of the original monthly mean: + +:: + + SpeciesConc.frequency: 00000000 010000 + SpeciesConc.duration: 00000000 010000 + SpeciesConc.mode: 'instantaneous' + +.. note:: + To massively change the date for all collections, in ``vim`` you can perform a subsitition by ``:%s/00000100 000000/00000000 010000/g`` + +Now you should be able to execute the model without problems. + +Perform long-term simulation +---------------------------- + +:ref:`With tmux `, you can keep the program running after logging out. + +:: + + $ tmux + $ ./geos.mp | tee run.log + Type `Ctrl + b`, and then type `d`, to detach from the tmux session + + $ tail -f run.log # display the output message dynamically + Type `Ctrl + c` to quit the message display. Won't affect model simulation. + +Log out of the server (``Ctrl + d`` or just close the terminal). The model will be safely running in the background. You can re-login anytime and check the progress by looking at ``run.log``. If you need to cancel the simulation, type ``tmux a`` to resume the interactive session and then ``Ctrl + c`` to kill the program. + +This half-day simulation will take about half an hour. In the meantime, do whatever you like such as having a cup coffee... Just come back and re-login after half an hour. The same strategy applies to simulations that run over many days. You don't have to keep the terminal open. + +.. note:: + What if the model finishes at mid-night? Any way to automatically terminate the instance to stop paying for charge? I tried multiple auto-checking methods but they often bring more troubles than benefits. For example, :ref:`the HPC cluster solution ` will handle server termination for you, but that often makes the workflow more complicated, especially if you are not a heavy user. Manually examining the simulation on next day is usually the easiest way. The cost of EC2 piles up for simulations that last for many days, but for just one night it is negligible. + +Analyze output data +------------------- + +Output data will be inside ``OutputDir/`` as specified in ``HISTORY.rc``. You can :ref:`use Jupyter notebooks ` to analyze them, or simply ``ipython`` for a quick check. One tip is that multi-file time-series can be opened as a single object by ``xarray.open_mfdataset()``:: + + $ source activate geo + $ ipython + Python 3.6.6 |Anaconda, Inc.| (default, Jun 28 2018, 17:14:51) + Type 'copyright', 'credits' or 'license' for more information + IPython 6.5.0 -- An enhanced Interactive Python. Type '?' for help. + + In [1]: import xarray as xr + + In [2]: ds = xr.open_mfdataset("GEOSChem.SpeciesConc.20160701_*00z.nc4") # multiple many files at once + + In [3]: ds # 12 time frames in the same object + Out[3]: + + Dimensions: (ilev: 73, lat: 91, lev: 72, lon: 144, time: 12) + ... + +Save your files to S3 +--------------------- + +Before terminate the EC2 instance, always make sure that input files are transferred to persistent storage (S3 or local). Here we push our custom files to S3 (:ref:`see here to review S3+AWSCLI usage `). + +:: + + aws s3 mb s3://my-custom-gc-files # use a different name for the bucket, with all lower cases + aws s3 cp --recursive ~/GC/ s3://my-custom-gc-files # transfer data + aws s3 ls s3://my-custom-gc-files/ # show the bucket content + +Only the ``~/GC/`` folder contains custom configurations. Input data can be easily retrieved from the ``s3://gcgrid`` bucket. However, if you made you own changes to the input data, remember to also transfer them to S3. + +Terminate server, start over whenever needed +-------------------------------------------- + +Now you can safely :ref:`terminate the server `. The next time you want to continue working on this project, **you only need to do two simple things**: + +1. Launch EC2 instance. It takes one second if you :doc:`use AWSCLI <../chapter03_advanced-tutorial/advanced-awscli>`. + +2. Retrieve data files. In this example, the commands are: -I often use big instances (e.g c5.8xlarge) with spot pricing for computationally-expensive model simulations. But I also keep a less-powerful, on-demand instance (e.g. c5.large) for data analysis workloads or lightweight simulations. Whenever I need to make a quick plot of model output data, I just start this on-demand instance, write Python code in Jupyter, and stop this instance when I am done. Since this on-demand instance just switches between "running" and "stopped" states but never "terminates", all files are preserved on disk, so I don't have to backup temporary files to S3 all the time. +:: -When I need large computing power to `process data in parallel `_, this on-demand instance can be easily `resized to a bigger type `_ like c5.8xlarge. Since data analysis tasks typically just take 1~2 hours (unlike model simulations that often take days), it doesn't worth the effort to set up spot instances to save one dollar. + # Assume that AWSCLI is already configured by either credentials or IAM roles + + # customized code, config files, and output data + aws s3 cp --recursive s3://my-custom-gc-files ~/GC/ + chmod u+x ~/GC/geos.mp # restore execution permission + + # standard input data from public bucket + aws s3 cp --request-payer=requester --recursive \ + s3://gcgrid/GEOS_2x2.5/GEOS_FP/2011/01/ ~/ExtData/GEOS_2x2.5/GEOS_FP/2011/01/ + aws s3 cp --request-payer=requester --recursive \ + s3://gcgrid/GEOS_2x2.5/GEOS_FP/2016/07/ ~/ExtData/GEOS_2x2.5/GEOS_FP/2016/07/ +The files on this new EC2 instance will look exactly the same as on the original instance that you terminated last time. In this way, you can get a sustainable workflow on the cloud. diff --git a/doc/source/chapter02_beginner-tutorial/use-s3.rst b/doc/source/chapter02_beginner-tutorial/use-s3.rst index 5bf1960..b40f5e6 100644 --- a/doc/source/chapter02_beginner-tutorial/use-s3.rst +++ b/doc/source/chapter02_beginner-tutorial/use-s3.rst @@ -19,6 +19,8 @@ The S3 console is convenient for viewing files, but most of time you will use AW - To transfer data between S3 and EC2, you have to use AWSCLI since there is no graphical console on EC2 instances. - To work with public data set, AWSCLI is almost the only way you can use. Recall that in the previous chapter you use ``aws s3 ls s3://nasanex/`` to list the NASA-NEX data. But you cannot see the "s3://nasanex/" bucket in S3 console, since it doesn't belong to you. +.. _s3-awscli_label: + Working with S3 using AWSCLI ---------------------------- @@ -201,7 +203,7 @@ Then you may want to change the simulation date in ``input.geo`` to test the new Run directory : ./ Input restart file : GEOSChem_restart.201607010000.nc -(Note that the restart file is still at 2013/07 in this case.) +(Note that the restart file is still at 2016/07 in this case.) The EC2 instance launched from the tutorial AMI only has limited disk by default, so the disk will be full very soon. You will learn how to increase the disk size, right in the next tutorial.