Prerequisites: Sign up at the link https://lsst-processing-tutorial-aws.splashthat.com/   to obtain your AWS user login. 




Slides: AWS-LSST-tutorial.pdf

Tutorial presenters: Hsin-Fang Chiang (AURA/LSST), Sanjay Padhi (AWS), Dino Bektesevic (University of Washington/LSST), Todd Miller (HTCondor/University of Wisconsin-Madison),  Yusra AlSayyad (Princeton/LSST), Meredith Rawls (University of Washington/LSST) with the help of the full AWS-PoC team. 



AWS - Launch an instance

Goal: Launch an Amazon EC2 instance in the AWS console. 

  1. If you register, you will be given a piece of paper with your login information.  The information on the paper should contain:

    CredentialsEXAMPLE
    Login nameaws-lsstXX
    PasswordOh19-EXAMPLE
    AWS Access Key IDAKIAIOSFODNN7-EXAMPLE
    AWS Secret Access KeywJalrXUtnFEMI/K7MDENG/bxp-Fake-EXAMPLEKEY

    You need the Login name and Password in the next step, will use the AWS Keys later. 

  2. Navigate, in your internet browser of choice, to https://sanpadhi.signin.aws.amazon.com/console and sign in. If you have a password manager, it may try to auto-fill your personal Amazon credentials, but we want to use login credentials specific to today's tutorial.
    1. Leave the "Account ID or Alias" unchanged (it should read "sanpadhi" who is kindly facilitating this tutorial).
    2. Type your unique Login name in the "IAM user name" field.
    3. Type the shared Password in the "Password" box.
  3. You will be redirected to the AWS Dashboard. We'll be using the EC2 (Elastic Compute) service to create a kind of virtual server, called an instance.
    1. Click on the "Services" tab at the top, next to the AWS logo.
    2. Type "EC2" in the search bar, and select EC2 by clicking on it.

  4. You will be redirected to the Elastic Compute Dashboard.
    1. On the left hand side there is a panel with options. Under the "Image" section, select "AMIs."


  5. You will be redirected to the Amazon Machine Image selection page. We already have a "private" AMI (Amazon Machine Image) ready to go for this demo. The default AMIs shown on this page are ones you created and own, so you won't see any at first.
    1. Click on the drop-down triangle and select "Private Images."
    2. Find the AMI named "demo_master" (ami-07d6c4c78c1530ff8) and click on it to select it. It has the friendly suggestion "Launch me!" as a name.
    3. Click the blue "Launch" button.

  6. You will be redirected to the Launch Instance Wizard. The Wizard allows you to configure a variety of things about your instance. You will see many different instance types, and a "micro" free-tier option may be pre-selected.
    1. Scroll way down and select m5.xlarge for this demo. This instance has 4 CPUs and 16 GB RAM. You could select a different one if you prefer; the specs are different so the performance varies.
    2. On the top of the Launch Instance Wizard page, in blue numbered text, you can see various different machine configurations you can select. Feel free to browse through them; most of the defaults are OK.
    3. We need to adjust the security settings before launching. Click on "6. Configure Security Group," choose "Select an existing security group," and select the "lsst-demo" (sg-058badbfff072b4ae) security group.
    4. Click the big blue "Review and Launch"!


  7. You will be redirected to Wizard Review page. It will warn you "Your instance is not eligible for the free usage tier!" and "Improve your instances' security. Your security group, lsst-demo, is open to the world." This is fine.
    1. Do a quick check of your setup before launching your instance. Ensure you have selected the "lsst-demo" security group! If not, you can edit it from here. 
    2. Click the big blue "Launch"  button in the bottom right of the screen!

  8. A pop-up window will appear. This window allows you to configure the SSH keys you want to use to connect to the instance.
    1. Since you are using this account for the first time select "Create a new key pair".
    2. Enter a Key pair name. We recommend you use your name; for example "Firstname-demo."
       
    3. Click "Download Key Pair" to download a <your name>.pem file.
    4. Store this .pem file somewhere safe and easy to remember. If you are running a Linux machine we recommend the `~/.ssh` directory. Most SSH clients will not allow you to use SSH key files if their file permissions are too permissive, so update the permissions to be user-read-only.

      mv ~/Downloads/Firstname-demo.pem ~/.ssh/.
      chmod 400 ~/.ssh/Firstname-demo.pem


    5. You are now ready to click in the big blue "Launch Instances" button in the bottom right of the pop-up window.

  9. Your instance is now launching! Click on the ID of your instance and wait for the "Instance State" to turn green.


    1. (OPTIONAL) If you want, you can give your instance a name, just so it's easier to find it later after you navigate away. To add a name, hover your mouse on top of the name box, which is blank by default. Click on the pencil to add a name.  Please be careful only do so to your instance, not somebody else's.
  10. Congratulations, your instance is now up and running! Click "Connect" for instructions on how to connect to it.



Setup your instance

Goal: connect to your instance using SSH credentials.

Good job launching your first AWS EC2 instance! Now it's time to connect to your instance so we can get the LSST Science Pipelines software set up.

  1. ssh to your instance from a Terminal window on your computer.

    1. You may click on your instance in the console and click "Connect" to get an example ssh command. 

    2. The example ssh command has `root` as the login user. You must replace `root` with `centos`, adjust the path to your key file, and copy the specific instance address from the example window.

      ssh -i "~/.ssh/Firstname-demo.pem" centos@EXAMPLE-12345-EXAMPLE.amazonaws.com


    3. It can take a while to log in. Don't worry, this is normal.

    4. When ssh asks, are you sure you want to continue connecting? Type "yes".

  2. Once you are connected, run this script to set up your credentials

    source setUpCredentials.sh

    You will be prompted to type in the AWS Access Key ID and the AWS Secret Access Key from your paper. If you are ever disconnected and have to re-ssh into the instance, you will need to run this setup again.

You are now ready to process data.


LSST Science Pipelines data processing

In this part of the tutorial we demonstrate two things:

  1. Using LSST Science Pipelines software to process some images using Amazon's S3 (Simple Storage Service) and RDS (Relational Database Service) services.
  2. Scaling the data processing in the cloud with the help of HTCondor, a workload management system.

In the first part, we want to demonstrate the native support for AWS that exists in the LSST Science Pipelines software. This is demonstrated by invoking a command to process image data using the LSST Science Pipelines but placing all of the image data and configuration information in the cloud via AWS services.

In the second part, we want to demonstrate how to scale up a singular dataset processing step from the first part to large datasets.

Running LSST Science Pipelines software using data and configs in S3 and RDS.

Goal: run an example instrument signature removal (ISR) task using the LSST Science Pipelines.

The LSST "Data Butler" keeps track of the datasets in its "registry," which is a database. The registry requires dataset uniqueness by using the concept of a collection: for running the same processing step of the same input data multiple times, a different output collection needs to be specified, or else you will get an "OutputExistsError." 

Because we all share one registry for this tutorial session, and we are all running the same pipeline processing with the same input, each tutorial user needs to target a different output collection to avoid conflicts. This tells the Data Butler, "I understand I am producing identical data that may already exist in some output collections, but I still want to produce it, so place it in a different collection."

To avoid conflicting with output collections of other users please use a unique output collection name. We recommend prefixing your name or your account id to the name. It is convenient to use an environment variable for this, for instance:

export OUT="aws-lsstXX-MyName"  # Change it!!

To run the LSST pipeline task, copy and paste (or type in) the following command to the Terminal: 

pipetask -d 'visit.visit=903334 and detector.detector=22' \
         -b 's3://lsst-demo-pegasus/input/butler.yaml' \
         -p lsst.ip.isr \
         -i calib,shared/ci_hsc \
         -o $OUT \
         run \
         -t isrTask.IsrTask:isr \
         -C isr:/home/centos/configs/isr.py

Let's break down what this command is doing. For visual clarity, we've broken this single-line command into multiple lines with `\` continuation characters.

After a short wait (it could be a few minutes), you should see logs coming out, like below:

If you are interested in seeing where your data is coming from and where is it going, you can download the Butler configuration used in this example and inspect it:

aws s3 cp "s3://lsst-demo-pegasus/input/butler.yaml" .
cat butler.yaml

You will find that the configuration consists of a datastore and a registry. The datastore is the place where the files actually live. In this demo, it points to an S3 Bucket called "lsst-demo-repo." The registry is a database that describes the datasets and their metadata. In this demo, it points to a PostgreSQL RDS instance called demo-registry at port 5432 and in that RDS instance the database called "cihsc". If you were running the processing on your local machine, these would just be standard paths to files on your computer instead of "s3" and "postgresql."

If you want to see the source code for the IsrTask, it is on GitHub here.

The following is OPTIONAL (but fun!). You may skip this and continue to run the workflow in Section 3.2.

As your ISR job finishes, the output image, a "postISRCCD", is written into the S3 output collection bucket. You can navigate to the S3 bucket to find the output fits image.

To view the S3 bucket in the AWS console, click on the "Services" tab on the upper left and type "S3" in the search bar. Select "S3" to go to the Amazon S3 page. Then search for the "lsst-demo-repo" bucket. 

Click on the "lsst-demo-repo" bucket on its name, and you will be brought to the overview of the bucket. You will see many folder names and you can navigate around the folders like in a filesystem, but it is not! The Amazon S3 console supports a concept of folders, but there is no hierarchy. S3 has a flat structure: you create a bucket, and the bucket stores objects. 

You should find a folder named after your output collection ("aws-lsstXX-MyName" in the example above). This was created when you ran the ISR job. Click on your folder and continue clicking down, you will see
"postISRCCD" > 903334 > postISRCCD_903334_22_HSC_NUMBER.fits" (Note this path includes the visit and detector numbers we specified earlier! The last number is distinct.)

You can download the fits file to your instance. For example, 

aws s3 cp s3://lsst-demo-repo/aws-lsstXX-MyName/postISRCCD/903334/postISRCCD_903334_22_HSC_NUMBER.fits .

(Change the path as appropriate for your outputs.)

Programmatically, we use the Butler API to retrieve the file as a python object. First, launch python.

python

from lsst.daf.butler import Butler
butler = Butler("/home/centos/butler.yaml", collection="aws-lsstXX-MyName")  # Update for your collection name
exp = butler.get("postISRCCD", {"visit":903334, "detector":22, "instrument":"HSC"})

Getting a "LookupError"? Are you SURE you changed your collection name in the above Python snippet? 

Getting a "FileNotFoundError" on the butler.yaml file? Do "aws s3 cp s3://lsst-demo-pegasus/input/butler.yaml ~/butler.yaml" to download a yaml config file or change the path to the location you store the file. 


A demo workflow of LSST processing jobs

Goal: scale the example processing command to 100 jobs using HTCondor and Pegasus.

In this example, we use Pegasus* as the workflow manager to submit and monitor the jobs. Pegasus runs workflows on top of HTCondor. The job dependency is represented as a directed acyclic graph (DAG) and HTCondor controls the processes. 

A Pegasus workflow of 100 jobs has been pre-made for you. We have 33 input raw images. For each input image, 3 sequential tasks are done. We also need one initialization job in the beginning. The dependency graph is as below:


Follow the steps to submit your workflow.

  1. Download files and inputs required to run the workflow

    aws s3 cp --recursive s3://lsst-demo-pegasus lsst-demo-pegasus


  2. Navigate to the folder (feel free to explore the files in it):

    cd lsst-demo-pegasus


  3. In the file wf.dax you can find the job specifications and the graph of operations that will be executed. You will find various environmental settings followed by a section that contains commands. You might notice that the commands look similar to the pipetask command you ran above, wrapped in additional XML syntax. 

  4. The run_peg.sh file is the workflow submit command for Pegasus. Feel free to look at it.  The command of interest is the pegasus-plan command, the rest of the file deals with ensuring that the output collection name differs from any used previously.
    If it isn't already, make the file executable, and run it.

    chmod +x run_peg.sh
    ./run_peg.sh


  5. The pegasus-status commands outputted among the other log outputs should look something like:

    pegasus-status -l /home/centos/lsst-demo-pegasus/submit/centos/pegasus/ciHsc/20191031T202640+0000

    but your timestamp part of the string is different. Running this command allows you to inspect the state of the workflow. Also note the name of your output collection printed at the end of the logs, if you want to navigate to S3 and see the files for yourself.

  6. The workflow will remain idle until Annex workers become available. In the next step we will add workers on which the jobs will be executed. 

HTCondor and get Annex workers

Goal: add some Annex workers to the HTCondor worker pool. 

You might have noted previously that the AMI you had launched was named "demo_master". On this AMI HTCondor is already running and is configured in a "master" mode. In this mode this instance serves only as a scheduler and resource manager and is not able to do any processing.

HTCondor master instructs "worker" nodes to do the actual processing. There are 2 types of workers that condor_annex can launch: "on-demand" and "spot fleet" workers. The difference between the two is the type of "rental agreement" made. We will launch "on-demand" instances.


On demand instances are yours until you decide to terminate them.


Spot instances are rented from AWS like on demand instances, but at any point AWS can request them back. A 2 minute warning is issued, after which the instance is terminated and assigned to someone else. Because of this spot fleet instances can be much cheaper, but are also more precarious to work on.

  1. To add on-demand instances, run 

    condor_annex -count 6 \
                 -aws-on-demand-instance-type m5.xlarge \
                 -aws-on-demand-ami-id ami-04dee9fa7194ef55e \
                 -aws-on-demand-security-group-ids sg-058badbfff072b4ae \
                 -idle 1 \
                 -annex-name $OUT
    

    The command breakdown follows:

    1. -count sets the number of instances you want,

    2. -aws-on-demand-instance-type sets the type, note that changing this might affect your performance,

    3. -aws-on-demand-ami targets the AMI you want to use, note that the AMI used for the workers is different than the one used for the master,

    4. -idle sets the amount of time the instance will run in idle mode, after this time has elapsed and the instance did not receive any new work it will automatically terminate.

    5. -annex-name The name of your Annex cluster, it doesn't have to be unique but it will make interpreting the status commands much easier. By default we just named them after your first output collection.

  2. You will need to type "yes" to confirm you want to request the Annex instances. 
  3. We kept the number of instances used relatively low in order for you to be able to run status and execution status commands. try any of the following:

Sit back, and see your jobs run.  You can repeat the "pegasus-status" command to see how the workflow is going. 

Eventually all of your pipetask jobs finish, given the worker availability.  

You are welcome to go to the folder   /home/centos/lsst-demo-pegasus/output   where you will find the pipeline logs of all jobs.  

Terminate your instance

Congratulations on finishing this tutorial and thank you for your patience following along! 

When you are done, please help us by terminating your instance. On-demand instances continue running until termination, and so do the $$ charges. 

Go to the EC2 Dashboard. (If you navigated away to S3, you will have to get back here from the Services menu.) Select your instance, click "Action" and go to "Instance State", and select "Terminate". 

Confirm "Yes, Terminate" if you are sure. 

Please only do so if you are sure this is your instance! If you are not sure, just leave it running. We can do the cleanup afterwards. 


Appendix: Find/Create Your Access Key ID and Secret Access Key

For this tutorial you will be given a paper with your access key and secret access key.  If you lose it, you may create a new key as below. 

  1. In top right of the EC2 panel click on the dropdown menu next to your username and navigate to My Security Credentials:
  2. On the page you were redirected to find the section titles ACCESS keys for CLI, SDK, & API Access and click on Create access key button:
  3. In the pop-up that appears click on show Secret Access Key and store it somewhere safe - you will not be able to see that key again. You will be able to see the Access Key ID in the "Access keys for CLI, SDK, & API Access" section of your Security Credentials page. Password managers are highly recommended.


Appendix: Configure the SSH keys for Windows

  1. Download Putty and PuttyGen from https://www.chiark.greenend.org.uk/~sgtatham/putty/latest.html
  2. Assume you saved your SSH key file to “This PC/Documents/lsst.pem”
  3. Run PyttyGen and load the .pem key file - Navigate to Documents folder make sure you select the “All files(*.*)” to see the lsst.pem file
  4.   
  5. “Save private key” to c:/Downloads/private_key
  6. Open Putty
  7. Enter centos@your-ec2-public-ip in Host
  8. Expend SSH/Auth and browse to the c://Downloads/private_key
  9. You are now logged onto the EC2 via SSH

References 


* Disclaimer: The LSST project does not make commitment to any particular workflow manager as its production tools at this moment.  The usage here is only meant as an example and is Hsin-Fang Chiang's own choice for this demo session.