Valid with Amazon Linux 2 General Environment or CentOS 7 General Environment instructions.
For other environments ensure that wget, pip and python-awscli exist. Something like:
sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm sudo yum install wget sudo yum --enablerepo epel install python-pip sudo pip install --upgrade pip sudo pip install awscli
Ensure security group permissions allow TCP communication on port 9618 and that your IAM has cloudformation:DescribeStacks on resource arn:aws:cloudformation permissions for the region you are in.
Note: Ensure that the environment variable USER is set correctly.
1.1. Procedure
Follow HTCondor's official instructions. Some commands are not quite correct (some paths are wrong, one code editing instruction is out-of-date, etc..). The corrected abridged version of the official instructions are therefore given bellow. Most of the steps, however, will be identical. Firstly, make sure we're starting up-to-date:
sudo yum update sudo yum upgrade sudo reboot now
Proceed directly to these directions first:
sudo wget -O /etc/yum.repos.d/RPM-GPG-KEY-HTCondor https://research.cs.wisc.edu/htcondor/yum/RPM-GPG-KEY-HTCondor sudo rpm --import /etc/yum.repos.d/RPM-GPG-KEY-HTCondor sudo wget -O /etc/yum.repos.d/htcondor-development-rhel7.repo https://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-development-rhel7.repo sudo yum install condor-all sudo systemctl start condor sudo systemctl enable condor
Ensure Condor runs. It should display condor_master, condor_procd and condor_shared_port:
ps -aux | grep condor
If you followed instructions on how to set up Amazon Linux 2 General environment skip the aws cli instalation via pip in the manual. It will already exist. Proceed with:
sudo yum install condor-annex-ec2
HTCondor and condor_annex should now be installed.
1.1.1. Configuring Master Condor Node (head)
Edit `/etc/condor/config.d/local` file and make sure the following configuration is given:
MayUseAWS and SUBMIT_ATTRS job flags could have been supplied in the job script itself but are now defaults for all jobs. Without the `UID_DOMAIN` the jobs would not run as $USER but as `nobody`. Whether a job is submitted under $USER or `nobody` is based on the UID_DOMAIN configuration parameter of the submit and execute machines. The value must be the same on the two machines and the value must be a substring of the submit machine's full hostname. Otherwise, the job is run as the nobody user. Setting the `UID_DOMAIN` to `compute.internal` ensures this condition is always met, since the string is always the ending of any `ip-<SOME_IP>.<SOME_REGION>.compute.internal` private DNS instance identification string used by HTCondor. Setting `TRUST_UID_DOMAIN` then allows the jobs to elevate its privileges to the login `$USER`.
The following step in the official manual is out-of-date at the time of writing this instruction set. File `/usr/libexec/condor/condor-annex-ec2 ` needs to be edited, delete lines 59 to 341, inclusive. Lines 47-51 can also be deleted if `FROM INSTANCE` feature is wanted. Feature allows condor_annex to launch new EC2 instances without having to copy the AWS SECRET and ACCESS keys manually. The final file should look like similar to:
depending on whether the `iptables` lines were removed or not.
HTCondor will use password authentication between its instances but it does not matter what the password is. One still needs to be created by running:
sudo condor_store_cred -c add -f `condor_config_val SEC_PASSWORD_FILE`
and entering any password desired.
A `condor_pool_password` file should appear in `/etc/condor/`. Make sure the privileges and ownership of the file is correct:
sudo chmod 600 /etc/condor/condor_pool_password sudo chown root /etc/condor/condor_pool_password
Create a way for condor_annex to share the password with new instances:
mkdir -p ~/.condor sudo cp /etc/condor/condor_pool_password ~/.condor/ sudo chmod 600 ~/.condor/condor_pool_password sudo chown $USER ~/.condor/condor_pool_password
Add additional required keys to that directory:
echo "YOUR_AWS_SECRET_ACCESS_KEY" > ~/.condor/privateKeyFile echo "YOUR_AWS_ACCESS_KEY_ID" > ~/.condor/publicKeyFile sudo chmod 600 ~/.condor/*KeyFile
they should look something like `wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY` and `AKIAIOSFODNN7EXAMPLE` respectively and you can create them by going to the AWS security credentials console.
Create the `~/.condor/user_config` file. Edit the file manually if you ran `condor_annex -setup` command beforehand, as the file will contain other config sensitive information. Add the following line:
echo "SEC_PASSWORD_FILE=/home/$USER/.condor/condor_pool_password" > ~/.condor/user_config echo "ANNEX_DEFAULT_AWS_REGION=us-east-2" >> ~/.condor/user_config
Make sure that the file is owned by the current user. If by accident the file was created by doing a `sudo emacs/vi/vim/anytexteditor` then the owner will be root and configuration won't be written in the file. If `condor_annex -setup` was run before the new default AWS region is set, setting it after that will raise an error that "Secret Key File must be accessible by owner". The option can only be set before running `condor_annex -setup`, the setup should still work though, even if the wrong region is selected.
sudo chown $USER ~/.condor/user_config
Run the following commands that create final setup files:
condor_annex -setup
This will edit the `~/.condor/user_config` file along with some others.
1.1.2. Configuring Worker Condor Node
Edit or create file `/etc/condor/config.d/local` with following content:
Without the `UID_DOMAIN` the jobs would not run as $USER but as `nobody`. Whether a job is submitted under $USER or `nobody` is based on the UID_DOMAIN configuration parameter of the submit and execute machines. The value must be the same on the two machines and the value must be a substring of the submit machine's full hostname. Otherwise, the job is run as the nobody user. Setting the `UID_DOMAIN` to `compute.internal` ensures this condition is always met, since the string is always the ending of any `ip-<SOME_IP>.<SOME_REGION>.compute.internal` private DNS instance identification string used by HTCondor. Setting `TRUST_UID_DOMAIN` then allows the jobs to elevate its privileges to the login `$USER`.
Edit the `/etc/condor/condor_config` file and add the following line at the end of the file:
SSH_TO_JOB_SSHD_CONFIG_TEMPLATE = /etc/condor/condor_ssh_to_job_sshd_config_template
Edit the `/etc/condor/condor_ssh_to_job_sshd_config_template` file. Change `UserPriviledgeSeparation` from `sandbox` to `yes` and (optionally?) edit `RhostsRSSAuthentication` to `yes`
UsePrivilegeSeparation yes # sandbox separation not supported in RHEL6
1.1.3. Shared configuration
The following steps are required for either the worker or the master node. They follow after either configuration step 1.1.1 or 1.1.2 were done.
Create file `/usr/libexec/condor/s3.sh` with the following content
#!/bin/sh if [ "$1" = "-classad" ] then echo 'PluginVersion = "0.1"' echo 'PluginType = "FileTransfer"' echo 'SupportedMethods = "s3"' exit 0 fi source=$1 dest=$2 exec aws s3 cp ${source} ${dest}
And give it the following permissions:
sudo chmod 755 /usr/libexec/condor/s3.sh
Create a file `/etc/condor/config.d/10-s3` with contents:
ENABLE_URL_TRANSFERS = TRUE FILETRANSFER_PLUGINS = $(LIBEXEC)/s3.sh
This allows us to omit `should_transfer_file = True` flag from job submit scripts.
The manual states that `/etc/condor/50ec2.config` should be deleted, but in test attempts that file did not exist. File `/etc/condor/config.d/50ec2.config` does exist, so proceed with:
sudo rm /etc/condor/config.d/50ec2.config
The manual states that a line starting with `After=...` in `[Unit]` section of `/usr/lib/systemd/system/condor.service` needs to be copied to `[Unit]` section of `/usr/lib/systemd/system/condor-annex-ec2.service`. I found these lines to already exist in both files and be identical to:
After=network-online.target nslcd.service ypbind.service time-sync.target nfs.client.target autofs.service
so check that particular line exists and is correct instead.
Start the service:
sudo systemctl enable condor-annex-ec2
1.1.4. Verify setup was correct
Unfortunately this will only work for master node setup
sudo systemctl restart condor condor_restart sudo reboot now # after successful reboot only: condor_annex -check-setup
it should return all `OK`s.
1.2. Pegasus
Last step is to add Pegasus. If the Amazon Linux 2 General Environment or CentOS 7 General Environment is used it is enough to do:
sudo wget -O /etc/yum.repos.d/pegasus.repo https://download.pegasus.isi.edu/pegasus/rhel/7/pegasus.repo sudo yum install pegasus
Pegasus will now be installed and linked to the system-wide Python installation.
Using Pegasus from LSST environment.
This is an issue if the end goal is to run Pegasus in an activated LSST Stack environment. Setting the LSST environment will override the search locations and subsequently importing Pegasus will not work. To fix this issue it is necessary to re-link the correct Pegasus library paths back into the environment without accidentally short-cutting LSST environment `PYTHONPATH` values. Do the following if Pegasus is intended to be used from the LSST environment:
sudo mkdir -p /usr/share/pegasus/lib64/python2.7/site-packages sudo ln -s /usr/lib64/python2.7/site-packages/Pegasus /usr/share/pegasus/lib64/python2.7/site-packages/
and add the paths to environment:
export PYTHONPATH=/usr/share/pegasus/lib64/python2.7/site-packages/:$PYTHONPATH
In the AMIs provided the paths are exported in `~/.bashrc`
1.3. Using the prepared AMI
AMIs with names `*_condor_base` contain an environment built according to the instructions given above. They are a set of minimal environments in which Condor and Pegasus will work, that are capable of compiling the LSST Stack (i.e. the steps described in the note above have been executed). Before use of provided images, however, the following configuration step is required:
echo "YOUR_AWS_SECRET_ACCESS_KEY" > ~/.condor/privateKeyFile echo "YOUR_AWS_ACCESS_KEY_ID" > ~/.condor/publicKeyFile sudo chmod 600 ~/.condor/*KeyFile
A basic test if the AMI works can be done by creating a `job.submit` file with contents:
universe = vanilla executable = /usr/bin/ls getenv = True output = out log = log error = err Requirements = TARGET.IsAnnex == true queue
and then launching an annex worker and submitting a job:
condor_annex -count 1 -annex-name SimpleTest # wait condor_submit job.submit
Additional options can be supplied to annex creation such as `-aws-on-demand-instance-type m4.large` and/or `-aws-on-demand-ami-id ami-AMI_ID`. More flags exists, some of which can be found here.