Setting up HTCondor and condor_annex

Valid with Amazon Linux 2 General Environment or CentOS 7 General Environment instructions.

For other environments ensure that wget, pip and python-awscli exist. Something like:

sudo yum install -y https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
sudo yum install wget
sudo yum --enablerepo epel install python-pip
sudo pip install --upgrade pip
sudo pip install awscli

Ensure security group permissions allow TCP communication on port 9618 and that your IAM has cloudformation:DescribeStacks on resource arn:aws:cloudformation permissions for the region you are in.

Note: Ensure that the environment variable USER is set correctly.

1.1. Procedure

Follow HTCondor's official instructions. Some commands are not quite correct (some paths are wrong, one code editing instruction is out-of-date, etc..). The corrected abridged version of the official instructions are therefore given bellow. Most of the steps, however, will be identical. Firstly, make sure we're starting up-to-date:

sudo yum update
sudo yum upgrade
sudo reboot now

Proceed directly to these directions first:

sudo wget -O /etc/yum.repos.d/RPM-GPG-KEY-HTCondor https://research.cs.wisc.edu/htcondor/yum/RPM-GPG-KEY-HTCondor
sudo rpm --import /etc/yum.repos.d/RPM-GPG-KEY-HTCondor
sudo wget -O /etc/yum.repos.d/htcondor-development-rhel7.repo https://research.cs.wisc.edu/htcondor/yum/repo.d/htcondor-development-rhel7.repo

sudo yum install condor-all

sudo systemctl start condor
sudo systemctl enable condor

Ensure Condor runs. It should display condor_master, condor_procd and condor_shared_port:

ps -aux | grep condor

If you followed instructions on how to set up Amazon Linux 2 General environment skip the aws cli instalation via pip in the manual. It will already exist. Proceed with:

sudo yum install condor-annex-ec2

HTCondor and condor_annex should now be installed.

1.1.1. Configuring Master Condor Node (head)

Edit `/etc/condor/config.d/local` file and make sure the following configuration is given:

# Make this a single-node pool.
DAEMON_LIST = MASTER COLLECTOR SCHEDD STARTD NEGOTIATOR
 
# Taken from the manual's instructions on installing a minicondor.
# ------------------------------------------------------------------------------
 
NETWORK_INTERFACE = *
CONDOR_HOST = $(FULL_HOSTNAME)
 
SEC_DAEMON_INTEGRITY = REQUIRED
SEC_DAEMON_AUTHENTICATION = REQUIRED
SEC_DAEMON_AUTHENTICATION_METHODS = PASSWORD
SEC_NEGOTIATOR_INTEGRITY = REQUIRED
SEC_NEGOTIATOR_AUTHENTICATION = REQUIRED
SEC_NEGOTIATOR_AUTHENTICATION_METHODS = PASSWORD
SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD
ALLOW_ADMINISTRATOR = $(CONDOR_HOST) $(TCP_FORWARDING_HOST)
ALLOW_DAEMON = condor_pool@*
 
# ------------------------------------------------------------------------------
# The following is different from the instructions because the RHEL7 RPM
# sets $(LOCAL_DIR) in a deeply unfortunate way.
SEC_PASSWORD_FILE = /etc/condor/condor_pool_password
 
# Configure the pool to be externally accessible. While this isn't necessary
# for the proper functioning of an AWS-only pool, condor_annex can't verify
# the accessibility of this host if it's not externally available, and there's
# not presently an option to disable that check.
#
# Note that EC2PublicIP is set by the condor-annex-ec2 package.
TCP_FORWARDING_HOST = $(EC2PublicIP)
COLLECTOR_HOST = $(TCP_FORWARDING_HOST):9618
ALLOW_WRITE = $(ALLOW_WRITE) $(TCP_FORWARDING_HOST)
ALLOW_DAEMON = $(ALLOW_DAEMON) $(TCP_FORWARDING_HOST)
ALLOW_NEGOTIATOR = $(ALLOW_NEGOTIATOR) $(TCP_FORWARDING_HOST)

# more verbose debug output
COLLECTOR_DEBUG = D_FULLDEBUG D_COMMAND
ANNEX_GAHP_DEBUG = D_FULLDEBUG
ANNEX_DEBUG = D_FULLDEBUG

# elevate to $USER
UID_DOMAIN = compute.internal
TRUST_UID_DOMAIN = true 

# Preferentially target AWS pools if there are any.
MayUseAWS = true 
SUBMIT_ATTRS = MayUseAWS

MayUseAWS and SUBMIT_ATTRS job flags could have been supplied in the job script itself but are now defaults for all jobs. Without the `UID_DOMAIN` the jobs would not run as $USER but as `nobody`. Whether a job is submitted under $USER or `nobody` is based on the UID_DOMAIN configuration parameter of the submit and execute machines. The value must be the same on the two machines and the value must be a substring of the submit machine's full hostname. Otherwise, the job is run as the nobody user. Setting the `UID_DOMAIN` to `compute.internal` ensures this condition is always met, since the string is always the ending of any `ip-<SOME_IP>.<SOME_REGION>.compute.internal` private DNS instance identification string used by HTCondor. Setting `TRUST_UID_DOMAIN` then allows the jobs to elevate its privileges to the login `$USER`.

The following step in the official manual is out-of-date at the time of writing this instruction set. File `/usr/libexec/condor/condor-annex-ec2 ` needs to be edited, delete lines 59 to 341, inclusive. Lines 47-51 can also be deleted if `FROM INSTANCE` feature is wanted. Feature allows condor_annex to launch new EC2 instances without having to copy the AWS SECRET and ACCESS keys manually. The final file should look like similar to:

#!/bin/sh
#
# condor-annex-ec2     Boot-time configuration for an HTCondor annex instance.
#
# chkconfig: 345 97 11
# description: Condor is a high throughput computing batch processing
#              platform.

### BEGIN INIT INFO
# Provides: condor-annex-ec2
# Default-Start: 3 4 5
# Default-Stop: 0 1 2 6
# Required-Start: $local_fs $network
# Required-Stop: $local_fs $network
# Short-Description: HTCondor annex configuration
# Description: Boot-time configuration for an HTCondor annex instance.
### END INIT INFO

# Source function library
. /etc/init.d/functions

# Source networking configuration
[ -f /etc/sysconfig/network ] && . /etc/sysconfig/network

# Source Condor configuration
[ -f /etc/sysconfig/condor ] && . /etc/sysconfig/condor

# Check that networking is up
[ "${NETWORKING}" = "no" ] && exit 1

# Check that HTCondor is configured to run
chkconfig condor || exit 1

start() {
    echo -n "Configuring HTCondor to be an EC2 annex: "

(
# Acquire the public IP and instance ID of this from the metadata server.
EC2PublicIP=$(/usr/bin/curl -s http://169.254.169.254/latest/meta-data/public-ipv4)
EC2InstanceID=$(/usr/bin/curl -s http://169.254.169.254/latest/meta-data/instance-id)

# If we were installed but for some reason aren't running on EC2, do nothing.
if [ "${EC2InstanceID}"x == x ]; then
    exit 1
fi

# Configure iptables to deny any nonroot user access to the metadata server.
# This will prevent them from using the credentials located there.
/sbin/iptables -A OUTPUT \
    -m owner --destination 169.254.169.254 ! --uid-owner 0 \
    -j REJECT

# Set the EC2PublicIP and EC2InstanceID macros. The example EC2 configuration
# uses these values (advertises the latter and sets TCP_FORWARDING_HOST to
# the former).
echo "EC2PublicIP = ${EC2PublicIP}"
echo "EC2InstanceID = \"${EC2InstanceID}\""

) >> /etc/condor/config.d/49ec2-instance.config

    RVAL=$?
    if [ $RVAL -ne 0 ]; then
	    echo "failed!"
    else
	    echo "done."
    fi

return $RVAL
}

case "$1" in
	start)
		start
	    RETVAL=$?
		;;
	stop|restart|try-restart|condrestart|reload|force-reload)
        # Unsupported.
	    RETVAL=3
	    ;;
    *)
	    echo $"Usage: $0 {start}"
	    RETVAL=2
        ;;
esac

exit $RETVAL

depending on whether the `iptables` lines were removed or not.

HTCondor will use password authentication between its instances but it does not matter what the password is. One still needs to be created by running:

sudo condor_store_cred -c add -f `condor_config_val SEC_PASSWORD_FILE`

and entering any password desired.

A `condor_pool_password` file should appear in `/etc/condor/`. Make sure the privileges and ownership of the file is correct:

sudo chmod 600 /etc/condor/condor_pool_password
sudo chown root /etc/condor/condor_pool_password

Create a way for condor_annex to share the password with new instances:

mkdir -p ~/.condor
sudo cp /etc/condor/condor_pool_password ~/.condor/
sudo chmod 600 ~/.condor/condor_pool_password
sudo chown $USER ~/.condor/condor_pool_password

Add additional required keys to that directory:

echo "YOUR_AWS_SECRET_ACCESS_KEY" > ~/.condor/privateKeyFile
echo "YOUR_AWS_ACCESS_KEY_ID" > ~/.condor/publicKeyFile
sudo chmod 600 ~/.condor/*KeyFile

they should look something like `wJalrXUtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY` and `AKIAIOSFODNN7EXAMPLE` respectively and you can create them by going to the AWS security credentials console.

Create the `~/.condor/user_config` file. Edit the file manually if you ran `condor_annex -setup` command beforehand, as the file will contain other config sensitive information. Add the following line:

echo "SEC_PASSWORD_FILE=/home/$USER/.condor/condor_pool_password" > ~/.condor/user_config
echo "ANNEX_DEFAULT_AWS_REGION=us-east-2" >> ~/.condor/user_config

Make sure that the file is owned by the current user. If by accident the file was created by doing a `sudo emacs/vi/vim/anytexteditor` then the owner will be root and configuration won't be written in the file. If `condor_annex -setup` was run before the new default AWS region is set, setting it after that will raise an error that "Secret Key File must be accessible by owner". The option can only be set before running `condor_annex -setup`, the setup should still work though, even if the wrong region is selected.

sudo chown $USER ~/.condor/user_config

Run the following commands that create final setup files:

condor_annex -setup

This will edit the `~/.condor/user_config` file along with some others.

1.1.2. Configuring Worker Condor Node

Edit or create file `/etc/condor/config.d/local` with following content:

#make this a worker node
DAEMON_LIST = MASTER STARTD
 
# Taken from the manual's instructions on installing a minicondor.
# ------------------------------------------------------------------------------
 
NETWORK_INTERFACE = *
# annex will set this
#CONDOR_HOST = $(FULL_HOSTNAME)
 
SEC_DAEMON_INTEGRITY = REQUIRED
SEC_DAEMON_AUTHENTICATION = REQUIRED
SEC_DAEMON_AUTHENTICATION_METHODS = PASSWORD
SEC_NEGOTIATOR_INTEGRITY = REQUIRED
SEC_NEGOTIATOR_AUTHENTICATION = REQUIRED
SEC_NEGOTIATOR_AUTHENTICATION_METHODS = PASSWORD
SEC_CLIENT_AUTHENTICATION_METHODS = FS, PASSWORD
ALLOW_ADMINISTRATOR = condor_pool@*
ALLOW_DAEMON = condor_pool@*
 
# ------------------------------------------------------------------------------
# The following is different from the instructions because the RHEL7 RPM
# sets $(LOCAL_DIR) in a deeply unfortunate way.
 
# Configure the pool to be externally accessible. While this isn't necessary
# for the proper functioning of an AWS-only pool, condor_annex can't verify
# the accessibility of this host if it's not externally available, and there's
# not presently an option to disable that check.
#
# Note that EC2PublicIP is set by the condor-annex-ec2 package.
TCP_FORWARDING_HOST = $(EC2PublicIP)
#COLLECTOR_HOST = $(TCP_FORWARDING_HOST):9618
ALLOW_WRITE = $(ALLOW_WRITE) $(TCP_FORWARDING_HOST)
ALLOW_DAEMON = $(ALLOW_DAEMON) $(TCP_FORWARDING_HOST)
ALLOW_NEGOTIATOR = $(ALLOW_NEGOTIATOR) $(TCP_FORWARDING_HOST)

# Have the master exit if the startd isn't around and it's been given more
# than sixty seconds to show up.  (We could avoid having a timeout if we
# were sure that START_StartTime was undefined before its first start.)
MASTER.DAEMON_SHUTDOWN_FAST = ( STARTD_StartTime == 0 ) && ((CurrentTime - DaemonStartTime) > 60)

# Shut the machine down when the master exits.
DEFAULT_MASTER_SHUTDOWN_SCRIPT = /etc/condor/master_shutdown_script.sh

# Isolate the temporary directories for each job.
MOUNT_UNDER_SCRATCH = /tmp,/var/tmp

# elevate to $USER
UID_DOMAIN = compute.internal
TRUST_UID_DOMAIN = true

# Make this worker a partitionable slot
NUM_SLOTS = 1
NUM_SLOTS_TYPE_1 = 1
SLOT_TYPE_1 = 100%
SLOT_TYPE_1_PARTITIONABLE = TRUE

Without the `UID_DOMAIN` the jobs would not run as $USER but as `nobody`. Whether a job is submitted under $USER or `nobody` is based on the UID_DOMAIN configuration parameter of the submit and execute machines. The value must be the same on the two machines and the value must be a substring of the submit machine's full hostname. Otherwise, the job is run as the nobody user. Setting the `UID_DOMAIN` to `compute.internal` ensures this condition is always met, since the string is always the ending of any `ip-<SOME_IP>.<SOME_REGION>.compute.internal` private DNS instance identification string used by HTCondor. Setting `TRUST_UID_DOMAIN` then allows the jobs to elevate its privileges to the login `$USER`.

Edit the `/etc/condor/condor_config` file and add the following line at the end of the file:

SSH_TO_JOB_SSHD_CONFIG_TEMPLATE = /etc/condor/condor_ssh_to_job_sshd_config_template

Edit the `/etc/condor/condor_ssh_to_job_sshd_config_template` file. Change `UserPriviledgeSeparation` from `sandbox` to `yes` and (optionally?) edit `RhostsRSSAuthentication` to `yes`

UsePrivilegeSeparation yes # sandbox separation not supported in RHEL6

1.1.3. Shared configuration

The following steps are required for either the worker or the master node. They follow after either configuration step 1.1.1 or 1.1.2 were done.

Create file `/usr/libexec/condor/s3.sh` with the following content

#!/bin/sh

if [ "$1" = "-classad" ]
then
    echo 'PluginVersion = "0.1"' 
    echo 'PluginType = "FileTransfer"'
    echo 'SupportedMethods = "s3"'
    exit 0
fi

source=$1
dest=$2

exec aws s3 cp ${source} ${dest}

And give it the following permissions:

sudo chmod 755 /usr/libexec/condor/s3.sh

Create a file `/etc/condor/config.d/10-s3` with contents:

ENABLE_URL_TRANSFERS = TRUE
FILETRANSFER_PLUGINS = $(LIBEXEC)/s3.sh

This allows us to omit `should_transfer_file = True` flag from job submit scripts.

The manual states that `/etc/condor/50ec2.config` should be deleted, but in test attempts that file did not exist. File `/etc/condor/config.d/50ec2.config` does exist, so proceed with:

sudo rm /etc/condor/config.d/50ec2.config

The manual states that a line starting with `After=...` in `[Unit]` section of `/usr/lib/systemd/system/condor.service` needs to be copied to `[Unit]` section of `/usr/lib/systemd/system/condor-annex-ec2.service`. I found these lines to already exist in both files and be identical to:

After=network-online.target nslcd.service ypbind.service time-sync.target nfs.client.target autofs.service

so check that particular line exists and is correct instead.

Start the service:

sudo systemctl enable condor-annex-ec2

1.1.4. Verify setup was correct

Unfortunately this will only work for master node setup

sudo systemctl restart condor
condor_restart
sudo reboot now

# after successful reboot only:
condor_annex -check-setup

it should return all `OK`s.

1.2. Pegasus

Last step is to add Pegasus. If the Amazon Linux 2 General Environment or CentOS 7 General Environment is used it is enough to do:

sudo wget -O /etc/yum.repos.d/pegasus.repo https://download.pegasus.isi.edu/pegasus/rhel/7/pegasus.repo
sudo yum install pegasus

Pegasus will now be installed and linked to the system-wide Python installation.

Using Pegasus from LSST environment.

This is an issue if the end goal is to run Pegasus in an activated LSST Stack environment. Setting the LSST environment will override the search locations and subsequently importing Pegasus will not work. To fix this issue it is necessary to re-link the correct Pegasus library paths back into the environment without accidentally short-cutting LSST environment `PYTHONPATH` values. Do the following if Pegasus is intended to be used from the LSST environment:

sudo mkdir -p /usr/share/pegasus/lib64/python2.7/site-packages
sudo ln -s /usr/lib64/python2.7/site-packages/Pegasus /usr/share/pegasus/lib64/python2.7/site-packages/

and add the paths to environment:

export PYTHONPATH=/usr/share/pegasus/lib64/python2.7/site-packages/:$PYTHONPATH

In the AMIs provided the paths are exported in `~/.bashrc`

1.3. Using the prepared AMI

AMIs with names `*_condor_base` contain an environment built according to the instructions given above. They are a set of minimal environments in which Condor and Pegasus will work, that are capable of compiling the LSST Stack (i.e. the steps described in the note above have been executed). Before use of provided images, however, the following configuration step is required:

echo "YOUR_AWS_SECRET_ACCESS_KEY" > ~/.condor/privateKeyFile
echo "YOUR_AWS_ACCESS_KEY_ID" > ~/.condor/publicKeyFile
sudo chmod 600 ~/.condor/*KeyFile

A basic test if the AMI works can be done by creating a `job.submit` file with contents:

universe = vanilla
executable = /usr/bin/ls
getenv = True
output = out
log = log
error = err
Requirements = TARGET.IsAnnex == true
queue

and then launching an annex worker and submitting a job:

condor_annex -count 1 -annex-name SimpleTest
# wait 
condor_submit job.submit

Additional options can be supplied to annex creation such as `-aws-on-demand-instance-type m4.large` and/or `-aws-on-demand-ami-id ami-AMI_ID`. More flags exists, some of which can be found here.

Space shortcuts

Page tree