AWS PoC Meeting 2019-10-18

Time, Date & Place

12:00 Pacific on Amazon Chime https://chime.aws/1930107527

Attendees

Wil, Dino, Hsin-Fang, Michelle Gower, Greg Daues, Greg Thain, Michelle Butler, Todd Miller, Lorena, Sanjay

Discussion Items

Updates:
- Dino has the new annex workers working. Configurations & software environment issues were tackled.
- Hsin-Fang continued the RC2 run with the old AMIs in the background. Some jobs require more memory, and partitionable condor slots are used. Hsin-Fang will switch to use Dino's new AMIs.
Annex connectivity error was seen another time. Any new findings?
- Need better debug logs
- May write a tool to generate the failure on purpose
For multiple times Hsin-Fang saw Spot instances disappearing from the condor pool (according to condor_status), while "condor_annex -status" still said the instances were running. It seemed the instances lost their reachability (for example I couldn't ssh to the spot instance directly like usual). Eventually, but with a delay, the EC2 "instance status check" showed that the reachability check failed.
- Is this Spot instance getting disrupted? Though the "instance state" remained "running". What's the expected behavior that a Spot instance got cancelled?
  - Spot cancellation has 2 min warnings. But in that case it should not be shown as running in the console. But more logs would help.
- I could manually reboot the failed instances to get them pass the reachability check again. But they won't re-join the condor pool.
  - Todd is not surprised they can't rejoin
- Todd has seen instances that lost connection in large runs (>1000) but rarely
- Sanjay recommended to
  - try on-demand instances in the same workflow/setup for debugging
  - try other types such as R4; R5 are nitro systems and new
  - try Spot Block (e.g. 6hr block) –- But annex doesn't support it yet
- May have another call including Chris/Aaron.
For HTCondor Annex and AWS, is there a recommended strategy on what instances to launch to accommodate a complex workflow with very different memory need?
- We had some discussions and shared experience about this.
- good to know the categories of jobs, create separate annex groups to meet those need. Set job requirement to match.
- annex node names
- can use cpu/memory requirement in spot json, let aws help instance types
- partitionable slots. Condor slot can also do it for disk space.
- Have instances big enough for the largest jobs.

Some of us continued to discuss the Kavli workshop tutorial session

Space shortcuts

Page tree

Time, Date & Place

Attendees

Discussion Items