Time, Date & Place
12:00 Pacific on Amazon Chime https://chime.aws/1930107527
Attendees
Wil, Dino, Hsin-Fang, Michelle Gower, Greg Daues, Greg Thain, Michelle Butler, Todd Miller, Lorena, Sanjay
Discussion Items
- Updates:
- Dino has the new annex workers working. Configurations & software environment issues were tackled.
- Hsin-Fang continued the RC2 run with the old AMIs in the background. Some jobs require more memory, and partitionable condor slots are used. Hsin-Fang will switch to use Dino's new AMIs.
- Annex connectivity error was seen another time. Any new findings?
- Need better debug logs
- May write a tool to generate the failure on purpose
- For multiple times Hsin-Fang saw Spot instances disappearing from the condor pool (according to condor_status), while "condor_annex -status" still said the instances were running. It seemed the instances lost their reachability (for example I couldn't ssh to the spot instance directly like usual). Eventually, but with a delay, the EC2 "instance status check" showed that the reachability check failed.
Is this Spot instance getting disrupted? Though the "instance state" remained "running". What's the expected behavior that a Spot instance got cancelled?
- Spot cancellation has 2 min warnings. But in that case it should not be shown as running in the console. But more logs would help.
- I could manually reboot the failed instances to get them pass the reachability check again. But they won't re-join the condor pool.
- Todd is not surprised they can't rejoin
- Todd has seen instances that lost connection in large runs (>1000) but rarely
- Sanjay recommended to
- try on-demand instances in the same workflow/setup for debugging
- try other types such as R4; R5 are nitro systems and new
- try Spot Block (e.g. 6hr block) –- But annex doesn't support it yet
- May have another call including Chris/Aaron.
For HTCondor Annex and AWS, is there a recommended strategy on what instances to launch to accommodate a complex workflow with very different memory need?
- We had some discussions and shared experience about this.
- good to know the categories of jobs, create separate annex groups to meet those need. Set job requirement to match.
- annex node names
- can use cpu/memory requirement in spot json, let aws help instance types
- partitionable slots. Condor slot can also do it for disk space.
- Have instances big enough for the largest jobs.
- Some of us continued to discuss the Kavli workshop tutorial session