Time, Date & Place

12:00 Pacific  on Amazon Chime https://chime.aws/1930107527

Attendees

Wil, Dino, Hsin-Fang, Michelle Gower, Greg Daues, Greg Thain, Michelle Butler, Todd Miller, Lorena, Sanjay

Discussion Items

  • Updates:
    • Dino has the new annex workers working.  Configurations & software environment issues were tackled. 
    • Hsin-Fang continued the RC2 run with the old AMIs in the background. Some jobs require more memory, and partitionable condor slots are used. Hsin-Fang will switch to use Dino's new AMIs. 
  • Annex connectivity error was seen another time. Any new findings? 
    • Need better debug logs 
    • May write a tool to generate the failure on purpose 
  • For multiple times Hsin-Fang saw Spot instances disappearing from the condor pool (according to condor_status), while "condor_annex -status" still said the instances were running.  It seemed the instances lost their reachability (for example I couldn't ssh to the spot instance directly like usual). Eventually, but with a delay, the EC2 "instance status check" showed that the reachability check failed. 
    • Is this Spot instance getting disrupted? Though the "instance state" remained "running". What's the expected behavior that a Spot instance got cancelled? 

      • Spot cancellation has 2 min warnings. But in that case it should not be shown as running in the console. But more logs would help. 
    • I could manually reboot the failed instances to get them pass the reachability check again. But they won't re-join the condor pool. 
      • Todd is not surprised they can't rejoin
    • Todd has seen instances that lost connection in large runs (>1000) but rarely
    • Sanjay recommended to
      • try on-demand instances in the same workflow/setup for debugging
      • try other types such as R4; R5 are nitro systems and new
      • try Spot Block (e.g. 6hr block) –- But annex doesn't support it yet
    • May have another call including Chris/Aaron. 
  • For HTCondor Annex and AWS, is there a recommended strategy on what instances to launch to accommodate a complex workflow with very different memory need? 

    • We had some discussions and shared experience about this.
    • good to know the categories of jobs,  create separate annex groups to meet those need. Set job requirement to match. 
    • annex node names 
    • can use cpu/memory requirement in spot json, let aws help instance types
    • partitionable slots. Condor slot can also do it for disk space.
    • Have instances big enough for the largest jobs. 
  • Some of us continued to discuss the Kavli workshop tutorial session







  • No labels