Date

Attendees

Goals

  • Status

Discussion items

Rucio updates:

  • Stephen P has modified Rucio Hermes service to send messages out to Kafka broker to work around fact that Hermes deletes messages as soon as they are (successfully) read. This is to connect ActiveMQ and Camel daemon.
  • Kafka broker on Butler side with client daemon can then interrogate status and trigger ingestion as appropriate.
  • Need to figure out sidecar issue.
    • Hermes logs Rucio events such as transfers, deletions, etc.
    • Issue for logging and monitoring at the same time.
    • If Hermes fails to send messages, it won't delete the message.
    • Wei noted that Rucio logs message in Reaper daemon as well, so information is retrievable from there.
  • Brandon has been working out how to deploy Rucio on cluster.
  • Confident can configure things and run latest version of Rucio.
  • Next step is to read and check the various Yaml configurations, plus set up the secrets in Vault.
  • Kubernetes ingress still to be tested.
  • Yuyi testing transfers from SLAC to IN2P3 and from SLAC to Lancaster. Both exhibiting slow speeds with significant fluctuations. SLAC end-point checked (also being used for NCSA transfer). SLAC to RAL also slow, though related to a VOMS issue which has only been fixed on a slow/ old gateway. VOMS fix to be rolled out across all gateways on Mon 24th/Tue 25th so should be ready to retest imminently.
  • George noted option to engage Richard H-J, once the RAL Test is completed, if useful. Wei noted that choice of MTU 9000 vs 1500 had caused problems. Previously SLAC had MTU set to 1500, but switched to 9000 to work around issues with transfer from NCSA. Wei seeing deterioration compared to tests run a few months ago. 
  • When testing, Yuyi found significant failures with MTU 9000. Better reliability with MTU 1500, but slower. Pete noted unlikely to be WAN that is problem. Pete suggested to test transferring from FermiLab to end points, as this is used all of the time. Pete suggested traceroute might provide more information. Yuyi investigating optimisation of Rucio database. Atlas DB Admin led on this, at Cern, though focus on Oracle DB. Yuyi hopes to migrate Oracle optimisations over to PostgreSQL, so Rubin can benefit also.
  • Greg Daues is doing some tests on transfers -- transferred 4GB files with single process, so testing from NERSC and NCSA and from NCSA and IN2P3. Also trying AWS S3cp, for comparison, when chopping up file and using multiple streams. Motivates review of XRootd configurations.
  • USDF data processing
    • Michelle noted Hsin Fan was having issues with merge jobs needing more memory than available by default.
    • Wen would like to define Panda (logical queues) so that he can use  pull mode on some of them and push mode for others (Pull mode has a   shorter latency). Mode is based on queue being used.
  • PanDA
    • PanDA env (with CAs and pilots) is waiting on Fabio to deploy to CVMFS. Wen would like to add some PanDA setup to CVMFS environment.  CVMFS will remove need to download in HT-Condor jobs. Wei noted ticket he had submitted to support on timeout for staging files, waiting on reply. Peter Love's pilot wrapper will be included (from wrapper), which will allow site-specific secrets/environment to be sent to sites. Wen noted discussions with Hammer-Cloud team re support for testing sites.
    • For running jobs at UKDF and France DF is awaiting CVMFS update (from Fabio).
  • Wen workng on new release of PanDA, to address a port issue identified when testing submission of jobs to CEs via Squid-server (Arc-CE ignores proxy setup). For SLAC, should have NAT-based outbound TCP to outside world from USDF. Wei will check. Also, Wen confirmed PanDA ingress (with X.509) not working. Trying different ingress controller, testing with HTTP.
  • Pete noted ARC-CE 7 to be released imminently, with exclusive use of REST. Wei confirmed all sites running ACE-CE6, but with REST API. Wei asked whether ARC-CE planned to support tokens. Pete offered to share a slides from talk that discussed this topic 
  • (https://indico.cern.ch/event/1096032/ )
  • Wei noted the tokens usage in ATLAS is still premature.

Action items

  •