Date

Zoom

https://stanford.zoom.us/j/91889629258?pwd=KzNleVdmSnA1dkN6VkRVUTNtMHBPZz09

Attendees

Apologies

Agenda — Data Replication

  • Status of replication exercises among the 3 facilities [ Yuyi GuoBrandon White ]
    • the evaluation instance of FTS has been showing some fragility. Is there any thing we can do to improve this situation?
    • what replication exercises have we performed? What is working and what is not?
    • what is preventing us to start doing regular replication of the equivalent of 1 night's worth of raw data?
  • Status of Rucio & FTS monitoring [ Timothy Noble George Beckett  ]
    • are we ready to set up an initial version of a dashboard showing Rucio and FTS activity?
  • Status of butler & Rucio integration [ Steve Pietrowicz]
  • Status of replication of STS data from USDF to FrDF [ Wei YangKian-Tat Lim Fabio Hernandez ]

Data Replication JIRA Issues

  • JIRA tickets with tag "data-replication"
  • T Key Summary Assignee Reporter P Status Resolution Created Updated Due
    Loading...
    Refresh

Notes

Data Replication (Fabio)

  • Brandon noted lots of transfer experiments between SLAC and Lancaster (coordinated by Yuyi with support from Brandon)
    • Trying to figure out realisable transfer speed
      • Tested with two large transfers (300 GB => 4.7 Gbps; 1,700 GB => similar speed) based on mixed set of file O(10MB) files.
      • Attempted 12-hour transfer test, but took around 3 days to complete, due to issues with FTS server
        • FTS server cannot cope with large numbers of transfer request
        • Wei has been helping to diagnose issue
      • Rucio smart enough to cope with FTS outages
      • Of 5,000 files, only one failed to transfer in three-day test.
    • Yuyi compared with FTS server at FermiLab
      • Looks to require 15 MB per concurrent transfer (have 8GB memory on server, at present, which is 5–6 GB for transfers)
      • Too many concurrent transfers causes server to crash
    • FTS deployed via Docker container (plan to move to Kubernetes in due course)
      • Also has limited log file space (17 GB), though request in to increase to 50 GB
      • Potential to constrain the number of concurrent transfers
      • Tim N noted "for reference RAL runs FTS3 as well, we have 8 FTS servers with 8GB ram, 4 processors, 120 GB disc space each"
      • Looks to be priority to increase FTS capacity
      • RAL FTS instance will likely be very helpful for France-UK transfers as reduced latency.
    • Yuyi noted would need bigger dataset to saturate link if we increased specs of FTS server.
    • Wei asked if requests were grouped
      • Yuyi confirmed was – Rucio organises files into groups of 100 files.
      • Wei noted grouping files reduced the load on FTS
    • Peter noted that FTS configuration (** channel ) has 0 0 set for minimum/ maximum transfers
  • In parallel, Peter is testing DP0.2 transfer from France to UK
    • Currently, managed to transfer small amount, though waiting on go-ahead from Wei to resume.
    • Also transferring subset of data from SLAC, using rsync, though this is too slow
    • Clear that Rucio would help with managing this.
  • Greg noted potentially other issue with large file count for submissions (based on documentation)
    • This is being investigated
    • In Slack, Greg noted"

      Some commented lines in FTS config template

      ## Parameters for QoS daemon - BringOnline operation

      # Maximum bulk size

      # If the size is too large, it will take more resources (memory and CPU) to generate the requests

      # and parse the responses. Some servers may reject the requests if they are too big.

      # If it is too small, performance will be reduced.

      # Keep it to a sensible size (between 100 and 1k)

      # StagingBulkSize=200

      Suggests there could be server issues with large submits.

      "
  • Tim noted progress with Rucio dashboard
    • https://grafana.slac.stanford.edu/d/000000003/rucio-overview?orgId=1&refresh=1m
      • Reporting transfer success (from viewpoint of Rucio), with timeline
      • Rule state – slightly opaque, though would require custom scripts to extract more info
      • Transmografier daemon
      • Reaper (deletions)
      • K-T noted high action count.
      • Data sourced from Rucio logging, via Prometheus
      • Link in Grafana folder – see 'Rucio Overview'
    • Requires FTS monitoring to progress further.
      • FTS logging is still being set up
      • Also, need a queue server (e.g., ActiveMQ) or direct route into Prometheus.
      • Potentially to get details from FTS
      • Steve P noted potential use Hermes/ ActiveMQ, which includes FTS logging.
      • Tim noted lots more information in both Hermes and FTS (compared to Rucio)
        • Includes filtering by end point, for example
  • Butler-Rucio integration
    • Steve ready to set up Kafka end points in UK and France.
    • Assume would be deployed within Kubernetes at each DF
      • Would be handy to be able to use Phalanx to streamline this, as includes CertBot for autorenewal of certificates
    • KT noted Phalanx is potential overkill if we don't need it.
      • Peter noted manual deployment of Kafka at Lancaster already
      • K-T noted three levels of Kafka deployment
        • Baremetal
        • Kubernetes
        • Sasquatch-based
      • George suggested to try existing Phalanx instance (in Edinburgh), plus option to build on MirrorMaker 2 experience from Lasair team
      • Fabio asked if certificates were necessary.
        • K-T noted was for TLS (HTTPS security), based on LetsEncrypt.
      • Fabio noted France has stopped using LetsEncrypt.
      • IN2P3 also using Phalanx for RSP, though is running on isolated Kubernetes cluster, serving RSP and Qserv only.
    • Potential site-specific concerns for LetsEncrypt
    • Pete proposed that each site worries about own sites infrastructure
      • K-T noted that Rubin would provide configurations that would be deployable via Kubernetes
      • K-T noted different views – infrastructure provided service or integral part of service deployment
    • Steve noted Greg has joined team to help with RSE deployment
  • Replication of STS data from USDF to France (Fabio)
    • Wei noted in process of registering data to Rucio (on US side)
    • Fabio noted way to detect if files are replicated into SE in France has been configured.
      • Using Kafka server to broadcast event from dCache
    • Next step is to work out how to trigger actions in reaction to events (e.g., a file arrives, though how do you take an action based on that trigger)
      • K-T notes potential to reuse setup from Summit
      • Wei noted SDS setup (sites at US and France) will not use hash function

Multi-site PanDA Support and Operation (Wei)

  • Focus of activity has been on transfer from France to UK of DC2 data, followed by registration of data in a UK Butler.
    • Pete confirms on his to-do list, once data is transferred.

Date of next meeting

Monday April 3rd, 8 am PT

Discussions on Slack #dm-rucio-testing in interim

Action items

  •