Date

Zoom

https://stanford.zoom.us/j/91889629258?pwd=KzNleVdmSnA1dkN6VkRVUTNtMHBPZz09

Attendees

Absent

 Agenda — Data Replication

  • Status of Rucio evaluation platform at USDF and replication exercises
  • Status of Rucio production platform at USDF
  • Status of data replication exercises
  • Status of data replication monitoring tools and logging platform
  • Status of integration of Rucio and butler for automated ingestion
      • JIRA issue?
  • Collective writing of a technote where we collect details on what we need to replicate and when
  • JIRA tickets relevant for data replication
    • T Key Summary Assignee Reporter P Status Resolution Created Updated Due
      Loading...
      Refresh

Notes

Data Replication

  • Yuyi is testing file replication between USDF and RAL. Recent transfers achieve around 10–15 MB per sec (based on XRootd protocol), which is in line with expectation.
    • No progress to resolve poor performance from USDF to Lancaster and USDF to IN2P3. Looks to be issue with drop-offs (with FTS pull).
      • Pete C noted investigations from Duncan R, which revealed no apparent issues with transfer rates for either SLAC to RAL or SLAC to Lancaster.
      • Matt D noted previously issues with Xrootd transfers to Lancaster, which should now have been resolved, so is worthwhile to re-test.
      • Yuyi noted IN2P3 was first test to be tested, and that configuration/ etc. had been optimised since then, so needed rechecking.
      • Fabio asked that transfer-rate issues be recorded in ticket. Previous experiments (conducted by Fabio and Wyn) produced reasonable results.
        • Yuyi confirmed results were already posted to ticket.
      • Pete suggested to run traceroute to sanity check route that messages are taking.
      • Fabio noted that Rucio file sizes of O(100 MB), which is smaller than the file sizes being experimented with.
        • K-T suggested raw files would be smaller. O(100 MB) relates to calibrated exposures and similar.
        • Pete noted potential issue with TCP/IP setting might be affected by small-file transfers.
      • George proposed would be good to have a transfer test that could easily be run in the future if, for example, we suspected transfer problems between any of the sites making up the Rucio environment,
    • Steve not able to progress with work on Hermes messaging, but believes Kafka can support multiple clients to get messages from Rucio.
      • Steve noted work was tracked in old ticket (from Data Backbone Design).
        • Fabio suggested to add label data replication to ticket.
    • No progress on creating an automated/ streamlined deployment of Rucio.
  • Brandon is working on transfer of files from NCSA to SLAC.
    • Bulk transfer has completed, though now need to 'diff' the file status on 27th July to that on 1st August, to catch any late changes to NCSA file systems.
    • NCSA file servers to be switched to supervisor mode shortly after 15th August, for a couple of days, in case of last-minute issues.
    • Brandon noted that later transfers were based on file tar-balls of around 1 TB, which achieved much higher bandwidth than individual files, as used initially.
      • Pete noted enabling lots of parallel streams for single-file transfers might also achieve higher bandwidths, similar to tarring up the files.
  • Fabio has proposed tool set up for ESCAPE project, which he found very useful (works at the Rucio and FTS level).
    • Tim N to liaise with Yuyi, who is most likely to be interested.
    • Steve noted that Hermes 2 (w/ Kafka integration) takes information from the database and then deletes, so may cause issues for monitoring tools that rely on Rucio database.
  • Fabio noted that IN2P3 is deploying a dCache instance for Rubin, to replace interim instance being used for transfer tests at the moment.
    • Fabio tried to test transfer from Slack DTN hosts, but seeing issue with asymmetry in routing
      • Looks as if SLAC network is trying to talk to dCache server at IN2P3 via LHCONE, though the dCache server is not on LHCONE.
      • Hopefully, can configure network, at SLAC end, to talk to dCache server via Internet.
      • PREOPS-1321 - Getting issue details... STATUS
    • Pete reiterated that ESNet staff are open to discuss networking strategies with Rubin.
    • Richard noted that link from Summit to SLAC is operational and wondered if there was risk of similar issue (incorrectly assuming could talk over dedicated VPN) for that connection.

Multi-site PanDA Support and Operation

  • Wen noted PanDA system deployed to SLAC and network configuration completed.
    • Successfully submitted jobs from other sites to SLAC. However, because SLAC PanDA server is on private network, cannot submit jobs to ARC-CE (no allowed out-bound connections from PanDA).
      • Wen is investigating options, though initial attempt to set up Squid proxy was only partially successful.
      • Peter L noted potentially better to just connect PanDA server to internet, rather than trying to work around issue (adding complexity to networking).
      • K-T noted longer term plan to enable NAT support, which would eliminate the issue.
      • George suggested might be worth trying to accelerate timeline for setting up NAT support.
  • Wen also working with Fabio to make PanDA Environment available via CVMFS (PanDA Environment needs to be present at all PanDA sites and hope is to facilitate this via CVMFS).
    • Wen noted still some ATLAS code in wrapper, which he'd like Peter L to look at.
    • Peter L is on leave until end of August, but can work on this in September.

Date of next meeting

Monday August 22nd (discussions on Slack #dm-rucio-testing in interim)

Action items

  •