Date
Zoom
https://stanford.zoom.us/j/91889629258?pwd=KzNleVdmSnA1dkN6VkRVUTNtMHBPZz09
Attendees
- Richard Dubois
- Brandon White
- Matthew Doidge
- Kian-Tat Lim
- Brian Yanny
- Yuyi Guo
- Peter Love
- Peter Clark
- Wen Guan
- Steve Pietrowicz
- George Beckett
- Fabio Hernandez
Absent
Agenda — Data Replication
- Status of Rucio evaluation platform at USDF and replication exercises
- Status of Rucio production platform at USDF
- Status of data replication exercises
- Status of data replication monitoring tools and logging platform
- Status of integration of Rucio and butler for automated ingestion
- JIRA issue?
- Collective writing of a technote where we collect details on what we need to replicate and when
- JIRA tickets relevant for data replication
Notes
Data Replication
- Yuyi is testing file replication between USDF and RAL. Recent transfers achieve around 10–15 MB per sec (based on XRootd protocol), which is in line with expectation.
- No progress to resolve poor performance from USDF to Lancaster and USDF to IN2P3. Looks to be issue with drop-offs (with FTS pull).
- Pete C noted investigations from Duncan R, which revealed no apparent issues with transfer rates for either SLAC to RAL or SLAC to Lancaster.
- Matt D noted previously issues with Xrootd transfers to Lancaster, which should now have been resolved, so is worthwhile to re-test.
- Yuyi noted IN2P3 was first test to be tested, and that configuration/ etc. had been optimised since then, so needed rechecking.
- Fabio asked that transfer-rate issues be recorded in ticket. Previous experiments (conducted by Fabio and Wyn) produced reasonable results.
- Yuyi confirmed results were already posted to ticket.
- Pete suggested to run traceroute to sanity check route that messages are taking.
- Fabio noted that Rucio file sizes of O(100 MB), which is smaller than the file sizes being experimented with.
- K-T suggested raw files would be smaller. O(100 MB) relates to calibrated exposures and similar.
- Pete noted potential issue with TCP/IP setting might be affected by small-file transfers.
- George proposed would be good to have a transfer test that could easily be run in the future if, for example, we suspected transfer problems between any of the sites making up the Rucio environment,
- Steve not able to progress with work on Hermes messaging, but believes Kafka can support multiple clients to get messages from Rucio.
- Steve noted work was tracked in old ticket (from Data Backbone Design).
- Fabio suggested to add label data replication to ticket.
- Steve noted work was tracked in old ticket (from Data Backbone Design).
- No progress on creating an automated/ streamlined deployment of Rucio.
- No progress to resolve poor performance from USDF to Lancaster and USDF to IN2P3. Looks to be issue with drop-offs (with FTS pull).
- Brandon is working on transfer of files from NCSA to SLAC.
- Bulk transfer has completed, though now need to 'diff' the file status on 27th July to that on 1st August, to catch any late changes to NCSA file systems.
- NCSA file servers to be switched to supervisor mode shortly after 15th August, for a couple of days, in case of last-minute issues.
- Brandon noted that later transfers were based on file tar-balls of around 1 TB, which achieved much higher bandwidth than individual files, as used initially.
- Pete noted enabling lots of parallel streams for single-file transfers might also achieve higher bandwidths, similar to tarring up the files.
- Fabio has proposed tool set up for ESCAPE project, which he found very useful (works at the Rucio and FTS level).
- Tim N to liaise with Yuyi, who is most likely to be interested.
- Steve noted that Hermes 2 (w/ Kafka integration) takes information from the database and then deletes, so may cause issues for monitoring tools that rely on Rucio database.
- Fabio noted that IN2P3 is deploying a dCache instance for Rubin, to replace interim instance being used for transfer tests at the moment.
- Fabio tried to test transfer from Slack DTN hosts, but seeing issue with asymmetry in routing
- Pete reiterated that ESNet staff are open to discuss networking strategies with Rubin.
- Richard noted that link from Summit to SLAC is operational and wondered if there was risk of similar issue (incorrectly assuming could talk over dedicated VPN) for that connection.
Multi-site PanDA Support and Operation
- Wen noted PanDA system deployed to SLAC and network configuration completed.
- Successfully submitted jobs from other sites to SLAC. However, because SLAC PanDA server is on private network, cannot submit jobs to ARC-CE (no allowed out-bound connections from PanDA).
- Wen is investigating options, though initial attempt to set up Squid proxy was only partially successful.
- Peter L noted potentially better to just connect PanDA server to internet, rather than trying to work around issue (adding complexity to networking).
- K-T noted longer term plan to enable NAT support, which would eliminate the issue.
- George suggested might be worth trying to accelerate timeline for setting up NAT support.
- Successfully submitted jobs from other sites to SLAC. However, because SLAC PanDA server is on private network, cannot submit jobs to ARC-CE (no allowed out-bound connections from PanDA).
- Wen also working with Fabio to make PanDA Environment available via CVMFS (PanDA Environment needs to be present at all PanDA sites and hope is to facilitate this via CVMFS).
- Wen noted still some ATLAS code in wrapper, which he'd like Peter L to look at.
- Peter L is on leave until end of August, but can work on this in September.
Date of next meeting
Monday August 22nd (discussions on Slack #dm-rucio-testing in interim)