Date

08 Aug 2022

Zoom

https://stanford.zoom.us/j/91889629258?pwd=KzNleVdmSnA1dkN6VkRVUTNtMHBPZz09

Attendees

Absent

Agenda — Data Replication

Status of Rucio evaluation platform at USDF and replication exercises
- PREOPS-1247 - Getting issue details... STATUS
Status of Rucio production platform at USDF
- PREOPS-672 - Getting issue details... STATUS
Status of data replication exercises
- PREOPS-1247 - Getting issue details... STATUS
Status of data replication monitoring tools and logging platform
- PREOPS-1264 - Getting issue details... STATUS
Status of integration of Rucio and butler for automated ingestion
- - JIRA issue?
Collective writing of a technote where we collect details on what we need to replicate and when
- PREOPS-1265 - Getting issue details... STATUS

JIRA tickets relevant for data replication

T	Key	Summary	Assignee	Reporter	P	Status	Resolution	Created	Updated	Due

Loading...

Refresh

Notes

Data Replication

Yuyi is testing file replication between USDF and RAL. Recent transfers achieve around 10–15 MB per sec (based on XRootd protocol), which is in line with expectation.
- No progress to resolve poor performance from USDF to Lancaster and USDF to IN2P3. Looks to be issue with drop-offs (with FTS pull).
  - Pete C noted investigations from Duncan R, which revealed no apparent issues with transfer rates for either SLAC to RAL or SLAC to Lancaster.
  - Matt D noted previously issues with Xrootd transfers to Lancaster, which should now have been resolved, so is worthwhile to re-test.
  - Yuyi noted IN2P3 was first test to be tested, and that configuration/ etc. had been optimised since then, so needed rechecking.
  - Fabio asked that transfer-rate issues be recorded in ticket. Previous experiments (conducted by Fabio and Wyn) produced reasonable results.
    - Yuyi confirmed results were already posted to ticket.
  - Pete suggested to run traceroute to sanity check route that messages are taking.
  - Fabio noted that Rucio file sizes of O(100 MB), which is smaller than the file sizes being experimented with.
    - K-T suggested raw files would be smaller. O(100 MB) relates to calibrated exposures and similar.
    - Pete noted potential issue with TCP/IP setting might be affected by small-file transfers.
  - George proposed would be good to have a transfer test that could easily be run in the future if, for example, we suspected transfer problems between any of the sites making up the Rucio environment,
- Steve not able to progress with work on Hermes messaging, but believes Kafka can support multiple clients to get messages from Rucio.
  - Steve noted work was tracked in old ticket (from Data Backbone Design).
    - Fabio suggested to add label data replication to ticket.
- No progress on creating an automated/ streamlined deployment of Rucio.
Brandon is working on transfer of files from NCSA to SLAC.
- Bulk transfer has completed, though now need to 'diff' the file status on 27th July to that on 1st August, to catch any late changes to NCSA file systems.
- NCSA file servers to be switched to supervisor mode shortly after 15th August, for a couple of days, in case of last-minute issues.
- Brandon noted that later transfers were based on file tar-balls of around 1 TB, which achieved much higher bandwidth than individual files, as used initially.
  - Pete noted enabling lots of parallel streams for single-file transfers might also achieve higher bandwidths, similar to tarring up the files.
Fabio has proposed tool set up for ESCAPE project, which he found very useful (works at the Rucio and FTS level).
- Tim N to liaise with Yuyi, who is most likely to be interested.
- Steve noted that Hermes 2 (w/ Kafka integration) takes information from the database and then deletes, so may cause issues for monitoring tools that rely on Rucio database.
Fabio noted that IN2P3 is deploying a dCache instance for Rubin, to replace interim instance being used for transfer tests at the moment.
- Fabio tried to test transfer from Slack DTN hosts, but seeing issue with asymmetry in routing
  - Looks as if SLAC network is trying to talk to dCache server at IN2P3 via LHCONE, though the dCache server is not on LHCONE.
  - Hopefully, can configure network, at SLAC end, to talk to dCache server via Internet.
  - PREOPS-1321 - Getting issue details... STATUS
- Pete reiterated that ESNet staff are open to discuss networking strategies with Rubin.

- Richard noted that link from Summit to SLAC is operational and wondered if there was risk of similar issue (incorrectly assuming could talk over dedicated VPN) for that connection.

Multi-site PanDA Support and Operation

Wen noted PanDA system deployed to SLAC and network configuration completed.
- Successfully submitted jobs from other sites to SLAC. However, because SLAC PanDA server is on private network, cannot submit jobs to ARC-CE (no allowed out-bound connections from PanDA).
  - Wen is investigating options, though initial attempt to set up Squid proxy was only partially successful.
  - Peter L noted potentially better to just connect PanDA server to internet, rather than trying to work around issue (adding complexity to networking).
  - K-T noted longer term plan to enable NAT support, which would eliminate the issue.
  - George suggested might be worth trying to accelerate timeline for setting up NAT support.
Wen also working with Fabio to make PanDA Environment available via CVMFS (PanDA Environment needs to be present at all PanDA sites and hope is to facilitate this via CVMFS).
- Wen noted still some ATLAS code in wrapper, which he'd like Peter L to look at.
- Peter L is on leave until end of August, but can work on this in September.

Date of next meeting

Monday August 22nd (discussions on Slack #dm-rucio-testing in interim)

Action items

Space shortcuts

Page tree

Date

Zoom

Attendees

Absent

Agenda — Data Replication

Notes

Data Replication

Multi-site PanDA Support and Operation

Date of next meeting

Action items

Space shortcuts

Page tree

Rucio Meeting notes 2022-08-08

Date

Zoom

Attendees

Absent

Agenda — Data Replication

Notes

Data Replication

Multi-site PanDA Support and Operation

Date of next meeting

Action items