Date

15 May 2023

Zoom

https://stanford.zoom.us/j/91889629258?pwd=KzNleVdmSnA1dkN6VkRVUTNtMHBPZz09

Attendees

Apologies

Fabio Hernandez

Agenda — Data Replication

Status of replication exercises among the 3 facilities [ Yuyi GuoBrandon White ]
- update on the status of the FTS instances (Rubin's and UKDF's)
- what is preventing us to start doing regular replication of the equivalent of 1 night's worth of raw data, first using FTS only, then Rucio?
Status of Rucio & FTS monitoring [ Timothy Noble George Beckett ]
- update on the improvements to the Rucio and FTS monitoring platform
Status of butler & Rucio integration [ Steve Pietrowicz]
- deployment progress at the DFs of the consumers of Rucio events. Relevant JIRA issues:
  - DM-38655 - Getting issue details... STATUS
  - DM-38656 - Getting issue details... STATUS
  - DM-38657 - Getting issue details... STATUS
- proposal for naming the RSEs: PREOPS-3441 (please provide your inputs there)
Status of replication of STS data from USDF to FrDF [ Wei Yang Kian-Tat Lim Fabio Hernandez ]

Data Replication JIRA Issues

JIRA tickets with tag "data-replication"

Key	Summary	T	Created	Updated	Due	Assignee	Reporter	P	Status	Resolution

Loading...

Refresh

Notes

Data Replication (Fabio)

Replication tests
- PerfSonar instances set up at SLAC on two different DTN nodes (one with MTU at 9000 and one with MTU at 1500)
  - Fabio to progress some network testing, looking at performance and stability.
  - Recent measures have shown transfer rates down as low as 1 Mbps
- Rucio transfer tests suspended while Yuyi works on debugging subscriptions of files.
  - Files look to be being moved, but no rules are being created.
  - Trivial scale of testing initially.
- Investigation of issue with FTS defaulting to one file per transfer has not progressed.
- Team continuing to use RAL-based FTS instance.
  - SLAC FTS installation is progressing. Database is up and accessible from application. Brandon is working on ingress/ egress process and debugging failure of some FTS container problems (some expected processes don't start).
- Greg D noted RAL FTS not accessible: may mean need to switch back to FTS eval (maybe reducing logging verbosity, to avoid losing space).
- Tim noted FTS is being upgraded in week beginning 15th May, which will hopefully resolve issues.
  - Wei asked if IP v4 or IP v6. Tim confirmed both.
Monitoring
- MQserver needed for FTS monitoring in SLAC (assuming won't be using CERN Monit), plus ElasticSearch and LogStash to feed into Gaafana
- Tim has a couple of scripts, from Edinburgh monitoring system, which would be added to FTS monitoring at SLAC
- Tim asked about options to deploy services in SLAC
  - Wei suggested would need to be K8s containerised.
  - Brandon suggested should set up separate cluster for monitoring.
  - Wei suggested Tim to talk to Yee for this
- Wei asked about use of Monit vs. MQserver
  - Tim noted many FTS instances use CERN Monit
  - Richard noted that Yee may be planning to deploy Monit at SLAC
Rucio-Butler integration
- Steve has posted document describing implementation of integration service – see Jira ticket on USDF topic *** LINK***
- Recent work has focused on testing of the integration: Steve hopes to have GitHub actions for creating the ingest daemon to ease deployment.
- Richard asked if would be ready in time for multi-site processing attempt (in three weeks time)
- Steve confirmed that was his plan.
- DF deployment of services is happening in parallel (tracked in tickets).
  - Peter L offered to help with setup of services, if things required.
  - Steve aiming to capture detailed requirements in Jira ticket
  - Peter L noted UKDF services were set up (in Docker) and ready for testing: next step is to deploy integration service.
- Steve noted timely to contribute to naming convention for RSEs
  - Fabio has proposed a convention in Jira
  - Wei noted lots of underscores, especially in UKDF SEs. Potentially could be truncated
  - Name is to provide a human-readable identifier to help with debugging
  - Rucio can use RSE attributes to match properties, so does not need to rely on naming.
  - Steve noted some further work may be required as not all RSEs will be backed by a Butler.
  - K-T noted he had posted some questions on the Jira ticket for Fabio's consideration
  - Steve asked if RSE names were used across database
    - Brandon confirmed properly normalised, so name associated with id internally
  - K-T noted RSE 'expressions' use names, not ids, so would need to be updated if names changed
STS data transfer from USDF to FrDF
- Currently impacted by FTS issue at RAL
- K-T offered to register a file in an RSE endpoint to enable download testing.
- Wei noted option to have more than one FTS server, in case of problems with primary FTS server.

Multi-site PanDA Support and Operation (Wei)

Wen is testing job submission at all three DFs
- Looks to be heavy demand for resources (e.g., on UK DF) from another processing campaign
Wen noted PanDA development system, at USDF, broken – possibly when load balancer switched to new server (means certificate is broken)
- Need new certificate for both development PanDA system system
Work to try to boost memory available to PanDA
- Seeing instances when virtual memory is being allocated, not physical memory, leading to cluster job being killed.
- Affecting both UKDF and FrDF sites killing jobs based on virtual-memory requirement. Not seeing same issue at USDF.
Deployment of Production PanDA needs host certificate and allocation of S3 storage (for backup) to progress.
- Also, Lscratch to run jobs, but do not have enough disk space as testing scales up, potentially conflicting with other users of node.
  - Lscratch on Milan nodes looks to be comparible to Lscratch on older 'Rome' nodes
  - Richard flagged that specification of nodes is for 950 GB (whereas looks to be around 300 GB).
- Potential to use scratch instead of lscratch, though may be performance penalty
- Alternatively is to make purge of lscratch more aggressive.
  - Richard believes scratch content is deleted when a users last job ends
  - Wei noted this may be challenging in practice, with multiple jobs per user.
- Alternatviely, Lscratch could be defined as a resource to be specified when job is submitted.
Wei suggests to include basic checking of node capabilities as part of job startup.

Date of next meeting

Monday May 29th, 8am PT

Discussions on Slack #dm-rucio-testing in interim

Action items

Space shortcuts

Page tree

Date

Zoom

Attendees

Apologies

Agenda — Data Replication

Data Replication JIRA Issues

Notes

Data Replication (Fabio)

Multi-site PanDA Support and Operation (Wei)

Action items

Space shortcuts

Page tree

Rucio Meeting notes 2023-05-15

Date

Zoom

Attendees

Apologies

Agenda — Data Replication

Data Replication JIRA Issues

Notes

Data Replication (Fabio)

Multi-site PanDA Support and Operation (Wei)

Action items