Date
Zoom
https://stanford.zoom.us/j/91889629258?pwd=KzNleVdmSnA1dkN6VkRVUTNtMHBPZz09
Attendees
- Richard Dubois
- Brandon White
- Kian-Tat Lim
- Brian Yanny
- Yuyi Guo
- Steve Pietrowicz
- Wei Yang (chair)
- George Beckett
- Greg Daues
- Andy Hanushevsky
- Matt Doidge
- Wen Guan
- Peter Love
- Timothy Noble
- Lionel Schwarz
- Peter Clark
- Michelle Gower
Apologies
Agenda — Data Replication
- Status of replication exercises among the 3 facilities [ Yuyi GuoBrandon White ]
- update on the status of the FTS instances (Rubin's and UKDF's)
- what is preventing us to start doing regular replication of the equivalent of 1 night's worth of raw data, first using FTS only, then Rucio?
- Status of Rucio & FTS monitoring [ Timothy Noble George Beckett ]
- update on the improvements to the Rucio and FTS monitoring platform
- Status of butler & Rucio integration [ Steve Pietrowicz]
- deployment progress at the DFs of the consumers of Rucio events. Relevant JIRA issues:
- proposal for naming the RSEs: PREOPS-3441 (please provide your inputs there)
- deployment progress at the DFs of the consumers of Rucio events. Relevant JIRA issues:
- Status of replication of STS data from USDF to FrDF [ Wei YangKian-Tat Lim Fabio Hernandez ]
Data Replication JIRA Issues
- JIRA tickets with tag "data-replication"
Notes
Data Replication (Fabio)
- Replication tests
- PerfSonar instances set up at SLAC on two different DTN nodes (one with MTU at 9000 and one with MTU at 1500)
- Fabio to progress some network testing, looking at performance and stability.
- Recent measures have shown transfer rates down as low as 1 Mbps
- Rucio transfer tests suspended while Yuyi works on debugging subscriptions of files.
- Files look to be being moved, but no rules are being created.
- Trivial scale of testing initially.
- Investigation of issue with FTS defaulting to one file per transfer has not progressed.
- Team continuing to use RAL-based FTS instance.
- SLAC FTS installation is progressing. Database is up and accessible from application. Brandon is working on ingress/ egress process and debugging failure of some FTS container problems (some expected processes don't start).
- Greg D noted RAL FTS not accessible: may mean need to switch back to FTS eval (maybe reducing logging verbosity, to avoid losing space).
- Tim noted FTS is being upgraded in week beginning 15th May, which will hopefully resolve issues.
- Wei asked if IP v4 or IP v6. Tim confirmed both.
- PerfSonar instances set up at SLAC on two different DTN nodes (one with MTU at 9000 and one with MTU at 1500)
- Monitoring
- MQserver needed for FTS monitoring in SLAC (assuming won't be using CERN Monit), plus ElasticSearch and LogStash to feed into Gaafana
- Tim has a couple of scripts, from Edinburgh monitoring system, which would be added to FTS monitoring at SLAC
- Tim asked about options to deploy services in SLAC
- Wei suggested would need to be K8s containerised.
- Brandon suggested should set up separate cluster for monitoring.
- Wei suggested Tim to talk to Yee for this
- Wei asked about use of Monit vs. MQserver
- Tim noted many FTS instances use CERN Monit
- Richard noted that Yee may be planning to deploy Monit at SLAC
- Rucio-Butler integration
- Steve has posted document describing implementation of integration service – see Jira ticket on USDF topic *** LINK***
- Recent work has focused on testing of the integration: Steve hopes to have GitHub actions for creating the ingest daemon to ease deployment.
- Richard asked if would be ready in time for multi-site processing attempt (in three weeks time)
- Steve confirmed that was his plan.
- DF deployment of services is happening in parallel (tracked in tickets).
- Peter L offered to help with setup of services, if things required.
- Steve aiming to capture detailed requirements in Jira ticket
- Peter L noted UKDF services were set up (in Docker) and ready for testing: next step is to deploy integration service.
- Steve noted timely to contribute to naming convention for RSEs
- Fabio has proposed a convention in Jira
- Wei noted lots of underscores, especially in UKDF SEs. Potentially could be truncated
- Name is to provide a human-readable identifier to help with debugging
- Rucio can use RSE attributes to match properties, so does not need to rely on naming.
- Steve noted some further work may be required as not all RSEs will be backed by a Butler.
- K-T noted he had posted some questions on the Jira ticket for Fabio's consideration
- Steve asked if RSE names were used across database
- Brandon confirmed properly normalised, so name associated with id internally
- K-T noted RSE 'expressions' use names, not ids, so would need to be updated if names changed
- STS data transfer from USDF to FrDF
- Currently impacted by FTS issue at RAL
- K-T offered to register a file in an RSE endpoint to enable download testing.
- Wei noted option to have more than one FTS server, in case of problems with primary FTS server.
Multi-site PanDA Support and Operation (Wei)
- Wen is testing job submission at all three DFs
- Looks to be heavy demand for resources (e.g., on UK DF) from another processing campaign
- Wen noted PanDA development system, at USDF, broken – possibly when load balancer switched to new server (means certificate is broken)
- Need new certificate for both development PanDA system system
- Work to try to boost memory available to PanDA
- Seeing instances when virtual memory is being allocated, not physical memory, leading to cluster job being killed.
- Affecting both UKDF and FrDF sites killing jobs based on virtual-memory requirement. Not seeing same issue at USDF.
- Deployment of Production PanDA needs host certificate and allocation of S3 storage (for backup) to progress.
- Also, Lscratch to run jobs, but do not have enough disk space as testing scales up, potentially conflicting with other users of node.
- Lscratch on Milan nodes looks to be comparible to Lscratch on older 'Rome' nodes
- Richard flagged that specification of nodes is for 950 GB (whereas looks to be around 300 GB).
- Potential to use scratch instead of lscratch, though may be performance penalty
- Alternatively is to make purge of lscratch more aggressive.
- Richard believes scratch content is deleted when a users last job ends
- Wei noted this may be challenging in practice, with multiple jobs per user.
- Alternatviely, Lscratch could be defined as a resource to be specified when job is submitted.
- Also, Lscratch to run jobs, but do not have enough disk space as testing scales up, potentially conflicting with other users of node.
- Wei suggests to include basic checking of node capabilities as part of job startup.
Date of next meeting
Monday May 29th, 8am PT
Discussions on Slack #dm-rucio-testing in interim