Date

20 Mar 2023

Zoom

https://stanford.zoom.us/j/91889629258?pwd=KzNleVdmSnA1dkN6VkRVUTNtMHBPZz09

Attendees

Apologies

Agenda — Data Replication

Status of replication exercises among the 3 facilities [ Yuyi GuoBrandon White ]
- the evaluation instance of FTS has been showing some fragility. Is there any thing we can do to improve this situation?
- what replication exercises have we performed? What is working and what is not?
- what is preventing us to start doing regular replication of the equivalent of 1 night's worth of raw data?
Status of Rucio & FTS monitoring [ Timothy Noble George Beckett ]
- are we ready to set up an initial version of a dashboard showing Rucio and FTS activity?
Status of butler & Rucio integration [ Steve Pietrowicz]
Status of replication of STS data from USDF to FrDF [ Wei Yang Kian-Tat Lim Fabio Hernandez ]

Data Replication JIRA Issues

JIRA tickets with tag "data-replication"

T	Key	Summary	Assignee	Reporter	P	Status	Resolution	Created	Updated	Due

Loading...

Refresh

Notes

Data Replication (Fabio)

Brandon noted lots of transfer experiments between SLAC and Lancaster (coordinated by Yuyi with support from Brandon)
- Trying to figure out realisable transfer speed
  - Tested with two large transfers (300 GB => 4.7 Gbps; 1,700 GB => similar speed) based on mixed set of file O(10MB) files.
  - Attempted 12-hour transfer test, but took around 3 days to complete, due to issues with FTS server
    - FTS server cannot cope with large numbers of transfer request
    - Wei has been helping to diagnose issue
  - Rucio smart enough to cope with FTS outages
  - Of 5,000 files, only one failed to transfer in three-day test.
- Yuyi compared with FTS server at FermiLab
  - Looks to require 15 MB per concurrent transfer (have 8GB memory on server, at present, which is 5–6 GB for transfers)
  - Too many concurrent transfers causes server to crash
- FTS deployed via Docker container (plan to move to Kubernetes in due course)
  - Also has limited log file space (17 GB), though request in to increase to 50 GB
  - Potential to constrain the number of concurrent transfers
  - Tim N noted "for reference RAL runs FTS3 as well, we have 8 FTS servers with 8GB ram, 4 processors, 120 GB disc space each"
  - Looks to be priority to increase FTS capacity
  - RAL FTS instance will likely be very helpful for France-UK transfers as reduced latency.
- Yuyi noted would need bigger dataset to saturate link if we increased specs of FTS server.
- Wei asked if requests were grouped
  - Yuyi confirmed was – Rucio organises files into groups of 100 files.
  - Wei noted grouping files reduced the load on FTS
- Peter noted that FTS configuration (** channel ) has 0 0 set for minimum/ maximum transfers
  - https://fts-eval01.slac.stanford.edu:8449/fts3/ftsmon/#/config/links
  - Transfers suspended until extra disk, for FTS logging, is available.
In parallel, Peter is testing DP0.2 transfer from France to UK
- Currently, managed to transfer small amount, though waiting on go-ahead from Wei to resume.
- Also transferring subset of data from SLAC, using rsync, though this is too slow
- Clear that Rucio would help with managing this.
Greg noted potentially other issue with large file count for submissions (based on documentation)
- This is being investigated
- In Slack, Greg noted"
  Some commented lines in FTS config template
  ## Parameters for QoS daemon - BringOnline operation
  # Maximum bulk size
  # If the size is too large, it will take more resources (memory and CPU) to generate the requests
  # and parse the responses. Some servers may reject the requests if they are too big.
  # If it is too small, performance will be reduced.
  # Keep it to a sensible size (between 100 and 1k)
  # StagingBulkSize=200
  Suggests there could be server issues with large submits.
  "
Tim noted progress with Rucio dashboard
- https://grafana.slac.stanford.edu/d/000000003/rucio-overview?orgId=1&refresh=1m
  - Reporting transfer success (from viewpoint of Rucio), with timeline
  - Rule state – slightly opaque, though would require custom scripts to extract more info
  - Transmografier daemon
  - Reaper (deletions)
  - K-T noted high action count.
  - Data sourced from Rucio logging, via Prometheus
  - Link in Grafana folder – see 'Rucio Overview'
- Requires FTS monitoring to progress further.
  - FTS logging is still being set up
  - Also, need a queue server (e.g., ActiveMQ) or direct route into Prometheus.
  - Potentially to get details from FTS
  - Steve P noted potential use Hermes/ ActiveMQ, which includes FTS logging.
  - Tim noted lots more information in both Hermes and FTS (compared to Rucio)
    - Includes filtering by end point, for example
Butler-Rucio integration
- Steve ready to set up Kafka end points in UK and France.
- Assume would be deployed within Kubernetes at each DF
  - Would be handy to be able to use Phalanx to streamline this, as includes CertBot for autorenewal of certificates
- KT noted Phalanx is potential overkill if we don't need it.
  - Peter noted manual deployment of Kafka at Lancaster already
  - K-T noted three levels of Kafka deployment
    - Baremetal
    - Kubernetes
    - Sasquatch-based
  - George suggested to try existing Phalanx instance (in Edinburgh), plus option to build on MirrorMaker 2 experience from Lasair team
  - Fabio asked if certificates were necessary.
    - K-T noted was for TLS (HTTPS security), based on LetsEncrypt.
  - Fabio noted France has stopped using LetsEncrypt.
  - IN2P3 also using Phalanx for RSP, though is running on isolated Kubernetes cluster, serving RSP and Qserv only.
- Potential site-specific concerns for LetsEncrypt
- Pete proposed that each site worries about own sites infrastructure
  - K-T noted that Rubin would provide configurations that would be deployable via Kubernetes
  - K-T noted different views – infrastructure provided service or integral part of service deployment
- Steve noted Greg has joined team to help with RSE deployment
Replication of STS data from USDF to France (Fabio)
- Wei noted in process of registering data to Rucio (on US side)
- Fabio noted way to detect if files are replicated into SE in France has been configured.
  - Using Kafka server to broadcast event from dCache
- Next step is to work out how to trigger actions in reaction to events (e.g., a file arrives, though how do you take an action based on that trigger)
  - K-T notes potential to reuse setup from Summit
  - Wei noted SDS setup (sites at US and France) will not use hash function

Multi-site PanDA Support and Operation (Wei)

Focus of activity has been on transfer from France to UK of DC2 data, followed by registration of data in a UK Butler.
- Pete confirms on his to-do list, once data is transferred.

Date of next meeting

Monday April 3rd, 8 am PT

Discussions on Slack #dm-rucio-testing in interim

Action items

Space shortcuts

Page tree

Date

Zoom

Attendees

Apologies

Agenda — Data Replication

Data Replication JIRA Issues

Notes

Data Replication (Fabio)

Multi-site PanDA Support and Operation (Wei)

Date of next meeting

Action items

Space shortcuts

Page tree

Rucio Meeting notes 2023-03-20

Date

Zoom

Attendees

Apologies

Agenda — Data Replication

Data Replication JIRA Issues

Notes

Data Replication (Fabio)

Multi-site PanDA Support and Operation (Wei)

Date of next meeting

Action items