Date
Zoom
https://stanford.zoom.us/j/91889629258?pwd=KzNleVdmSnA1dkN6VkRVUTNtMHBPZz09
Attendees
- Richard Dubois
- Brandon White
- Kian-Tat Lim
- Brian Yanny
- Yuyi Guo
- Steve Pietrowicz
- Wei Yang
- George Beckett
- Greg Daues
- Andy Hanushevsky
- Matt Doidge
- Wen Guan
- Peter Love
- Timothy Noble
- Lionel Schwarz
- Fabio Hernandez
- Peter Clark
- Michelle Gower
Apologies
Agenda — Data Replication
- Status of replication exercises among the 3 facilities [ Yuyi GuoBrandon White ]
- the evaluation instance of FTS has been showing some fragility. Is there any thing we can do to improve this situation?
- what replication exercises have we performed? What is working and what is not?
- what is preventing us to start doing regular replication of the equivalent of 1 night's worth of raw data?
- Status of Rucio & FTS monitoring [ Timothy Noble George Beckett ]
- are we ready to set up an initial version of a dashboard showing Rucio and FTS activity?
- Status of butler & Rucio integration [ Steve Pietrowicz]
- Status of replication of STS data from USDF to FrDF [ Wei YangKian-Tat Lim Fabio Hernandez ]
Data Replication JIRA Issues
- JIRA tickets with tag "data-replication"
Notes
Data Replication (Fabio)
- Brandon noted lots of transfer experiments between SLAC and Lancaster (coordinated by Yuyi with support from Brandon)
- Trying to figure out realisable transfer speed
- Tested with two large transfers (300 GB => 4.7 Gbps; 1,700 GB => similar speed) based on mixed set of file O(10MB) files.
- Attempted 12-hour transfer test, but took around 3 days to complete, due to issues with FTS server
- FTS server cannot cope with large numbers of transfer request
- Wei has been helping to diagnose issue
- Rucio smart enough to cope with FTS outages
- Of 5,000 files, only one failed to transfer in three-day test.
- Yuyi compared with FTS server at FermiLab
- Looks to require 15 MB per concurrent transfer (have 8GB memory on server, at present, which is 5–6 GB for transfers)
- Too many concurrent transfers causes server to crash
- FTS deployed via Docker container (plan to move to Kubernetes in due course)
- Also has limited log file space (17 GB), though request in to increase to 50 GB
- Potential to constrain the number of concurrent transfers
- Tim N noted "for reference RAL runs FTS3 as well, we have 8 FTS servers with 8GB ram, 4 processors, 120 GB disc space each"
- Looks to be priority to increase FTS capacity
- RAL FTS instance will likely be very helpful for France-UK transfers as reduced latency.
- Yuyi noted would need bigger dataset to saturate link if we increased specs of FTS server.
- Wei asked if requests were grouped
- Yuyi confirmed was – Rucio organises files into groups of 100 files.
- Wei noted grouping files reduced the load on FTS
- Peter noted that FTS configuration (** channel ) has 0 0 set for minimum/ maximum transfers
- https://fts-eval01.slac.stanford.edu:8449/fts3/ftsmon/#/config/links
- Transfers suspended until extra disk, for FTS logging, is available.
- Trying to figure out realisable transfer speed
- In parallel, Peter is testing DP0.2 transfer from France to UK
- Currently, managed to transfer small amount, though waiting on go-ahead from Wei to resume.
- Also transferring subset of data from SLAC, using rsync, though this is too slow
- Clear that Rucio would help with managing this.
- Greg noted potentially other issue with large file count for submissions (based on documentation)
- This is being investigated
- In Slack, Greg noted"
Some commented lines in FTS config template
## Parameters for QoS daemon - BringOnline operation
# Maximum bulk size
# If the size is too large, it will take more resources (memory and CPU) to generate the requests
# and parse the responses. Some servers may reject the requests if they are too big.
# If it is too small, performance will be reduced.
# Keep it to a sensible size (between 100 and 1k)
# StagingBulkSize=200
Suggests there could be server issues with large submits.
"
- Tim noted progress with Rucio dashboard
- https://grafana.slac.stanford.edu/d/000000003/rucio-overview?orgId=1&refresh=1m
- Reporting transfer success (from viewpoint of Rucio), with timeline
- Rule state – slightly opaque, though would require custom scripts to extract more info
- Transmografier daemon
- Reaper (deletions)
- K-T noted high action count.
- Data sourced from Rucio logging, via Prometheus
- Link in Grafana folder – see 'Rucio Overview'
- Requires FTS monitoring to progress further.
- FTS logging is still being set up
- Also, need a queue server (e.g., ActiveMQ) or direct route into Prometheus.
- Potentially to get details from FTS
- Steve P noted potential use Hermes/ ActiveMQ, which includes FTS logging.
- Tim noted lots more information in both Hermes and FTS (compared to Rucio)
- Includes filtering by end point, for example
- https://grafana.slac.stanford.edu/d/000000003/rucio-overview?orgId=1&refresh=1m
- Butler-Rucio integration
- Steve ready to set up Kafka end points in UK and France.
- Assume would be deployed within Kubernetes at each DF
- Would be handy to be able to use Phalanx to streamline this, as includes CertBot for autorenewal of certificates
- KT noted Phalanx is potential overkill if we don't need it.
- Peter noted manual deployment of Kafka at Lancaster already
- K-T noted three levels of Kafka deployment
- Baremetal
- Kubernetes
- Sasquatch-based
- George suggested to try existing Phalanx instance (in Edinburgh), plus option to build on MirrorMaker 2 experience from Lasair team
- Fabio asked if certificates were necessary.
- K-T noted was for TLS (HTTPS security), based on LetsEncrypt.
- Fabio noted France has stopped using LetsEncrypt.
- IN2P3 also using Phalanx for RSP, though is running on isolated Kubernetes cluster, serving RSP and Qserv only.
- Potential site-specific concerns for LetsEncrypt
- Pete proposed that each site worries about own sites infrastructure
- K-T noted that Rubin would provide configurations that would be deployable via Kubernetes
- K-T noted different views – infrastructure provided service or integral part of service deployment
- Steve noted Greg has joined team to help with RSE deployment
- Replication of STS data from USDF to France (Fabio)
- Wei noted in process of registering data to Rucio (on US side)
- Fabio noted way to detect if files are replicated into SE in France has been configured.
- Using Kafka server to broadcast event from dCache
- Next step is to work out how to trigger actions in reaction to events (e.g., a file arrives, though how do you take an action based on that trigger)
- K-T notes potential to reuse setup from Summit
- Wei noted SDS setup (sites at US and France) will not use hash function
Multi-site PanDA Support and Operation (Wei)
- Focus of activity has been on transfer from France to UK of DC2 data, followed by registration of data in a UK Butler.
- Pete confirms on his to-do list, once data is transferred.
Date of next meeting
Monday April 3rd, 8 am PT
Discussions on Slack #dm-rucio-testing in interim