Date

Zoom

https://stanford.zoom.us/j/91889629258?pwd=KzNleVdmSnA1dkN6VkRVUTNtMHBPZz09

Attendees

Apologies

Agenda — Data Replication

  • Status of replication exercises among the 3 facilities [ Yuyi GuoBrandon White ]
    • update on the status of the FTS instances (Rubin's and UKDF's)
    • what is preventing us to start doing regular replication of the equivalent of 1 night's worth of raw data, first using FTS only, then Rucio?
  • Status of Rucio & FTS monitoring [ Timothy Noble George Beckett  ]
    • update on the improvements to the Rucio and FTS monitoring platform
  • Status of butler & Rucio integration [ Steve Pietrowicz]
    • deployment progress at the DFs of the consumers of Rucio events. Relevant JIRA issues:
    • proposal for naming the RSEs: PREOPS-3441 (please provide your inputs there)
  • Status of replication of STS data from USDF to FrDF [ Wei YangKian-Tat Lim Fabio Hernandez ]

Data Replication JIRA Issues

  • JIRA tickets with tag "data-replication"
  • Key Summary T Created Updated Due Assignee Reporter P Status Resolution
    Loading...
    Refresh

Notes

Data Replication (Fabio)

  • Replication tests
    • PerfSonar instances set up at SLAC on two different DTN nodes (one with MTU at 9000 and one with MTU at 1500)
      • Fabio to progress some network testing, looking at performance and stability.
      • Recent measures have shown transfer rates down as low as 1 Mbps
    • Rucio transfer tests suspended while Yuyi works on debugging subscriptions of files.
      • Files look to be being moved, but no rules are being created.
      • Trivial scale of testing initially.
    • Investigation of issue with FTS defaulting to one file per transfer has not progressed.
    • Team continuing to use RAL-based FTS instance.
      • SLAC FTS installation is progressing. Database is up and accessible from application. Brandon is working on ingress/ egress process and debugging failure of some FTS container problems (some expected processes don't start).
    • Greg D noted RAL FTS not accessible: may mean need to switch back to FTS eval (maybe reducing logging verbosity, to avoid losing space).
    • Tim noted FTS is being upgraded in week beginning 15th May, which will hopefully resolve issues.
      • Wei asked if IP v4 or IP v6. Tim confirmed both.
  • Monitoring
    • MQserver needed for FTS monitoring in SLAC (assuming won't be using CERN Monit), plus ElasticSearch and LogStash to feed into Gaafana
    • Tim has a couple of scripts, from Edinburgh monitoring system, which would be added to FTS monitoring at SLAC
    • Tim asked about options to deploy services in SLAC
      • Wei suggested would need to be K8s containerised.
      • Brandon suggested should set up separate cluster for monitoring.
      • Wei suggested Tim to talk to Yee for this
    • Wei asked about use of Monit vs. MQserver
      • Tim noted many FTS instances use CERN Monit
      • Richard noted that Yee may be planning to deploy Monit at SLAC
  • Rucio-Butler integration
    • Steve has posted document describing implementation of integration service – see Jira ticket on USDF topic *** LINK***
    • Recent work has focused on testing of the integration: Steve hopes to have GitHub actions for creating the ingest daemon to ease deployment.
    • Richard asked if would be ready in time for multi-site processing attempt (in three weeks time)
    • Steve confirmed that was his plan.
    • DF deployment of services is happening in parallel (tracked in tickets).
      • Peter L offered to help with setup of services, if things required.
      • Steve aiming to capture detailed requirements in Jira ticket
      • Peter L noted UKDF services were set up (in Docker) and ready for testing: next step is to deploy integration service.
    • Steve noted timely to contribute to naming convention for RSEs
      • Fabio has proposed a convention in Jira
      • Wei noted lots of underscores, especially in UKDF SEs. Potentially could be truncated
      • Name is to provide a human-readable identifier to help with debugging
      • Rucio can use RSE attributes to match properties, so does not need to rely on naming.
      • Steve noted some further work may be required as not all RSEs will be backed by a Butler.
      • K-T noted he had posted some questions on the Jira ticket for Fabio's consideration
      • Steve asked if RSE names were used across database
        • Brandon confirmed properly normalised, so name associated with id internally
      • K-T noted RSE 'expressions' use names, not ids, so would need to be updated if names changed
  • STS data transfer from USDF to FrDF (question)
    • Currently impacted by FTS issue at RAL
    • K-T offered to register a file in an RSE endpoint to enable download testing.
    • Wei noted option to have more than one FTS server, in case of problems with primary FTS server.

Multi-site PanDA Support and Operation (Wei)

  • Wen is testing job submission at all three DFs
    • Looks to be heavy demand for resources (e.g., on UK DF) from another processing campaign
  • Wen noted PanDA development system, at USDF, broken – possibly when load balancer switched to new server (means certificate is broken)
    • Need new certificate for both development PanDA system system
  • Work to try to boost memory available to PanDA
    • Seeing instances when virtual memory is being allocated, not physical memory, leading to cluster job being killed.
    • Affecting both UKDF and FrDF sites killing jobs based on virtual-memory requirement. Not seeing same issue at USDF.
  • Deployment of Production PanDA needs host certificate and allocation of S3 storage (for backup) to progress.
    • Also, Lscratch to run jobs, but do not have enough disk space as testing scales up, potentially conflicting with other users of node.
      • Lscratch on Milan nodes looks to be comparible to Lscratch on older 'Rome' (question) nodes
      • Richard flagged that specification of nodes is for 950 GB (whereas looks to be around 300 GB).
    • Potential to use scratch instead of lscratch, though may be performance penalty
    • Alternatively is to make purge of lscratch more aggressive.
      • Richard believes scratch content is deleted when a users last job ends
      • Wei noted this may be challenging in practice, with multiple jobs per user.
    • Alternatviely, Lscratch could be defined as a resource to be specified when job is submitted.
  • Wei suggests to include basic checking of node capabilities as part of job startup.

Date of next meeting

Monday May 29th, 8am PT

Discussions on Slack #dm-rucio-testing in interim

Action items

  •