Date

Zoom

https://stanford.zoom.us/j/91889629258?pwd=KzNleVdmSnA1dkN6VkRVUTNtMHBPZz09

Attendees

Apologies

Agenda — Data Replication

Data Replication JIRA Issues

  • Status of Rucio evaluation platform at USDF and replication exercises
  • Status of Rucio production platform at USDF
  • Status of data replication exercises
  • Status of data replication monitoring tools and logging platform
  • Status of integration of Rucio and butler for automated ingestion
  • Collective writing of a technote where we collect details on what we need to replicate and when
  • JIRA tickets relevant for data replication
  • T Key Summary Assignee Reporter P Status Resolution Created Updated Due
    Loading...
    Refresh

Note: the JIRA issues related to data replication have the label "data-replication" (among others)

Notes

Data Replication (Fabio)

  • Three-DF Replication experiments (Yuyi and Brandon)
    • Yuyi measuring transfer throughput between facilities
      • SLAC-IN2P3 (third-party) initially failed, but possible to push data.
        • Measured 4MBps based on test data
        • Issue (with third-party transfer) tracked down to checksum issue at IN2P3 end, requiring reconfiguration of Xrootd server to recompute checksum as a work around.
          • Not clear why this is an issue, nor why didn't see issue moving data from IN2P3 to Lancaster
          • Expect need coordination of configuration between Xrootd, FTS and dCache
          • Greg noted dCache seems to have independent action (through default configuration?) to do checksum verification independent of FTS.
          • Fabio noted was possible to disable check in dCache: by default, dCache asks remote server to confirm checksum which it uses to confirm transfer completed successfully. dCache supports Adler-32 though prefers other checksums. Fabio expects could configure FTS to provide Adler-32 checksum if requested (by dCache).
          • Andy noted work with dCache team, in February, to resolve this issue. Andy noted that order of checksums in request is important. Wei noted fix did not seem to have appeared in Xrootd Version 5.5.3.
      • SLAC-Lancaster didn't work as expected, as Rucio chose data from IN2P3 in preference.
        • Achieved 40 Mbps (between IN2P3 and Lancaster)
        • Second attempt to transfer from SLAC to Lancaster on Friday 3rd failed, due to missing link to test disk at IN2P3 and Lancaster sites.
        • Yuyi recommends using Test disk at Lancaster and IN2P3 to avoid test data getting mixed up with production data.
          • Fabio noted that IN2P3 has configured a test area in storage (separate endpoint)
          • Matt thinks Data disk is okay to use
        • Matt proposes to set up second Rucio endpoint in Lancaster (separate path in endpoint).
      • Fabio asked if ready to set up regular test of transfers between facilities – e.g., replicating a night of observations three times per week.
        • Yuyi confirms is possible though would like to measure performance between SLAC and Lancaster, plus solve IN2P3 checkpoint problem, before doing this.
        • Could start regular transfers this week.
        • Also, Yuyi noted that monitoring currently very limited.
        • George asked whether needed to use different data for each transfer, to ensure Rucio did transfer.
          • Fabio noted could use different data each time, or delete existing data.
          • Yuyi noted available data is limited.
          • Fabio suggested to start with one night of data per transfer experiment. Would be a useful volume.
          • Wei noted to include Rucio deletion in exercise: in the past, this has been a weakness of Rucio as done file-by-file. Wei considered running Delete daemon in Europe, to avoid high latency of using a SLAC-based Delete daemon.
          • Brandon recommended to set up separate RSE and use Greedy attribute to encourage quick deletion of files.
        • Stephen asked if deletion latency would be an issue for production
          • K-T noted likely need to be able to remove data from specific sites, when no longer needed.
          • Stephen proposed further consideration, as potentially a source of issue since the Butler may need to be notified about any deletions.
        • Brandon noted likely requirement to write custom policy package to deal with creation of PFNs.
            • Dune has done this, though is not a smooth process and required significant debugging.
            • K-T hopes is not as complicated in Rubin as in Dune.
          • Wei noted potential need to reconfigure Reg-Exp to deal with PFN creation use case
          • Stephen noted, in testing, he had been replacing root of file path at receiver side for Butler ingest. Not sure if PFN use case suggests that would need to transfer path from transmitter, as could not pre-compute it.
            • Wei proposes to register majority of path (excluding site-specific root prefix).
            • K-T does not think the prefix would need to be shared.
          • Fabio noted could test this with real data in near future.
  • Monitoring (George for Tim)
    • Atlas is running 1.30 whereas Rubin is running 1.29 at Rucio
      • Rubin to stick with Version 1.29 as this is LTS version
    • Yuyi will provide list of metrics that are needed for monitoring.
    • Radu has renamed metrics.
    • Yuyi has provided list of metrics that she needs (data in Jira).
  • Butler-Rucio integration
    • Stephen writing up design to date, ready to be sanity checked by K-T.
    • Fabio noted investigation of setting up Kubernetes at IN2P3 for Kafka Cluster and MirrorMaker
      • Ticket to be created describing what is required
      • Stephen noted in progress, plus would also need a Butler Ingest instance.
  • Fabio asked if ready to begin replicating SDS data to IN2P3
    • Wei noted that path issue is a blocker. Requires a new Rucio policy file.
    • Fabio proposed to start with simple tests to check physical filename is consistent with logical filename.
    • K-T noted successful transfers by Tony to test storage at Tucson. This could be used as source for file replication in short term.
    • Wei asked if K-T had tried to register files with Rucio. K-T noted not yet: was reviewing APIs for doing this.
    • Andy noted Adler32 not a good checksum. Proposes CRC-32C (hardware assistent and faster to compute).
      • Andy recommends to switch to CRC-32C.
      • Fabio noted currently Rucio requires Adler-32 checksum to be provided when registering a file
      • Andy noted Rucio could be updated to support CRC-32C
    • Yuyi and Greg noted to change to another checksum potentially requires update to database: choice is hard-wired. Is the basis for name in Rucio.

Multi-site PanDA Support and Operation (Wei)

  • Able to run CI-HSC test on all three DFs (but without clustering).
    • Clustering likely to significantly improve performance
    • Michelle is working on clustering support for PanDA (HT-Condor and Parsl can support clutering)
  • Peter planning to create routine test of PanDA setup based on CI-HSC data.
  • Weil believes need to address current issues before progressing to next step.
  • No exceptions from Wen and Michelle.

Date of next meeting

Monday March 20th, 8 am PT

Discussions on Slack #dm-rucio-testing in interim

Action items

  •