Date

Zoom

https://stanford.zoom.us/j/91889629258?pwd=KzNleVdmSnA1dkN6VkRVUTNtMHBPZz09

Attendees

Absent

Agenda — Data Replication

  • Status of Rucio evaluation platform at USDF and replication exercises
  • Status of Rucio production platform at USDF
  • Status of data replication exercises
  • Status of data replication monitoring tools and logging platform
  • Status of integration of Rucio and butler for automated ingestion
    • Status of USDF network configuration
  • Collective writing of a technote where we collect details on what we need to replicate and when
  • JIRA tickets relevant for data replication
  • T Key Summary Assignee Reporter P Status Resolution Created Updated Due
    Loading...
    Refresh

Note: the JIRA issues related to data replication have the label "data-replication" (among others)

Notes

Data Replication

  • USDF team focused on wrap-up of migration of data and service from NCSA to SLAC.
    • This will continue to be focus for much of remainder of August.
  • No progress with new deployment of Rucio instance.
    • Existing instance should be usable.
  • Weil summarised new facilities called S3DF (replacing interim SDF, which has hosted services up until now).
    • Good progress with S3DF. more generally. E.g., some users can log in and various Kubernetes tools deployed.
    • However, not suitable for DRP yet, as no FTS, Compute Endpoints, etc.
    • Brandon expected re-deployment of services onto the S3DF would take a few days
      • Potential to deploy both test and production instances of key services; such as, Rucio.
    • Wen has started to define required network configuration for S3DF (e.g., thinking about traffic ingress from UK and French DF).
  • Fabio advised that replication experiments between USDF and French DF still affected by LHC One routing issue (that is, by default, USDF puts traffic destined for IN2P3 onto LHC One VLAN, even though end-point in IN2P3 is not connected to this VLAN).
    • Wei proposed to investigate further: did not think it was an issue with traffic routing to LHC One.
  • Fabio noted need to agree and document datasets to be replicated between sites
      • Fabio has been building up a basic proposal based on details from DP0.2.
        • Needs more info on number and sizes of datasets and on when data can be replicated.
      • George suggested to work through the seven stages of data processing with a Pipeline expert
      • Fabio hopes to progress in September.
      • Brian noted tables that were produced for DP0.2 based on input/ output and sizes.
        • Brian also noted some intermediate files did not need to be duplicated between sites.
      • Brian would be happy to help with this.
  • Yuyi has been investigating monitoring options and provided feedback on requirements.
    • Fabio shared useful notes from ESCAPE project, re. setting up of Monit tools.
    • Tim thought there was sufficient information from ESCAPE project to progress with setup (though Tim is away until mid-September)
    • In production, expect to colocate monitoring tools with Rucio instance (i.e., in USDF)
      • Wei noted would be useful to confirm access to Rucio DB from outside of USDF.
    • Stephen P noted that Monit, by default, accessed messages via ActiveMQ mechanism, which would clash with is Rucio-Date Butler integration.
    • Potentially may need to update Monit to access Kafka stream that Stephen is setting up.
    • Alternatively we should look more generally at what services need to access file-status information in Rucio
      • E.g., other sites may read from ActiveMQ into a database (such as Elastic Search) and then point all downstream services to that.
  • Stephen noted no progress on work to integrate Rucio and Data Butler since end of PCW.

Multi-site PanDA Support and Operation

  • SLAC S3DF
    • Wei plans to rebuild grid infrastructure in S3DF, using data from CVMFS
    • Note that S3DF has newer linux distro for images (Red Hat 8), so likely to require some work (SDF images based on Red Hat 7).
    • Open queastion as to whether to use containers (would require containerisation of some elements – e.g., ARC-CE) or virtualised infrastructure.
    • Expect to be a few weeks away from being ready.
  • Wei also assembling details of network configuration (has meeting with Security officer on Thursday 25th).
    • Needs information on both outbound and inbound to/ from UKDF and French DF.
  • Status of existing SDF
    • Should still be fine for running PanDA jobs
    • Wen noted that have started using Rubin CVMFS for shipping code and environment.
  • Preparing running at EU DFs
    • Testing submitting jobs to IN2P3.
    • Requires some updates to environment that is accessed from jobs at IN2P3.
      • Need to set up local record of required secrets (credentials for accessing GCS, AWS and the local Butler registry database), so job can access necessary services.
    • Wei noted jobs running at IN2P3 are accessing Butler at SLAC and asked if that was as intended. 
      • Wen believes this is okay, depending on the job. Job can define which Butler to use.
    • Peter Love is working on an wrapper that will:
      • Goal is to standardize Panda job startup process at all 3 DFs, and uses CVMFS as much as possible (as opposed to stage into via Panda job submission)
      • "source" site specific setup script: environment variables, secrets, etc.
      • start Panda pilot wrapper 
      • A preliminary release of this wrapper and pilot wrapper is already available in CVMFS ( PREOPS-1263 - Getting issue details... STATUS ). Site-specific setup scripts are meant to be only visible at the site.
  • PanDA installation at SLAC
    • PanDA components are deployed
      • For incoming traffic, SSL passthrough works from SLAC but not from external sites. Potentially, this is a complication caused by information needing to pass through Load Balancer. Believe this is a blocking issue.
      • No outbound traffic working yet. Wei now has details of address range and certificate
    • PanDA deployment using test database.
      • Unclear when will move to production database. Will continue with test instance for just now.
      • Wei noted that PanDA DB, Rucio DB, and Butler DB will all use Postgres.
      • Database performance and backup strategy to be finalised in due course.
    • Wei asked if Rubin Security documentation discussed details of networking and traffic flow
      • Fabio did not expect this level of detail in current document library, but noted would be needed for implementation of each DF.
  • George noted progress with RTN-021 and plan to circulate update before end of August.
  • Fabio asked if USDF team had information on periodic processing of HSC2 data.
    • Fabio proposed to work towards being able to complete HSC2 data in the coming months (e.g., before end of calendar year).
    • George suggested to start with just Step 1 (or Step 1 and Step 2) of the pipeline.

Date of next meeting

Monday September 5th September 19th:  Sep 5th is US Labour day holiday (discussions on Slack #dm-rucio-testing in interim)

Action items

  •