Scheduled Maintenance

See the LSST Service Status Page

Note that the December maintenance window is moved from 12/21 to 12/14. 

As-is Services

Incidents

  • 0 created, 0 resolved.

Discussion of Notable Issues

Unexpected reboot of lsst-qserv-db16 (IHS-606).

Spontaneous reboots of PDAC nodes has been an ongoing issue since Nov. 14.  The last event was on Nov. 23.  Datacenter infrastructure has been ruled out as a cause.  The problem is determined to be with the servers themselves - likely on the system board.

Despite this issue, Igor Gaponenko was able to complete his data ingests and meet his Nov. 30 milestone.

On Mon. of this week, our engineering team and vendor tech support believe they have identified the likely cause and we've initiated an emergency change request.  The fix entails firmware updates, which are currently being installed.   Nodes will be unavailable as they are upgraded. This could be a lengthy process. 

Requests

None created or resolved


Change Management

This process primarily targets requests that can be handled with current level of effort (LOE) resources.  This process is also designed to detect and redirect items to the EVMS process if they exceed LOE resources.

Successful changes proceed through 5 stages: 

1

Business Case & T/CAM ConcurrenceCheck that the submitter has stated a plausible business case and the relevant T/CAM agrees
2FeasibilityIs the change well-formulated, address a project need and
3PlanningA detailed implementation plan is created which takes into account impacts, resource needs, testing and verification.
4InsertionThe plan is executed to implement the change.
5AssessmentVerification of successful change, issues analysis, documentation and close-out.


Open Change Requests

Key  SummaryProcess Stage†ReporterPCreatedResource trackStatus
IHS-580

DM developers need a build/test environment that supports docker containers


Feasibility

Joshua Hoblitt Minor02/Nov/17
  • This request is orthogonal to the planned FY18 kubernetes deployment
  • Determined that this change is significant enough that it needs to be inserted into the EVMS system.
  • Will be discussed in the next T/CAM meeting.
  • Has been added as EVMS epic DM-12846
IHS-576

Configure slurm to accept jobs to use only partial nodes

Planning

Tim Morton Major02/Nov/17
  • Greg Daues is discussing IHS-576 & IHS-612 with stakeholders to enumerate use cases and needs.

Implement debug and normal queues for developers on the verification cluster

Planning

Yusra AlSayyad Major16/Nov/17LOE
  • Tentatively scheduled for mid-December

IHS-595

iperf3 installed on lsst-xfer

Closed (inserted)

John Parejko Major08/Nov/17LOE
  • iperf3 was installed on Mon. Verified that it's working to John's satisfaction yesterday (Tu.)
IHS-613

move /scratch files to holding area before deletion

Closed

(will not implement)

John Parejko Minor16/Nov/17-


Heard on the Street This Week, but no Ticket Filed

  • New

    • Several users expressed a desire to have the Intel compiler suite (icc) available on last-dev
    • Increase ssh idle session timeout, which is currently 1 hr. (John Parejko via Slack) 


  • Previous

    • Suggestion to deploy kubernetes on PDAC, it is assumed that this is being handled through the rolling-wave (EVMS) process
    • Tools for parallel programming in batch computing environment (gnu parallel and others)

Change Process Notes

  • LDF Service Management Operations Meeting Notes are now posted on the LSST confluence site.
  • Change process is being exercised, refined and socialized with T/CAMs as well as submitters

Problem Management

Report format under development

Interactions

  1. T/CAM interactions

    None

    ITSC

    1. Next meeting  


  1. PDAC

    1. Next PDAC meeting is tomorrow, .


  2. Summit-base Tiger Team

    1. Suspended until after the first of the year since Jeff is in Chile.  However, I’m linked in to Chile IT.


  3. Infrastructure

Next meeting  

Other business

(None)

Action Items

New

  • Write some training .ppt doc's on service manager duties describe what needs to be done on a daily basis.  1st is daily duties.
  • Define how the LDMCR closure process
  • revise change process to include in phase 1 linking in T/CAM at discretion of the CM
  • Document the change process, issue types, etc.....

From last week