Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.




Table of Contents
indent15px

All times listed are Project Time (Pacific)

StartEndEventLocation

Description

Systems/services that will NOT be availableStatus








Upcoming Scheduled Maintenance

StartEndEventLocation

Description

Systems/services that will NOT be availableStatus

17-Jan-2019 6:00am

17-Jan-2019 10:00am


Monthly maintenanceNCSA
  • power rebalancing in one rack
  • switch maintenance in select racks
  • critical security patching

ALL LSST systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters, and tus-ats01)

primarily:

  • lsp-stable
  • lsp-int
  • Oracle

secondarily:

  • VMs (dbb-gw VM will be migrated to another host temporarily)
  • admin testing services

    Status
    subtletrue
    colourBlue
    titleScheduled

    Recurring Scheduled Maintenance

    (All times are Project Time (Pacific))

    StartEndEventLocation

    Description

    Systems/services that will NOT be availableStatus

    Third Thursday of every month 06:00

    Third Thursday of every month 10:00

    Recurring-Monthly

    Monthly lsst-dev maintenance

    NCSA
    • Routine system updates.
    Variable. Do not expect any lsst-dev system to be available during this period.

    Status
    subtletrue
    colourBlue
    titleScheduled


    Every Mon. 04:00


    Recurring- Weekly
    Purge of GPFS /scratch partition

    NCSA

    Per LSST data management policies, files older than 180 days will be purged from the LSST shared (GPFS) /scratch file system.

    Purge logs can be found in /lsst/admin/purge_logs/scratch/

    No outage or service disruption.

    Status
    subtletrue
    colourBlue
    titleSCHEDULED

    Every Tu. 08:00

    Every Tu. 10:00

    Recurring- Weekly

    Weekly Nebula Maintenance

    NCSARoutine system updates. Computational services continue to run.Horizon and API interfaces.

    Status
    subtletrue
    colourBlue
    titleScheduled



    Previous Outages & Events

    StartEndEventLocation

    Description

    Systems/services that were NOT availableStatus

    18-Dec-2018

    9:46am

    18-Dec-2018

    11:00am

    Host reboots due to power fluctuation

    LDF (NCSA)

    A power event caused some hosts to reboot:

    • lspdev kubernetes cluster (12 nodes including master node did not come back on their own and were manually brought online around 11:00am)
    • some L1 nodes rebooted as well
    lspdev was unavailable from ~09:40 until ~11:00am

    Status
    colourGreen
    titleRESOLVED

    Systems are back online and should be functioning, but please open tickets if there are lingering issues.

    5-Dec-2018

    7:00am

    5-Dec-2018

    8:30am

    PDAC and lspdev k8s mergeNCSAThe PDAC k8s environment was merged into the lspdev k8s cluster. Services will continue to be isolated through Kubernetes namespaces, labels, taints, etc.Services running in PDAC Kubernetes

    Status
    subtletrue
    colourGreen
    titlecomplete

    29-Nov-2018

    6:00am

    29-Nov-2018 12:00 noonMonthly maintenanceNCSA
    • Puppet code changes
    • disable CPU hyperthreading (requires reboot!!!)
    • OS/Yum updates
    • code upgrades on select service & management switches NPCF
    • pfSense updates
    ALL LSST systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters, and tus-ats01)

    Status
    subtletrue
    colourGreen
    titlecomplete

    13-Nov-2018

    5:30 PST

    13-Nov-2018

    6:30 PST

    lspdev cluster rebootNCSA
    • Reseating Kubernetes nodes in their chassis slots to resolve errors caused by power event over the weekend.
    lspdev/Kubernetes cluster

    Status
    colourGreen
    titleRESOLVED

    10-Nov-2018 ~02:4010-Nov-2018 ~02:45Host reboots due to power fluctuationLDF (NCSA)

    A power event caused some hosts to reboot:

    • lspdev kubernetes cluster (3 nodes including master node did not come back on their own and were manually brought online around 07:30)
    • some L1 nodes rebooted as well

    lspdev was unavailable from ~02:40 until ~07:30

    Status
    colourGreen
    titleRESOLVED

    Systems are back online and should be functioning, but please open tickets if there are lingering issues.

    11/6/2018 5am (PT)

    11/6/2018 1pm (PT)


    Power maintenanceLDF (NCSA)

    Some power distribution panels are being worked on, but should NOT cause any LSST environment disruptions.

    None

    Status
    subtletrue
    colourGreen
    titlecomplete

    01-Nov-2018 10:0001-Nov-2018 10:05Critical security patchingNCSAAddressed vulnerability CVE-2018-14665 on the 3 lsst-dev hosts.No interruption of service.

    Status
    colourGreen
    titleRESOLVED

    18-Oct-2018 06:0018-Oct-2018 10:00Monthly maintenanceNCSA

    Activities are minimal this month and are expected to cause little impact:

    • firmware update and reboot on monitor01 (monitoring collector)
    • OS & Kernel updates on tus-ats01.lsst.ncsa.edu
    • Puppet code changes
    • monitor01/InfluxDB (and likely the front-end Grafana monitoring, e.g., monitor-ncsa.lsst.org) will be unavailable for a short period of time
    • tus-ats01 will be unavailable for OS & Kernel updates
    • the Puppet changes are intended to be functional "no-ops" and should cause no outage, although we scheduled these changes during our monthly PM window in case something unexpected occurs

    Status
    subtletrue
    colourGreen
    titlecomplete

    15-Oct-2018 05:3515-Oct-2018 07:15Power event -> host outage at one datacenterNCSA

    A power blip caused all physical hosts at the NCSA building to power off or reboot.

    • None of the LSST physical hosts at the NPCF building were affected.

    affected: all physical LSST hosts (and VMs) at the NCSA building:

    • incl. lsst-dev*, lsst-xfer, lsst-l1*, lsst-daq, lsst-dev-db
    • most physical hosts rebooted themselves after the event, although a few L1 systems had to be manually powered on
    • most VMs had to be manually started after the event
    • update: also includes Nebula, which is still impcated

    unaffected: all physical LSST hosts (and VMs) at the NPCF building:

    • incl. lsst-qserv*, lsst-verify-worker*, lsst-sui*, lsst-kub*, GPFS

    Status
    colourGreen
    titleRESOLVED

    • note: Nebula is still impacted by the outage
    04-Oct-2018 06:0004-Oct-2018 07:15Critical security patchingNCSA

    An incorrect date (Oct 1) was initially posted for this maintenance. The correct date is Thu, Oct 4.

    ALL lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters)

    The following systems will remain online and unaffected:

    • tus-ats01

    Status
    colourGreen
    titleRESOLVED

    • sui-tomcat02 is getting rebooted once more to resolve an issue with NFS mounts but we expect it to be resolved easily
    20-Sep-2018 06:0022-Sep-2018 14:50Qserv Master outageNCSA

    qserv-master01 is having trouble booting after a motherboard replacement during planned maintenance.

    Qserv in general, specifically qserv-master

    Status
    colourGreen
    titleRESOLVED

    20-Sep-2018 06:00
    20-Sep-2018 12:40
    LSPdev KubernetesNCSA
    1. LSPdev is having a gateway error

    LSPdev

    Status
    colourGreen
    titleRESOLVED

    20-Sep-2018 06:0020-Sep-2018 12:00Monthly maintenanceNCSA
    1. Network switch firmware updates/reboots
    2. Lenovo firmware updates/reboots
    3. OS package updates/reboots
    4. ESXi hypervisor updates/reboots
    5. GPFS client changes and upgrade to 4.2.3-10

    6. GPFS server upgrade to 4.2.3-10

    All systems will be unavailable during this period.

    Status
    colourGreen
    titleRESOLVED

    qserv-master01 and LSPdev are still having issues. These will be tracked as a separate incidents.

    09-Aug-2018 09:0009-Aug-2018 09:37lsst-dev01 OutageNCSAThe lsst-dev01 server was unreachable for >60sec from the GPFS cluster and got expelled from the GPFS cluster. Open file handles and/or bind mounts from GPFS prevented lsst-dev01 from reconnecting to GPFS until it was rebooted. We suspect that a big job on the Slurm cluster may have contributed to some network congestion that triggered this.lsst-dev01

    Status
    colourGreen
    titleRESOLVED

    03-Aug-2018 10:0003-Aug-2018 13:30NCSA VPN was not working for some users.NCSAA configuration issue caused some VPN users connection problems to some NCSA resources.NCSA VPN
    Status
    colourGreen
    titleRESOLVED


    29-Jul-201803-Aug-2018 05:45Bulk Transfer Server RebuildNCSAThe Globus endpoint on lsst-xfer stopped working on July 29 after a certificate from the outdated GridFTP service expired. lsst-xfer was rebuilt and upgraded with CentOS 7.5, Globus Connect Server (v4), bbcp (17.12), and iRODS client (4.2.3). Globus bookmarks to the lsst#lsst-xfer endpoint may need to be updated to point to the rebuilt endpoint.Globus on lsst-xfer

    Status
    colourGreen
    titleRESOLVED

    27-Jul-201827-Jul-2018NCSA VPN MigrationNCSA

    NCSA will be migrating to a new VPN with multi-factor authentication. The new VPN is currently available, and users are encouraged to start using the new VPN before the cutoff date in order to ensure continued connectivity. All users must be registered with NCSA's Duo before they can use the new VPN. Links to the how-to article as well as the new VPN and Duo login are included below.

    No interruption of service is expected.

    Status
    colourGreen
    titlecomplete

    19-Jul-2018 10:0019-Jul-2018 10:30

    DB services on lsst-dev-db are unavailable along with dependent services, including:

    • lspdev
    NCSA

    MariaDB service did not start on lsst-dev-db after maintenance. There is a newer setting in MariaDB that didn't like the current mount point.

    DB services on lsst-dev-db

    Services that depend on lsst-dev-db, including:

    • lspdev

    Status
    colourGreen
    titleRESOLVED

    19-Jul-2018 06:0019-Jul-2018 10:00Monthly lsst-dev maintenanceNCSA
    1. Dell firmware updates/reboots
    2. OS package updates/reboots
      1. including upgrades to CentOS 7.5
    3. GPFS client changes and upgrade to 4.2.3-9

    4. GPFS server upgrade to 4.2.3-9

    ALL lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters)

    The following systems will remain online and unaffected:

    • lsst-daq
    • lsst-l1-*
    • tus-ats01

    Status
    colourGreen
    titlecomplete

    DB services on lsst-dev-db will not start after maintenance, impacting dependent services such as lspdev. This will be tracked in a separate status event.

    27-Jun-2018 07:0027-Jun-2018 11:00lspdev outageNCSAThe Kubernetes head node unexpectedly rebooted at approximately 7:00 AM, causing a JupyterHub outage. Service was brought back online around 11:00 AM.lsst-kub0[01-20]

    Status
    colourGreen
    titlecomplete

    27-Jun-2018 06:1027-Jun-2018 06:30Monitoring UpdateNCSAFirst phase of enabling encryption on monitoring trafficMonitoring Dashboards


    21-Jun-2018 06:0021-Jun-2018 07:35Monthly lsst-dev maintenanceNCSA
    1. pfSense firewall update
    2. OS package updates/reboots for CentOS 6.9 servers (lsst-web, lsst-xfer, lsst-nagios)
    3. Slurm update (lsst-dev01, lsst-verify-worker*)
    4. Update host firewalls on GPFS servers
    5. iDRAC configuration updates on lsst-dev01 and ESXi hosts

    CentOS 6.9 servers:

    • lsst-web
    • lsst-xfer
    • lsst-nagios

    Slurm/verification cluster

    Other impact is not expected but unexpected issues could lead to connectivity issues for other hosts or downtime for lsst-dev01 or hosted VMs

    Status
    colourGreen
    titlecomplete

    18-Jun-2018 11:0019-Jun-2018 17:00Nebula outageNCSANebula is undergoing a complete reboot. Last week's storms damaged more than just one node initially thought to be affected.Nebula will be unavailable until 15:00 (5pm CDT)

    Status
    colourGreen
    titleRESOLVED

    19-Jun-2018 06:00

    19-Jun-2018 10:00

    Level One Test Stand MaintenanceNCSA
    1. BIOS firmware updates
    2. Puppet and firewall changes (including support of SAL unicast/multicast traffic)
    3. OS package updates (staying with CentOS 7.4)

    Level One Test Stand, including:

    • lsst-daq
    • lsst-l1-*

    Status
    colourGreen
    titleResolved

    12-Jun-2018 ~01:40 PDT12-Jun-2018 07:01 PDTStorm → outage of Kubernetes Commons & 75% of verification cluster compute nodesNCSAA storm caused a power event at the NPCF datacenter taking down Kubernetes commons and lspdev as well as 75% of the verification cluster compute nodes.
    • Kubernetes Commons / lsst-lspdev / kub*
    • 75% of verify-worker* / Slurm nodes

    Status
    colourGreen
    titleRESOLVED

    17-May-2018 11:3017-May-2018 12:25Grafana monitoring was offlineNCSAThe influxdb data used by grafana monitoring was offline while it's storage was rebuilthttps://monitor-ncsa.lsst.org/ monitoring data was offline

    Status
    colourGreen
    titleRESOLVED

    17-May-2018 06:00

    17-May-2018 11:30

    Monthly lsst-dev maintenanceNCSA
    1. GPFS maintenance

      • Replace floor tile

      • GPFS service upgrade to 4.2.3-8

      • Rebuild of /lsst/backups structure

    2. PDAC Firewall maintenance for new vLANs

    3. BIOS Firmware updates (lsst-bastion01, lsst-sui*, lsst-qserv*, LevelOne Test Stand, lsst-dev-db)

    4. Node changes with reboots (all nodes)

      • switch to rsyslog v8 yum repository & upgrade rsyslog (bastion01 & kub, qserv, sui, verification clusters)

      • puppet-stdlib module update (lsst-dev01, lsst-dev-db, lsst-web, lsst-xfer, LevelOne Test Stand)

      • GPFS client upgrade (4.2.3-8) and nosuid mount option changes (lsst-dev01, lsst-qserv*, lsst-web, lsst-xfer, verification cluster)
      • NFS nosuid mount option changes of GPFS (lsst-demo01 and kub & verification clusters)
      • enable PXE boot on new network interfaces (lsst-kub* & lsst-backup01)
      • OS Updates (all nodes)

    ALL lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters)

    The following systems will remain online and unaffected:

    • lsst-daq
    • lsst-l1-*
    • tus-ats01

    Status
    colourGreen
    titleRESOLVED


    30-Apr-2018 18:3714-May-2018 15:00Security & AA infrastructure offlineLa Serena

    The Security & AA infrastructure went offline around 18:37 Project Time. None of the infrastructure is accessible via the network.

    A UPS had to be replaced and an electrical circuit upgraded for the replacement UPS.

    None.

    Status
    colourGreen
    titleRESOLVED

    11-Apr-2018 06:00

    07-May-2018 10:30

    production-size run (HSC-PDR1) on the verification clusterNCSA

    Per IHS-749, ~15 nodes of the batch compute resources will be reserved in order to complete HSC-PDR1 data runs. It is expected that the reservation can be scaled back to <10 after the first couple of weeks.

    All systems available.

    Status
    colourGreen
    titlecomplete

    25-Apr-2018 11:3025-Apr-2018 12:40Test new puppet changes for sssd and ldap access on SUI* nodes.NCSAA minor change to sssd service configuration needs to be rolled out to all nodes. The change will require a momentary outage of the sssd service and some actions will take longer (for a short period of time) as cache is repopulated. Changes in puppet structure (affecting ldap group sync) are also in need of testing and can happen simultaneously.

    Affected services:

    • Firefly proxy and tomcat services
      • Some actions may appear slow while cache re-populates

    Status
    colourGreen
    titlecomplete

    04/24/2018 07:1004/24/2018 07:50increased LDAP timeout to 60 seconds in sssd.confNCSA

    increased LDAP timeout to 60 seconds in sssd.conf to fix problems with long login times and failure to start batch jobs

    we will coordinate in the near future to apply the same change on qserv* & sui*

    Affected nodes: kub*, verify-worker*

    Status
    colourGreen
    titleRESOLVED

    All nodes are back in service, although affected nodes may have slow LDAP response times for a short while (due to local cache needing rebuilt).

    19 Apr 201819 Apr 2018Monthly lsst-dev maintenanceNCSACANCELLED. No major work is needed and key personnel are travelling. Deployment of the new DTN and VM infrastructure will be delayed until after the May maintenance period.N/A

    Status
    subtletrue
    colourGreen
    titleCANCELLED

    4/11 at 07:004/11 at 08:00Firewall update at NCSANCSAPer LSST-1257, the primary firewall needs to have its routing software updated.  No failover is required and traffic will continue to flow through the primary firewall during the upgrade.No outage or service disruption.

    Status
    colourGreen
    titleRESOLVED

    4/3/2018 16:404/3/2018 16:45LDAP problemsNCSACausing new logins to the LSST resources at NCSA to hang.new logins can't take place right now.fixed.
    3/26/2018
    08:00

    4/2/2018

    9:00

    A fileserver on Nebula became unstable, resulting in diminished
    performance for some instances and volumes.
    NCSAAny instances or volumes hosted on the
    healing filesystem will be impacted, or approximately
    20% of instances and volumes.


    We are migrating instances around to
    speed up the process.

    3/15/2018 10:20 am PT3/15/2018 14:20 am PTLingering issues on select nodes following March PMNCSA

    Select nodes had issues coming out of the PM.

    • lsst7 - issue w/ sshd

    Status
    colourGreen
    titleRESOLVED

    3/15/2018 10:20 am PT3/15/2018 11:23am PTLingering issues on select nodes following March PMNCSA

    Select nodes had issues coming out of the PM.

    • lsst-qserv-master01 - cannot mount local /qserv volume
    • lsst-xfer - issue w/ sshd
    • lsst-dts - issue w/ sshd
    • lsst-l1-cl-dmcs - unknown issue

    Status
    colourGreen
    titleRESOLVED

    3/15/2018 6:00 am PT3/15/2018 10:20 am PTMarch lsst-dev maintenance (regular schedule)NCSA
    • GPFS server updates and configuration of additional NFS/Samba services
    • Urgent Firmware updates
    • Increase size of /tmp on lsst-dev01
    • Hardware maintenance/memory increases on select servers/VMs
    • Release of refactored Puppet code
    • OS updates
    • Recabling servers in dev server room to new switches
    Systems/services that were NOT be available: ALL lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification and Kubernetes clusters)

    Status
    colourGreen
    titleCOMPLETE

    Select nodes (lsst-qserv-master01, lsst7, lsst-xfer, lsst-dts, lsst-l1-cl-dmcs) required additional attention following the PM, as noted in a separate status entry.

    3/12/2018 7:00 am PT

    3/12/2018 3:00pm PTnebula(Open stack resource) is downNCSANebula is being taken down for patches to be applied across the whole infrastructure.All containers on Nebula are going down.

    Status
    colourGreen
    titleCOMPLETE

    07 Mar 2018 13:00

    07 Mar 2018 14:10qserv-db12 maintenanceNCSAqserv-db12 had one failed drive in the OS mirror replaced but the other is presenting errors as well so the RAID cannot rebuild. The node was taken down for replacement of the 2nd disk, to rebuild the RAID in the OS volume, and to reinstall the OS.qserv-db12

    Status
    colourGreen
    titleCOMPLETE

     09:02

    09:21

    lsst-dev01 Out of SpaceNCSAThe main / drive partition ran out of space due to a user's faulty pip build. The faulty files were moved elsewhere for the user to review.
    lsst-dev01

    Status
    colourGreen
    titleCOMPLETE

    27 Feb 2018 08:40

    27 Feb 2018 09:40

    Puppet maintenance at NCSANCSA

    Enable environment isolation on puppet master

    No outage or service disruption is expected.

    Status
    colourGreen
    titleCOMPLETE

     06:00

    07:00

    Puppet updatesNCSARolled out significant logic and organization of the Puppet resources in NCSA 3003 data center in order to standardize between LSST Puppet environments at NCSA. We had done extensive testing and did not expect any outages or disruption of services.

    None

    Changes were applied to: lsst-dev01, lsst-dev-db,lsst-web, lsst-xfer, lsst-dts, lsst-demo, L1 test stand, DBB test stand, elastic test stand.

    Status
    colourGreen
    titlecomplete

     12:55

     13:18

    lsst-dev-db crashedNCSAThe developer MySQL server lost network and crashed.lsst-dev-db MySQL database

    Status
    colourGreen
    titleRestored

     06:00

     11:00

    February lsst-dev maintenance (regular schedule)NCSA
    • Updating GPFS mounts to access new storage appliance
    • Rewire 2 PDUs in dev server room (hosts lsst-dev01, lsst-xfer, etc.)
    • Switch stack configuration changes in dev server room (hosts lsst-dev01, lsst-xfer, etc.)
    • Routine system updates
    • Firewall maintenance at datacenter (hosts PDAC, verification cluster, etc.)
    • Updates to system monitoring
    Systems/services that will NOT be available: all lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification cluster)

    Status
    colourGreen
    titlecomplete

    • NOTE: GPFS was not remounted on qserv-dax01 until 4:27pm

    , 08:00

    , 08:30

    Slurm reconfigurationNCSA

    The slurm scheduler on the verification cluster will be repartitioned from one queue (debug) into two:

    debug: 3 nodes, MaxTime=30 min

    normal: 45 nodes, MaxTime=INFINITE

    No outages

    Status
    colourGreen
    titlecomplete

    Wed 1/24/2018 13:35Wed 1/24/2018 14:55Loss of LSST NFS servicesNCSAAll NFS mounts for LSST systems were not workingNFS access on lsst-demo and lsst-SUI were not working

    Status
    colourGreen
    titleRestored

     16:40

     21:00

    Firewall outageNCSABoth pfSense firewalls were accidentally powered off.PDAC (Qserv & SUI) and verification clusters were inaccessible, as well as introducing GPFS issues across many services, e.g. lsst-dev01.

    Status
    colourGreen
    titleRestored

     06:00

     08:00

    January lsst-dev maintenance (regular schedule)NCSA
    • Routine system updates
    Systems/services that will NOT be available: all lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification cluster)

    Status
    colourGreen
    titleComplete

     06:00

     11:30

    Critical patches on lsst-dev systems (incl. kernel updates)NCSA
    • Update kernel and system packages to address a security vulnerability.
    Systems/services that will NOT be available: all lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification cluster)

    Status
    colourGreen
    titleCOMPLETE

     09:00

     17:00

    NebulaNCSANebula (OpenStack) will be shut down for hardware and software maintenance from January 2nd, 2018 at 9am until January 5th, 2018 at 5pm.All Nebula systems unavailable.
    Status
    colourGreen
    titleCOMPLETE


    Saturday  

    Tuesday  

    Support over holiday breakNCSA

    2017-12-22 to 2018-1-01 (inclusive) is the University holiday period. Services will be operational. Please report problems via the JIRA IHS queue. The queue will be monitored by NCSA staff and users will be notified via Jira as to if or when their issue can be addressed.


    All services will be operational.

    Status
    colourGreen
    titleCOMPLETE

    Wednesday  

    06:00

    Wednesday  

    08:00

    NFS Server switchNCSANFS services will be moved to a different host

    brief outage of NFS services to SUI nodes, lsst-demo, lsst-demo2

    Status
    colourGreen
    titleCompleted

    Wednesday  

    06:00

    Wednesday  

    07:00

    Firewall drive replacementNCSACurrent pfSense has a bad drive. If it fails, all nodes behind the firewall will be inaccessible. There are redundant firewalls, no service interrupts are expected.None Expected

    Status
    colourGreen
    titleCompleted

    Thursday 2017-12-14 04:00

    Thursday 2017-12-14, 10:00

    19:00

    December lsst-dev maintenance

    (off-schedule)

    NCSA
    • Due to holiday schedules, the December maintenance event is being moved up 1 week, from 2017-12-21 to 2017-12-14
    • Routine system updates
    • Network switch replacement
    • lsst-db server replacement
    • Further details here
    Do not expect any lsst-dev system to be available during this period.

    Status
    colourGreen
    titleCompleted


    Tuesday 2017-11-28, 10:00TBDRolling reboots of PDAC qserv nodesNCSA
    • In order to address a spontaneous rebooting issue with some qserv nodes, firmware upgrades are being performed.
    The occasional qserv node will need to be rebooted. Experience with the first couple will allow NCSA to give more precise information on the order and timing of the reboots.

    Status
    colourGreen
    titleCOMPLETED

    2017-11-20 7:002017-11-20 14:00Nebula Openstack cluster

    NCSA

    Nebula OpenStack cluster will be unavailable for emergency hardware maintenance. A failing RAID controller from one of the storage nodes and a network switch will be replaced.

    Not all instances will be impacted. If any running Nebula instances are affected by the outage they will be shut down, then restarted again after we finish maintenance that day.

    Status
    colourGreen
    titleCompleted

    Thursday 2017-11-16 06:00

    Thursday 2017-11-16 10:00

    Extended monthly lsst-dev maintenance

    NCSA
    • Routine system updates.
    • Due to the volume of work that needs to be done, this event is being extended by 2 hrs. If systems become available before the end of the maintenance window, we will announce it here.
    • Be aware that this event will include an off-schedule purge of items in /scratch older than 180 days.
    Do not expect any lsst-dev system to be available during this period.

    Status
    colourGreen
    titleCompleted

    2017-10-31
    NFS instabilityNCSANFS becomes intermittently unresponsive.

    Status
    colourYellow
    title~stable

    We are guardedly optimistic that this problem has been resolved. PDAC is now utilizing native GPFS mounts.

    2017-10-24 09:50LSSTGPFS outageNCSAAll LSST nodes from NCSA 3003 (e.g., lsst-dev01/lsst-dev7) and NCPF (verify-worker, PDAC) that connect to GPFS (as GPFS or NFS) have lost their connection.GPFS

    Status
    colourGreen
    titleOnline

    Storage is working to bring GPFS back online

    2017-10-21 17:15

    LSSTpublic/protected network switch is down in rack N76 at NPCF


    nodes cannot communicate DNS, LDAP, etc. so largely cannot communicate with other nodes, e.g., no communication between affected verify-worker nodes and the Slurm scheduler on lsst-dev01, no communication between affected qserv-db nodes and the rest of qserv

    Efffectively, the whole verification cluster

    Status
    colourGreen
    titleRestored

    in progress, replacement switch is on order

    Workaround in progress. If all goes well, systems should be back online by late afternoon.

    2017-10-19 06:00

    2017-10-19 14:00qserv-master replacementNCSA

    qserve-master will be down so that systems engineering can finish configuring the new server and xfering files. Status updates here:

    Jira
    serverJIRA
    serverId9da94fb6-5771-303d-a785-1b6c5ab0f2d2
    keyIHS-378
    .

    qserv-master will be down for this entire period

    Status
    colourGreen
    titleComplete

    Archived events