Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

Current Status

Status
colourYellowGreen
titleMaintenanceNormal

StartEventLocation

Description

Systems/services that will NOT be availableStatus

 06:00

Critical patches on lsst-dev systems (incl. kernel updates)NCSA
  • Update kernel and system packages to address a security vulnerability.
Systems/services that will NOT be available: all lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification cluster)
Status
subtletrue
colourYellow
titleIn progress

  • bastion01 will not boot; we are working with the vendor to get this resolved
  • all other systems are back in production (although PDAC will not be accessible due to the issue with bastion01)

    Request Support

    Upcoming Scheduled Maintenance

    ...

     06:00
    StartEndEventLocation

    Description

    Systems/services that will NOT be availableStatus

     10:00

    Critical patches on lsst-dev systems (incl. kernel updates)NCSA
    • Update kernel and system packages to address a security vulnerability.
    Systems/services that will NOT be available: all lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification cluster)

    Status
    subtletrue
    colourBlue
    titleScheduled

     06:00

     06:00

    January lsst-dev maintenance (regular schedule)NCSA
    • Routine system updates
    Systems/services that will NOT be available: all lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification cluster)

    Status
    subtletrue
    colourBlue
    titleScheduled

    Third Thursday of every month 06:00

    Third Thursday of every month 08:00

    Recurring-Monthly

    Monthly lsst-dev maintenance

    NCSA
    • Routine system updates.
    Variable. Do not expect any lsst-dev system to be available during this period.

    Status
    subtletrue
    colourBlue
    titleScheduled

    Every Mon. 04:00


    Recurring- Weekly

    Purge of GPFS /scratch partition
    NCSA

    Per LSST data management policies, files older than 180 days will be purged from the LSST shared (GPFS) /scratch file system.

    Purge logs can be found in /gpfs/fs0/admin/purge_logs/scratch/

    No outage or service disruption.

    Status
    subtletrue
    colourBlue
    titleSCHEDULED

    Every Tu. 08:00

    Every Tu. 10:00

    Recurring- Weekly

    Weekly Nebula Maintenance

    NCSARoutine system updates. Computational services continue to run.Horizon and API interfaces.

    Status
    subtletrue
    colourBlue
    titleScheduled








    ...

    StartEndEventLocation

    Planned Activities

    Systems/services that will NOT be availableStatus

     06:00

     11:30

    Critical patches on lsst-dev systems (incl. kernel updates)NCSA
    • Update kernel and system packages to address a security vulnerability.
    Systems/services that will NOT be available: all lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification cluster)

    Status
    colourGreen
    titleCOMPLETE

     09:00

     17:00

    NebulaNCSANebula (OpenStack) will be shut down for hardware and software maintenance from January 2nd, 2018 at 9am until January 5th, 2018 at 5pm.All Nebula systems unavailable.
    Status
    colourGreen
    titleCOMPLETE


    Saturday  

    Tuesday  

    Support over holiday breakNCSA

    2017-12-22 to 2018-1-01 (inclusive) is the University holiday period. Services will be operational. Please report problems via the JIRA IHS queue. The queue will be monitored by NCSA staff and users will be notified via Jira as to if or when their issue can be addressed.


    All services will be operational.

    Status
    colourGreen
    titleCOMPLETE

    Wednesday  

    06:00

    Wednesday  

    08:00

    NFS Server switchNCSANFS services will be moved to a different host

    brief outage of NFS services to SUI nodes, lsst-demo, lsst-demo2

    Status
    colourGreen
    titleCompleted

    Wednesday  

    06:00

    Wednesday  

    07:00

    Firewall drive replacementNCSACurrent pfSense has a bad drive. If it fails, all nodes behind the firewall will be inaccessible. There are redundant firewalls, no service interrupts are expected.None Expected

    Status
    colourGreen
    titleCompleted

    Thursday 2017-12-14 04:00

    Thursday 2017-12-14, 10:00

    19:00

    December lsst-dev maintenance

    (off-schedule)

    NCSA
    • Due to holiday schedules, the December maintenance event is being moved up 1 week, from 2017-12-21 to 2017-12-14
    • Routine system updates
    • Network switch replacement
    • lsst-db server replacement
    • Further details here
    Do not expect any lsst-dev system to be available during this period.

    Status
    colourGreen
    titleCompleted


    Tuesday 2017-11-28, 10:00TBDRolling reboots of PDAC qserv nodesNCSA
    • In order to address a spontaneous rebooting issue with some qserv nodes, firmware upgrades are being performed.
    The occasional qserv node will need to be rebooted. Experience with the first couple will allow NCSA to give more precise information on the order and timing of the reboots.

    Status
    colourGreen
    titleCOMPLETED

    2017-11-20 7:002017-11-20 14:00Nebula Openstack cluster

    NCSA

    Nebula OpenStack cluster will be unavailable for emergency hardware maintenance. A failing RAID controller from one of the storage nodes and a network switch will be replaced.

    Not all instances will be impacted. If any running Nebula instances are affected by the outage they will be shut down, then restarted again after we finish maintenance that day.

    Status
    colourGreen
    titleCompleted

    Thursday 2017-11-16 06:00

    Thursday 2017-11-16 10:00

    Extended monthly lsst-dev maintenance

    NCSA
    • Routine system updates.
    • Due to the volume of work that needs to be done, this event is being extended by 2 hrs. If systems become available before the end of the maintenance window, we will announce it here.
    • Be aware that this event will include an off-schedule purge of items in /scratch older than 180 days.
    Do not expect any lsst-dev system to be available during this period.

    Status
    colourGreen
    titleCompleted

    2017-10-31
    NFS instabilityNCSANFS becomes intermittently unresponsive.

    Status
    colourYellow
    title~stable

    We are guardedly optimistic that this problem has been resolved. PDAC is now utilizing native GPFS mounts.

    2017-10-24 09:50LSSTGPFS outageNCSAAll LSST nodes from NCSA 3003 (e.g., lsst-dev01/lsst-dev7) and NCPF (verify-worker, PDAC) that connect to GPFS (as GPFS or NFS) have lost their connection.GPFS

    Status
    colourGreen
    titleOnline

    Storage is working to bring GPFS back online

    2017-10-21 17:15

    LSSTpublic/protected network switch is down in rack N76 at NPCF


    nodes cannot communicate DNS, LDAP, etc. so largely cannot communicate with other nodes, e.g., no communication between affected verify-worker nodes and the Slurm scheduler on lsst-dev01, no communication between affected qserv-db nodes and the rest of qserv

    Efffectively, the whole verification cluster

    Status
    colourGreen
    titleRestored

    in progress, replacement switch is on order

    Workaround in progress. If all goes well, systems should be back online by late afternoon.

    2017-10-19 06:00

    2017-10-19 14:00qserv-master replacementNCSA

    qserve-master will be down so that systems engineering can finish configuring the new server and xfering files. Status updates here:

    Jira
    serverJIRA
    serverId9da94fb6-5771-303d-a785-1b6c5ab0f2d2
    keyIHS-378
    .

    qserv-master will be down for this entire period

    Status
    colourGreen
    titleComplete

    ...