Previous Outages & Events
Start | End | Event | Location | Planned Activities | Outcome | |
---|---|---|---|---|---|---|
2017-10-19 06:00 | 2017-10-19 10:00 | Monthly routine maintenance (extended) | NCSA | Routine patching and reboots, firewall firmware updates, server firmware updates. | All NCSA-hosted resources except for Nebula. | COMPLETE |
2017-10-19 06:00 | 2017-10-19 14:00 | qserv-master replacement | NCSA | qserve-master will be down so that systems engineering can finish configuring the new server and xfering files. Status updates here:IHS-378 - PDAC qserv-master upgrade IN PROGRESS. | qserv-master will be down for this entire period | COMPLETE |
2017-07-20 04:00 | 2017-07-20 08:00 | Monthly lsst-dev maintenance | NCSA |
SeeIHS-365 - scheduled maintenance for NCSA-hosted dev machines (July 20, 2017, 4:00 - 8:00am Pacific) DONEfor details | verify-worker31 suffered a failure and will be out of commission for a while | |
2017-06-22 (06:00) | 2017-06-22 (10:00) | Critical Kernel upgrades | NCSA | Upgrade kernel and system packages to address Stack Guard Page vulnerability. See also:IHS-324 - Emergency update of all NCSA hosted dev machines DONE | All NCSA hosted resources (except Nebula). UPDATE: 08:00 PT - Outage is being extended till 10:00 PT. Outage was completed at 10PT. Some nodes didn't come back. See ticket for details. | |
|
| Pushed back - yet to be rescheduled | ||||
2017-05-18 (06:00) | 2017-05-18 (08:00) | LSST monthly maintenance | NCSA |
| Success. | |
2017-05-04 09:30 | 2017-05-04 10:00 | Unplanned lsst-dev file systems full | NCSA | lsst-dev lsst-dev filesystems / and /home filled up at approximately 09:30. This was a result of inode usage from another process | The admins freed up inodes to make the filesystem responsive again. Admins are currently tracking down the root cause. | |
2017-04-27 13:11 | 2017-04-27 14:20 | Unplanned Nebula outage | glusterfs crashed due to this bug, so no instances could access their filesystems | All instances running on Nebula | Needed to reboot the node that systems were mounting from, but took the opportunity to upgrade all gluster clients on other systems while waiting for a reboot. Version 3.10.1 fixes the bug. All instances with errors in their logs were restarted. | |
2017-04-20 (04:30) | 2017-04-20 (09:30) | LSST monthly maintenance | NCSA | This event is cancelled so as not to interfere with Early Integration Activity #03 being held at NCSA April 19 & 20. | nothing bad happened | |
2017-04-17 (13:41) | 2017-04-17 (13:53) | Unplanned lsst-dev login node down | NCSA | Users unable to log in to lsst-dev. Probable cause is that the root file system filled up due to excessive logging | Fixed | |
2017-03-27 (22:00) | 2017-03-29 (14:00) | Blue Waters maintenance | NCSA | Due to maintenance of cooling infrastructure at NPCF, Blue Waters will down during this period. Cray will also take this maintenance window to perform some system updates at the same time. Systems that will be down
Systems that will remain up Qserv nodes ( lsst-qserv-* ), SUI nodes ( lsst-sui-* ), Bastion node ( lsst-bastion01 ) should remain online during the outage. However, if temperatures in the NPCF rise too high, we will be forced to shut these down as well. I've been told that this is a low-probability scenario and we will be given time to do graceful shutdowns. In the unlikely event that this happens, it will be communicated through the DM Slack channel and also posted here. | All systems normal | |
2017-03-23 (0800) | 2017-03-23 (1300) | NCSA Nebula Outage | NCSA | Nebula will take an outage to balance and build a more stable setup for the file system. This will require a pause of all instances, and Horizon being unavailable. | Nebula is back to normal. | |
2017-03-16 (0430) | 2017-03-16 (0930) | LSST monthly maintenance | NCSA | GPFS filesystems will go offline for entire duration of outages. Some systems may be rebooted, especially those that mount one or more of the GPFS filesystems. | ||
Aug. 24, 06:00 | Aug. 24, 13:30 | LSST Dev infrastructure upgrades | NCSA |
| Completed successfully | |
Aug. 24, 06:00 | Aug. 24, 07:30 | LSST Dev patching | NCSA |
| Completed. Note the separate status message for "LSST Dev infrastructure upgrades", which includes system in NCSA 3003 and is scheduled for 06:00 - 15:00. Maintenance on the following:
| |
Aug. 22, 06:00 | decommissioning of lsst-dbdev machines | NCSA |
| Done. | ||
2017-06-15 0600 | 2017-06-15 0730 | Deploy unbound LSST cluster nodes (verify-worker*, qserv*, sui*, bastion01, test*, backup01) | NCSA | DNS resolving may have a short (~30 mins) delay. | Updates deployed successfully via new puppet module. All tests passed. | |
2017-02-22 1415 | 2017-02-22 (1615) | Nebula Gluster Issues | NCSA | All Nebula instances paused while gluster repaired | Nebula is available. |