Current Status
Start | End | Event | Location | Description | Systems/services that will NOT be available | Status |
---|---|---|---|---|---|---|
Scheduled Maintenance
Start | End | Event | Location | Description | Systems/services that will NOT be available | Status |
---|---|---|---|---|---|---|
Recurring Scheduled Maintenance
(All times are Project Time (Pacific))
Start | End | Event | Location | Description | Systems/services that will NOT be available | Status | Last Thursday of every month 06:00 | Last Thursday of every month 10:00 | NCSA |
| Variable. Do not expect any lsst-dev system to be available during this period. | |||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Status | ||||||
---|---|---|---|---|---|---|
|
Every Mon. 04:00
Recurring- Weekly
Purge of GPFS /scratch partition
Per LSST data management policies, files older than 180 days will be purged from the LSST shared (GPFS) /scratch
file system.
Purge logs can be found in /lsst/admin/purge_logs/scratch/
Status | ||||||
---|---|---|---|---|---|---|
|
Every Tu. 08:00
Recurring- Weekly
Weekly Nebula Maintenance
|
Previous Outages & Events
Start | End | Event | Location | Description | Systems/services that were NOT available | Status | |||||
---|---|---|---|---|---|---|---|---|---|---|---|
Wednesday 2022-08-17 10:00 | N/A | NCSA hosted services were powered down. | NCSA | ALL services hosted at NCSA are being transitioned to other LSST sites. On Wednesday, 17 Aug, all services were stopped and servers powered down. (This was delayed for originally being scheduled for 15 Aug.) | ALL services hosted at NCSA |
|
|
Previous Outages & Events
| |||||||||||||||||||||||
Friday 2022-07-08 15:00 | Monday 2022-07-11 06:00 | NCSA building power outage | NCSA | NCSA building will have no power from 7AM - 5PM on Sunday, 10 July. Affected servers will be shutdown at COB (local time) on Friday and restarted on Monday morning. | Firefly service (lsst-demo.ncsa.illinois.edu) (Grafana for LSST users) lsst-dm-monitor.ncsa.illinois.edu (Globus DTN) lsst-xfer.ncsa.illinois.edu |
| |||||||||||||||||
Tuesday | Tuesday | Issues on SUI nodes after PM. | NCSA |
|
|
| |||||||||||||||||
Tuesday | Tuesday | Some systems having issues after firmware and OS updates. | NCSA |
|
|
| |||||||||||||||||
Tuesday | Tuesday | Kuberenetes still down after PM | NCSA |
|
|
| |||||||||||||||||
Tuesday | Tuesday | Quarterly NCSA Maintenance | NCSA |
| ALL services hosted at NCSA |
| |||||||||||||||||
Friday 2021-11-19 10:52 | Friday 2021-11-19 11:25 | VM server at NCSA crashed | NCSA | lsst-esxi08 crashed | The following VMs rebooted: Idap-Isst-ncsa3 |
| |||||||||||||||||
Thursday | Thursday | Quarterly NCSA Maintenance | NCSA |
| ALL services hosted at NCSA |
| |||||||||||||||||
Thursday | Thursday | Emergency NCSA Maintenance | NCSA |
| ALL services hosted at NCSA EXCEPT
|
| |||||||||||||||||
2021-07-28 08:00 | Wednesday 2021-07-28 8:50 | NCSA Test Stand Updates | NCSA |
| All services in the NCSA Test Stand (NTS) |
| |||||||||||||||||
2021-06-24 | 2021-06-24 | Prod/Stable k8s updates | NCSA |
| Prod/Stable k8s |
| |||||||||||||||||
2021-06-24 | 2021-06-24 | Quarterly NCSA Maintenance | NCSA |
| ALL services hosted at NCSA |
except Prod/Stable k8s environment | |||||||||||||||||
2021-05-20 03:40 | 2021-05-20 06:45 | ESXi host outage | NCSA | ESXi host outage causing degradation of select services. | Degradation of select services:
Also loss of redundancy for some underlying services, including auth/access & k8s head nodes. |
| |||||||||||||||||
2021-04-29 1400 | 2021-04-29 1500 | Add new nodes into Condor service pools | NCSA | Add new nodes into Condor service pools:
| Minor risk of interruptions in:
|
| |||||||||||||||||
Thursday | Thursday | Quarterly NCSA Maintenance | NCSA |
| ALL services hosted at NCSA |
| |||||||||||||||||
Wednesday 2021-01-27 09:40 | Wednesday 2021-01-27 10:10 | Patched sudo package | NCSA | The |
|
| |||||||||||||||||
Thursday 2020-10-01 08:00 | Thursday 2020-10-01 09:30 | Changed SSH Access & Retiring Services at NCSA | NCSA |
|
|
| |||||||||||||||||
Thursday | Thursday | HTcondor reservation still blocking new jobs | NCSA | PM leftover, work still in progress to remove the reservation so new jobs can run. | HTCondor |
| |||||||||||||||||
Wednesday | Thursday | Monthly server maintenance | NCSA | GPFS version upgrade from 4.x to 5.x Routine system OS and firmware updates. | ALL services hosted at NCSA |
| |||||||||||||||||
12:26 (GMT -4) | 22:49 (GMT -4) | Main link Santiago - La Serena down | LHN Path | Fiber cut in the main link Santiago - La Serena | No LHN connection to Rubin |
| |||||||||||||||||
2020-06-24 | 2020-06-25 Kubernetes upgrade completed at 2020-06-25 1330 | Monthly server maintenance | NCSA | Routine system OS and firmware updates. GPFS firmware updates (fixes network issues) Two significant changes are being applied to the
| ALL services hosted at NCSA |
| |||||||||||||||||
2020-06-18 | 2020-06-18 | Developer Web Server Upgrade | NCSA | NCSA replaced the web server that hosts http://lsst-web.ncsa.illinois.edu/ . All old URLs now redirect to a new hostname/URL under https://lsst.ncsa.illinois.edu/ . | lsst.ncsa.illinois.edu |
| Start | End | Event | Location | Description | Systems/services that were NOT available | Status|||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2020-02-27 06:00 | 2020-02-27 12:00 | Monthly LSST system maintenance | NCSA |
| ALL LSST systems will be updated, including:
|
| |||||||||||||||||
2020-02-17 18:00 | 2020-02-17 21:25 | LDAP and authentication interruption | NCSA | Intermittent timeouts on LSST's LDAP replica servers at NCSA caused authentication issues for most LSST servers at NCSA. This was triggered by replication timeouts after NCSA's primary LDAP server outage earlier around 16:00. LDAP issues for LSST resources did not occur till about 18:00. The LSST LDAP replica servers recovered around 19:20. Later, we discovered that many servers needed their SSSD cache manually cleared to allow authentication, which was resolved around 21:20. | Authentication to most LSST systems, including:
|
| |||||||||||||||||
2020-01-30 06:09 | 2020-01-30 07:51 | shared filesystem interruption errant "New LSSTdev Account" emails | NCSA | Many LSST users who have NCSA accounts received an errant email this morning with a subject of "New LSSTdev Account at NCSA". This email can be ignored. The email was caused by a bug in provisioning scripts that was triggered by a short shared filesystem interruption. The shared filesystem issue has now been resolved. | Many LSST systems (all native GPFS clients):
|
| |||||||||||||||||
2020-01-15 05:00 | 2020-01-15 12:00 | Hardware repair in NCSA Test Stand | NCSA |
| 19 of 52 active servers in the NCSA Test Stand, including:
Most LSST servers will remain up. |
| |||||||||||||||||
2019-12-12 06:00 | 2019-12-12 12:00 | Monthly LSST system maintenance | NCSA |
| ALL LSST systems will be updated, including:
|
| |||||||||||||||||
27 Sep 2019 7:00pm (PT) | 28 Sep 2019 11:30pm (PT) | Building power maintenance & GPFS firmware upgrade | LDF (NCSA) |
| ALL LSST systems, including:
|
Oracle is still down but is expected to be returned to service later this afternoon. | |||||||||||||||||
18 Jul 2019 8:15am (PT) | 18 Jul 2019 2:40pm (PT) | lsst-oradb down | LDF (NCSA) | The primary Oracle service was down after this morning's planned maintenance due to issues accessing NetApp storage. | lsst-oradb was down
|
| |||||||||||||||||
18 Jul 2019 6:00am (PT) | 18 Jul 2019 8:15am (PT) | Monthly maintenance | LDF (NCSA) |
| ALL LSST systems, including:
|
| |||||||||||||||||
23 Jun 2019 5:00am (PT) | 23 Jun 2019 12:00 Noon (PT) | Full building power maintenance | LDF (NCSA) | Full building power outage in NPCF facility at NCSA. End time is approximate. |
|
| |||||||||||||||||
21-May-2019 9:00am (PT) | 22-May-2019 5:00pm (PT) | k8s cluster migration | NCSA | Migrate old kubernetes cluster to redeployed clusters with redundant head nodes. UPDATE: lsp-stable was primarily stable a few hours after the planed end time on May 21st. lsp-int services required an extra day to stabilize. | All k8s at NCSA including: lsp-stable lsp-int |
| |||||||||||||||||
16-May-2019 6:00am (PT) | 16-May-2019 10:00am (PT) | Monthly maintenance | LDF (NCSA) | Update authentication to use new LDAP & Kerberos servers. No interruption of service or downtime is expected. | No interruption of service or downtime. ALL LSST systems, including:
|
| |||||||||||||||||
10-May-2019 07:30am | 10-May-2019 08:10am | LSST Identity | LDF (NCSA) | Upgrade PHP to v7 on LSST Identity website. | Expect LSST Identity website will have minimal, momentary downtime |
| |||||||||||||||||
23-Apr-2019 10am | 23-Apr-2019 3pm | Ci-logon AWS | AWS | On April 23, 2019, the Amazon Web Services (AWS) infrastructure supporting the CILogon COmanage Registry and LDAP services will be modified to increase the high availability (HA) posture. A new network load balancer (NLB) will be introduced and DNS entries modified to point to the new NLB interfaces. The existing NLB interfaces will continue to function for 72 hours after the transition to support any clients that have cached the older (current) DNS mappings. | No anticipated outages help@cilogon.org if problems with schedule |
| |||||||||||||||||
18-Apr 2019 6:00am (PT) | 18-Apr 2019 10:00am (PT) | Monthly maintenance | LDF (NCSA) |
| ALL LSST systems, including:
|
| |||||||||||||||||
4Apr2019 6:00AM (PT) | 4Apr2019 7:00AM (PT) | Network Maintenance | NCSA | Network engineers at NCSA migrated switches servicing L1 test stand and others within NCSA-3003 to a new router. A brief blip (<60s) took place as router interfaces are migrated. | L1 Test Stand NCSA-3003 Infrastructure |
| |||||||||||||||||
4/2/2019 7am | 4/2/2019 8am | CIlogon will be upgraded | NCSA | CILogon is upgrading. Can test code now at test.cilogon.org. | No outage is expected. Just new release moved from test to production. |
| |||||||||||||||||
19-Mar-2019 04:15 | 19-Mar-2019 07:58 | lsst-dev01 GPFS issue | LDF (NCSA) | The lsst-dev01 server was repeatedly being expelled from GPFS cluster after unexpected socket errors. | lsst-dev01 |
A reboot of the server resolved the socket errors. | |||||||||||||||||
12-Mar 2019 5am (PST) | 13-Mar 2019 3:45pm (PST) | network testing with LSTdev Slurm compute nodes | LDF (NCSA) | 24 of the LSSTdev/Slurm compute nodes were reserved for admin use for this testing | verify-worker[13-36] | testing was extended into the 13th but was completed and nodes were returned to service | |||||||||||||||||
12-Mar 2019 11:25am (PST) | 12-Mar 2019 12:25pm (PST) | LSST Oracle service unavailable | LDF (NCSA) | public DNS names were inadvertently removed for LSST's Oracle servers/service | LSST Oracle service at the LDF (lsst-oradb) |
| |||||||||||||||||
09-Mar-2019 8:35pm | 09-Mar-2019 8:35pm | Host reboots due to power fluctuation | LDF (NCSA) | 27 L1 "NCSA test stand" nodes rebooted
| 27 L1 "NCSA test stand" nodes |
| |||||||||||||||||
06-Mar-2019 6am (PST) | 06-Mar-2019 7am (PST) | pfsense firewall config update. | NCSA | pfsense network config update to stage 'k8s-prod' deployment. Requires failover of firewall, and may cause short (~60s) outage of systems behind the firewall. | All services behind pfsense firewall at NCSA. (qserv, verify, lsp, oradb) |
| |||||||||||||||||
21-Feb-2019 6:00am (PT) | 21-Feb-2019 10:00am (PT) | Monthly maintenance | NCSA |
| ALL systems operated by NCSA, including:
|
| |||||||||||||||||
2/18/2019 - 7am (PT) | 2/18/2019 - 9am(PT) | K8 upgrade for security reasons | LDF | Security vulnerabilities require a update to the K8/Docker infrastructure at LDF. Upgrade Docker from `17.03.1` to `18.09.2` Upgrade completed on time but some additional troubleshooting had to be done to get lsp-stable and lsp-int back online. | K8 nodes | Emergency | |||||||||||||||||
17-Jan-2019 6:00am | 17-Jan-2019 10:00am | Monthly maintenance | NCSA |
| ALL LSST systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters, and tus-ats01) |
Please open tickets if you notice other issues. | |||||||||||||||||
18-Dec-2018 9:46am | 18-Dec-2018 11:00am | Host reboots due to power fluctuation | LDF (NCSA) | A power event caused some hosts to reboot:
| lspdev was unavailable from ~09:40 until ~11:00am |
Systems are back online and should be functioning, but please open tickets if there are lingering issues. | |||||||||||||||||
5-Dec-2018 7:00am | 5-Dec-2018 8:30am | PDAC and lspdev k8s merge | NCSA | The PDAC k8s environment was merged into the lspdev k8s cluster. Services will continue to be isolated through Kubernetes namespaces, labels, taints, etc. | Services running in PDAC Kubernetes |
| |||||||||||||||||
29-Nov-2018 6:00am | 29-Nov-2018 12:00 noon | Monthly maintenance | NCSA |
| ALL LSST systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters, and tus-ats01) |
| |||||||||||||||||
13-Nov-2018 5:30 PST | 13-Nov-2018 6:30 PST | lspdev cluster reboot | NCSA |
| lspdev/Kubernetes cluster |
| |||||||||||||||||
10-Nov-2018 ~02:40 | 10-Nov-2018 ~02:45 | Host reboots due to power fluctuation | LDF (NCSA) | A power event caused some hosts to reboot:
| lspdev was unavailable from ~02:40 until ~07:30 |
Systems are back online and should be functioning, but please open tickets if there are lingering issues. | |||||||||||||||||
11/6/2018 5am (PT) | 11/6/2018 1pm (PT) | Power maintenance | LDF (NCSA) | Some power distribution panels are being worked on, but should NOT cause any LSST environment disruptions. | None |
| |||||||||||||||||
01-Nov-2018 10:00 | 01-Nov-2018 10:05 | Critical security patching | NCSA | Addressed vulnerability CVE-2018-14665 on the 3 lsst-dev hosts. | No interruption of service. |
| |||||||||||||||||
18-Oct-2018 06:00 | 18-Oct-2018 10:00 | Monthly maintenance | NCSA | Activities are minimal this month and are expected to cause little impact:
|
|
| |||||||||||||||||
15-Oct-2018 05:35 | 15-Oct-2018 07:15 | Power event -> host outage at one datacenter | NCSA | A power blip caused all physical hosts at the NCSA building to power off or reboot.
| affected: all physical LSST hosts (and VMs) at the NCSA building:
unaffected: all physical LSST hosts (and VMs) at the NPCF building:
|
| |||||||||||||||||
04-Oct-2018 06:00 | 04-Oct-2018 07:15 | Critical security patching | NCSA | An incorrect date (Oct 1) was initially posted for this maintenance. The correct date is Thu, Oct 4. | ALL lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters) The following systems will remain online and unaffected:
|
| |||||||||||||||||
20-Sep-2018 06:00 | 22-Sep-2018 14:50 | Qserv Master outage | NCSA | qserv-master01 is having trouble booting after a motherboard replacement during planned maintenance. | Qserv in general, specifically qserv-master |
| |||||||||||||||||
20-Sep-2018 06:00 | 20-Sep-2018 12:40 | LSPdev Kubernetes | NCSA |
| LSPdev |
| |||||||||||||||||
20-Sep-2018 06:00 | 20-Sep-2018 12:00 | Monthly maintenance | NCSA |
| All systems will be unavailable during this period. |
qserv-master01 and LSPdev are still having issues. These will be tracked as a separate incidents. | |||||||||||||||||
09-Aug-2018 09:00 | 09-Aug-2018 09:37 | lsst-dev01 Outage | NCSA | The lsst-dev01 server was unreachable for >60sec from the GPFS cluster and got expelled from the GPFS cluster. Open file handles and/or bind mounts from GPFS prevented lsst-dev01 from reconnecting to GPFS until it was rebooted. We suspect that a big job on the Slurm cluster may have contributed to some network congestion that triggered this. | lsst-dev01 |
| |||||||||||||||||
03-Aug-2018 10:00 | 03-Aug-2018 13:30 | NCSA VPN was not working for some users. | NCSA | A configuration issue caused some VPN users connection problems to some NCSA resources. | NCSA VPN |
| |||||||||||||||||
29-Jul-2018 | 03-Aug-2018 05:45 | Bulk Transfer Server Rebuild | NCSA | The Globus endpoint on lsst-xfer stopped working on July 29 after a certificate from the outdated GridFTP service expired. lsst-xfer was rebuilt and upgraded with CentOS 7.5, Globus Connect Server (v4), bbcp (17.12), and iRODS client (4.2.3). Globus bookmarks to the lsst#lsst-xfer endpoint may need to be updated to point to the rebuilt endpoint. | Globus on lsst-xfer |
| |||||||||||||||||
27-Jul-2018 | 27-Jul-2018 | NCSA VPN Migration | NCSA | NCSA will be migrating to a new VPN with multi-factor authentication. The new VPN is currently available, and users are encouraged to start using the new VPN before the cutoff date in order to ensure continued connectivity. All users must be registered with NCSA's Duo before they can use the new VPN. Links to the how-to article as well as the new VPN and Duo login are included below. | No interruption of service is expected. |
| |||||||||||||||||
19-Jul-2018 10:00 | 19-Jul-2018 10:30 | DB services on lsst-dev-db are unavailable along with dependent services, including:
| NCSA | MariaDB service did not start on lsst-dev-db after maintenance. There is a newer setting in MariaDB that didn't like the current mount point. | DB services on lsst-dev-db Services that depend on lsst-dev-db, including:
|
| |||||||||||||||||
19-Jul-2018 06:00 | 19-Jul-2018 10:00 | Monthly lsst-dev maintenance | NCSA |
| ALL lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters) The following systems will remain online and unaffected:
|
DB services on lsst-dev-db will not start after maintenance, impacting dependent services such as lspdev. This will be tracked in a separate status event. | |||||||||||||||||
27-Jun-2018 07:00 | 27-Jun-2018 11:00 | lspdev outage | NCSA | The Kubernetes head node unexpectedly rebooted at approximately 7:00 AM, causing a JupyterHub outage. Service was brought back online around 11:00 AM. | lsst-kub0[01-20] |
| |||||||||||||||||
27-Jun-2018 06:10 | 27-Jun-2018 06:30 | Monitoring Update | NCSA | First phase of enabling encryption on monitoring traffic | Monitoring Dashboards | ||||||||||||||||||
21-Jun-2018 06:00 | 21-Jun-2018 07:35 | Monthly lsst-dev maintenance | NCSA |
| CentOS 6.9 servers:
Slurm/verification cluster Other impact is not expected but unexpected issues could lead to connectivity issues for other hosts or downtime for lsst-dev01 or hosted VMs |
| |||||||||||||||||
18-Jun-2018 11:00 | 19-Jun-2018 17:00 | Nebula outage | NCSA | Nebula is undergoing a complete reboot. Last week's storms damaged more than just one node initially thought to be affected. | Nebula will be unavailable until 15:00 (5pm CDT) |
| |||||||||||||||||
19-Jun-2018 06:00 | 19-Jun-2018 10:00 | Level One Test Stand Maintenance | NCSA |
| Level One Test Stand, including:
|
| |||||||||||||||||
12-Jun-2018 ~01:40 PDT | 12-Jun-2018 07:01 PDT | Storm → outage of Kubernetes Commons & 75% of verification cluster compute nodes | NCSA | A storm caused a power event at the NPCF datacenter taking down Kubernetes commons and lspdev as well as 75% of the verification cluster compute nodes. |
|
| |||||||||||||||||
17-May-2018 11:30 | 17-May-2018 12:25 | Grafana monitoring was offline | NCSA | The influxdb data used by grafana monitoring was offline while it's storage was rebuilt | https://monitor-ncsa.lsst.org/ monitoring data was offline |
| |||||||||||||||||
17-May-2018 06:00 | 17-May-2018 11:30 | Monthly lsst-dev maintenance | NCSA |
| ALL lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters) The following systems will remain online and unaffected:
|
| |||||||||||||||||
30-Apr-2018 18:37 | 14-May-2018 15:00 | Security & AA infrastructure offline | La Serena | The Security & AA infrastructure went offline around 18:37 Project Time. None of the infrastructure is accessible via the network. A UPS had to be replaced and an electrical circuit upgraded for the replacement UPS. | None. |
| |||||||||||||||||
11-Apr-2018 06:00 | 07-May-2018 10:30 | production-size run (HSC-PDR1) on the verification cluster | NCSA | Per IHS-749, ~15 nodes of the batch compute resources will be reserved in order to complete HSC-PDR1 data runs. It is expected that the reservation can be scaled back to <10 after the first couple of weeks. | All systems available. |
| |||||||||||||||||
25-Apr-2018 11:30 | 25-Apr-2018 12:40 | Test new puppet changes for sssd and ldap access on SUI* nodes. | NCSA | A minor change to sssd service configuration needs to be rolled out to all nodes. The change will require a momentary outage of the sssd service and some actions will take longer (for a short period of time) as cache is repopulated. Changes in puppet structure (affecting ldap group sync) are also in need of testing and can happen simultaneously. | Affected services:
|
| |||||||||||||||||
04/24/2018 07:10 | 04/24/2018 07:50 | increased LDAP timeout to 60 seconds in sssd.conf | NCSA | increased LDAP timeout to 60 seconds in sssd.conf to fix problems with long login times and failure to start batch jobs we will coordinate in the near future to apply the same change on qserv* & sui* | Affected nodes: kub*, verify-worker* |
All nodes are back in service, although affected nodes may have slow LDAP response times for a short while (due to local cache needing rebuilt). | |||||||||||||||||
19 Apr 2018 | 19 Apr 2018 | Monthly lsst-dev maintenance | NCSA | CANCELLED. No major work is needed and key personnel are travelling. Deployment of the new DTN and VM infrastructure will be delayed until after the May maintenance period. | N/A |
| |||||||||||||||||
4/11 at 07:00 | 4/11 at 08:00 | Firewall update at NCSA | NCSA | Per LSST-1257, the primary firewall needs to have its routing software updated. No failover is required and traffic will continue to flow through the primary firewall during the upgrade. | No outage or service disruption. |
| |||||||||||||||||
4/3/2018 16:40 | 4/3/2018 16:45 | LDAP problems | NCSA | Causing new logins to the LSST resources at NCSA to hang. | new logins can't take place right now. | fixed. | |||||||||||||||||
3/26/2018 08:00 | 4/2/2018 9:00 | A fileserver on Nebula became unstable, resulting in diminished performance for some instances and volumes. | NCSA | Any instances or volumes hosted on the healing filesystem will be impacted, or approximately 20% of instances and volumes. | We are migrating instances around to | ||||||||||||||||||
3/15/2018 10:20 am PT | 3/15/2018 14:20 am PT | Lingering issues on select nodes following March PM | NCSA | Select nodes had issues coming out of the PM. |
|
| |||||||||||||||||
3/15/2018 10:20 am PT | 3/15/2018 11:23am PT | Lingering issues on select nodes following March PM | NCSA | Select nodes had issues coming out of the PM. |
|
| |||||||||||||||||
3/15/2018 6:00 am PT | 3/15/2018 10:20 am PT | March lsst-dev maintenance (regular schedule) | NCSA |
| Systems/services that were NOT be available: ALL lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification and Kubernetes clusters) |
Select nodes (lsst-qserv-master01, lsst7, lsst-xfer, lsst-dts, lsst-l1-cl-dmcs) required additional attention following the PM, as noted in a separate status entry. | |||||||||||||||||
3/12/2018 7:00 am PT | 3/12/2018 3:00pm PT | nebula(Open stack resource) is down | NCSA | Nebula is being taken down for patches to be applied across the whole infrastructure. | All containers on Nebula are going down. |
| |||||||||||||||||
07 Mar 2018 13:00 | 07 Mar 2018 14:10 | qserv-db12 maintenance | NCSA | qserv-db12 had one failed drive in the OS mirror replaced but the other is presenting errors as well so the RAID cannot rebuild. The node was taken down for replacement of the 2nd disk, to rebuild the RAID in the OS volume, and to reinstall the OS. | qserv-db12 |
| |||||||||||||||||
09:02 | 09:21 | lsst-dev01 Out of Space | NCSA | The main / drive partition ran out of space due to a user's faulty pip build. The faulty files were moved elsewhere for the user to review. | lsst-dev01 |
| |||||||||||||||||
27 Feb 2018 08:40 | 27 Feb 2018 09:40 | Puppet maintenance at NCSA | NCSA | Enable environment isolation on puppet master | No outage or service disruption is expected. |
| |||||||||||||||||
06:00 | 07:00 | Puppet updates | NCSA | Rolled out significant logic and organization of the Puppet resources in NCSA 3003 data center in order to standardize between LSST Puppet environments at NCSA. We had done extensive testing and did not expect any outages or disruption of services. | None Changes were applied to: |
| |||||||||||||||||
12:55 | 13:18 | lsst-dev-db crashed | NCSA | The developer MySQL server lost network and crashed. | lsst-dev-db MySQL database |
| |||||||||||||||||
06:00 | 11:00 | February lsst-dev maintenance (regular schedule) | NCSA |
| Systems/services that will NOT be available: all lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification cluster) |
| |||||||||||||||||
, 08:00 | , 08:30 | Slurm reconfiguration | NCSA | The slurm scheduler on the verification cluster will be repartitioned from one queue (debug) into two: debug: 3 nodes, MaxTime=30 min normal: 45 nodes, MaxTime=INFINITE | No outages |
| |||||||||||||||||
Wed 1/24/2018 13:35 | Wed 1/24/2018 14:55 | Loss of LSST NFS services | NCSA | All NFS mounts for LSST systems were not working | NFS access on lsst-demo and lsst-SUI were not working |
| |||||||||||||||||
16:40 | 21:00 | Firewall outage | NCSA | Both pfSense firewalls were accidentally powered off. | PDAC (Qserv & SUI) and verification clusters were inaccessible, as well as introducing GPFS issues across many services, e.g. lsst-dev01. |
| |||||||||||||||||
06:00 | 08:00 | January lsst-dev maintenance (regular schedule) | NCSA |
| Systems/services that will NOT be available: all lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification cluster) |
| |||||||||||||||||
06:00 | 11:30 | Critical patches on lsst-dev systems (incl. kernel updates) | NCSA |
| Systems/services that will NOT be available: all lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification cluster) |
| |||||||||||||||||
09:00 | 17:00 | Nebula | NCSA | Nebula (OpenStack) will be shut down for hardware and software maintenance from January 2nd, 2018 at 9am until January 5th, 2018 at 5pm. | All Nebula systems unavailable. |
| |||||||||||||||||
Saturday | Tuesday | Support over holiday break | NCSA | 2017-12-22 to 2018-1-01 (inclusive) is the University holiday period. Services will be operational. Please report problems via the JIRA IHS queue. The queue will be monitored by NCSA staff and users will be notified via Jira as to if or when their issue can be addressed. | All services will be operational. |
| |||||||||||||||||
Wednesday 06:00 | Wednesday 08:00 | NFS Server switch | NCSA | NFS services will be moved to a different host | brief outage of NFS services to SUI nodes, lsst-demo, lsst-demo2 |
| |||||||||||||||||
Wednesday 06:00 | Wednesday 07:00 | Firewall drive replacement | NCSA | Current pfSense has a bad drive. If it fails, all nodes behind the firewall will be inaccessible. There are redundant firewalls, no service interrupts are expected. | None Expected |
| |||||||||||||||||
Thursday 2017-12-14 04:00 | Thursday 2017-12-14, 19:00 | December lsst-dev maintenance (off-schedule) | NCSA |
| Do not expect any lsst-dev system to be available during this period. |
| |||||||||||||||||
Tuesday 2017-11-28, 10:00 | TBD | Rolling reboots of PDAC qserv nodes | NCSA |
| The occasional qserv node will need to be rebooted. Experience with the first couple will allow NCSA to give more precise information on the order and timing of the reboots. |
| |||||||||||||||||
2017-11-20 7:00 | 2017-11-20 14:00 | Nebula Openstack cluster | NCSA | Nebula OpenStack cluster will be unavailable for emergency hardware maintenance. A failing RAID controller from one of the storage nodes and a network switch will be replaced. | Not all instances will be impacted. If any running Nebula instances are affected by the outage they will be shut down, then restarted again after we finish maintenance that day. |
| |||||||||||||||||
Thursday 2017-11-16 06:00 | Thursday 2017-11-16 10:00 | Extended monthly lsst-dev maintenance | NCSA |
| Do not expect any lsst-dev system to be available during this period. |
| |||||||||||||||||
2017-10-31 | NFS instability | NCSA | NFS becomes intermittently unresponsive. |
We are guardedly optimistic that this problem has been resolved. PDAC is now utilizing native GPFS mounts. | |||||||||||||||||||
2017-10-24 09:50 | LSST | GPFS outage | NCSA | All LSST nodes from NCSA 3003 (e.g., lsst-dev01/lsst-dev7) and NCPF (verify-worker, PDAC) that connect to GPFS (as GPFS or NFS) have lost their connection. | GPFS |
Storage is working to bring GPFS back online | |||||||||||||||||
2017-10-21 17:15 | LSST | public/protected network switch is down in rack N76 at NPCF | nodes cannot communicate DNS, LDAP, etc. so largely cannot communicate with other nodes, e.g., no communication between affected verify-worker nodes and the Slurm scheduler on lsst-dev01, no communication between affected qserv-db nodes and the rest of qserv | Efffectively, the whole verification cluster |
in progress, replacement switch is on order Workaround in progress. If all goes well, systems should be back online by late afternoon. | ||||||||||||||||||
2017-10-19 06:00 | 2017-10-19 14:00 | qserv-master replacement | NCSA | qserve-master will be down so that systems engineering can finish configuring the new server and xfering files. Status updates here:
| qserv-master will be down for this entire period |
|