LSST Service Status page

All times listed are Project Time (Pacific)

Current Status

Start	End	Event	Location	Description	Systems/services that will NOT be available	Status

Upcoming Scheduled Maintenance

Start

End

Event

Location

Description

Systems/services that will NOT be available

Status

2020-02-27 06:00

2020-02-27 12:00

Monthly LSST system maintenance

NCSA

OS updates and reboots
Other updates as needed.

ALL LSST systems will be updated, including:

TBD

TBD

Recurring Scheduled Maintenance

(All times are Project Time (Pacific))

Start

End

Event

Location

Description

Systems/services that will NOT be available

Status

Last Thursday of every month 06:00

Last Thursday of every month 10:00

Recurring-Monthly

Monthly lsst-dev maintenance

NCSA

Routine system updates.

Variable. Do not expect any lsst-dev system to be available during this period.

SCHEDULED

Every Mon. 04:00

Recurring- Weekly
Purge of GPFS /scratch partition

NCSA

Per LSST data management policies, files older than 180 days will be purged from the LSST shared (GPFS) /scratch file system.

Purge logs can be found in /lsst/admin/purge_logs/scratch/

No outage or service disruption.

SCHEDULED

Every Tu. 08:00

Every Tu. 10:00

Recurring- Weekly

Weekly Nebula Maintenance

NCSA

Routine system updates. Computational services continue to run.

Horizon and API interfaces.

SCHEDULED

Previous Outages & Events

Start	End	Event	Location	Description	Systems/services that were NOT available	Status
2020-02-17 18:00	2020-02-17 21:25	LDAP and authentication interruption	NCSA	Intermittent timeouts on LSST's LDAP replica servers at NCSA caused authentication issues for most LSST servers at NCSA. This was triggered by replication timeouts after NCSA's primary LDAP server outage earlier around 16:00. LDAP issues for LSST resources did not occur till about 18:00. The LSST LDAP replica servers recovered around 19:20. Later, we discovered that many servers needed their SSSD cache manually cleared to allow authentication, which was resolved around 21:20.	Authentication to most LSST systems, including: lsst-dev01, lsst-xfer, etc. PDAC/Kubernetes/LSP clusters NCSA Test Stand	RESOLVED
2020-01-30 06:09	2020-01-30 07:51	shared filesystem interruption errant "New LSSTdev Account" emails	NCSA	Many LSST users who have NCSA accounts received an errant email this morning with a subject of "New LSSTdev Account at NCSA". This email can be ignored. The email was caused by a bug in provisioning scripts that was triggered by a short shared filesystem interruption. The shared filesystem issue has now been resolved.	Many LSST systems (all native GPFS clients): lsst-dev01, lsst-xfer, etc. Slurm verification cluster	RESOLVED
2020-01-15 05:00	2020-01-15 12:00	Hardware repair in NCSA Test Stand	NCSA	21 servers in the NCSA Test Stand had their drive backplanes replaced by the vendor	19 of 52 active servers in the NCSA Test Stand, including: lsst-l1-cl-arctl.ncsa.illinois.edu lsst-l1-cl-audit.ncsa.illinois.edu lsst-l1-cl-efd.ncsa.illinois.edu lsst-l1-cl-fault.ncsa.illinois.edu lsst-l1-cl-header.ncsa.illinois.edu lsst-l1-us-fault.ncsa.illinois.edu lsst-teststand-ts1.ncsa.illinois.edu Most LSST servers will remain up.	COMPLETE
2019-12-12 06:00	2019-12-12 12:00	Monthly LSST system maintenance	NCSA	OS updates and reboots GPFS filesystem restructure	ALL LSST systems will be updated, including: lsst-dev01, lsst-xfer, etc. Slurm verification cluster PDAC/Kubernetes/LSP clusters ~~tus-ats01~~ NCSA test stand	COMPLETE
27 Sep 2019 7:00pm (PT)	28 Sep 2019 11:30pm (PT)	Building power maintenance & GPFS firmware upgrade	LDF (NCSA)	Full building power outage in NCSA facility at NCSA Firmware upgrade of GPFS appliance (causes Home directories to be unavailable)	ALL LSST systems, including: lsst-dev01, lsst-xfer, etc. Slurm verification cluster PDAC/Kubernetes/LSP clusters L1 test stand	COMPLETE Oracle is still down but is expected to be returned to service later this afternoon.
18 Jul 2019 8:15am (PT)	18 Jul 2019 2:40pm (PT)	lsst-oradb down	LDF (NCSA)	The primary Oracle service was down after this morning's planned maintenance due to issues accessing NetApp storage.	lsst-oradb was down lsst-oradb-test was up	RESOLVED
18 Jul 2019 6:00am (PT)	18 Jul 2019 8:15am (PT)	Monthly maintenance	LDF (NCSA)	OS updates and reboots ~~Dell firmware updates~~ ~~firmware update on bastion01 (see separate entry scheduled for 04:00-06:00am)~~ ~~pfSense firewall maintenance~~ (postponed)	ALL LSST systems, including: lsst-dev01, lsst-xfer, etc. Slurm verification cluster PDAC/Kubernetes/LSP clusters tus-ats01 L1 test stand	COMPLETE with the following exceptions: lsp-int is still inaccessible two L1 test nodes are still down: ~~lsst-l1-cL-frwd16~~ (back up) lsst-l1-cl-ocs production Oracle services (lsst-oradb.ncsa.illinois.edu) are still down
23 Jun 2019 5:00am (PT)	23 Jun 2019 12:00 Noon (PT)	Full building power maintenance	LDF (NCSA)	Full building power outage in NPCF facility at NCSA. End time is approximate.	lsst-dev01, lsst-xfer, lsst-dbb-gw Slurm verification cluster PDAC/Kubernetes/LSP clusters	COMPLETE
21-May-2019 9:00am (PT)	22-May-2019 5:00pm (PT)	k8s cluster migration	NCSA	Migrate old kubernetes cluster to redeployed clusters with redundant head nodes. UPDATE: lsp-stable was primarily stable a few hours after the planed end time on May 21st. lsp-int services required an extra day to stabilize.	All k8s at NCSA including: lsp-stable lsp-int	COMPLETE some lsp-int services are still offline, but impact a very limited number of users.
16-May-2019 6:00am (PT)	16-May-2019 10:00am (PT)	Monthly maintenance	LDF (NCSA)	Update authentication to use new LDAP & Kerberos servers. No interruption of service or downtime is expected.	No interruption of service or downtime. ALL LSST systems, including: lsst-dev01, lsst-xfer, etc. Slurm verification cluster PDAC/Kubernetes/LSP clusters tus-ats01	COMPLETE
10-May-2019 07:30am	10-May-2019 08:10am	LSST Identity	LDF (NCSA)	Upgrade PHP to v7 on LSST Identity website.	Expect LSST Identity website will have minimal, momentary downtime	COMPLETED
23-Apr-2019 10am	23-Apr-2019 3pm	Ci-logon AWS	AWS	On April 23, 2019, the Amazon Web Services (AWS) infrastructure supporting the CILogon COmanage Registry and LDAP services will be modified to increase the high availability (HA) posture. A new network load balancer (NLB) will be introduced and DNS entries modified to point to the new NLB interfaces. The existing NLB interfaces will continue to function for 72 hours after the transition to support any clients that have cached the older (current) DNS mappings.	No anticipated outages help@cilogon.org if problems with schedule	COMPLETED
18-Apr 2019 6:00am (PT)	18-Apr 2019 10:00am (PT)	Monthly maintenance	LDF (NCSA)	~~10G network switch maintenance~~ GPFS server updates OS updates (incl. updated kernel) and reboots Dell firmware updates Kubernetes update (v1.13.3 to v1.14.0) Pending configuration changes via Puppet	ALL LSST systems, including: lsst-dev01, lsst-xfer, etc. Slurm verification cluster PDAC/Kubernetes/LSP clusters tus-ats01	COMPLETED
4Apr2019 6:00AM (PT)	4Apr2019 7:00AM (PT)	Network Maintenance	NCSA	Network engineers at NCSA migrated switches servicing L1 test stand and others within NCSA-3003 to a new router. A brief blip (<60s) took place as router interfaces are migrated.	L1 Test Stand NCSA-3003 Infrastructure	COMPLETED
4/2/2019 7am	4/2/2019 8am	CIlogon will be upgraded	NCSA	CILogon is upgrading. Can test code now at test.cilogon.org.	No outage is expected. Just new release moved from test to production.	COMPLETED
19-Mar-2019 04:15	19-Mar-2019 07:58	lsst-dev01 GPFS issue	LDF (NCSA)	The lsst-dev01 server was repeatedly being expelled from GPFS cluster after unexpected socket errors.	lsst-dev01	RESOLVED A reboot of the server resolved the socket errors.
12-Mar 2019 5am (PST)	13-Mar 2019 3:45pm (PST)	network testing with LSTdev Slurm compute nodes	LDF (NCSA)	24 of the LSSTdev/Slurm compute nodes were reserved for admin use for this testing	verify-worker[13-36]	testing was extended into the 13th but was completed and nodes were returned to service
12-Mar 2019 11:25am (PST)	12-Mar 2019 12:25pm (PST)	LSST Oracle service unavailable	LDF (NCSA)	public DNS names were inadvertently removed for LSST's Oracle servers/service	LSST Oracle service at the LDF (lsst-oradb)	RESOLVED DNS entries were recreated and active by 12:25 (PST) slowness following return to service was initially reported by one user but this seems to have resolved itself
09-Mar-2019 8:35pm	09-Mar-2019 8:35pm	Host reboots due to power fluctuation	LDF (NCSA)	27 L1 "NCSA test stand" nodes rebooted NCSA continues to engage Dell on this issue; this particular model (C6420) has been uniquely susceptible to issues during brownouts	27 L1 "NCSA test stand" nodes	RESOLVED
06-Mar-2019 6am (PST)	06-Mar-2019 7am (PST)	pfsense firewall config update.	NCSA	pfsense network config update to stage 'k8s-prod' deployment. Requires failover of firewall, and may cause short (~60s) outage of systems behind the firewall.	All services behind pfsense firewall at NCSA. (qserv, verify, lsp, oradb)	COMPLETED
21-Feb-2019 6:00am (PT)	21-Feb-2019 10:00am (PT)	Monthly maintenance	NCSA	OS/Yum updates Switch maintenance in NPCF N73 & P73 pfSense update & port negotiation change GPFS server updates Firmware updates for Dell C6420s	ALL systems operated by NCSA, including: lsst-dev01, lsst-xfer, etc. PDAC, verification, and Kubernetes clusters tus-ats01	COMPLETED
2/18/2019 - 7am (PT)	2/18/2019 - 9am(PT)	K8 upgrade for security reasons	LDF	Security vulnerabilities require a update to the K8/Docker infrastructure at LDF. Upgrade Docker from `17.03.1` to `18.09.2` Upgrade Kubernetes from `1.11.5` to `1.13.3` Upgrade completed on time but some additional troubleshooting had to be done to get lsp-stable and lsp-int back online.	K8 nodes	Emergency
17-Jan-2019 6:00am	17-Jan-2019 10:00am	Monthly maintenance	NCSA	power rebalancing in one rack switch maintenance in select racks critical security patching server firmware upgrades	ALL LSST systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters, and tus-ats01)	RESOLVEDSystems are back online and should be functioning with the following exceptions: lsp services in Kubernetes are not fully functional (this is carryover from before the PM; see discussion on Slack, dm-lsp-users and possibly other channels) lsst-l1-cl-dmcs will not boot after firmware updates Please open tickets if you notice other issues.
18-Dec-2018 9:46am	18-Dec-2018 11:00am	Host reboots due to power fluctuation	LDF (NCSA)	A power event caused some hosts to reboot: lspdev kubernetes cluster (12 nodes including master node did not come back on their own and were manually brought online around 11:00am) some L1 nodes rebooted as well	lspdev was unavailable from ~09:40 until ~11:00am	RESOLVED Systems are back online and should be functioning, but please open tickets if there are lingering issues.
5-Dec-2018 7:00am	5-Dec-2018 8:30am	PDAC and lspdev k8s merge	NCSA	The PDAC k8s environment was merged into the lspdev k8s cluster. Services will continue to be isolated through Kubernetes namespaces, labels, taints, etc.	Services running in PDAC Kubernetes	COMPLETE
29-Nov-2018 6:00am	29-Nov-2018 12:00 noon	Monthly maintenance	NCSA	Puppet code changes disable CPU hyperthreading (requires reboot!!!) OS/Yum updates code upgrades on select service & management switches NPCF pfSense updates	ALL LSST systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters, and tus-ats01)	COMPLETE
13-Nov-2018 5:30 PST	13-Nov-2018 6:30 PST	lspdev cluster reboot	NCSA	Reseating Kubernetes nodes in their chassis slots to resolve errors caused by power event over the weekend.	lspdev/Kubernetes cluster	RESOLVED
10-Nov-2018 ~02:40	10-Nov-2018 ~02:45	Host reboots due to power fluctuation	LDF (NCSA)	A power event caused some hosts to reboot: lspdev kubernetes cluster (3 nodes including master node did not come back on their own and were manually brought online around 07:30) some L1 nodes rebooted as well	lspdev was unavailable from ~02:40 until ~07:30	RESOLVED Systems are back online and should be functioning, but please open tickets if there are lingering issues.
11/6/2018 5am (PT)	11/6/2018 1pm (PT)	Power maintenance	LDF (NCSA)	Some power distribution panels are being worked on, but should NOT cause any LSST environment disruptions.	None	COMPLETE
01-Nov-2018 10:00	01-Nov-2018 10:05	Critical security patching	NCSA	Addressed vulnerability CVE-2018-14665 on the 3 `lsst-dev` hosts.	No interruption of service.	RESOLVED
18-Oct-2018 06:00	18-Oct-2018 10:00	Monthly maintenance	NCSA	Activities are minimal this month and are expected to cause little impact: firmware update and reboot on monitor01 (monitoring collector) OS & Kernel updates on tus-ats01.lsst.ncsa.edu Puppet code changes	monitor01/InfluxDB (and likely the front-end Grafana monitoring, e.g., monitor-ncsa.lsst.org) will be unavailable for a short period of time tus-ats01 will be unavailable for OS & Kernel updates the Puppet changes are intended to be functional "no-ops" and should cause no outage, although we scheduled these changes during our monthly PM window in case something unexpected occurs	COMPLETE
15-Oct-2018 05:35	15-Oct-2018 07:15	Power event -> host outage at one datacenter	NCSA	A power blip caused all physical hosts at the NCSA building to power off or reboot. None of the LSST physical hosts at the NPCF building were affected.	affected: all physical LSST hosts (and VMs) at the NCSA building: incl. lsst-dev, lsst-xfer, lsst-l1, lsst-daq, lsst-dev-db most physical hosts rebooted themselves after the event, although a few L1 systems had to be manually powered on most VMs had to be manually started after the event update: also includes Nebula, which is still impcated unaffected: all physical LSST hosts (and VMs) at the NPCF building: incl. lsst-qserv, lsst-verify-worker, lsst-sui, lsst-kub, GPFS	RESOLVED note: Nebula is still impacted by the outage
04-Oct-2018 06:00	04-Oct-2018 07:15	Critical security patching	NCSA	An incorrect date (Oct 1) was initially posted for this maintenance. The correct date is Thu, Oct 4.	ALL lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters) The following systems will remain online and unaffected: tus-ats01	RESOLVED sui-tomcat02 is getting rebooted once more to resolve an issue with NFS mounts but we expect it to be resolved easily
20-Sep-2018 06:00	22-Sep-2018 14:50	Qserv Master outage	NCSA	qserv-master01 is having trouble booting after a motherboard replacement during planned maintenance.	Qserv in general, specifically qserv-master	RESOLVED
20-Sep-2018 06:00	20-Sep-2018 12:40	LSPdev Kubernetes	NCSA	LSPdev is having a gateway error	LSPdev	RESOLVED
20-Sep-2018 06:00	20-Sep-2018 12:00	Monthly maintenance	NCSA	Network switch firmware updates/reboots Lenovo firmware updates/reboots OS package updates/reboots ESXi hypervisor updates/reboots GPFS client changes and upgrade to 4.2.3-10 GPFS server upgrade to 4.2.3-10	All systems will be unavailable during this period.	RESOLVED qserv-master01 and LSPdev are still having issues. These will be tracked as a separate incidents.
09-Aug-2018 09:00	09-Aug-2018 09:37	lsst-dev01 Outage	NCSA	The lsst-dev01 server was unreachable for >60sec from the GPFS cluster and got expelled from the GPFS cluster. Open file handles and/or bind mounts from GPFS prevented lsst-dev01 from reconnecting to GPFS until it was rebooted. We suspect that a big job on the Slurm cluster may have contributed to some network congestion that triggered this.	lsst-dev01	RESOLVED
03-Aug-2018 10:00	03-Aug-2018 13:30	NCSA VPN was not working for some users.	NCSA	A configuration issue caused some VPN users connection problems to some NCSA resources.	NCSA VPN	RESOLVED
29-Jul-2018	03-Aug-2018 05:45	Bulk Transfer Server Rebuild	NCSA	The Globus endpoint on lsst-xfer stopped working on July 29 after a certificate from the outdated GridFTP service expired. lsst-xfer was rebuilt and upgraded with CentOS 7.5, Globus Connect Server (v4), bbcp (17.12), and iRODS client (4.2.3). Globus bookmarks to the `lsst#lsst-xfer` endpoint may need to be updated to point to the rebuilt endpoint.	Globus on lsst-xfer	RESOLVED
27-Jul-2018	27-Jul-2018	NCSA VPN Migration	NCSA	NCSA will be migrating to a new VPN with multi-factor authentication. The new VPN is currently available, and users are encouraged to start using the new VPN before the cutoff date in order to ensure continued connectivity. All users must be registered with NCSA's Duo before they can use the new VPN. Links to the how-to article as well as the new VPN and Duo login are included below. How-To: https://wiki.ncsa.illinois.edu/x/hJsDAg VPN: https://sslvpn.ncsa.illinois.edu/ Duo: https://duo.security.ncsa.illinois.edu/login/	No interruption of service is expected.	COMPLETE
19-Jul-2018 10:00	19-Jul-2018 10:30	DB services on lsst-dev-db are unavailable along with dependent services, including: lspdev	NCSA	MariaDB service did not start on lsst-dev-db after maintenance. There is a newer setting in MariaDB that didn't like the current mount point.	DB services on lsst-dev-db Services that depend on lsst-dev-db, including: lspdev	RESOLVED
19-Jul-2018 06:00	19-Jul-2018 10:00	Monthly lsst-dev maintenance	NCSA	Dell firmware updates/reboots OS package updates/reboots including upgrades to CentOS 7.5 GPFS client changes and upgrade to 4.2.3-9 GPFS server upgrade to 4.2.3-9	ALL lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters) The following systems will remain online and unaffected: lsst-daq lsst-l1-* tus-ats01	COMPLETE DB services on lsst-dev-db will not start after maintenance, impacting dependent services such as lspdev. This will be tracked in a separate status event.
27-Jun-2018 07:00	27-Jun-2018 11:00	lspdev outage	NCSA	The Kubernetes head node unexpectedly rebooted at approximately 7:00 AM, causing a JupyterHub outage. Service was brought back online around 11:00 AM.	lsst-kub0[01-20]	COMPLETE
27-Jun-2018 06:10	27-Jun-2018 06:30	Monitoring Update	NCSA	First phase of enabling encryption on monitoring traffic	Monitoring Dashboards
21-Jun-2018 06:00	21-Jun-2018 07:35	Monthly lsst-dev maintenance	NCSA	pfSense firewall update OS package updates/reboots for CentOS 6.9 servers (lsst-web, lsst-xfer, lsst-nagios) Slurm update (lsst-dev01, lsst-verify-worker*) Update host firewalls on GPFS servers iDRAC configuration updates on lsst-dev01 and ESXi hosts	CentOS 6.9 servers: lsst-web lsst-xfer lsst-nagios Slurm/verification cluster Other impact is not expected but unexpected issues could lead to connectivity issues for other hosts or downtime for lsst-dev01 or hosted VMs	COMPLETE
18-Jun-2018 11:00	19-Jun-2018 17:00	Nebula outage	NCSA	Nebula is undergoing a complete reboot. Last week's storms damaged more than just one node initially thought to be affected.	Nebula will be unavailable until 15:00 (5pm CDT)	RESOLVED
19-Jun-2018 06:00	19-Jun-2018 10:00	Level One Test Stand Maintenance	NCSA	BIOS firmware updates Puppet and firewall changes (including support of SAL unicast/multicast traffic) OS package updates (staying with CentOS 7.4)	Level One Test Stand, including: lsst-daq lsst-l1-*	RESOLVED
12-Jun-2018 ~01:40 PDT	12-Jun-2018 07:01 PDT	Storm → outage of Kubernetes Commons & 75% of verification cluster compute nodes	NCSA	A storm caused a power event at the NPCF datacenter taking down Kubernetes commons and lspdev as well as 75% of the verification cluster compute nodes.	Kubernetes Commons / lsst-lspdev / kub* 75% of verify-worker* / Slurm nodes	RESOLVED
17-May-2018 11:30	17-May-2018 12:25	Grafana monitoring was offline	NCSA	The influxdb data used by grafana monitoring was offline while it's storage was rebuilt	https://monitor-ncsa.lsst.org/ monitoring data was offline	RESOLVED
17-May-2018 06:00	17-May-2018 11:30	Monthly lsst-dev maintenance	NCSA	GPFS maintenance Replace floor tile GPFS service upgrade to 4.2.3-8 Rebuild of `/lsst/backups` structure PDAC Firewall maintenance for new vLANs BIOS Firmware updates (`lsst-bastion01`, `lsst-sui`, `lsst-qserv`, LevelOne Test Stand, `lsst-dev-db`) Node changes with reboots (all nodes) switch to rsyslog v8 yum repository & upgrade rsyslog (`bastion01` & kub, qserv, sui, verification clusters) puppet-stdlib module update (`lsst-dev01`, `lsst-dev-db`, `lsst-web`, `lsst-xfer`, LevelOne Test Stand) GPFS client upgrade (4.2.3-8) and `nosuid` mount option changes (`lsst-dev01`, `lsst-qserv`, `lsst-web`, `lsst-xfer`, verification cluster) NFS `nosuid` mount option changes of GPFS (`lsst-demo01` and kub & verification clusters) enable PXE boot on new network interfaces (`lsst-kub` & `lsst-backup01`) OS Updates (all nodes)	ALL lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC, verification, and Kubernetes clusters) The following systems will remain online and unaffected: lsst-daq lsst-l1-* tus-ats01	RESOLVED
30-Apr-2018 18:37	14-May-2018 15:00	Security & AA infrastructure offline	La Serena	The Security & AA infrastructure went offline around 18:37 Project Time. None of the infrastructure is accessible via the network. A UPS had to be replaced and an electrical circuit upgraded for the replacement UPS.	None.	RESOLVED
11-Apr-2018 06:00	07-May-2018 10:30	production-size run (HSC-PDR1) on the verification cluster	NCSA	Per IHS-749, ~15 nodes of the batch compute resources will be reserved in order to complete HSC-PDR1 data runs. It is expected that the reservation can be scaled back to <10 after the first couple of weeks.	All systems available.	COMPLETE
25-Apr-2018 11:30	25-Apr-2018 12:40	Test new puppet changes for sssd and ldap access on SUI* nodes.	NCSA	A minor change to sssd service configuration needs to be rolled out to all nodes. The change will require a momentary outage of the sssd service and some actions will take longer (for a short period of time) as cache is repopulated. Changes in puppet structure (affecting ldap group sync) are also in need of testing and can happen simultaneously.	Affected services: Firefly proxy and tomcat services Some actions may appear slow while cache re-populates	COMPLETE
04/24/2018 07:10	04/24/2018 07:50	increased LDAP timeout to 60 seconds in sssd.conf	NCSA	increased LDAP timeout to 60 seconds in sssd.conf to fix problems with long login times and failure to start batch jobs we will coordinate in the near future to apply the same change on qserv* & sui*	Affected nodes: kub, verify-worker	RESOLVED All nodes are back in service, although affected nodes may have slow LDAP response times for a short while (due to local cache needing rebuilt).
19 Apr 2018	19 Apr 2018	Monthly lsst-dev maintenance	NCSA	CANCELLED. No major work is needed and key personnel are travelling. Deployment of the new DTN and VM infrastructure will be delayed until after the May maintenance period.	N/A	CANCELLED
4/11 at 07:00	4/11 at 08:00	Firewall update at NCSA	NCSA	Per LSST-1257, the primary firewall needs to have its routing software updated. No failover is required and traffic will continue to flow through the primary firewall during the upgrade.	No outage or service disruption.	RESOLVED
4/3/2018 16:40	4/3/2018 16:45	LDAP problems	NCSA	Causing new logins to the LSST resources at NCSA to hang.	new logins can't take place right now.	fixed.
3/26/2018 08:00	4/2/2018 9:00	A fileserver on Nebula became unstable, resulting in diminished performance for some instances and volumes.	NCSA	Any instances or volumes hosted on the healing filesystem will be impacted, or approximately 20% of instances and volumes.		We are migrating instances around to speed up the process.
3/15/2018 10:20 am PT	3/15/2018 14:20 am PT	Lingering issues on select nodes following March PM	NCSA	Select nodes had issues coming out of the PM.	lsst7 - issue w/ sshd	RESOLVED
3/15/2018 10:20 am PT	3/15/2018 11:23am PT	Lingering issues on select nodes following March PM	NCSA	Select nodes had issues coming out of the PM.	lsst-qserv-master01 - cannot mount local /qserv volume lsst-xfer - issue w/ sshd lsst-dts - issue w/ sshd lsst-l1-cl-dmcs - unknown issue	RESOLVED
3/15/2018 6:00 am PT	3/15/2018 10:20 am PT	March lsst-dev maintenance (regular schedule)	NCSA	GPFS server updates and configuration of additional NFS/Samba services Urgent Firmware updates Increase size of /tmp on lsst-dev01 Hardware maintenance/memory increases on select servers/VMs Release of refactored Puppet code OS updates Recabling servers in dev server room to new switches	Systems/services that were NOT be available: ALL lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification and Kubernetes clusters)	COMPLETE Select nodes (lsst-qserv-master01, lsst7, lsst-xfer, lsst-dts, lsst-l1-cl-dmcs) required additional attention following the PM, as noted in a separate status entry.
3/12/2018 7:00 am PT	3/12/2018 3:00pm PT	nebula(Open stack resource) is down	NCSA	Nebula is being taken down for patches to be applied across the whole infrastructure.	All containers on Nebula are going down.	COMPLETE
07 Mar 2018 13:00	07 Mar 2018 14:10	qserv-db12 maintenance	NCSA	qserv-db12 had one failed drive in the OS mirror replaced but the other is presenting errors as well so the RAID cannot rebuild. The node was taken down for replacement of the 2nd disk, to rebuild the RAID in the OS volume, and to reinstall the OS.	qserv-db12	COMPLETE
28 Feb 2018 09:02	28 Feb 2018 09:21	`lsst-dev01` Out of Space	NCSA	The main `/` drive partition ran out of space due to a user's faulty pip build. The faulty files were moved elsewhere for the user to review.	lsst-dev01	COMPLETE
27 Feb 2018 08:40	27 Feb 2018 09:40	Puppet maintenance at NCSA	NCSA	Enable environment isolation on puppet master	No outage or service disruption is expected.	COMPLETE
23 Feb 2018 06:00	23 Feb 2018 07:00	Puppet updates	NCSA	Rolled out significant logic and organization of the Puppet resources in NCSA 3003 data center in order to standardize between LSST Puppet environments at NCSA. We had done extensive testing and did not expect any outages or disruption of services.	None Changes were applied to: `lsst-dev01`, `lsst-dev-db`, `lsst-web`, `lsst-xfer`, `lsst-dts`, `lsst-demo`, L1 test stand, DBB test stand, elastic test stand.	COMPLETE
15 Feb 2018 12:55	15 Feb 2018 13:18	lsst-dev-db crashed	NCSA	The developer MySQL server lost network and crashed.	lsst-dev-db MySQL database	RESTORED
15 Feb 2018 06:00	15 Feb 2018 11:00	February lsst-dev maintenance (regular schedule)	NCSA	Updating GPFS mounts to access new storage appliance Rewire 2 PDUs in dev server room (hosts lsst-dev01, lsst-xfer, etc.) Switch stack configuration changes in dev server room (hosts lsst-dev01, lsst-xfer, etc.) Routine system updates Firewall maintenance at datacenter (hosts PDAC, verification cluster, etc.) Updates to system monitoring	Systems/services that will NOT be available: all lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification cluster)	COMPLETE NOTE: GPFS was not remounted on qserv-dax01 until 4:27pm
31 Jan 2018, 08:00	31 Jan 2018, 08:30	Slurm reconfiguration	NCSA	The slurm scheduler on the verification cluster will be repartitioned from one queue (debug) into two: debug: 3 nodes, MaxTime=30 min normal: 45 nodes, MaxTime=INFINITE	No outages	COMPLETE
Wed 1/24/2018 13:35	Wed 1/24/2018 14:55	Loss of LSST NFS services	NCSA	All NFS mounts for LSST systems were not working	NFS access on lsst-demo and lsst-SUI were not working	RESTORED
18 Jan 2018 16:40	18 Jan 2018 21:00	Firewall outage	NCSA	Both pfSense firewalls were accidentally powered off.	PDAC (Qserv & SUI) and verification clusters were inaccessible, as well as introducing GPFS issues across many services, e.g. lsst-dev01.	RESTORED
18 Jan 2018 06:00	18 Jan 2018 08:00	January lsst-dev maintenance (regular schedule)	NCSA	Routine system updates	Systems/services that will NOT be available: all lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification cluster)	COMPLETE
11 Jan 2018 06:00	11 Jan 2018 11:30	Critical patches on lsst-dev systems (incl. kernel updates)	NCSA	Update kernel and system packages to address a security vulnerability.	Systems/services that will NOT be available: all lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification cluster)	COMPLETE
02 Jan 2018 09:00	05 Jan 2018 17:00	Nebula	NCSA	Nebula (OpenStack) will be shut down for hardware and software maintenance from January 2nd, 2018 at 9am until January 5th, 2018 at 5pm.	All Nebula systems unavailable.	COMPLETE
Saturday 23 Dec 2017	Tuesday 02 Jan 2018	Support over holiday break	NCSA	2017-12-22 to 2018-1-01 (inclusive) is the University holiday period. Services will be operational. Please report problems via the JIRA IHS queue. The queue will be monitored by NCSA staff and users will be notified via Jira as to if or when their issue can be addressed.	All services will be operational.	COMPLETE
Wednesday 20 Dec 2017 06:00	Wednesday 20 Dec 2017 08:00	NFS Server switch	NCSA	NFS services will be moved to a different host	brief outage of NFS services to SUI nodes, lsst-demo, lsst-demo2	COMPLETED
Wednesday 20 Dec 2017 06:00	Wednesday 20 Dec 2017 07:00	Firewall drive replacement	NCSA	Current pfSense has a bad drive. If it fails, all nodes behind the firewall will be inaccessible. There are redundant firewalls, no service interrupts are expected.	None Expected	COMPLETED
Thursday 2017-12-14 04:00	Thursday 2017-12-14, 10~~:00~~ 19:00	December lsst-dev maintenance (off-schedule)	NCSA	Due to holiday schedules, the December maintenance event is being moved up 1 week, from 2017-12-21 to 2017-12-14 Routine system updates Network switch replacement lsst-db server replacement Further details here	Do not expect any lsst-dev system to be available during this period.	COMPLETED
Tuesday 2017-11-28, 10:00	TBD	Rolling reboots of PDAC qserv nodes	NCSA	In order to address a spontaneous rebooting issue with some qserv nodes, firmware upgrades are being performed.	The occasional qserv node will need to be rebooted. Experience with the first couple will allow NCSA to give more precise information on the order and timing of the reboots.	COMPLETED
2017-11-20 7:00	2017-11-20 14:00	Nebula Openstack cluster	NCSA	Nebula OpenStack cluster will be unavailable for emergency hardware maintenance. A failing RAID controller from one of the storage nodes and a network switch will be replaced.	Not all instances will be impacted. If any running Nebula instances are affected by the outage they will be shut down, then restarted again after we finish maintenance that day.	COMPLETED
Thursday 2017-11-16 06:00	Thursday 2017-11-16 10:00	Extended monthly lsst-dev maintenance	NCSA	Routine system updates. Due to the volume of work that needs to be done, this event is being extended by 2 hrs. If systems become available before the end of the maintenance window, we will announce it here. Be aware that this event will include an off-schedule purge of items in /scratch older than 180 days.	Do not expect any lsst-dev system to be available during this period.	COMPLETED
2017-10-31		NFS instability	NCSA	NFS becomes intermittently unresponsive.		~STABLE We are guardedly optimistic that this problem has been resolved. PDAC is now utilizing native GPFS mounts.
2017-10-24 09:50	LSST	GPFS outage	NCSA	All LSST nodes from NCSA 3003 (e.g., lsst-dev01/lsst-dev7) and NCPF (verify-worker, PDAC) that connect to GPFS (as GPFS or NFS) have lost their connection.	GPFS	ONLINE Storage is working to bring GPFS back online
2017-10-21 17:15	LSST	public/protected network switch is down in rack N76 at NPCF		nodes cannot communicate DNS, LDAP, etc. so largely cannot communicate with other nodes, e.g., no communication between affected verify-worker nodes and the Slurm scheduler on lsst-dev01, no communication between affected qserv-db nodes and the rest of qserv	Efffectively, the whole verification cluster	RESTORED in progress, replacement switch is on order Workaround in progress. If all goes well, systems should be back online by late afternoon.
2017-10-19 06:00	2017-10-19 14:00	qserv-master replacement	NCSA	qserve-master will be down so that systems engineering can finish configuring the new server and xfering files. Status updates here: IHS-378 - Getting issue details... STATUS .	qserv-master will be down for this entire period	COMPLETE

Archived events

Space shortcuts

Page tree

Current Status

Upcoming Scheduled Maintenance

Recurring Scheduled Maintenance

Previous Outages & Events

DM Meetings and Events