Versions Compared

Key

  • This line was added.
  • This line was removed.
  • Formatting was changed.

...

(All times are Project Time (Pacific))

StartEndEventLocation

Description

Systems/services that will NOT be availableStatus
Every Tu.
08
09:00Every Tu.
10
11:00

Recurring

Weekly Nebula Maintenance

NCSARoutine system updates. Computational services continue to run.Horizon and API interfaces.

Status
subtletrue
colourBlue
titleScheduled

Third Thursday of every month

06

07:00

Third Thursday of every month
08
09:00

Recurring

Monthly lsst-dev maintenance

NCSA
  • Routine system updates.
Variable. Do not expect any lsst-dev system to be available during this period.

Status
subtletrue
colourBlue
titleScheduled

Thursday 2017-11-16 07:00

Thursday 2017-11-16 11:00

Extended monthly lsst-dev maintenance

NCSA
  • Routine system updates.
  • Due to the volume of work that needs to be done, this event is being extended by 2 hrs. If systems become available before the end of the maintenance window, we will announce it here.
Do not expect any lsst-dev system to be available during this period.

Status
subtletrue
titleTentative

2017-10-31
GPFS instabilityNCSA

All disks in the GPFS storage system went offline temporarily and came back online by themselves. NFS services were restarted.

This is the second drop-out in <24hrs. GPFS has the hiccups.


most NCSA-hosted LSST resources

native mounts (e.g., lsst-dev01, verify-worker*) and NFS mounts (e.g., PDAC)

Status
colourYellow
titleUnstable

All GPFS services are currently running

Until a cause is identified & fixed we'll consider it unstable.

Logs have been sent to the vendor for analysis.









Previous Outages & Events

StartEndEventLocation

Planned Activities

Systems/services that will NOT be availableStatus
2017-10-24 09:50LSSTGPFS outageNCSAAll LSST nodes from NCSA 3003 (e.g., lsst-dev01/lsst-dev7) and NCPF (verify-worker, PDAC) that connect to GPFS (as GPFS or NFS) have lost their connection.GPFS

Status
colourGreen
titleOnline

Storage is working to bring GPFS back online

2017-10-21 17:15

LSSTpublic/protected network switch is down in rack N76 at NPCF


nodes cannot communicate DNS, LDAP, etc. so largely cannot communicate with other nodes, e.g., no communication between affected verify-worker nodes and the Slurm scheduler on lsst-dev01, no communication between affected qserv-db nodes and the rest of qserv

Efffectively, the whole verification cluster

Status
colourGreen
titleRestored

in progress, replacement switch is on order

Workaround in progress. If all goes well, systems should be back online by late afternoon.

2017-10-19 06:00

2017-10-19 14:00qserv-master replacementNCSA

qserve-master will be down so that systems engineering can finish configuring the new server and xfering files. Status updates here:

Jira
serverJIRA
serverId9da94fb6-5771-303d-a785-1b6c5ab0f2d2
keyIHS-378
.

qserv-master will be down for this entire period

Status
colourGreen
titleComplete

Archived events


Important Project Dates

(those with asterisk* are LSSTC funded):

...