...
(All times are Project Time (Pacific))
Start | End | Event | Location | Description | Systems/services that will NOT be available | Status |
---|---|---|---|---|---|---|
Every Tu. |
09:00 | Every Tu. |
11:00 | Recurring Weekly Nebula Maintenance | NCSA | Routine system updates. Computational services continue to run. | Horizon and API interfaces. |
| ||||||||
Third Thursday of every month |
07:00 | Third Thursday of every month |
09:00 | Recurring Monthly lsst-dev maintenance | NCSA |
| Variable. Do not expect any lsst-dev system to be available during this period. |
| ||||||||
Thursday 2017-11-16 07:00 | Thursday 2017-11-16 11:00 | Extended monthly lsst-dev maintenance | NCSA |
| Do not expect any lsst-dev system to be available during this period. |
| |||||||
2017-10-31 | GPFS instability | NCSA | All disks in the GPFS storage system went offline temporarily and came back online by themselves. NFS services were restarted. This is the second drop-out in <24hrs. GPFS has the hiccups. | most NCSA-hosted LSST resources native mounts (e.g., lsst-dev01, verify-worker*) and NFS mounts (e.g., PDAC) |
All GPFS services are currently running Until a cause is identified & fixed we'll consider it unstable. Logs have been sent to the vendor for analysis. | ||||||||
Previous Outages & Events
Start | End | Event | Location | Planned Activities | Systems/services that will NOT be available | Status | ||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
2017-10-24 09:50 | LSST | GPFS outage | NCSA | All LSST nodes from NCSA 3003 (e.g., lsst-dev01/lsst-dev7) and NCPF (verify-worker, PDAC) that connect to GPFS (as GPFS or NFS) have lost their connection. | GPFS |
Storage is working to bring GPFS back online | ||||||||||||||
2017-10-21 17:15 | LSST | public/protected network switch is down in rack N76 at NPCF | nodes cannot communicate DNS, LDAP, etc. so largely cannot communicate with other nodes, e.g., no communication between affected verify-worker nodes and the Slurm scheduler on lsst-dev01, no communication between affected qserv-db nodes and the rest of qserv | Efffectively, the whole verification cluster |
in progress, replacement switch is on order Workaround in progress. If all goes well, systems should be back online by late afternoon. | |||||||||||||||
2017-10-19 06:00 | 2017-10-19 14:00 | qserv-master replacement | NCSA | qserve-master will be down so that systems engineering can finish configuring the new server and xfering files. Status updates here:
| qserv-master will be down for this entire period |
|
Important Project Dates
(those with asterisk* are LSSTC funded):
...