You are viewing an old version of this page. View the current version.

Compare with Current View Page History

« Previous Version 171 Next »



Current Status

StartEndEventLocation

Description

Systems/services that will NOT be availableStatus







Upcoming Scheduled Maintenance

StartEndEventLocation

Description

Systems/services that will NOT be availableStatus
Last week in Feb. Exact date TBD.~6 weeks after startproduction-size run (HSC-PDR1) on the verification clusterNCSA

Per IHS-749, ~15 nodes of the batch compute resources will be reserved in order to complete HSC-PDR1 data runs. It is expected that the reservation can be scaled back to <10 after the first couple of weeks.

All systems available.

SCHEDULED

4/11 at 07:004/11 at 08:00Firewall update at NCSANCSAPer LSST-1257, the primary firewall needs to have its routing software updated.  No failover is required and traffic will continue to flow through the primary firewall during the upgrade.No outage or service disruption.

SCHEDULED

19 Apr 201819 Apr 2018Monthly lsst-dev maintenanceNCSACANCELLED. No major work is needed and key personnel are travelling. Deployment of the new DTN and VM infrastructure will be delayed until the May maintenance period.N/A

CANCELLED

Recurring Scheduled Maintenance

(All times are Project Time (Pacific))

StartEndEventLocation

Description

Systems/services that will NOT be availableStatus

Third Thursday of every month 06:00

Third Thursday of every month 10:00

Recurring-Monthly

Monthly lsst-dev maintenance

NCSA
  • Routine system updates.
Variable. Do not expect any lsst-dev system to be available during this period.

SCHEDULED

Every Mon. 04:00


Recurring- Weekly
Purge of GPFS /scratch partition

NCSA

Per LSST data management policies, files older than 180 days will be purged from the LSST shared (GPFS) /scratch file system.

Purge logs can be found in /gpfs/fs0/admin/purge_logs/scratch/

No outage or service disruption.

SCHEDULED

Every Tu. 08:00

Every Tu. 10:00

Recurring- Weekly

Weekly Nebula Maintenance

NCSARoutine system updates. Computational services continue to run.Horizon and API interfaces.

SCHEDULED


Previous Outages & Events

StartEndEventLocation

Description

Systems/services that will NOT be availableStatus
4/3/2018 16:404/3/2018 16:45LDAP problemsNCSACausing new logins to the LSST resources at NCSA to hang.new logins can't take place right now.fixed.
3/26/2018
08:00

4/2/2018

9:00

A fileserver on Nebula became unstable, resulting in diminished
performance for some instances and volumes.
NCSAAny instances or volumes hosted on the
healing filesystem will be impacted, or approximately
20% of instances and volumes.


We are migrating instances around to
speed up the process.

3/15/2018 10:20 am PT3/15/2018 14:20 am PTLingering issues on select nodes following March PMNCSA

Select nodes had issues coming out of the PM.

  • lsst7 - issue w/ sshd

RESOLVED

3/15/2018 10:20 am PT3/15/2018 11:23am PTLingering issues on select nodes following March PMNCSA

Select nodes had issues coming out of the PM.

  • lsst-qserv-master01 - cannot mount local /qserv volume
  • lsst-xfer - issue w/ sshd
  • lsst-dts - issue w/ sshd
  • lsst-l1-cl-dmcs - unknown issue

RESOLVED

3/15/2018 6:00 am PT3/15/2018 10:20 am PTMarch lsst-dev maintenance (regular schedule)NCSA
  • GPFS server updates and configuration of additional NFS/Samba services
  • Urgent Firmware updates
  • Increase size of /tmp on lsst-dev01
  • Hardware maintenance/memory increases on select servers/VMs
  • Release of refactored Puppet code
  • OS updates
  • Recabling servers in dev server room to new switches
Systems/services that were NOT be available: ALL lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification and Kubernetes clusters)

COMPLETE

Select nodes (lsst-qserv-master01, lsst7, lsst-xfer, lsst-dts, lsst-l1-cl-dmcs) required additional attention following the PM, as noted in a separate status entry.

3/12/2018 7:00 am PT

3/12/2018 3:00pm PTnebula(Open stack resource) is downNCSANebula is being taken down for patches to be applied across the whole infrastructure.All containers on Nebula are going down.

COMPLETE

07 Mar 2018 13:00

07 Mar 2018 14:10qserv-db12 maintenanceNCSAqserv-db12 had one failed drive in the OS mirror replaced but the other is presenting errors as well so the RAID cannot rebuild. The node was taken down for replacement of the 2nd disk, to rebuild the RAID in the OS volume, and to reinstall the OS.qserv-db12

COMPLETE

 09:02

09:21

lsst-dev01 Out of SpaceNCSAThe main / drive partition ran out of space due to a user's faulty pip build. The faulty files were moved elsewhere for the user to review.
lsst-dev01

COMPLETE

27 Feb 2018 08:40

27 Feb 2018 09:40

Puppet maintenance at NCSANCSA

Enable environment isolation on puppet master

No outage or service disruption is expected.

COMPLETE

 06:00

07:00

Puppet updatesNCSARolled out significant logic and organization of the Puppet resources in NCSA 3003 data center in order to standardize between LSST Puppet environments at NCSA. We had done extensive testing and did not expect any outages or disruption of services.

None

Changes were applied to: lsst-dev01, lsst-dev-db, lsst-web, lsst-xfer, lsst-dts, lsst-demo, L1 test stand, DBB test stand, elastic test stand.

COMPLETE

 12:55

 13:18

lsst-dev-db crashedNCSAThe developer MySQL server lost network and crashed.lsst-dev-db MySQL database

RESTORED

 06:00

 11:00

February lsst-dev maintenance (regular schedule)NCSA
  • Updating GPFS mounts to access new storage appliance
  • Rewire 2 PDUs in dev server room (hosts lsst-dev01, lsst-xfer, etc.)
  • Switch stack configuration changes in dev server room (hosts lsst-dev01, lsst-xfer, etc.)
  • Routine system updates
  • Firewall maintenance at datacenter (hosts PDAC, verification cluster, etc.)
  • Updates to system monitoring
Systems/services that will NOT be available: all lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification cluster)

COMPLETE

  • NOTE: GPFS was not remounted on qserv-dax01 until 4:27pm

, 08:00

, 08:30

Slurm reconfigurationNCSA

The slurm scheduler on the verification cluster will be repartitioned from one queue (debug) into two:

debug: 3 nodes, MaxTime=30 min

normal: 45 nodes, MaxTime=INFINITE

No outages

COMPLETE

Wed 1/24/2018 13:35Wed 1/24/2018 14:55Loss of LSST NFS servicesNCSAAll NFS mounts for LSST systems were not workingNFS access on lsst-demo and lsst-SUI were not working

RESTORED

 16:40

 21:00

Firewall outageNCSABoth pfSense firewalls were accidentally powered off.PDAC (Qserv & SUI) and verification clusters were inaccessible, as well as introducing GPFS issues across many services, e.g. lsst-dev01.

RESTORED

 06:00

 08:00

January lsst-dev maintenance (regular schedule)NCSA
  • Routine system updates
Systems/services that will NOT be available: all lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification cluster)

COMPLETE

 06:00

 11:30

Critical patches on lsst-dev systems (incl. kernel updates)NCSA
  • Update kernel and system packages to address a security vulnerability.
Systems/services that will NOT be available: all lsst-dev systems (incl. lsst-dev01, lsst-xfer, etc. as well as PDAC and the verification cluster)

COMPLETE

 09:00

 17:00

NebulaNCSANebula (OpenStack) will be shut down for hardware and software maintenance from January 2nd, 2018 at 9am until January 5th, 2018 at 5pm.All Nebula systems unavailable.COMPLETE


Saturday  

Tuesday  

Support over holiday breakNCSA

2017-12-22 to 2018-1-01 (inclusive) is the University holiday period. Services will be operational. Please report problems via the JIRA IHS queue. The queue will be monitored by NCSA staff and users will be notified via Jira as to if or when their issue can be addressed.


All services will be operational.

COMPLETE

Wednesday  

06:00

Wednesday  

08:00

NFS Server switchNCSANFS services will be moved to a different host

brief outage of NFS services to SUI nodes, lsst-demo, lsst-demo2

COMPLETED

Wednesday  

06:00

Wednesday  

07:00

Firewall drive replacementNCSACurrent pfSense has a bad drive. If it fails, all nodes behind the firewall will be inaccessible. There are redundant firewalls, no service interrupts are expected.None Expected

COMPLETED

Thursday 2017-12-14 04:00

Thursday 2017-12-14, 10:00

19:00

December lsst-dev maintenance

(off-schedule)

NCSA
  • Due to holiday schedules, the December maintenance event is being moved up 1 week, from 2017-12-21 to 2017-12-14
  • Routine system updates
  • Network switch replacement
  • lsst-db server replacement
  • Further details here
Do not expect any lsst-dev system to be available during this period.

COMPLETED


Tuesday 2017-11-28, 10:00TBDRolling reboots of PDAC qserv nodesNCSA
  • In order to address a spontaneous rebooting issue with some qserv nodes, firmware upgrades are being performed.
The occasional qserv node will need to be rebooted. Experience with the first couple will allow NCSA to give more precise information on the order and timing of the reboots.

COMPLETED

2017-11-20 7:002017-11-20 14:00Nebula Openstack cluster

NCSA

Nebula OpenStack cluster will be unavailable for emergency hardware maintenance. A failing RAID controller from one of the storage nodes and a network switch will be replaced.

Not all instances will be impacted. If any running Nebula instances are affected by the outage they will be shut down, then restarted again after we finish maintenance that day.

COMPLETED

Thursday 2017-11-16 06:00

Thursday 2017-11-16 10:00

Extended monthly lsst-dev maintenance

NCSA
  • Routine system updates.
  • Due to the volume of work that needs to be done, this event is being extended by 2 hrs. If systems become available before the end of the maintenance window, we will announce it here.
  • Be aware that this event will include an off-schedule purge of items in /scratch older than 180 days.
Do not expect any lsst-dev system to be available during this period.

COMPLETED

2017-10-31
NFS instabilityNCSANFS becomes intermittently unresponsive.

~STABLE

We are guardedly optimistic that this problem has been resolved. PDAC is now utilizing native GPFS mounts.

2017-10-24 09:50LSSTGPFS outageNCSAAll LSST nodes from NCSA 3003 (e.g., lsst-dev01/lsst-dev7) and NCPF (verify-worker, PDAC) that connect to GPFS (as GPFS or NFS) have lost their connection.GPFS

ONLINE

Storage is working to bring GPFS back online

2017-10-21 17:15

LSSTpublic/protected network switch is down in rack N76 at NPCF


nodes cannot communicate DNS, LDAP, etc. so largely cannot communicate with other nodes, e.g., no communication between affected verify-worker nodes and the Slurm scheduler on lsst-dev01, no communication between affected qserv-db nodes and the rest of qserv

Efffectively, the whole verification cluster

RESTORED

in progress, replacement switch is on order

Workaround in progress. If all goes well, systems should be back online by late afternoon.

2017-10-19 06:00

2017-10-19 14:00qserv-master replacementNCSA

qserve-master will be down so that systems engineering can finish configuring the new server and xfering files. Status updates here: IHS-378 - Getting issue details... STATUS .

qserv-master will be down for this entire period

COMPLETE

Archived events

DM Meetings and Events

NameDatesLocationNotes/links
JupyterCon 20182018/08/21-24New York City

https://www.oreilly.com/conferences/

Call for speaker: 2018/01 - 2018/02

LSST2018 Project & Community Workshop2018/08/13–17Tucson, AZ

LSST@Europe3

2018/06/11–15Lyon, France
SPIE 20182018/06/10-15Austin, TXSPIE Conference in Austin, Meeting: 10 June – 15 June 2018
IVOA InterOp Northern Spring2018/05/28- 06/01CADC, Victoria, BC
DMLT face to face2018/05/22-24SLAC or UW
Python in Astronomy 20182018/04/30- 05/04New York, NYDeadline for applications is December 9th.
DM Joint Meeting with Systems Engineering2018/03/06-08IPAC, Pasadena, CA
DESC Meeting2018/02/05–09SLAChttps://confluence.slac.stanford.edu/display/LSSTDESC/February+2018+Collaboration+Meeting+-+SLAC
Jupyter Widgets Workshop2018/01/23-26Saclay, FranceDeveloper-centered workshop at CMAP Laboratory at Ecole Polytechnique. Some details at end of this Github thread, or contact Sylvain Corlay sylvain.corlay@gmail.com
DM Gen. 3 Middleware Meeting2018/01/22-25Princeton, NJInternal DM meeting to further developer SuperTask/Butler designs and do some collaborative development. Agenda and list of attendees are still in progress.

DM Boot Camp 2

TBD
231st AAS Meeting2018/01/08–12Washington, DC
Towards Science in Chile with LSST in Chile2017/12/13-15Santiago, Chilehttps://www.lsst-chile.cl/2017-workshop



  • No labels