Summary
This document outlines the available monitoring resources available to developers who are using NCSA LDF resources. Expansion of and improvements to the monitoring framework are ongoing, but are in a stable state that allows developers to get a good idea of system health and utilization.
Current Monitoring Systems
System | Monitoring Platform | site | Status |
---|---|---|---|
lsst-dev, lsst-stor, lsst-dbdev | Nagios | https://lsst-web.ncsa.illinois.edu/nagios/ | Legacy |
All Machines | Grafana | https://monitor-ncsa.lsst.org | Production |
Accessing Production Monitoring
Production monitoring can be accessed at the link above in the table, credentials will be required and these credentials are the same as the ones used to log into LSST machines that are hosted at NCSA (eg. lsst-dev01). If you can't remember your credentials, they can be reset here.
Available Dashboards
Dashboard Name | Description of Dashboard | Metrics Collected |
---|---|---|
Batch System | Overall state of the batch system from a high level | Running Jobs/Pending Jobs by partition, Nodes Busy/Idle/Offline by Partition, Batch System Utilization, Batch Usage Trends |
Batch System Compute Summary | Node health metrics for machines participating in the batch system | Ping, Uptime, Load, Memory Usage, Network Utilization, Local Disk Usage |
GPFS Usage Report | Report of Usage across LSST File system areas broken down by user | Bytes Used, Files Owned by each user per home, project, scratch, and datasets |
LSST-NCSA Status | Overall Health Panel of LDF Systems (check if experiencing issues) | States of all sub-system services |
Qserv Systems Summary | Node health metrics for machines serving qserv needs | Ping, Uptime, Load, Memory Usage, Network Utilization, Local Disk Usage, Web Page/DB Health |
SUI Systems Summary | Node health metrics for machines serving SUI needs | Ping, Uptime, Load, Memory Usage, Network Utilization, Local Disk Usage, Web Page Health |
Tools Used in Framework
Collectors
Metric Storage
Visualization & Alerts
Log Analysis
References
- TICK Stack (InfluxData Platform)
- DM JTM 2018 Presentation (link)