Summary

This document outlines the available monitoring resources available to developers who are using NCSA LDF resources.  Expansion of and improvements to the monitoring framework are ongoing, but are in a stable state that allows developers to get a good idea of system health and utilization.  

Current Monitoring Systems

SystemMonitoring PlatformsiteStatus
lsst-dev, lsst-stor, lsst-dbdevNagioshttps://lsst-web.ncsa.illinois.edu/nagios/Legacy
All MachinesGrafanahttps://monitor-ncsa.lsst.orgProduction


Accessing Production Monitoring

Production monitoring can be accessed at the link above in the table, credentials will be required and these credentials are the same as the ones used to log into LSST machines that are hosted at NCSA (eg. lsst-dev01).  If you can't remember your credentials, they can be reset here.

Available Dashboards

Dashboard NameDescription of DashboardMetrics Collected
Batch SystemOverall state of the batch system from a high levelRunning Jobs/Pending Jobs by partition, Nodes Busy/Idle/Offline by Partition, Batch System Utilization, Batch Usage Trends
Batch System Compute SummaryNode health metrics for machines participating in the batch systemPing, Uptime, Load, Memory Usage, Network Utilization, Local Disk Usage
GPFS Usage ReportReport of Usage across LSST File system areas broken down by userBytes Used, Files Owned by each user per home, project, scratch, and datasets
LSST-NCSA StatusOverall Health Panel of LDF Systems (check if experiencing issues)States of all sub-system services
Qserv Systems SummaryNode health metrics for machines serving qserv needsPing, Uptime, Load, Memory Usage, Network Utilization, Local Disk Usage, Web Page/DB Health
SUI Systems SummaryNode health metrics for machines serving SUI needsPing, Uptime, Load, Memory Usage, Network Utilization, Local Disk Usage, Web Page Health


Tools Used in Framework

Collectors

Metric Storage

Visualization & Alerts

Log Analysis

References



  • No labels