HTCondor DAGMAN Log Files

The HTCondor DAGMAN utility writes a variety of files to your $HOME/condor_scratch directory.   One of the files is a log of all processing done through HTCondor.   All log messages are in the form of records, which are delimited by an ellipses.

Records are delimited with an ellipses. Here is a portion of one of the files, with "updated", "submitted" and "terminated" records:

006 (61916.000.000) 10/04 11:48:51 Image size of job updated: 120060
    0  -  ResidentSetSize of job (KB)
...
000 (61931.000.000) 10/04 11:48:56 Job submitted from host: <141.142.225.136:37267>
    DAG Node: A51
...
005 (62891.000.000) 10/04 12:54:29 Job terminated.
    (1) Normal termination (return value 0)
        Usr 0 00:11:24, Sys 0 00:00:05  -  Run Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Run Local Usage
        Usr 0 00:11:24, Sys 0 00:00:05  -  Total Remote Usage
        Usr 0 00:00:00, Sys 0 00:00:00  -  Total Local Usage
    2647  -  Run Bytes Sent By Job
    1912  -  Run Bytes Received By Job
    2647  -  Total Bytes Sent By Job
    1912  -  Total Bytes Received By Job
    Partitionable Resources :    Usage  Request
       Cpus                 :                 1
       Disk (KB)            :       12       12
       Memory (MB)          :               871
...

ctrl_stats package

The ctrl_stats package contains commands to ingest Condor log files into a database by taking the event "records" HTCondor emits during execution. The utilities group all of these records according to Condor job id in order to get an overview of what happened during the job. These records are then reduced into two tables, one that describes each submitted dag node and one that describes all of the events that happens to each condor job id. Note that the dag node can be submitted more than one time for a variety of reasons. All of this information is captured in the database records.  Note that you have to have database write capabilities to execute the ingest command.echo

The following commands are available:

condorLogIngest.py - Takes a list of log files and ingests them into a database 

 

usage: condorLogIngest.py [-h] -H HOST [-p PORT] -d DATABASE -f FILENAMES [FILENAMES ...] [-v]

example:
$ cd ~/condor_scratch/srp_2012_1211_090023
$ condorLogIngest.py -H lsst10 -p 3600 -d testing -f logs/worker-pre.log

 


condorDirectoryIngest - given a directory in ~/condor_scratch, take the preJob, worker and postJob log files and add them into a database named after the directory.

usage: condorDirectoryIngest -H HOST -p PORT -d directory
example:
$ cd ~/condor_scratch
$ condorDirectoryIngest -H lsst10 -p 3600 -d srp_2012_0925_160117

condorLogInfo.py - a debugging utility to view record groups. This was used as a debugging tool, but is useful in viewing records without having to do a complete database ingest.

usage: condorLogInfo.py [-h] [-v] [-c CONDORIDS [CONDORIDS ...]] -f FILENAMES [FILENAMES ...]
example:
$ cd ~/condor_scratch
$ condorLogInfo.py -c 630.000.000 -f srp_2012_0925_160117/*nodes.log

 

Record Groups

 The are examples of how these individual records are grouped:

Job was executing, increase it's image size to 118472, and was evicted because job id 68101 (the dagman job which submitted this dag) was removed.

Submitted 2012-10-18 16:54:01 jobNum=68103.000.000 dagNode=A1
Executing 2012-10-18 16:54:03 host=198.202.102.162:53706
Updated 2012-10-18 16:54:12 imageSize=118472
Evicted 2012-10-18 16:57:04 
Aborted 2012-10-18 16:57:04 	removed because <OtherJobRemoveRequirements = DAGManJobId == 68101> fired when job (68101.0) was removed

Job was submitted, but never executed, and was evicted when the dagman job was removed.

Submitted 2012-10-18 16:57:02 jobNum=68247.000.000 dagNode=A145
Aborted 2012-10-18 16:57:03 	removed because <OtherJobRemoveRequirements = DAGManJobId == 68101> fired when job (68101.0) was removed

Job was submitted, executed, updated its image size to 117796, then again updated it's image size to 689000, and terminated.

Submitted 2012-10-03 10:57:10 jobNum=59179.000.000 dagNode=A1078
Executing 2012-10-03 10:57:12 host=198.202.101.70:58180
Updated 2012-10-03 10:57:21 imageSize=117796
Updated 2012-10-03 11:02:22 imageSize=689000
Terminated 2012-10-03 11:07:49

Job was submitted, executed on node 198.202.100.66, updated its image size to 100916, then again to 891028, communication to the node was lost, reconnection failed once, executing started on node 198.202.100.198, and then terminated.

Submitted 2012-10-04 09:39:07 jobNum=60540.000.000 dagNode=A6
Executing 2012-10-04 09:39:28 host=198.202.100.66:39897
Updated 2012-10-04 09:39:36 imageSize=100916
Updated 2012-10-04 09:44:36 imageSize=891028
SocketLost 60540.000.000 2012-10-04 09:53:55
SocketReconnectFailure 60540.000.000 2012-10-04 10:13:55
Executing 2012-10-04 10:13:58 host=198.202.100.198:43111
Terminated 2012-10-04 10:25:29

Submitted, executing on node 198.202.101.206, updated to image size 119476, an exception in the Shadow daemon occurs, execution starts on 198.202.101.110, updated image size to 102492, and then again to 682660, disconnected from node, reconnection to node failed, execution starts on 198.202.101.185, and then terminates.

Submitted 2012-10-04 09:41:41 jobNum=60660.000.000 dagNode=A126
Executing 2012-10-04 09:42:08 host=198.202.101.206:51636
Updated 2012-10-04 09:42:17 imageSize=119476
ShadowException 60660.000.000 2012-10-04 09:48:18
Executing 2012-10-04 09:48:45 host=198.202.101.110:48658
Updated 2012-10-04 09:48:53 imageSize=120492
Updated 2012-10-04 09:54:24 imageSize=682660
SocketLost 60660.000.000 2012-10-04 09:56:47
SocketReconnectFailure 60660.000.000 2012-10-04 10:16:47
Executing 2012-10-04 10:16:52 host=198.202.101.185:60256
Terminated 2012-10-04 10:29:13

Database Table Formats

Nodes Table

The following is the definition of the nodes table which ctrl_stats writes into:

CREATE TABLE IF NOT EXISTS `nodes` (
  `id` int(11) unsigned NOT NULL auto_increment,
  `condorId` varchar(24) default NULL,
  `dagNode` varchar(10) default NULL,
  `submitTime` datetime default NULL,
  `executionHost` varchar(24) default NULL,
  `executionStartTime` datetime default NULL,
  `executionStopTime` datetime default NULL,
  `updateImageSize` int(11) default NULL,
  `updateMemoryUsageMB` int(11) default NULL,
  `updateResidentSetSizeKB` int(11) default NULL,
  `userRunRemoteUsage` int(11) default NULL,
  `sysRunRemoteUsage` int(11) default NULL,
  `finalMemoryUsageMB` int(11) default NULL,
  `finalMemoryRequestMB` int(11) default NULL,
  `bytesSent` int(11) default NULL,
  `bytesReceived` int(11) default NULL,
  `terminationTime` datetime default NULL,
  `terminationCode` varchar(3) default NULL,
  `terminationReason` varchar(4096) default NULL,
  PRIMARY KEY  (`id`)
) ENGINE=MyISAM DEFAULT CHARSET=latin1;
  • condorId - the number Condor assigns for this job
  • dagNode - the DAGMan node name associated with this job
  • submitTime - the time that DAGMan submitted this job to Condor
  • executionHost - The host on which this job started executing. This may be blank if the job was never executed.
  • executionStartTime - The time at which this job started executing. This may be listed as "0000-00-00 00:00:00" if the job was never started.
  • executionStopTime - The time at which this job stop executing. This may be listed as "0000-00-00 00:00:00" if the job did not terminate normally.
  • updateImageSize - Condor can sometimes send out information about the imageSize of the job. This is the last updated received before the job ended. This may be 0.
  • updateMemoryUsageMB - In updates, Condor sometimes adds additional information about the memory usage in MB. This may be 0, if this information is not reported.
  • updateResidentSetSizeKB - in updates, Condor sometimes adds additional information about the portion of the process's memory that is held in RAM. This may be 0, if this information is not reported.
  • userRunRemoteUsage - At termination, the amount of remote user time that this job took to execute.
  • sysRunRemoteUsage - At termination, the amount of remote system time that this job took to execute. 
  • finalMemoryUsageMB - At termination, the last reported memory usage statistic reported by Condor
  • finalMemroryRequestMB - At termination, the last reported memory request statistic reported by Condor. This may be 0.
  • bytesSent - the number of bytes set to this job.
  • bytesReceived - the number of bytes received from this job.
  • terminationTime - The time recorded for this job's termination.
  • terminationCode - The Condor event code for this terminated job.
  • terminationReason - The text description reported by Condor as the reason for this job's termination.

Event Code Table

Each time Condor outputs a record it puts out an event code. 

+-----------+-----------------------------------------+
| eventCode | eventName                               |
+-----------+-----------------------------------------+
| 000       | Job submitted                           |
| 001       | Job executing                           |
| 002       | Error in executable                     |
| 003       | Job was checkpointed                    |
| 004       | Job evicted from machine                |
| 005       | Job Terminated                          |
| 006       | Image size of job updated               |
| 007       | Shadow exception                        |
| 008       | Generic log event                       |
| 009       | Job aborted                             |
| 010       | Job was suspended                       |
| 011       | Job was unsuspended                     |
| 012       | Job was held                            |
| 013       | Job was released                        |
| 014       | Parallel node executed                  |
| 015       | Parallel node terminated                |
| 016       | POST script terminated                  |
| 017       | Job submitted to Globus                 |
| 018       | Globus submit failed                    |
| 019       | Globus resource up                      |
| 020       | Detected Down Globus Resource           |
| 021       | Remote error                            |
| 022       | Remote system call socket lost          |
| 023       | Remote system call socket reestablished |
| 024       | Remote system call reconnect failure    |
| 025       | Grid Resource Back Up                   |
| 026       | Detected Down Grid Resource             |
| 027       | Job submitted to grid resource          |
| 028       | Job ad information event was triggered  |
| 029       | The job's remote status is unknown      |
| 030       | The job's remote statis is known again  |
| 031       | unused                                  |
| 032       | unused                                  |
| 033       | Attribute update                        |
+-----------+-----------------------------------------+
  • No labels