Task and Node States in CRAB3-HTCondor

Complete: 5 Go to SWGuideCrab

This document covers the various states of a task and node in CRAB3-HTCondor.

Vocabulary Terms

Although this document assumes a general familiarity with HTCondor and CRAB3 concepts, we outline our definitions of a few vocabulary words that are often confusing.

  • Task: A unit of work corresponding to the processing or simulation of an input dataset.
  • DAG task: The realization of a task as a condor_dagman invocation. When run, the DAG task is logically divided up into many nodes. The condor_dagman process will attempt to complete the DAG task successfully. When the DAG task is done, the CRAB3 task is successful.
  • Node: A portion of the work in a task. The work associated with the node does not change during the task lifetime. At any given time when the task is in the SUBMITTED state, a node may have either zero or one HTCondor jobs associated with it.
    • The user is reported the node states in monitoring and in the output of crab status -l.
  • Job: A job in the HTCondor schedd. This corresponds to an attempt to complete a single node. A successful job is necessary - but not sufficient - to successfully complete a node (as the post-job work - such as ASO - as run by condor_dagman could fail). At most one job may be in the HTCondor schedd per node at a time; however, over time, there may be many jobs that attempted to complete a node.
    • The job states are equivalent to the job states in HTCondor.
  • Resubmit: A job that is not the first attempt at a node where the previous attempt ended normally. A resubmit may either be done manually (if the prior attempt ended with a fatal error) or by condor_dagman itself (if the prior attempt ended with a recoverable error).
  • Restart: A job that is restarted by HTCondor because the previous run ended abnormally (examples: HTCondor lost its network connection to the job or there were glexec issue). No user action causes a restart.

Task States

  • NEW, RESUBMIT, KILL: Temporary statuses to indicate the action ('submit', 'resubmit' or 'kill') that has to be applied to the task.
  • QUEUED: An action ('submit', 'resubmit' or 'kill') affecting the task is queued in the CRAB3 system.
  • SUBMITTED: The task was submitted to HTCondor as a DAG task. The DAG task is currently running.
  • SUBMITFAILED: The 'submit' action has failed (CRAB3 was unable to create a DAG task).
  • FAILED: The DAG task completed all nodes and at least one is a permanent failure.
  • COMPLETED: All nodes have been completed
  • KILLED: The user killed the task.
  • KILLFAILED: The 'kill' action has failed.
  • RESUBMITFAILED: The 'resubmit' action has failed.

Node States

States in bold are terminal states - nodes will stay in this state unless if there is some human intervention.

States in italics are transient states - CRAB3 is still attempting to make progress on this node. No user interaction is required.

  • unsubmitted: The node is ready to have a job, but condor_dagman has not yet submitted a job to the HTCondor schedd.
    • This is usually due the task already having too many idle jobs or the DAG task hasjust recently started up.
  • idle: The job for this node is in the HTCondor schedd queue, but is not running.
  • running: The job for this node is running on a remote worker node.
  • finished: The node has had a successful job and transfer.
  • failed: The node had a permanent failure. condor_dagman will not resubmit this job without a user intervention (via 'crab resubmit').
  • cooloff (start state): condor_dagman is waiting before submitting this job. Typically, this is because either the DAG task is brand new (and condor_dagman hasn't considered this job yet) or the prior job for this node exited with a recoverable error. condor_dagman will resubmit this job in the future.
  • transferring: The job has finished and condor_dagman is running stageout.
  • killing: condor_dagman is killing the job associated with this node. Killing is typically very quick, but may last several minutes in a few cases.
  • held: The job associated with this node is held in the HTCondor queue due to some fatal error; this will require operator intervention to fix.

Note that, due to implementation details, cooloff is the start state for all nodes. We hope to start all nodes in the unsubmitted state in the future.

Node State Transitions

The figure below outlines the node state transitions in CRAB3.

Automatic Resubmit Policies

CRAB3 will examine the results of the job and, for a few known cases, will automatically resubmit the job.

CRAB3 will NEVER resubmit if ANY of these are true:

  • The last attempt at the job lasted more than 24 hours.
  • The cumulative wall time for all attempts is more than 36 hours.
  • The last attempt used more than 2GB of memory.
  • The job did not produce a usable job report.
  • The job hit its max resubmit limit (currently, 10 resubmits are allowed after any submit/resubmit).

If none of the above are true, CRAB3 will resubmit if ONE of these are true:

  • The job ended with a file open or read error (8021, 8020, 8028)
  • The job ended with a stageout error (60307)
  • The job did not find the required CMSSW version on the worker node (10034)

Note that the CRAB3 wrapper forces the minimum job runtime to be at least 20 minutes; this prevents a single worker node from failing too many jobs.

Whether a job is resubmitted is determined by the post-job process, run by dagman on the schedd. If the post-job determines the job can be resubmitted, then the resubmit will be done by the dagman process at some time in the future; immediately, however, the job enters the cooloff state.

Automatic Restart Policies

If the job goes into the hold state, it will automatically be released by the HTCondor schedd for these reasons:

  • glexec failures.
  • File stage-in failures.

The job can be released at most once every 5 minutes.

If the HTCondor starter process has an unexpected and unrecoverable network disconnect from the schedd (during job startup, job running, or the end of the job), then the HTCondor schedd will requeue the job and attempt to re-run it in the future.

Automatic Site Blacklist Policies

(NOTE: We are in the process of implementing this policy.)

A job may blacklist a site at submit time if ALL of the following conditions are met:

  • The previous job for this node failed at the site.
  • There have been more than 10 completed jobs for this task (failed or successful).
  • There are 5 successful jobs for this task. (protects against tasks which are going to fail everywhere).
  • The site success rate is less than 50% for this task.
  • The site has failed at least 5 jobs.
  • The job can match at least one other site.

A 'crab resubmit' will clear all statistics with respect to the automatic blacklist.

In the future, the automatic blacklist will be shown in the glidemon monitoring.

Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2015-09-14 - AndresTanasijczuk
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback