WMStats Monitoring for Tier0


WMStats is status based monitoring tool for WMAgent. It is a user interface that presents the current status of the tier0 system. In WMStats you can find:

  • A summary of the production the tier0 is working on: run processing status.
  • Job creation for the requests in the system, job success rate and failure modes - if any.
  • Tier0 WMAgent status and health.
Run-Workflow-Jobs.png

WMStats creates a summary of the run, request/workflow and job progress. Each run can have several workflows depending on the streams it has, i.e. Run228566 can have workflows like:

  • Express_Run228566_StreamExpressCosmics
  • Repack_Run228566_StreamA
  • Repack_Run228566_StreamCalibration ...
And each workflow can have a different number of jobs created.

These are the WMStates instances the tier0 is using:

Observations: Percentages DO NOT REFLECT the workflow status, it varies according to job counts and gets to 0% when the workflow is complete. Also, when all the processing is done, it counts on merge and slow logCollect jobs to declare workflow complete.

Tool description

Run tab

Run_tab.png

In this tab you can find the actual list of runs in the system and their details:

  • requests: number of workflows created and not completed for this run
  • run status depends on the workflows status for the given run, it can be: Active, Real Time Processing, Real Time Merge, Real Time Harvesting, PromptReco, Reco Merge, Reco Harvest, Processing Done, Real Time Done, Complete
  • success, failure, pending, running and job progress: total number of jobs, it is filtered according to the job status. job progress shows the percentage of jobs that are done compared to the total created.
  • cool off: when a job is in cool off, it means it is currently retrying. You may want to check the failure mode of these cool off, it can be a temporary problem but it can also be an early alarm for real problems. The job will be paused if it the 4th retry is still failing.
  • paused job: jobs that exhausted the max number of retries. The failure mode should be investigated and fixed if possible. To resume the paused jobs is a manual procedure that is done by the tier0 experts.

Workflow Tab

Requests_Tab.png

In this tab you can find the actual list of workflows in the system and their details (you can filter them by run, workflow or status from the search bar):

  • workflow: name of the request. This name contains the workflow type, the run number and the stream used as input.
  • status can be: new, Closed, Merge, Harvesting, Processing Done, AlcaSkim, completed
  • duration: request duration in the system (since it was created)
  • submitted: number of jobs created and submitted for this workflow
  • pending, running and job progress: total number of jobs, it is filtered according to the job status. job progress shows the percentage of jobs that are done compared to the total created.
  • cool off: when a job is in cool off, it means it is currently retrying. You may want to check the failure mode of these cool off, it can be a temporary problem but it can also be an early alarm for real problems. The job will be paused if it the 4th retry is still failing.
  • paused job: jobs that exhausted the max number of retries. The failure mode should be investigated and fixed if possible. To resume the paused jobs is a manual procedure that is done by the tier0 experts.

Jobs Tab

Jobs_tab.png

In this tab you can find the actual list of tasks for a given workflow:

  • task: name of one of the task for the given workflow.
  • created: number of jobs created in the tier0 agent for the task
  • queued: number of jobs that has been submitted by the tier0 agent
  • pending: number of jobs pending in condor
  • running: number of jobs running in condor
  • cool off: when a job is in cool off, it means it is currently retrying. You may want to check the failure mode of these cool off, it can be a temporary problem but it can also be an early alarm for real problems. The job will be paused if it the 4th retry is still failing.
  • paused job: jobs that exhausted the max number of retries. The failure mode should be investigated and fixed if possible. To resume the paused jobs is a manual procedure that is done by the tier0 experts.
  • event and lumi progress: number of events and lumis (average for all the output datasets)

Workflows status

Depending on the kind of workflow, there are several status. The following diagram shows the status transitions for Repack, Express and PromptReco workflows.

Workflow_Status.png

A few remarks: if a workflow status is new, this means that the run is still not closed. Check for missing bookkeeping if a workflow stays in this status for a long time. The system is currently archiving all the workflow that get to completed as soon as possible. There is development work to change this (The goal is to show a workflow until the run is completed as this is convenient for monitoring purposes)

Run status

The run status is defined by the what is the latest status of the workflows the run has. The following diagram describes the transition change:

Run_Status.png

  • When a run is first created, the status is set as Active. This status may mean that data is still coming for the given run.
  • There are 4 status directly linked to Express and Repack steps: Real Time Processing, Real Time Merge, Real Time Harvesting and Real Time Done.
  • There are 4 status directly linked to PromptReco: PromptReco, Reco Merge, Reco Harvest, Processing Done.
  • When every workflow is completed, then the status is set as Complete.

FAQ

Are Express and Repack done for a given run?

As known, we want it done as we take data. The best way to know it so far is for the run states, namely, for a run still being processed "Real Time Processing", to know how far the run is, click the "L" Button, it will take you to the expanded workflows. Then you will see what is still left to finish. Hopefully not Express.

If the run is past the point we need to worry about, its state will be "Real Time Done"

Are we keeping up with data taking?

We are going to use the previous concept here to define that, and a healthy WMStats, should show the first few runs as "Real Time Processing" (the fewer the better). The subsequent runs have all to stay in the "Real Time Done" state, which, again, means we're done with express and repack.

Despite of the run status, did we upload the PCL Payload?

Just go to the Run, click "L", you will see the Express workflow like :

Express_Run228548_StreamExpressCosmics

Click "L" on it, you should see this task with all jobs complete and success status :

ExpressAlcaSkimwrite _StreamExpressCosmics_ALCARECOAlcaHarvestALCARECOStreamPromptCalibProdSiStrip

Too big name to be obvious, so I would just search in the browser for "alcaharvest".

Despite of the run status, did we upload data to the DQM GUI?

Just go to the Run, click "L", you will see the Express workflow like :

Express_Run210611_StreamExpress

Click "L" on it, you should see this task with all jobs complete and success status :

ExpressMergewrite _StreamExpressCosmics_DQMEndOfRunDQMHarvestMerged

To be a bit more didactic, it harvest the output from the task :

ExpressMergewrite _StreamExpressCosmics_DQM

What is the PromptReco state?

All you need to do is go to the "workflow" filter box and type "promptreco".

What you will see are all the runs that have PromptReco released, and their status, namely :

  • PromptReco - Running the processing jobs - you will want to filter out those to know about the farm usage
  • Reco Harvest - DQM and AlCa Harvest jobs running, processing+merge is past
  • Processing Done - All Primary Datasets are done in all levels of their workflows (merge, harvest, skim, etc)
Observations Be aware of the reco triggering delay: expect runs older than 6h in the past. More recent runs will not appear.
Topic attachments
I Attachment History Action Size DateSorted ascending Who Comment
PNGpng Jobs_tab.png r1 manage 187.5 K 2014-10-28 - 19:27 LuisContreras  
PNGpng Requests_Tab.png r1 manage 218.1 K 2014-10-28 - 19:27 LuisContreras  
PNGpng Run-Workflow-Jobs.png r2 r1 manage 68.7 K 2014-10-28 - 20:39 LuisContreras  
PNGpng Run_tab.png r1 manage 140.6 K 2014-10-28 - 19:27 LuisContreras  
PNGpng Workflow_Status.png r1 manage 281.7 K 2014-10-31 - 23:10 LuisContreras  
PNGpng Run_Status.png r2 r1 manage 159.8 K 2014-11-03 - 22:06 LuisContreras  
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r7 - 2014-12-15 - LuisContreras
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback