Monitoring Links
CMS Online Services: Storage Manager, Prompt Calibration Loop URL, WBM Stream Summary Tier-0 Service : Kibana Tier-0 Monitoring , Tier0 Jira Tier-0 Job Monitoring : CondorMonitoring, WMStats, T0 Prodmon


Tier-0 workflows monitoring

NOTE: THESE ARE NEW INSTRUCTIONS. WE HAVE MOVED FROM ELOG TO JIRA. PLEASE AVOID SENDING NEW ELOGS. THANKS!

Introduction / Machine status

Please read this once at the start of your shift to learn about the Tier-0 workflow. This introduction will help you to understand the importance of the different components of the workflow and which problem to look for.

  • Tier0 is now using Jira to follow issues. All the Jira Tickets should be opened in the Tier0 project: https://its.cern.ch/jira/projects/CMSTZ/issues/
  • The T0 is one of the most important computing systems of CMS, it is responsible for creating the RAW datasets out of the data streams sent from P5. It also handles the first reconstruction of the RAW data called PromptReco.
  • It runs many kinds of jobs against the collision data, among them the most important types of Jobs are Express and Repack.
    • Express jobs speedily reconstruct a special portion of the RAW data coming from the detector and are supposed to finish within 1 hour of recording of this data.
    • Repack jobs process all the data coming from P5 and convert the data into RAW files and split them into Primary datasets.
    • These jobs should run in real time, a delay impacts all teams and groups downstream. For example online shifters for the detector subsystems can't work if these jobs get delayed.
    • The main problems that can be encountered are stuck transfers from P5 or Express and/or Repack jobs failing causing runs to get stuck within the T0 processing chain.
  • As you probably have read in the Computing Plan of the Day , you already know if we are during data-taking period or not. When we are, any error in the T0 should be reported. We should not have runs delayed.
  • The following diagram shows a summary of the CMS data flow and the Tier0 role on it. For details in how the Tier-0 processing happens, please have a look to this link

CMS_Data_Flow.png

Tier-0 Service

Checks of the most relevant Tier-0 issues

When reporting an issue through JIRA, please check if a ticket already exists for that issue. If one does, use that ticket to report any major changes. There is absolutely no reason to create more than one ticket for an issue. JIRA tickets are fully editable so it is possible to fix any mistakes. All other changes and updates can be addressed through comments on the existing issue.

IMPORTANT NOTE: Please note that there are two different WMStat pages, one for the Tier0 and another for Central Production. Please make sure that you are checking the Tier0 one (https://cmsweb.cern.ch/tier0_wmstats/index.html) when following these instructions.

IMPORTANT NOTE: Please note there are two different JIRA projects for Tier0 and Central Production. Please make sure that you report on the Tier0 one (https://its.cern.ch/jira/projects/CMSTZ/issues)

Check Tier-0 Components

Check the status of Tier 0 components through WMStats:

  • Look at the "agent info" section near the top of the page. If there is a warning please proceed as follows:
  • If the warning is Red AND the component has been down for more than 90 minutes, please open a JIRA TICKET with the title "Component Error - 'Component Name' ".
    ComponenError_Example.png
  • If the warning is Yellow with the message "Disk warning", alerting about cmvfs locations. Please ignore it. It is not necessary to report.
    Tier0_CSP_DiskAlert.png
  • If the warning is Yellow with the message "Disk warning", alerting about other locations (like /var/):
    • If it is lower than 90%. Please ignore it. It is not necessary to report.
      tier0_var.png
    • If it is equals or bigger than 90%, please open a JIRA TICKET with the title "Disk is almost full - ".
      tier0_var_bad.png
  • If the warning is Yellow with the message "Proxy warning", alerting that "Agent proxy '/data/certs/serviceproxy-vocmsXXX.pem' must be renewed ASAP. Its time left is: xxx.xx hours.":
    • If it is more than 80 hours, please ignore it. It is not necessary to report.
      proxy_warning_wmstats.png
    • If it is less than 80 hours, please open a JIRA TICKET with the title "Proxy warning on vocmsXXX - hours left".

How to create a Jira Issue

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng 1.png r1 manage 73.4 K 2016-04-26 - 17:29 UnknownUser  
PNGpng 2.png r1 manage 98.1 K 2016-04-26 - 17:29 UnknownUser  
PNGpng 3.png r1 manage 139.4 K 2016-04-26 - 17:29 UnknownUser  
PNGpng CMS_Data_Flow.png r1 manage 311.5 K 2015-07-02 - 22:26 UnknownUser  
PNGpng CSP_Label.png r1 manage 5.1 K 2016-12-05 - 15:57 UnknownUser  
PNGpng ComponenError_Example.png r1 manage 77.8 K 2016-11-10 - 14:26 UnknownUser  
PNGpng ExpressDelayExample.png r1 manage 600.3 K 2017-04-25 - 18:18 UnknownUser  
PNGpng JIRA_CreateIssue.png r1 manage 132.8 K 2016-11-10 - 15:23 UnknownUser  
PNGpng JIRA_CreateIssue_Click.png r1 manage 20.2 K 2016-11-10 - 15:23 UnknownUser  
PNGpng Tier0_CSP_ActiveRuns.png r1 manage 232.4 K 2016-03-30 - 17:39 UnknownUser Checking active runs in Tier0 WMStats
PNGpng Tier0_CSP_DiskAlert.png r1 manage 96.7 K 2016-04-17 - 10:01 UnknownUser  
PNGpng Tier0_CSP_ExpressOverview.png r1 manage 641.9 K 2017-04-13 - 14:40 UnknownUser  
PNGpng proxy_warning_wmstats.png r1 manage 17.0 K 2018-05-16 - 15:06 UnknownUser  
Edit | Attach | Watch | Print version | History: r68 < r67 < r66 < r65 < r64 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r68 - 2018-09-17 - unknown
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback