Monitoring Links
CMS Online Services: DAQstatus, SM, P5 Elog Tier-0 Service : T0 Alarms, Tier-0 Elog Tier-0 Job Monitoring : CondorMonitoring, WMStats


Tier-0 workflows monitoring

Introduction / Machine status

Please read this once at the start of your shift to learn about the Tier-0 workflow. This introduction will help you to understand the importance of the different components of the workflow and which problem to look for.

  • A general T0 description is provided here Hide
    • The T0 is one of the most important computing systems of CMS, it is responsible for creating the RAW datasets out of the data streams sent from P5. It also handles the first reconstruction of the RAW data called PromptReco. It runs many kinds of jobs against the collision data, among them the most important types of Jobs are Express and Repack. Express jobs speedily reconstruct a special portion of the RAW data coming from the detector and are supposed to finish within 1 hour of recording of this data. Repack jobs process all the data coming from P5 and convert the data into RAW files and split them into Primary datasets. These jobs should run in real time, a delay impacts all teams and groups downstream. For example online shifters for the detector subsystems can't work if these jobs get delayed. The main problems that can be encountered are stuck transfers from P5 or Express and/or Repack jobs failing causing runs to get stuck within the T0 processing chain.
    • As you probably have read in the Computing Plan of the Day , you already know if we are during data-taking period or not. When we are, any error in the T0 should be reported. We should not have runs delayed.

  • The following diagram shows a summary of the CMS dataflow and the Tier0 role on it. For details in how the Tier-0 processing happens, please have a look to this link

CMS_Data_Flow.png

CMS Online Services

Check the CMS Online status related to incoming runs

During LHC collisions, periodically checkpoint the Tier-0 WMStats monitoring below against the ongoing data taking status seen here.

  • DAQ Status URL: DAQstatus (shift reload this periodically)
    • Show INSTRUCTIONS Hide INSTRUCTIONS: This overview indicates whether a run is currently being ongoing, specifies the data taking mode and if data is being transferred from the CMS detector to the Tier-0. In the top center of the page you can find the on-going Run Number; a green field on the upper left indicates "TIER0 TRANSFER ON" when data is sent to the Tier-0. The first line above "Data Flow" on the top left specifies the LHC data taking mode. The tag "physics" is the most relevant data. The bottom right histograms shows the run history in the last 24h; the run numbers specified on the graph should be reflected in the WMStats page.

  • Storage Manager URL: Storage Manager
    • Show INSTRUCTIONS Hide INSTRUCTIONS: Check on the DAQ page above (see URL on previous bullet) that the latest run number in Storage Manager matches the current run number (if "TIER0 TRANSFER ON" is green and some data has been logged). If "TIER0 TRANSFER ON" is green and the run number is not the latest, open an Elog. The Storage Manager also shows all the transfer/preliminar processing chain: Close -> Inject -> Transfr -> Check -> Repack is what matters, if you see missing files ( next number is lower than the previous ) and it is red, open an Elog. (if it's blue, it's expected). If you see anything red around the page, ex: Server down, open an Elog.

  • Prompt Calibration Loop URL
    • Show INSTRUCTIONS Hide INSTRUCTIONS: This monitoring page (look only at Latency since the end of the run) shows whether Express workflows delivered the calibration payload to the online systems (black dot correspondent to the run on the axis X). This has to happen before PromptReco starts (red dotted line). Runs before (left side) the red line, can be ignored. If the red line gets too close (3 runs before) to the last run that uploaded conditions (has the black dot), open an Elog to warn the Tier-0 Experts.

  • P5 Elog (for information only, see instructions)
    • Show INSTRUCTIONS Hide INSTRUCTIONS: This elog is not needed in "normal" situations, however it may be useful in cases of very special events at the CMS detector. Or you may use it to simply find out who is the online shifter (same shift role corresponding to yours, but for everything related to online data taking). You will need to log in with your AFS credentials.

Tier-0 Service

Checks of the the most relevant Tier-0 issues

The main causes of runs getting stuck are monitored:

  • Tier-0 Alarms in Kibana monitoring for the Tier0
  • Show INSTRUCTIONS Hide INSTRUCTIONS:
    • This monitoring shows some metrics about critical parts of the Tier-0. Please open an Elog in case you see:
      • USE OF EOS/UNMERGED AREA is over 90%. This might mean that we are creating more files than what we can merge, or there is some problem with cleanup jobs.
      • USE OF EOS/T0STREAMER is over 95%. T0Streamer is the area that actually receives the streamer files from the Transfer System. There is an automatic cleanup script on that area.
      • PAUSED JOBS (# OF JOBS) is more than 0 and there are no known problems in the latest Tier-0 elogs. This is a very important metric, it shows jobs that a Tier-0 operator should take a look manually because it failed all the retries so far. After an investigation of the failure mode, the operator can decide to let it retry again.
      • WMAGENT BACKLOG (# OF JOBS) is sustainedly growing up or there are hundreds of jobs reported for more than 3 hours.
      • # OF LATE WORKFLOWS is more than 0. A workflow can be running late by many reasons, i.e. if it has paused jobs.
      • LONG JOBS - AVAILABILITY (%) < 100%. Some jobs can be running for long times and that may mean that the job is stuck.


Check the status of Tier 0 components and workflows through WMStats

  • WMStats
    • Show INSTRUCTIONS Hide INSTRUCTIONS:
      • Check that T0 components are up: Look at the "agent info" section near the top of the page. If there is a warning (yellow or red), please open an Elog.
      • Check for new runs appearing: New runs should appear on WMStats in the "Run" table within a few minutes of appearing on the Storage Manager page above. If a new run doesn't appear in WMStats between the next 15 min open an Elog. Also in the "Run" table, if an old run is in 'run status' = 'Active' and ALL the files has been transferred in the Storage Manager page above, please open an Elog.


Check the Tier-0 Jobs

  • URL: CondorMonitoring
    • Show INSTRUCTIONS Hide The Tier-0 project quota at CERN is 13500 cores. In the link you can find running/pending jobs and how much cores we are using (In case there are multicore jobs). If there is a lot pending but a few/nothing running, please open an Elog.

  • URL: Dashboard
    • Show INSTRUCTIONS Hide You can see the number of jobs that ran in the Tier-0 resources during the last 24 hours. If you see that many jobs were failing (red) and there are some paused jobs in Kibana, please open an Elog. It is normal to have a few failures, jobs might have temporary problemes.


Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng CMS_Data_Flow.png r1 manage 311.5 K 2015-07-01 - 18:21 LuisContreras  
PNGpng KibanaDiag.png r1 manage 28.8 K 2015-07-02 - 17:12 UnknownUser  
Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2015-07-02 - LuisContreras
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback