Tier-0 workflows monitoring

Introduction / Machine status

Please read this once at the start of your shift to learn about the Tier-0 workflow. This introduction will help you to understand the importance of the different components of the workflow and which problem to look for.
  • A general T0 description is provided here Hide
    • The T0 is one of the most important computing systems of CMS, it's responsible for creating the RAW datasets out of the data stream sent from the pit. It also handles the first reconstruction pass of the RAW data called PromptReco. It runs many kinds of jobs against the collision data, among them the most important types of Jobs are EXPRESS and REPACK. EXPRESS jobs speedily reconstruct a special portion of the RAW data coming from the detector and are supposed to finish within 1 hour of recording of this data. REPACK jobs process all the data coming from the pit and convert the data into RAW files and split them into Primary datasets. These jobs should run in real time, a delay impacts all subsequent workflows. For example online shifters for the detector subsystems can't work if these jobs get delayed. The main problems that can be encountered are stuck transfers from P5 or Express and/or Repack jobs failing causing runs to get stuck within the T0 processing chain. NEW
    • As you probably have read in the Computing Plan of the Day , you already know if we are during data-taking period or not. When we are, any error in the T0 should be reported. We should not have runs delayed.
  • A drawing of the main Tier-0 components can be found below ( clickable boxes on the picture may be used)

Check the CMS Online status:

  • Incoming Runs --- %COMPLETE5%

During LHC collisions, periodically checkpoint the T0Mon monitoring below against the ongoing data taking status seen here.

  • DAQ Status URL: http://cmsonline.cern.ch/daqStatusSCX/aDAQmon/DAQstatusGre.jpg (shift reload this periodically)
    • Show INSTRUCTIONS Hide INSTRUCTIONS: This overview indicates whether a run is currently being ongoing and specifies the data taking mode and if data is being transferred from the CMS detector to the Tier-0. In the top center of the page you can find the on-going Run Number; a green field on the bottom right indicates "TIER0_TRANSFER_ON" when data is sent to the Tier-0. The first line under "Data Flow" on the top right specifies the data taking mode (or "HLT key", which is a string containing a tag such as "interfill", "physics", ...). The tag "physics" is the most relevant data. The bottom left histograms shows the run history in the last 24h; the run numbers specified on the graph should be reflected in the T0Mon page, see Monitoring links below.
  • Storage Manager URL: http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager
    • Show INSTRUCTIONS Hide INSTRUCTIONS: Check on the DAQ page above (see URL on previous bullet) that the latest run number in Storage Manager matches the current run number (if T0_Tranfer is on). If T0_Transfer is on and the run number is not the latest, open an Elog. The Storage Manager also shows all the transfer/preliminar processing chain. close -> inject -> transfr -> check -> repack is what matters, if you see missing files ( next number is lower than the previous ) and it is red, open an elog. (if it's blue, it's expected). If you see anything red around the page, ex: Server down, open an Elog.

  • P5 Elog URL (for information only, see instructions)
    • Show INSTRUCTIONS Hide INSTRUCTIONS: This elog is not needed in "normal" situations, however it may be useful in cases of very special events at the CMS detector. Or you may use it to simply find out who is the online shifter (== the shift role corresponding to yours, but for every thing related to online data taking). You will need to log in with your AFS credentials.

Tier-0 Service

Checks of the the most urgent alarms of the Tier-0. The main causes of runs getting stuck are monitored: Failed jobs, Stuck jobs and in case of overload, Backlog. NEW

  • Show INSTRUCTIONS Hide INSTRUCTIONS:
    • General assumptions
      • Green bar - check is fine
      • Yellow bar - check is failing, there's no "red", or worse state. It's binary.
      • There are no defective plots, values = 0 are fine
      • Report if a monitoring element shows a yellow bar in the Elog in the T0 Processing category.
    • Permanent Failures
      • This is the most important check in this category. It checks whether there have been any failed jobs in the last 24h. The information corresponds to the counters in the T0mon page (see below), it will turn yellow once a single job failure was detected, it will not change back to green until an expert fixes/resubmits the job. Even if it's yellow, check if the number of failures are increasing (look at the plots). If you see new failures, and there's no known reason(check Elog), open an Elog.
    • Backlog
      • This shows how much jobs are queued in the Tier-0 system and are *still not submitted * to the batch system, so you won't see them in the "running/pending" jobs plot. Thresholds have been specified in the alarm page, for each kind of job. If you see any job type accumulating above the thresholds, open an Elog.
    • Long jobs
      • This is mostly an indicator if everything is normal or not. Jobs can take longer than expected, but if they take a lot longer, it is possible that they are stuck due to an infrastructure problem. If you see the alarm going yellow, open an Elog.
    • Cluster health
      • Skip for now, this is an new alarm under commissioning. Experts are taking care of this alarm for now.


Check the status of Tier 0 components and workflows through T0Mon --- %COMPLETE5%

  • T0Mon URL: https://cmsweb.cern.ch/T0Mon/
    • Show INSTRUCTIONS Hide INSTRUCTIONS:
      • Check that T0 components are up: Look at the "Time from component's last heartbeat" section near the top of the page. If one or more of the components show up in red as below, open an elog in the T0 category.
        cmsproc.rrd_RUN_m.gif.png
      • Check for new runs appearing: New runs should appear on T0Mon in the "Runs" table within a few minutes of appearing on the Storage Manager page above (excluding TransferTest or local runs with labels such as privmuon). If a new run doesn't appear on T0Mon, or shows up with an HLTKey of "UNKNOWN", open an elog.
        cmsproc.rrd_RUN_m.gif.png


Check the Castor pools/subclusters used by CMS: focus on t0export and t1transfer --- %COMPLETE5%

  • URL1s: load average summary and nework utilization : t0export, t1transfer
    • Show INSTRUCTIONS Hide INSTRUCTIONS:
      • For each of the URL1 links above, check the "load average" pie chart on the left-hand side : if you see that some hosts have a load average higher than 10, please open an ELOG in the "T0" category.

        Check also the network utilization plot on the bottom right-hand side of URL1s : if you see a sustained throughput at the resource-saturation plateau (that is # hosts x 100 Mbyte/s) open an ELOG in the "T0" category.
        cmsproc.rrd_RUN_m.gif.png
  • URL2s: Total space on t0export https://sls.cern.ch/sls/history.php?id=CASTORCMS_T0EXPORT&more=nv:Total+Space+TB&period=week&title=
    • Show INSTRUCTIONS Hide INSTRUCTIONS:
      • The total space available on t0export should be constant. Watch out for sudden drops, as these indicate a missing disk server as shown below. If this happens, make an ELOG post in the T0 category.
        cmsproc.rrd_RUN_m.gif.png
  • URL3s: Active Transfers/Queued Transfers on t0export pool .
    • Show INSTRUCTIONS Hide INSTRUCTIONS:
      • There can be up to a few thousand active transfers but the number of queued transfers should not go above 200. If this happens, please make an ELOG post in the T0 category.
  • URL5s: Active Transfers/Queued Transfers on t1transfer pool.
    • Show INSTRUCTIONS Hide INSTRUCTIONS:
      • There can be up to a thousand active transfers but the number of queued transfers should not go above 200. If this happens, please make an ELOG post in the T0 category.


Check the activity on the Tier-0 LSF WNs --- %COMPLETE5%


Check the Tier-0 Jobs --- %COMPLETE5%

  • URL1: Queued and running Jobs on cmst0
  • Show INSTRUCTIONS Hide The cmst0 queue is using ~3200 job slots. If you see on URL1 a sustained saturation of running (green at 2.8k) or pending (blue) jobs like in the example shown in the picture below, it might not be an issue, however it is worth notifying the DataOps team via the ELOG in the "T0" category.cmsproc.rrd_RUN_d.gif
Topic attachments
I Attachment History Action Size Date Who Comment
JPEGjpg Tier0.jpg r1 manage 25.5 K 2012-03-26 - 00:22 SamirCury  
Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2020-08-21 - TWikiAdminUser
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox/SandboxArchive All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback