New CSP Instructions Draft


Special Check of the: Castor Pool T0Export outgoing rate during HI data taking --- %COMPLETE0%


Tier-0 workflows monitoring

Introduction / Machine status

Read this if you didn't before, once you understand what is important or not your shift will be more efficient, and you will know which kind of problems to look for

  • Show T0 description Hide
    • The T0 is one of the most important CMS systems, it's responsible for creating the first RAW datasets, and RECO from the real data that comes directly from CMS. It runs many kinds of jobs against the collision data, among them the most important types of Jobs are EXPRESS and REPACK . These should run in real time, if they get delayed by one hour, then all the workflow may be compromised. So keep this in mind during your shift, the main problems you should look for are stuck transfers from P5 or if we have runs stuck/failing in Express or Repack. NEW

Monitoring links

  • Critical status board - here are the most urgent alarms provided by CERN Central monitoring systems. Basically it monitors the main reason of getting runs stuck - Failed jobs, Stuck jobs and in case of overload, Express backlog. NEW
    • More notes in the SLS alarms themselves.

http://cmsprod.web.cern.ch/cmsprod/sls/


Check the status of Tier 0 components and workflows through T0Mon --- %COMPLETE5%

    • T0Mon URL: https://cmsweb.cern.ch/T0Mon/
    • Show INSTRUCTIONS Hide INSTRUCTIONS:
      • Check that T0 components are up: Look at the "Time from component's last heartbeat" section near the top of the page. If one or more of the components show up in red as below, open an elog in the T0 category.
        cmsproc.rrd_RUN_m.gif.png
      • Check for new runs appearing: New runs should appear on T0Mon in the "Runs" table within a few minutes of appearing on the Storage Manager page above (excluding TransferTest or local runs with labels such as privmuon). If a new run doesn't appear on T0Mon, or shows up with an HLTKey of "UNKNOWN", open an elog.
        cmsproc.rrd_RUN_m.gif.png
      • Monitor progress of processing: In the runs table, as processing proceeds, the status for a run will move through "Active", "CloseOutRepack", "CloseOutRepackMerge", "CloseOutPromptReco", "CloseOutRecoMerge", "CloseOutAlcaSkim", "CloseOutAlcaMerge". When processing is complete for a run, and only transfers are remaining, the status will be "CloseOutExport". Fully processed and transferred runs will show up as either "CloseOutT1Skimming" or "Complete". If older runs are stuck in an incomplete status, or show no progress in the status bars over a long period, make an elog post.
      • Check for new job failures: In the job stats area beneath the Runs table, look at the number of failed jobs in each of the 5 job categories. If you see any of these counts increasing, make an Elog post immediately.
        cmsproc.rrd_RUN_m.gif.png
      • Check for slow-running jobs: In the running and acquired jobs tables beneath the job stats, check for jobs which are older than about 12 hours, and open an elog if you see any.
        cmsproc.rrd_RUN_m.gif.png cmsproc.rrd_RUN_m.gif.png



  • The most important Tier-0 monitoring links are collected here and the detailed shift instructions are following below. If you wish, you can help yourself with the clickable boxes on the picture below (active map), while you follow the instructions.


Check the CMS Online status:

  • Incoming Runs --- %COMPLETE5%

You should keep an eye on these while we are in active datataking and periodically checkpoint the T0Mon monitoring below against what they show to be going on at P5.

    • Storage Manager URL: http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager
      • Show INSTRUCTIONS Hide INSTRUCTIONS: Check on the DAQ page (http://cmsonline.cern.ch/daqStatusSCX/aDAQmon/DAQstatusGre.jpg) that on the right column, Data Flow, data taking was declared and all the following fields are green: LHC RAMPING OFF, PreShower HV ON, Tracker HV ON, Pixel HV ON, Physics declared. Also check that on the bottom of the right column, TIER0_TRANSFER_ON is green. Then check on the Storage Manager URL given above the current status of data taking. You should see first the current run in the upper box called Last streams. Check that there are files per stream to be transferred in the Files column. If there are more than 10 files, check the Transf column. If there are zero files listed for a stream that has files to be transferred, open elog in the T0 section. If there are more than 1000 files not transferred, call the CRC.
        Next check the LAST SM INSTANCES instance boxes for lines with numbers marked in RED and open an elog in the T0 section.
        Last check, look at the Last Runs box and also elog all lines with numbers marked in RED.
        As usual in elogs please be as verbose as possible.




Check the Castor pools/subclusters used by CMS: focus on t0export and t1transfer --- %COMPLETE5%



Check the activity on the Tier-0 LSF WNs --- %COMPLETE5%



Check the Tier-0 Jobs --- %COMPLETE5%

  • URL1: Queued and running Jobs on cmst0
  • Show INSTRUCTIONS Hide The cmst0 queue is using ~2,800 job slots. If you see on URL1 a sustained saturation of running (green at 2.8k) or pending (blue) jobs like in the example shown in the picture below, it might not be an issue, however it is worth notifying the DataOps team via the ELOG in the "T0" category.cmsproc.rrd_RUN_d.gif



-- SamirCury - 30-Sep-2011

Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r2 - 2020-08-20 - TWikiAdminUser
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox/SandboxArchive All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback