#T0Workflow ---+++ Tier-0 workflows monitoring ---++++ Introduction / Machine status

Read this if you didn't before, once you understand what is important and what is less important your shift will be more efficient, and you will know which kind of problems to look for

  • A general T0 description is provided here Hide
  • The T0 is responsible for creating the first RAW datasets, and RECO from the real data that comes directly from CMS. It runs many kinds of jobs against the CMS data, among them the most important types of Jobs are EXPRESS and REPACK . These should run in real time, if they get delayed by few hours, then all the workflow may be compromised. Online shifters for the detector subsystems can't work. So keep this in mind during your shift, the main problems you should look for are stuck transfers from P5 or if we have runs stuck/failing in Express or Repack. <img src="/twiki/pub/TWiki/TWikiDocGraphics/new.gif" width="30" height="16" alt="NEW" title="NEW" border="0" />
  • As you probably have read in the Computing Plan of the Day , you already know if we are during data-taking period or not. When we are, any error in the "T0 processing" Elog should be reported. We should not have runs delayed.
  • A drawing of the main Tier-0 components can be found below (clickable boxes on the picture may be used))
  • <img width="640" usemap="#Tier0c4df3fe2" alt="" src="/twiki/pub/Sandbox/SamirCurySandboxCSP2/Tier0.jpg" height="452" border="0" /<map name="Tier0c4df3fe2"<area shape="rect" coords="23,341,127,419" href="http://lemonweb.cern.ch/lemon-status/info.php?time=0&offset=0&entity=c2cms/t1transfer&cluster=1&type=host" alt=""<area shape="rect" coords="251,192,352,269" href="http://lemonweb.cern.ch/lemon-status/info.php?time=0&offset=0&entity=c2cms/t0export&cluster=1&type=host" alt=""<area shape="rect" coords="23,192,127,269" href="http://sls.cern.ch/sls/history.php?id=CASTORTapeCMS&more=nv:votape&period=month" alt=""<area shape="rect" coords="165,220,225,235" href="http://sls.cern.ch/sls/history.php?id=CASTORTapeCMS&more=nv:rwfull&period=month" alt=""<area shape="rect" coords="165,380,225,400" href="https://cmsweb.cern.ch/phedex/prod/Activity::RatePlots?graph=quantity_rates&entity=dest&src_filter=CERN&dest_filter=T1&no_mss=true&period=l7d&upto=" alt=""<area shape="rect" coords="251,350,352,430" href="https://cmsweb.cern.ch/prodmon/plots/plot/timebargraph/resourcesPerSite?wf=&ds=&job_type=Any&exit_code=&prod_team=&prod_agent=&starttime=2009-01-01+18:41:38&endtime=2009-02-28+18:41:38&site=T1" alt=""<area shape="rect" coords="410,40,460,50" href="http://lsf-rrd.cern.ch/lrf-lsf/info.php?queue=cmst0""<area shape="rect" coords="15,45,110,150" href="http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager""</map

    <br /

    To help you know what's the data taking status right now, you can look into 2 places : NEW

    • DAQ Status URL: http://cmsonline.cern.ch/daqStatusSCX/aDAQmon/DAQstatusGre.jpg (shift reload this periodically)
      • Show INSTRUCTIONS Hide INSTRUCTIONS: This tells you if a run is ongoing, the data taking mode and if data is being transferred from the CMS detector to T0. If the DAQ is running it will be specified on the top bar under "DAQ state". Next to it you can find the on-going Run Number; if that run is sent to T0, it will be specified under the green field on bottom right of the view "TIER0_TRANSFER_ON". The first line under Data Flow on top right specifies the data taking mode (or "HLT key", which is a string containing a tag such as "interfill", "physics", ...). The tag "physics" obviously identifies the most relevant data. The bottom left histograms show the run history in last 24h; the run numbers specified on the graph should be reflected in the T0Mon page, see Monitoring links below.
    • P5 Elog URL (for information only, see instructions)
      • Show INSTRUCTIONS Hide INSTRUCTIONS: You don't need this elog in "normal" situations, however it may be useful to know what's going on at P5, especially in case of very special events at the CMS detector. Or you may use it to simply find out who are the online shifters (Shift leader, DAQ shifter, ...). You will need to log in with your AFS credentials.

    ---++++ Tier-0 Service


    SLS shows the the most urgent alarms of the Tier-0. Basically it monitors the main causes of runs getting stuck - Failed jobs, Stuck jobs and in case of overload, Backlog. NEW

        • General assumptions
          • Only two states:
            • Green bar - check is fine
            • Orange bar - check is failing
          • Note that there's no other state (e.g. "red")
          • In case of need (see below) open an Elog on the "T0 Processing" category
        • Permanent Failures
          • This is the most careful alarming you can do. Basically it shows if in the last 24h we had any failed job at the T0 (Just like the counters in the T0mon), it will turn orange once there at least a failure that won't "go away" until experts fix/resubmit the job, or if it's unrecoverable, 24h passes by. If this alarming goes orange, it will stay orange during all your shift probably, but you should always check if the counters are increasing (take a look at the plots). If you see new unknown failures, open an Elog (or answer an existing one). If you see new Express/Repack failing with no known reason, you should open an Elog and Call the CRC(? Calling part still not valid).
        • Long Running jobs
          • This is mostly an indicator if everything is normal or not. Jobs can take longer than expected, but if they take way longer, it is possible that they are stuck due to an infrastructure problem (very usual lately), so, if you see the alarming red Is that really red or is it orange? , take a look Where? , if there's any job older than 16h, open an Elog listing them, if the list is too big, no need to list, as usual, Express and Repack jobs should be quick and not be stuck, so if they're stuck for more than 4h, also open an Elog.
        • Backlog
          • This shows how many jobs we have to run and are not yet submitted, so you won't see in the "running/pending" jobs plot where? . It has the thresholds specified in the alarm page, for each kind of job. Express and Repack shouldn't accumulate. PromptReco is ok. If you see a backlog above the thresholds on Express and Repack open an Elog, PromptReco above threshold is still fine.
        • what about Cluster health?

    --- #T0WorkflowT0Mon

    Check the status of Tier 0 components and workflows through T0Mon --- %COMPLETE5%

    • T0Mon URL: https://cmsweb.cern.ch/T0Mon/
        • Check that T0 components are up: Look at the "Time from component's last heartbeat" section near the top of the page. If one or more of the components show up in red as below, open an elog in the T0 processing category.
          <img alt="cmsproc.rrd_RUN_m.gif.png" src="/twiki/pub/Sandbox/SamirCurySandboxCSP2/T0Mon-heartbeats.png" width=900 /
        • Check for new runs appearing: New runs should appear on T0Mon in the "Runs" table within a few minutes of appearing on the Storage Manager page below Probably better to put SM checks before... (excluding TransferTest or local runs with labels such as privmuon). If a new run doesn't appear on T0Mon, or shows up with an HLTKey of "UNKNOWN", open an elog in the T0 processing category.
          <img alt="cmsproc.rrd_RUN_m.gif.png" src="/twiki/pub/Sandbox/SamirCurySandboxCSP2/T0Mon-runs.png" / INSTRUCTIONS: Check first on the DAQ page (http://cmsonline.cern.ch/daqStatusSCX/aDAQmon/DAQstatusGre.jpg) that on the right column, Data Flow, data taking was declared and that all the following fields are green: LHC RAMPING OFF, PreShower HV ON, Tracker HV ON, Pixel HV ON, Physics declared. Also check that on the bottom of the right column, TIER0_TRANSFER_ON is green. Then check on the Storage Manager URL given above the current status of data taking. You should see first the current run in the upper box called Last streams. Check that there are files per stream to be transferred in the Files column. If there are more than 10 files, check the Transf column. If there are zero files listed for a stream that has files to be transferred, open elog in the T0 processing section. If there are more than 1000 files not transferred, call the CRC.LAST SM INSTANCES instance boxes for lines with numbers marked in RED and open an elog in the T0 section.<brLast check, look at the Last Runs box and also elog all lines with numbers marked in RED.<brAs usual in elogs please be as verbose as possible. <span style="color: #ff0000">This paragraph is rather long. Probably better to split into a list of operations

    <br / --- #T0WorkflowCastorPoolst0t1 ---

    Check the Castor pools/subclusters used by CMS: focus on t0export and t1transfer --- %COMPLETE5%

    • URL1s: load average summary and nework utilization : t0export, t1transfer
    • URL2s: load-average history of t0export hosts, load-average history of t1transfer hosts
        • For each of the URL1 links above, check the "load average" pie chart on the left-hand side : if you see that some hosts have a load average higher than 10, you can check the load average history (URL2) of each host. and if you see a problematic node like in the image example below, please open an ELOG in the "T0 processing" category. However, make sure the node is in production status (shown at the top of its page). <br<img src="/twiki/pub/Sandbox/SamirCurySandboxCSP2/lxb8468_1_0_LOADAVG_STACKEDL_1.gif.png" alt="lxb8468_1_0_LOADAVG_STACKEDL_1.gif.png" width='441' height='172' /. <br <br Check also the network utilization plot on the bottom right side of URL1s : if you see a sustained throughput at the resource-saturation plateau (that is # hosts x 100 Mbyte/s) open an ELOG in the "T0 processing" category.
          <img alt="cmsproc.rrd_RUN_m.gif.png" src="/twiki/pub/Sandbox/SamirCurySandboxCSP2/t0exportTotal.png" /
    • URL3s: Total space on t0export https://sls.cern.ch/sls/history.php?id=CASTORCMS_T0EXPORT&more=nv:Total+Space+TB&period=week&title=
        • The total space available on t0export should be constant. Watch out for sudden drops, as these indicate a missing disk server as shown below. If this happens, make an ELOG post in the T0 processing category.
          <img alt="cmsproc.rrd_RUN_m.gif.png" src="/twiki/pub/Sandbox/SamirCurySandboxCSP2/t0exportTotal.png" /
    • URL4s: Active Transfers/Queued Transfers on t0export pool .
        • There can be up to a few thousand active transfers but the number of queued transfers should not go above 200. If this happens, please make an ELOG post in the T0 processing category.
    • URL5s: Active Transfers/Queued Transfers on t1transfer pool.
        • There can be up to a thousand active transfers but the number of queued transfers should not go above 200. If this happens, please make an ELOG post in the T0 processing category.

    <br/ --- #T0WorkflowWNs

    Check the activity on the Tier-0 LSF WNs --- %COMPLETE5%

    <br/ --- #T0WorkflowJobs

    Check the Tier-0 Jobs --- %COMPLETE5%

    • URL1: Queued and running Jobs on cmst0
    • Show INSTRUCTIONS Hide The cmst0 queue is using ~2,800 job slots. If URL1 shows a sustained saturation of running (green at 2.8k) or pending (blue) jobs like in the example shown in the picture below, it might still be ok, however it is worth notifying the DataOps team via the ELOG in the "T0 processing" category.<img src="/twiki/pub/Sandbox/SamirCurySandboxCSP2/cmsproc.rrd_RUN_d.gif" alt="cmsproc.rrd_RUN_d.gif" width='495' height='296' /


    -- SamirCury - 15-Mar-2012

    Edit | Attach | Watch | Print version | History: r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
    Topic revision: r1 - 2012-03-15 - SamirCury
      • Cern Search Icon Cern Search
      • TWiki Search Icon TWiki Search
      • Google Search Icon Google Search

      Sandbox All webs login

    • Edit
    • Attach
    This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
    or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback