#T0Workflow ---+++ Tier-0 workflows monitoring ---++++ Introduction / Machine status *Please read this once at the start of your shift * to learn about the Tier-0 workflow. This introduction will help you to understand the importance of the different components of the workflow and which problem to look for. * %TWISTY{id="T0Intro" mod="div" showlink="General description of the Tier-0 workflow" hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%"}%<br /> * The T0 is one of the most important computing systems of CMS, it's responsible for creating the RAW datasets out of the data stream sent from the pit. It also handles the first reconstruction pass of the RAW data called PromptReco. It runs many kinds of jobs against the collision data, among them the most important types of Jobs are EXPRESS and REPACK. EXPRESS jobs speedily reconstruct a special portion of the RAW data coming from the detector and are supposed to finish within 1 hour of recording of this data. REPACK jobs process all the data coming from the pit and convert the data into RAW files and split them into Primary datasets. These jobs should run in real time, a delay impacts all subsequent workflows. For example online shifters for the detector subsystems can't work if these jobs get delayed. The main problems that can be encountered are stuck transfers from P5 or Express and/or Repack jobs failing causing runs to get stuck within the T0 processing chain. %ICON{new}% * Please check the [[http://cmsdoc.cern.ch/cmscc/shift/today.jsp][Computing Plan of the Day]] if the LHC is colliding beams and we are taking data. In this case, any Tier-0 error should be reported to avoid delaying processing of new runs.%ENDTWISTY% * A drawing of the main Tier-0 components can be found below ( *clickable boxes on the picture may be used) *) <img width="640" usemap="#Tier0c4df3fe2" alt="" src="%ATTACHURLPATH%/Tier0.jpg" height="452" border="0" /><map name="Tier0c4df3fe2"><area shape="rect" coords="23,341,127,419" href="http://lemonweb.cern.ch/lemon-status/info.php?time=0&offset=0&entity=c2cms/t1transfer&cluster=1&type=host" alt=""><area shape="rect" coords="251,192,352,269" href="http://lemonweb.cern.ch/lemon-status/info.php?time=0&offset=0&entity=c2cms/t0export&cluster=1&type=host" alt=""><area shape="rect" coords="23,192,127,269" href="http://sls.cern.ch/sls/history.php?id=CASTORTapeCMS&more=nv:votape&period=month" alt=""><area shape="rect" coords="165,220,225,235" href="http://sls.cern.ch/sls/history.php?id=CASTORTapeCMS&more=nv:rwfull&period=month" alt=""><area shape="rect" coords="165,380,225,400" href="https://cmsweb.cern.ch/phedex/prod/Activity::RatePlots?graph=quantity_rates&entity=dest&src_filter=CERN&dest_filter=T1&no_mss=true&period=l7d&upto=" alt=""><area shape="rect" coords="251,350,352,430" href="https://cmsweb.cern.ch/prodmon/plots/plot/timebargraph/resourcesPerSite?wf=&ds=&job_type=Any&exit_code=&prod_team=&prod_agent=&starttime=2009-01-01+18:41:38&endtime=2009-02-28+18:41:38&site=T1" alt=""><area shape="rect" coords="410,40,460,50" href="http://lsf-rrd.cern.ch/lrf-lsf/info.php?queue=cmst0""><area shape="rect" coords="15,45,110,150" href="http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager""></map> The status of the ongoing data taking can be seen at the 2 following locations : %ICON{new}% * DAQ Status URL: [[http://cmsonline.cern.ch/daqStatusSCX/aDAQmon/DAQstatusGre.jpg][http://cmsonline.cern.ch/daqStatusSCX/aDAQmon/DAQstatusGre.jpg]] (shift reload this periodically) * %TWISTY{id="DAQStatus" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%INSTRUCTIONS: This overview indicates whether a run is currently being ongoing and specifies the data taking mode and if data is being transferred from the CMS detector to the Tier-0. In the top center of the page you can find the on-going Run Number; a green field on the bottom right indicates "TIER0_TRANSFER_ON" when data is sent to the Tier-0. The first line under "Data Flow" on the top right specifies the data taking mode (or "HLT key", which is a string containing a tag such as "interfill", "physics", ...). The tag "physics" is the most relevant data. The bottom left histograms shows the run history in the last 24h; the run numbers specified on the graph should be reflected in the T0Mon page, see Monitoring links below.%ENDTWISTY% * [[http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Elog?_piref815_429145_815_429142_429142.strutsAction=%2FviewSubcatMessages.do%3FcatId%3D2%26subId%3D7%26page%3D1%26fetch%3D1][P5 Elog URL]] (for information only, see instructions) * %TWISTY{id="P5ElogURL" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%INSTRUCTIONS: This elog is not required to be followed by the CSP in "normal" situations, however it may be useful in cases of very special events at the CMS detector. You may use it to find out who is the current online shifter (== the shift role corresponding to yours, but for every thing related to online data taking). You will need to log in with your AFS credentials. %ENDTWISTY% ---++++ Tier-0 Service #SLS Checks of the *the most urgent alarms of the Tier-0 *. The main causes of runs getting stuck are monitored: *Failed jobs *, *Stuck jobs * and in case of overload, *Backlog *. %ICON{new}% * URL : http://cmsprod.web.cern.ch/cmsprod/sls/ * * %TWISTY{id="T0Critical" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%INSTRUCTIONS: * General assumptions * Green bar - check is fine * Yellow bar - check is failing, there's no "red", or worse state. It's binary. * Report if a monitoring element shows a yellow bar in the Elog in the T0 Processing category. * Permanent Failures * This is the most important check in this category. It checks whether there have been any failed jobs in the last 24h. The information corresponds to the counters in the T0mon page (see below), it will turn yellow once a single job failure was detected, it will not change back to green until an experts fixes/resubmits the job. If the failure is not recoverable, the field will change to back to green automatically green after 24h passed by and if there have been no new job failures. If this alarm goes yellow, it can stay yellow during all your shift, but you should always check if the [[https://sls.cern.ch/sls/service.php?id=CMST0-permanent-failures][counters]] are increasing (take a look in the [[https://sls.cern.ch/sls/history.php?id=CMST0-permanent-failures&more=ALL&period=24h][plots]]). If you see new unknown failures, open an Elog (or answer an existing one). If you see *new * Express/Repack failing with no known reason, you should open an Elog and Call the CRC. * Backlog * This shows how much jobs are queued in the Tier-0 system and are *still not submitted * to the batch system, so you won't see them in the "running/pending" jobs [[http://lsf-rrd.cern.ch/lrf-lsf/info.php?queue=cmst0][plot]]. Thresholds have been specified in the alarm [[http://sls.cern.ch/sls/service.php?id=CMST0-backlog][page]], for each kind of job. Express and Repack should not accumulate a backlog, PromptReco is ok. If you see a backlog above the thresholds on Express and Repack open an Elog, PromptReco above threshold is still fine. The historical information can be found [[http://sls.cern.ch/sls/history.php?id=CMST0-backlog&more=ALL&period=24h][here]], so you can see how the backlog developed over time. * Long jobs * This is mostly an indicator if everything is normal or not. Jobs can take longer than expected, but if they take a lot longer, it is possible that they are stuck due to an infrastructure problem. If you see the alarm going yellow, open an Elog. * Cluster health * Skip for now, this is an new alarm under commissioning. Experts are taking care of this alarm for now. %ENDTWISTY% --- #T0WorkflowT0Mon Check the *status of Tier 0 components and workflows through T0Mon * --- %COMPLETE5% * T0Mon URL: [[https://cmsweb.cern.ch/T0Mon/]] * %TWISTY{id="T0Mon" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%INSTRUCTIONS: * Check that all T0 components are up: Look at the "Time from component's last heartbeat" section near the top of the page. If one or more of the components show up in red as below, open an elog in the T0 category. %BR% <img width="900" alt="cmsproc.rrd_RUN_m.gif.png" src="%ATTACHURLPATH%/T0Mon-heartbeats.png" /> * Check for new runs appearing: New runs should appear on T0Mon in the "Runs" table within a few minutes of appearing on the Storage Manager page above (excluding TransferTest or local runs with labels such as "privmuon"). If a new run doesn't appear on T0Mon, or shows up with an HLTKey of "UNKNOWN", open an elog. %BR% <img alt="cmsproc.rrd_RUN_m.gif.png" src="%ATTACHURLPATH%/T0Mon-runs.png" /> <!-- * Check number of streamers: Within a short delay, the number of Streamers for a run shown by T0Mon should match the number of files in the "safe 99" column on the Storage Manager page. If there is a discrepancy which persists, especially for older runs, make an elog post.--> --- #T0WorkflowCMSOnline Check the CMS Online status: * *Incoming Runs * --- %COMPLETE5% During LHC collisions, periodically checkpoint the T0Mon monitoring below against the ongoing data taking status from above. * * Storage Manager URL: [[http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager][http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager]] * %TWISTY{id="SMURL" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%INSTRUCTIONS: Check on the DAQ page ([[http://cmsonline.cern.ch/daqStatusSCX/aDAQmon/DAQstatusGre.jpg][http://cmsonline.cern.ch/daqStatusSCX/aDAQmon/DAQstatusGre.jpg]]) that the latest run number in Storage Manager matches the current run number (if T0_Transfer is on). If T0_Transfer is on and the run number is not the latest, open an Elog. The Storage Manager also shows all the transfers/preliminary processing chain. close -> inject -> transfr -> check -> repack is what matters, if you see missing files ( next number is lower than the previous ) and it is red, open an elog. (if it's blue, it's expected). If you see anything red around the page, ex: Server down, open an Elog. <!--This is a bit tricky - to do this right at the moment you have to psychically determine the intent of the folks at P5. We have in the past had trouble where intended global runs were taken with a setuplabel of TransferTestWithSafety. This is bad when you intend to keep the data, as it means the data will be transferred but then autodeleted. It is also a legitimate test mode at the right time. Until we find a good way to help you read the mind of P5, for now only alarm when we are clearly taking beam or collision data. In addition, please exclude entries for the *OnlineErrors * stream.-->%ENDTWISTY% --- #T0WorkflowCastorPoolst0t1 --- Check the *Castor pools/subclusters used by CMS: focus on t0export and t1transfer * --- %COMPLETE5% * URL1s: load average summary and nework utilization : [[http://lemonweb.cern.ch/lemon-status/info.php?time=0&offset=0&entity=c2cms/t0export&cluster=1&type=host][t0export]], [[http://lemonweb.cern.ch/lemon-status/info.php?time=0&offset=0&entity=c2cms/t1transfer&cluster=1&type=host][t1transfer]] * %TWISTY{id="Castor2" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%INSTRUCTIONS: * For each of the URL1 links above, check the "load average" pie chart on the left-hand side : if you see that some hosts have a load average higher than 10, please open an ELOG in the "T0" category. <br /> <br /> Check also the network utilization plot on the bottom right-hand side of URL1s : if you see a sustained throughput at the resource-saturation plateau (that is # hosts x 100 Mbyte/s) open an ELOG in the "T0" category. %BR% <img alt="cmsproc.rrd_RUN_m.gif.png" src="%ATTACHURLPATH%/t0exportTotal.png" /> %ENDTWISTY% * URL2s: Total space on t0export [[https://sls.cern.ch/sls/history.php?id=CASTORCMS_T0EXPORT&more=nv:Total+Space+TB&period=week&title=]] * %TWISTY{id="Castor3" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%INSTRUCTIONS: * The total space available on t0export should be constant. Watch out for sudden drops, as these indicate a missing disk server as shown below. If this happens, make an ELOG post in the T0 category. %BR% <img alt="cmsproc.rrd_RUN_m.gif.png" src="%ATTACHURLPATH%/t0exportTotal.png" /> %ENDTWISTY% * URL3s: [[http://sls.cern.ch/sls/history.php?id=CASTORCMS_T0EXPORT&more=nv:Active+transfers&period=day][Active Transfers]]/[[http://sls.cern.ch/sls/history.php?id=CASTORCMS_T0EXPORT&more=nv:Queued+transfers&period=day][Queued Transfers]] on t0export pool . * %TWISTY{id="Castor6" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%INSTRUCTIONS: * There can be up to a few thousand active transfers but the number of queued transfers should not go above 200. If this happens, please make an ELOG post in the T0 category. %ENDTWISTY% * URL5s: [[http://sls.cern.ch/sls/history.php?id=CASTORCMS_T1TRANSFER&more=nv:Active+transfers&period=day][Active Transfers]]/[[http://sls.cern.ch/sls/history.php?id=CASTORCMS_T1TRANSFER&more=nv:Queued+transfers&period=day][Queued Transfers]] on t1transfer pool. * %TWISTY{id="Castor7" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%INSTRUCTIONS: * There can be up to a thousand active transfers but the number of queued transfers should not go above 200. If this happens, please make an ELOG post in the T0 category. %ENDTWISTY% --- #T0WorkflowWNs Check the *activity on the Tier-0 LSF WNs * --- %COMPLETE5% * URL1: [[http://lemonweb.cern.ch/lemon-web/info.php?entity=lxbatch/cmst0&cluster=1&type=host][CPU and Network ulitization of cmst0 batch cluster]] * URL1bis: [[http://lemonweb.cern.ch/lemon-web/info.php?entity=lxbatch/cmsproc&cluster=1&type=host][CPU and Network ulitization of cmsproc batch cluster]] * URL2: [[http://lemonweb.cern.ch/lemon-web/rrd_distribution.php?rrd_metric=LOADAVG&time=1&offset=0&entity=lxbatch%25252Fcmst0&cluster=1&type=host][load-average history of cmst0 hosts]] * URL2bis: [[http://lemonweb.cern.ch/lemon-web/rrd_distribution.php?rrd_metric=LOADAVG&time=1&offset=0&entity=lxbatch%25252Fcmsproc&cluster=1&type=host][load-average history of cmsproc hosts]] * %TWISTY{id="T0WNs" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }% For URL1 above, check the CPU Utilization and the "load average" pie chart on the left-hand side : if you see a sustained CPU utilization above 95% and/or that some hosts have a load average higher than 12 in sustained manner (see example of a problematic node in image below), the open an ELOG in the "T0" category. <br /><img width="441" alt="lxb8468_1_0_LOADAVG_STACKEDL_1.gif.png" src="%ATTACHURLPATH%/lxb8468_1_0_LOADAVG_STACKEDL_1.gif.png" height="172" />. <br />. Check also the network utilization plot on the bottom right-hand side of URL1. %ENDTWISTY% --- #T0WorkflowJobs Check the *Tier-0 Jobs * --- %COMPLETE5% * URL1: [[http://lsf-rrd.cern.ch/lrf-lsf/info.php?queue=cmst0][Queued and running Jobs on cmst0]] * %TWISTY{id="T0Jobs" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%The cmst0 queue is using ~3200 job slots. If you see on URL1 a sustained saturation of running (green at 2.8k) or pending (blue) jobs like in the example shown in the picture below, it might not be an issue, however it is worth notifying the DataOps team via the ELOG in the "T0" category.<img width="495" alt="cmsproc.rrd_RUN_d.gif" src="%ATTACHURLPATH%/cmsproc.rrd_RUN_d.gif" height="296" /> %ENDTWISTY%
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r2
<
r1
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r2 - 2012-03-19
-
SamirCury
Home
Plugins
Sandbox for tests
Support
Alice
Atlas
CMS
LHCb
Public Webs
Sandbox Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
View
Raw View
PDF version
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
Cern Search
TWiki Search
Google Search
Sandbox
All webs
E
dit
A
ttach
Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback