%TABLE{ sort="off" tableborder="0" cellpadding="1" cellspacing="1" cellborder="0" }% | *Tier-0 workflows* |||||| | %RED% CMS Online Services:%ENDCOLOR% [[http://cmsonline.cern.ch/daqStatusSCX/aDAQmon/DAQstatusGre.jpg][DAQstatus]],[[http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager][SM]], [[http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Elog?_piref815_429145_815_429142_429142.strutsAction=%2FviewSubcatMessages.do%3FcatId%3D492%26subId%3D7%26page%3D1%26fetch%3D1][P5 Elog]] | %RED%Tier-0 Service :%ENDCOLOR% [[http://cmsprod.web.cern.ch/cmsprod/sls/][T0 Alarms]], [[http://cmsweb-testbed.cern.ch/T0Mon/][T0Mon]] | %RED%T0export Castor Pool :%ENDCOLOR% [[http://lemonweb.cern.ch/lemon-status/info.php?time=0&offset=0&entity=c2cms/t0export&cluster=1&type=host][Load&Network]], [[http://sls.cern.ch/sls/history.php?id=CASTORCMS_T0EXPORT&more=nv:Total+Space+TB&period=week][Total Space]], [[http://sls.cern.ch/sls/history.php?id=CASTORCMS_T0EXPORT&more=nv:Active+transfers&period=day][Active Transfers]], [[http://sls.cern.ch/sls/history.php?id=CASTORCMS_T0EXPORT&more=nv:Queued+transfers&period=day][Queued Transfers]] | %RED%T1transfer Castor Pool :%ENDCOLOR% [[http://lemonweb.cern.ch/lemon-status/info.php?time=0&offset=0&entity=c2cms/t1transfer&cluster=1&type=host][Load&Network]], [[http://sls.cern.ch/sls/history.php?id=CASTORCMS_T1TRANSFER&more=nv:Active+transfers&period=day][Active Transfers]], [[http://sls.cern.ch/sls/history.php?id=CASTORCMS_T1TRANSFER&more=nv:Queued+transfers&period=day][Queued Transfers]] | %RED%cmst0 LSF Pool :%ENDCOLOR% [[http://lemonweb.cern.ch/lemon-web/info.php?entity=lxbatch/cmst0&cluster=1&type=host][Load&Network]] | %RED%Tier-0 Jobs :%ENDCOLOR% [[http://lsf-rrd.cern.ch/lrf-lsf/info.php?queue=cmst0][cmst0 Jobs]] | --- #T0Workflow ---+++ Tier-0 workflows monitoring ---++++ Introduction / Machine status *Please read this once at the start of your shift* to learn about the Tier-0 workflow. This introduction will help you to understand the importance of the different components of the workflow and which problem to look for. * %TWISTY{id="T0Intro" mod="div" showlink="A general T0 description is provided here" hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%"}%<br /> * The T0 is one of the most important computing systems of CMS, it's responsible for creating the RAW datasets out of the data stream sent from the pit. It also handles the first reconstruction pass of the RAW data called !PromptReco. It runs many kinds of jobs against the collision data, among them the most important types of Jobs are EXPRESS and REPACK. EXPRESS jobs speedily reconstruct a special portion of the RAW data coming from the detector and are supposed to finish within 1 hour of recording of this data. REPACK jobs process all the data coming from the pit and convert the data into RAW files and split them into Primary datasets. These jobs should run in real time, a delay impacts all subsequent workflows. For example online shifters for the detector subsystems can't work if these jobs get delayed. The main problems that can be encountered are stuck transfers from P5 or Express and/or Repack jobs failing causing runs to get stuck within the T0 processing chain. %ICON{new}% * As you probably have read in the [[http://cmsdoc.cern.ch/cmscc/shift/today.jsp][Computing Plan of the Day]] , you already know if we are during data-taking period or not. When we are, any error in the T0 should be reported. We should not have runs delayed.%ENDTWISTY% * A drawing of the main Tier-0 components can be found below ( *clickable boxes on the picture may be used*) <img width="640" usemap="#Tier0c4df3fe2" alt="" src="%ATTACHURLPATH%/Tier0.jpg" height="452" border="0" /><map name="Tier0c4df3fe2"><area shape="rect" coords="23,341,127,419" href="http://lemonweb.cern.ch/lemon-status/info.php?time=0&offset=0&entity=c2cms/t1transfer&cluster=1&type=host" alt=""><area shape="rect" coords="251,192,352,269" href="http://lemonweb.cern.ch/lemon-status/info.php?time=0&offset=0&entity=c2cms/t0export&cluster=1&type=host" alt=""><area shape="rect" coords="23,192,127,269" href="http://sls.cern.ch/sls/history.php?id=CASTORTapeCMS&more=nv:votape&period=month" alt=""><area shape="rect" coords="165,220,225,235" href="http://sls.cern.ch/sls/history.php?id=CASTORTapeCMS&more=nv:rwfull&period=month" alt=""><area shape="rect" coords="165,380,225,400" href="https://cmsweb.cern.ch/phedex/prod/Activity::RatePlots?graph=quantity_rates&entity=dest&src_filter=CERN&dest_filter=T1&no_mss=true&period=l7d&upto=" alt=""><area shape="rect" coords="251,350,352,430" href="https://cmsweb.cern.ch/prodmon/plots/plot/timebargraph/resourcesPerSite?wf=&ds=&job_type=Any&exit_code=&prod_team=&prod_agent=&starttime=2009-01-01+18:41:38&endtime=2009-02-28+18:41:38&site=T1" alt=""><area shape="rect" coords="410,40,460,50" href="http://lsf-rrd.cern.ch/lrf-lsf/info.php?queue=cmst0""><area shape="rect" coords="15,45,110,150" href="http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager""></map> #T0WorkflowCMSOnline ---++++ CMS Online Services Check the CMS Online status related to *incoming runs* --- %COMPLETE5% During LHC collisions, periodically checkpoint the !T0Mon monitoring below against the ongoing data taking status seen here. * DAQ Status URL: [[http://cmsonline.cern.ch/daqStatusSCX/aDAQmon/DAQstatusGre.jpg][http://cmsonline.cern.ch/daqStatusSCX/aDAQmon/DAQstatusGre.jpg]] (shift reload this periodically) * %TWISTY{id="DAQStatus" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%INSTRUCTIONS: This overview indicates whether a run is currently being ongoing and specifies the data taking mode and if data is being transferred from the CMS detector to the Tier-0. In the top center of the page you can find the on-going Run Number; a green field on the bottom right indicates "TIER0_TRANSFER_ON" when data is sent to the Tier-0. The first line under "Data Flow" on the top right specifies the data taking mode (or "HLT key", which is a string containing a tag such as "interfill", "physics", ...). The tag "physics" is the most relevant data. The bottom left histograms shows the run history in the last 24h; the run numbers specified on the graph should be reflected in the !T0Mon page, see Monitoring links below.%ENDTWISTY% * Storage Manager URL: [[http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager][http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager]] * %TWISTY{id="SMURL" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%INSTRUCTIONS: Check on the DAQ page above (see URL on previous bullet) that the latest run number in Storage Manager matches the current run number (if T0_Tranfer is on and some data has been logged). If T0_Transfer is on and the run number is not the latest, open an Elog. The Storage Manager also shows all the transfer/preliminar processing chain. close -> inject -> transfr -> check -> repack is what matters, if you see missing files ( next number is lower than the previous ) and it is red, open an elog. (if it's blue, it's expected). If you see anything red around the page, ex: Server down, open an Elog. <!--This is a bit tricky - to do this right at the moment you have to psychically determine the intent of the folks at P5. We have in the past had trouble where intended global runs were taken with a setuplabel of TransferTestWithSafety. This is bad when you intend to keep the data, as it means the data will be transferred but then autodeleted. It is also a legitimate test mode at the right time. Until we find a good way to help you read the mind of P5, for now only alarm when we are clearly taking beam or collision data. In addition, please exclude entries for the *OnlineErrors* stream.--><!--%ENDTWISTY% * [[http://cms-alcadb.web.cern.ch/cms-alcadb/Monitoring/PCLTier0Workflow/][Prompt Calibration Loop URL]] * %TWISTY{id="P5ElogURL" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%INSTRUCTIONS: This monitoring page (look only at *Latency since the end of the run*) controls whether Express workflows delivered the calibration payload to the online systems(black dot correspondent to the run on the axis X). This has to happen before PromptReco starts (red dotted line). Runs before (left side) the red line, can be ignored. If the red line gets too close (3 runs before) to the last run that uploaded conditions (has the black dot), open an Elog to warn the Tier-0 Experts. %ENDTWISTY% * [[http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Elog?_piref815_429145_815_429142_429142.strutsAction=%2FviewSubcatMessages.do%3FcatId%3D2%26subId%3D7%26page%3D1%26fetch%3D1][P5 Elog URL]] (for information only, see instructions) * %TWISTY{id="P5ElogURL" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%INSTRUCTIONS: This elog is not needed in "normal" situations, however it may be useful in cases of very special events at the CMS detector. Or you may use it to simply find out who is the online shifter (== the shift role corresponding to yours, but for every thing related to online data taking). You will need to log in with your AFS credentials. %ENDTWISTY% ---++++ Tier-0 Service Checks of the *the most urgent alarms of the Tier-0* --- %COMPLETE5% The main causes of runs getting stuck are monitored: *Failed jobs*, *Stuck jobs* and in case of overload, *Backlog*. * Tier-0 Alarms URL : http://cmsprod.web.cern.ch/cmsprod/sls/ * %TWISTY{id="T0Critical" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%INSTRUCTIONS: * General assumptions * Green bar - check is fine * Yellow bar - check is failing, there's no "red", or worse state. It's binary. * There are no defective plots, values = 0 are fine * Report if a monitoring element shows a yellow bar in the Elog in the T0 Processing category. * Permanent Failures * This is the most important check in this category. It checks whether there have been any failed jobs in the last 24h. The information corresponds to the counters in the !T0mon page (see below), it will turn yellow once a single job failure was detected, it will not change back to green until an expert fixes/resubmits the job. Even if it's yellow, check if the number of failures are increasing (look at the [[https://sls.cern.ch/sls/history.php?id=CMST0-permanent-failures&more=ALL&period=24h][plots]]). If you see new failures, and there's no known reason(check Elog), open an Elog. * Backlog * This shows how much jobs are queued in the Tier-0 system and are *still not submitted * to the batch system, so you won't see them in the "running/pending" jobs [[http://lsf-rrd.cern.ch/lrf-lsf/info.php?queue=cmst0][plot]]. Thresholds have been specified in the alarm [[http://sls.cern.ch/sls/service.php?id=CMST0-backlog][page]], for each kind of job. If you see any job type accumulating above the thresholds, open an Elog. * Long jobs * This is mostly an indicator if everything is normal or not. Jobs can take longer than expected, but if they take a lot longer, it is possible that they are stuck due to an infrastructure problem. If you see the alarm going yellow, open an Elog. * Cluster health * Skip for now, this is an new alarm under commissioning. Experts are taking care of this alarm for now. %ENDTWISTY% --- #T0WorkflowT0Mon Check the *status of Tier 0 components and workflows through !T0Mon* --- %COMPLETE5% * !T0Mon URL: [[https://cmsweb-testbed.cern.ch/T0Mon/]] * %TWISTY{id="T0Mon" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%INSTRUCTIONS: * Check that T0 components are up: Look at the "Time from component's last heartbeat" section near the top of the page. If one or more of the components show up in red as below, open an elog in the T0 category. %BR% <img width="900" alt="cmsproc.rrd_RUN_m.gif.png" src="%ATTACHURLPATH%/T0Mon-heartbeats.png" /> * Check for new runs appearing: New runs should appear on !T0Mon in the "Runs" table within a few minutes of appearing on the Storage Manager page above (excluding TransferTest or local runs with labels such as privmuon). If a new run doesn't appear on !T0Mon, or shows up with an HLTKey of "UNKNOWN", open an elog. %BR% <img alt="cmsproc.rrd_RUN_m.gif.png" src="%ATTACHURLPATH%/T0Mon-runs.png" /> <!-- * Check number of streamers: Within a short delay, the number of Streamers for a run shown by !T0Mon should match the number of files in the "safe 99" column on the Storage Manager page. If there is a discrepancy which persists, especially for older runs, make an elog post.--><!-- --- #T0WorkflowCastorPoolst0t1 Check the *Castor pools/subclusters used by CMS: focus on t0export and t1transfer* --- %COMPLETE5% * URL1s: load average summary and nework utilization : [[http://lemonweb.cern.ch/lemon-status/info.php?time=0&offset=0&entity=c2cms/t0export&cluster=1&type=host][t0export]], [[http://lemonweb.cern.ch/lemon-status/info.php?time=0&offset=0&entity=c2cms/t1transfer&cluster=1&type=host][t1transfer]] * %TWISTY{id="Castor2" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%INSTRUCTIONS: * For each of the URL1 links above, check the "load average" pie chart on the left-hand side : if you see that some hosts have a load average higher than 20, please open an ELOG in the "T0" category. <br /> <br /> Check also the network utilization plot on the bottom right-hand side of URL1s : if you see a sustained throughput at the resource-saturation plateau (that is # hosts x 100 Mbyte/s) open an ELOG in the "T0" category. * URL2s: Total space on t0export [[https://sls.cern.ch/sls/history.php?id=CASTORCMS_T0EXPORT&more=nv:Total+Space+TB&period=week&title=]] * %TWISTY{id="Castor3" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%INSTRUCTIONS: * The total space available on t0export should be constant. Watch out for sudden drops, as these indicate a missing disk server as shown below. If this happens, make an ELOG post in the T0 category. %BR% <img alt="cmsproc.rrd_RUN_m.gif.png" src="%ATTACHURLPATH%/t0exportTotal.png" /> %ENDTWISTY% * URL3s: [[http://sls.cern.ch/sls/history.php?id=CASTORCMS_T0EXPORT&more=nv:Active+transfers&period=day][Active Transfers]]/[[http://sls.cern.ch/sls/history.php?id=CASTORCMS_T0EXPORT&more=nv:Queued+transfers&period=day][Queued Transfers]] on t0export pool . * %TWISTY{id="Castor6" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%INSTRUCTIONS: * There can be up to a few thousand active transfers but the number of queued transfers should not go above 200. If this happens, please make an ELOG post in the T0 category. %ENDTWISTY% * URL5s: [[http://sls.cern.ch/sls/history.php?id=CASTORCMS_T1TRANSFER&more=nv:Active+transfers&period=day][Active Transfers]]/[[http://sls.cern.ch/sls/history.php?id=CASTORCMS_T1TRANSFER&more=nv:Queued+transfers&period=day][Queued Transfers]] on t1transfer pool. * %TWISTY{id="Castor7" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%INSTRUCTIONS: * There can be up to a thousand active transfers but the number of queued transfers should not go above 200. If this happens, please make an ELOG post in the T0 category. %ENDTWISTY% --- #T0WorkflowWNs Check the *activity on the Tier-0 LSF WNs* --- %COMPLETE5% * URL1: [[http://lemonweb.cern.ch/lemon-web/info.php?entity=lxbatch/cmst0&cluster=1&type=host][Load average summary and network ulitization of cmst0 batch cluster]] * %TWISTY{id="T0WNs" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }% For URL1 above, check the CPU Utilization and the "load average" pie chart on the left-hand side : if you see a sustained CPU utilization above 95%, open an ELOG in the "T0" category. <br> %ENDTWISTY% --- #T0WorkflowJobs Check the *Tier-0 Jobs* --- %COMPLETE5% * URL: [[http://lsf-rrd.cern.ch/lrf-lsf/info.php?queue=cmst0][Queued and running Jobs on cmst0]] * %TWISTY{id="T0Jobs" mod="div" showlink="Show INSTRUCTIONS " hidelink="Hide " remember="off" showimgright="%ICONURLPATH{toggleopen-small}%" hideimgright="%ICONURLPATH{toggleclose-small}%" start="hide" }%The cmst0 queue currently has ~3500 job slots. We should use all of them, if there are jobs to run([[http://sls.cern.ch/sls/history.php?id=CMST0-backlog&more=ALL&period=24h][backlog]]). If there are more than 3500 pending+running jobs, but much less than 3500 running, it might be a LSF issue, report in the T0 section of the Elog.%ENDTWISTY%
E
dit
|
A
ttach
|
Watch
|
P
rint version
|
H
istory
: r1
|
B
acklinks
|
V
iew topic
|
WYSIWYG
|
M
ore topic actions
Topic revision: r1 - 2013-01-11
-
DiegoBallesterosVillamizar
Home
Plugins
Sandbox for tests
Support
Alice
Atlas
CMS
LHCb
Public Webs
Sandbox Web
Create New Topic
Index
Search
Changes
Notifications
RSS Feed
Statistics
Preferences
P
View
Raw View
PDF version
Print version
Find backlinks
History
More topic actions
Edit
Raw edit
Attach file or image
Edit topic preference settings
Set new parent
More topic actions
Account
Log In
Cern Search
TWiki Search
Google Search
Sandbox
All webs
E
dit
A
ttach
Copyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use
Discourse
or
Send feedback