New CSP Instructions Draft
Special Check of the:
Castor Pool T0Export outgoing rate during HI data taking --- %COMPLETE0%
Tier-0 workflows monitoring
Introduction / Machine status
Read this if you didn't before, once you understand what is important or not your shift will be more efficient, and you will know which kind of problems to look for
- Show T0 description
Hide
- The T0 is one of the most important CMS systems, it's responsible for creating the first RAW datasets, and RECO from the real data that comes directly from CMS. It runs many kinds of jobs against the collision data, among them the most important types of Jobs are EXPRESS and REPACK . These should run in real time, if they get delayed by one hour, then all the workflow may be compromised. So keep this in mind during your shift, the main problems you should look for are stuck transfers from P5 or if we have runs stuck/failing in Express or Repack.
Monitoring links
- Critical status board - here are the most urgent alarms provided by CERN Central monitoring systems. Basically it monitors the main reason of getting runs stuck - Failed jobs, Stuck jobs and in case of overload, Express backlog.
- More notes in the SLS alarms themselves.
http://cmsprod.web.cern.ch/cmsprod/sls/
Check the
status of Tier 0 components and workflows through T0Mon --- %COMPLETE5%
-
- T0Mon URL: https://cmsweb.cern.ch/T0Mon/
- Show INSTRUCTIONS
Hide
INSTRUCTIONS:
- Check that T0 components are up: Look at the "Time from component's last heartbeat" section near the top of the page. If one or more of the components show up in red as below, open an elog in the T0 category.
- Check for new runs appearing: New runs should appear on T0Mon in the "Runs" table within a few minutes of appearing on the Storage Manager page above (excluding TransferTest or local runs with labels such as privmuon). If a new run doesn't appear on T0Mon, or shows up with an HLTKey of "UNKNOWN", open an elog.
- Monitor progress of processing: In the runs table, as processing proceeds, the status for a run will move through "Active", "CloseOutRepack", "CloseOutRepackMerge", "CloseOutPromptReco", "CloseOutRecoMerge", "CloseOutAlcaSkim", "CloseOutAlcaMerge". When processing is complete for a run, and only transfers are remaining, the status will be "CloseOutExport". Fully processed and transferred runs will show up as either "CloseOutT1Skimming" or "Complete". If older runs are stuck in an incomplete status, or show no progress in the status bars over a long period, make an elog post.
- Check for new job failures: In the job stats area beneath the Runs table, look at the number of failed jobs in each of the 5 job categories. If you see any of these counts increasing, make an Elog post immediately.
- Check for slow-running jobs: In the running and acquired jobs tables beneath the job stats, check for jobs which are older than about 12 hours, and open an elog if you see any.
- The most important Tier-0 monitoring links are collected here and the detailed shift instructions are following below. If you wish, you can help yourself with the clickable boxes on the picture below (active map), while you follow the instructions.
Check the CMS Online status:
- Incoming Runs --- %COMPLETE5%
You should keep an eye on these while we are in active datataking and periodically checkpoint the
T0Mon monitoring below against what they show to be going on at P5.
-
- Storage Manager URL: http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager
- Show INSTRUCTIONS
Hide
INSTRUCTIONS: Check on the DAQ page (http://cmsonline.cern.ch/daqStatusSCX/aDAQmon/DAQstatusGre.jpg
) that on the right column, Data Flow
, data taking was declared and all the following fields are green: LHC RAMPING OFF, PreShower HV ON, Tracker HV ON, Pixel HV ON, Physics declared. Also check that on the bottom of the right column, TIER0_TRANSFER_ON is green. Then check on the Storage Manager URL given above the current status of data taking. You should see first the current run in the upper box called Last streams
. Check that there are files per stream to be transferred in the Files
column. If there are more than 10 files, check the Transf
column. If there are zero files listed for a stream that has files to be transferred, open elog in the T0 section. If there are more than 1000 files not transferred, call the CRC.
Next check the LAST SM INSTANCES
instance boxes for lines with numbers marked in RED and open an elog in the T0 section.
Last check, look at the Last Runs
box and also elog all lines with numbers marked in RED.
As usual in elogs please be as verbose as possible.
Check the
Castor pools/subclusters used by CMS: focus on t0export and t1transfer --- %COMPLETE5%
- URL1s: load average summary and nework utilization : t0export
, t1transfer
- URL2s: load-average history of t0export hosts
, load-average history of t1transfer hosts
- Show INSTRUCTIONS
Hide
INSTRUCTIONS:
- For each of the URL1 links above, check the "load average" pie chart on the left-hand side : if you see that some hosts have a load average higher than 10, you can check the load average history (URL2) of each host. and if you see a problematic node like in the image example below, please open an ELOG in the "T0" category. However, make sure the node is in production status (shown at the top of its page).
.
Check also the network utilization plot on the bottom right-hand side of URL1s : if you see a sustained throughput at the resource-saturation plateau (that is # hosts x 100 Mbyte/s) open an ELOG in the "T0" category.
URL3s: Total space on t0export https://sls.cern.ch/sls/history.php?id=CASTORCMS_T0EXPORT&more=nv:Total+Space+TB&period=week&title=
- Show INSTRUCTIONS
Hide
INSTRUCTIONS:
- The total space available on t0export should be constant. Watch out for sudden drops, as these indicate a missing disk server as shown below. If this happens, make an ELOG post in the T0 category.
URL4s: Active Transfers
/Queued Transfers
on t0export pool .
- Show INSTRUCTIONS
Hide
INSTRUCTIONS:
- There can be up to a few thousand active transfers but the number of queued transfers should not go above 200. If this happens, please make an ELOG post in the T0 category.
URL5s: Active Transfers
/Queued Transfers
on t1transfer pool.
- Show INSTRUCTIONS
Hide
INSTRUCTIONS:
- There can be up to a thousand active transfers but the number of queued transfers should not go above 200. If this happens, please make an ELOG post in the T0 category.
Check the
activity on the Tier-0 LSF WNs --- %COMPLETE5%
Check the
Tier-0 Jobs --- %COMPLETE5%
- URL1: Queued and running Jobs on cmst0
- Show INSTRUCTIONS
Hide
The cmst0 queue is using ~2,800 job slots. If you see on URL1 a sustained saturation of running (green at 2.8k) or pending (blue) jobs like in the example shown in the picture below, it might not be an issue, however it is worth notifying the DataOps team via the ELOG in the "T0" category.
--
SamirCury - 30-Sep-2011