2013 WMA Deployment

https://github.com/dmwm/WMCore/wiki/All-in-one-test

Requests vocms13 :

https://twiki.cern.ch/twiki/bin/viewauth/CMS/WMAgentRequestsVocms13

higgs

http://www.sciencedirect.com/science/article/pii/0550321395003797

https://twiki.cern.ch/twiki/pub/CMS/TriDASWikiHome/CMS-Detector-Paper-DAQ-jinst8_08_s08004.pdf

T0AST Tables


OBJECT_NAME
--------------------------------------------------------------------------------
WMBS_WORKFLOW_OUTPUT
WMBS_WORKFLOW
WMBS_USERS
WMBS_SUB_TYPES
WMBS_SUB_FILES_FAILED
WMBS_SUB_FILES_COMPLETE
WMBS_SUB_FILES_AVAILABLE
WMBS_SUB_FILES_ACQUIRED
WMBS_SUBSCRIPTION_VALIDATION
WMBS_SUBSCRIPTION
WMBS_LOCATION
WMBS_JOB_STATE
WMBS_JOB_MASK
WMBS_JOB_ASSOC
WMBS_JOBGROUP
WMBS_JOB
WMBS_FILE_RUNLUMI_MAP
WMBS_FILE_PARENT
WMBS_FILE_LOCATION
WMBS_FILE_DETAILS
WMBS_FILE_CHECKSUMS
WMBS_FILESET_FILES
WMBS_FILESET
WMBS_CHECKSUM_TYPE
TRIGGER_LABEL
T0_CONFIG
STREAM_SPECIAL_PRIMDS_ASSOC
STREAMER
STREAM
STORAGE_NODE
RUN_TRIG_PRIMDS_ASSOC
RUN_STREAM_STYLE_ASSOC
RUN_STREAM_FILESET_ASSOC
RUN_STREAM_CMSSW_ASSOC
RUN_STATUS
RUN_PRIMDS_STREAM_ASSOC
RUN_PRIMDS_SCENARIO_ASSOC
RUN
REPACK_CONFIG
RECO_CONFIG
PROMPTSKIM_CONFIG
PROCESSING_STYLE
PRIMDS_ERROR_PRIMDS_ASSOC
PRIMARY_DATASET
PHEDEX_CONFIG
LUMI_SECTION_SPLIT_ACTIVE
LUMI_SECTION_CLOSED
LUMI_SECTION
EXPRESS_CONFIG
EVENT_SCENARIO
DATA_TIER
CMSSW_VERSION

In WMAGENT


STREAMER
REPACK_CONFIG
EXPRESS_CONFIG
RECO_CONFIG
PHEDEX_CONFIG
PROMPTSKIM_CONFIG
WORKFLOW_MONITORING
WM_COMPONENTS
WM_WORKERS
DBSBUFFER_DATASET
DBSBUFFER_ALGO
DBSBUFFER_ALGO_DATASET_ASSOC
DBSBUFFER_FILE
DBSBUFFER_FILE_PARENT
DBSBUFFER_FILE_RUNLUMI_MAP
DBSBUFFER_LOCATION
DBSBUFFER_FILE_LOCATION
DBSBUFFER_BLOCK
DBSBUFFER_CHECKSUM_TYPE
DBSBUFFER_FILE_CHECKSUMS
DBSBUFFER_WORKFLOW
BL_STATUS
BL_RUNJOB
RC_THRESHOLD
WMBS_FILESET
WMBS_FILE_DETAILS
WMBS_FILESET_FILES
WMBS_FILE_PARENT
WMBS_FILE_RUNLUMI_MAP
WMBS_LOCATION_STATE
WMBS_LOCATION
WMBS_FILE_LOCATION
WMBS_USERS
WMBS_WORKFLOW
WMBS_SUB_TYPES
WMBS_WORKFLOW_OUTPUT
WMBS_SUBSCRIPTION
WMBS_SUB_FILES_ACQUIRED
WMBS_SUB_FILES_AVAILABLE
WMBS_SUBSCRIPTION_VALIDATION
WMBS_SUB_FILES_FAILED
WMBS_SUB_FILES_COMPLETE
WMBS_JOBGROUP
WMBS_JOB_STATE
WMBS_JOB
WMBS_JOB_ASSOC
WMBS_JOB_MASK
WMBS_CHECKSUM_TYPE
WMBS_FILE_CHECKSUMS
WMBS_LOCATION_SENAMES
T0_CONFIG
RUN_STATUS
PROCESSING_STYLE
EVENT_SCENARIO
DATA_TIER
CMSSW_VERSION
STREAM
TRIGGER_LABEL
PRIMARY_DATASET
STORAGE_NODE
RUN
RUN_TRIG_PRIMDS_ASSOC
RUN_PRIMDS_STREAM_ASSOC
RUN_PRIMDS_SCENARIO_ASSOC
RUN_STREAM_STYLE_ASSOC
RUN_STREAM_CMSSW_ASSOC
RUN_STREAM_FILESET_ASSOC
RECO_RELEASE_CONFIG
STREAM_SPECIAL_PRIMDS_ASSOC
PRIMDS_ERROR_PRIMDS_ASSOC
LUMI_SECTION
LUMI_SECTION_CLOSED
LUMI_SECTION_SPLIT_ACTIVE

CMSBUILD tricks

http://lat.web.cern.ch/lat/dmwmtut/environ/index.html

To get dependencies which are not in the tag, but without having all crap from the HEAD :

for i in $(echo "wmcore-db-mysql wmcore-db-oracle wmcore-db-couch wmcore-webtools py2-cjson dbs-client dls-client py2-zmq py2-psutil pystack dbs3-client" | xargs perl -e'foreach (@ARGV){print $_,$/}'); do cvs checkout -r HEAD -d CMSDIST-1204h CMSDIST/$i.spec ; done

WMCore tree update:

[lxplus305] /afs/cern.ch/user/s/samir/private/t0/WMCore > git svn fetch -r HEAD [lxplus305] /afs/cern.ch/user/s/samir/private/t0/WMCore > git rebase remotes/trunk # Move base of stack to current SVN head

WMCore CHECKOUT

git svn clone svn+ssh://svn.cern.ch/reps/CMSDMWM/WMCore/trunk -r 16070 wmcore/

WMA T0 renew :

./config/tier0/manage clean-all
sqlplus  CMS_T0AST_REPLAY2/**@INT2R < wipe_oracle.sql
 ./config/tier0/manage activate-tier0
./config/tier0/manage start-services
python t0astgrants.py 
./config/tier0/manage start-tier0

New JobState relevant docs

https://svnweb.cern.ch/trac/CMSDMWM/ticket/3114

https://svnweb.cern.ch/trac/CMSDMWM/browser/WMCore/trunk/src/python/WMComponent/RetryManager/PlugIns/ExponentialAlgo.py

https://svnweb.cern.ch/trac/CMSDMWM/browser/WMCore/trunk/src/python/WMCore/JobStateMachine

Also, what the plugin does is to return True if the job should be resubmitted (created), otherwise returns false.

WMA Unit tests

source /data/srv/wmagent/current/apps/wmagent/etc/profile.d/init.sh
export PYTHONPATH=$PWD/test/python/:$PYTHONPATH
export PYTHONPATH=$PWD/src/python/:$PYTHONPATH
export PYTHONHOME=/data/srv/wmagent/current/sw/slc5_amd64_gcc461/external/python/2.6.4-comp2/
export DBSOCK=/data/srv/wmagent/v0.8.26pre6/install/mysql/logs/mysql.sock
export DATABASE=mysql://wma:mysql@localhost/wmagent
export PYTHONPATH=$PYTHONPATH:/usr/lib/python2.6/site-packages/
export COUCHURL=http://admin:couch@wma-dev.cern.ch:5984
export WMCORE_ROOT


    find you socket file:
    [vocms201] ~ $ ps aux | grep mysql
    cmst1     1250  0.0  0.0  64012  1296 ?        S    Mar06   0:00 /bin/sh /data/srv/wmagent/v0.8.25/sw/slc5_amd64_gcc461/external/mysql/5.1.58/bin/mysqld_safe --defaults-extra-file=/data/srv/wmagent/v0.8.25/config/mysql/my.cnf --datadir=/data/srv/wmagent/v0.8.25/install/mysql/database --log-bin --socket=/data/srv/wmagent/v0.8.25/install/mysql/logs/mysql.sock --skip-networking --log-error=/data/srv/wmagent/v0.8.25/install/mysql/logs/error.log --pid-file=/data/srv/wmagent/v0.8.25/install/mysql/logs/mysqld.pid
    

 scfoulkes 11:14 pm
    --socket=/data/srv/wmagent/v0.8.25/install/mysql/logs/mysql.sock 

 scfoulkes 11:14 pm
    DBSOCK = that file

 Samir Cury 11:14 pm
    DATABASE is not needed then

 scfoulkes 11:14 pm
    DATABASE = mysql://user:password@localhost/DBNAME

 Samir Cury 11:14 pm
    =)
    what about hte SOCK?
    just stays there?

 scfoulkes 11:15 pm
    the DBSOCk variable
    needs to point at the socket file for the DB
    which you can get from the ps output
    looking for the mysql process

 Samir Cury 11:15 pm
    DBNAME would be wmagent 

CMS T0 Monitoring through SLS

Goal/Intro

In order to document the ongoing work with SLS for people that started caring about it as CRC's and soon CSP's, I'm creating this documentation page. It's intended in the beginning explain in simple words what each alarm does, then the details about it to whom it may concern after.

Alarms

For now we have 3 :

  • CMS T0 Long Jobs
  • CMS T0 Express queue
  • CMS T0 Permanent failures

All of them are listed in what I call "T0 Status Board", for quick monitoring, due to it's size it works fine in mobile devices :

http://cmsprod.web.cern.ch/cmsprod/sls/

Description for CSP/CRC

The most important thing is that for now it is a "JFYI" tool, basically T0 Ops is developing this monitoring/alarming FrameWork so it's watching closely all the fired alarms and trying to correct them ASAP. For now CRC's and CSP's doesn't need to alarm anyone. Everyone is aware and we have a couple pair of eyes in it.

CMS T0 Long Jobs

Status : It is alarming according to problems, we can completely trust in it.

It basically tells if we have jobs running for more than the defined threshold - right now 12 hours. Jobs are not supposed to run more than this - is not fatal but is suspicious. If this service is degraded for more than 10h it will be a problem, because if jobs are late, runs start to get late. A single job is enough to get a run stuck. You can see a full Status/Run list here :

http://cmsprod.web.cern.ch/cmsprod/runs.txt (updated hourly)

Most people get scared with runs in CloseOutPromptReco for more than 48h but this is the intentional delay into the system for it to complete previous tasks before firing PromptReco. If it takes more than 48+16h in PromptReco this is a problem, that usually happens because of stuck jobs. So normally these 2 checks are related.

All these checks the T0 Team does, but I'm documenting here just in case CRC's/CSP's want to have a more global view of what we have.

CMS T0 Express queue

Proved useful once in big runs when we had backlog here. Tested only once. Showed trustable, but still doesn't substitute monitoring by hand

Context : PromptReco jobs usually take 1500~2000 slots and take ~8h ; Repack/Express (should be real time) jobs takes 1~2h max.;

When we have big runs, they generate a lot of jobs, and if we have PromptReco from 48h past runs at the same time, it may prevent the real-time jobs to run, creating a huge backlog in the Express queue because we have a lot of long PromptReco jobs running - Express/Repack backlogs are eaten faster then PromptReco, so is a good practice to reduce PR thresholds so the Express/Repack can have more slots to finish and end faster this backlog - attention to the fact that Express/Repack are essential to finish before 12h after the run was closed. AlcaHarvest needs it.

This alarm fires when we have more than 2500 express jobs in the PA queue - not fired into the system, so the CSP won't see it here (https://lsf-rrd.cern.ch/lrf-lsf/info.php?queue=cmst0).

When it fires, actions should be taken by Samir at CERN timezone or David Mason at FNAL timezone. During weekends the first that catches the alarm. This should be adjusting thresholds in a way that Express have enough room to finish soon enough.

CMS T0 Permanent failures

Status:This is the most unstable alarm - spotted not alarming when the problem was there - some patches but I still don't trust it. You shouldn't either - doesn't substitute monitoring by hand

Basically when a given job fails for more than the defined retry count, the system gives up and put them in the failure archive - this is usually something that requires fast response because the job probably took some hours (not few) in the retries, and this + response time will be the time the run is delayed.

Usual suspects for this issue (for ops mostly)

  • 500 Error - webserver problem - just resubmit jobs, they should work now - happens most with DQMHarvest jobs, but may happen with others.

  • Memory thresholds too low - jobs are killed by LSF because they try to use more memory than they were allowed too - investigate and if is the case raise mem reservation in PA before resubmitting them. do your best to bring the mem reservation to the lower value again after they run, exception when we're having a considerable number of permanent failures - when it starts to give more operational work than it should. Elog it and discuss with the team

Future alarms

This can help implementing RRD's for the metrics below :

https://twiki.cern.ch/twiki/bin/view/FIOgroup/SLSManualForSM#Accounting_information_in_SLS

Transient Failures

They happen everyday. They may show problems with jobs being retried some or several times before succeeding. Less this number is, more efficient we are. We don't have this metric today. Running it once per day looks good. CondorTracker is the key.

In case this number increase too much, we're in the edge of having definitive failures and real problems.

PA queue

It's supposed to watch PA queue - jobs before they are submitted and log in a RRD through SLS the number of jobs through time, should alarm if exceeds treshold.

Delayed runs

Access http://cmsprod.web.cern.ch/cmsprod/runs.txt or the script directly and alarm when a given run exceeds a given TH in a given state. Alarm if it happens.

New CSP Instructions Draft


Special Check of the: Castor Pool T0Export outgoing rate during HI data taking --- %COMPLETE0%


Tier-0 workflows monitoring

Introduction / Machine status

Read this if you didn't before, once you understand what is important or not your shift will be more efficient, and you will know which kind of problems to look for

  • A general T0 description is provided here Hide
    • The T0 is one of the most important CMS systems, it's responsible for creating the first RAW datasets, and RECO from the real data that comes directly from CMS. It runs many kinds of jobs against the collision data, among them the most important types of Jobs are EXPRESS and REPACK . These should run in real time, if they get delayed by one hour, then all the workflow may be compromised. So keep this in mind during your shift, the main problems you should look for are stuck transfers from P5 or if we have runs stuck/failing in Express or Repack. NEW
    • As you probably have read in the Computing Plan of the Day , you already know if we are during data-taking period or not. When we are, any error in the T0 should be reported. We should not have runs delayed.
  • A drawing of the main Tier-0 components can be found below (clickable boxes on the picture may be used))


To help you know what's the status right now, you can look into 2 places : NEW

  • DAQ Status URL: http://cmsonline.cern.ch/daqStatusSCX/aDAQmon/DAQstatusGre.jpg (shift reload this periodically)
    • Show INSTRUCTIONS Hide INSTRUCTIONS: This tells you if a run is ongoing, the data taking mode and if data is being transferred from the CMS detector to T0. If the DAQ is running it will be specified on the top bar under "DAQ state". Next to it you can find the on-going Run Number; if that run is sent to T0, it will be specified under the green field on bottom right of the view "TIER0_TRANSFER_ON". The first line under Data Flow on top right specifies the data taking mode (or "HLT key", which is a string containing a tag such as "interfill", "physics", ...). The tag "physics" is obviously the most relevant data. The bottom left histograms shows the run history in last 24h; the run numbers specified on the graph should be reflected in the T0Mon page, see Monitoring links below.
  • P5 Elog URL (for information only, see instructions)
    • Show INSTRUCTIONS Hide INSTRUCTIONS: This elog is not needed in "normal" situations, however it may be useful in cases of very special events at the CMS detector. Or you may use it to simply find out who is the online shifter (== the shift role corresponding to yours, but for every thing related to online data taking). You will need to log in with your AFS credentials.

Tier-0 Service

Check the the most urgent alarms of the Tier-0. Basically it monitors the main causes of runs getting stuck - Failed jobs, Stuck jobs and in case of overload, Express backlog. NEW


Check the status of Tier 0 components and workflows through T0Mon --- %COMPLETE5%

  • T0Mon URL: https://cmsweb.cern.ch/T0Mon/
    • Show INSTRUCTIONS Hide INSTRUCTIONS:
      • Check that T0 components are up: Look at the "Time from component's last heartbeat" section near the top of the page. If one or more of the components show up in red as below, open an elog in the T0 category.
        cmsproc.rrd_RUN_m.gif.png
      • Check for new runs appearing: New runs should appear on T0Mon in the "Runs" table within a few minutes of appearing on the Storage Manager page above (excluding TransferTest or local runs with labels such as privmuon). If a new run doesn't appear on T0Mon, or shows up with an HLTKey of "UNKNOWN", open an elog.
        cmsproc.rrd_RUN_m.gif.png
      • Monitor progress of processing: In the runs table, as processing proceeds, the status for a run will move through "Active", "CloseOutRepack", "CloseOutRepackMerge", "CloseOutPromptReco", "CloseOutRecoMerge", "CloseOutAlcaSkim", "CloseOutAlcaMerge". When processing is complete for a run, and only transfers are remaining, the status will be "CloseOutExport". Fully processed and transferred runs will show up as either "CloseOutT1Skimming" or "Complete". If older runs are stuck in an incomplete status, or show no progress in the status bars over a long period, make an elog post.
      • Check for new job failures: In the job stats area beneath the Runs table, look at the number of failed jobs in each of the 5 job categories. If you see any of these counts increasing, make an Elog post immediately.
        cmsproc.rrd_RUN_m.gif.png
      • Check for slow-running jobs: In the running and acquired jobs tables beneath the job stats, check for jobs which are older than about 12 hours, and open an elog if you see any.
        cmsproc.rrd_RUN_m.gif.png cmsproc.rrd_RUN_m.gif.png



Check the CMS Online status:

  • Incoming Runs --- %COMPLETE5%

You should keep an eye on these while we are in active datataking and periodically checkpoint the T0Mon monitoring below against what they show to be going on at P5.

    • Storage Manager URL: http://cmsonline.cern.ch/portal/page/portal/CMS%20online%20system/Storage%20Manager
      • Show INSTRUCTIONS Hide INSTRUCTIONS: Check on the DAQ page (http://cmsonline.cern.ch/daqStatusSCX/aDAQmon/DAQstatusGre.jpg) that on the right column, Data Flow, data taking was declared and all the following fields are green: LHC RAMPING OFF, PreShower HV ON, Tracker HV ON, Pixel HV ON, Physics declared. Also check that on the bottom of the right column, TIER0_TRANSFER_ON is green. Then check on the Storage Manager URL given above the current status of data taking. You should see first the current run in the upper box called Last streams. Check that there are files per stream to be transferred in the Files column. If there are more than 10 files, check the Transf column. If there are zero files listed for a stream that has files to be transferred, open elog in the T0 section. If there are more than 1000 files not transferred, call the CRC.
        Next check the LAST SM INSTANCES instance boxes for lines with numbers marked in RED and open an elog in the T0 section.
        Last check, look at the Last Runs box and also elog all lines with numbers marked in RED.
        As usual in elogs please be as verbose as possible.




Check the Castor pools/subclusters used by CMS: focus on t0export and t1transfer --- %COMPLETE5%



Check the activity on the Tier-0 LSF WNs --- %COMPLETE5%



Check the Tier-0 Jobs --- %COMPLETE5%

  • URL1: Queued and running Jobs on cmst0
  • Show INSTRUCTIONS Hide The cmst0 queue is using ~2,800 job slots. If you see on URL1 a sustained saturation of running (green at 2.8k) or pending (blue) jobs like in the example shown in the picture below, it might not be an issue, however it is worth notifying the DataOps team via the ELOG in the "T0" category.cmsproc.rrd_RUN_d.gif




PD PeakValueRss PeakValueVsize AvgEventTime MaxEventTime
Commissioning NO_REPORT NO_REPORT NO_REPORT NO_REPORT
Cosmics NO_REPORT NO_REPORT NO_REPORT NO_REPORT
DoubleElectron 1770.06 2091.48 13.04 35.88
DoubleMu 2117.85 2432.80 7.74 121.46
ElectronHad 2163.88 2479.44 9.71 98.19
HcalNZS NO_REPORT NO_REPORT NO_REPORT NO_REPORT
HT 2223.68 2535.55 12.15 377.57
-Jet 1959.37 2277.05 8.91 49.42
MET NO_REPORT NO_REPORT NO_REPORT NO_REPORT
BTag NO_REPORT NO_REPORT NO_REPORT NO_REPORT
MinimumBias 2141.32 2476.99 6.59 126.17
MuEG 2090.04 2392.83 7.35 76.98
MuHad 2091.00 2409.90 9.37 132.51
MultiJet 2195.08 2509.85 11.87 221.83
MuOnia 2067.37 2383.45 7.48 55.45
Photon 2053.85 2361.51 8.00 65.66
PhotonHad 2151.27 2469.36 11.78 65.66
SingleElectron 2152.21 2481.82 7.48 140.06
SingleMu 2126.10 2453.34 9.84 102.50
Tau 2135.85 2449.58 9.37 165.19
TauPlusX 2135.85 2449.58 9.37 165.19

PD PeakValueRss PeakValueVsize AvgEventTime MaxEventTime TotalJobTime[h]
Commissioning 2543.50 2853.54 22.75 239.63 5.52
Cosmics NO_REPORT NO_REPORT NO_REPORT NO_REPORT NO_REPORT
DoubleElectron 2995.54 3326.40 20.74 224.64 5.53
DoubleMu 2743.33 3074.85 21.94 205.35 5.88
ElectronHad 2985.61 3295.51 27.65 323.26 7.42
HcalNZS NO_REPORT NO_REPORT NO_REPORT NO_REPORT NO_REPORT
HT 3091.69 3412.01 35.33 379.27 9.13
-Jet 2912.35 3197.07 25.34 181.88 6.59
MET 2982.85 3308.28 29.41 217.82 7.75
BTag 2911.18 3248.52 28.11 164.60 7.34
MinimumBias 2999.07 3319.20 19.42 211.75 5.38
MuEG 2869.68 3179.32 22.38 176.43 5.86
MuHad 2938.05 3258.33 30.66 224.54 8.02
MultiJet 3184.42 3505.14 39.37 296.53 10.46
MuOnia 2680.07 3002.35 20.07 198.00 5.12
Photon 2769.31 3081.49 21.35 298.69 5.60
PhotonHad 2815.15 3116.57 24.39 198.65 6.27
SingleElectron 2842.88 3151.97 20.99 175.92 5.60
SingleMu 2814.28 3133.10 22.65 200.37 5.85
Tau 2972.46 3277.78 23.94 205.42 6.18
TauPlusX 3009.18 3311.59 24.39 205.42 6.18

Self reports

Dev - week 15/04/2012

Fixed/helped in :

  • #3624
  • #3613
  • #3625

-- SamirCury - 16-Jun-2011

Edit | Attach | Watch | Print version | History: r34 < r33 < r32 < r31 < r30 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r34 - 2013-09-28 - SamirCury
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    Sandbox All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback