Please refer to Computing Operation Production and Reprocessing up to date documentation.

Team Leader Responsibilities

This twiki contains a collection of tricks, some helpful, some dirty, some nasty, that you may need in your daily functions.

Your Mission

To put it in a few words, your mission is to have the system under control, since the word "system" is rather wide and fuzzy, you'll focus your attention in six areas:

  1. WMAgent machines.
  2. Sites.
  3. Workflows.
  4. Operators (shifters).
  5. Scripts
  6. Communication with other teams.
Each of them have it's tweaks and quirks, an easy way to start is checking the list of CompOpsWorkflowOperatorResponsibilities, here you'll find basic tasks and how to do them. Your tasks include them and some more.

A Check-List

This is a list of stuff you better check during your day, they might include a few tasks from the CompOpsWorkflowOperatorResponsibilities, so if you have dutiful operators it's better you delegate some of them.

  1. WMAgent machines:
  2. Sites:
  3. Workflows:
  4. Operators
    • Chat once a week with you operators.
    • Is the schedule full?
  5. Scripts
    • Check if these scripts need any updates: WmAgentScripts
      • New types of workflows
      • New validations on closing out workflows.
      • Fix bugs or improve performance
      • Documentation and clarity
  6. Communications with other teams:

Scheduling operators

  • Gather availiblity information from team members and make sure we have enough people scheduled to get the work done
  • coordinate work of team members

Answering operator questions

Automatically Closing Out / Announcing Workflows that Are Ready

  • Currently, assignment, closing-out and announcing of workflows is done automatically by Unified, which is under supervision of your L3's.
  • Check that workflows don't stay forever in the "assistance" page.
  • If a workflow is ok to be announced, you can tell Unified to skip any verification on it by:
    • Creating a text file in your AFS folder in: public/ops/bypass.json
    • writing in json format:
      [ "WORKFLOW1", "WORKFLOW2", .... "WORKFLOWn"] 
    • Tell your L3's to include that file in the bypass list.

Resource Planning

* note this will become more important as we start balancing T1/T2 work over the same sites and have to balance loads and take into consideration downtimes etc The Resource planning is one of the most important daily tasks of an operator. This is done in a site by site basis. For each site the following is done:

  1. Through the ReqMgr we gather what is running at the chosen site. For this it is better to cross check with the batch pages at each site, the site pages can be looked here: Site Pages and for sites which are not there then Dashboard
  2. Through the ReqMgr we gather what is running at the chosen site.
  3. Then it is gathered what is on state assignment-approved.
  4. If the slots at a site are completely used or if there is work which will soon arrive to a site, nothing is to be done Otherwise it is necessary to inject backfills for the under used site.
Once this is done for all sites an ELOG should be created with the subject: Resource Planning <date>.

This ELOG has two parts:

  • the first one includes for each site what campaigns are running on it

  • the second one is done per site, and includes what changes are going to be made on the system, i.e change of thresholds or injection of new workflows.
An usual elog can look like this:
Overview
========
- CNAF:(0 slots)
   * Fall11 - (waiting to be aproved)
   * Summer11 - Tails
- FNAL: (~6000  slots in use right now)
   * Fall 11 
   * Relval (<100)
- IN2P3: (1300 slots in use)
  * Fall 11 
  * Backfill
- KIT: (~2500 slots in use)
  * Fall 11 
- PIC: (~100)
   * MC Production
    * Fall 11 (waiting to be aproved)
- RAL: (~1700)
  * Fall 11 
 
Planned
=======

- CNAF:
  * I will increase the thresholds to 11000 on vocms201, since Fall 11 hasn't started
- FNAL:
  * I will keep the thresholds at 0 on vocms201, since Fall 11 is taking all of the slots.
- IN2P3:
  * I will keep the thresholds at 0 on vocms201, since Fall 11 is taking all of the slots.
- KIT:
   * I will keep the thresholds at 0 on vocms201, since Fall 11 is taking all of the slots.
- PIC;
  * I will increase the thresholds to 700 on vocms201, since Fall 11 hasn't started
- RAL
   * I will keep the thresholds at 0 on vocms201, since Fall 11 is taking all of the slots.

Debugging

Creating ACDC's / Cloning / Resubmitting / Recovering Workflows

  • Check the list of workflows that are completed in the following link: https://cmst2.web.cern.ch/cmst2/unified/assistance.html
  • Pay special attention to those who are in the section assistance-manual. Those are workflows that finished but are not yet good to go. This happens due to:
    • Jobs failed due to site problems
    • Misplaced input
    • Missing files
    • Duplicated events
    • Job information lost due to WMAgent crashes
    • Communication to other components (PhEDEx, ReqMgr, DBS, etc)
    • Errors on the request itself.
    • Just anything
There are different mechanisms to retry partially and totally a workflow, which are:
  • Creating ACDC's: Is a partial retry (only the failed jobs are retried) use this when:
    • Workflows are massive (more than 2K jobs)
    • Failures are recoverable (they are site failures)
    • Keep in mind that TaskChain ACDC's need special handling, see Creating
    • for event based splitting workflows, only timeout and memory can be changed, do not change splitting or you will end up with duplicates and have to start over anyway.
    • TaskChain ACDC's need to be assigned with assignProdTaskChain.py
      • You can also provide a list of sites separated by commas (no spaces) T1_US_FNAL,T2_US_UCSD,...
      • For TaskChains is very easy (because we always use the same list):
      • You can use -s all: It will assign to all sites available (Works for any taskchain acdc).
      • You can skip -s option: It will assign to the "good site" list (Works for any clone you need).
  • Cloning / Resubmitting Workflows: This a total retry of a workflow (from the beginning), use this when:
    • Workflows are small (less than 2k)
    • Failures cannot be recovered (i.e. files are lost)
    • Most of the workflow failed but can be recovered
  • Extending: This is creating a workflow that appends more data to the result of a previous workflow. Use this when:
    • Only when workflow is a MonteCarlo (from scratch): it has no input dataset.
  • Recovering: This also runs a "smart" partial retry of the workflow, taking into account which lumis were not produced. Use this when:
    • You lost job information (merge jobs)
    • Merge jobs failed because unmerged files broke or unmerged files are unavailable.
    • The workflow is too large to clone it from the beginning.
    • The workflow has input and output dataset (MoneCarloFromGen, ReDigi, ReReco)
  • Clearing up duplicates
    • A workflow will have duplicate events if an ACDC or recovery has run 2X over the same files OR if the parent dataset has duplicates
    • To clear the duplicates we need to find out which files have duplicate data in them and then file a ggus ticket to the transfer team to ask to have the files invalidated
      • To find the list of files with duplicates we have 2 steps:
      • this will output a minimum list of files that need to be invalidated to have no duplicates in the dataset
      • Open a ggus ticket to the transfer team to invalidate the files. Once the files are invalidated the workflow should clear on it's own.

The Usual Errors in Assistance-Manual

Exit Code: 50660

When you find a workflow getting

  • Error in CMSSW step cmsRun1 Job has exceeded maxRSS: [SOME VALUE HERE] Job has RSS: [SOME VALUE HERE]
you should resubmit the workflow increasing the memory available for it. ACDC with increased memory would also function.

Clearing out Problem WF's

When a workflow has consistent errors that cannot be recovered, and are not caused by our infrastructure, we need to notify the requestors about it.

  • Write an elog about it.
  • Write to the corresponding HypeNews thread.
  • Tell Unified that this workflow needs to be on hold, by:
    • Create a text file on your AFS public/ops/onhold.json
    • Add the workflows as a json list:
      ["WORKFLOW1", "WORKFLOW2, ... ] 
    • Tell your L3's to add that file to Unified.

Checking for Failed WF's

  • once a week check the list of failed workflows to determine why they failed and if they need to be re-submitted or if they can be moved to abort/announced
  • if a WF is complaining of missing blocks you can check to see if the missing blocks are now available by doing the following:
    • You can find them as follows:
      • dbsf block,site where block = /QCD_pt50to150_bEnriched_MuEnrichedPt14_TuneZ2star_7TeV-pythia6/Summer11-START311_V2-v1/GEN-SIM*" | sort -k 2
      • any blocks with no site are bad
      • dbsf block,block.numevents where block = /QCD_pt50to150_bEnriched_MuEnrichedPt14_TuneZ2star_7TeV-pythia6/Summer11-START311_V2-v1/GEN-SIM*" | sort -n -k 2
      • any blocks with 0 events are bad
      • In this case it looks like the block is OK now, so no clone is needed and I have reassigned the workflow (you could also do this in future, you only need to re-select the T1 site).
      • /QCD_pt50to150_bEnriched_MuEnrichedPt14_TuneZ2star_7TeV-pythia6/Summer11-START311_V2-v1/GEN-SIM#89dc61f8-fcfa-11e1-90b5-0024e83ef644 storm-fe-cms.cr.cnaf.infn.i
  • Workflows not passing PhEDEx subscription test:
    • Custodial Subscriptions of outputdatasets are made automatically by the WmAgent when only one Tier 1 Site is selected in the white list.
    • The Custodial Subscription is made by the agent after the first merge job runs succesfully.
    • Custodials subscriptions are grouped and requested in groups for each site each 12 hours (in order not to over burden the site admins).
    • The DQM (if there is) runs when all the processing jobs have finished and the mergeDQM afterwards so their custodial request can be done in another poll after the other tiers are requested.
    • T1 Sites are expected to approve the subscriptions in 24 hours of them being made, but this can be some time more if there are holidays weekends.
    • When suspecting that a request is taking too long to be accepted. Check some workflow and check it's output dataset:
For example this wf:

etorassa_EXO-Fall11_R1-01206_T1_FR_CCIN2P3_MSS_v1_120130_111455

with this outputdatset

/UnpartmonoJet_S-1_dU-1p8_LU-3_7TeV_Tune4C-pythia8/Fall11-PU_S6_START42_V14B-v1/GEN-RAW

First of all check if a custodial subscription has been done:

https://cmsweb.cern.ch/phedex/datasvc/xml/prod/requestlist?dataset=/UnpartmonoJet_S-1_dU-1p8_LU-3_7TeV_Tune4C-pythia8/Fall11-PU_S6_START42_V14B-v1/GEN-RAW

If yes like, then checkin the time it was done:

time_create="1328155152.58407">

This is Unix time converted to normal time is: Thu, 02 Feb 2012 03:59:12 GMT

If it is more than 24 hours, contact the site via a Savannah ticket and name the custodial request (phedex link). Which hasn't been approved.

  • check to see what the % of jobs done is discuss with Jacob and Andrew if you have questions about what can/should be closed

Force-Completing Workflows

If we hear from requestors that a workflow "has enough statistics" or "enough events" or that they "need it urgently" you may want to force-complete the workflow to speed up it's announcement.

Force-Completing on ReqMgr:

  • Simply move the workflow manually from running-closed to force-complete
  • If you need to do this for several workflows, use forceCompleteWorkflows.py instead.

Force-Completing inside the WMAgent:

  • Find out which agent the WF is running on
  • log in to the machine and initialize WMAgent environment.
  • Open the WorkQueueManager console:
    $manage execute-agent wmagent-workqueue -i > 
  • Mark work as done in WorkQueueManager:
    > workqueue.doneWork(WorkflowName = 'WORKFLOW_NAME') 
  • this should move the WF to "complete" in a few hors

Note:

  • If a workflow stays for too long in force-complete, it can be manually moved to complete, however if you do this, double check that there are no Production or Merge jobs running or pending.

Finding Missing Run/Lumi information

  • if there are only a few missing events the fastest way of doing this is to look under the Status - Complete in WMStats
  • easy way to do proceedure listed below :
    • on agent machine: * [vocms216] /afs/cern.ch/user/j/jen_a/WmAgentScripts > python checkRequest.py franzoni_Fall53_2011A_Jet_Run2011A-v1_Prio1_5312p1_130916_235201_2576
Starting... 21:52:27 Getting information from cmsweb about franzoni_Fall53_2011A_Jet_Run2011A-v1_Prio1_5312p1_130916_235201_2576... 21:52:27 Done querying cmsweb... 21:52:29 Loading DBS full information for /Jet/Run2011A-v1/RAW... 21:52:29 Loading DBS full information for /Jet/Run2011A-12Oct2013-v1/DQM... 21:53:10 Loading DBS full information for /Jet/Run2011A-12Oct2013-v1/AOD... 21:53:34 Loading DBS full information for /Jet/Run2011A-LogError-12Oct2013-v1/RAW-RECO... 21:54:04 Loading DBS full information for /Jet/Run2011A-HighMET-12Oct2013-v1/RAW-RECO... 21:54:28 Analyzing differences in /Jet/Run2011A-12Oct2013-v1/AOD... 21:54:50 Analyzing differences in /Jet/Run2011A-12Oct2013-v1/DQM... 21:54:50 Analyzing differences in /Jet/Run2011A-HighMET-12Oct2013-v1/RAW-RECO... 21:54:50 Analyzing differences in /Jet/Run2011A-LogError-12Oct2013-v1/RAW-RECO... 21:54:50 [vocms216] /afs/cern.ch/user/j/jen_a/WmAgentScripts >

Step 1
Login to cmslxplus cern ch
 cd CMSSW_5_3_7_patch4/
 echo $SCRAM_ARCH
  cmsenv
If you do not have a CMSSW area:
scram project CMSSW CMSSW_5_3_7_patch4
should get it for you....

Make a list of all input and output datasets and paste them to a text file (in this case ds.txt):
You can get the input datasets from the workflow page:
https://cmsweb.cern.ch/reqmgr/view/details/vlimant_Winter532012CDoubleMuParked_CNAFPrio1_537p5_130122_195017_1947
You can get the output datasets from dbsTest.py

e.g.

> cat ds.txt
/DoubleMuParked/Run2012C-v1/RAW
/DoubleMuParked/Run2012C-TkAlZMuMu-22Jan2013-v1/ALCARECO
/DoubleMuParked/Run2012C-DtCalib-22Jan2013-v1/ALCARECO
/DoubleMuParked/Run2012C-MuAlCalIsolatedMu-22Jan2013-v1/ALCARECO
/DoubleMuParked/Run2012C-22Jan2013-v1/RECO
/DoubleMuParked/Run2012C-22Jan2013-v1/DQM
/DoubleMuParked/Run2012C-22Jan2013-v1/AOD
/DoubleMuParked/Run2012C-MuAlOverlaps-22Jan2013-v1/ALCARECO
/DoubleMuParked/Run2012C-EXOHSCP-22Jan2013-v1/USER
/DoubleMuParked/Run2012C-HZZ-22Jan2013-v1/AOD
/DoubleMuParked/Run2012C-LogErrorMonitor-22Jan2013-v1/USER
/DoubleMuParked/Run2012C-HighLumi-22Jan2013-v1/RAW-RECO
/DoubleMuParked/Run2012C-LogError-22Jan2013-v1/RAW-RECO
/DoubleMuParked/Run2012C-Zmmg-22Jan2013-v1/RAW-RECO

Step 2
Make a list of run,lumi for each of these datasets:

cat ds.txt |  awk '{split($1,array,"/"); print "dbs search --production --noheader --query=\"find run,lumi where dataset = "$1"\"  | sort &> "array[2]"_"array[3]"_"array[4]"_runlumi.log &"}' > makerunlumilists.sh
source makerunlumilists.sh 

Step 3
Find the run whitelist from the WF page and paste it to a blank text file (in this case runWL2012C.txt):

https://cmsweb.cern.ch/reqmgr/view/details/vlimant_Winter532012CDoubleMuParked_CNAFPrio1_537p5_130122_195017_1947

[198022, 198023, 198041, 198044, 198045, 198048, 198049, 198050, 198063, 198116, 198202, 198207, 198208, 198210, 198212, 198213, 198214, 198215, 198229, 198230, 198249, 198268, 198269, 198270, 198271, 198272, 198346, 198372, 198485, 198486, 
198487, 198522, 198523, 198588, 198589, 198603, 198609, 198898, 198899, 198900, 198901, 198902, 198903, 198941, 198954, 198955, 198969, 199008, 199011, 199021, 199276, 199282, 199306, 199318, 199319, 199336, 199356, 199409, 199428, 199429, 
199435, 199436, 199563, 199564, 199565, 199566, 199568, 199569, 199570, 199571, 199572, 199573, 199574, 199608, 199698, 199699, 199703, 199739, 199745, 199751, 199752, 199753, 199754, 199804, 199812, 199832, 199833, 199834, 199862, 199864, 
199867, 199868, 199875, 199876, 199877, 199960, 199961, 199967, 199973, 200041, 200042, 200049, 200075, 200091, 200152, 200160, 200161, 200174, 200177, 200178, 200180, 200186, 200188, 200190, 200228, 200229, 200243, 200244, 200245, 200368, 
200369, 200381, 200466, 200473, 200491, 200515, 200519, 200525, 200532, 200599, 200600, 200601, 200961, 200976, 200990, 200991, 200992, 201062, 201097, 201114, 201115, 201159, 201164, 201168, 201173, 201174, 201191, 201193, 201195, 201196, 
201197, 201199, 201200, 201202, 201228, 201229, 201278, 201535, 201554, 201602, 201611, 201613, 201624, 201625, 201657, 201658, 201668, 201669, 201671, 201678, 201679, 201692, 201705, 201706, 201707, 201708, 201718, 201727, 201729, 201794, 
201802, 201812, 201813, 201815, 201816, 201817, 201818, 201819, 201824, 202000, 202012, 202013, 202014, 202016, 202044, 202045, 202047, 202054, 202060, 202074, 202075, 202084, 202086, 202087, 202088, 202093, 202116, 202178, 202205, 202207, 
202208, 202209, 202237, 202272, 202299, 202305, 202314, 202328, 202333, 202389, 202469, 202472, 202477, 202478, 202504, 202792, 202793, 202794, 202970, 202972, 202973, 203002, 203708, 203709, 203739, 203742]

Then convert the run whitelist into an egrep command
cat runWL2012C.txt  |  sed 's/, /|/g' | sed 's/\[/(/g' | sed 's/\]/)/g' |awk '{print "egrep \""$1"\""}'

Step 4
Apply the egrep command to the RAW run,lumi list

cat DoubleMuParked_Run2012C-v1_RAW_runlumi.log | egrep "
(198022|198023|198041|198044|198045|198048|198049|198050|198063|198116|198202|198207|198208|198210|198212|198213|198214|198215|198229|198230|198249|198268|198269|198270|198271|198272|198346|198372|198485|198486|198487|19852
2|198523|198588|198589|198603|198609|198898|198899|198900|198901|198902|198903|198941|198954|198955|198969|199008|199011|199021|199276|199282|199306|199318|199319|199336|199356|199409|199428|199429|199435|199436|199563|1995
64|199565|199566|199568|199569|199570|199571|199572|199573|199574|199608|199698|199699|199703|199739|199745|199751|199752|199753|199754|199804|199812|199832|199833|199834|199862|199864|199867|199868|199875|199876|199877|199
960|199961|199967|199973|200041|200042|200049|200075|200091|200152|200160|200161|200174|200177|200178|200180|200186|200188|200190|200228|200229|200243|200244|200245|200368|200369|200381|200466|200473|200491|200515|200519|20
0525|200532|200599|200600|200601|200961|200976|200990|200991|200992|201062|201097|201114|201115|201159|201164|201168|201173|201174|201191|201193|201195|201196|201197|201199|201200|201202|201228|201229|201278|201535|201554|2
01602|201611|201613|201624|201625|201657|201658|201668|201669|201671|201678|201679|201692|201705|201706|201707|201708|201718|201727|201729|201794|201802|201812|201813|201815|201816|201817|201818|201819|201824|202000|202012|
202013|202014|202016|202044|202045|202047|202054|202060|202074|202075|202084|202086|202087|202088|202093|202116|202178|202205|202207|202208|202209|202237|202272|202299|202305|202314|202328|202333|202389|202469|202472|20247
7|202478|202504|202792|202793|202794|202970|202972|202973|203002|203708|203709|203739|203742)" > DoubleMuParked_Run2012C-v1_RAW_runlumi_runwl.log 

Step 5
Compare the output dataset run,lumi lists to the whitelisted input list (this must be done for each output dataset):

e.g. to count missing lumis:

diff  DoubleMuParked_Run2012C-v1_RAW_runlumi_runwl.log DoubleMuParked_Run2012C-22Jan2013-v1_AOD_runlumi.log  | grep "< " | wc -l
339

to print missing lumis in json format for twiki:

 diff  DoubleMuParked_Run2012C-v1_RAW_runlumi_runwl.log DoubleMuParked_Run2012C-22Jan2013-v1_AOD_runlumi.log  | grep "< "  | awk '{print "\""$2"\": ![["$3","$3"]]," }' | tr '\n' ' ' | awk '{print "{"$line"}"}'  | sed 's/, }/}/g'
{"198210": ![[193,193]], "198210": ![[194,194]], "198210": ![[195,195]], "198210": ![[196,196]], "198210": ![[197,197]], "198230": ![[734,734]], "198230": ![[735,735]], "198230": ![[736,736]], "198230": ![[737,737]], "198230": ![[756,756]], "198230": ![[757,757]], "198230": !
[[763,763]], "198230": ![[764,764]], "198269": ![[161,161]], "198269": ![[162,162]], "198271": ![[146,146]], "198271": ![[147,147]], "198271": ![[410,410]], "198271": ![[411,411]], "198271": ![[412,412]], "198271": ![[44,44]], "198271": ![[45,45]], "198522": ![[62,62]], 
"198522": ![[66,66]], "198522": ![[67,67]], "198522": ![[68,68]], "198522": ![[69,69]], "198522": ![[70,70]], "198522": ![[71,71]], "198522": ![[72,72]], "198522": ![[73,73]], "198522": ![[74,74]], "198522": ![[76,76]], "198955": ![[305,305]], "198955": ![[306,306]], "198955": !
[[307,307]], "199008": ![[76,76]], "199008": ![[77,77]], "199008": ![[78,78]], "199008": ![[79,79]], "199008": ![[80,80]], "199008": ![[81,81]], "199008": ![[82,82]], "199008": ![[84,84]], "199008": ![[86,86]], "199021": ![[61,61]], "199021": ![[62,62]], "199021": ![[63,63]], 
"199021": ![[64,64]], "199021": ![[65,65]], "199021": ![[66,66]], "199021": ![[67,67]], "199021": ![[68,68]], "200243": ![[72,72]], "200243": ![[73,73]], "200243": ![[74,74]], "200243": ![[75,75]], "200243": ![[76,76]], "200243": ![[77,77]], "200243": ![[78,78]], "200243": !
[[79,79]], "200243": ![[83,83]], "200244": ![[171,171]], "200244": ![[172,172]], "200244": ![[308,308]], "200244": ![[309,309]], "200244": ![[310,310]], "200244": ![[417,417]], "200244": ![[419,419]], "200244": ![[420,420]], "200244": ![[560,560]], "200244": ![[563,563]], 
"200244": ![[564,564]], "200244": ![[599,599]], "200244": ![[600,600]], "200244": ![[601,601]], "200491": ![[128,128]], "200491": ![[129,129]], "200491": ![[85,85]], "200491": ![[86,86]], "200491": ![[88,88]], "200491": ![[89,89]], "200491": ![[90,90]], "200491": ![[91,91]], 
"200491": ![[93,93]], "200491": ![[94,94]], "200491": ![[95,95]], "200491": ![[96,96]], "200491": ![[97,97]], "200491": ![[98,98]], "200515": ![[100,100]], "200515": ![[101,101]], "200515": ![[102,102]], "200515": ![[103,103]], "200515": ![[104,104]], "200515": ![[105,105]], 
"200515": ![[106,106]], "200515": ![[107,107]], "200515": ![[97,97]], "200515": ![[98,98]], "200515": ![[99,99]], "200525": ![[111,111]], "200525": ![[112,112]], "200525": ![[144,144]], "200525": ![[147,147]], "201062": ![[73,73]], "201062": ![[78,78]], "201062": ![[79,79]], 
"201062": ![[80,80]], "201062": ![[81,81]], "201062": ![[82,82]], "201062": ![[83,83]], "201062": ![[84,84]], "201062": ![[85,85]], "201062": ![[86,86]], "201062": ![[87,87]], "201062": ![[88,88]], "201097": ![[76,76]], "201097": ![[77,77]], "201097": ![[78,78]], "201097": !
[[79,79]], "201097": ![[80,80]], "201097": ![[83,83]], "201097": ![[84,84]], "201097": ![[85,85]], "201097": ![[86,86]], "201097": ![[87,87]], "201097": ![[88,88]], "201097": ![[90,90]], "201097": ![[93,93]], "201097": ![[94,94]], "201097": ![[96,96]], "201159": ![[73,73]], "201159": 
![[74,74]], "201159": ![[75,75]], "201159": ![[76,76]], "201159": ![[77,77]], "201159": ![[78,78]], "201159": ![[81,81]], "201164": ![[140,140]], "201164": ![[145,145]], "201164": ![[333,333]], "201164": ![[334,334]], "201164": ![[335,335]], "201191": ![[76,76]], "201191": !
[[77,77]], "201191": ![[78,78]], "201191": ![[79,79]], "201191": ![[80,80]], "201191": ![[81,81]], "201191": ![[82,82]], "201191": ![[84,84]], "201191": ![[85,85]], "201278": ![[64,64]], "201278": ![[65,65]], "201278": ![[66,66]], "201278": ![[67,67]], "201278": ![[68,68]], "201278": 
![[69,69]], "201278": ![[70,70]], "201278": ![[73,73]], "201278": ![[76,76]], "201278": ![[78,78]], "201278": ![[83,83]], "201554": ![[53,53]], "201554": ![[54,54]], "201554": ![[55,55]], "201554": ![[56,56]], "201554": ![[57,57]], "201554": ![[58,58]], "201554": ![[59,59]], 
"201554": ![[70,70]], "201554": ![[71,71]], "201554": ![[72,72]], "201554": ![[74,74]], "201554": ![[78,78]], "201554": ![[79,79]], "201624": ![[100,100]], "201624": ![[102,102]], "201624": ![[105,105]], "201624": ![[76,76]], "201624": ![[77,77]], "201624": ![[78,78]], "201624": !
[[79,79]], "201624": ![[80,80]], "201624": ![[81,81]], "201624": ![[84,84]], "201624": ![[85,85]], "201624": ![[86,86]], "201624": ![[96,96]], "201624": ![[97,97]], "201657": ![[71,71]], "201657": ![[72,72]], "201657": ![[73,73]], "201657": ![[74,74]], "201657": ![[75,75]], "201657": 
![[76,76]], "201657": ![[77,77]], "201657": ![[78,78]], "201657": ![[79,79]], "201657": ![[80,80]], "201657": ![[81,81]], "201657": ![[82,82]], "201657": ![[83,83]], "201657": ![[98,98]], "201668": ![[79,79]], "201668": ![[80,80]], "201668": ![[81,81]], "201668": ![[82,82]], 
"201668": ![[84,84]], "201668": ![[85,85]], "201668": ![[86,86]], "201668": ![[87,87]], "201706": ![[58,58]], "201706": ![[59,59]], "201706": ![[60,60]], "201706": ![[61,61]], "201706": ![[62,62]], "201718": ![[56,56]], "201718": ![[58,58]], "201718": ![[59,59]], "201718": !
[[60,60]], "201718": ![[61,61]], "201718": ![[63,63]], "201718": ![[64,64]], "201718": ![[65,65]], "201718": ![[67,67]], "201718": ![[68,68]], "201718": ![[69,69]], "201727": ![[70,70]], "201727": ![[71,71]], "201727": ![[72,72]], "201727": ![[73,73]], "201727": ![[74,74]], "201727": 
![[75,75]], "201727": ![[76,76]], "201727": ![[77,77]], "201727": ![[78,78]], "201794": ![[61,61]], "201794": ![[62,62]], "201794": ![[63,63]], "201794": ![[66,66]], "201794": ![[86,86]], "201794": ![[89,89]], "202044": ![[100,100]], "202044": ![[88,88]], "202044": ![[89,89]], 
"202044": ![[90,90]], "202044": ![[91,91]], "202044": ![[92,92]], "202044": ![[95,95]], "202044": ![[97,97]], "202044": ![[99,99]], "202178": ![[281,281]], "202178": ![[282,282]], "202178": ![[68,68]], "202178": ![[69,69]], "202178": ![[70,70]], "202178": ![[71,71]], "202178": !
[[72,72]], "202178": ![[73,73]], "202178": ![[74,74]], "202178": ![[75,75]], "202178": ![[76,76]], "202178": ![[77,77]], "202178": ![[78,78]], "202299": ![[53,53]], "202299": ![[54,54]], "202299": ![[55,55]], "202299": ![[56,56]], "202299": ![[57,57]], "202299": ![[58,58]], "202299": 
![[66,66]], "202299": ![[67,67]], "202299": ![[69,69]], "202299": ![[70,70]], "202299": ![[71,71]], "202299": ![[72,72]], "202299": ![[73,73]], "202314": ![[57,57]], "202314": ![[58,58]], "202314": ![[66,66]], "202314": ![[67,67]], "202314": ![[68,68]], "202314": ![[69,69]], 
"202314": ![[70,70]], "202314": ![[71,71]], "202314": ![[72,72]], "202389": ![[102,102]], "202389": ![[82,82]], "202389": ![[83,83]], "202389": ![[84,84]], "202389": ![[96,96]], "202389": ![[97,97]], "202389": ![[98,98]], "202389": ![[99,99]], "202469": ![[110,110]], "202469": !
[[111,111]], "202469": ![[127,127]], "202469": ![[129,129]], "202469": ![[135,135]], "202469": ![[205,205]], "202469": ![[273,273]], "202469": ![[88,88]], "202469": ![[96,96]], "202469": ![[99,99]], "202504": ![[71,71]], "202504": ![[72,72]], "202504": ![[73,73]], "202504": !
[[74,74]], "202504": ![[75,75]], "202504": ![[76,76]], "202504": ![[78,78]], "202504": ![[79,79]], "202504": ![[80,80]], "202504": ![[81,81]], "202504": ![[82,82]], "202504": ![[83,83]], "202504": ![[84,84]], "202504": ![[85,85]], "202970": ![[100,100]], "202970": ![[101,101]], 
"202970": ![[102,102]], "202970": ![[103,103]], "202970": ![[104,104]], "202970": ![[96,96]], "202970": ![[97,97]], "202970": ![[98,98]], "202970": ![[99,99]], "203002": ![[77,77]], "203002": ![[78,78]], "203002": ![[80,80]], "203002": ![[82,82]], "203002": ![[83,83]], "203002": !
[[85,85]], "203002": ![[86,86]], "203002": ![[87,87]], "203002": ![[88,88]], "203002": ![[89,89]]}

Bringing agents up and down when machines need to be rebooted, watching and brininging up and down components

  • when agent machines need to be rebooted we need to gracefully bring the agent and couch down, after they restart bring everything back up:
    • $manage stop-agent
    • $manage stop-services
    • exit
    • wait for agent to come back up
    • source status.sh
    • $manage start-services
    • $manage start-agent

https://twiki.cern.ch/twiki/bin/view/CMSPublic/CompOpsWorkflowOperationsWMAgentToolkit

Installing Cronjobs for Issues Pages and Monday Meeting Plots

Instruction for Installing Cronjobs for Issues Pages and Monday Meeting Plots

Keep overview of high and highest priority workflows

Clearing out space in couch Database :

bashs-3.2$ pwd /data/srv/wmagent/current bashs-3.2$ cd install/couchdb/logs/ bashs-3.2$ > couch.log
  • Clear out the binary logs for MySQL. You need to figure out the latest binary logs then head into the MySQL command prompt: mysql> PURGE BINARY LOGS TO 'mysqld-bin.000059' -> ; Query OK, 0 rows affected (1 min 28.54 sec)
  • Tuncate the stderr and stdout files for each component. These logs do not get rotated like the regular component logs.

  • We can save some more space (>50GB) by deleting or moving the job archives in: /data/srv/wmagent/current/install/wmagent/JobArchiver/logDir

Unregistering a WMAgent from WMStats

Log in the WMAgent machine to be unregistered and run:

$manage execute-agent wmagent-unregister-wmstats `hostname -f`:9999

Invalidating and deleting datasets

When we have data that needs to be invalidated and deleted do the following: 1) verify the current status of the data and the site that it is at: ./dbssql --input='find dataset,sum(block.numevents),dataset.status,site where dataset=/DYJetsToLL_M-50_scaleup_8TeV-madgraph-tauola/Summer12_DR53X-PU_S10_START53_V7A-*/*'

### Look-up data in CMS_DBS_PROD_GLOBAL instance ### /DYJetsToLL_M-50_scaleup_8TeV-madgraph-tauola/Summer12_DR53X-PU_S10_START53_V7A-v1/AODSIM 2170270 VALID srm-cms.gridpp.rl.ac.uk /DYJetsToLL_M-50_scaleup_8TeV-madgraph-tauola/Summer12_DR53X-PU_S10_START53_V7A-v1/DQM 2170270 VALID srm-cms.gridpp.rl.ac.uk /DYJetsToLL_M-50_scaleup_8TeV-madgraph-tauola/Summer12_DR53X-PU_S10_START53_V7A-v1/DQM 2170270 VALID srm-eoscms.cern.ch /DYJetsToLL_M-50_scaleup_8TeV-madgraph-tauola/Summer12_DR53X-PU_S10_START53_V7A-v2/AODSIM 2170270 INVALID srm-cms.gridpp.rl.ac.uk /DYJetsToLL_M-50_scaleup_8TeV-madgraph-tauola/Summer12_DR53X-PU_S10_START53_V7A-v2/DQM 2170270 INVALID srm-eoscms.cern.ch /DYJetsToLL_M-50_scaleup_8TeV-madgraph-tauola/Summer12_DR53X-PU_S10_START53_V7A-v2/DQM 2170270 INVALID srm-cms.gridpp.rl.ac.uk

2) DBS3SetDatasetStatu.py --dataset=/DYJetsToLL_M-50_scaleup_8TeV-madgraph-tauola/Summer12_DR53X-PU_S10_START53_V7A-v2/DQM --status=INVALID --url=https://cmsdbsprod.cern.ch:8443/cms_dbs_prod_global_writer/servlet/DBSServlet

3) double check that you have changed the status 4) don't forget to do a manual PhEDEx deletion request at: https://cmsweb.cern.ch/phedex/prod/Request::Create?type=delete

Resolving StoreResults tickets

If you get a ticket requesting "Elevation of an user dataset", you should follow the StoreResults procedure.

Updating pledges

Update the site pledges here:

http://dashb-ssb-dev.cern.ch/dashboard/request.py/siteview#currentView=Pledges&highlight=true

and also please update the txt file here: /afs/cern.ch/user/c/cmst1/www/T2List.txt

Assigning Workflows:

Retrieving log files from failed workflows for post mortem

  • The first step is to find the logCollect tarball containing the logArchive for one of the jobs you're interested in:
    • click on the link for one of the failed jobs from WMStats "number of jobs" view (requires tunnel to the agent)
    • find logArchive names at the bottom, and choose the one corresponding to the resubmission with the failure type you're interested in, e.g.
      • Retry 0 -> /store/unmerged/logs/prod/2013/11/12/jen_a_recovery-2-franzoni_Fall53_2011A_MultiJet_Run2011A-v1_Prio2_5312p1__131108_051402_7449/DataProcessing/10001/0/fb87ed2e-4bda-11e3-ad7a-003048f02c8a-1476-0-logArchive.tar.gz
      • Used by: 1164669
    • follow the "used by" link, which will take you to the log collect job (if no used by link, the logCollect hasn't run yet and the logArchive can still be obtained directly from /store/unmerged) - this page can take several minutes to load
    • find the log collect address at the bottom of the page, e.g. Output Files: srm://srm-cms.cern.ch:8443/srm/managerv2?SFN=/castor/cern.ch/cms/store/logs/prod/2013/11/WMAgent/jen_a_recovery-2-franzoni_Fall53_2011A_MultiJet_Run2011A-v1_Prio2_5312p1__131108_051402_7449/jen_a_recovery-2-franzoni_Fall53_2011A_MultiJet_Run2011A-v1_Prio2_5312p1__131108_051402_7449-LogCollect-1-logs.tar
  • If the logArchive from any job will do, you can skip the above and find the names of all the LogCollect tarballs as follows:
    • login to lxplus
    • list all the log tarballs: nsls /castor/cern.ch/cms/store/logs/prod/2013/11/WMAgent/jen_a_recovery-2-franzoni_Fall53_2011A_MultiJet_Run2011A-v1_Prio2_5312p1__131108_051402_7449 (change the year, month, and workflow name to match the one you're searching for)

  • Copy the logCollect tarball as follows:
    • login to lxplus
    • rfcp /castor/cern.ch/cms/store/logs/prod/2013/11/WMAgent/jen_a_recovery-2-franzoni_Fall53_2011A_MultiJet_Run2011A-v1_Prio2_5312p1__131108_051402_7449/jen_a_recovery-2-franzoni_Fall53_2011A_MultiJet_Run2011A-v1_Prio2_5312p1__131108_051402_7449-LogCollect-1-logs.tar .
    • extract the tarball to find the logArchive you wanted

  • Copy logArchives from unmerged (before logCollect cleans them up) as follows:
    • on CERN machines, when you already have your environment setup
    • lcg-cp -v [srm address] file:////`pwd`/[file]
    • where you can find the srm address from the LogCollect jobs
    • untar the files and look in the cmsRun1/cmsRun1-stdout.log for the fatal error message

Updating myproxy credentials

NOTE 1: This procedure can currently only be followed by Dave Mason or Alan Malta, since proxies need to have production VOMS roles and for a couple of sites the DN is completely hard-coded in the grid-map-file.

NOTE 2: This procedure MUST be followed both for CERN and for FNAL agents, since they are using different credentials (different set of DNs are allowed to retrieve them).

NOTE 3: Right now there is a cronjob in each agent (running every 12h) that retrieves a short live proxy (168 hours) from the MYPROXY server, which also extends it from a grid to a voms proxy.

NOTE 4: There is another cronjob in each agent (running every 12h) that will send an email to Alan (in the future to the WF egroup) in case the short live proxy goes below 96h of validity.

In case we (Alan :)) receive a warning email about the proxy that is about to expire, we have to create and upload a new long life proxy (30 days) to the MYPROXY server, following this step for CERN agents:

  1. Connect to a machine where you have your grid credentials (usercert and userkey files), e.g., vocms049 (you'll be prompted for your private key):
    myproxy-init -v -l amaltaro -c 720 -t 168 -s "myproxy.cern.ch" -x -Z "/DC=ch/DC=cern/OU=computers/CN=wmagent/(vocms040|vocms0308).cern.ch" -n -k amaltaroCERN
AND then following for FNAL agents (remember, they are different and different agents/certificates retrieve them. BTW, FNAL agents don't use the regex option anymore):
  1. Connect to a machine where you have your grid credentials (usercert and userkey files), e.g., cmslpc nodes (you'll be prompted for your private key):
    myproxy-init -v -l amaltaro -c 720 -t 168 -s "myproxy.cern.ch" -x -Z "/DC=org/DC=opensciencegrid/O=Open Science Grid/OU=Services/CN=wmagent/wmagent.fnal.gov" -n -k amaltaroFNAL
    
  2. At this point we should have a long term proxy (720h = 30 days) in the MYPROXY server. The cronjob running every 12h in the agent will pick a new short term proxy in the next run, so you DO NOT need to do anything else.
  3. Checking the long term proxy validity:
    (export X509_USER_CERT=/data/certs/myproxy.pem; export X509_USER_KEY=/data/certs/myproxy.pem; myproxy-info -v -l amaltaro -s "myproxy.cern.ch" -k amaltaroCERN)

Comissioning Disk/Tape Separation

Changing WMAgent internal passwords

Passwords are stored in the secrets file. Each agent may have different passwords for the different databases. None of this should be public on an ELOG or ticket. If you need to change some password, here is the procedure:

Couch DB

  • First tunnel to the agent machine
  • Open /data/srv/wmagent/current/config/couchdb/local.ini file
  • Update the password under [admin] - use the plain text password. it will change to hash value when couch server is restarted.
  • Update the new password in the secrets and agent config files ( Be careful, in the agent config file there is a lot of places where the password is hardcoded. All of them has to be changed ). Be sure you only change couchdb related hardcoded passwords.
  • Restart agent and couchdb:
$manage stop-agent 
$manage stop-couchdb 
$manage start-couchdb 
$manage start-agent

MySQL

  • First tunnel to the agent machine
  • Go to mysql prompt:
source /data/admin/wmagent/env.sh $manage mysql-prompt wmagent

  • First get the encrypted current password by doing:
SELECT PASSWORD('old_password');

  • Get all the users that are using that password by comparing the results of the encrypted old password to what you get doing:
SELECT * FROM mysql.user;

  • For all 'Host' and 'User' using the old password, do:
SET PASSWORD FOR 'User'@'Host' = PASSWORD('new_password');

  • Update the new password in the secrets and agent config files ( Be careful, in the agent config file there is a lot of places where the password is hardcoded. All of them has to be changed ). Be sure you only change MySQL related hardcoded passwords.

  • Restart agent and services:
$manage stop-agent 
$manage stop-services 
$manage start-services 
$manage start-agent

Places to report things and what to report where...

Problems with WmAgent/WmStats

Problems with WmAgentScripts/Unified

Site issues

General issues that need tracking that are not software/site/or transfer related

Transfer issue

General logging information

Problems with agents

  • elog reboots, if it becomes an issue open a JIRA ticket and assign to Scarlett

Edit | Attach | Watch | Print version | History: r59 < r58 < r57 < r56 < r55 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r59 - 2017-01-09 - JeanrochVlimant
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2023 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback