Updating and maintaining .ssh/config and proxy.pac files - Alan
Workflow Team Meeting - Oct 30 4PM CERN time
Attending
- FNAL: Jen, Luis, Seanchan, Dave, John
- CERN : Julian, Alan, Andrew
Personel
EU
Oct 23 -> Oct 30 |
Jasper |
Oct 31 -> Nov 6 |
?? |
US
- Julian will be in Colombia 25th Nov - 25th Dec - Working plans on being online.
-
News
- Monitoring scripts should move to vocms049 too? We need condor to be installed in that machine.
- better to run on one of the SL6 agents, but we don't have them at CERN yet. Luis wants to migrate jobs ASAP, but we don't have them available yet.
- Luis has been testing on 015, a T0 machine using SL6. The script is crashing on SL5, reporting false info for SL6 nodes. Luis knows what the issue is
- Luis and Juian will work out offline
- CMSWEB migration of production CouchDB back-ends
- down for some hours, we are going to shut down some agents (the drained ones) to minimize impact on couch.
- everything is back up submit2 and vocms235 and both are in drain, we can turn them back on @ lunchtime fnal time ACDC view is building from start. It will take 13 hrs to get views up to date. Do not do ACDC's until tomorrow.
- When setting up Ian, we realized that we need somebody to be responsible for updating and maintaining
- https://github.com/gutsche/scripts/tree/master/Kerberos
is a good master Dave suggests Alan to be our guy to maintain this as he is the one to bring up new machins
- Julian was working on it. He's hoping we can set up general rules so we don't have to have specific names, and is planning on putting it into the wmAgent scripts repository.
- once we have it working OK we need to update documentation
- US Operators have setup a tenitive meeting time for 1PM FNAL time on Tues
- decide if each operator will do 10 hrs/wk or 20 hrs every other week
- discuss issues they are having with shift work, more training
- Do we need to change the meeting time for the EU shifters?
- We need a new EU shift schedule!
Site support
- A few sites have 0 in CPU bound column
Jasper's notes
Agent Issues
- 98 had a lot of WF's that were stuck, thresholds were messed up. Anybody want to fess up what happened?
- If you change thresholds anywhere you HAVE to e-log it so we know what is going on!
- Submit1 components down because global pool is messed up right now. Backfill on hold
- We leave them down for today, Julian will tell us when we can turn it back on.
Redeployment plan
* Redeployment plan
-
- Production Pool:
production SL6 | mc SL5 |
reproc_lowprio SL5 | step0 SL5 |
cmssrv217 (up/new version) 218 (up/new version) 219 (up/new) | vocms216 (up/new version) 201 (up/new version) 235 (drain) cmssrv98(up - will be abandoned) |
vocms202 (up/new version) 234 (up/new version) 85 (up - will be abandoned) cmssrv112 (drain - will be abandoned) | vocms237 (up/new version- will be abandoned) |
- Global Pool
- Redeploying submit2,
- Draining vocms235 (to redeploy)
Workflows
miniaod's
Rereco
Store Results
SL6 testing/backfill
- Can we autoapprove transfers or relval datasets to CERN?
- Can we agree on this? https://github.com/dmwm/WMCore/issues/5432
- How can we remove the dependence of the cloning script on the wmagent code?
- Did we figure out why footprints was reopening tickets automatically?
--
JenniferAdelmanMcCarthy - 24 Oct 2014