https://indico.cern.ch/conferenceDisplay.py?confId=254682
Attending
Adli, Julian - CERN
John, Luis, Jen,
SeangChan - FNAL
Congratulations to Dave on his new baby. We are still waiting cute baby photos.
Andrew
Personel:
* coming on shift Xavier
* coming on shift Sunil
* list of everyone's holidays? (Cern closeout)
-
- US will be having the Thanksgiving Holiday Thursday-Sunday
- Jen will be pulling "best effort" days Wed-Sunday. In other words I'll log in, run the close out script and make sure the machines aren't on fire during US/Asia shifts but will not be spending a lot of time trackingdown issues that can not be ignored.
- SeangChan will be Taking off completely Wed-Sun
- Julian's holidays: (dec 27th to 29th) and (jan 7th to 11th)
- Dec 23-Jan1 - Xavier and Sunil will be on shift but working from home
- CERN closed Dec 22-Jan6
- Luis will be gone Dec 11-13, Dec 16-20 Luis will be working remotely
Issues
- Dashboard is being unrelable this week - Julian will make sure Sunil understands the Dashboard issues we are having so we are not relying on it for debugging
- time plots, # of jobs is unrelyable we are running 60K jobs but the dashboard plots say 120K jobs
- Effiency plots - John will look into why we keep going green/yellow/green
- Couch issues on 201, 216 couch keeps going down we need to keep a close eye on it
Agents
- vocms201: Issues with couch. Getting usual when heavy load.
- vocms235: Sandbox problem solved. 11386 fwjr with missing task field, manually added
- vocms85: Workflows stuck in Acquired. Oracle connection problem, solved. 11359
- Stuck MonteCarlos: Why are so many? Not reaching "complete"
- the acquired WF's were waiting resources
- 2-3 WF's with many queued jobs and everything is piled up behind it
- When you make ACDC's they need to have higher priority so that they go in.
- Blocks were not being closed in DBS. Problem identified. There will be a procedure change
- already discussed procedure changes, we need to make sure the twiki is updated
- when workflow is force completed ACDC's also need to be force completed
- Jullian will update the twiki's Jen or Sunil will test
Site Issues that affected workflows
* Xavier - good job going through the sites with issues and working with the site support team!
* We need to get the EU operators working closer with the EU site support team. Right now the EU site support team is still on a steep learning curve but we aren't going to get them ramped up unless we get them working!
- T2_TH_CUNSTDA (80 slots) - solved links and HC (91%) - SAM problems
- T2_BR_UERJ (200 slots) - new SE - fixing phedex links and SAM availability
- pledges view - update: there are sites with n/a - automatic? talk with Julian
- Script to automatically change drain list with WR list in SSB
Workflows
Monte Carlo
Reprocessing
Workload Summary/Problems in the config file
AOB
- Dashboard Alarms (Adli and Julian working on it) 40%
- Site status script, tested on vocms201. Successful for now, this week is the migration for vocms216 and vocms85.
- Emails to workflow HN is the TaskChain working properly, Luis has in fact read the emails but hasn't replied or looked yet. All jobs are running over the same event.
- are we doing DQM harvesting on ACDC's
-- JenniferAdelmanMcCarthy - 25 Nov 2013--
JenniferAdelmanMcCarthy - 25 Nov 2013