Workflow Team Meeting - Oct 2 4PM CERN time
Attending
- FNAL: Jen, Jorge, Dave, Seangchan, Luis
- CERN : Andrew and Julian
Personel
Sep 25 -> Oct 2 |
Sara |
Oct 2-> Oct 9 |
Xavier |
- Dave is done with Jury duty. He can now tell you what parts of Kane county to avoid.
News
- hyper urgent Upg2023SHCAL14DR is down to one last ACDC2
- SL6 stress testing, backfill - more on this further down the agenda
-
resubmit.py
script modified to get a backfill wf from any given workflow (not Taskchain)
- to use it:
python resubmit.py WORKFLOW USER GROUP -b
- it will add "Backfill" everywhere (Acquisition era, processing string, bla bla bla)
- it will reset the request date (treated as new request).
- example:
[vocms174] /afs/cern.ch/user/j/jbadillo > python WmAgentScripts/resubmit.py pdmvserv_TSG-Spring14dr-00048_00226_v0__140924_200502_3926 jbadillo DATAOPS -b
Submitting workflow
RESPONSE ....
Cloned workflow: jbadillo_TSG-Spring14dr-Backfill-00048_00226_v0__141002_110514_6144
Site support
- we began AAA testing with SL6 but need to stop doing so...
Sara's notes
Agent Issues
- Agents have been well behaved... that's what happens when you aren't running much
Redeployment plan
- cmssrv112 - in drain for redeployment as disk filled
Workflows
- Processing string - Where do we stand with this. We had 2 tasks assigned last week where do we stand?
- We need to talk to MCM about how they want the policy set: Dave will reply to the e-mail that they are postponing it and asking for clarification as to how we will know that we are using it.
- Dave is working with them, we need to just handle the responsiblity of request to reqest manager2
- Julian will make a list of WF's that have processing string in the schema that are already in the system.
- slow... just wrapping up old work and new stuff dribbling in behaving itself
miniaod's
- Workflow with duplicates was resubmitted by PPD - new miniaod ran, no duplicates, old workflows rejected and outputs were deleted
- New Miniaod at 300% pdmvserv_SUS-Spring14miniaod-00012_00049_v2__140916_141253_9688 300%
https://cmslogbook.cern.ch/elog/Workflow+processing/17065
Rereco
Store Results
SL6 testing/backfill
- We have begun ramping up the load testing on the 3 SL6 agents, only one suspicious crash in both cmssrv217 and 218 see elog here
- had issues with couch replication
- problems overnight that were solved. New one appeared after it and Alan will talk to Seangchan after the meeting about it.
- Fri-Weekend - submitted Backfill ran at nice steady state, no crashes but thresholds were improperly set (Use the new resubmit.py)
- Mon-Tues realized that the thresholds were set incorrectly when we couldn't get the 2nd Backfill to go, once the thresholds were reset things ramped up nicely and held steady state
- We are running T1's at threshold using these 3 agents
- Wed - Started another redigi backfill, with higher priority than the others, it is currently taking over slots as we want it to!
- Wed- Julian started MC backfill, and ramping up more jobs
- cmssrv217.fnal.gov cmssrv217. 11618 27143 0
- cmssrv218.fnal.gov cmssrv218. 13986 20560 1
- cmssrv219.fnal.gov cmssrv219. 10986 23072 0
- We think we are ready to start running low priority real work on these machines, Anybody want to throw some work at us?
- Tests to run
- replication didn't work in the first tries, and then it worked. Seangchan can't think of any specific tests
- need to ask Alan or Justice if they have any specific tests they think we should run - Julian will work with Alan to come up with a list of things we need to ru
- need to come up with a good variety of WF's to put through, multi step, miniaod's, mixing.... what else?
- get something big for each site
- get some hard workflows multistep gensim+raw + gen_sim_Reco
- the Frankenflow should run through - made aodsim and miniaodsim
- miniaod's that run on reco
- someof the CSA14 produced gen sim raw and gen sim reco they actually had the reco format rather than AOD then did miniaod from reco
- can we put these into cmsweb-dev
- we should remove the "release validation" elog and replace it with a "WM debugging" elog I think
This topic: CMSPublic
> WebPreferences > WorkflowTeamMeeting20141002
Topic revision: r5 - 2014-10-02 - JenniferAdelmanMcCarthy