Workflow Team Meeting - Nov 5 4PM CERN, 9 FNAL time
Vidyo Link
Attending
Personnel
- Jorge - Nov 9-13
- SeangChan off Nov 3-10
- Julian to Colombia Dec 14, and then contract ends, will be working remotely from Colombia until the First of Jan
- Ajit out till Nov
- JR unavailable until Nov 23d
News - Dima
3 top issues effecting production
- See ReReco notes
- StepChain -
- cmsweb upgrades - finished already https://cms-logbook.cern.ch/elog/Workflow+processing/22103
- do we need to revive recovery script for ReReco
- Lots of WF;s in status failed, with no reason given in file why they failed. Is something broken? I found several of them that had been reset, or cloned but the origioinal was still stuck in status failed. There was nothing in elog about the workflows being cloned or reset, the only way I knew what happened was to look for output datasets with higher version numbers. Is it really that hard to post an elog when you clone/reset workflows? Is there a better way to check on the workflows in failed?
Site support - Gaston
Currently in the waiting room:
T2_ES_IFCA,
T2_IT_Bari, T2_IN_TIFR, T2_TH_CUNSTDA, T2_PK_NCP, T2_EE_Estonia, T2_RU_INR, T2_BR_UERJ.
No changes in morgue.
News & Issues
- (JR) Several sites over dataops DDM quota : reasons are
- heavy gen-sim in production or in use
- 4 or 5 opened campaign with secondary input : 40+40+25+10+10 TB = 125TB of secondary on disk
- (JR) Large TaskChain with large job blow-up ratio : now assigning to >4k slots sites if ratio >5
Workflows
*
- Workflows with None AcqEra - https://cms-logbook.cern.ch/elog/Workflow+processing/22100
- We have already discussed why this happened, and outlined the procedure to fix it
- this is a problem of communication and documentation, Alan and Julian knew what to do, I sort of knew what to do but nothing was documented anywhere!
- What other exceptions to regular acdc production and assigning are there?
- (JR) what can be done about the very long running taskchains for now
Rereco
- Need to start returning WF's that are not 100%
- need to update the get missing run/lumi info
- How much have we talked to the developers on why we are getting missing lumi's but they are not showing up in WmStats as errors? We need to find a better way of recovering this!
Store Results
Agent Issues
Redeployment Plan
- everything is up to date - redeploy in mid-November, and then we'll have major changes to iron out before Christmas Production
- all sites need proper local site node name info or they won't work - reposting last weeks list here to see where we are in getting things done!
- before site config returned node names , but se name and phedex node name need to be returned or they will fail. Gaston needs to verify that the CERN Local configs are right
* as discussed in the meeting, some sites are not providing the phedex-node value in the site-local-config.xml. For example, T2_CH_CERN is Ok:
- but T2_CH_CERN_HLT is missing that info:
- We need all the processing sites to properly provide this info otherwise we'll start having problems in the next WMAgent release.
- Also the site we are using has to be listed here. If not we either update siteDB or get the site name from other sources.
L3 discussion - Ajit, Jean-Roch, Matteo
Opportunistic Resources
Automatic Assignment And Unified Software
- In the past Julian has "extended MC workflows" that didn't meet statistics, is this something Unified can take over?
AOB
- PNN change - have Gaston check sitedb list that maps
--
JenniferAdelmanMcCarthy - 2015-10-28