Workflow Team Meeting - Oct 15 4PM CERN, 9 FNAL time
Vidyo Link
Attending
- FNAL: Jen, Gaston, Jorge, Eliana, SeangChan
- US:
- CERN : Andrew and Julian
- EU:
Personnel
News - Dima
- Plenty of work to keep us out of trouble
- KIT downtime was over 6 min ago
- site readyness has been below 80% for the last couple days, do we force them on or wait for them to have a "good day" via tests so they go on automatically
3 top issues effecting production
- Sites failing and having to move workflows
- TIFR: TIFR: https://cms-logbook.cern.ch/elog/Workflow+processing/21866
- the site has been drained, and we have some test WF's sent there. Waiting for test workflows to run.
- Submit Failures at Bristol
- File read/merge issues at KIPT
- same errors at Lisbon, Julian dug deeper, looks like it might be a permissions error
- Stage-out failures at RALPP
- file read issues at Lisbon
- Workflows with no errors but not 100%
- we suspected non-reporting errors, where are we in debugging this? Already debugged, twice:
- Conclusions:
- robust merge (in most cases) or error-document missing on couch (in one case).
- kill and clone works most of the time - non-easy to reproduce error.
Site support - Gaston
News & Issues
- About the sites:
- KIPT : SAM, HC, Links :OK, waiting for resolution of ticket : 116876
, the site was on downtime from Oct 7 through Oct 9.
- TIFR: Problems could be related to problematic DPM node, issue reported to be solved 116867
. SR ok during last 2 weeks.
- RALPP: Links issues on Monday and Tuesday. Site appears in good condition now according to SR. Apparent overload on SRM according to HN thread
. * double check date/time of failures to see if it's still a problem
- Bristol: Waiting for resolution of 116683
, 116873
. No answer from site admin yet. Site appears in good shape to SAM, HC and link metrics.
- problem with the TFC?? Jorge will check that the files are actually there. PhEDEx thinks the files are there, the agent is failing the submits because it can't find the files
- new morgue controler script if a site has better than 80% for more than a week it will go back into the waiting room but will not go back into production until we do so manually
- Out of the morgue: T2_BR_UERJ, T2_RU_INR, T2_PK_NCP
- Into the waiting room: T2_FI_HIP, T2_IT_Bari, T2_ES_IFCA
- Out of the waiting room: T2_EE_Estonia.
- Sites in Waiting Room: 6
- Sites in Morgue: 7
Workflows
Rereco
Store Results
Agent Issues
Redeployment Plan
- everything is up to date - redeploy in November, and then we'll have major changes to iron out before Christmas Production
L3 discussion - Ajit, Jean-Roch, Matteo
Opportunistic Resources - Stefan
Automatic Assignment And Unified Software
AOB
--
JenniferAdelmanMcCarthy - 2015-10-14
This topic: CMSPublic
> CompOps >
CompOpsWorkflowTeam >
WorkflowTeamMeeting > WorkflowTeamMeeting20151015
Topic revision: r5 - 2015-10-21 - JenniferAdelmanMcCarthy