Workflow Team Meeting - July 2nd 4PM CERN, 9 FNAL time
Vidyo Link
Attending
- US: Ajit, Matteo
- EU: Dima, Alan, Julian, Andrew
Personel
- Jen off June 26-July5 - will have e-mail access but painfully slow internet
- Jen off Aug 10-26 (tenitive)
- SeangChan July 27-31
- Julian Sept 14-30
- Alan Jul 11 - 20 he guesses
News
3 top issues affecting production
- GlideIn - Collector problems after an update, things look better now.
- Some stuff delayed in acquired.
- RunIISpring15DR74: (Exit Code: 8003) = Step3 miniaod problem, across sites, reported to PPD HN HN link
- ACDC's for some of them don't do any good.
- ACDC's stuck in acquired in Global Queue - SeangChan worked on them - any updates?
Site support - John
- some test wfs on TH_CUNSTDA, RU_IHEP,
- maybe we need to kill them to upgrade agent.
Workflows
- EXO-RunIIWinter15GS 's with high failure rates:
- maxRSS exceeded.
- All wfs aborted.
- Discussion still going on HN link
Store Results
Agent Issues
Redeployment Plan
- Only two agents remaining. Everything else was upgraded.
- we got rid of the wfs waiting for same dataset.
- Seangchang is not here
- Andrew wants to add an exit code for not retrying jobs killed by timeout. * This delays the completion of a wf
- Andrew asked also about the archiving delay.
L3 discussion - Ajit, Jean-Roch, Matteo
Opportunistic Resources - Stefan
- Olivier found some issues that are preventing jobs to run properly on HLT.
- Still some details need to be tuned up.
- Meanwhile we keep pending jobs on HLT.
- NERSC was able to handle some wfs
- Matteo is dealing on SDSC
- Julian is going to re-assing something there, but last days were full of pending jobs but very few running.
Automatic Assignment And Unified Software
- JR implemented the deletion of invalidated data.
AOB
JulianBadillo - 2015-07-02
This topic: CMSPublic
> CompOps >
CompOpsWorkflowTeam >
WorkflowTeamMeeting > WorkflowTeamMeeting20150702
Topic revision: r4 - 2015-07-02 - JulianBadillo