Workflow Team Meeting - Jun 12

Vidyo Link

Attending

  • Jen -phone
  • Luis and SeangChan Dave - FNAL
  • Jasper
  • Vincenzo, Ajit
  • Julian
  • Andrew

Personel

June 5 --> June 12 Sara
June 12 --> June 19 Jasper
(Or tell Oli | CW | CP)
  • Jen will be taking off June 1/2 Day 12 & all day 13, July 28 - Aug 8 - may have limited access eveinings
  • Julian Badillo July 24 - July 25
  • Dave Mid July
  • SeangChan will be taking week off end of June
  • Note : Krista will be going on Maternity leave sometime in July

News

  • Reminder to all Operators that we need to keep on top of "stuck" workflows. Please check daily for WF's that are are have no jobs running but are sitting in running-closed as well as Vincenczo's list
  • What happened to our daily e-mails? Are we keeping on top of restarting components and checking that all agents are full?
  • HIG-Fall13wmLHE -
  • CSA 14 is the high priority, HIG-Fall13wmLHE and Digi Reco's Any Fall13 GEN-SIMS, Spring 14 - we need to push on these hard and watch
    • there are 5 requests with the wmLHE WF's the remaining requests are the ones we need to take care of
      • they can't be extended they are at 88% they have no queued jobs and no jobs in condor, the problem is that they are on the standard MC agents they have no condor jobs so they haven't been submitted. Julian can't find jobs to submit inside the agent. They are jobs that can only run at KIT, they are merge jobs, when you look at the agent there are 50 pending jobs from other WF's pending so we need to wait. Can SeangChan take a look? They've been stuck at 85-88%. We have the max jobs running on higher priority agents.
      • we need to go to the requestors to see if they are happy with the stats we already have. Can we increase the priority of the agent? Unsure of what the side effects would be
    • are there new Fall13 that started yesterday?
  • The list of WF's that can be Force Completed at 90% 15163
  • http://cmst2.web.cern.ch/cmst2/mc/requests.html

Sara's notes

Site support

SR.png links.png

  • Drain List metric in SSB
    Automate data entry: respect manual changes and update automatically as well
    	* prod status metric 158: automatic filling procedure  for “drain” (follow the order indicated)
    		+ add previous drain list (read the metric 158 itself)
    		- remove sites with SR ranking last week > 80%
    		https://dashb-ssb.cern.ch/dashboard/request.py/sitereadinessrank?columnid=45#time=168&start_date=&end_date=&sites=all
    		(this condition only for sites in “drain". If site was “down", don’t do anything)
    		+ add sites in WR=in (metric 153)
    
    	* prod status metric 158: automatic filling procedure  for “down” 
    		+ add previous down list (read the metric 158 itself)
    		+ add sites in Morgue
    
    Output file: 
    a txt file called “drain.txt" with the usual format for SSB
    link= web address of the txt file
    output location: the usual SST web folder
    
history.png

Agent Issues

Workflows

  • we've had a lot of bottlenecks in getting WF's running this week.

ReDigi

  • Finally have the ReDigi backlog under control
    • we have a number of WFs where ACDC or the WF was 100% successful, but we are not at 100%. Julian, you said you were going to look at the recovery script, did you have time? Can we wait on these or should I just clone and re-run?
      • pdmvserv_EXO-Fall13dr-00319_T1_DE_KIT_MSS_00204_v0_tsg_140430_070111_5033 15063
      • pdmvserv_EXO-Fall13dr-00304_T1_DE_KIT_MSS_00206_v0_tsg_140429_200703_3105 15061
      • pdmvserv_EXO-Summer12DR53X-02980_T1_UK_RAL_MSS_00209_v0__140427_130346_3539 14898
      • pdmvserv_EXO-Summer12DR53X-02976_T1_UK_RAL_MSS_00209_v0__140427_112658_9371 14898
      • pdmvserv_EXO-Summer12DR53X-02974_T1_UK_RAL_MSS_00209_v0__140427_112608_9905 14898
      • pdmvserv_EXO-Summer12DR53X-02965_T1_UK_RAL_MSS_00207_v0__140423_190314_9100 14898
      • pdmvserv_TSG-Spring14dr-00045_T1_IT_CNAF_MSS_00013_v0__140412_002912_3455 14299
      • pdmvserv_B2G-Spring14dr-00008_T1_IT_CNAF_MSS_00007_v0__140412_002635_4233 14641
      • pdmvserv_TSG-Spring14dr-00011_T1_US_FNAL_MSS_00006_v0__140411_185116_9520 14898
      • pdmvserv_TOP-Summer12DR53X-00231_T1_US_FNAL_MSS_00191_v0__140331_140116_863 14789
  • pdmvserv_HIG-Fall11R2-015 at PIC - 100% failure! 15110
  • pdmvserv_EXO-Spring14premixdr-00002_T1_US_FNAL_MSS_00001_v0_premixinPilotGF_140527_171514_2197 100% failure 14894
  • ALCARECO = we need to adjust the closeout script test for ALCARECO. Julian do you have time?

Store Results

  • Was there any work this week? How did it go?
    • I have assigned a few of this workflows. - we need a shared and a few procedure agreements in order to avoid repeating work.
    • Diego will push to testbed the code for moving this to central wf operation.

MonteCarlo

  • we had low running numbers all week. Do we understand what happened? How do we get things to improve? Is it documented on the twiki's so we can get things moving faster next time?
  • Some workflows waiting for merges at KIT.
  • Wfs under watch:

RelVal Andrew

AOB

-- JenniferAdelmanMcCarthy - 11 Jun 2014
Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2014-06-18 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback