Workflow Team Meeting - April 2 4PM CERN time 9 AM FNAL TIME

Vidyo Link

Attending

  • FNAL: Seanchan, Luis, Jen
  • US: Ajit, Stephan
  • CERN : Dima, Alan, Andrew JeanRoc
  • EU:

Personel

  • Luis to Colombia around May 1-10???? talk on the 6th
  • Julian out over Easter Holidays March 28-April 4 * Seangchan off a few days March 28-April 4

News

  • New EU Operator!
    • James Keaveney - Belgium
    • Haneol Lee - SNU Korea
  • Ongoing Upgrade Workflows
  • Anything else coming up??
  • We went through a new version of the assigning the workflows scripts, it will live with the other scripts in wmagent is in testing, and we are working on getting official approval of spreading out input datasetsmore to allow more flexibility in running data

3 top issues effecting production

  • still cleaning up workflows that have step1 files missing that can't run step2 & 3, there appears to be a problem with PhEDEx, there was a nice discussion on the workflow team list. we need to get to the bottom of this!
  • All jobs failingto T0_CH_CERN - this mostly effected the T0 jobs as we were not running much there
    • all agents are now patched and it is working fine
  • shorthanded this week - between holidays in the EU and Spring break in the US, we have large chunks of the workflow team out at least part time this week. This is making things go a bit slower than normal.
  • Infrastructure issues - we always have components going down but list of top 3 issues that can be worked on

Site support

Opportunistic sites

  • Little happening, we are runing workflows at NERC and SDSC and SDSC is commissioned, it's running successfully and staging out to FNAL and merge jobs are running successfully at FNAL! YEAH! Now merges are happening automatically is the data being injected at FNAL in DBS? haven't checked yet.
    • the dataset is produced, has it been mapped properly yes or no? Alan will work with Stephan to see if this is right
  • Stephan submitted another WF yesterday to see if automatic merges were happening or not. This workflow has been in acquired for over 1 day.
    • STephan will file an e-log for this workflow
  • Pilots running or failing at NERSC so things are not yet running.

Workflows

  • Found problem with recovery script, when you run it on workflows that have whitelists it ignores the whitelist when it makes the recovery workflows.
    • Julian have you had time to look at this yet? -- way beyond my WMCore knowledge - I can try spawning some tests.
    • Juian not here this week so need to discuss next week
    • Seangchan looked at it but hasn't done much

ReDigi

  • Problem with step1 files missing on Disk endpoint

TaskChains

  • script has been integrated into the other assignment scripts

miniaod's

  • nothing to report

Rereco

  • Data skims - we can't decide anything here, what do we want to do??
  • Long open requests, what are we doing with them? Luis can work on running replay. they should be triggered to put them in. Talk offline

Store Results

  • Nothing to report

MonteCarlo

  • Bunch of Summer12 wfs in completed, but with significant number of failed jobs.
  • One Summer11Leg wf in similar situation. A batch of RunIIWinter15GS wfs too.
  • All needs ACDC unless they are already in place.
  • Julian usually looks at them but Jen will take them on today since Julian is out of town.

Agent Issues

  • SSB thresholds were not publishing correct info, so we had nothing running on Tues. It's fixed not just patched, It's fixed now, so lets re-inable agent watcher
    • Jen will make the changes and restart everything.

Redeployment Plan

production SL6
FNAL CERN
cmsgwms-submit1 (up) vocms0308 (ready to wake)
cmsgwms-submit2 (ready to wake) vocms0309 (drain)
cmssrv217 (up) vocms0310 (drain)
cmssrv218 (up)  
cmssrv219 (up)  

RelVal Andrew

  • Any progress on github issues? He's working on the log part
  • we've been discussing jobs asking for too many resources, but the question is going nowhere. Alan and Seangchan, do we know what to do about this?
    • the parameter is walltime, maxwalltime if the job takes to long it will be pending foreverand can have nothing to do with how long jobs actually runs if they set it wrong on job splitting. by default we should make it 48 hrs except for CERN.

AOB

-- JenniferAdelmanMcCarthy - 2015-04-02

Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2015-04-02 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback