Workflow Team Meeting - Feb 5 4PM CERN time

Vidyo Link


  • FNAL: Jen, Luis, Jorge, Seangchan, John
  • US: Ajit, Ian
  • EU: Vincenzio, Julian, Xaviar, Alan, Andrew, Dima


  • Julian will be off Feb 10-13th - in Istanbul Already updated calendar
  • Jen May be off Feb 13-16, Pending weather


  • About EU Operators:
    • Two new guys from Seoul Univ. will start training next two weeks.
      • Already added to workflow team e-group
    • Also news from Belgium.
      • in principal - still using Sara for 2 mo she will disappear before Summer
      • Xavier will have 4 wks for us
      • new postdoc in group that we will have for next year
      • Xavier will put us in contact with the people who he is arranging, we need to work on haveing better feedback to the operators
      • give shifters more specific tasks.
  • US Operations news:
    • Sean will not be shifting for the time being he has too many responsiblities with classwork and teaching
  • FNAL will have a downtime on the 11th
  • Old requests monitoring (Dima)
    • we need to check periodically for requests that are in production for too long:
    • Typical issues:
      • rejected workflows that were not properly communicated back to PPD and still reflected as "submitted" in McM
      • lost/forgotten workflows. Example: pdmvserv_HIG-Summer12DR53X-01991_T1_US_FNAL_MSS_00212_v0__140502_155516_4571
  • PPD - Lots of local GEN-SIM to produce minbias sample, but they injected everything

Site support

  • SAM and HC problems in all sites
  • drain list script had some issues and sites have not been updated. SSB list is ok, we can do it manually.

Agent Issues

  • JobAccountant unstable on Feb 3 - SeangChan had to run Alan's script a number of times to get things running again. Not sure why this is happening, but he spent some time looking at it.
    • Condor sent back FWJR that was corrupted there is missing info on jobtype. We have a script how to fix it, but we should try to figure out what the source of this is.

Redeployment plan

  • Submit2 redeployed on Wed
    • Global Pool
      production SL6
      submit1 (up)
      submit2 (up)
      cmssrv217 (up)
      218 (up)
      219 (up)
      vocms0308 (down)
      vocms0309 (down)
      vocms0310 (down)
    • Production Pool: * All Production machines have been retired.
  • CERN machines installed and tested:
    • We are waking them up when one of the FNAL agents reach 75% disk -> submit1 most probably (is at 61% now)
    • The idea -> drain submit1 and submit2 and use them as backup.
    • Please check that you have access to the machines!
    • Also check access to vocms049 (for scripts running)
    • Condor scripts have been moved to vocms0308


  • Backfill again: two TMNT (Teenage Mason's Nuclear Trolls)
    • 1x10^9 events ~ 770K jobs, 4.5K prio
    • maybe next time we put in backfill choose WF's better so we have shorter lag time. 3-4 hr time
  • Priorities working fine - however check this elog see elog




Store Results

  • 1 wf waiting for resources at FNAL


  • Huge load of RunIIWinter15GS injected yesterday, 61 in acquired, 9 running

RelVal Andrew

  • old issue of WF's not closing out if multiple WF's running off same input dataset is still happening


* getting low level backfill's going on all sites to keep health of sites happy. Good idea, Julian is looking into it.

-- JenniferAdelmanMcCarthy - 2015-02-04

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2015-02-05 - JenniferAdelmanMcCarthy
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback