Workflow Team Meeting - Dec 4 4PM CERN time & US Meeting Tues Dec 2 at 1PM FNAL time

Vidyo Link


  • US Tues Meeting : Jen, Ian and Sean, Dave
  • Thurs Meeting:
    • FNAL: Jen, SeanChan, Ian
    • CERN : Dima, Julian, Andrew, Alan



Nov 28 -> Dec 4 Jasper
Dec 4 -> Dec 10 Xavier

  • Holiday vacation plans
    • Julian will be in Colombia 25th Nov - 25th Dec - Working plans on being online.
    • Luis will be in Colombia Dec 20-through New Year will get us exact dates soon
    • Jen will be in MN Dec 26-Jan 2 and will have limited internet access
    • Ian & Sean will be around but nothing prolonged
    • Alan Dec 13-New Years Brazil - if you need him you have to call Alan's Grandmother, he will post number in Twiki wink
    • Seangchan will take a couple days off, but not telling us when wink


  • Changes needed to the .ssh/config file so you can login to CERN machines:
    • #ProxyCommand proxy-ssh /usr/bin/nc %h %p
    • ProxyCommand proxy-ssh /usr/bin/nc %h %p
      • Julian is chaning it right now
  • News onChristmas production? Expect it to hit right as CERN shuts down
    • No specific answer, just planning to get lots of stuff
    • Both WF and Site support will be on skeleton staff so keep
    • upgrades and redigi/reco Dima will get as specific information so we can get things setup.
      • We need to know what the input datasets are ASAP so we can get things staged
      • Upgrade is high I/O FNAL and T2_CH_CERN, CNAF are the only two sites that can do it so we need to put the pileup datasets there and everything else at other sites and that we have replication done properly
      • send GGUS ticket to actual sites requesting they be replicated, "We expect this dataset will have high access and they will give a high replication factor on their site."
    • Dima will submit the information to the WF group not the compops main list and we will make sure that everything is staged and ready to go.
    • do we know what time for event we will be looking at? don't know yet, most likely not something nice.
    • What is the drop dead date for the outputs? Which of these samples need to be done by what date? do we go all the way to 95% or is 75% good enough for each sample

Site support

EU shift notes

  • Wed meetings

US Shift notes

  • Starting next week there will be no Tues Meeting as both Ian and Sean will be able to make the Thurs Meetings!
  • No real questions - need to make the changes in .ssh/config file still
  • Ian will look at the Production, and Sean will look into REdigi errors and then e-log

Agent Issues

  • Intervention in CMSR database (Wednesday) see hypernews
    • crashed all oracle-dependant components see elog
  • Workflows with duplicate lumis: see elog
    • Affected Phys14 and a few MC.
      • latency with central couch meant we had two copies of Workqueue running so we had exactly 200%
      • lumi based issue there was a lumi creation method that had a bug in it
    • Solved last week by the development team. Workflows had to be ressubmitted.
    • Where do we stand on the resubmission?

Redeployment plan

  • Solved last week by the development team. Workflows had to be ressubmitted.
  • Migration to global pool
    • Production Pool:
      mc SL5
      vocms216 (up)
      201 (up)
      235 (up)
      reproc_lowprio SL5 step0 SL5
      vocms202 (up)
      234 (up)
      vocms237 (up - will be abandoned)
    • Global Pool
      backfill+production SL6 production SL6
      submit1 (up)
      submit2 (up)
      cmssrv217 (waiting)
      218 (waiting)
  • what are we waiting for in turning cmssrv217, 218 and 219 back up?
    • Julian says to just turn them on. Jen will turn them on so we have people to look at them today.
    • should we remove other team names to submit1 &2? Dave is giving the go ahead
  • Log collect jobs aren't working properly

* Log collects were failing on CERN and FNAL. see elog report

    • log collects running at CERN - not properly mapped to EOS which is why the CERN log collect are failing. Production DN is mapped, have production role mapped properly and then it should work. So somebody needs to work with Nicolo to get this working. Alan has been tasked with doing this.
by_agent by_componet

* Agent Load

Reproc team
reproc_running.png reproc_pending.png
MC team
mc_running.png mc_pending.png
Production team
production_running.png production_pending.png


pdmvserv_B2G-Phys14DR-00031_00011_v0__141107_221903_3924 submitFailed/ACDC failed... started clone:


  • Miniaod's sitting in assignment-approved
  • WF's with no errors but not at 100%
    • pdmvserv_EXO-Spring14miniaod-00163_00067_v0__141030_093040_4447 80%
    • pdmvserv_EXO-Spring14miniaod-00192_00068_v0__141030_093231_949 86%
    • pdmvserv_SUS-Spring14miniaod-00058_00072_v0__141030_100336_4528 89%
    • pdmvserv_EXO-Spring14miniaod-00215_00077_v0__141030_153350_2462 86%
  • pdmvserv_MUO-Spring14miniaod-00019_00084_v0__141030_155616_5795 100% error


Store Results


SL6 testing/backfill

  • do we need sl6 testing section anymore? I think we are set. question is do we want to keep some low priority BF in the system just in case resources are left open.
  • sl5 decommissioning + global pool testing will be the new name of this section

RelVal Andrew

  • made a pull request needs to talk to Seangchan and Alan, they are holding the merging until validation is done and then will do the pull after Tues.

ACDC issue

-- JenniferAdelmanMcCarthy - 2014-12-02
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2014-12-04 - JenniferAdelmanMcCarthy
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback