FNAL: Jen, Luis, John, Dave CERN: Adli, Julian Andrew, Xavier


  • EU:
Nov 12 --> Nov 19 Sunil
Nov 19 --> Nov 26 Xavier

  • US: Jen & Luis
  • Upcoming Holidays:
    • US - Nov 28-Dec 1 Thanksgiving Holiday's we will be on "best effort" but not working those days.
    • Julian's holidays: (dec 27th to 29th) and (jan 7th to 11th)
    • Dec 23-Jan1 - Xavier and Sunil will be on shift but working from home
    • list of everyone's holidays? (Cern closeout)
      • CERN closed Dec 22-Jan6 * Luis will be gone Dec 11-13, Dec 16-20 Luis will be working remotely

Issues last week:

  • Sunil -
    • Xavier will talk directly to Sunil *Jen -
    • Agents fairly stable usual sql issues
    • Problems with WF's running at sites outside of the white list
    • Problem with a dataset that didn't want to properly subscribe
    • Jen worked with Lucy to get it taken care of Blocks that didn't close out properly
      • if we force complete a WF, we also need to make sure that we force complete any associated ACDC's
  • Does DQM Harvesting happen if we run ACDC? SeangChan needs to look and see what it is doing
    • Andrew will think over what needs to happen and send an e-mail to the Comp-Ops for discussion

  • Spent most of my week working through the WF's that needed recovery run

  • we had a LOT of WF's that needed help getting closed out last week! Good job everyone in working through our problems. Which brings me to my next point:
    • Ideas on what we can do to make operations run more smoothly.
    • Are there any lessons learned this week?
    • What did we do by hand that needs to be documented?
    • is there any automating that we can do from the problems we had last week?

  • Workflow issues:
    • Open blocks on closed out wfs: It's closed now.
    • workflow without Phedex subscription: jbadillo_B2G-Summer12-00461_00039_v0__131111_094210_477, now with subscription. Not closed out yet
      • let's add the PhEDEx fix to github issue so that we are not having to do this manually because it is time consuming and Luis reports he needed to run it ~20 times on different agents last week.
    • Some issues with workflows running on KIT.


  • The SSB site update script - this week is going to be tested in vocms201.
    • Procedures to put sites on drain/normal is going to change
  • The Dashboards Alarms Script (a.k.a. the vocms174 killer) Adli and Julian working on it.
    • shell script is ineffecent so they are translating it to python it will take a while but they are working on it.
    • for now we've moved it off production machines and changed the interval to 6 hrs.
  • followed up with Chris Jones about missing lumi problems. there is something we can add to the config if it fails if it gets a sig int
    • by default cmssw will take a control C and end the job. but you can add a line that will ignore the control-C then your batch system would have to really fail it and then it will get retried

-- JenniferAdelmanMcCarthy - 19 Nov 2013

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2013-11-19 - JenniferAdelmanMcCarthy
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback