Vidyo Link

Attending

  • Luis, Seangchan, Dave, Jen, John- FNAL
  • Andrew and Julian, Alan -CERN
  • Sunil

Shift

  • Feb 4-11 Sunil
  • Feb 11-18 Sunil

Thank You

  • First a BIG Thank You to the whole workflow team who pulled a lot of extra hours the last couple weeks and worked all weekend to get things shut down so we can be ready for the DBS3 upgrades this week. Treat yourself to an uncaffinated drink .. as from the hours I've been seeing e-mails/IM's come through I think we've all consumed way too much caffine the last couple weeks.

Agent issues:

  • All of the agents were unstable all week (sometimes several crashes during the hour):
  • The disk was hovering ~95% full, the agents are designed for disk to be closer to 75% max
  • The push to get the most out of the extra two weeks of data processing just proved we can not run the machines at this level for any length of time and have anybody be stable.
  • We plan a stricter rotation of agents after everything is upgraded.
  • Moving away from teams will also help make us more flexible when agents get full/unstable in the future.
  • This week: Agent deployment: DBS2_DBS3 Migration Plan
    • let's go through this weeks plan and edit the migration plan twiki as we go rather than notes here
  • Agents Shutdown: all the checks for pending work look ok, It seems no remaining real work was skipped. However, there was some blocks from backfill/aborted/rejected workflows that could not be inserted to dbs3, or Closed in dbs. We should be careful with the invalidation list, we cant forget any dataset.

Workflow Issues

  • Production was shut down on Sat FNAL time, any remaining Redigi/MC workflows were force completed.
  • we loosened our usual standards of 95% to 90% for redigi closeout, it was that or make people wait another ~2wks to get their data.
    • Dave and Andrew will let people know what we have when they announce the workflows.
    • if it is discovered that we need more statistics for any of these datasets let us know and we can always clone and run again later.
  • Any workflows that were < 90% complete, or had duplication issues, missing input issues or that was still suffering after changes in the SE etc were rejected and cloned.
  • a full list of all workflows that were rejected, and their clones are here: 12749
  • a full list of all datasets that will be invalidated and deleted can be found here: 12752
  • there were a handful of workflows that were having some weird issues that prevented auto closeout, but were far enough along that we wanted to keep them instead of rejecting them. Those WF's and the issues I was seeing can be found here: 12747
  • You can also take a look here: DBS2_DBS3 Migration Plan
  • Jen will prepare a list of datasets for deletion. As discussed in the Monday CompOps Meeting we will not submit the deletion requests until after DBS3 is up and stable. We have to be sure all the datasets
  • have we come up with list of data for sites that they can delete from unmerged? we will rely on 2wk turn around for cleanup to get things cleaned up

List of scripts -

Site Issues - John

Andrew's Questions

  • continuing problem with log collect jobs at FNAL due to incorrect timestamps.

-- JenniferAdelmanMcCarthy - 10 Feb 2014

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2014-02-11 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback