Workflow Team Meeting - April 17

Vidyo Link

Attending

  • FNAL - Jen, John, Dave, Seangchan, Luis

Personel

April 10 -> April 17 Sara
April 17 -> April 24 Jasper
  • Luis is going to Colombia for a seminar April 21-25 then on vacation April 28-May 2
  • Dave & Oli at SLAC April 6-10
  • Dave at CERN April 21-29
  • Julian will be on vacation from 16-May to 18-May
  • CERN shutdown for Easter Holidays April 17-21
  • Adli will be on vacation from May 14-24

News

  • Issues with WF's stuck in running open - Fixed
    • was a problem with the dbs2 shutdown. there were a handful of WF's that needed to be fixed by hand.
  • Need to come up with requirments for request manager 2.
    • things that we currently do with scripts need to move to request manager

Sara's Notes

  • Lots of agent issues had to restart TaskArchiver crashing a lot.
    • Known issue in meantime just keep restarting and e-logging
  • Central couch compacts at night CERN time, view creation is slow/stuck and until compaction finishes all agents will will report AnalyticsData Collector is "down" in fact it is running it just hasn't reported in 20 min. If you see this you need to just wait. If you see TaskArchiver/JobUpdater down, those are actually down.
  • couch maxing out. Depends on how many documents we have, it seems like it is happening more often now. It has been increased to 100 and we are still maxing out. We need to restart couch. Developers are testing Big Couch. If this works problem solved, but as of now it doesn't appear that this is going to work. For now just restart couch
  • change in documentation needs to be made. We only need to shut down the agent and restart couch then restart the agent if it is replication down otherwise just restarting couch is OK and then JobSubmitter doesn't have to rebuild things.
  • there are 3 WF's that have failures at all sites with SCRAM_ARCH issues.

Agent Issues

  • What is the status of Workflows stuck in acquired? We have plenty of work assigned, sitting in acquired but not much running.
    • Checked, it is not a glidin issue it is an issue with the agent, Seangchan was looking into it on Wed Afternoon. Where do we stand?
    • I had ACDC's going in on Tues afternoon but the stuff I put in Tues night was still in acquired Wed afternoon.
  • 216 and 234 draining for upgrades

Site Issues

  • Purdue changed it's SE name so we need to keep an extra close eye on it
  • RALPP in the factories they had a problem with some of the settings

MonteCarlo

Redigi/ReReco

  • I have workflows moving again, lots of ACDC's to run I just need them to be moving again

Relval == Andrew's Questions

-- JenniferAdelmanMcCarthy - 17 Apr 2014

Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2014-04-24 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback