Note new time and day Thurs at 16:00 CERN time.. this will be 10 am FNAL time this week until CERN changes clocks then will be a 9am meeting!

Vidyo Link

  • Yes I know... it is for the meeting we were supposed to have on Tues, I just realized I need to figure out how to go through and re-edit all the workflow team meeting times from now until forever...


  • Jen, Luis, Dave, John,SeangChan
  • Jullian, Alan, Andrew
  • Adli, Sara


  • Question, does it make sense to change the day that we switch shifts from Tues to Thurs now that we are meeting on Thurs with the shifters in attendance so we can do the formal handover at the beginning of each meeting?
  • March 11 -> March 18 Xavier
  • March 18 -> March 25 Adli
  • March 25-> ? Sara Thanks for taking the extra couple days
  • Jen took Tues off, anybody else have time off planned in the near future?
  • SeangChan will be at CERN next week and working directly with Julian

Shift news

  • Xavier's summary:
    • Getting log files from FNAL was problematic over the weekend ELOG. Probably a consequence of all migration of disk/tape,... --> Need to be solved as we need to known where are logs are.
      • FNAL people ask T1 people if we can have the unmerged space on the agent machines so we can see logs.
      • find out why srmcp was not working over the weekend
    • Very high load on some agents causing stress in stye system (couch crash, factories long polling cycle, ....)
      • we had a big list of pending jobs in the collector which is why things were delayed so long in getting assigned
      • we have theresholds set set according to the site support table.
      • we need a max jobs submitted by an agent. right now each agent put in it's max number of jobs and we had 200,000 jobs pending causing all the issues.
        • Julian will make a github issue that we have a global threshold on the agent limiting the number so that the local queue isn't pulling down an excessive amount of work.
    • DR53X was not working at FNAL
      • when a jobs lands on a worker node it then complies the list for mss.
        • subscribe the dataset to mss - temp hack because at other sites the pileup dataset was on both mss and disk so we don't have the pileup datasets on tape.
        • real solution is to have the jobs look at the disk instance at T1's. - dave will make a github issue since he understands the problem and the solution the best
        • best place to have it fail would be at assignment page
      • pileup has a list of files it reads from dbs. It is created when the config is created on the workernode.
    • Step0 at CERN problems: ELOG
      • combination of workflow and site issues we are back running
    • Several sites were still underused after the production kickstart on Thursday (see list in ELOG ) --> Would be nice to follow-up and start checking ...
      • there was a problem in the reprocessing agents. some of the agents got stuck query PhEDEx so no more jobs were splitting
      • proceedure change.. do we want to bring down WorkQueue, PhEDEx and DBS uploader.
        • the agents will retry later for PhEDEx and DBS so WorkQueue is the only one that really needs to be brought down
        • OK next time we leave it up, but watch it carefully
    • Operator TWiki was cleaned and rewritten with he experience of having Jasper as new shifter and the help of Julian.
    • SSB column usage was also cleaned/reviewed with the help of John/Julian/Eddie, see compmon hypernews. SSB feeding script being further improved, see github pull req#14 being tested.


  • When are we moving to global prioritization in production pool: ~April- more likely May
    • problem will be merging 2 different infrastructures together
    • we need to get to 100K + jobs running at the same time so we will need at least 5 agents going at the same time and on order 10 agents in total
    • the current MC agents are in fact running both MC and redigi/rereco backfill so we need make sure we have those debugged before getting rid of teams.

Agent issues

  • agent instablities over the weekend. MC machines had issues with load
  • 235 still waiting for re-shoot when will it be done?
    • It's taking longer than expected (it still has jobs running) - removing the jobs is not working.
      • still has 3 running jobs, hopefully it can be reinstalled later today or tomorrow. Some of the jobs are from aborted workflows.
      • LogCollect/cleanup/and merge jobs - if there are still jobs running Luis wants to look at them
      • next time we put an agent in drain, if we see a job running from an aborted workflow we need to let Luis know so he can debug
  • vocms85: had a higher priority in condor. = already taken care of
  • Seangchan & Luis, please take a look to
      • when WMStats lost the job information we can not see how many jobs were created/running/pending
      • Luis will try to get more debugging information on it

Workflow Issues

  • Large number of pending jobs:
    • Flushed during the last days
    • Systems have stabilized



  • Summer12DR53X workflows at FNAL failing
    • pileup issues, pileup not on disk. Has this been confirmed? are we testing

  • Status of redigi's from cop ops meeting:
  • status of Summer12DR53X :
  • what happened with IN2P3 transfers?
    • there is already a savannah ticket out on this. Their disk instance is full
    • we are going to just sit on these ACDC's as is until IN2P3 gets their disk situation fixed and the files transferred.

Site issues for the Workflow team

Andrew's questions/Luis & Seangchan's answers wink

  • cmssrv113 replication seems to die everyday * SeangChan will look further into it


-- JenniferAdelmanMcCarthy - 20 Mar 2014
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2015-02-04 - JulianBadillo
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback