John, SeangChan, Luis, Jen, Dave - FNAL Julian, Andrew, Adli- CERN


  • Nov 26 --> Dec 3 Sunil
  • Dec 3 --> Dec 10 Sara

Review of last weeks issues

  • 2011 legacy reprocessing:
    • Recovery very close to being complete, on the last workflow, waiting for recovery to finish. Jobs still running as of Mon night.
  • Where are we with the stuck Workflow problems?
    • 40 stuck yesterday 15 today
      • Luis restarted some stuck components,
      • DBS component was restarted and that
      • Problem with couch in the MC agents, there is a TaskArchive query that is taking 24 hrs and crashing, it hasn't
    • when can we shut down the agent, and hand over to Dirk, we will give Dirk 216 the next time it crashes we wil
  • Problem with log file access, especially when there is very little time between jobs running and access to logfiles,
    • suggestion to first store the logarchives on EOS and then on castor. * we don't change physical location so we can keep track of where they are for 1wk to 1 month - Dave is following up on this issue
    • How do we give users access to these files? will they be able to get them themselves or will the workflow team have to fetch them?
  • T2_CH_CERN_HLT & AI - most likely a site issue John will look into this. The WF's that are stuck due to this issue are all running on 227 which doesn't read from the drain list so the fact that it is in drain isn't an issue. It's an actual site issue.
  • low latency agent

Agent Issues

Site Issues that affected workflows

  • T2_UK_Bristol - giving warnings of some jobs on production view of dashboard. Unscheduled Cream CE down, unscheduled downtime John will see how long they are down and then decide if they need to be in drain. They should be out of downtime in a couple hours.

Workflow Issues

  • Drop of running jobs on tuesday-wednesday: GlideIn Front-end.
  • Retrieving logs before force-complete.

  • EXO-Fall13 with merge failures 11474, Kill-and-clone policy.
  • BTV-Fall13 batch without failing info 11508
  • High priority WFs finished: 11428
  • Highest priority WF stuck: 11489 ACDC completed successfuly but still at 34%

  • franzoni_Fall53_2011A_Jet_Run2011A-v1_Prio1_5312p1_130916_235201_2576 - still running recovery, the rest of the ReReco WF's are done
  • pdmvserv_SMP-Fall11R4-00002_T1_US_FNAL_MSS_00006_v0__131113_125318_6858 * Monday's round of tests with Burt showed it is in fact a site issue and Burt is working on it. * still tracking down the issue so I will run another round
  • pdmvserv_EGM-UpgradePhase1Age1Kdr61SLHCx-00002_T1_US_FNAL_MSS_00002_v0__131125_160358_1243 - performance failures 92% so I will run ACDC to see if we get more

The Andrew's Question's

  • Asked by requesters what causes the memory limit of workflows RSS limit for heavy ion
    • we do have sites that we could bump it up a little, limit is set by the hardware limit of most sites
    • they want 4 GB I don't think we have any sites that have that kind of limit
  • RE priority of agents
    • we need to set a date
  • we made a ticket with problems of PhEDEx injector, 400 error:

-- JenniferAdelmanMcCarthy - 03 Dec 2013

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2013-12-03 - AndrewLevin
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback