Workflow Team Meeting - Feb 11 4PM CERN, 9 FNAL time

Vidyo Link

Attending

  • FNAL: Jen, Jorge, Gaston
  • US:
  • CERN : Dima, Paola and Alan

Personnel

  • JR in Zurich Feb 18-20
  • Jen to CERN Feb 29-March 4 - tickets are booked!
  • Possible training sessions Feb 8-12 - ND student Alison, Matteo, Paola
    • Paola waiting on FNAL accounts, problems with grid certs - Alan will help
    • Jen and Paola will get together at 4 CERN time on the 12th to go through things.
  • New Transfer team member - Sebastian

News - Dima

  • Toward summer major production
    • super high priority and it will not be a stable version
    • we need to get any development done ASAP release should be availe April 1st , and we should be staging things in and be ready 3 Billion they really want 6 Billion events events by June!
    • this means that request manager 2 would have to be ready for the March release. It needs to be tested.
      • Jen and SeangChan will sit down together and try to figure out what we need to do to make the changes. Jen and Seang Chan will work together for 1-2 hrs a day next week to see if we can make this happen,.

3 top issues affecting production

  • Instabilities at NCP - kill and clone, just leaving it in drain manually even though the site is passing all it's tests
mmm
    • we need to figure out what is going on there but not on production
  • Network issues at FNAL late last week caused file read problems that should be cleared now
  • Intervention on Tues had all the CERN agents down for the day
  • High abort rate for workflows Elog 23098
    • Gaston you were chasing this down on Monday, where are we now?
    • this was due to the network being overloaded at FNAL, mostly read failures over xrootd
    • We need to fix Pileup, the way we read it saturates the networks everywhere

Site support - Gaston

  • what do we do about chronically unstable sites like CSCS and NCP, UERJ? they work for a couple days then give us issues again making doing anything long term on them difficult

  • Into the waiting room: T2_DE_RWTH, T2_FI_HIP, T2_IT_Rome
  • Out of the waiting room: T2_ES_IFCA.

Transfers - Jorge

Workflows

ReDigi

MiniAOD

  • putting NCP into drain and kill and cloning the last batch of WF's seems to have gotten this list under control

TaskChains

StepChain

Rereco

  • New campaign is being better behaved, ~20 workflows in recovery, but recoveries are working well
  • there are a handful of WF's where the number of Files in DBS = # files in PhEDEx, now these workflows are also under recovery, but we need to figure out what is going on with the dbsInjector so when we are done with recovery they can be announced.

Store Results

  • NA

MonteCarlo

  • WF's with merge failures at CSCS which has been unstable had to kill and clone a bunch of them now that CSCS is in drain again.

Agent Issues

  • All CERN agents restarted Tues due to work being done at CERN

Agent redeployment

RelVal Andrew

L3 discussion - Ajit, Jean-Roch, Matteo

Opportunistic Resources

Automatic Assignment And Unified Software

AOB

-- JenniferAdelmanMcCarthy - 2016-02-1
Edit | Attach | Watch | Print version | History: r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r3 - 2016-02-11 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback