Reprocessing and Production Team Meeting - June 2 4PM CERN, 9 FNAL time


Vidyo Link

Attending

  • FNAL: Jen, Gaston, Jesus, Matteo, Jorge
  • US: Allie
  • CERN : Sebastian, Dima, Alan
  • Korea :

Personnel

  • Jen 1/2 day June 3 & 6th
  • Jen Vacation July 25-29
  • SeangChan Jun 2-July 5
  • Jorge June 13-24

News - Dima

  • we need to get rid of the tails DR80 waiting for AOD and miniaod
  • we have another group of requests coming up but we need to focus on these tails, system is behaving better
  • assignment script was using backfill, we don't know what workflows have written to backfill, we need to find them and replace them
    • we need to query dbs3 and get this information - Jen and Paola

top issues affecting production

  • Problem with mergedlfnbase set for multiple ACDC : need to find out what wf is affected and assess looks like a major issue
    • backfill issue, we need to get this figured out and fixed
  • Problem with error 134 : need wmstats to report the actual exception (https://github.com/dmwm/WMCore/issues/6894)
  • jen_a_Commissioning2015-Cosmics-Boff-01Mar2016_763p2_160331_172001_8114 - ACDC was run over the same Rereco 2X - I am making a list of duplicated Lumi's overnight tonight, do we want to try to invalicate and then run the recovery script or kill and clone and up the priority?
    • Kill and clone per Dima
  • workflow with duplicates, but no ACDC : https://cmsweb.cern.ch/reqmgr/view/details/vlimant_EXO-RunIISummer15GS-01845_00322_v0__160523_112111_9696
    • Alan will look at this
    • we should also look at the input dataset to make sure that it doesn't have duplicates - Jen -start an elog conversation on it either way
  • large number of DR80 workflows accidentally kill and cloned, need to make sure that we've cleaned up after this
  • Transfers failing from Fermilab AODSIM - one dataset has files in different directories
  • we need to explicitly use lfn if we want to use the assign scripts Matteo is working on this

Site support -

  • 3 sites in waiting room
    • UCSD - coming out of waiting room tomorrow
    • Ioinina - down and dead
    • nchc - has been in waiting room for a while due to avalablity problems
  • morge hasn't changed

Transfers - Jorge

  • slow transfers between fnal and cern
  • blocks in stuck transfer json - blocks are still open SeangChan is looking at why the agent didn't close it automatically, but it can be closed.

Workflows

ReDigi

MiniAOD

TaskChains

StepChain

  • NA

Rereco

Store Results

MonteCarlo

Agent Issues

  • ErrorHandler - Alan will look at it, job submittor is too agressive?? doesn't think that's the issue, Alan is looking at it, keep on restarting component

Agent redeployment

  • Started validating new agent, so we don't have feedback yet, hopefully we can start migrating in 2 wks

RequestMgr2 Migration

  • can we inject things into reqmgr2 , it's just fine either in testbed or in production

Merging Scripts

  • Alli and Matteo are going to help us improving the resubmit.py and rejec.py scripts, in order to have the option to increase the maximum memory.
    • set maxRSS to memoryX1024 instead of adding 15MB, watchdog will kill it before it gets to the level we are requesting, we should round up memory
  • assign script needs work
  • need to double check that the clone script is copying splitting from the parent workflow

RelVal Andrew

AOB

-- JenniferAdelmanMcCarthy - 2016-06-02

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2016-06-02 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback