Workflow Team Meeting - April 14 4PM CERN, 9 FNAL time

Vidyo Link

Attending

  • FNAL: SeangChen, Gaston, Allie, Matteo, Jesus, Jorge, Dave
  • US:
  • CERN : Paola, Dima, Sebastian, Allan, JeanRoch
  • Korea :

Personnel

  • Youn on shift
  • Jorge to Colombia April 15-May 2, Talk on April 27
  • Gaston to Colombia Early May Dates yet to be determined.
  • Allie will be gone April 29 through May 8.

News - Dima

  • Fermilab Storage Downtime next tues, April 19 while we switch over to the new storage
    • We need to do a full drain of T1_US_FNAL
    • Full replication has already happened from the full disk instance to the new one. during the drain we will be copying over the last remaining files that are being produced.
    • unmerged is being moved to dcache
    • Dave is recommending we start draining Fri night
    • This will also affect other sites that are reading their input from FNAL but only during the actual downtime itself. Jobs will fail their file reads during this time. We are draining so we don't have data not on disk, while we drain we will be copying the last bits over.
    • submitter nodes will be up and still running.
  • We need to make sure that DR80 is running highest priority, if it is not elog and ping Dima

3 top issues affecting production

  • Issues assigning ACDC's for merge steps for TaskChain Workflows.
    • Paola and Alan fixing the script, they got my ACDC's running, but it doesn't look like the ones I tried are moving yet.
    • Paola will pull Allie's script and merge changes together.
  • T0 testing has been shutdown, because the T0 needed their slots.
    • not yet ready to run production, we need to make changes to protect T0, it's safe to start testing again. John and Alan are doing the testing
  • Things are running smoothly, only thing holding us back is getting data in place.
    • what is making things slow is that we are pulling things off tape to random T2's, we can pull data off tape much faster when we are putting it on our own disk. Doing this means that tapes get dismoutned in the meantime so we are dominating our tape transfer time by tape swapping, which is slowing us down. Buffer to destonation is slow. We've had long queues in PhEDEx, and trying to figure out what is causing it, the best we can figure out is that it is being requested ineffecently. Dave is going to try to get something more concrete.

Site support - Gaston

Waiting Room / Morgue

  • Into the Waiting Room: T2_BR_UERJ,T2_PT_NCG_Lisbon,T2_UK_SGrid_Bristol,T2_BR_SPRACE

  • Out the Waiting Room: T2_FI_HIP,T2_EE_Estonia,T2_DE_RWTH

Sites in Waiting Room: 11 Sites in Morgue: 9

Transfers - Jorge

  • do we have a way to watch transfers by campaign? Not really but it would be useful to have for Dima. Russian sites are having some problems with transferring files.
    • better way to check is to look for DR80 in the workflow name and check for those. Jorge now knows the two json files he needs to cross corriolate to get the info he needs.

Workflows

  • multicore on T2's pdmvserv_task_TOP-RunIISpring16DR80-00001__v1_T_160331_151408_3872
    • getting successes on the ACDC, still some submit failures but better
  • vlimant_SUS-RunIIWinter15pLHE-00354_00157_v0__160411_144741_6938 500 events per job, then it's their problem so we send it back
  • Harvest Workflows - even with largest memory requirement it failed so we sent it back what do we do with the rest
    • set to maxRSS and MaxMemory if they still fail we send them back

ReDigi

MiniAOD

TaskChains

StepChain

  • NA

Rereco

Store Results

MonteCarlo

Agent Issues

* Having to restart TaskArchiver - Alan is going to look into it, Just keep restarting for now * what FNAL have we rebooted, make a list so we know what we have left and post to thread

Agent redeployment

  • vocms311, 219 are the only ones running the old code, both are in drain, and are ramping down
  • FNAL machine 217 that was having problem restarting components so we need to look at that machine and figure out what is going on, is there a disk issue? machine has been set to drain we should have Krista look at it.

RequestMgr2 Migration

Merging Scripts

RelVal Andrew

AOB

-- JenniferAdelmanMcCarthy - 2016-04-12

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2016-04-14 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback