Workflow Team Meeting - July 3 4PM CERN time

Vidyo Link

Attending

  • FNAL - Jen, Luis, SeangChan, Jorge
  • CERN - Julian, Andrew
  • Ajit

Personel

June 26-July3 Jasper
July 3 - July 10 Sara

(Or tell Oli | CW | CP)
  • Jen will be taking off July 28 - Aug 8 - may have limited access evenings
  • Julian Badillo July 24 - July 25
  • Dave Mid July - Do we have dates yet????
  • Dave will be at Jury duty next Tuesday - Hope he doesn't get picked
  • Note : Krista will be going on Maternity leave sometime in July

News

  • Welcome Jorge (the new Lucy) and Juan! (1/2 developer, 1/2 grid)
  • Sring14dr and Fall13- take highest priority period. CSA 14 the deadline is July 1
    • we have the go ahead to force complete them at 90%
    • Fall13- agressively cut at 90% Spring14dr, let it finish in the tails
    • big set of requests for miniaod - the release came out 24 hrs ago the requests are not in yet. Dave is planning on handling them himself, there will be a few hundred of them, they are very fast but the tests we have done they are iffy, they read the AODSIM we have been placing it at the T2's. We want to run where it is already on disk, we've never run reprocessing types of wf's at T2's and it will require changes to the assignment script. It will make the dataset name changes, it is designed for Digi/reco we don't need the pileup string for it, we are just stripping info out of the modified file. takes off era and version number and propigate it as it is. It is like a skim but we are keeping all events so we should 100% in the outputs. The name of the campaign spring14miniaod. They will be assigned to the low-prio reprocessing and will give them a high priority so they move ahead. Dave is assuming tomorrow a little after midnight CERN time requests will come in and we will get things going and breaking things.

Jasper's notes

  • Open GGUS ticket to Lisbon: https://ggus.eu/?mode=ticket_info&ticket_id=106587 -> they are solving it (waiting for consistency check to clean up the site).
  • Keep an eye on T2_UA_KIPT which had no running jobs in past days.
    • Site is not drain but has not been stable for a while, When John comes in we will discuss with him about putting it in drain.

Site support

Agent Issues

* upgrading couch on all agents
    • vocms237, vocms202, vocms85, vocms142
    • would like to upgrade one of the MC agents to upgrade Thurs to try and see if anything breaks before the long weekend.
      • Go with vocms216, it's going to be redeployed on Fri.
      • need to make sure that we are pointing the depolyment to the proper version of couch
  • draining and upgrading agents: 235 in drain - probably redeployed on Fri (After Seangchan updates couch).
  • cmssrv98 & 112 do we move them to new teams?
    • may effect small HP requests, but this is a step toward moving to one team. Dave will talk to Andrew L, if Andrew is OK with it it will go to the next to redeploy list and they will be redeployed as mc/lowprio machines

Workflows

  • jobs sitting in acquired for long periods of time.
    • so far it seems like they are legit, we just have a lot of work but we need to keep an eye on this
  • Can we look at WF's in acquired, and their priority and the WF's in running how old are they?

ReDigi

  • Mostly under control.

Store Results

  • about to get busier with miniaodsim.
    • right now in CSA14 there are 2 pileups produced using crab3 someplace they have a list of datasets ~12 that they plan on promoting
  • received almost 60 wfs last week. Here is the working table: Google Docs
    • 6 wfs with no successful jobs - Luis found the problem, he needs to confirm it file access issues, permission problem
    • 5 wfs with 200% of events (is this normal?) - Luis will dig more

MonteCarlo

  • Some Fall13 with filter efficiency issues:
    • How to spot them? they have almost 100% of lumis but much lower event %. Examples: Elog
    • 3 workflows affected.
  • 9 Fall-13 still in the system, but doing ok.
  • Big workflows:
    • SMP-Summer12-00013 (60M events) was cloned on Friday (80% of events by now)
    • BPH-Summer12-00166(20M events) was cloned on Tuesday (50% of events by now)
  • everything else is flowing.

RelVal Andrew

-- JenniferAdelmanMcCarthy - 02 Jul 2014

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2014-07-03 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback