Workflow Team Meeting - July 17 4PM CERN time

Vidyo Link

Attending

  • Ajit, Luis C, Seangchan, Alan, Andrew Levin, Dave, Jen, Joe McCartin, Julian

Personel

July 10 --> July 17 Jasper
July 17 --> July 24 Joseph
(Or tell Oli | CW | CP)
  • Jen will be taking off July 28 - Aug 8 - may have limited access evenings
    • Juan will be working as a US Shifter while I am on vacation so we will have eyes on the system.
    • Please Juan send your IM (gtalk/skype/whatever) info to the workflow team.
  • Julian Badillo July 24 - July 25
  • Dave won't leave on vacation but will not be available on Tuesdays until October.
  • Note : Krista is Maternity leave until September

News

  • Oli's notes on what is coming up :
  • To bring everyone up to speed, we are expecting two important things in the next days in addition to whatever is going on already:
    • pre-mixing
      • pre-mixing is very important to us (computing) because it reduces the I/O load of digi/reco workflows with a lot of PileUp events
      • you do the heavy I/O in combining many MinBias events once
      • you do the digi/reco many times only reading one pre-mixed MinBias event per signal event => low I/O
      • we expect a high statistic test workflow (the software is not completely validated yet, there are still problems, see presentations in PPD meeting tomorrow, Wednesday, July 16th)
      • step 1: we have to run the high I/O load pre-mixing, as many events as the planned signal sample
      • step 2: we have to use the pre-mixed sample and a signal sample to run digi/reco
      • proposal right now is to run the 25M event ttbar sample
    • upgrade workflows
      • we are expecting end of this week of beginning of next week (very soon) a complete digi/reco pass of all upgrade samples
      • these have 140 or even 200 PileUp events, a significant I/O load for the sites
      • they will not be run in pre-mixing mode
      • this is very important for the collaboration, because it will define the physics case for the upgrade technical proposal which will be used by the funding agencies to decide if we get the money
      • although we are close to restart the LHC with the most important run of the machine at 13 TeV, this is equally important as it defines the far future of the experiment (sorry, I cannot say this in easier words, it is complicated)
    • We need to figure out how to do these things:
      • we need to choose sites that can stand high I/O load for the upgrade samples and the pre-mixing
      • we need to choose the site for the 2nd step of the pre-mixing workflow, taking into account latencies to transfer the pre-mixed MinBias
      • we need to ask for the upgrade input sample list and make sure that they are on disk at appropriate sites with high I/O capabilities
      • Here is already a proposal:
      • FNAL and CERN upgrade
      • CNAF 1st and 2nd step of pre-mixing
      • Or can we run upgrade at more T1 sites without impacting them too much? Or should we run the pre-mixing steps at separate sites? All this taking into account the already existing workfload at the sites.
      • Before the requests come in, we will make sure that there is a clear and documented prioritization.
      • Lets discuss here and find a solution for everything above so that we can succeed.
        • in e-mail discussion that followed it was suggested that we use ucsd for processing but it is not ready yet. John???
What we should know:
  • Take care about pileups location.
  • Upgrade's will be high prio but premixing is higher.
  • Do normal ACDC procedures (finer splitting, longer timeouts).

Jaspers's notes

  • Joseph coming on shift
  • keep track of elog's and new operator's twiki.

Site support

Agent Issues

  • Do we understand why 98 sucked up all the work when it was redeployed?
    • In principle is about thresholds tweaking. let's see how do the new agents behave.
  • NoJobReport error on T2's elog.
    • vocms201 and 142 are affected. looks like a network problem
    • Luis L should back Alison while she is on vacation.

Redeployment plan

  • 98, and 216 are all draining
    • 216 may be redeployed tomorrow. (only log collects pending for T2_DE_RWTH)
    • 98 was drained and re-deployed into the mc team, but then sucked up all the work, and filled it's disk.
  • 112 was redeployed today --> reproc_lowprio tea.
  • vocms237 (step0) was redeployet, however PPD people mentioned they need the hack anyway.

Workflows

miniaod's

  • ramping down, will probably need site reassignment.

ReDigi

  • large amount of things waiting for DBS blocks

Rereco

  • some wf's waiting and ready to be annouced.

Store Results

  • Retried wf's for T3_US_UMD
  • Waiting for these phedex subscriptions to closeout wfs: 419205 and 419334

MonteCarlo

  • everything flowing, besides the already mentioned hiccups.
  • Ajit asked if someone did set some datasets in VALID, since some closed-out workflows were expected.
    • The procedure states that Jen & Julian leave workflows once they get to closed-out, then they are Vincenzo/Ajit or Andrew/Dave responsibility.

RelVal Andrew

-- JenniferAdelmanMcCarthy - 09 Jul 2014

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2014-07-17 - JulianBadillo
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback