Workflow Team Meeting - May 7 4PM CERN, 9 FNAL time

Vidyo Link

Attending

  • FNAL: Jen, SeanChan
  • US: Ajit,
  • CERN : Julian, Dima, Alan and Andrew
  • EU:

Personel

  • Luis to Colombia around May 1-15th, working remotely (very little)
    • John will be in charge of the T0 while Luis is gone
  • Matteo will be at CERN May 4-14
  • Jen will not be available the weekend of 16-17

News - DIMA

  • Run2 74X DigiReco is still planned to get started on May 8th.
    • Expect a bunch of requests tomorrow, discussed with Jean-Roc, we are not going to jump on it right away, we want to see it the requests but filter how we put them in
    • we don't want to kill the system on Friday, it's redigi so we need to transfer all the data.
    • urgent but effecency is more important so we will put in just a few requests and make sure they don't blow up and do the rest Monday
  • TP2023HGCALDR jobs need more memory than was requested. 2.6GB should be enough, but something close to 3.0GB is safer. Need to check if we will get enough pilots.
    • we need to understand how far we can go increasing memory requirements on ACDC, which sites can we do this at? We need to determine this, and where we can submit the ACDC's. We need to be we need the requested memory parameter to be put in correctly otherwise they will go into condor and not find matches and be stuck.
    • we need to acdc as normal, then clone the acdc and then mess with the memory in the acdc... these ACDCs will be Julian's baby's for now until he gets a proceedure down that works.

3 top issues effecting production

  • Xrootd fallback seems broken for CMSSW build based on ROOT6 (CMSSW_7_4_x) and first open attempted with DCAP
    • Problem identified by framework/ROOT experts
  • Broken merges at FNAL
    • mostly effected RalVal, there were a few Redigi's effected, but they were far outshadowed by the Xrootd issue
    • we recovered all the RelVals by running at CERN
    • did we do a full post mortum at FNAL to determine what happened? and how to avoid it in the future, Jen will ask at afternoon meeting
  • patches for agents to keep them stable again - JobUpdater, couch and couch DB were not sinced so, it's fixed

Site support - John

Workflows

  • New close out script and pages:
    • now that we've lived with it 2 wks, how is it going? we are happy

ReDigi

  • RunIISpring15Digi74 workflows
  • Julian, the script you are using to mass do ACDC's, tell us about it when I have 50-60 workflows to ACDC automating this could be a carpol tunnel saver
    • how do you select which workflows to put into the script? How are you filtering the workflows to know ACDC isn't already running?
    • do you have to make the ACDC's manually? or is it just an assignment script?
      • ACDC still needs to be produced manually
    • what are provisions for adjusting parameters? In this case it was xrood, but sometimes we have mass failures due to issues where we need to change splitting etc can this handle it?
      • script now just uses oritional parameters, so this will be useful for xrood/file read issues but anything else we still have to do the old way.
    • will it run on all types of WF's? It doesn't work on TaskChain but will work for others.
    • Julian: I was using JR's assistance page, right now I create ACDC manually and assign by script
      • Next step: - create ACDC's automatically (using wf list and taskname to ACDC)
      • There is an old script for that but needs some serious refactoring.

TaskChains

  • nothing to report

Rereco

  • ongoing cosmic processing where are we? they are only 2-3% done not sure why they are stuck

Store Results

  • nothing to report

MonteCarlo

  • nothing to report

Agent Issues

  • ErrorHandler keeps crashing, not understanding what is going on, do we know what is going on? Just keep restarting, but if it keeps happening we need to bug Diego.
  • Yesterday Faruuk updated condor to 8.3.5 so if we see weird condor errors report them ASAP

Redeployment Plan

production SL6
FNAL CERN
cmsgwms-submit1 (ready to wake) vocms0308 (ready to wake)
cmsgwms-submit2 (up) vocms0309 (up)
cmssrv217 (ready to wake) vocms0310 (up)
cmssrv218 (ready to wake) vocms0311 (up)
cmssrv219 (ready to wake)  

RelVal Andrew

  • Issue closed about separation of Workflow and will go into the next release
  • random seeds github issue - not being worked on, 17 days ago lots of messages but we don't have a conclusion yet. Dirk is against it, Andrew is for it. need more feedback.
  • duplicate dataset issue, if we have 2 workflows writing to the same dataset we get duplicates, then we need to delete parts making the dataset a mess.
    • it would be good to check when request is created, it isn't always possible. We should have flag in workflow of "write to existing dataset" Otherwise it checks if dataset is there and if not, it doesn't allow new workfow to go in.

L3 discussion - Ajit, Jean-Roch, Matteo (issues held over from last week)

  • JR : How does the "Memory" parameter get passed along ?
    • We are running with whatever we can at the moment, it will fail with a high percentage, we will need to do hacking at ACDC stage, Julian will do the hacking and see if we can get this working.
    • it comes from MCM knows this because heavy Ion workflows, it goes through WMAgent to the jobs. Not something we can easily toggle it will have to be via script by acdc workflow, then cloning the ACDC with a script that manually adjusts the parameter.
  • JR : How do we catch stuck requests easily ?
  • JR : Should we abandon the redigi, mc separation and talk about workflows to be done. technically there is no separation anymore.
    • for the L3 level we keep separation, but workflow team will keep separation to keep from stepping on each others toes.
  • From Dave: Long tails: is that caused by the wider white list? Is it an issue? Again on the wider whitelist: if everything is running everywhere and if one site is affected by some issues, this will create problems to a large number of workflows, the ones running there. How can we deal with this?
  • Matteo: Is it possible using the T0 for processing?

Opportunistic Resources - Stefan

  • We finally have jobs running at NERSC! I have submitted a test workflow on Friday, which is progressing nicely there. We're still testing the site though - so, not yet part of production.

HLT

  • Matteo: did we succeed in running MC/ReDigi? Is the EOS permission issue solved?

SDSC

  • JR : 1Mevent/day observed, and going "down". is anything happening with the site ?

Automatic Assignment And Unified Software

AOB

-- JenniferAdelmanMcCarthy - 2015-04-30

Edit | Attach | Watch | Print version | History: r7 < r6 < r5 < r4 < r3 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r7 - 2015-05-07 - JulianBadillo
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback