Workflow Team Meeting - Feb 4 4PM CERN, 9 FNAL time

Vidyo Link

Attending

  • FNAL: Jen, Jorge, Gaston, Matteo
  • US:
  • CERN : Alan, Paola, JeanRoch, Dima

Personnel

  • Rokas - new site support team member
  • New Julian starting Feb 1 - Paola
  • JR in Zurich Feb 18-20
  • Jen to CERN Feb 29-March 4 - tickets are booked!
  • Possible training sessions Feb 8-12 - ND student Alison, Matteo, Paola
    • Paola waiting on FNAL accounts, problems with grid certs - Alan will help

News - Dima

  • Nothing major coming up
  • Amazon news - we have our first workflow through! We need to make the datasets available now.
    • not all the scripts are being maintained in github so not sure if the scripts Dave found on Github to publish still work.

3 top issues effecting production

  • ReReco WF's that have 0 failures but only 99% complete
    • do we know why this is happening? is the workflow being launched before all the lumi's are in place? If not.. just wait the little extra time it takes to have all the data in place to submit. yes it may cause an extra couple hours delay at the onset but it will save us days of recovery time!
    • example: fabozzi_Run2015A-HINMuon-boff-27Jan2016_763p2_160128_143210_5683
    • all failures were log collect or cleanup - but 99.7% recovery
    • Created JSON recovery-0-fabozzi_Run2015A-HINMuon-boff-27Jan2016_763p2_.json for recovery of ['/HINMuon/Run2015A-27Jan2016-v1/MINIAOD']
    • This will recover 1 lumis in 1 files
    • Created JSON recovery-1-fabozzi_Run2015A-HINMuon-boff-27Jan2016_763p2_.json for recovery of ['/HINMuon/Run2015A-27Jan2016-v1/DQMIO']
    • This will recover 1 lumis in 1 files
    • Created JSON recovery-2-fabozzi_Run2015A-HINMuon-boff-27Jan2016_763p2_.json for recovery of ['/HINMuon/Run2015A-27Jan2016-v1/AOD']
    • This will recover 3 lumis in 3 files

  • (JR) "file mismatch" https://cms-logbook.cern.ch/elog/Workflow+processing/23012 what/who should do something : holding several workflows from announcement. * Jorge is going to look into this - files were there the dbs injector didn't like them * check which blocks is missing
  • (JR) "duplicates" https://cms-logbook.cern.ch/elog/Workflow+processing/23006 what/who should do something : holding the second oldest DR.
  • (JR) non runnable ACDC (GQ location not in the site whitelist) creating tails, is that solved to not get anymore of those : many aborted to let go of workflow above 95%.
  • (JR) AcqEra=None ACDC creating a mess of output, creating tails is that solved to no get anymore of those, do we have a GH issue to fix this.
  • (JR) long acquired ACDC, even with very high priority https://cms-logbook.cern.ch/elog/Workflow+processing/23014 creating tails, what can be the reason (GQ is intersecting site whitelist, and site is not in drain)
  • (JR) are there any recent ACDC that can be automatized ?
    • (JR) Where does one retrieve the sites that created the error programmatically (wmstats doc?) so that it can be automatically checked with drain status => for reject and clone.
    • any other source of failure recently ?

Site support - Gaston

  • T2_BR_UERJ is not failing since 02/02, currently 184 successful jobs.
  • T2_CH_CSCS: 87.28% success since 02/02. I will check the failed jobs to see the reasons why they are failing.
  • T2_PK_NCP: 88.7 % success since yesteraday.

Date Site Into Waiting Room Out of Waiting Room Into Morgue Out of Morgue
2016-01-27 T2_UK_SGrid_RALPP x      
2016-01-27 T2_US_UCSD x      
2016-01-27 T2_IN_TIFR   x    
2016-01-27 T2_RU_INR   x    
2016-01-29 T2_DE_RWTH x    
2016-01-30 T2_US_UCSD   x    
2016-01-31 T2_TH_CUNSTDA     x  

  • Sites in Waiting Room: T2_RU_IHEP, T2_RU_SINP, T2_DE_RWTH, T2_UK_SGrid_Ralpp
  • Sites in Morgue: T2_PL_Warsaw, T2_TR_METU, T2_RU_PNPI, T2_RU_ITEP, T2_MY_UPM_BIRUNI, T2_RU_RRC_KI, T2_TH_CUNSTDA.

Transfers - Jorge

  • (JR) can we work out an automatic diagnosis of the stuck_transfer.json ? I know there are known issues, some others aren't

Workflows

  • (JR) is there an alert in wmstats for acquired workflow with GQ element on sites not available anymore ?

ReDigi

TaskChains

  • workflows with "none" in dataset name for ACDC's sorry - doing too many acdc's at once and accidently submitted them the wrong way.

StepChain

Rereco

  • new campaign

Store Results

MonteCarlo

Agent Issues

Agent redeployment

RelVal Andrew

L3 discussion - Ajit, Jean-Roch, Matteo

Opportunistic Resources

Automatic Assignment And Unified Software

  • We need documenation!!!!!!! Matteo is working on it, will continue to look at it.
  • will be worked on when we are training in Alison

AOB

-- JenniferAdelmanMcCarthy - 2016-02-03

Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2016-02-11 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback