Workflow Team Meeting - Nov 5 4PM CERN, 9 FNAL time

Vidyo Link

Attending

  • FNAL: Jorge, Jen, Gaston, Matteo
  • US:
  • CERN : Dima, JonRoc, Alan, Julian
  • EU:

Personnel

  • Jorge - Nov 9-13
  • SeangChan off Nov 3-10
  • Julian to Colombia Dec 14, and then contract ends, will be working remotely from Colombia until the First of Jan
  • Ajit out till Nov
  • JR unavailable until Nov 23d
  • Dima traveling next week
  • Jen won't be checking in on Sat

News - Dima

  • Not really, we have a couple big requests keeping things busy, but no other real news

3 top issues effecting production

  • See ReReco notes
  • cmsweb upgrades - finished already https://cms-logbook.cern.ch/elog/Workflow+processing/22103
    • Watch out for PNN's
    • new cmsweb and wmagents have concept of PhEX node name - PNN - cms site name will be mapped to PSN Phedex site name
    • We need to be watching for failures like a site that doesn't have proper values, ie T3's don't have site names in site local config
    • we might be staging out someplace and they won't have pnn so we won't get merge jobs
  • do we need to revive recovery script for ReReco
  • WMStats views were behind, but should be up to date now
  • support of pileup for pileup wf's
  • Lots of WF;s in status failed, with no reason given in file why they failed. Is something broken? I found several of them that had been reset, or cloned but the origioinal was still stuck in status failed. There was nothing in elog about the workflows being cloned or reset, the only way I knew what happened was to look for output datasets with higher version numbers. Is it really that hard to post an elog when you clone/reset workflows? Is there a better way to check on the workflows in failed?

Site support - Gaston

Currently in the waiting room: T2_ES_IFCA, T2_IT_Bari, T2_IN_TIFR, T2_TH_CUNSTDA, T2_PK_NCP, T2_EE_Estonia, T2_RU_INR, T2_BR_UERJ.

No changes in morgue.

News & Issues

  • (JR) Several sites over dataops DDM quota : reasons are
    • heavy gen-sim in production or in use
    • 4 or 5 opened campaign with secondary input : 40+40+25+10+10 TB = 125TB of secondary on disk
  • (JR) Large TaskChain with large job blow-up ratio : now assigning to >4k slots sites if ratio >5

Workflows

ReDigi

*

TaskChains

  • Workflows with None AcqEra - https://cms-logbook.cern.ch/elog/Workflow+processing/22100
  • We have already discussed why this happened, and outlined the procedure to fix it
    • this is a problem of communication and documentation, Alan and Julian knew what to do, I sort of knew what to do but nothing was documented anywhere!
    • What other exceptions to regular acdc production and assigning are there?
  • (JR) what can be done about the very long running taskchains for now - long SUSY task chains training for 3-4 wks
    • keep an eye on them, if they are over 95% force complete, check to see if they are running, then check stuck workflow scripts

StepChain

Rereco

  • Need to start returning WF's that are not 100%
  • How much have we talked to the developers on why we are getting missing lumi's but they are not showing up in WmStats as errors? We need to find a better way of recovering this!

Store Results

MonteCarlo

Agent Issues

Redeployment Plan

  • ideally we will be upgrading agents next week!
  • everything is up to date - redeploy in mid-November, and then we'll have major changes to iron out before Christmas Production
  • all sites need proper local site node name info or they won't work - reposting last weeks list here to see where we are in getting things done!
    • before site config returned node names , but se name and phedex node name need to be returned or they will fail. Gaston needs to verify that the CERN Local configs are right

* as discussed in the meeting, some sites are not providing the phedex-node value in the site-local-config.xml. For example, T2_CH_CERN is Ok:

  • We need all the processing sites to properly provide this info otherwise we'll start having problems in the next WMAgent release.
  • Also the site we are using has to be listed here. If not we either update siteDB or get the site name from other sources.

RelVal Andrew

L3 discussion - Ajit, Jean-Roch, Matteo

Opportunistic Resources

Automatic Assignment And Unified Software

  • In the past Julian has "extended MC workflows" that didn't meet statistics, is this something Unified can take over?

AOB

  • PNN change - have Gaston check sitedb list that maps

-- JenniferAdelmanMcCarthy - 2015-10-28

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2015-11-05 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback