Workflow Team Meeting - March 5 4PM CERN time 10 AM FNAL TIME!!!!!

Vidyo Link

*https://indico.cern.ch/event/381100/

Attending

  • FNAL:
  • US:
  • CERN :
  • EU:

Personel

  • Luis to Colombia around May 1-10???? talk on the 6th
  • Julian out over Easter Holidays March 28-April 4
  • Jen May be taking time off around Easter, not sure on dates yet

News

  • New EU Operator!
    • James Keaveney - Belgium
    • Haneol Lee - SNU Korea

3 top issues effecting production

  • Matteo's add for the Monday Meeting
  • would like to discuss the possiblity of testing putting multiple sites in the white list, so if a High Priority WF gets added to a site that is big, we could transfer data to another site to see if it would start running there.
    • sounds doable, I don't think we've tried it before. comments?
  • what else do we see as issues causing processing to be slow? Communications?
  • Open issue from the Transfer team and US sites: Missing files after creation - https://ggus.eu/index.php?mode=ticket_info&ticket_id=111932
    • I'm running some test workflows.
  • about the priority - remapping schema, this is still in place, right?

Site support

Opportunistic sites

Workflows

  • Found problem with recovery script, when you run it on workflows that have whitelists it ignores the whitelist when it makes the recovery workflows.
    • Julian have you had time to look at this yet? -- way beyond my WMCore knowledge - I can try spawning some tests.

ReDigi

  • IN2P3 had/has transfer problems causing the datasets to not have all the block information on _DISK https://ggus.eu/index.php?mode=ticket_info&ticket_id=112030 They are working on it but until this gets fixed, and the remaining blocks get transferred, the following WF's are going to sit in complete. IF we get the recovery script going before this happens we can try to recover, otherwise we clone.
    • pdmvserv_HIN-HiFall13DR53X-00034_00009_v0__150305_153317_8745
    • pdmvserv_HIN-HiFall13DR53X-00035_00009_v0__150305_153314_5782
    • pdmvserv_HIN-HiFall13DR53X-00036_00009_v0__150305_153311_7540
    • I tried ACDC with trustSiteList = True. in theory should only fail the missing files.
      • jbadillo_ACDC_HIN-HiFall13DR53X-00034_00009_v0__150312_113309_8838 - result: absolute failure.

TaskChains

  • 6 taskchains in the system.
  • task_SMP-Summer12WMLHE-00011 -> waiting for files
  • task_TOP-RunIIWinter15wmLHE-00003 -> is having timeouts on Task2, but requestors said that they can use the LHE.

miniaod's

Rereco

  • We still have a skims flowing:
    • amaltaro_CosmicsSP_732_Feb9test_732_150302_231736_5479
    • franzoni_6Mar2015_Cosmics_732p5_150309_175340_221
    • alahiff_6Mar2015_MinimumBias_732p5_150309_145708_5216
    • The original is still completed: vlimant_CosmicsSP_732_Feb9test_732_150216_142419_5298
    • Who is taking care of assigning and announcing this? asking if we need to reject the v1?

Store Results

  • Nothing to report

MonteCarlo

  • Nothing to report

Agent Issues

  • CERN's vocms03XX have /var almost at 100%
    • Debugging I found some offending files /var/log/secure-2015xxxx ~ 2GB each, any idea what is producing those files?

Redeployment Plan

production SL6
FNAL CERN
cmsgwms-submit1 (up) vocms0308 (ready to wake)
cmsgwms-submit2 (ready to wake) vocms0309 (drain)
cmssrv217 (up) vocms0310 (drain)
cmssrv218 (up)  
cmssrv219 (up)  

RelVal Andrew

AOB

-- JenniferAdelmanMcCarthy - 2015-03-11
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r4 - 2015-03-12 - JulianBadillo
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback