Workflow Team Meeting - March 5 4PM CERN time 10 AM FNAL TIME!!!!!

Vidyo Link

*https://indico.cern.ch/event/381100/

Attending

  • FNAL: Jen, Matteo, Seanchan, Luis, Jorge
  • US: Ian, Ajit, Stephan, Dave
  • CERN : Julian, Alan, Andrew, Dima
  • EU:

Personel

  • Luis to Colombia around May 1-10???? talk on the 6th
  • Julian out over Easter Holidays March 28-April 4
  • Jen May be taking time off around Easter, not sure on dates yet
  • Seangchan off a few days March 28-Aprial 4
  • Matteo will be at CERN next week, Moriond the week after

News

  • New EU Operator!
    • James Keaveney - Belgium
    • Haneol Lee - SNU Korea
  • Production coming up?? prepairing for a bunch of things, more GEN-SIM in the form of TaskChains, still deciding on release, and how much.
    • still validating so already have plans for more than we can produce.
  • T0 - would like to take over ~1/2 of the resources if they can, Do we have problems with that? If we have nothing going, or it's "just backfill" we are giving permission to the T0 to take things over. condor monitoring, they will report to the dashboard.

3 top issues effecting production

  • Matteo's add for the Monday Meeting
  • would like to discuss the possiblity of testing putting multiple sites in the white list, so if a High Priority WF gets added to a site that is big, we could transfer data to another site to see if it would start running there.
    • sounds doable, I don't think we've tried it before. comments?
    • not as easy as it sounds, need to remind users that low priority means low priority and they need to wait...
  • what else do we see as issues causing processing to be slow? Communications?
  • Open issue from the Transfer team and US sites: Missing files after creation - https://ggus.eu/index.php?mode=ticket_info&ticket_id=111932
    • I'm running some test workflows.
    • it's a small % of MC so not a huge deal to just invalidate, but it is increaseing so we need to figure out what is going on.
  • about the priority - remapping schema, this is still in place, right?

Site support

Opportunistic sites

Workflows

  • Found problem with recovery script, when you run it on workflows that have whitelists it ignores the whitelist when it makes the recovery workflows.
    • Julian have you had time to look at this yet? -- way beyond my WMCore knowledge - I can try spawning some tests.

ReDigi

  • IN2P3 had/has transfer problems causing the datasets to not have all the block information on _DISK https://ggus.eu/index.php?mode=ticket_info&ticket_id=112030 They are working on it but until this gets fixed, and the remaining blocks get transferred, the following WF's are going to sit in complete. IF we get the recovery script going before this happens we can try to recover, otherwise we clone.
    • pdmvserv_HIN-HiFall13DR53X-00034_00009_v0__150305_153317_8745
    • pdmvserv_HIN-HiFall13DR53X-00035_00009_v0__150305_153314_5782
    • pdmvserv_HIN-HiFall13DR53X-00036_00009_v0__150305_153311_7540
    • I tried ACDC with trustSiteList = True. in theory should only fail the missing files.
      • jbadillo_ACDC_HIN-HiFall13DR53X-00034_00009_v0__150312_113309_8838 - result: absolute failure.

TaskChains

  • 6 taskchains in the system.
  • task_SMP-Summer12WMLHE-00011 -> waiting for files
  • task_TOP-RunIIWinter15wmLHE-00003 -> is having timeouts on Task2, but requestors said that they can use the LHE.

miniaod's

Rereco

  • We still have a skims flowing:
    • amaltaro_CosmicsSP_732_Feb9test_732_150302_231736_5479
    • franzoni_6Mar2015_Cosmics_732p5_150309_175340_221
    • alahiff_6Mar2015_MinimumBias_732p5_150309_145708_5216
    • The original is still completed: vlimant_CosmicsSP_732_Feb9test_732_150216_142419_5298
    • Who is taking care of assigning and announcing this? asking if we need to reject the v1?

Store Results

  • Nothing to report

MonteCarlo

  • Nothing to report

Agent Issues

  • CERN's vocms03XX have /var almost at 100%
    • Debugging I found some offending files /var/log/secure-2015xxxx ~ 2GB each, any idea what is producing those files? Julian has his answer

Redeployment Plan

production SL6
FNAL CERN
cmsgwms-submit1 (up) vocms0308 (ready to wake)
cmsgwms-submit2 (ready to wake) vocms0309 (drain)
cmssrv217 (up) vocms0310 (drain)
cmssrv218 (up)  
cmssrv219 (up)  

RelVal Andrew

AOB

-- JenniferAdelmanMcCarthy - 2015-03-11
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2015-03-12 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback