Vidyo Link

Attending

  • Jen, Dave, Julian, Xavier, Andrew, SeangChan, Luis, John

Shift

Feb 18 -> Feb 25 Sara
Feb 18 -> Feb 25 Xavier

News

  • Are we draining for the Oracle database upgrade?
    • between 8-12 CERN time on Thurs so we need to shutdown the agents
    • by "drain" we mean just turn off JobSubmitter - Julian - Andrew and Alan will take care of the RelVAL Agents
      • Wed at 8am - shut off JobSubmitter
      • Thurs - shut down agents at 8 am Thurs
      • Thurs - after Oracle upgrade we bring the agents back up ~1 hr later
  • Migration to disk tape at FNAL and PIC

Issues

  • vocms85 is having the same DBS3Uploader issue (job location missing) : https://cmslogbook.cern.ch/elog/Workflow+processing/13038
    • We will use 85 and treat it as a redeploy and let the developers look at it to figure out what is going on
  • Central couch maxed out during the weekend: 13007, Affected components and WMStats.
  • Still ongoing issue about blocks not uploaded when the workflow is completed.
    • For now we let completed workflows sit for 24 hours in "complete" before resubmition.
    • SeangChan is creating a patch that will require upgrade that will just close out the WF's once the last job has finished instead of waiting 18 hrs
  • Too much pending workflows: Tune up pending jobs 13020
  • Script for extending MonteCarlo from scratch 13017, in order to make smart resubmitions.

Workflow issues

  • known file read error means we need to ACDC pretty much all WF's to get all the events.
  • Duplicate Lumi's
    • When you move a WF from assign/run to rejected all the blocks in the global queue are not killed and when the agent runs it will process them, but WMStats won't know anything about them and they will not be monitored.
    • You should ONLY ABORT any workflow that are in a state between Assign->Complete Only safe time to reject a workflow is before assign status

Savannah tickets to watch

datasets take more than 10 hours to appear in das after workflow finished

Site issues

  • scripts that are producing info for dashboard for running/pending jobs there are a few points we need to correct/we need to look at
    • One of the main problems is that the script is looking at the old drain list name - drain - pledge but the script is still using that info and reporting the info to dashboard in sites view
      • it needs to be using ssb
      • we have 2 collectors, 2 are redundant which is why we are getting double counting - Julian is looking at this
    • when the script is dividing assignments/groups some steps have merge in their name even though they are not merge jobs, this needs to be cleared up
      • there is a way in condor to get task type we need to grab this from condor rather than lfn so we are reporting accurately
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2014-02-25 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback