Workfow Team meeting

Attendence

  • FNAL Luis, Seangchan, jen, Dave
  • CERN Julian
  • Sara and Jasper

Vidyo Link

Shifts

March 25 -> April 3 Sara
April 3 -> April 10 Jasper
  • Luis is going to Columbia April 21-25 then on vacation April 28-May2
  • SeangChan 1/2 day on Friday
  • Dave & Oli at SLAC April 6-10
  • Dave at CERN April 21-29

Sara's operations notes

* Things went smoothly
    • Problems with dashboard - script problem causing things not to report properly making monitoring difficult
  • FNAL - disk full errors and drain
    • we will be monitoring the unmerged space differently so for now we can ignore that particular
    • site people will be watching this, we will shut down alarm, patch code and update scripts
    • what do we look for? best to look at everything and then put in exception lists for those that cause false alarms like this week.
  • Condor submit errors
    • modifying threshold script and dividing the number of threshold pending lower and this should improve the situation for the
    • each agent is independent and doesn't know about the other agents. This means when you throw in a bunch of WF's with global site lists it loads the max number of slots on all agents and maxes everything out. this is how we ended up with 400K jobs pending which causes our condor submit errors.
    • we need different thresholds for MC and Processing ??? set thresholds by cronjob that is aware of the number of agents up so we don't flood the system?
    • remove the github download from cronjob - Julian do after Luis turns in his changes for threshold issues
  • issues list http://spinoso.web.cern.ch/spinoso/mc/issues.html
    • who is responsible for going through this list and making sure things are resolved?
    • Luis will update twiki with info on stuck resources

News

  • FNAL in drain
    • what is the current status?
    • we need to have 0 jobs running on Monday, everything will become inaccssable so
    • MC - put agents in drain
    • redigi/resub resub only if clones are needed they don't get put in until after the change
  • ggus transition
    • I opened a ticket against FNAL on Wed, anybody know why sites are not listed as T1_US_FNAL etc for submission?
    • They are listed indeed, but you need to request tickets from the CMS interface:
Screen_Shot_2014-04-03_at_7.55.24_AM.png

Agent Issues

  • Too many pending jobs: elog thread
    • Alison says that RequestDisk (among other parameters) is used to create job-groups inside the collector
    • There are too many values of the same attribute (that are very similar to each other)
    • Collector matching is optimized for larger and fewer groups of similar jobs
    • We need a github issue for rounding up values like this to have fewer values.
      • Collector experts are looking and verifiying that his will help fix issues
      • this will speed up job matching in the collector - Julian will set up the issue
    • We also need to push the running/pending limit per agent: github issue
  • Alarms on FNAL machine about /lustre/unmerged: elog thread
    • What are we doing about them?
  • Step0 Workflows stuck in "running-closed"
    • Could we debug them?
    • Luis look at and update documention and Jasper will follow up

  • Redeployment plan:
    • vocms201 redeployed --> vocms85 in drain (it had several issues during the weekend)
    • vocms202 redeployed (both vocms234 and 202 are up)
    • vocms85 will be attached to reproc_lowprio team to drain vocms234.
    • we are waiting for FNAL to finish the Disk/Tape separation.

Site Issues.

  • FNAL in drain

MonteCarlo

Redigi/Rereco

* need to go over list with Dave https://cmslogbook.cern.ch/elog/Workflow+processing/13771

Relval == Andrew's Questions

-- JulianBadillo - 01 Apr 2014

Topic attachments
I Attachment History Action Size Date Who Comment
PNGpng Screen_Shot_2014-04-03_at_7.55.24_AM.png r1 manage 90.5 K 2014-04-03 - 07:58 JulianBadillo  
True Type fontttf ihacs.ttf r1 manage 37.6 K 2014-04-01 - 16:00 JulianBadillo  
True Type fontttf kimberle.ttf r1 manage 23.1 K 2014-04-01 - 15:02 JulianBadillo  
Edit | Attach | Watch | Print version | History: r8 < r7 < r6 < r5 < r4 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r8 - 2014-04-03 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback