https://indico.cern.ch/conferenceDisplay.py?confId=254676

Attending

CERN: Andrew, Adli, Edgar

FNAL: Jen, LuisC, SeangChan

Personel

  • Welcome Adli! Please tell us a little about yourself!
  • Oct 8 --> Oct 15 Sara
  • Oct 15 --> Oct 22 Julian (+Edgar)
  • we may need to pull Dorian in to work US/Asia shifts at the end of the week.
  • Edgar will be on vacation from monday 21st till wednesday 23rd.

Agent Issues

  • replication stops and does not restart - patch created
    • All machines patched, but some replication issues still persist.
    • Restart couch replication should deal with the problem.
    • However some data is restarted.
    • We need to monitor more.
  • vocms85 JobCreator was crashing - hard link limit
    • Might be due to bad workflow set up. 32k log collecting job events is too much.
    • Directory on LogCollect jobs need to be dealt the same way the Production jobs.
    • SeanChang created the issue.
  • vocms201 JobCreator, also hard limit crash, fix to JobSubmitter (probably a typo).
  • vocms235 deployment: 0.9.69a
    • Problems deploying, most probably with the machine installation.
    • It will be used to deploy new version (tomorrow) anyway.
  • parenting problem: tested, patched on vocms234
    • Also patched on vocms85, vocms201, 235, 216, 227 and 237.
    • Now discussion is on ComOps how to deal with parentage problems.
    • We can identify which file is missing, maybe can be done manualy if there are few datasets.
    • It probably will have impact on DAS/DBS.
  • vocms202 draining or upgrade
  • Discuss changing teams of some of the agent.
  • Jobs that are created after aborted worfkflows:
    • Most of them are LogCollect and Cleanup Jobs
    • Sometimes we will need to kill manualy those jobs, and let sites clean up logs.

Workflow Issues

  • we have LOTS of stuck workflows to work our way through.
    • Luis has discovered that most of the stuck WF's are caused by jobs that have no site information in reqManager. We will let him discuss the problem further in tomorrow's meeting.
    • Edgar's script can be used to finish or force-complete most of stuck WF's. (that has over 95%)
  • We may be moving a lot of data around this week. Once we get a better feel for the scope of work we have ahead of us we may be re-prioritizing peoples work, please stay tuned to updates!!!
  • We have restarted the fall processing now with the parentage problem fixed. This data has highest priority to make sure it keeps moving as the date we were supposed to have finished it has already passed.
  • No tickets.

Site Issues

AOB

  • John, Julian and Gockhan will follow up with the dashboard on how to finally get rid of the conf file. * .json the publish to the SSB all sites have to appear even despite the fact of having 0 jobs in condor.
    • For now use Request Alert on WMStats (200 stuck requests now)
    • We should deal with MC workflows first.
  • How are we detecting stuck requests faster?
  • WMStats database rotation (Alan and Diego) was successful, and improved size so it should work better this week.

Action Items

  • Julian: Create and apply patch on JobSubmitter to one Agent to test it.
  • LuisC and SeangChan: Find the problem about stuck wf's.
  • SeangChan: Create issue on jobs running after aborted WF.
  • Jen: List of stuck workflows that are ready to be force-closed.
  • All: Move all the stuck requests.
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2013-10-16 - JohnArtieda
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback