https://indico.cern.ch/conferenceDisplay.py?confId=254678
Attending:
  • FNAL - Jen, Luis, SeangChan, Dave
  • CERN - Julian, Adli, Andrew Edgar
Personel:
  • Oct 22 --> Oct 29 Julian (+Adli)
  • Oct 29 --> Nov 4 Adli
  • Edgar is gone Nov 8th now?? or are you officially gone on the 31st?
  • John is gone Oct 23-Nov 11
Agent issues
  • vocms235 was having couch issues today, they applied a patch and it has now been up for 3 hrs
    • couch processes matching out is our most common problem not sure if that was the issue with today's crashes - Luis will look into this to see if this was in fact the problem today
    • Error handler patch was applied, it has been tested, but this is it's first deployment on a production machine so we need to keep an extra close eye on this machine for the next few days.
    • failed job report numbers not matching properly maybe so SeangChan will keep an extra close eye on on things.
  • vocms227 ErrorHandler problem - it's a connection problem
  • vomcs202 (reprocessing agent): up un running
  • vocms216 & 234 is re-installed are ready for jobs
  • vocms201 & cmssrv112 is in drain for upgrades that will happen later this week
  • RelVal is having same priority as MC so this should help our issue with RelVal taking all the slots at FNAL
  • FNAL not getting any jobs: behaving normally
  • WMAgent issues:
    • v0.9.82 deployed in vocms202, vocms235, cmssrv98, and vocms85. Next: vocms216, vocms234.
    • Oracle on vocms85
  • disk full problem: warning patch is being tested
    • for now we still do not have the warning, but we have an alarm that we will be getting e-mail if the disk is filling
    • this will let us know if /data1 & /data starts to fill if it's /data1 there is nothing to clean,
    • SeangChan will update the twiki with information on what to do when the disk fills
    • Jen and Luis will ask Burt/Krista to put this same alarm for the FNAL machines at 90%
    • 113 is currently at 88% SeangChan will write up documation for cleanup/ Andrew will test it
  • PhEDEx subscriptions issue: solved

Workflow issues:

  • Workflow with massive fail - Input Data invalidation.
    • once the files were invalidated it moved along nicely and is now out of our hands
  • We have a number of workflows that are not at 100%, but we've run ACDC and have no failures.
  • we have made HUGE strides in understanding why WF's are getting stuck and having the stuck list back under control.. for now... please keep up the good work everyone! * we need to work on/ start incorporateing Edgars' ideas for preventing stuck workflows in the first place.
"backfill"
  • due to recent agent issues, and the fact that we currently do not have a lot of jobs running to keep sites busy Oli has requested that we run "backfill" which is basically running a known job over and over and over again to make sure that there are no stability issues
  • the first 2 workflows are in: https://cmslogbook.cern.ch/elog/Workflow+processing/10729
    • we need to keep some end to end statistics
      • when was the WF submitted'
      • when did it end
      • how long did it take to ACDC to go through etc
      • we will treat these like normal WF's only when we are all done, we delete the outputs as we already have run the data.
    • as soon as one backfill WF is finished the next one goes in the idea is that we keep constant pressure on the sites to insure that there are no problems creeping up on us. We will start with just the T1's once we get that going smoothly we will add T2's * query DAS for first and last events to go into a dataset to get the numbers
Site Problems
  • All US T2 fail the xrootd-fallback SAM test
    • Need to understand why
  • Issue with T2_US_Vanderbilt SAM availability
  • We have started rolling out the SL6 workers in the FNAL cluster. All the test workflows ran but if you or anyone in dataops see any issues with FNAL, please let me or cmst1 know so we can take care of it asap.
cms-comp-ops-site-support-team (Site Support Team) <cms-comp-ops-site-support-team@cern.ch>

Warning: Can't find topic CMSPublic.CompOpsMeetingTickets

  • Any problem, email site support team list (while John is in vacation).
AOB
  • resource control script had an error, there was an issue created that got into 9.28 but it hasn't really been fixed updating site information
    • needs to be patched in on all the agents
  • 227 errorhandler had a typo, that is fixed
  • certificate problems should not be shared with production agents this needs to be checked and verified that we are doing things
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r3 - 2013-10-29 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback