FNAL- Dave M, SeangChan, Yuyi, Jen Luis Juian, Andrew - CERN


Dec 10 --> Dec 17 Xavier
Dec 17 --> Dec 24 Sunil
Dec 24 --> Dec 31 Xavier
Dec 31 --> Jan 7 Sunil
  • Julian's holidays: (dec 27th to 29th) and (jan 7th to 11th)
  • Adli ~maybe~ will be on holiday 1-6 january
  • Jen Around the week of Christmas but probably off 29th to Jan 6
  • Jen Off Dec 18th
  • Dec 23-Jan1 - Xavier and Sunil will be on shift but working from home
  • CERN closed Dec 22-Jan6
  • Luis will be gone Dec 11-13, Dec 16-20 Luis will be working remotely

Issues from Last week

* thresholds

* DBS3 - Components up on all but cmssrv85, 112, 113 and 237 * 201 and 202 - turned on but they crashed right away, needed to restart 2X and now they are up but still unstable

    • What is the Error - connection timeout error DBS connection timeout
    • from DBS3 side only 1-2 errors
      • parentage error, block already in error so the errors are not matching up between the agent and DBS3
      • in 235 more than 200 requests going in at once
      • only turned 2 on at a time, and they failed after ~1hr, maybe agent is sending too many requests
      • on dbs3 side heavily loaded, abnormally so it takes several min to insert a
      • 2 steps
        • 13 copy data to just copy data
        • 13 hrs to validate data between 2 datasets
      • the database is shared between agent, dbs and PhEDEx
  • we need to really understand what steps are still depending on DBS2 and make sure that they are ready to switch to DBS3 and tested
  • we need to have a time schedule Meetings at 10:30, everyone needs to attend!!! at FNAL we will meet in the ROC
    • list of systems affected
    • time table
    • checks

Agent Issues

  • DBS/DBS3 upload halted, can we turn back DBS3Upload back on?
    • we are manually tracking invalidated datasets on DBS.
    • Julian restarted only vocms201 and vocms202 but failed with timeout error. Is it normal?
  • This week: Upgrade to condor 8 (MC machines already upgraded)
    • MC machines are done
    • rest will be done tomorrow
    • cmssrv112 is also done
  • Oracle query that doesn't end: Still ongoing. Status of that?
  • ACDC's with timed out jobs: Still ongoing, gathering information and replicating.
  • Step0 workflows stuck in acquired: Due to massive assingment.
    • The solution is to assign in small groups so they won't overload the global queue.
  • Phedex inavailability caused PhedexInjector problems.
  • we need to increase the thresholds for FNAL since we now have 4000 more slots
  • we need to re-evaluate our pending numbers

Site Issues

  • Site thresholds/slots inside the agent:
    • Fixed and set for T1 on MC machines.
    • Scripts ~cmst1/ and ~cmst1/
    • They should be centrally updated (SSB).


  • ReDigi's that complete with no failures yet less than 25% done 11710
  • Recovery is Done done done done done smile
  • Stuck Workflow List - let's make sure we understand WHY they are stuck, force completing is fine if things are high priority but we need to make sure we understand WHY they are stuck and fix it in the agent so we don't have to do manual work.
    • site thresholds.
    • 7 stuck MC now.

The Andrew's Question's


  • Is there going to be anybody around the next 2 wks to have a meeting? If I'm talking to myself the meetings will be virtual but I would like it if summary's of the weeks issues can be posted here.
    • Next tuesday is 24th, so Julian might not connect.

-- JenniferAdelmanMcCarthy - 13 Dec 2013

Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r6 - 2013-12-17 - JenniferAdelmanMcCarthy
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback