Indico connect


* Jen, Dave, Seangchan, Adli, Andrew, Alan, Julian, Luis, John


jan 21 - jan 27 Xavier, what happened?
jan 28 - Feb 3 next Shift. Adli
Next shift? Xavier? Sara or Sunil
  • please elog when you shift-in and shift-out.
  • please do shift handoff
  • Julian will arrange next shift

Agent Issues

    • For us: last three columns (CPU bound and IO bound), updated. We are responsible for updating and managing the values.
      • Enough merges and not too many.
  • IN2P3 disk/tape separated site
  • 12335 PhedexInjector 400 error, patched in all agents.
  • 12156 Disk problem errors.
  • 12372 147 Workflows stuck in acquired, some of them were not because resource control. Luis and Seangchan developed a patch in Global Workqueue that increased a connection timeout.
  • several couch crashes during the weekend.
    • These causes missing job information (jobs that were supposed to be allocated but disappear).
    • What is our line of attack to recover from this? We do not have time to kill and clone all the workflows that were effected before the Feb 10 deadline.
    • Some troubles with xrootd during the weekend.
    • disk full alarm: to workflow team and CRC list.
  • Agent release-redeploy plan:
    • Agents need to be redeployed periodically to avoid couch and db issues.
    • Every two months? - Next deploy on DBS migration (Feb-10), next on mid-March.
    • Some agents will be redeployed before DBS migration.
      • start with vocms202, vocms216 and vocms85 (this Friday)
      • redeploy version - that is already running.

Site Issues that affected workflows

  • 12404 FNAL stageout issues.
  • Increasing thresholds on IT_CNAF and FR_CCIN2P3

  • T2_TH_CUNSTDA - Commissioning for production
  • T2_IN_TIFR (100 slots) - Commissioning for production

  • Poll - High memory slots
    • T1s and US T2s complete - pending EU T2s
    • Google spreadsheet: link

Workflows Issues

  • Closeout Script output
  • Complete WF's that have no running ACDC
    • this is nice, it would be good to have this as a "view" in the agent that we can also mark what the current status is. The list 'looks' long, but many of them have open issues on them that we are waiting for response on. This page with a way to link to e-log thread or something so we can say "Oh this one isn't running ACDC what's going on with it." without having to dig would be good! I belive that we've discussed having this as a feature in WMStats, just waiting for it to be implemented. This is also where the "pending X" state in the agent would be good.. gives us fast access to thing that need to be taken care of.


DBS2/DBS3 Upgrade

Relval/Andrew's questions

  • wall-clock/CPU time ratio, time per event
  • Nebraska for RelVal: 3K slots at 4GB. They don't need special configuration.
  • FNAL EOS downtime, Jan 30
    • A bit of delay, hold jobs but no issues.
  • high memory workflows
  • do we have a version of dbsTest with getRunLumiCountDatasetBlockList that works with das?
  • how to get process ID for job running on a worker node


-- JenniferAdelmanMcCarthy - 28 Jan 2014

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2014-01-28 - LuisContreras
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback