Workflow Team Meeting Oct 1, 2013

Login Information

https://indico.cern.ch/conferenceDisplay.py?confId=254674

Attending

Jullian, Alan, Jen, SeangCan, Luis, John, Dave M, Andrew, Sara, Xaviar

Personel

  • Edgar on Vacation all week
  • Coming off shift - Sara
  • Coming on shift - Oct 1 --> Oct 8 Xavier
  • The schedule does not go past next week! Xavier I believe you are in charge of schedules Please let me know if there is anything I can do to help!

Issues from last week

Infrastructure

  • Central CouchDB and WMStats issues (Alan):
    • We all know that there are central problems which is badly affecting the workflow and cmsweb operations. After some iterations with Seangchan we got a patch for WMAgent and another for WMStats:
      • WMA_0969_patch and WMA_0979_patch: it makes the agent not to replicate documents of successful jobs (unless there was a job retry).
      • WMStats_patch: task summary overview was added to Jobs tab, so we don't miss useful information/numbers related to the success jobs.
    • I list here a plan/action to start coping with these problems:
      • validation of WMA 0.9.79: Preslav and I are performing the validation. We already know of two issues
      • validation of Seangchan's patch for WMStats: it was deployed to cmsweb-testbed and if everything is fine by Wednesday morning GVA time (Oct 2nd), then it'll be announced and get into production on Thursday (Oct 3rd).
      • drain of agents: vocms235 is being drained since a long time. Can we start draining vocms202 or vocms234? We need at least two re-deployed agents to do the wmstats operation/cleaning in cmsweb.
      • with 0.9.79 validated and these agents drained, then we can deploy a v0.9.79 WMA and wipe out the local couchDB (it's required for the cleaning of the couch in cmsweb). Only after this step we'll get WMStats information reliable again, I hope...
      • in the end, we should move all production agents to a new version (0.9.79 maybe) and then we can do another cleanup of the wmstats database in cmsweb. At this point we get a small and reliable database.
  • WMAgent issues:
    • AnalyticsDataCollector monitoring improvement: vocms201, 216 and 235 patched. The patch separate the component reporting to wmstats thread and the data collector thread. Logs are showing if there is a component down in the agent.
      • hope to upgrade 235 on Friday
      • Action Item - Jen talk to Jacob about if we can drain drain 202 or 234 for upgrades and run all the jobs on the other agent
    • New:
      • Display last time data was updated from each agent in wmstats
      • Don't make JobUpdater/TaskArchiver crash with couch connection error
    • Pending:
    • Need to discuss with SeangChan the procedure for re-booting Couch
      • SeangChan says we can just stop/restart any down components I specifically remember Steve telling us not to do this as it could do "very bad things" that the only safe way to restart couch was to shut down the agent, shut down couch, restart couch, restart the agent. Did something change to make this no longer dangerous?
        • We are already loosing the data so shutting down agent and restarting the agent isn't going to buy us anything
        • It may say that the threads are down is just just can't talk to couch
        • If couch is down "silently" it means that replication is down and we need to restart the replication (manage status shows couch down)
        • ps -elf |grep couch if it is more than 100 then we need to restart couch
        • vocms216 JobAccountant down new problem - FJR missing the test name this is a new issue that SeangChan is working on.
  • Machine issues
    • We stopped running jobs over the weekend due to a the CERN frontend's firewall subscription expired Savannah ticket 139981
      • Alison updated the firewall subscription and we are now back to running jobs
      • Nice Catch Sara!
  • Other operational issues
    • Jacob and I discovered a problem with the closeout script on Friday afternoon meaning that the script is not properly accounting for the RERECO jobs.
      • Jacob has a dbsTest that is still giving the proper % done so either I will use that this week to close out, or I will dig through the changes Edgar made before going on vacation and fix the main script
      • There are some ~50 MC jobs that are stuck in running due to the Couch instability issues.
      • I will come up with a complete list of jobs, have SeangChan verify that the agent in fact believes they are done then post the list to e-log to have the MC team verify that we can close them out.
        • Does somebody else have time to help dig? Instructions for checking out stuck WF's are on the operators page
      • not all of the WF's have the statistics that were requested yet so we will have to decide what to do about them.
    • Luis put directions on the Operations page for running the script for reporting status of workflows that should be run and reported upon daily along with the site usage. Please make sure you run and send this report.

Site issues

Sites for Production

Site in MC Slots Status Notes Issues
T2_RU_PNPI 176 skip to be commissioned under maintenance until Oct 07

Action Items

  • Write twiki disk/tape separation T1_IT_CNAF. Edgar
  • Recovery workflows - Jen - suspend
    • first 2 workflows are completely through and now we are waiting for people to really look and make sure that there are no show stoppers before we do the other 50.
    • Guillemo is bothering JeanRoc about if people have actually looked at the data
  • Site status script, feeding SSB. (Julian)
    • stuck with the whole list script will work with John off line on making progress
    • there is a script behind the site status board that is reading from another file that is not the one that we update so that the site status board is updating properly.
  • Tracking of .tar file sizes on some workflows (Julian)
    • ongoing
  • Action Item - Jen talk to Jacob about if we can drain drain 202 or 234 for upgrades and run all the jobs on the other agent

AOB

  • Why are some of the ACDC's not being assigned? trying to used merged file as input but the file is gone so the jobs are not going to be run. If ACDC happens after 2 wks the files are gone.
Edit | Attach | Watch | Print version | History: r5 < r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r5 - 2013-10-01 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback