https://indico.cern.ch/conferenceDisplay.py?confId=254674

Attending

  • Jen - from home
  • SeangChan, Luis, Dave M at Fermilab
  • Andrew

Personel

  • Edgar back from vacation
  • Oct 1 --> Oct 8 Xavier
  • Oct 8 --> Oct 15 Sara
  • Jen is dealing with a sick dog this week and will likely be working from home all week. I will be available online.

Infrastructure

  • WMAgent issues:
    • We spent considerable time looking into stability issues of the agents. Luis and SeangChan made patches and updated the agents and that seemed to help stabilize things a bit.
    • New:
      • Replication stops - Solved issue, 216 and 201 need to be patched
      • central couch problem: plan rotation for this Friday, patch is available to filter out successful jobs not to be migrated to central
will drain more and more agents
      • on the weekend, found that local couch was also not properly deleting documents, patched
      • need to update documentation on restarting couch replication.
    • Pending:
      • Display last time data was updated from each agent in wmstats
      • Don't make JobUpdater/TaskArchiver crash with couch connection error
      • CondorPlugin UnitTests
      • Couch call take too long
    • upgrade of 235
    • testing of the parentage Problem - Edgar and Andrew
  • Workflow issues:
    • We had a significant number of WF's that were 'stuck' for MC processing.
      • Jen & Luis spent time debugging these workflows
      • the workflows that were stuck, but over 95% we looked at first.
      • One of the main reasons that they were not working was that the agent lost the site information for cleanup & Merge jobs. It was determined that the cause behind this was the instablity of couch all month. Luis and SeangChan are looking into ways to prevent this from happening in the future.
    • Issues with closeout script
      • the version of the dbsTest.py that is currently in git gives us the wrong counts/answer for the ReDigi WF's. Jacob verified that the old version he had around was giving the correct answers so for the week we used the old version of the script to close out the ReDigi/ReReco WF's and the new version for MC
        • seems to be working OK now
      • All week long dbsTest.py was having problems talking to DBS on and off. https://cmslogbook.cern.ch/elog/Workflow+processing/10120
        • we have not yet come up with a solution on how to fix this problem. Hopefully Edgar will have a quick fix now that he is back.
        • this problem is persisting
    • condor_overview fixed and improved
CONFLICT original 2:

Site Problems
CONFLICT version 3:

CONFLICT end

Site Problems

Waiting Room

Sites for Production

Site in MC SlotsSorted ascending Status Notes Issues
T2_RU_PNPI 176 skip to be commissioned under maintenance until Sep 30 - SAM & Links errors
T2_IN_TIFR 355 drain they claim they fixed site issues Site was in WR for 22 weeks
Edit | Attach | Watch | Print version | History: r6 < r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r4 - 2013-10-08 - JenniferAdelmanMcCarthy
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback