https://indico.cern.ch/conferenceDisplay.py?confId=254674
Attending
- Jen - from home
- SeangChan, Luis, Dave M at Fermilab
- Andrew
- John at Fermilab (different room)
Personel
- Edgar back from vacation
- Oct 1 --> Oct 8 Xavier
- Oct 8 --> Oct 15 Sara
- Jen is dealing with a sick dog this week and will likely be working from home all week. I will be available online.
Infrastructure
- WMAgent issues:
- We spent considerable time looking into stability issues of the agents. Luis and SeangChang made patches and updated the agents and that seemed to help stabilize things a bit.
- New:
- Replication stops - Solved issue, 216 and 201 need to be patched
- central couch problem: plan rotation for this Friday, patch is available to filter out successful jobs not to be migrated to central
will drain more and more agents
-
-
- on the weekend, found that local couch was also not properly deleting documents, patched
- need to update documentation on restarting couch replication.
- It won't be needed when we move to 0.9.79
- Pending:
- Display last time data was updated from each agent in wmstats
- Don't make JobUpdater/TaskArchiver crash with couch connection error
- CondorPlugin UnitTests
- Couch call take too long
- upgrade of 235
- testing of the parentage problem at the skims - Edgar and Alan
- Workflow issues:
- We had a significant number of WF's that were 'stuck' for MC processing.
- Jen & Luis spent time debugging these workflows
- the workflows that were stuck, but over 95% we looked at first.
- One of the main reasons that they were not working was that the agent lost the site information for cleanup & Merge jobs. It was determined that the cause behind this was the instablity of couch all month. Luis and SeangChang are looking into ways to prevent this from happening in the future.
- Edgar will provide script with Idea to Luis to work on it.
- Issues with closeout script
- the version of the dbsTest.py that is currently in git gives us the wrong counts/answer for the ReDigi WF's. Jacob verified that the old version he had around was giving the correct answers so for the week we used the old version of the script to close out the ReDigi/ReReco WF's and the new version for MC
- seems to be working OK now
- All week long dbsTest.py was having problems talking to DBS on and off. https://cmslogbook.cern.ch/elog/Workflow+processing/10120
- we have not yet come up with a solution on how to fix this problem. Hopefully Edgar will have a quick fix now that he is back.
- this problem is persisting
- It will be fixed in newest version of DAS in Oct deployment.
- condor_overview fixed and improved
- Rereco jobs at Vanderbilt * Jen will double check.
Site Problems
- Two T2_PL_Warsaw tickets:
Waiting Room
Sites for Production
Site in MC |
Slots |
Status |
Notes |
Issues |
T2_RU_PNPI |
176 |
skip |
to be commissioned |
under maintenance until Sep 30 - SAM & Links errors |
T2_IN_TIFR |
355 |
drain |
they claim they fixed site issues |
Site was in WR for 22 weeks |
- Old enable DQM harvesting parameter causing problems. Andrew will create github issue.
- Request manager should check against DBS for valid data tiers at request creation time. Andrew will create the github issue and he will contact seangchang.
AOB
- John, Julian and Gockhan will follow up with the dashboard on how to finally get rid of the conf file. * .json the publish to the SSB all sites have to appear even despite the fact of having 0 jobs in condor.
- How are we detecting stuck requests faster?