https://indico.cern.ch/conferenceDisplay.py?confId=254678
Attending:
  • FNAL - Jen, Luis, SeangChan, Dave
  • CERN - Julian, Adli, Andrew Edgar
Personel:
  • Oct 22 --> Oct 29 Julian (+Adli)
  • Oct 29 --> Nov 4 Adli
  • Edgar is gone Nov 8th now?? or are you officially gone on the 31st?
  • John is gone Oct 23-Nov 11
Agent issues
  • vocms235 was having couch issues today, they applied a patch and it has now been up for 3 hrs
    • couch processes matching out is our most common problem not sure if that was the issue with today's crashes - Luis will look into this to see if this was in fact the problem today
    • Error handler patch was applied, it has been tested, but this is it's first deployment on a production machine so we need to keep an extra close eye on this machine for the next few days.
    • failed job report numbers not matching properly maybe so SeangChan will keep an extra close eye on on things.
  • vocms227 ErrorHandler problem - it's a connection problem
  • vomcs202 (reprocessing agent): up un running
  • vocms216 & 234 is re-installed are ready for jobs
  • vocms201 & cmssrv112 is in drain for upgrades that will happen later this week
  • RelVal is having same priority as MC so this should help our issue with RelVal taking all the slots at FNAL
  • FNAL not getting any jobs: behaving normally
  • WMAgent issues:
    • v0.9.82 deployed in vocms202, vocms235, cmssrv98, and vocms85. Next: vocms216, vocms234.
    • Oracle on vocms85
  • disk full problem: warning patch is being tested
    • for now we still do not have the warning, but we have an alarm that we will be getting e-mail if the disk is filling
    • this will let us know if /data1 & /data starts to fill if it's /data1 there is nothing to clean,
    • SeangChan will update the twiki with information on what to do when the disk fills
    • Jen and Luis will ask Burt/Krista to put this same alarm for the FNAL machines at 90%
    • 113 is currently at 88% SeangChan will write up documation for cleanup/ Andrew will test it
  • PhEDEx subscriptions issue: solved

Workflow issues:

  • Workflow with massive fail - Input Data invalidation.
    • once the files were invalidated it moved along nicely and is now out of our hands
  • We have a number of workflows that are not at 100%, but we've run ACDC and have no failures.
  • we have made HUGE strides in understanding why WF's are getting stuck and having the stuck list back under control.. for now... please keep up the good work everyone! * we need to work on/ start incorporateing Edgars' ideas for preventing stuck workflows in the first place.
"backfill"
  • due to recent agent issues, and the fact that we currently do not have a lot of jobs running to keep sites busy Oli has requested that we run "backfill" which is basically running a known job over and over and over again to make sure that there are no stability issues
  • the first 2 workflows are in: https://cmslogbook.cern.ch/elog/Workflow+processing/10729
    • we need to keep some end to end statistics
      • when was the WF submitted'
      • when did it end
      • how long did it take to ACDC to go through etc
      • we will treat these like normal WF's only when we are all done, we delete the outputs as we already have run the data.
    • as soon as one backfill WF is finished the next one goes in the idea is that we keep constant pressure on the sites to insure that there are no problems creeping up on us. We will start with just the T1's once we get that going smoothly we will add T2's * query DAS for first and last events to go into a dataset to get the numbers
Site Problems
  • All US T2 fail the xrootd-fallback SAM test
    • Need to understand why
  • Issue with T2_US_Vanderbilt SAM availability
  • We have started rolling out the SL6 workers in the FNAL cluster. All the test workflows ran but if you or anyone in dataops see any issues with FNAL, please let me or cmst1 know so we can take care of it asap.
cms-comp-ops-site-support-team (Site Support Team) <cms-comp-ops-site-support-team@cern.ch>

Status on the the site-local-config.xml update:

CMS SiteSorted ascending Contacted Replied Updated
T1_DE_KIT Yes Yes Not yet
T1_ES_PIC Yes Yes Not yet
T1_FR_CCIN2P3 Yes Not yet Not yet
T1_IT_CNAF Yes Yes Not yet
T1_RU_JINR Yes Yes Yes
T1_UK_RAL Yes Yes Yes
T1_US_FNAL Yes Not yet Not yet

Sites currently not enable LifeStatus state:

SITESorted ascending Status Duration Reason
T2_AT_Vienna Waiting Room 1+ month(s) HC + FTS evals
T2_ES_IFCA Waiting Room 1 week(s) SAM + FTS Evals
T2_GR_Ioannina Waiting Room 1 month(s) SAM + HC + FTS Evals
T2_IN_TIFR Waiting Room 1+ week(s) SAM Evals
T2_PK_NCP Waiting Room 1+ month(s) HC + FTS Evals
T2_PL_Swierk Waiting Room 1 month(s) FTS Evals
T2_RU_ITEP Waiting Room 1 week(s) SAM + HC + FTS Evals
T2_RU_SINP Morgue 1+ month No change

Updated Jan/18/2021

Ticket journal of last week:

TicketSorted ascending CMS Site Last update Status Subject
144936 T2_US_Vanderbilt Tuesday, 01/12 solved CVMFS fail-over from Vanderbil
148997 T2_CN_Beijing Thursday, 01/14 solved Update CMS SE from dCache to D
149392 T2_PL_Swierk Thursday, 01/14 in progress T2_PL_Swierk failing SAM WN-xr
149733 T2_AT_Vienna Wednesday, 01/13 solved T2_AT_Vienna failing SAM WN-xr
149942 T1_IT_CNAF Tuesday, 01/12 closed Transfer issues from your site
149966 T2_TW_NCHC Friday, 01/15 assigned Transfers timing out from/to T
149969 T2_RU_JINR Tuesday, 01/12 solved Jobs that were sent to T2_RU_J
150020 T2_ES_IFCA Monday, 01/11 closed Both Squids at T2_ES_IFCA
150041 T2_US_Nebraska Wednesday, 01/13 closed TFS transfer failure to T2_US_
150042 T2_CN_Beijing Wednesday, 01/13 closed SAM WN-xrootd-access failing a
150077 T2_US_Purdue Monday, 01/11 unsolved Pilots at T2_US_Purdue (Hadoop
150087 T2_CN_Beijing Thursday, 01/14 unsolved SAM tests for one CE are not b
150106 T2_RU_IHEP Monday, 01/11 solved StageOut Failures at T2_RU_IHE
150152 T2_KR_KISTI Thursday, 01/14 solved Read/open error in your site
150170 T2_DE_RWTH Tuesday, 01/12 solved Read/open error at your site
150175 T2_IN_TIFR Monday, 01/11 assigned Transfer issues from your site

Generated on 18/Jan/2021 (GMT)

Sites with open GGUS tickets:

CMS SiteSorted ascending Number of Tickets Tickets
Generated on 18/Jan/2021 11:14:52 (GMT), Total number of tickets: 25
T0_CH_CSCS_HPC 2 149166 147830
T2_AT_Vienna 1 149686
T2_BE_IIHE 1 149789
T2_BE_UCL 2 148354 149968
T2_CN_Beijing 1 149976
T2_FI_HIP 1 149930
T2_FR_GRIF_IRFU 2 149150 150076
T2_IN_TIFR 2 147831 150175
T2_IT_Pisa 1 149791
T2_PK_NCP 2 150075 149773
T2_PL_Swierk 2 149392 147006
T2_PT_NCG_Lisbon 2 149275 150074
T2_RU_ITEP 1 149999
T2_TW_NCHC 1 149966
T2_UK_London_Brunel 1 149955
T2_US_MIT 1 150088
T3_US_Rice 1 149315
T3_US_UMD 1 149837

Updated on: 2021-01-11 at 11:58:10 by HectorCamiloZambranoHernandez

  • Any problem, email site support team list (while John is in vacation).
AOB
  • resource control script had an error, there was an issue created that got into 9.28 but it hasn't really been fixed updating site information
    • needs to be patched in on all the agents
  • 227 errorhandler had a typo, that is fixed
  • certificate problems should not be shared with production agents this needs to be checked and verified that we are doing things


This topic: CMSPublic > CompOps > CompOpsWorkflowTeam > WorkflowTeamMeeting > WorkflowTeamMeeting20131029
Topic revision: r4 - 2016-07-22 - StephanLammel
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback