https://indico.cern.ch/conferenceDisplay.py?confId=254678
Attending:
  • FNAL - Jen, Luis, SeangChan, Dave
  • CERN - Julian, Adli, Andrew Edgar
Personel:
  • Oct 22 --> Oct 29 Julian (+Adli)
  • Oct 29 --> Nov 4 Adli
  • Edgar is gone Nov 8th now?? or are you officially gone on the 31st?
  • John is gone Oct 23-Nov 11
Agent issues
  • vocms235 was having couch issues today, they applied a patch and it has now been up for 3 hrs
    • couch processes matching out is our most common problem not sure if that was the issue with today's crashes - Luis will look into this to see if this was in fact the problem today
    • Error handler patch was applied, it has been tested, but this is it's first deployment on a production machine so we need to keep an extra close eye on this machine for the next few days.
    • failed job report numbers not matching properly maybe so SeangChan will keep an extra close eye on on things.
  • vocms227 ErrorHandler problem - it's a connection problem
  • vomcs202 (reprocessing agent): up un running
  • vocms216 & 234 is re-installed are ready for jobs
  • vocms201 & cmssrv112 is in drain for upgrades that will happen later this week
  • RelVal is having same priority as MC so this should help our issue with RelVal taking all the slots at FNAL
  • FNAL not getting any jobs: behaving normally
  • WMAgent issues:
    • v0.9.82 deployed in vocms202, vocms235, cmssrv98, and vocms85. Next: vocms216, vocms234.
    • Oracle on vocms85
  • disk full problem: warning patch is being tested
    • for now we still do not have the warning, but we have an alarm that we will be getting e-mail if the disk is filling
    • this will let us know if /data1 & /data starts to fill if it's /data1 there is nothing to clean,
    • SeangChan will update the twiki with information on what to do when the disk fills
    • Jen and Luis will ask Burt/Krista to put this same alarm for the FNAL machines at 90%
    • 113 is currently at 88% SeangChan will write up documation for cleanup/ Andrew will test it
  • PhEDEx subscriptions issue: solved

Workflow issues:

  • Workflow with massive fail - Input Data invalidation.
    • once the files were invalidated it moved along nicely and is now out of our hands
  • We have a number of workflows that are not at 100%, but we've run ACDC and have no failures.
  • we have made HUGE strides in understanding why WF's are getting stuck and having the stuck list back under control.. for now... please keep up the good work everyone! * we need to work on/ start incorporateing Edgars' ideas for preventing stuck workflows in the first place.
"backfill"
  • due to recent agent issues, and the fact that we currently do not have a lot of jobs running to keep sites busy Oli has requested that we run "backfill" which is basically running a known job over and over and over again to make sure that there are no stability issues
  • the first 2 workflows are in: https://cmslogbook.cern.ch/elog/Workflow+processing/10729
    • we need to keep some end to end statistics
      • when was the WF submitted'
      • when did it end
      • how long did it take to ACDC to go through etc
      • we will treat these like normal WF's only when we are all done, we delete the outputs as we already have run the data.
    • as soon as one backfill WF is finished the next one goes in the idea is that we keep constant pressure on the sites to insure that there are no problems creeping up on us. We will start with just the T1's once we get that going smoothly we will add T2's * query DAS for first and last events to go into a dataset to get the numbers
Site Problems
  • All US T2 fail the xrootd-fallback SAM test
    • Need to understand why
  • Issue with T2_US_Vanderbilt SAM availability
  • We have started rolling out the SL6 workers in the FNAL cluster. All the test workflows ran but if you or anyone in dataops see any issues with FNAL, please let me or cmst1 know so we can take care of it asap.
cms-comp-ops-site-support-team (Site Support Team) <cms-comp-ops-site-support-team@cern.ch>

Status on WebDAV_CMS deployment:

Previous week


Current week

Status on the the New Subsite Mechanism update (sites number):

CMS SiteSorted ascending Contacted Replied Updated
T1_DE_KIT Yes Yes Yes
T1_IT_CNAF Yes Yes Yes
T2_DE_RWTH No No No
T2_FR_GRIF_IRFU No No No
T2_FR_GRIF_LLR No No No
T2_UK_London_Brunel No No No
T2_US_Florida No No No
T2_US_Nebraska No No No
T2_US_Purdue No No No

Status on the the site-local-config.xml update:

CMS SiteSorted ascending Contacted Replied Updated
T1_DE_KIT Yes Yes Yes
T1_ES_PIC Yes Yes Not yet
T1_FR_CCIN2P3 Yes Yes Yes
T1_IT_CNAF Yes Yes Yes
T1_RU_JINR Yes Yes Yes
T1_UK_RAL Yes Yes Yes
T1_US_FNAL Yes Not yet Not yet

Sites currently not enable LifeStatus state:

Updated 17/Jan/2022

SITE Status Duration Reason
T2_GR_Ioannina Morgue 1+ month(s) SAM + HC Evals
T2_PK_NCP Waiting Room 1+ month(s) SAM + HC + FTS Evals
T2_RU_ITEP Waiting Room 1+ month(s) SAM + FTS Evals
T2_TW_NCHC Waiting Room 1+ month(s) SAM Evals
T2_US_UCSD Waiting Room 1+ month(s) SAM Evals
T2_RU_SINP Morgue 1+ month(s) No change

SITE Update
T2_AT_Vienna Exiting WR
T2_US_UCSD Exiting WR

Ticket journal of last week:

TicketSorted ascending CMS Site Last update Status Subject
152030 T2_CN_Beijing Thursday, 01/13 assigned Erroneous consistency check en
152031 T2_UK_London_Brunel Monday, 01/10 in progress Erroneous consistency check en
152036 T2_TR_METU Tuesday, 01/11 solved Erroneous consistency check en
153517 T2_ES_IFCA Wednesday, 01/12 in progress Consistency check (cc) scans f
153686 T2_TW_NCHC Tuesday, 01/11 assigned JobSubmit errors at T2_TW_NCHC
153990 T2_RU_ITEP Wednesday, 01/12 in progress Failed CMS Consistency Enforce
154053 T1_FR_CCIN2P3 Monday, 01/10 in progress esting Tape access via srm+htt
154054 T1_US_FNAL Thursday, 01/13 assigned Testing Tape access via srm+ht
154228 T2_TW_NCHC Wednesday, 01/12 assigned Deletion issues at your site (
154399 T1_UK_RAL Monday, 01/10 solved CMS Hammer Cloud jobs are fail
154585 T2_TW_NCHC Tuesday, 01/11 solved CMS data world-readble via Web
154667 T1_ES_PIC Wednesday, 01/12 in progress FileReadErrors at T1_ES_PIC
154860 T3_HR_IRB Wednesday, 01/12 in progress TPC WebDAV protocol deployment
154893 T2_US_Nebraska Monday, 01/10 in progress Pilots at T2_US_Nebraska
154927 T2_US_UCSD Wednesday, 01/12 assigned CMS Frontier fail-over from SD
154985 T2_US_UCSD Monday, 01/10 waiting for reply Jobs removed from the CE but s
155111 T2_FI_HIP Wednesday, 01/12 in progress SAM tests for one CE failing a
155154 T2_TW_NCHC Tuesday, 01/11 solved All services failing at T2_TW_
155162 T2_US_Purdue Wednesday, 01/12 assigned Pilots at T2_US_Purdue
155179 T2_US_MIT Monday, 01/10 verified GPU resources at T2_US_MIT
155228 T2_RU_ITEP Thursday, 01/13 solved Transfers failing from T2_RU_I
155236 T2_US_Vanderbilt Monday, 01/10 in progress Pilots at T2_US_Vanderbilt
155255 T1_IT_CNAF Tuesday, 01/11 closed File exists and overwrite not
155258 T1_US_FNAL Wednesday, 01/12 closed File exists and overwrite not
155261 T2_PK_NCP Thursday, 01/13 assigned SAM tests failing at T2_PK_NCP
155280 T2_FR_GRIF_IRFU Wednesday, 01/12 reopened Eurasian redirector at T2_IT_P
155294 T2_UK_London_Brunel Wednesday, 01/12 reopened XRootD tests failiing at T2_UK
155310 T2_EE_Estonia Monday, 01/10 solved SAM tests for CE not executing
155326 T2_US_MIT Wednesday, 01/12 closed File exists and overwrite not
155365 T2_ES_IFCA Monday, 01/10 verified Pilots at T2_ES_IFCA
155396 T2_US_MIT Wednesday, 01/12 closed Tape files to delete T2_US_MIT
155397 T1_IT_CNAF Tuesday, 01/11 closed Tape files to delete T1_IT_CNA
155404 T2_US_Vanderbilt Monday, 01/10 verified Please check your loadtest fil
155427 T2_FR_GRIF_IRFU Thursday, 01/13 waiting for reply Container creation failures at
155432 T2_US_Purdue Friday, 01/14 closed SAM tests for one CE are not b
155459 T1_FR_CCIN2P3 Friday, 01/14 in progress TrivialFileCatalog errors at T
155464 T3_BG_UNI_SOFIA Thursday, 01/13 in progress CMS Frontier Squid at T3_BG_UN
155467 T2_US_Florida Monday, 01/10 solved SAM tests for CEs are not bein
155472 T2_ES_IFCA Monday, 01/10 solved All services failing at T2_ES_
155485 T2_UK_London_IC Monday, 01/10 in progress SAM Squid test for T2_UK_Londo
155486 T3_UK_ScotGrid_GLA Monday, 01/10 waiting for reply SAM Squid test for T3_UK_ScotG
155491 T1_DE_KIT Monday, 01/10 in progress Incomplete configuration error
155499 T2_US_Nebraska Thursday, 01/13 assigned xrootd and webdav SAM tests fa
155512 T2_DE_RWTH Wednesday, 01/12 in progress Production failures at T2_DE_R
155515 T2_IT_Pisa Thursday, 01/13 assigned SAM tests failing at T2_IT_Pis
155518 T1_IT_CNAF Wednesday, 01/12 verified Pilots at T1_IT_CNAF
155519 T2_ES_IFCA Wednesday, 01/12 verified Pilots at T2_ES_IFCA
155520 T2_US_Nebraska Thursday, 01/13 verified Pilots at T2_US_Nebraska
155523 T3_CH_PSI Thursday, 01/13 reopened CMS Frontier Squid at T3_CH_PS
155524 T2_AT_Vienna Friday, 01/14 in progress One CE is not executing SAM te
155537 T2_UK_SGrid_RALPP Friday, 01/14 in progress SAM tests failing at T2_UK_SGr
155540 T2_IT_Pisa Friday, 01/14 assigned Pilot jobs going held with "Er

Number of tickets: 52, Generated on 17/Jan/2022
AAA WAN Access CAF Operations Central Workflows
Data Transfers Facilities HammerCloud
Register New CMS Site SAM tests Submission Infrastructure
Tier-1 Tape Families

Sites with open GGUS tickets:

CMS SiteSorted ascending Number of Tickets Tickets
Generated on 17/Jan/2022, Total number of tickets: 82
T0_CH_CERN 1 155135
T1_DE_KIT 3 155362 155491 155003
T1_ES_PIC 2 154052 154667
T1_FR_CCIN2P3 2 154053 155459
T1_RU_JINR 1 155002
T1_UK_RAL 1 155191
T1_US_FNAL 4 155183 154054 155184 154751
T2_AT_Vienna 1 155524
T2_BE_UCL 2 153516 148354
T2_BR_SPRACE 2 154677 155206
T2_CN_Beijing 2 153493 152030
T2_DE_RWTH 1 155512
T2_ES_IFCA 1 153517
T2_FI_HIP 1 155111
T2_FR_GRIF_IRFU 2 155280 155427
T2_GR_Ioannina 4 152029 150722 154569 154803
T2_IT_Bari 1 154499
T2_IT_Legnaro 1 153024
T2_IT_Pisa 3 155515 155293 155540
T2_IT_Rome 1 153026
T2_KR_KISTI 1 154265
T2_PK_NCP 1 155261
T2_PT_NCG_Lisbon 1 149275
T2_RU_ITEP 2 154583 153990
T2_TW_NCHC 2 154228 153686
T2_UK_London_Brunel 2 152031 155294
T2_UK_London_IC 1 155485
T2_UK_SGrid_Bristol 3 150734 154573 154227
T2_UK_SGrid_RALPP 1 155537
T2_US_MIT 1 155289
T2_US_Nebraska 4 154893 155499 155393 153627
T2_US_Purdue 1 155162
T2_US_UCSD 4 155454 154985 154927 154931
T2_US_Vanderbilt 4 154417 155236 155030 155431
T3_BG_UNI_SOFIA 2 154857 155464
T3_CH_PSI 2 155523 154858
T3_FR_IPNL 1 154859
T3_HR_IRB 2 154611 154860
T3_IT_MIB 1 154861
T3_KR_KISTI 1 154863
T3_MX_Cinvestav 1 154865
T3_TW_NCU 1 150488
T3_TW_NTU_HEP 1 151300
T3_UK_ScotGrid_GLA 1 155486
T3_US_Colorado 1 154866
T3_US_Minnesota 1 154337
T3_US_NotreDame 1 155278
T3_US_PuertoRico 1 154588
T3_US_Rutgers 1 155190

Updated on: 2021-11-01 at 08:21:24 by HectorCamiloZambranoHernandez

  • Any problem, email site support team list (while John is in vacation).
AOB
  • resource control script had an error, there was an issue created that got into 9.28 but it hasn't really been fixed updating site information
    • needs to be patched in on all the agents
  • 227 errorhandler had a typo, that is fixed
  • certificate problems should not be shared with production agents this needs to be checked and verified that we are doing things


This topic: CMSPublic > CompOps > CompOpsWorkflowTeam > WorkflowTeamMeeting > WorkflowTeamMeeting20131029
Topic revision: r4 - 2016-07-22 - StephanLammel
 
This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2022 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback