https://indico.cern.ch/conferenceDisplay.py?confId=254678
Attending:
  • FNAL - Jen, Luis, SeangChan, Dave
  • CERN - Julian, Adli, Andrew Edgar
Personel:
  • Oct 22 --> Oct 29 Julian (+Adli)
  • Oct 29 --> Nov 4 Adli
  • Edgar is gone Nov 8th now?? or are you officially gone on the 31st?
  • John is gone Oct 23-Nov 11
Agent issues
  • vocms235 was having couch issues today, they applied a patch and it has now been up for 3 hrs
    • couch processes matching out is our most common problem not sure if that was the issue with today's crashes - Luis will look into this to see if this was in fact the problem today
    • Error handler patch was applied, it has been tested, but this is it's first deployment on a production machine so we need to keep an extra close eye on this machine for the next few days.
    • failed job report numbers not matching properly maybe so SeangChan will keep an extra close eye on on things.
  • vocms227 ErrorHandler problem - it's a connection problem
  • vomcs202 (reprocessing agent): up un running
  • vocms216 & 234 is re-installed are ready for jobs
  • vocms201 & cmssrv112 is in drain for upgrades that will happen later this week
  • RelVal is having same priority as MC so this should help our issue with RelVal taking all the slots at FNAL
  • FNAL not getting any jobs: behaving normally
  • WMAgent issues:
    • v0.9.82 deployed in vocms202, vocms235, cmssrv98, and vocms85. Next: vocms216, vocms234.
    • Oracle on vocms85
  • disk full problem: warning patch is being tested
    • for now we still do not have the warning, but we have an alarm that we will be getting e-mail if the disk is filling
    • this will let us know if /data1 & /data starts to fill if it's /data1 there is nothing to clean,
    • SeangChan will update the twiki with information on what to do when the disk fills
    • Jen and Luis will ask Burt/Krista to put this same alarm for the FNAL machines at 90%
    • 113 is currently at 88% SeangChan will write up documation for cleanup/ Andrew will test it
  • PhEDEx subscriptions issue: solved

Workflow issues:

  • Workflow with massive fail - Input Data invalidation.
    • once the files were invalidated it moved along nicely and is now out of our hands
  • We have a number of workflows that are not at 100%, but we've run ACDC and have no failures.
  • we have made HUGE strides in understanding why WF's are getting stuck and having the stuck list back under control.. for now... please keep up the good work everyone! * we need to work on/ start incorporateing Edgars' ideas for preventing stuck workflows in the first place.
"backfill"
  • due to recent agent issues, and the fact that we currently do not have a lot of jobs running to keep sites busy Oli has requested that we run "backfill" which is basically running a known job over and over and over again to make sure that there are no stability issues
  • the first 2 workflows are in: https://cmslogbook.cern.ch/elog/Workflow+processing/10729
    • we need to keep some end to end statistics
      • when was the WF submitted'
      • when did it end
      • how long did it take to ACDC to go through etc
      • we will treat these like normal WF's only when we are all done, we delete the outputs as we already have run the data.
    • as soon as one backfill WF is finished the next one goes in the idea is that we keep constant pressure on the sites to insure that there are no problems creeping up on us. We will start with just the T1's once we get that going smoothly we will add T2's * query DAS for first and last events to go into a dataset to get the numbers
Site Problems
  • All US T2 fail the xrootd-fallback SAM test
    • Need to understand why
  • Issue with T2_US_Vanderbilt SAM availability
  • We have started rolling out the SL6 workers in the FNAL cluster. All the test workflows ran but if you or anyone in dataops see any issues with FNAL, please let me or cmst1 know so we can take care of it asap.
cms-comp-ops-site-support-team (Site Support Team) <cms-comp-ops-site-support-team@cern.ch>

Status on WebDAV_CMS deployment:

Previous week


Current week

Status on the the New Subsite Mechanism update (sites number):

CMS SiteSorted ascending Contacted Replied Updated
T1_DE_KIT Yes Yes Yes
T1_IT_CNAF Yes Yes Yes
T2_DE_RWTH No No No
T2_FR_GRIF_IRFU No No No
T2_FR_GRIF_LLR No No No
T2_UK_London_Brunel No No No
T2_US_Florida No No No
T2_US_Nebraska No No No
T2_US_Purdue No No No

Status on the the site-local-config.xml update:

CMS SiteSorted ascending Contacted Replied Updated
T1_DE_KIT Yes Yes Yes
T1_ES_PIC Yes Yes Not yet
T1_FR_CCIN2P3 Yes Yes Yes
T1_IT_CNAF Yes Yes Yes
T1_RU_JINR Yes Yes Yes
T1_UK_RAL Yes Yes Yes
T1_US_FNAL Yes Not yet Not yet

Sites currently not enable LifeStatus state:

Updated 11/Oct/2021

SITE Status Duration Reason
T2_ES_IFCA Waiting Room 2 day(s) FTS Evals
T2_FI_HIP Waiting Room 3 days(s) SAM Evals
T2_GR_Ioannina Morgue 1+ month(s) SAM + HC + FTS Evals
T2_PK_NCP Waiting Room 1+ month(s) SAM + HC + FTS Evals
T2_RU_IHEP Waiting Room 1+ month(s) FTS Evals
T2_RU_ITEP Waiting Room 1+ month(s) SAM Evals
T2_TW_NCHC Waiting Room 1+ month(s) FTS + HC Evals
T2_RU_SINP Morgue 1+ month(s) No change

SITE Update
T2_BR_UERJ Exiting WR
T2_IT_Pisa Exiting WR
T2_RU_INR Exiting WR
T2_TR_METU Exiting WR
T2_TW_NCHC Entering WR

Ticket journal of last week:

TicketSorted ascending CMS Site Last update Status Subject
148240 T2_IT_Pisa Tuesday, 10/05 closed New HTCondor CE at T2_IT_Pisa
150331 T2_RU_ITEP Wednesday, 10/06 closed ARC-CE is not executing SAM te
150724 T2_IT_Bari Wednesday, 10/06 closed WebDAV endpoint deployment for
150734 T2_UK_SGrid_Bristol Monday, 10/04 on hold WebDAV endpoint deployment for
151168 T2_IT_Pisa Saturday, 10/09 in progress WebDAV endpoint deployment for
151228 T1_FR_CCIN2P3 Monday, 10/04 closed Destination Overwrite error at
151300 T3_TW_NTU_HEP Thursday, 10/07 in progress Transfer issues from your site
152037 T2_US_Nebraska Thursday, 10/07 closed Erroneous consistency check en
152644 T2_UK_SGrid_Bristol Monday, 10/04 in progress Transfers timing out at T2_UK_
152992 T2_PT_NCG_Lisbon Wednesday, 10/06 solved Enabling TPC transfers over da
153112 T2_TR_METU Friday, 10/08 closed LoadTest Transfers are failing
153181 T2_IT_Bari Friday, 10/08 assigned wrong write/delete permissions
153377 T2_FI_HIP Wednesday, 10/06 closed Enabling TPC transfers over We
153463 T2_US_UCSD Monday, 10/04 on hold ~8% of missing files at UCSD
153570 T1_IT_CNAF Monday, 10/04 waiting for reply Testing Tape access via srm+ht
153594 T2_BE_UCL Wednesday, 10/06 closed Transfers failing from/to T2_B
153670 T2_CH_CSCS Friday, 10/08 closed CEs running few tests at T2_CH
153686 T2_TW_NCHC Tuesday, 10/05 assigned JobSubmit errors at T2_TW_NCHC
153860 T2_BR_SPRACE Wednesday, 10/06 in progress StageOut Failures at T2_BR_SPR
153982 T2_IN_TIFR Wednesday, 10/06 assigned XRootD tests failiing at T2_IN
153989 T2_US_UCSD Monday, 10/04 closed Failed CMS Consistency Enforce
153996 T2_BR_UERJ Monday, 10/04 in progress SAM tests and transfers failin
154020 T2_PL_Swierk Thursday, 10/07 solved T2_PL_Swierk not running any g
154035 T2_RU_ITEP Wednesday, 10/06 closed HC tests failing at T2_RU_ITEP
154045 T2_FR_GRIF_IRFU Monday, 10/04 solved SAM tests for one CE failing a
154052 T1_ES_PIC Saturday, 10/09 in progress Testing Tape access via srm+ht
154063 T2_UK_London_IC Monday, 10/04 closed SAM tests for one CE failing a
154071 T0_CH_CERN Tuesday, 10/05 waiting for reply WebDAV endpoint deployment for
154086 T2_UK_SGrid_RALPP Wednesday, 10/06 closed XRootD tests failiing at T2_UK
154101 T2_UK_SGrid_Bristol Tuesday, 10/05 closed SAM tests for one CE failing a
154139 T2_FI_HIP Thursday, 10/07 closed SAM tests for one CE failing a
154164 T2_BR_SPRACE Friday, 10/08 closed SAM tests for one CE failing a
154188 T1_DE_KIT Monday, 10/04 waiting for reply Production jobs failing at T1_
154203 T2_BE_UCL Thursday, 10/07 solved Transfers are failing from T2_
154217 T2_IT_Bari Thursday, 10/07 in progress SAM tests for one CE failing a
154219 T2_US_MIT Monday, 10/04 assigned I/O Error reading from MIT
154220 T2_ES_IFCA Monday, 10/04 solved SAM tests for CE failing at T2
154227 T2_UK_SGrid_Bristol Monday, 10/04 in progress Deletion issues at your site (
154233 T1_US_FNAL Friday, 10/08 assigned Tape writing test at FNAL_Tape
154239 T1_UK_RAL Monday, 10/04 solved XRootD tests failiing at T1_UK
154243 T1_RU_JINR Friday, 10/08 assigned Tape writing test at JINR_Tape
154246 T1_DE_KIT Saturday, 10/09 in progress Tape writing test at KIT_Tape
154247 T1_ES_PIC Friday, 10/08 assigned Tape writing test at PIC_Tape
154249 T1_IT_CNAF Friday, 10/08 in progress Tape writing test at CNAF_Tape
154251 T1_UK_RAL Monday, 10/04 in progress Tape writing test at RAL_Tape
154264 T2_IT_Pisa Tuesday, 10/05 solved SAM tests failing at T2_IT_Pis
154265 T2_KR_KISTI Wednesday, 10/06 in progress HammerCloud tests failing at T
154266 T2_UK_SGrid_Bristol Tuesday, 10/05 solved SAM tests failing at T2_UK_SGr
154267 T2_RU_IHEP Friday, 10/08 solved Transfers failing from T2_RU_I
154272 T2_CN_Beijing Friday, 10/08 assigned HammerCloud tests failing at T
154274 T2_FI_HIP Friday, 10/08 in progress StageOut Failures at T2_FI_HIP
154281 T2_RU_ITEP Tuesday, 10/05 assigned SAM tests failing at T2_RU_ITE
154287 T2_AT_Vienna Thursday, 10/07 solved SRM-SAM tests failing at T2_AT
154288 T2_BR_SPRACE Tuesday, 10/05 solved JobSubmit errors at T2_BR_SPRA
154301 T2_UK_SGrid_RALPP Thursday, 10/07 solved SAM tests failing at T2_UK_SGr
154305 T2_CH_CSCS Wednesday, 10/06 assigned Enabling access to GPU resourc
154309 T2_US_Caltech Thursday, 10/07 solved Merge job failures at T2_US_Ca
154312 T2_IT_Pisa Wednesday, 10/06 in progress JobSubmit errors at T2_IT_Pisa
154316 T0_CH_CERN Wednesday, 10/06 assigned Enabling access to GPU resourc
154337 T3_US_Minnesota Thursday, 10/07 assigned CMS Frontier activity from T3_
154348 T2_US_MIT Thursday, 10/07 assigned Transfers failing to T2_US_MIT
154349 T2_BE_IIHE Friday, 10/08 solved Transfers failing to T2_BE_IIH
154355 T2_DE_DESY Friday, 10/08 solved SAM tests failing at T2_DE_DES
154356 T2_BR_SPRACE Saturday, 10/09 solved SAM tests failing at T2_BR_SPR
154357 T2_PK_NCP Friday, 10/08 assigned SAM tests failing at T2_PK_NCP
154360 T2_US_Nebraska Friday, 10/08 assigned SAM tests failing intermittent
154361 T2_PT_NCG_Lisbon Friday, 10/08 in progress Transfers failing to T2_PT_NCG
154363 T1_UK_RAL Friday, 10/08 waiting for reply RAL-FTS transfer requests fail
154364 T0_CH_CERN Saturday, 10/09 in progress Checksum request not supported

Number of tickets: 69, Generated on 11/Oct/2021
AAA WAN Access CAF Operations Central Workflows
Data Transfers Facilities HammerCloud
Register New CMS Site SAM tests Submission Infrastructure
Tier-1 Tape Families

Sites with open GGUS tickets:

CMS SiteSorted ascending Number of Tickets Tickets
Generated on 11/Oct/2021 , Total number of tickets: 67
T0_CH_CERN 3 153373 153032 154071
T1_DE_KIT 2 154246 154188
T1_ES_PIC 2 154247 154052
T1_FR_CCIN2P3 2 153474 154053
T1_IT_CNAF 2 153570 154249
T1_RU_JINR 1 154243
T1_UK_RAL 3 154251 154196 153496
T1_US_FNAL 3 154214 154054 154233
T2_BE_UCL 3 154012 148354 153516
T2_BR_SPRACE 2 152895 153860
T2_BR_UERJ 1 153996
T2_CN_Beijing 2 153493 152030
T2_DE_DESY 1 154011
T2_EE_Estonia 1 154008
T2_ES_IFCA 1 153517
T2_FI_HIP 1 154014
T2_FR_GRIF_IRFU 2 153452 151272
T2_GR_Ioannina 2 152029 150722
T2_IN_TIFR 3 154195 151932 153982
T2_IT_Bari 2 154217 153181
T2_IT_Legnaro 3 151275 154006 153024
T2_IT_Pisa 1 151168
T2_IT_Rome 1 153026
T2_PK_NCP 2 151063 150075
T2_PT_NCG_Lisbon 1 149275
T2_RU_ITEP 1 153990
T2_TR_METU 1 152036
T2_TW_NCHC 2 154228 153686
T2_UA_KIPT 1 150729
T2_UK_London_Brunel 1 152031
T2_UK_London_IC 1 154005
T2_UK_SGrid_Bristol 3 152644 150734 154227
T2_US_MIT 2 154219 154015
T2_US_Nebraska 1 153627
T2_US_UCSD 1 153463
T2_US_Vanderbilt 2 153014 152893
T3_TW_NCU 1 150488
T3_TW_NTU_HEP 1 151300
T3_US_TACC 1 152790
T3_US_TAMU 1 150296

Updated on: 2021-09-26 at 15:25:01 by HectorCamiloZambranoHernandez

  • Any problem, email site support team list (while John is in vacation).
AOB
  • resource control script had an error, there was an issue created that got into 9.28 but it hasn't really been fixed updating site information
    • needs to be patched in on all the agents
  • 227 errorhandler had a typo, that is fixed
  • certificate problems should not be shared with production agents this needs to be checked and verified that we are doing things
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2016-07-22 - StephanLammel
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback