https://indico.cern.ch/conferenceDisplay.py?confId=254678
Attending:
  • FNAL - Jen, Luis, SeangChan, Dave
  • CERN - Julian, Adli, Andrew Edgar
Personel:
  • Oct 22 --> Oct 29 Julian (+Adli)
  • Oct 29 --> Nov 4 Adli
  • Edgar is gone Nov 8th now?? or are you officially gone on the 31st?
  • John is gone Oct 23-Nov 11
Agent issues
  • vocms235 was having couch issues today, they applied a patch and it has now been up for 3 hrs
    • couch processes matching out is our most common problem not sure if that was the issue with today's crashes - Luis will look into this to see if this was in fact the problem today
    • Error handler patch was applied, it has been tested, but this is it's first deployment on a production machine so we need to keep an extra close eye on this machine for the next few days.
    • failed job report numbers not matching properly maybe so SeangChan will keep an extra close eye on on things.
  • vocms227 ErrorHandler problem - it's a connection problem
  • vomcs202 (reprocessing agent): up un running
  • vocms216 & 234 is re-installed are ready for jobs
  • vocms201 & cmssrv112 is in drain for upgrades that will happen later this week
  • RelVal is having same priority as MC so this should help our issue with RelVal taking all the slots at FNAL
  • FNAL not getting any jobs: behaving normally
  • WMAgent issues:
    • v0.9.82 deployed in vocms202, vocms235, cmssrv98, and vocms85. Next: vocms216, vocms234.
    • Oracle on vocms85
  • disk full problem: warning patch is being tested
    • for now we still do not have the warning, but we have an alarm that we will be getting e-mail if the disk is filling
    • this will let us know if /data1 & /data starts to fill if it's /data1 there is nothing to clean,
    • SeangChan will update the twiki with information on what to do when the disk fills
    • Jen and Luis will ask Burt/Krista to put this same alarm for the FNAL machines at 90%
    • 113 is currently at 88% SeangChan will write up documation for cleanup/ Andrew will test it
  • PhEDEx subscriptions issue: solved

Workflow issues:

  • Workflow with massive fail - Input Data invalidation.
    • once the files were invalidated it moved along nicely and is now out of our hands
  • We have a number of workflows that are not at 100%, but we've run ACDC and have no failures.
  • we have made HUGE strides in understanding why WF's are getting stuck and having the stuck list back under control.. for now... please keep up the good work everyone! * we need to work on/ start incorporateing Edgars' ideas for preventing stuck workflows in the first place.
"backfill"
  • due to recent agent issues, and the fact that we currently do not have a lot of jobs running to keep sites busy Oli has requested that we run "backfill" which is basically running a known job over and over and over again to make sure that there are no stability issues
  • the first 2 workflows are in: https://cmslogbook.cern.ch/elog/Workflow+processing/10729
    • we need to keep some end to end statistics
      • when was the WF submitted'
      • when did it end
      • how long did it take to ACDC to go through etc
      • we will treat these like normal WF's only when we are all done, we delete the outputs as we already have run the data.
    • as soon as one backfill WF is finished the next one goes in the idea is that we keep constant pressure on the sites to insure that there are no problems creeping up on us. We will start with just the T1's once we get that going smoothly we will add T2's * query DAS for first and last events to go into a dataset to get the numbers
Site Problems
  • All US T2 fail the xrootd-fallback SAM test
    • Need to understand why
  • Issue with T2_US_Vanderbilt SAM availability
  • We have started rolling out the SL6 workers in the FNAL cluster. All the test workflows ran but if you or anyone in dataops see any issues with FNAL, please let me or cmst1 know so we can take care of it asap.
cms-comp-ops-site-support-team (Site Support Team) <cms-comp-ops-site-support-team@cern.ch>

Sites with PhEDEx agents running:

CMS SiteSorted ascending PhEDEx agents state
T2_CN_Beijing Running
T3_CH_CERN_OpenData Running
T3_RU_MEPhI Running

XRootD restarting pending tickets:

TicketSorted ascending State CMS Site
149071 waiting for reply T2_DE_DESY

Sites currently not enable LifeStatus state:

SITESorted ascending Status Duration Reason
T2_PK_NCP Waiting Room 1+ month 2+ HC evals
T2_PL_Swierk Waiting Room 1 week(s) SAM evals continously failing
T2_PT_NCG_Lisbon Waiting Room 1 week(s) 2+ HC evals
T2_RU_ITEP Downtime 1 week(s) Went from morgue to DT
T2_RU_SINP Morgue 1+ month No change
T2_UK_SGrid_Bristol Waiting Room 1 week(s) FTS evals continously failing

Warsaw to be decommissioned

Ticket journal of last week:

TicketSorted ascending CMS Site Creation Date Status Subject
149475 T2_BE_UCL Monday, 11/16 solved Transfers and VOPut failing at
149476 T2_IT_Rome Monday, 11/16 solved CEs are not executing SAM test
149477 T2_FI_HIP Monday, 11/16 solved Few SAM tests running at T2_FI
149486 T2_UK_SGrid_Bristol Monday, 11/16 assigned Unable to create loadtest file
149488 T2_US_MIT Tuesday, 11/17 solved Transfers to MIT failing: Syst
149491 T2_TR_METU Tuesday, 11/17 solved JobSubmit failing at T2_TR_MET
149493 T0_CH_CERN Tuesday, 11/17 in progress Files unavailable at CERN
149501 T2_KR_KISTI Tuesday, 11/17 verified Pilots at T2_KR_KISTI
149503 T2_UK_London_Brunel Tuesday, 11/17 solved SAM tests failing at T2_UK_Lon
149508 T1_US_FNAL Wednesday, 11/18 assigned Transfers failing from T1_US_F
149523 T2_BR_SPRACE Thursday, 11/19 solved Traffic from T2_BR_SPRACE not
149524 T2_PT_NCG_Lisbon Thursday, 11/19 solved HC tests failing at T2_PT_NCG_
149525 T2_RU_IHEP Thursday, 11/19 solved CE holding SAM tests at T2_RU_
149529 T2_DE_RWTH Thursday, 11/19 in progress Transfers failing from/to T2_D
149533 T0_CH_CSCS_HPC Thursday, 11/19 solved site PhEDEx agents at (T0_CH_
149534 T1_ES_PIC Thursday, 11/19 solved site PhEDEx agents at (T1_ES_
149535 T1_FR_CCIN2P3 Thursday, 11/19 solved site PhEDEx agents at (T1_FR_
149536 T1_RU_JINR Thursday, 11/19 solved site PhEDEx agents at (T1_RU_
149537 T1_US_FNAL Thursday, 11/19 solved site PhEDEx agents at (T1_US_
149538 T2_AT_Vienna Thursday, 11/19 solved site PhEDEx agents at (T2_AT_
149539 T2_BE_IIHE Thursday, 11/19 solved site PhEDEx agents at (T2_BE_
149540 T2_BE_UCL Thursday, 11/19 solved site PhEDEx agents at (T2_BE_
149541 T2_BR_UERJ Thursday, 11/19 solved site PhEDEx agents at (T2_BR_
149542 T2_CN_Beijing Thursday, 11/19 assigned site PhEDEx agents at (T2_CN_
149543 T2_DE_DESY Thursday, 11/19 solved site PhEDEx agents at (T2_DE_
149544 T2_DE_RWTH Thursday, 11/19 solved site PhEDEx agents at (T2_DE_
149545 T2_EE_Estonia Thursday, 11/19 solved site PhEDEx agents at (T2_EE_
149546 T2_ES_IFCA Thursday, 11/19 solved site PhEDEx agents at (T2_ES_
149547 T2_FR_CCIN2P3 Thursday, 11/19 solved site PhEDEx agents at (T2_FR_
149548 T2_GR_Ioannina Thursday, 11/19 assigned site PhEDEx agents at (T2_GR_
149549 T2_HU_Budapest Thursday, 11/19 solved site PhEDEx agents at (T2_HU_
149550 T2_IN_TIFR Thursday, 11/19 solved site PhEDEx agents at (T2_IN_
149551 T2_IT_Legnaro Thursday, 11/19 solved site PhEDEx agents at (T2_IT_
149552 T2_KR_KISTI Thursday, 11/19 solved site PhEDEx agents at (T2_KR_
149553 T2_PL_Swierk Thursday, 11/19 solved site PhEDEx agents at (T2_PL_
149554 T2_RU_IHEP Thursday, 11/19 solved site PhEDEx agents at (T2_RU_
149555 T2_RU_INR Thursday, 11/19 solved site PhEDEx agents at (T2_RU_
149556 T2_RU_ITEP Thursday, 11/19 in progress site PhEDEx agents at (T2_RU_
149557 T2_TR_METU Thursday, 11/19 solved site PhEDEx agents at (T2_TR_
149558 T2_TW_NCHC Thursday, 11/19 assigned site PhEDEx agents at (T2_TW_
149559 T2_UK_London_Brunel Thursday, 11/19 solved site PhEDEx agents at (T2_UK_
149560 T2_UK_London_IC Thursday, 11/19 solved site PhEDEx agents at (T2_UK_
149561 T2_US_MIT Thursday, 11/19 solved site PhEDEx agents at (T2_US_
149562 T2_US_Vanderbilt Thursday, 11/19 assigned site PhEDEx agents at (T2_US_
149563 T2_US_Wisconsin Thursday, 11/19 solved site PhEDEx agents at (T2_US_
149564 T3_CH_PSI Thursday, 11/19 solved site PhEDEx agents at (T3_CH_
149565 T3_FR_IPNL Thursday, 11/19 solved site PhEDEx agents at (T3_FR_
149566 T3_KR_KNU Thursday, 11/19 solved site PhEDEx agents at (T3_KR_
149567 T3_US_Baylor Thursday, 11/19 solved site PhEDEx agents at (T3_US_
149568 T3_US_Colorado Thursday, 11/19 solved site PhEDEx agents at (T3_US_
149569 T3_US_NotreDame Thursday, 11/19 solved site PhEDEx agents at (T3_US_
149570 T3_US_Princeton_ICSE Thursday, 11/19 solved site PhEDEx agents at (T3_US_
149571 T3_US_PuertoRico Thursday, 11/19 solved site PhEDEx agents at (T3_US_
149572 T3_US_Rutgers Thursday, 11/19 assigned site PhEDEx agents at (T3_US_
149573 T3_US_UMD Thursday, 11/19 solved site PhEDEx agents at (T3_US_
149574 T3_US_UMiss Thursday, 11/19 solved site PhEDEx agents at (T3_US_
149576 T2_IN_TIFR Thursday, 11/19 assigned Pilots at T2_IN_TIFR
149577 T2_US_UCSD Thursday, 11/19 assigned Pilots at T2_US_UCSD
149588 T2_UK_SGrid_RALPP Friday, 11/20 solved SAM tests failing at T2_UK_SGr

Generated on 23/Nov/2020 (GMT)

Sites with open GGUS tickets:

CMS SiteSorted ascending Number of Tickets Tickets
Generated on 23/Nov/2020 13:52:36 (GMT), Total number of tickets: 36
T0_CH_CERN 1 148790
T0_CH_CSCS_HPC 2 149166 147830
T1_IT_CNAF 1 149215
T2_AT_Vienna 1 148309
T2_BE_UCL 1 148354
T2_BR_SPRACE 1 149430
T2_CN_Beijing 4 148997 149321 148780 148688
T2_DE_DESY 1 149071
T2_FI_HIP 1 148708
T2_FR_GRIF_IRFU 2 149150 149272
T2_IN_TIFR 1 147831
T2_IT_Bari 1 148735
T2_IT_Legnaro 1 149019
T2_IT_Pisa 1 149184
T2_PK_NCP 1 148185
T2_PL_Swierk 3 149392 149274 147006
T2_PT_NCG_Lisbon 1 149275
T2_RU_ITEP 2 149046 148568
T2_TR_METU 2 149344 149276
T2_UK_SGrid_Bristol 1 149425
T2_US_UCSD 1 149443
T2_US_Vanderbilt 2 148148 144936
T3_US_Minnesota 2 139435 140562
T3_US_Rice 1 149315
T3_US_UMD 1 148604

Updated on: 2020-11-23 at 10:55:18 by HectorCamiloZambranoHernandez

  • Any problem, email site support team list (while John is in vacation).
AOB
  • resource control script had an error, there was an issue created that got into 9.28 but it hasn't really been fixed updating site information
    • needs to be patched in on all the agents
  • 227 errorhandler had a typo, that is fixed
  • certificate problems should not be shared with production agents this needs to be checked and verified that we are doing things
Edit | Attach | Watch | Print version | History: r4 < r3 < r2 < r1 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r4 - 2016-07-22 - StephanLammel
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    CMSPublic All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback