Week of 120625

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here
  3. The scod rota for the next few weeks is at ScodRota

WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments

VO Summaries of Site Usability SIRs, Open Issues & Broadcasts Change assessments
ALICE ATLAS CMS LHCb WLCG Service Incident Reports WLCG Service Open Issues Broadcast archive CASTOR Change Assessments

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board M/W PPSCoordinationWorkLog WLCG Baseline Versions WLCG Blogs   GgusInformation Sharepoint site - Cooldown Status - News


Monday

Attendance: local(David, Eva, Luc, Maarten, Raja, Steve);remote(Dimitri, Dmytro, Gonzalo, Jhen-Wei, Lisa, Lorenzo, Michael, Oliver, Onno, Rob, Rolf, Tiju).

Experiments round table:

  • ATLAS reports -
    • CERN CENTRAL SERVICES, T0
      • NTR
    • T1
      • IN2P3-CC: many job failures when sourcing setup in cvmfs. GGUS:83517 filed on Sunday morning at 8:30. Escalated to Alarm (Pb with local squids)
        • Rolf: problem with first Squid was due to full file system; still investigating why the second died; a failover to CERN or RAL did not work due to connection filtering; will look further into these issues
      • SARA: file transfer failures: "destination error, failed to contact on remote SRM". GGUS:83523 raised to ALARM on Sunday, 20:21. It looks like the srm layer is broken. Affects the high priority ATLAS tasks. Moving SRM to new hardware.
      • NDGF-T1 unscheduled DT (SE,SRM) till Tuesday. Cloud set to "brokeroff", analysis queue set "offline". Elog.:37134-37136.
    • T2
      • NAPOLI -> CERN and IN2P3-CC slow file transfers affected urgent tasks. GGUS:85315 assigned on Saturday. Transfers of the large multi-GB files to CERN and TAIWAN failed with the timeout.

  • CMS reports -
    • LHC machine / CMS detector
      • Technical stop
    • CERN / central services and T0
      • T0 will be migrated to a new machine and updated tomorrow
    • Tier-1/2:
      • T1_TW_ASGC: HammerCloud problems, seems this is an overload situation on one of the CREAM CEs, GGUS:83526
      • T1_FR_CCIN2P3: also failing HammerCloud, problem seems related to CREAM CEs, GGUS:83520
      • FTS for T1_IT_CNAF: credential delegation problem, actively followed up in GGUS:83486
        • Maarten: this issue probably is related to the intermittent FTS problems at various T1 for various VOs; the FTS developers are working on that with high priority and have identified a few issues in auxiliary components, with workarounds for some of them; the expectation is that a workable setup without those problems will be determined in the near future

  • LHCb reports -
    • Users analysis at T1s ongoing
    • MC production at all sites
    • New GGUS (or RT) tickets
    • T0:
    • T1:
      • FZK-LCG2: Looking forward to new dCache instance soon for LHCb
        • Dimitri: will ping the experts
      • IN2P3 : CVMFS problem (GGUS:83528)
      • NL-T1 : SARA srm problems - but they have gone on for >3 months now.

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3 - nta
  • KIT - ntr
  • NDGF
    • yesterday the Danish part of NDGF-T1 suffered a power cut; its disk pools are still unavailable; a downtime has been declared, more news tomorrow
  • NLT1
    • already for a while the main SRM node at SARA has been having non-fatal Fibre Channel problems when accessing its DB; an earlier planned intervention to fix that had to be postponed; Sunday evening matters got worse and dCache collapsed due to DB timeouts; the DB has been moved to new HW now, but we are concerned about the very low network traffic observed for the SE, that may not all be explained by ATLAS and LHCb blacklisting it; there could be some other problem due to the migration, we are looking into it
  • OSG
    • the BNL-Indiana BDII network issue mentioned last Fri was solved shortly after that meeting
    • regarding the python update issue for ATLAS mentioned last Fri, broadcasts to OSG sites can be requested by sending mail to "goc" in the opensciencegrid.org domain
      • Michael: ATLAS OSG sites had already been informed Tue or Wed last week
  • PIC
    • Last Thursday 21 June at around 16:00 CEST, the cooling system of the 2nd machine room at PIC (the one hosting most of the WNs) had an incident and stopped working. To avoid dangerous overheating, about 100 WNs (800 job slots approx) had to be rapidly powered off.
      The exact cause of the problem is still being investigated, but we suspect that the cooling unit was working too close to its max capacity so a peak in the outside temperature could have caused it to stop. We plan to put (at least some of) the powered-off WNs back into production slowly during these days, while carefully watching the cooling behaviour.
      We will submit a Service Incident Report with more details as soon as we finalise it.
  • RAL - ntr

  • dashboards - ntr
  • databases
    • ongoing intervention on ATLAS online DB
    • storage upgrades on CMS on- and offline, COMPASS, LCGR
    • tomorrow interventions on various DB for ATLAS (e.g. ADCR), LHCb and PDBR
    • storage interventions should be transparent, but bugs were hit in the past; a 15-min downtime for some DB would therefore be possible
  • grid services - ntr

AOB:

Tuesday

Attendance: local(David, Eva, Jacob, Luc, Maarten, Maria D, Raja);remote(Dmytro, Gonzalo, Jeremy, Jhen-Wei, Lisa, Michael, Onno, Rolf, Ron, Tiju, Xavier M).

Experiments round table:

  • ATLAS reports -
    • CERN CENTRAL SERVICES, T0
      • ATLAS Central Catalog: Load balancing issue. 1 machine not chosen hence monitoring alert. Stopped (voatlas181) and restarted. OK now.
    • T1
      • SARA: file transfer failures: "destination error, failed to contact on remote SRM". GGUS:83523
      • NDGF-T1 unscheduled DT (SE,SRM) till Tuesday. Cloud set to "brokeroff", analysis queue set "offline". Elog.:37134-37136.
    • T2
      • NAPOLI -> CERN and IN2P3-CC slow file transfers affected urgent tasks. GGUS:83513 assigned on Saturday. Transfers of the large multi-GB files to CERN and TAIWAN failed with the timeout.

  • ALICE reports -
    • KIT: disk SE has low read efficiency, leading jobs to fail over to off-site replicas of their input data; seems to be due to a GPFS issue; under investigation

  • LHCb reports -
    • Users analysis at T1s ongoing
    • MC production at all sites
    • New GGUS (or RT) tickets
    • T0:
      • Moving DIRAC accounting services to new machines. Will take ~24 hours.
    • T1:
      • FZK-LCG2: Looking forward to new dCache instance soon for LHCb. Also requested more storage in current configuration. Failed user jobs (GGUS:83425) - any update?
      • NL-T1 : SARA srm problems - but they have gone on for >3 months now. (GGUS:83584)
      • Rolf: GGUS:82247 about corrupted files at IN2P3 has been updated - KIT saw even more files affected by such problems!
        • Raja: we will look into that

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • FNAL - ntr
  • GridPP - ntr
  • IN2P3 - nta
  • KIT
    • extra disk storage requested by LHCb will be implemented this week
    • no update on failed user jobs ticket (GGUS:83425) yet, waiting on update in dCache user forum discussion
    • no news on separate dCache instance for LHCb
  • NDGF
    • the Danish storage cluster is back in operation
  • NLT1
    • Ron: last week dCache at SARA was upgraded to the new golden release 2.2.1, which went OK (not related to current problems); the SRM node has been suffering multi-path failures a few tens of times per day, which recently led to high loads and dCache errors; the node was migrated to new HW, but the problems remained; today the SRM DB was reindexed and things look a lot better, the remaining errors are expected to disappear in a few hours after timeout and retry by clients
    • Raja: a load-balanced set of SRM nodes would lower the load on each instance
    • Ron: that probably will not help; there are other T1 with the SRM on a single node; the typical load is 1 or 2 on an 8-core machine
    • Maarten: wasn't the DB already reindexed last week?
    • Onno: that was the Name Service DB
    • Luc: should ATLAS lower the number of transfer requests?
    • Ron: just keep going, we expect the remaining errors to go away shortly
    • Raja: LHCb will unban the SE after the meeting
  • OSG
    • currently in routine maintenance window during which all centrally hosted operations services will be upgraded transparently
  • PIC - ntr
  • RAL
    • interventions on FTS and CASTOR tomorrow morning

  • dashboards - ntr
  • databases
    • one patch that was applied to the online DBs for LHCb and CMS last week has revealed a new bug affecting PVSS sessions; a roll-back of that patch has been proposed to the experiments
    • other DB interventions are ongoing OK
  • GGUS/SNOW
    • GGUS search for user "CompAtP1Shift" shows that no test ticket was opened since yesterday, release date. None but ATLAS shifters can do this test as it requires login AND passwd. We 'd like to close Savannah:129607, making sure this is actually working as required by the experiment. The next GGUS release is as early as July 9th, followed by a long holiday period. One wouldn't like to leave this untested for the whole summer.

AOB:

Wednesday

Attendance: local(David, Jacob, Luc, Luca C, Maarten, Maria D, Raja, Steve);remote(Dmytro, Gonzalo, Lisa, Lorenzo, Michael, Pavel, Rob, Rolf, Ronald, Tiju).

Experiments round table:

  • ATLAS reports -
    • CERN CENTRAL SERVICES, T0
      • ggus connection with atlcomp1 userid shows now ATLAS vo membership.
      • python security update made running ATLAS job fail (CERN Linux ticket RQF:0111006). Solved now with new pilot version.
    • T1
      • SARA: Since reindexing DB situation did improve.No transfer failures with destination error for the last 8 hours. GGUS:83523 .
      • RAL DT for Oracle upgrade ended. Batch farm kept up as much as possible to optimize resources.
    • T2
      • NAPOLI slow transfers GGUS:83513. Solved. Configuration at GEANT router corrected.

  • CMS reports -
    • LHC machine / CMS detector
      • Technical stop
    • CERN / central services and T0
      • T0 successfully migrated to a new machine yesterday
    • Tier-1/2:
      • T1_FR_CCIN2P3: also failing HammerCloud, problem seems related to CREAM CEs, GGUS:83520 (HammerCloud red again)
      • FTS for T1_IT_CNAF: credential delegation problem, actively followed up in GGUS:83486 (currently OK, keeping here in case of reoccurrence)

  • LHCb reports -
    • Users analysis at T1s ongoing
    • MC production at all sites
    • New GGUS (or RT) tickets
    • T0:
      • Moving DIRAC accounting services to new machines. Should be completed this evening.
    • T1:
      • FZK-LCG2: Failed user jobs (GGUS:83425) - any update? More failed user jobs (GGUS:83608). Slow pool?
        • Raja: about 1 job in 7 fails
      • NL-T1 : SARA problems ongoing. Now not able to access files there, even though the srm agrees it exists and the files are accessible locally. (GGUS:83584)
      • Corrupted files (IN2P3 & FZK) : Not clear why the problem with jobs was seen only at IN2P3 but not at GridKa : Possibly just chance. LHCb will be cleaning up the corrupted files found there. The source of the corruption is not clear, though there is possibly a clue in the two (three?) bursts of writing times when it occurred. Possibly need to extend the check to other sites (at least those using dCache).
        • Raja: the problem is not seen at RAL; other T1 still to be checked

Sites / Services round table:

  • BNL - ntr
  • CNAF - ntr
  • NDGF - ntr
  • FNAL - ntr
  • IN2P3
    • last Fri new machines were added to the batch farm and since then there have been scheduling performance issues from time to time, causing jobs to accumulate in queues for a while; looking into it
  • KIT - ntr
  • NLT1
    • the current SARA SRM problem reported by LHCb looks unrelated to the SRM DB access problem that was solved yesterday; looking into it
  • OSG
    • yesterday's central service maintenance went OK without service interruptions
    • Maria: does anything special need to be done w.r.t. the migration of the DOEGrids CA to OSG?
      • Rob: nothing special expected; the full deployment will not happen before the fall; users will experience the new situation one at a time, viz. as their current certificates expire; they may then need to re-register in VOMS/VOMRS
  • PIC - ntr
  • RAL
    • today's Oracle intervention for FTS and CASTOR went OK, all services are back

  • dashboards - ntr
  • databases
    • the problematic patch mentioned yesterday was rolled back on the online DBs of CMS (yesterday) and LHCb (today)
    • other interventions (e.g. ALICE and ATLAS online) went OK
  • GGUS/SNOW
    • File ggus-tickets.xls is up-to-date and attached to page WLCGOperationsMeetings . There were already 6 real ALARMs since last MB, all submitted by ATLAS, various sites.
  • grid services
    • next Tue morning (July 3) myproxy.cern.ch will be updated to the latest MyProxy version to reinstate the support of VOMS attributes in authorization policies; should be transparent

AOB:

Thursday

Attendance: local(Eddie, Luc, Maarten, Manuel, Maria D, Przemek, Raja, Xavier E);remote(Dmytro, Gonzalo, Jacob, Jhen-Wei, John, Kyle, Lisa, Michael, Rolf, Ronald).

Experiments round table:

  • ATLAS reports -
    • CERN CENTRAL SERVICES, T0
      • NTR
    • T1
      • SARA: transfers stable & successful
      • IN2P3-CC: Oracle outage, 1 CREAM died, GE overloaded. Running with limited power.
        • Rolf: IN2P3-CC services should all be OK again now; the Oracle issue was due to a configuration error; the CREAM problem was due to a side effect of moving from gLite to EMI: one maintenance routine had stopped working due to paths having changed
    • T2
      • NTR

  • CMS reports -
    • LHC machine / CMS detector
      • Technical stop
    • CERN / central services and T0
      • T0 successfully migrated to a new machine yesterday
    • Tier-1/2:
      • T1_FR_CCIN2P3: failing HammerCloud due to overloaded CREAM CE, GGUS:83520
        • cccreamceli05 back in production today, but problem remains
        • Rolf: will look into it
      • FTS for T1_IT_CNAF: credential delegation problem, actively followed up in GGUS:83486 (currently OK, keeping here in case of reoccurrence)

  • LHCb reports -
    • Users analysis at T1s ongoing
    • MC production at all sites
    • New GGUS (or RT) tickets
    • T0:
      • Moving DIRAC accounting services to new machines. Not yet back in production - grid submission as usual.
    • T1:
      • FZK-LCG2: Failed user jobs (GGUS:83608). As requested, dCache client upgraded to v2.47.5-0 (applies to all dCache sites). Wait and see.
      • NL-T1 : SARA problems ongoing. New ticket opened (GGUS:83676) - old ticket referred to 5 different problems and the latest failures seem different from all of those.
        • Ron (after the meeting): There were not 5 issues reported in the old GGUS ticket but three of which 2 are now solved. For the remaining open issue ticket ticket 83676 was opened.
      • IN2P3 / GridKa file corruption : Requested PIC contact to check at PIC.

Sites / Services round table:

  • ASGC
    • Sat and Tue there will be maintenance on the 10-Gbit link; at-risk downtimes have been posted in the GOCDB; the traffic will go over the backup link
  • BNL - ntr
  • FNAL - ntr
  • IN2P3 - nta
  • NDGF - ntr
  • NLT1 - ntr
  • OSG - ntr
  • PIC - ntr
  • RAL - ntr

  • dashboards - ntr
  • databases - ntr
  • GGUS/SNOW
    • The certificate that will be used as of the next release of 2012/07/09 for the ALARM tests initiated by the GGUS developers will be different. Sites must be configured to accept it. Details in Savannah:129944 . A GGUS ticket notifying the T0 and each T1 is also created.
      • Kyle: might it affect the ticket exchange mechanism?
      • Maria: no, and it ought not break the alarm tests either
  • grid services - ntr
  • storage
    • transparent interventions on the CASTOR DB backends are proposed for July 9 starting at 09:00 and lasting ~4h
      • alternative dates would be Tue July 10 or Wed July 11, starting either at 09:00 or at 14:00
      • the date will be July 9 unless an objection is raised soon

AOB:

Friday

Attendance: local(David, Eva, Jacob, Luc, Maarten, Raja, Steve, Xavier E);remote(Alexander, Dmytro, Jhen-Wei, John, Kyle, Marc, Michael, Xavier M).

Experiments round table:

  • ATLAS reports -
    • CERN CENTRAL SERVICES, T0
      • ALARM ticket GGUS:83705 (Armin) linked to power cut. Fixed.
      • EOS problem GGUS:83715 File seen in EOS but unable to copy
        • Xavier: some EOS-ATLAS nodes have not been fully recovered yet
    • T1
      • IN2P3-CC: down due to loss of connectivity to outside (hw pb on central router). Ongoing.
    • T2
      • NTR

  • LHCb reports -
    • Users analysis at T1s ongoing
    • MC production at all sites
    • Some catch up of processing tail at SARA.
    • New GGUS (or RT) tickets
    • T0:
      • Moving DIRAC accounting services to new machines. Not yet back in production - grid operations going on as usual, though some background LHCb operations are paused as a result.
      • CERN power failure : LHCb VOboxes not affected. However, some problems accessing files in castor. Quite a few failed jobs as a result. Also 340TB missing (GGUS:83713) from LHCbDisk
        • Xavier: the missing disk servers are being recovered, currently ~30 TB still unavailable; SLS shows the status of the service class, which is OK whenever the probe can write, read and remove a test file
    • T1:
      • FZK-LCG2: Failed user jobs (GGUS:83608). As requested, dCache client upgraded to v2.47.5-0 (applies to all dCache sites). Wait and see.
      • NL-T1 : SARA problem with file access understood and fixed. (GGUS:83676). Jobs going through now.
      • IN2P3 / GridKa file corruption : PIC and SARA contacts also checking.

Sites / Services round table:

  • ASGC - ntr
  • BNL - ntr
  • KIT
    • today 75 TB have been added to the LHCb-Disk token
  • NDGF
    • there will be rolling upgrades of dCache pools Mon-Thu next week, possibly making some ALICE or ATLAS data temporarily unavailable
  • IN2P3
    • between 07:30 and 11:00 UTC there was a total loss of connectivity due to a HW issue in a central router, whose CPU board was replaced to fix the problem; the declared downtime has been shortened
  • NLT1 - ntr
  • OSG
    • noticed issues with bdii206.cern.ch 06:15-07:10 UTC, most probably due to the power cut (it is among the subset of BDII nodes that were rebooted)
  • RAL - ntr

  • dashboards
    • nearly all Dashboard instances were affected by the power cut, all OK now
  • databases
    • the power cut affected an Active Data Guard copy between ADCR and ATLAS online, fixed
    • there was an intervention on one node of the LHCb online DB
  • grid services
    • everything seems to be back now
  • storage
    • Luc: what about the remaining disk servers?
    • Xavier: we try to have them available again still today

AOB:

-- JamieShiers - 22-May-2012

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2012-06-29 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback