Week of 151102

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Links to Tier-1 downtimes

ASGC BNL CERN CNAF FNAL IN2P3-CC JINR KISTI
KIT NDGF NIKHEF SARA NRC-KI PIC RAL TRIUMF

Monday

Attendance:

  • local: Andrea Sciabà (SCOD), Stefan Roiser (LHCb), Alessandro Fiorot (IT-DSS), Emil Pilecki (IT-DB), Lorena Lobato (IT-PES)
  • remote: Michael Ernst (BNL), Dmitri (KIT), Christoph Wissing (CMS), Francesco Noferini (CNAF), Jose Enrique Garcia Navarro (ATLAS), Kyle Gross (OSG), Sang Un Ahn (KISTI), Ulf Tigerstedt (NDGF), Tiju Idiculla (RAL), Onno Zweers (NL-T1), Rolf Rumler (CC-IN2P3), Pepe Flix (PIC)

Experiments round table:

  • ATLAS reports ( raw view) -
    • ATLAS General
      • Normal data-taking and Grid production activities ongoing.

  • CMS reports ( raw view) -
    • Good load in the production system since the weekend
    • Problems with EOS internal transfers at CERN: GGUS:117321
      • Transfers seem to have recovered this morning

  • ALICE -
    • High to very high activity.
      • Many hours of 90k+ running jobs on Fri.

  • LHCb reports ( raw view) -
    • Data Processing:
      • Data processing of pp data at T0/1/2 sites. Some T2 attached to T1 in order to speed up the processing.
      • Monte Carlo mostly at T2, user analysis at T0/1/2D sites
    • T0
    • T1
      • IN2P3 problem with one dCache server GGUS:117311 - solved
      • RRCKI problems with tape system - GGUS:117267 . Seems to be recurrent at the site.
      • SARA transfer problems GGUS:116939 continued over the week-end, today no more errors
    • AOB
      • Tickets especially at CERN not being explained / closed.

Sites / Services round table:

  • ASGC:
  • BNL: ntr
  • CNAF: ntr
  • FNAL:
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI: ntr
  • KIT:
  • NDGF:
    • a glitch caused a 10-15' slowdown in dCache due to CPU starvation
    • tomorrow we'll have a full cycle downtime from 0630Z with a headnode outage of 30' plus a daylong warning to update dCache pools one site at a time (Java security updates and dCache updates). Impact is the short outage and possible temporary unavailabilities of some files. Still to be scheduled in GOCDB (was mistakenly scheduled last Friday)
  • NL-T1:
    • last Thursday dCache collapsed, maybe due to a configuration change supposedly transparent. Rolling it back was not enough and a complete reboot of the dCache cluster was needed. The downtime of tomorrow will also allow to test if the cause was the configuration change
    • last Friday we had a network issue causing part of the compute cluster being unable to connect to the storage. To avoid job failures, the problematic part was taken offline. The issue was fixed this morning by rebooting the switch.
    • tomorrow, scheduled downtime for network maintenance
  • NRC-KI:
  • OSG: ntr
  • PIC: ntr
  • RAL: ntr
  • TRIUMF:

  • CERN batch and grid services:
    • Trying for another upgrade to LSF9 on Wednesday
    • HTCondor pool goes production today
    • Will be decommissioning one ARC CE, hostname: ce502. from BDII on next wednesday 11th

  • CERN storage services: several upgrades planned:
    • 9/11: EOSATLAS
    • 10/11: CASTOR PUBLIC
    • 11/11: CASTORCMS and EOSCMS
    • 12/11: CASTORATLAS and CASTORLHCB
  • Databases: last week the CMS offline database had a new application deployed in production, the software popularity implementing a new popularity model. It is working fine.
  • GGUS:
  • Grid Monitoring:
    • Draft availability reports for October sent, and available at the SAM3 UI
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Alessandro F (storage), Asa (ASGC), Emil (databases), Lorena (batch & grid), Maarten (SCOD + ALICE)
  • remote: Dennis (NLT1), Lisa (FNAL), Matteo (CNAF), Michael (BNL), Rolf (IN2P3), Tiju (RAL), Ulf (NDGF)

Experiments round table:

  • ATLAS reports ( raw view) -
    • Grid production activities ongoing
    • Central service: need reasonable CVMFS monitoring (probe stays orange)

  • CMS reports ( raw view) -
    • Likely no CMS person is available to join the meeting
    • Had many data transfer failures to EOS at CERN: GGUS:117369
      • Quota setting in EOS and Dynamic Data Management found to be inconsistent - fixed CMS internal
    • Global CMS xrootd redirector does not see European sites: GGUS:117392

  • ALICE -
    • CERN: team ticket GGUS:117357 opened for CASTOR Tue evening
      • most reco jobs could not access their raw data files
      • due to a huge amount of disk-to-disk copies keeping CASTOR busy
      • due to a big staging request using the wrong pool as destination
      • fixed on the ALICE side and thanks to the CASTOR team!

  • LHCb reports ( raw view) -
    • Data Processing:
      • Data processing of pp data at T0/1/2 sites. Some T2 attached to T1 in order to speed up the processing.
      • Monte Carlo mostly at T2, user analysis at T0/1/2D sites
    • T0
      • EOS storage access problems solved by restarting Bestman (GGUS:117368)
    • T1
      • IN2P3: Files not accessible via one xroot door, door was closed (GGUS:117359)
      • IN2P3: one dCache server stuck and needed to be rebooted (GGUS:117311)
      • GRIDKA: pilot submission failing to one of the CEs (GGUS:117341)

Sites / Services round table:

  • ASGC: ntr
  • BNL:
    • there was an incident with the FTS-3 service this week:
      • some 300k files were requested to be staged (through the bringOnline function)
      • the target SEs were at BNL and TRIUMF
      • the requests got promptly forwarded to those SEs
      • the time to fulfil the requests exceeded the timeout on the Rucio side
      • the requests remained stuck in the DB
      • corresponding queries were extremely slow
      • the FTS admin at BNL identified the cause and introduced a few indexes
      • that improved the situation dramatically
    • the FTS-3 mailing list was in the loop:
      • other instances should have similar recipes applied for the time being
      • the next release should best prevent such issues "out of the box"
    • after the meeting:
      • the next release will have a standard cap of 1k files per such request
      • an FTS admin can apply that limit already to the current version
      • some indexes will be added if the new cap is not sufficient
  • CNAF: ntr
  • FNAL: ntr
  • GridPP:
  • IN2P3:
    • there was a major network incident on Nov 3 due to a broken router
      • it mainly affected the traffic with the outside world
      • as also some internal communication was affected, no downtime was recorded in the GOCDB
      • the problems were fixed after ~1h
      • during the incident the Lyon and Annecy T2 sites were cut off from LHCONE
      • a Service Incident Report will be created
  • JINR:
  • KISTI:
  • KIT:
  • NDGF:
    • the dCache upgrade on Tue went OK and lasted ~15 min
      • the ATLAS FAX node needed more time to recover
  • NL-T1:
    • At SARA, last Tuesday's downtime took longer than planned because some Qfabric switch upgrades failed. The vendor provided online assistance but was unable to help us update all the switches. In the end, the effort was abandoned and the switches were rolled back, but even that turned out to be challenge, so we had to extend the downtime. The vendor had made several claims that have turned out incorrect. We have escalated this. We won't attempt this again until the procedure has been tested thoroughly.
    • In the same downtime we've upgraded dCache and we've made some architectural changes to try to improve performance: our main dCache node has moved from a VM to dedicated hardware, and some dCache components that communicate a lot with each other have been placed closer together in one "domain". We're curious to see if this helps to decrease the "space manager timeout" errors; since the issue seems load related it will probably take some time to know the result. We welcome feedback from the experiments.
  • NRC-KI:
  • OSG:
    • some CMS PhEDEx tickets for sites are still open, we will see if we can help
    • GGUS:117377 is a request from a T3 site to CMS that got wrongly assigned to OSG
      • Maarten: will reassign to VO Support (done)
  • PIC:
  • RAL: ntr
  • TRIUMF:

  • CERN batch and grid services:
    • FTS Production After some delay all CERN fts servers are now running with sqlalchemy.pool_size=10 as of 11:30 CET today.
    • FTS Pilot 2 of 4 nodes are now running CentOS 7. This is end point for this year. Migrate the rest including production, fully transparent , gradual and can be rolled back.
    • VOMS: The VOMS service will be unavailable at 7:30 CET on Tuesday 10th of November for approximately 5 minutes due to an upgrade and reboot of the DB. More details in OTG:0026128.
    • MyProxy will probably updated before Christmas. More news coming next week as the responsible still has to take a look to the changelog and plan the testing.

    • LSF master upgrade is finished for the public instance.Small subset of testing nodes are now upgrading to the new version. Affecting about 1% of the batch nodes and ce407
  • CERN storage services:
    • reminder: various CASTOR and EOS updates have been agreed for next week
  • Databases:
    • yesterday afternoon there was a largely transparent incident with the CASTOR Name Server DB
      • instance 2 got stuck, apparently due to a yet unidentified OS issue
      • instance 1 took over while the other was restarting
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Edit | Attach | Watch | Print version | History: r18 < r17 < r16 < r15 < r14 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r18 - 2015-11-07 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback