Week of 141103

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Alberto (grid services), Alessandro Di G. (ATLAS), Alessandro F. (storage), Belinda (storage), Maarten (SCOD + ALICE), Raja (LHCb), Tsung-Hsun (ASGC)
  • remote: Dimitri (KIT), Dmytro (NDGF), Lisa (FNAL), Michael (BNL), Onno (NLT1), Pepe (PIC), Rob (OSG), Rolf (IN2P3), Sonia (CNAF), Soyun (KISTI), Tiju (RAL)

Experiments round table:

  • ATLAS reports (raw view) -
    • Daily Activity overview
      • Problem during the weekend due to an automatic update of Openssl on pandaserver nodes (happened on Thursday , picked up on Friday night 4am). Problem understood and fixed. actions to be done:
        • review the list of packages in the exclusion list from automatic updates (most of the packages should not be automatically updated). Review the procedures and the notifications when upgrades happen
        • improve test suite for PandaServer upgrades
      • not many jobs on the Grid in general. MC production and Derivation Framework coordinators contacted.
    • CentralService/T0/T1s
      • RAL-LCG2 GGUS:109814 quite many jobs failing due to "lost heartbeat".
      • TRIUMF-LCG2 GGUS:109800 transfers issue between TRIUMF and RU-PROTVINO: to be understood if it's a Triumf or Protvino issue.
      • NIKHEF ACL set GGUS:109779 : ticket is a change request, thanks to the site who made the requested changes (which are difficult to be done from outside the site).

  • CMS reports (raw view) -
    • NTR
    • CMS Computing and Offline week ongoing, therefore nobody will join the call, sorry

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • MC and User jobs. Prestaging from tapes for Stripping21 campaign.
    • T0: Problems accessing files at CERN over the weekend (GGUS:109808 and GGUS:109672). Affected ~2K files. Problem fixed fast and jobs and transfers have finished successfully.
    • T1:
      • IN2P3 : dCache upgrade to fix root6 access problems.
      • GridKa : Investigating possible problems accessing two files and also some FTS transfer issues (GGUS:109825).
    • Others : Changes in way root6 handles include files causing crashes at many sites on grid. Working on a fix within LHCb software stack.
    • discussion:
      • Onno: the dCache fix for the ROOT6 issue has not yet been applied at SARA, because we are waiting for the pending fix for the "POODLE" vulnerability to be applied at the same time - how urgent is the ROOT6 matter?
      • Raja: we intend to start our re-stripping campaign soon (maybe already later this week), so it would be good to apply that fix in the coming days
      • Maarten: as the POODLE vulnerability is low risk for grid components, the fix for that could wait a bit longer

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL:
    • the communication about the VOMS server outage last week could have been better
      • Maarten: to be discussed at this week's Ops Coordination meeting as needed
  • GridPP:
  • IN2P3:
    • on Dec 9 there will be a 1-day outage, further details will follow
  • JINR:
  • KISTI: ntr
  • KIT: ntr
  • NDGF:
    • 1 CE at our Finnish site has problems that are being looked into
  • NL-T1: ntr
  • OSG:
    • status of the new VOMS servers?
      • the ports for ATLAS and CMS are still not open to the world
      • ATLAS and CMS will be asked to proceed with their validations
      • the new servers will be part of the OSG client config as of the Nov 11 release
      • the firewalls could be opened to specific OSG hosts for testing
  • PIC: ntr
  • RAL: ntr
  • RRC-KI:
  • TRIUMF:

  • CERN batch and grid services: ntr
  • CERN storage services: ntr
  • Databases:
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Alberto (grid services), Alessandro F (storage), Andrea (MW Officer), Antonio (PIC), Belinda (storage), Felix (ASGC), Maarten (SCOD + ALICE), Pablo (GGUS + grid monitoring), Pepe (PIC)
  • remote: Alessandro Di G (ATLAS), Dea Han (KISTI), Dennis (NLT1), Jeremy (GridPP), Lisa (FNAL), Lucia (CNAF), Michael (BNL), Raja (LHCb), Rob (OSG), Rolf (IN2P3), Thomas (KIT), Tiju (RAL), Tommaso (CMS), Ulf (NDGF)

Experiments round table:

  • ATLAS reports (raw view) -
    • Daily Activity overview
      • RAW dataset in Rucio: Andreu will ask today at the ProdSys2 meeting
      • Rucio ProdSys2 Full chain test: still not on rucio but more on the test itself. APFs.
      • Production: DE cloud skipped due to space. Checking space: FZK, pic and TAIWAN are low in space. AK started the cleaning of last month transient.
      • Running full steam.
      • HammerCloud issue with clients, not able to list Rucio datasets.
    • CentralService/T0/T1s

  • CMS reports (raw view) -
    • Computing week ongoing
    • a few weeks long global run data taking. Involves also CERN/Remote computing
    • Processing overview:
      • New production campaign of MINIAOD (Tier-1 and Tier-2 sites)
      • Preparations for new campaigns (PHYS14 and another round of Upgrade) ongoing
    • Testing of new VOMS server infrastructure
      • Little progress, no problems seen
      • Will ramp up to meet the know deadline (end of November)
        • Maarten: the host certs of the old VOMS servers expire in the morning of Nov 26
        • Rob: the Nov 11 OSG release will have clients already refer to the new servers in addition to the old ones
    • Operational items
      • Reconfiguration campaign of xrootd for European sites started beginning of this week (the famous privacy issue)
      • Data transfer test T0->Tier-1 and Tape staging exercise planned for November/December
      • Rebalancing T0 resource mix by killing idle machines, as asked (~500 just today)
    • GGUS open (T0, T1s only...)
      • GGUS:109876 : FR T1, a stuck submission
        • Rolf: we will look into that
      • GGUS:109855 : a CREAM CE miss-behaving at CERN
      • GGUS:109919 : glExec problem at RAL_T1
        • Tiju: the failing jobs had the production instead of the pilot role
      • GGUS:109812 : slow Xrootd access at CERN, turns out to be VM related (probably creating a new ticket...)

  • ALICE -
    • very high activity during the last 11 days
      • 69k concurrent jobs reached Nov 5

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL:
    • downtime Thu Nov 13 7am-4pm local time, various services and activities will be impacted to some extent
  • GridPP: ntr
  • IN2P3:
    • there was a problem with yesterday's test alarm ticket: the ticket update did not come through due to SSLv3 having been switched off on the GGUS web service (as advised); the problem was solved by adjusting our code on the client side
  • JINR:
  • KISTI: ntr
  • KIT: ntr
  • NDGF:
    • also NDGF ran into an issue with GGUS: the NorduGrid CA was no longer accepted, it has been fixed
    • the Xrootd logging for ALICE writing files to the tape SE has been improved
    • our Danish site has 1 broken machine affecting access to some of the ALICE and ATLAS data; possibly fixed later today
  • NL-T1: ntr
  • OSG:
    • reminder: our Nov 11 release will have clients already refer to the new VOMS servers in addition to the old ones
  • PIC: ntr
  • RAL: nta
  • RRC-KI:
  • TRIUMF:

  • CERN batch and grid services: ntr
  • CERN storage services: ntr
  • Databases:
  • GGUS:
    • Update of GGUS on the 5th of November. The release includes, among other things, two new support units ("WLCG perfSONAR Monitoring" and " Vac/Vcycle"), new permissions for 'Expert' users, and disabling SSLv3 on the web server
    • Several T1 have not acknowledged the test alarms: IN2P3-CC (solved) , KISTI, Taiwan-LCG2
  • Grid Monitoring: ntr
  • MW Officer: ntr

AOB:

Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2014-11-06 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback