Week of 180730

WLCG Operations Call details

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 Geneva time until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Portal
  • Whenever a particular topic needs to be discussed at the operations meeting requiring information from sites or experiments, it is highly recommended to announce it by email to wlcg-scod@cernSPAMNOTNOSPAMPLEASE.ch to allow the SCOD to make sure that the relevant parties have the time to collect the required information, or invite the right people at the meeting.

Best practices for scheduled downtimes

Monday

Attendance:

  • local: Julia (WLCG), Kate (WLCG, DB, chair), Maarten (WLCG, ALICE), Roberto (storage), Petr (ATLAS), Vincent (security), Borja (monitoring), Alberto (monitoring), Marian (networks)
  • remote: Xavier (KIT), Alexander (NL-T1), Vladimir (LHCb), Eric (LHCb), Dmytro (NDGF), John (RAL), Xin (BNL), Pepe (PIC), Dave (FNAL)

Experiments round table:

  • ATLAS reports ( raw view) -
    • variable production between 270-350k slots
      • almost 100k from CERN-P1 for most of week (but new installation and scalability issues)
      • problems with some big sites and their local storage
      • problems with grid CE for CERN-T0 resources (update & bouncycastle)
      • updates related to the pilot causing job failures
      • issues with production input file transfer rate
    • overloaded rucio readers & problems with automatic midnight restarts (new implementation in pipeline)
    • data reprocessing campaign - more inputs transferred from tape
    • storage at INFN-T1 unstable and down several times during last week
    • automatic downtime synchronization from OIM/GOCDB to AGIS not working for some SEs (requires manual blacklisting)

  • CMS reports ( raw view) -
    • smooth sailing, with complete utilization of resources.
    • a lower pressure to cern resources (T0 + HLT) is probable in the next days, in order to empty a production buffer which is getting a bit out of control
    • we deployed the new CMSSW release @ T0, which is expected to give physics grade samples from now to the end of the run
    • we started preparing the mixing library for MC 2018, already close to the 500M events needed. Main activities will switch from the tails in 2017 MC to 2018 MC.

  • ALICE -
    • Normal activity on average until the weekend
    • Central services task queue DB HW problem Sat afternoon
      • Fixed Sat late evening
    • Lowish activity also on Sunday due to a big production having a bad TTL
      • It was set too large to match any resources
      • Fixed Sunday late evening

Vladimir commented that it is unacceptable for such an issue to last 1 week as it is not the first time the issue is reported and the monitoring is still missing. It is also not acceptable to close issues without solution. WLCG ops will follow up with the batch team (this was done after the meeting).

Comment from the batch team obtained after the meeting: CVMFS issue in the past in production happened internally (OTG:0044920). This did lead to some follow ups, including a better cvmfs probe to avoid accepting jobs on worker nodes that have cvmfs issues. For the cloud, there are some configuration differences, which is why we saw a similar problem there. We have the cvmfs probe ready to deploy there too. It does seem to be unfortunate that the misrouting of the original ticket & the weekend resulted in a delay to resolve this at RHEA.

Sites / Services round table:

  • ASGC: nc
  • BNL: Power distribution units will be shut down to fix an issues with electricity. This will be rolling, an announcement was sent to ATLAS.
  • CNAF: nc
  • EGI: nc
  • FNAL: NTR
  • IN2P3: nc
  • JINR: NTR
  • KISTI: nc
  • KIT: NTR
  • NDGF: Bluegrass site (the Triolith's replacement) is not up still. The work on it is postponed until August. Until then we are missing ca. 1/4 of the computational power of NDGF-T1 site.
  • NL-T1: This morning we suddenly saw major errors in our dCache environment and transfers started to fail. We tried to solve this by restarting dCache. Unfortunately dCache refused to start again and it took us several hours to find a work around. We are now back on line. We have contacted dCache support to investigate this matter.
  • NRC-KI: nc
  • OSG: nc
  • PIC: NTR
  • RAL: Last week's power testing went more or less OK.
  • TRIUMF: Started replicating the data to new storage at new data centre, likely can finish it in one month.

  • CERN computing services: nc
  • CERN storage services: A date in the SSB announcement for CMS CVMFS intervention will be changed (done after the meeting)
  • CERN databases: NTR
  • GGUS: NTR
  • Monitoring: NTR
  • MW Officer:
    • UMD-4 was updated on July 24 to fix the issues that broke CREAM on SL6 on July 11
    • Unfortunately, in some cases workarounds may still be needed
    • Please consult GGUS:136074 for further details
    • Due to holidays, the remaining issues can only be fixed in a few weeks from now
  • Networks: GGUS:135962 - Transfers from FNAL to DESY failing due to timeout. Re-testing was done today, but it didn't indicate any obvious network issue.
  • Security: NTR

AOB:

Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2018-07-30 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback