Week of 151130

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday

Attendance:

  • local: Andrea Sciabà (SCOD), Maarten Litmaath (ALICE), Hervé Rousseau (IT-DSS), Mark Slater (LHCb), Emil Pilecki (IT-DB)
  • remote: Lisa Giacchetti (FNAL), Michael Ernst (BNL), Alexander Verkooijen (NL-T1), Francesco Noferini (CNAF), Kyle Gross (OSG), Sang Un Ahn (KISTI), Dmytro Karpenko (NDGF), Gareth Smith (RAL), Rolf Rumler (CC-IN2P3), Christoph Wissing (CMS)

Experiments round table:

  • CMS reports (raw view) -
    • Heavy Ion Run is ongoing
    • Followup from last week: Overloaded CASTOR at CERN
      • Force-completed a huge workflow, which triggered too many log-file archiving
      • Some threshold got adjusted internally
      • GGUS:117960 , which has been solved
    • Permission issue on EOS at CERN: GGUS:118027
      • Mapping problem fixed during the weekend
    • Problems with reading files on EOS from WN at Wigner: GGUS:118037
      • This is impacting PromptRECO for HI data taking - therefore made an ALARM ticket
      • Presently a network problem is suspected
      • Any news?
    • Many thanks to CERN storage experts, who responded during the weekend
    • Slow tape performance at KIT: GGUS:117910

Hervé: the problem in Wigner went away yesterday evening, it is still under investigation but everything indicates it was a network problem.
The permission issue in EOS was due to a known, but not yet fully understood, bug.

  • ALICE -
    • Heavy ion data reconstruction has been OK so far. Thanks to the participating sites!
    • CNAF added ~300 TB to the disk in front of their tape system, thanks!
    • RRC-KI-T1 tape transfer performance has been improved, thanks!
      • Unexpected appearance of small files will be looked into.

  • LHCb reports (raw view) -
    • Data Processing:
      • Data processing of pp data at T0/T1/T2 mostly complete. pp reference data processing starting this week.
      • Monte Carlo mostly at T0/T1/T2/T2D, user analysis at T0/1/2D sites
    • No issues to report

Sites / Services round table:

  • ASGC:
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • GridPP:
  • IN2P3: An outage is scheduled (and in GOCDB) for December 8th, impacting batch services all day (plus the previous day for draining), and dCache all day, plus some other services which will be in maintenance for just a few hours.
  • JINR:
  • KISTI: ntr
  • KIT:
  • NDGF: ntr
  • NL-T1: just reminder of the full day downtime of tomorrow.
  • NRC-KI:
  • OSG: ntr
  • PIC:
  • RAL: during the weekend there was an issue with three ATLAS disk servers, which are now back in readonly mode. The coincidence is just random, there is no evidence for a common cause. The space token affected was DATADISK.
  • TRIUMF:

  • CERN batch and grid services:
  • CERN storage services: ntr
  • Databases: ntr
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Maria Dimou (SCOD), Maarten Litmaath (ALICE), Hervé Rousseau (IT-DSS), Mark Slater (LHCb), Emil Pilecki (IT-DB). Asa Hsu (ASGC).
  • remote: Lisa Giacchetti (FNAL), Michael Ernst (BNL), Andrew Pickford (NL-T1), Kyle Gross (OSG), Sang Un Ahn (KISTI), Dmytro Karpenko (NDGF), Gareth Smith (RAL), Rolf Rumler (CC-IN2P3), Christoph Wissing (CMS), Di Qinq (Triumf).

Experiments round table:

  • ATLAS reports (raw view) -
    • Apologies from ATLAS, nobody can attend today.

  • CMS reports (raw view) -
    • Heavy Ion data taking ongoing
    • Issue with EOS pools disappearing from the network is back again: GGUS:118082
      • CERN storage is investigating with network experts
    • CMS Tier-0 workflows is driving some CERN Openstack hardware to its limits: GGUS:118056
    • Tape staging at KIT working again: GGUS:117910
      • CMS transfer experts and CMS site contacts working through the backlog

  • ALICE -
    • CERN: team ticket GGUS:118062 opened Monday evening
      • ALICE was severely impacted by an OpenStack issue
      • the standard build system could not be used to release analysis updates
      • a local mini build system was put together for the most urgent cases
      • thanks to the OpenStack team for solving the complex issue as fast as possible!
    • CERN: submission to HTCondor CE at CERN in production since yesterday evening

  • LHCb reports (raw view) -
    • Data Processing
      • Data processing of 13TeV pp data at T0/T1/T2 now complete. pp reference data processing started
      • Monte Carlo mostly at T0/T1/T2/T2D, user analysis at T0/1/2D sites
      • Will be starting processing of Heavy Ion runs soon
      • Significant MC generation in-coming
    • Issues
      • CA issues at IN2P3 caused problems with certain users accessing files (GGUS:118077)
        • Maarten: that user issue had nothing to do with the French CA; for the record, the French CA now is able to provide certificates that are compliant with the upcoming policy change in Globus and, indeed, such a certificate was installed on the given SRM host around the time the user issue happened to go away
        • Maria: Please do not verify GGUS tickets which don't contain a text in the solution field helping the reader understand how the problem was solved, as per GGUS:118077
      • Problem with RRCKI tape put offline and preventing access to certain files is now solved.

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: not connected
  • FNAL: ntr
  • GridPP: not connected
  • IN2P3: ntr
  • JINR: not connected
  • KISTI: ntr
  • KIT: not connected
  • NDGF: ntr
  • NL-T1: ntr
  • NRC-KI: not connected
  • OSG: ntr
  • PIC: not connected
  • RAL: Network changes planned for next week. All downtimes published in GOCDB. GGUS:118109 opened yesterday to report an issue with Golden Gate database synchronization from CERN. The "Notify Site" field had no value, so the GGUS TPM assigned this back to RAL. The ticket is in progress. The issue is most probably related to the firewall problem CERN experienced yesterday.
  • TRIUMF: ntr

  • CERN batch and grid services:
    • there was a brief outage of VOMS and VOMS-Admin Wed afternoon (OTG:0026924)
  • CERN storage services:
    • Connectivity issues affecting EOS between Wigner and Meyrin: under investigation with the Network Team. So far, only CMS seems to be affected.
  • Databases: A storage overload started building up on Monday around 6pm. This caused connectivity issues with the Accelerator DB, the experiments' online DBs and non-LHC experiment DBs. At about the same time failures were observed on the "Storage filer", overload of the openstack VMs, between 3am and 6am on Tuesday and a hardware problem on the ATLAS archiver DB.
  • GGUS: no report
  • Grid Monitoring:
    • The REBUS page for the submission of the monthly T1 accounting reports has been modified to retrieve the default data from the accounting-devel.egi.eu, which includes multi-core usage.
  • MW Officer: no report. More in WLCG Ops Coord today.

AOB:

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2015-12-03 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback