Week of 150824

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Andrea Sciabà, Iain Steers, Xavier Espinal, Maarten Litmaath, Andrea Manzi, Michal (ATLAS), Katarzyna Dziedziniewicz-Wojcik
  • remote: Michael Ernst (BNL), Asa (ASGC), Christoph Wissing (CMS), Dimitri (KIT), Dmytro Karpenko (NDGF), Rolf Rumler (IN2P3-CC), Pepe Flix (PIC), Luca Tomassetti (LHCb), Onno Zweers (NL-T1), Sang Un Ahn (KISTI), Tiju Idiculla (RAL)

Experiments round table:

  • ATLAS reports (raw view) -
    • due to high level of analysis activity (which is causing problems to MC production), T2s were asked to change shares from (50% production/50% analysis) to (75% production/25% analysis)
    • Taiwan - "Put error:Permission denied" job failures - one disk server not properly setup after reinstallation, after that jobs started failing with get errors - caused by missing voms plugin for xrootd4

  • CMS reports (raw view) -
    • Had a wave of Upgrade requests in the system end of last week
      • Reach ~100k parallel CMS jobs in the Global Pool
      • Now down to rather low utilization
    • Once more an issue with DAS (Data Aggregation Service) on Friday
      • Overloaded by too many remote queries
      • This time only DAS affected - new protection of other CMS web services worked
      • Related tickets: GGUS:115812, GGUS:115810

  • ALICE -
    • high activity
    • CERN: team ticket GGUS:115823 opened Fri evening because of job submission issues
      • due to the unresolved issue occasionally affecting Argus
      • support was provided during the weekend, thanks!

  • LHCb reports (raw view) -
    • Data Processing:
      • validation productions of 25ns data will be redone this week (ReCo + Stripping), with proper refitting of PV
      • current validation (Reco15b and Stripping 23a) quasi finished (>99% data processed, 75% merged)
    • T0
      • Nothing to report
    • T1
      • Nothing to report

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF:
  • FNAL:
  • GridPP:
  • IN2P3: announced outage on September 22, all services will be down for one day. More details will be given one week before.
  • JINR:
  • KISTI: ntr
  • KIT: ntr
  • NDGF: ntr
  • NL-T1: this Thursday there will be a downtime in the dCache cluster to apply a fix for the recently published dCache vulnerability (see below) and for other updates (OS, BIOS). The downtime should last around four hours.
  • NRC-KI:
  • OSG: ntr
  • PIC:
  • RAL: This Wednesday and Thursday the CASTOR disk servers will be updated to SL6. The downtime is marked "at risk".
  • TRIUMF:

  • CERN batch and grid services:
    • MyProxy upgrade scheduled for the 31st. QA node running the new version will join the production alias several times during this week. Please read the ITSSB entry for more details.
    • ARGUS issues over the weekend. Very high-load on an intermittent basis meant that argus nodes were being knocked out of the alias and occasionally seemed to cause a crash in Argus. Restarting the argus services was occasionally required as they seemed to become out-of-sync, investigation is still ongoing but the source of the issue is more or less understood as excessive JobCancel burst requests from a single VOBox that needed to be rejected by Argus. In the future, offenders could be temporarily added to iptables. The issue was noticed by ALICE and CMS.
  • CERN storage services:
    • Yesterday there was a crash on EOS ATLAS, causing a downtime of 15'. The cause is being investigated.
    • Next week is technical stop for LHC, so it would be a good time to update all EOS instances to the latest version (EOS CMS has it already). It is suggested to do it on Tuesday or Wednesday. No experiment objects to the proposal.
  • Databases: ntr
  • GGUS:
  • Grid Monitoring:
  • MW Officer:
    • Both CERN and Nikhef reported issues with slapd process crashing after upgrading to SLC6.7/CentOS 6.7 which includes a new version of openLDAP ( openldap-servers-2.4.40-5 ). This is possibly affecting all BDII installations (resource/site/top). While the issue is under investigation, we suggest sites not to upgrade to this version of openLDAP.
    • dCache vulnerability broadcasted today by EGI SVG. (https://wiki.egi.eu/wiki/SVG:Advisory-SVG-2015-9323). It affects gsi and kerberos ftp doors. All versions of dCache prior to 2.13.7, 2.12.19, 2.11.30 and 2.10.39 are affected. The baselines for dCache have been upgraded and sites are suggested to upgrade to the latest versions containing the fix.

AOB:

Thursday

Attendance:

  • local: Andrea Sciabà, Iain Steers (IT-PES), Xavier Espinal (IT-DSS), Andrei Dumitru (IT-DB)
  • remote: Michael Ernst (BNL), Andrew Pickford (NL-T1), Dmytro Karpenko (NDGF), Lisa Giacchetti (FNAL), Salvatore Tupputi, Sang Un Ahn (KISTI), Gareth Smith (RAL), Rolf Rumler (IN2P3-CC), Luca Tomassetti (LHcb), Daniele Cesini (CNAF), Thomas Hartmann (KIT), Pepe Flix (PIC)

Experiments round table:

  • ATLAS reports (raw view) -
    • T0/Central services
      • frontier degradation in Kibana (ELOG:54542)

  • CMS reports (raw view) -
    • Many of us are still in holiday mode - likely nobody will join the call, sorry for that
    • Very little activity in the system
    • Otherwise nothing to report - address issues to Christoph, who will follow up next week

  • ALICE -
    • high activity

  • LHCb reports (raw view) -
    • Data Processing:
      • p-Ne collisions reconstruction started
    • T0
      • Nothing to report
    • T1
      • RAL 'at-risk' yesterday and today due to castor update.

Luca mentioned that some problems were observed at RAL even if the update (moving the disk servers to SL6) was expected to be transparent. Gareth added that the disk areas are now completed, the tape areas should be finished within two hours.

Sites / Services round table:

  • ASGC:
  • BNL: ntr
  • CNAF: today at 8:15 there was a fire event in the computing centre affecting the power supply. Now the site is up and running but without any fire protection available. Later today it will be decided if to shut down all services for safety reasons and in case it will be announced via the usual channels.
  • FNAL: ntr
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI: ntr
  • KIT:
    • there will be a downtime on September 5 for network maintenance, expected to be transparent
    • The annual big downtime is scheduled from September 29 to October 1, during which the site will be mostly offline
  • NDGF: one computing endpoint has problems since Tuesday evening and it will be down until next week, thus reducing the available compute capacity
  • NL-T1: ntr
  • NRC-KI:
  • OSG: ntr
  • PIC: ntr
  • RAL:
    • Next Monday is a holiday, so RAL will not connect to the meeting
    • A short "at risk" downtime is scheduled on September 3 for a minor network reconfiguration and it should be transparent
  • TRIUMF:

  • CERN batch and grid services:
    • a new version of Myproxy is running on one of the nodes behind myproxy.cern.ch without any issues. The general update is scheduled for August 31
    • a downtime is scheduled for next Thursday on the CEs, which will not accept new jobs during that period. Existing jobs will run unaffected only if they do not use glexec.
  • CERN storage services: The dates for the EOS updates of next week have been agreed with all the experiments; however there is a possibility that the intervention will be postponed, as we are still waiting for some new bug fixes. The final decision will be taken on Monday.
  • Databases: ntr
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2015-08-27 - AndreaSciaba
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback