Week of 160808

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:
  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday

Attendance:

  • local: Alberto (monitoring), Alessandro (ATLAS), Andrea (LHCb), Belinda (storage), David (databases), Fa-Hui (ASGC), Iain (computing), Ivan (ATLAS), Jesus (storage), Julia (WLCG), Maarten (SCOD + ALICE), Marcelo (LHCb), Marian (networks), Oliver (CMS)
  • remote: Andrew (NLT1), Daniele (CMS), David (FNAL), Dimitri (KIT), Eric (BNL), Kyle (OSG), Renaud (IN2P3), Sang-Un (KISTI), Tiju (RAL)

Experiments round table:

  • ATLAS reports (raw view) -
    • Activities:
      • Grid almost full.
      • Few tasks doing MC Overlay with real data are running on a dedicated site (BNL) to avoid creating troubles to the rest of the Frontier/DB infrastructure.
      • CERN - RAL network link now running with both the primary and secondary 10Gbps.
    • Problems:
      • CERN-PROD Tier-0 LSF issue GGUS:123304 (alarm), GGUS:123245
        • Iain: the issue in the alarm ticket was an extra requirement on the jobs,
          which prevented many resources from matching

  • CMS reports (raw view) -
    • General overview: analysis and production activity slowed down due to ICHEP effect. Data taking going strong
      • Small problem on Friday with CMS quota settings on EOS impacting transfers, solved on Friday and transfers caught up during the weekend
    • Monitoring problems
      • CMS Critical Services Kibana page (https://meter.cern.ch/public/_plugin/kibana/#/dashboard/temp/CMS::CMS)
        • Problems with services going "pink" (no info) intermittently continue
        • ticket: INC:1092589
        • Explanation from service provider
          • At the beginning of the week we experienced some issues with the migrated gateways, having a poor network performance, this caused some lemon events to arrive really late.
          • During the week our ES cluster suffered from something similar (one of the master nodes was migrated) and because of the network performance again there were some timeouts between the connections (causing meter dashboards even not to load).
          • On Thursday afternoon there was a planned intervention on CEPH volumes, causing some of our gateways to be down again.
          • Finally on Friday HDFS service was in intervention, which again caused our gateways to start clogging (in regular basis this shouldn't affect ES, but in this case the amount of data caused some disk fulls).
          • Right now we are checking that all our infrastructure is back to a stable point, and we don't expect any data loss produced by these incidents.
      • Site Status Board: links behind column entries produce empty page, to get to detailed SAM3 information, need to click on site name and then navigate the SAM3 monitoring
    • Cloud Infrastructure - DNS resolution can be slow or fail with IPv6 in "cern_geneva_c" AVZ
      • OTG:0031373
      • We were affected, observed intermediate service outages
      • Difficult to determine where our VMs are running (Meyrin or Wigner)
        • Oliver: example ticket INC:1095816
        • one can check by IP address:
          • Wigner addresses start with 188.185
          • Meyrin addresses start with 188.184
          • Meyrin also has other networks
        • Iain (after the meeting):
          • The availability zones always have the datacentre somewhere in the name:
            • $ openstack server show your-hostname
          • For puppet nodes both zone and datacentre are facts on server.
    • HC tests failing - cannot unzip the sandbox tar
      • GGUS:123303
      • corrupted tarball due to machine running out of disk or connectivity issues
      • new tests are fine, correcting test results for affected sites

  • ALICE -
    • CERN: writing to CASTOR timed out Fri evening (alarm GGUS:123307)
      • the problem came back Sat evening, again quickly fixed, thanks!

  • LHCb reports (raw view) -
    • Activity
      • Monte Carlo simulation, data reconstruction/stripping and user jobs on the Grid
    • Site Issues
      • T0: NTR
      • T1: SARA.nl in downtime for tape system moving place: https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=20703
      • T1: RRCKI.ru will enter downtime next week: https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=21254
        • Julia: in a discussion with RRC-KI-T1 manager Eygene Ryabinkin
          it turned out that this intervention had to be decided quite suddenly
          due to external factors, hence could not be discussed in advance
        • Maarten: we will set up a page in the WLCG ops area that describes
          the generic guidelines for non-trivial planned downtimes;
          note that such guidelines may not always be honored exactly,
          because of external circumstances beyond the site's control

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF:
  • EGI:
  • FNAL: ntr
  • GridPP:
  • IN2P3: ntr
  • JINR:
    • One dCache pool selected for SAM was overloaded. Some settings are changed for the pool selection algorithm.
  • KISTI:
    • on Thu there will be a downtime from 04:00 to 06:00 UTC for maintenance on internal switches
      • the actual downtime may be as short as 10 min
  • KIT: ntr
  • NDGF:
  • NL-T1:
    • we have observed an overload of the SRM at NIKHEF that appears
      to have been due to a large amount of tests; it went away by itself
  • NRC-KI:
  • OSG: ntr
  • PIC:
  • RAL:
    • On Thursday morning, the 10G backup OPN link to CERN was enabled for active use.
  • TRIUMF:

  • CERN computing services:
    • ARGUS service (site-argus.cern.ch) to be upgraded to centos 7 and argus 1.7 between 0800 & 1000 UTC on Thursday 11th. After the intervention all backend nodes will be running the new ARGUS release on centos 7. It is important to note that it is expected that some proxies will be mapped to a different pool account after the migration. This is expected due to differences in how VOMS groups are dealt with in the new release, especially seen in some CMS proxies.
      • Maarten: if all goes as planned, there will be a final instability
        experienced by up to a few CMS users; afterwards the mappings
        should at last become how we want them, and in particular this
        should fix the SAM gLExec tests of the CERN CEs
  • CERN storage services:
    • The CASTOR ALICE suffered two unavailabilities during the weekend (a GGUS ALARM ticket was raised - cf. ALICE report). Root cause is still under investigations. In both cases, the process responsible for the scheduling was stuck and a simple restart was enough to reestablish the service.
  • CERN databases: ntr
  • GGUS:
  • Monitoring:
    • Draft reports for the July availability sent around
    • Issue with the June final availability reports fixed on 1-8-2016. The site OU_OSCER_ATLAS joined after the draft reports were generated, and before the final reports were generated, so they appeared with an empty line. They have been removed.
    • Issue with Kibana speed for meter: INC:1092589, being investigated. Since meter is going to be migrated quite soon to the unified monitoring, this issue might be addressed at the moment of the migration
    • The new portal (http://monit.cern.ch) is accessible to all CERN accounts. It is not in production and is under development, but there is already a lot of WLCG data. We will mention the new features every couple of weeks.
  • MW Officer:
  • Networks:
    • GGUS:121687 RAL consistent loss - waiting for router upgrade - plan is to upgrade by end of Sept.
    • GGUS:121905 BNL to SARA - consistent loss to other T1s (KIT, PIC, CERN) no longer seen after the last intervention, ticket will be closed.
    • GGUS:123285 - transfer errors for transfers from CA-MCGILL-CLUMEQ-T2 to NET2 - gridftp transfers timing out, issue re-appeared again, but it doesn't seem to be related to network.
    • MIT inbound throughput - started July 7, was narrowed down to Internet2 to MIT segment, MIT to open a ticket with Internet2 to investigate further.
  • Security:

AOB:

Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2016-08-12 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback