Week of 160815

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Tier-1 downtimes

Experiments may experience problems if two or more of their Tier-1 sites are inaccessible at the same time. Therefore Tier-1 sites should do their best to avoid scheduling a downtime classified as "outage" in a time slot overlapping with an "outage" downtime already declared by another Tier-1 site supporting the same VO(s). The following procedure is recommended:
  1. A Tier-1 should check the downtimes calendar to see if another Tier-1 has already an "outage" downtime in the desired time slot.
  2. If there is a conflict, another time slot should be chosen.
  3. In case stronger constraints cannot allow to choose another time slot, the Tier-1 will point out the existence of the conflict to the SCOD mailing list and at the next WLCG operations call, to discuss it with the representatives of the experiments involved and the other Tier-1.

As an additional precaution, the SCOD will check the downtimes calendar for Tier-1 "outage" downtime conflicts at least once during his/her shift, for the current and the following two weeks; in case a conflict is found, it will be discussed at the next operations call, or offline if at least one relevant experiment or site contact is absent.

Links to Tier-1 downtimes

ALICE ATLAS CMS LHCB
  BNL FNAL  

Monday

Attendance:

  • local: Alberto (monitoring), Marian (networks), Alessandro (storage), Vincent (security), Maarten (ALICE), Maria D. (SCOD), Fernando (T0 computing), Ignacio (databases).
  • remote: Andrew (NL_T1), Daniele (CMS), David (FNAL), Dimitri (KIT), Sang-Un (KISTI), Gareth (RAL), Eygene (RRC-KI-T1), Ulf (NDGF), Zoltan (LHCb).

Experiments round table:

  • ATLAS reports (raw view) -
    • Activities:
      • Grid almost full except on Sunday, when we ran out of jobs.
      • detection of dark data at RRC-KI-T1_DATADISK (108352 files), deletion in progress
      • improved Tzero usage by grid jobs
    • Problems:
      • again problems with some Frontier and squid servers due to overload by overlay tasks

Eygene said that they don't know yet, at the site, whether the ~100K files of dark data are found in rucio or at their premises. Jiri explained that they must be in their storage and not in rucio.

  • CMS reports (raw view) -
    • General:
      • analysis and production activity still relatively slowed down due to ICHEP effect. Data taking proceeds. Computing-wise, Busy but relatively quiet week.
    • Central services:
    • Tier-1s (selected items):
      • T1_US_FNAL (Aug 9-11): SAM status errors for BDII ("Visible”), and in general SAM3 SRM behavior rather unstable
        • on SRM, Dave (Mason) reported xrootd was hung - restarted on Aug 13, things went green again
      • T1_US_FNAL (Aug 14) high failure rates
        • Dave (Mason) reported probably related to the hrs earlier when xrootd was failing (read above) - getting better
      • T1_FR_CCIN2P3 (Aug 12): problem on SAM3 CE. GGUS:123381 opened. Debugged to be related to full space on the analysis-ops pool
      • T1_RU_JINR (Aug 13): SAM3 CE in Critical State. GGUS:123390 opened
    • T1_RU_JINR (Aug 14) PhEDEx agents for Disk node down for a while, shifter (mistakenly) opened a GGUS, actually the site was in scheduled downtime

Alessandro said that the strange used space phenomena seen on the dashboard is related to the EOS namespace restart.

  • ALICE -
    • CERN: EOS-ALICE was down Tue evening (GGUS:123343)
      • namespace lock cured by restart, after repair and compactification
      • EOS version was updated in parallel to the latest release

  • LHCb reports (raw view) -
    • Activity
      • Monte Carlo simulation, data reconstruction and user jobs on the Grid
    • Site Issues
      • T0:
        • We had error writing to CASTOR (GGUS:123349), which was due to Oracle update.
      • T1:
        • RAL.uk jobs can not open sqlite db using CVMFS (GGUS:123382) (In progress)

Sites / Services round table:

  • ASGC: not connected
  • BNL: not connected
  • CNAF: not connected
  • EGI: not connected
  • FNAL: xrootd went down and got restarted during the weekend. Not clear why this happened.
  • GridPP: not connected
  • IN2P3: not connected - Public holiday 15 August
  • JINR: not connected
  • KISTI: ntr
  • KIT: ntr
  • NDGF: A machine shutdown by itself during the weekend. Now the service is restored but the reasons remain unknown. A scheduled downtime for next Thursday is published in GOCDB that will make some ATLAS and ALICE files unavailable.
  • NL-T1: SARA experiences some problems between the dCache server and the tape library. An unscheduled downtime was published, to solve this.
  • NRC-KI: the data center cooling maintenance will take at least 5 days, while 2 weeks have conservatively been declared for the downtime
  • OSG: not connected
  • PIC: not connected
  • RAL: a problem on a database machine affected the CASTOR database on Friday evening. ALICE was impacted. Gareth will check the ticket mentioned by LHCb.
  • TRIUMF: not connected

  • CERN computing services: New ARGUS server version installed last Thursday.
  • CERN storage services: nothing to add
  • CERN databases: ntr
  • GGUS: Some maintenance work on KIT border routers will occur on August 22nd. Although the network connection is redundant short outages may occur. Please see also https://goc.egi.eu/portal/index.php?Page_Type=Downtime&id=21275 and the GGUS news for details.
  • Monitoring:
  • MW Officer:
  • Networks:
    • GGUS:121687 RAL consistent loss - waiting for router upgrade - plan is to upgrade by end of Sept.
    • MIT inbound throughput - started July 7, was narrowed down to Internet2 to MIT segment. Recently, additional issue reported with LHCONE routing at the MIT border.
    • LCG distribution router upgrade (from Tue Aug 23 10:00:00 CEST 2016 to Tue Aug 23 10:30:00 CEST 2016)
  • Security: ntr

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx GGUS-for-MB-Aug-16.pptx r1 manage 2848.0 K 2016-08-15 - 18:06 MariaDimou GGUS slide for the cancelled MB service report of Aug 16th
Edit | Attach | Watch | Print version | History: r11 < r10 < r9 < r8 < r7 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r11 - 2016-08-15 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback