Week of 150119

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions can be found here.

General Information

  • The purpose of the meeting is:
    • to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
    • to announce or schedule interventions at Tier-1 sites;
    • to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
    • to provide important news about the middleware;
    • to communicate any other information considered interesting for WLCG operations.
  • The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web
  • Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.

Monday

Attendance:

  • local: Belinda (storage), Ignacio (grid services), Lorena (databases), Maarten (SCOD + ALICE)
  • remote: Christoph (CMS), Di (TRIUMF), Dmytro (NDGF), Felix (ASGC), Michael (BNL), Onno (NLT1), Pavel (KIT), Pepe (PIC), Renato (LHCb), Rolf (IN2P3), Tiju (RAL)

Experiments round table:

  • ATLAS reports (raw view) -
    • CentralServices/T0
      • many APF have full /tmp, pilots not submitted in DE,ES,FT,IT,NL,TW,UK clouds
    • Data loss at SARA, reported Friday afternoon
      • 500k files lost
      • rucio recovery ongoing - monitoring is missing - to provide a list of definitely lost files
      • most of files are lost, logs are all lost, 80k recoverable
      • difficult to get the dataset from the name
      • planning the recovery with Prodsys-2

  • CMS -
    • Only very moderate work in the system from CMS
    • RelVal production got delayed by Stratum 0 problems over the weekend GGUS:111223
    • Production system attempted to run over 'invalidated' file at CCIN2P3 - GGUS:111163 - ticket to be closed from CMS side

  • ALICE -
    • high activity until Thu Jan 15 and again since Fri evening
    • on Thu ~08:45 a few central service certificates expired
      • they were quickly replaced and services restarted
      • as of that time, very few tasks managed to run
      • most of the time the VOBOX and/or the Job Agents found no matching jobs for their site
      • a hefty debugging exercise finally converged Fri ~16:00
      • a bug in the operation of an internal cache got exposed when the central services had to be restarted with new certificates
    • big data loss at SARA (NLT1) due to RAID controller failure
      • 108k ALICE files (~8 TB) lost
      • catalog cleanup and re-replications to be done

  • LHCb reports (raw view) -
    • "Legacy Run1 Stripping" campaign "almost" done + MC and user jobs.
    • T0:
    • T1: Ticket opened on Friday to Nikhef/Sara concerning files lost (more than 100K files). Today declared "lost".
      • GGUS:111205 (in progress)
      • LHCb Data management team is working on recovering, whatever is possible.

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF:
  • FNAL:
  • GridPP:
  • IN2P3: ntr
  • JINR:
  • KISTI:
  • KIT: ntr
  • NDGF:
    • tomorrow 12:00-14:00 UTC dCache head nodes upgrade, all data will be unavailable during the intervention
  • NL-T1:
    • Last Wednesday, a dCache pool node had a broken backplane. The storage is configured as RAID60, consisting of a RAID0 set of 3 RAID6 arrays. One of the RAID6 arrays went down, leaving the RAID0 set in failed state. Efforts of the vendor and us to try and recover the RAID0 failed. We had to declare the files as permanently lost. Atlas, Alice and LHCb had files on this node and we have provided them with lists of the lost files. LHCb submitted GGUS:111205. With help from the vendor we're getting the RAID up and running. Due to some technical difficulties and recommended testing we expect the node to be in production again by the end of the week. Until then, there might be a little less storage capacity. Graphs are available at http://web.grid.sara.nl/dcache.php?r=month#poolgroups. Our sincere apologies for the inconvenience.
      • Maarten: was the system wrong in declaring the RAID-0 data lost, instead of waiting for the missing RAID-6 to come back? Can another such incident easily happen on another disk server?
      • Onno: with the vendor we will carefully analyze the incident to obtain answers to such questions.
      • Renato: should the SE be banned for now, or can it be used as normal?
      • Onno: the unaffected parts of the SE can be used as normal.
  • NRC-KI:
  • OSG:
  • PIC:
    • this morning an erroneous change of the squid server ACLs caused CMS SAM tests to fail; should be OK now
  • RAL: ntr
  • TRIUMF: ntr

  • CERN batch and grid services: The FTS fts3.cern.ch will upgrade at 10:00 CET on Tuesday 20th. From 3.2.30->3.2.31. OTG:0017635 It is expected to be fully transparent.
  • CERN storage services: ntr
  • Databases:
    • during the weekend the 4th instance of the LCGR database crashed due to an Oracle internal error; a high-priority service request has been opened with Oracle
      • Maarten: that incident appears to have been fairly transparent to the experiments
  • GGUS:
  • Grid Monitoring:
  • MW Officer:

AOB:

Thursday

Attendance:

  • local: Andrea Sciabà, Lorena Lobato (databases), Maarten Litmaath (ALICE), Belinda Chan Kwok Cheong (storage services), Andrea Manzi (middleware officer).
  • remote: Andrej Filipcic (ATLAS), Dmytro Karpenko (NDGF), John Kelly (RAL), Lisa Giacchetti (FNAL), Renato Santana (LHCb), Thomas Hartmann (KIT), Sang Un Ahn (KISTI), Jeremy Coles (GridPP), Rolf Rumler (IN2P3-CC), Di Qing (TRIUMF), Salvatore Tupputi (CNAF), Kyle Gross (OSG).

Experiments round table:

  • ATLAS reports (raw view) -
    • CentralServices/T0
      • RAL FTS3 upgrade to be done Friday at 10:00
      • Recovering from the data loss at SARA; the Rucio recovery procedure is working but the relevant information on the files/datasets removed from the catalogs needs to be obtained from the Rucio log files for now.
      • Having issues with CREAM at CERN as of this morning. There is no ticket for it yet, but experts are looking into it.

  • CMS
    • None from CMS can join today.
    • Nothing to report

  • ALICE -
    • Also working on the SARA data loss. The impact is not huge but work is needed to update the file catalog and re-replicate files. Some files have been irremediably lost but they were not very important.

  • LHCb reports (raw view) -
    • "Legacy Run1 Stripping" campaign "almost" done + MC and user jobs.
      • T1: Ticket opened on Friday to Nikhef/Sara concerning files lost (more than 100K files), closed yesterday:
        • GGUS:111205
        • According to LHCb Data management team 80% of files could not be recovered.

Sites / Services round table:

  • ASGC:
  • BNL:
  • CNAF: a bug was found in StoRM 1.11.5 affecting some ATLAS transfers. Version 1.11.6 was released and installed to fix it. It is not yet available in EMI. The bug was probably there since a long time but it was exposed only now. It should not be a serious issue for other sites.
  • FNAL: there is a problem (GGUS:111106) affecting most CMS OSG sites by which their CEs drop out from the CMS VOfeed, affecting the SAM monitoring. Maarten says that the problem appeared on January 12 and it is caused by the case of site names changing, which clashes with the (wrong) case sensitivity in the VO feed generation code. While the VO feed is going to be changed to remove such assumption, the real source of the problem remains unknown. In principle there should be no impact on the daily availability/reliability, as CEs do not disappear for a whole day, but Lisa mentions that some sites were indeed affected. Maarten said that any false A/R issues will be corrected as usual.
  • GridPP: ntr
  • IN2P3-CC: ntr
  • JINR:
  • KISTI: ntr
  • KIT: ntr
  • NDGF: ntr
  • NL-T1: ntr
  • NRC-KI:
  • OSG: no progress on the issue reported in GGUS:111078 about transfer failures to UPENN. Andrej will follow it up in ATLAS.
  • PIC:
  • RAL: the power tests conducted in the past days were successful, the system looks healthy.
  • TRIUMF: A 4-hour site-wide electrical power shutdown to allow for work on TRIUMF 12.5 kV switchgear is foreseen for Wednesday 28 around 16:00 local time.

  • CERN batch and grid services:
  • CERN storage services: ntr
  • Databases: no news about the problem with one machine in LCGR that happened last weekend. Waiting for information from Oracle support.
  • GGUS:
    • Transparent network intervention at KIT on Monday from 5:00 to 7:00.
    • Release will be done on the 28th of January. The usual test alarms will be sent to all T1
  • Grid Monitoring:
  • MW Officer: ntr

AOB: -- AndreaSciaba - 2014-12-16

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx MB-Jan-15.pptx r1 manage 2866.1 K 2015-01-19 - 09:56 PabloSaiz  
Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2015-01-22 - XavierEspinal
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback