Week of 140714

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Maria Alandes (chair, minutes), Giuseppe Bagliesi (CMS), Luca Canali (IT-DB), Maria Dimou (GGUS), Alessandro di Girolamo (ATLAS), Felix Lee (ASGC), Maarten Litmaath (ALICE)
  • remote: Thomas Bellman (NDGF), Michael Ernst (BNL), Tiju Idiculla (RAL), Lisa Giacchetti (FNAL), Dmitry Nilsen (KIT), Elisabeth Prout (OSG), Alexander Verkooijen (NL-T1), Matteo (CNAF)

Experiments round table:

Tiju informs that these were in fact two different disk servers and that both of them are now back in production.

  • CMS reports (raw view) -
    • No major issues, processing and production is continuing
      • CSA14 exercise ongoing: kick-off on Monday 7th
    • Problems with voms-proxy-init and CRL on lxplus5 (solved) GGUS:106789
    • T0
      • NTR
    • T1
      • NTR

  • ALICE -
    • NTR

Sites / Services round table:

  • ASGC: NTR
  • BNL: NTR
  • CNAF: Due to a kernel upgrade on SL6 machines, the computing power of the farm has temporarily decreased while the affected machines are being rebooted.
  • FNAL: NTR
  • GridPP: Not present
  • IN2P3: Not present
  • JINR: Not present
  • KISTI: Not present
  • KIT: NTR
  • NDGF: There will be a downtime for tape libraries next Wednesday and due to this fact ATLAS data will be unavailable during the whole day. The service is expected to be back in production on Thursday.
  • NL-T1: NTR
  • OSG: NTR
  • PIC: NTR (reported offline in email)
  • RAL: NTR
  • RRC-KI: Not present
  • TRIUMF: Not present

  • CERN batch and grid services: Not present
  • CERN storage services: Not present
  • Databases: NTR. Alessandro reports that the FTS 3 pilot and production service has been suffering from some instabilities and it is believed that it would be good to move the MySQL DB hosted in the same physical node to a diffetent machine. Luca replies that this is better to follow up with the DB on demand people to understand whether this in fact causes any overload and study the possibility of moving it to a different machine, which in principle should be possible.
  • GGUS: Reminder! Release this Wednesday with ALARM tests using new GGUS host cert. Maria adds that the release should be transparent in any case.
  • Grid Monitoring: Not present
  • MW Officer: Not present

AOB:

Thursday

Attendance:

  • local: Maria Alandes (chair, minutes), Maria Dimou (GGUS), Kate Dziedziniewicz-Wojcik (IT-DB), Massimo Lamanna (Storage), Felix Lee (ASGC), Maarten Litmaath (ALICE), Andrea Manzi (MW Officer), Ulrich Schwickerath (Grid&Batch),
  • remote: Sang-Un Ahn (KISTI), Thomas Bellman (NDGF), Michael Ernst (BNL), Lisa Giacchetti (FNAL), Thomas Hartmann (KIT), John Kelly (RAL), Elisabeth Prout (OSG), Rolf Rumler (IN2P3), Kai Leffhalm (ATLAS), Saverio Virgilio (CNAF), Alexey Zhelezov (LHCb)

Experiments round table:

  • ATLAS reports (raw view) -
    • Tier0/1
      • CERNPROD_TZERO staging errors on Tuesday, overloaded due to heavy request, system protected itself (GGUS:106878)
      • Taiwan: Network issues, under investigation (GGUS:106736), transfers failing Taiwan as Source
      • FZK: Staging errors, decreasing since 9:00 UTC, under investigation

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • MC(74%) and User(26%) jobs only, with no critical problems.
    • T0: NTR
    • T1: NTR

Sites / Services round table:

  • ASGC: Ongoing investigation of a network issue with the Taiwan - CERN LHCOPN link. More details in ITSSB.
  • BNL: NTR
  • CNAF: NTR
  • FNAL: Some problems with the GGUS:106939 TEST ALARM sent to FNAL after the new GGUS release. Maria Dimou will follow this up. Lisa also requests whether such test alarms could be sent at a more convenient time for them; this will also be followed up.
  • GridPP: Not present
  • IN2P3: NTR
  • JINR: Not present
  • KISTI: The scheduled downtime from 7am to 8 am UTC to upgrade the kernel went fine.
  • KIT: NTR. Maarten Litmaath asks how much data was lost after the tape issue. Kai reports that it's about 150 files. Maarten asks whether there will be a SIR for this, especially if this could be also useful for other sites. Kai replies that in principle they were not planning to do any but it could be done.
  • NDGF: The downtime announced for yesterday didn't actually happen due to the late arrival of the HW. This will be postponed for next week.
  • NL-T1: Not present
  • OSG: NTR
  • PIC: Not present
  • RAL: NTR
  • RRC-KI: Not present
  • TRIUMF: Not present

  • CERN batch and grid services:
    • FTS3 Software Upgrade , Tuesday afternoon next week, transparent ITSSB entry. software upgrade from 3.2.22 to 3.2.26. Includes workarounds to frequent crashes in underlying gridsite.
    • an attempt to upgrade our QA CEs went wrong on Monday and had to be rolled back. Due to this these CEs where unavailable for 2-3h. He also warns other sys admins about the fact that downgrading the BLAH rpm removes all accounting records.
  • CERN storage services: Massimo reports that the problem reported by ATLAS in GGUS:106878 is indeed still under investigation to better understand what could be the real cause. Maarten points out that ATLAS is doing stress tests and maybe the heavy load comes from this. Massimo says that this is not yet clear and in any case needs to be understood if this is going to be the load for RUN 2. More news probably on Monday.
  • Databases:
    • LHCBR Offline DB was not available at 9am today due to a storage misconfiguration. A permanent fix has been applied. More details in ITSSB.
    • ATLAS Conditions Data Replication is scheduled to be moved to Oracle GoldenGate next Wednesday. See ITSSB for more details.
  • GGUS: Msg from Guenter Grein (GGUS developer): During yesterday's alarm tests we faced a couple of problems with the new certificate. The reason for this is that the new certificate has attribute "X509v3 Extended Key Usage" and values "TLS Web Server Authentication, TLS Web Client Authentication" but not value "emailProtection". Therefore the verify operations at various T1 failed.We rolled back to the old certificate now. The old certificate is valid until July 28. Meanwhile our CA has to fix the attribute issue. Progress recorded in JIRA:1276
  • Grid Monitoring: Not present
  • MW Officer:
    • A vulnerability has been found in PERFSONAR-PS and a fix has been already provided. The baseline table will be updated with a link to a twiki with upgrade instructions that is going to be prepared by the Network&Transfer Metrics WG.
    • GGUS tickets to ATLAS sites not up to date with the latest CVMFS client 2.1.19 will be sent in the upcoming days.

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx 2014-07-15.pptx r2 r1 manage 2841.2 K 2014-07-14 - 15:44 MariaDimou Final GGUS slides for the 2014/07/15 WLCG MB
Edit | Attach | Watch | Print version | History: r14 < r13 < r12 < r11 < r10 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r14 - 2014-07-18 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback