Week of 130429

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1



  • local: Simone (SCOD), Jarka (CERN - dashboard), Maria (CERN - GGUS), Jerome (CERN - PES), Felix (ASGC), Stefan (CERN - ES), Belinda (CERN - DSS), Victor (LHCb), Marcin (CERN - DB), Pepe (PIC).
  • remote: Michael (BNL), Alexander (NL-T1), Xavier (KIT), Kyle (OSG), Christian (NDGF), Gareth (RAL), Stefano (CMS), Alessandro (ATLAS), Salvatore (CNAF)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • NTR
    • T1s
      • ND.ARC: on Saturday thousands of jobs failed with transfer timeout, and problem with big amount of jobs at transferring state, as FTS channel for ND->DE cannot transfer the files fast enough (elog:44012-44015). The issue was fixed by increasing the timeout to 4 days (from 2 days), and doubling the number of parallel transfers.
        • From KIT: The FTS configuration at KIT was indeed changed on sunday at 10:00AM. Number of active transfers from NDGF increased from 10 to 20.

  • CMS reports (raw view) -
    • nothing to report on the distributed system
    • Oracle DB problem at CERN last thursday. CMS CRC opened an ALARM ticket GGUS:93653 at 13:38 . SNOW ticket INC:285815 had been issued earlier at 13:12 by CMS operator. According to CERN DB Support problem started at 12:40 and was due to internal reasons not tied to CMS activity. CMS applications have been failing for a couple of hours before service was restored causing alarms (but not panic) among operators. First communication from CERN Oracle support was at 15:00 meeting but CRC phone link dropped near the end and CRC did not hear/understand. We can benefit from written, direct, communication between DB Support ad CMS operators. To be followed up by CMS Computing Operations.
      • Maria Dimou: The action of CMS to open an alarm ticket was indeed the correct one. What was the problem with communication mentioned? Stefano: would have been good to have a direct communication from the Oracle team mentioning there was a problem.

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Restriping campaing has started on Saturday
    • T0:
    • T1: RAL (UK sites) Some MC jobs failing to upload from UK sites to different destinations

Sites / Services round table:

  • NL-T1: some problems staging files from tape this morning, due to load on tape drives. The tape drives configuration has been then tuned accordingly.
  • OSG: still seeing some errors on the CERN BDII (20% drop on the number of entries w.r.t. usual). Looking into it.
  • NDGF: short downtime (marked 2h) on friday morning to reboot some disk servers.
  • RAL: on wednesday morning warning for oracle patching in DB behind CASOTR. At RISK.




  • local:
  • remote:

Experiments round table:

  • ALICE -

Sites / Services round table:


Edit | Attach | Watch | Print version | History: r9 | r5 < r4 < r3 < r2 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r3 - 2013-04-29 - SimoneCampana
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback