Week of 130429

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: Simone (SCOD), Jarka (CERN - dashboard), Maria (CERN - GGUS), Jerome (CERN - PES), Felix (ASGC), Stefan (CERN - ES), Belinda (CERN - DSS), Victor (LHCb), Marcin (CERN - DB), Pepe (PIC).
  • remote: Michael (BNL), Alexander (NL-T1), Xavier (KIT), Kyle (OSG), Christian (NDGF), Gareth (RAL), Stefano (CMS), Alessandro (ATLAS), Salvatore (CNAF)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • NTR
    • T1s
      • ND.ARC: on Saturday thousands of jobs failed with transfer timeout, and problem with big amount of jobs at transferring state, as FTS channel for ND->DE cannot transfer the files fast enough (elog:44012-44015). The issue was fixed by increasing the timeout to 4 days (from 2 days), and doubling the number of parallel transfers.
        • From KIT: The FTS configuration at KIT was indeed changed on sunday at 10:00AM. Number of active transfers from NDGF increased from 10 to 20.

  • CMS reports (raw view) -
    • nothing to report on the distributed system
    • Oracle DB problem at CERN last thursday. CMS CRC opened an ALARM ticket GGUS:93653 at 13:38 . SNOW ticket INC:285815 had been issued earlier at 13:12 by CMS operator. According to CERN DB Support problem started at 12:40 and was due to internal reasons not tied to CMS activity. CMS applications have been failing for a couple of hours before service was restored causing alarms (but not panic) among operators. First communication from CERN Oracle support was at 15:00 meeting but CRC phone link dropped near the end and CRC did not hear/understand. We can benefit from written, direct, communication between DB Support ad CMS operators. To be followed up by CMS Computing Operations.
      • Maria Dimou: The action of CMS to open an alarm ticket was indeed the correct one. What was the problem with communication mentioned? Stefano: would have been good to have a direct communication from the Oracle team mentioning there was a problem.

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Restriping campaing has started on Saturday
    • T0:
    • T1: RAL (UK sites) Some MC jobs failing to upload from UK sites to different destinations

Sites / Services round table:

  • NL-T1: some problems staging files from tape this morning, due to load on tape drives. The tape drives configuration has been then tuned accordingly.
  • OSG: still seeing some errors on the CERN BDII (20% drop on the number of entries w.r.t. usual). Looking into it.
  • NDGF: short downtime (marked 2h) on friday morning to reboot some disk servers.
  • RAL: on wednesday morning warning for oracle patching in DB behind CASTOR. At RISK.

AOB:

Thursday

Attendance:

  • local: AndreaV/SCOD, Alessandro/ATLAS, Felix/ASGC, Maarten/ALICE, Jerome/Grid services, Marcin/Databases, Belinda/Storage, Victor/PIC&LHCb
  • remote: Jeff/NLT1, Xavier/KIT, Gareth/RAL, Kyle/OSG, Paolo/CNAF, Lisa/FNAL, Rolf/IN2P3, Christian/NDGF, Jeremy/Gridpp; MariaD/GGUS, David/CMS

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • NTR
    • T1s
      • today (Thursday) a diskserver problem appeared on RAL, which causes timeout in reading some files, investigation ongoing (GGUS:93795)

  • CMS reports (raw view) -
    • In general a slow half-week.
    • 2012 Rereco is complete, some MC production on Tier 1's ramping up. In general activity low.
    • CMS only Tier 2's are encouraged to make the transition to SL6 -- Sites shared with other VO's need to wait until June 1 to do this.

  • ALICE -
    • central services: some servers went down due to a long power cut this morning, causing most jobs to fail and the queues to get drained; the job numbers are slowly ramping up again

  • LHCb reports (raw view) -
    • Incremental stripping campaign in progress: ramping-up between 2k - 3k of stripping and merging jobs with 98% of execution completed
    • T0:
    • T1:

Sites / Services round table:

  • Jeff/NLT1:
    • there was a 30 minute unscheduled intervention this morning for tapes, but the disk buffers were working ok
    • Nikhef will move to SLC6 during the week of May 21st
  • Xavier/KIT: ntr
  • Gareth/RAL:
    • investigating issues reported by ATLAS
    • also working on the firmware upgrade of ALICE disk servers, batch queues have been drained
    • next week on Wed or Thu will switch over the DB backend of Castor between primary and standby
  • Kyle/OSG: ntr
  • Paolo/CNAF: there was an unscheduled intervention for ALICE three days ago, everything was solved by yesterday
  • Lisa/FNAL: ntr
  • Rolf/IN2P3: ntr
  • Christian/NDGF: ntr
  • Jeremy/Gridpp: ntr
  • Felix/ASGC: there has been an unscheduled intervention on Cream CE
  • Victor/PIC: following up issues with Cream CE on SLC6, will wait for the expert next week

  • Jerome/Grid services: Cream CE upgrade to EMI2 at CERN is ongoing
  • Marcin/DB: ntr
  • Belinda/Storage: ntr
  • MariaD/GGUS:
    • The next GGUS release will take place on June 2nd.
    • File ggus-tickets.xls is up-to-date and attached to twiki WLCGOperationsMeetings. There were 2 real ALARMs, so far, since the last WLCG MB that took place on 2013/03/19.

AOB: the meetings next week will take place on Monday and Friday (CERN is closed on Thursday)

Edit | Attach | Watch | Print version | History: r9 < r8 < r7 < r6 < r5 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r9 - 2013-05-02 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback