Week of 130826

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: MariaD/SCOD, Felix/ASGC, Xavi/CERN-DSS, Eddy/CERN-Dashboards, Stefan/LHCb, Maarten/ALICE.
  • remote: Xavier/KIT, Salvatore/CNAF, Elena/ATLAS, Sang-Un/KISTI, David/CMS, Onno/NL_T1, Christian/NDGF, Rob+Kyle/OSG, Lisa/FNAL, Rolf/IN2P3, Michael/BNL, Pepe/PIC.

Experiments round table:

  • ATLAS reports (raw view) -
    • T0
      • ntr
    • T1
      • FTS3 testing at RAL: Huge number of simulation jobs in transferring state on Friday. FZK, LYON and RAL FTS were switched back to FTS2 on Friday afternoon. The problem has gone by Saturday morning for FZK and LYON but there was still a problem for incoming transfers for RAL. Site service box for RAL indicated a problem which was fixed. By Saturday evening all delayed jobs outputs have been transferred.
      • CVMFS has been updated with new CMTUSERCONTEXT on Thursday.
      • FZK (GGUS:96839) :Transfer problems due to one pool blocking regular operations. The process was restarted, but failed. Then the host was rebooted, but failed due to some more errors. Under investigation.
Xavier/KIT explained that a problem with the storage system requires an intervention by the vendor, so it is set at risk till Tue evening. To Stefan's question on the nature of the CVMFS problem, Elena confirmed it is due to ATLAS software.

  • CMS reports (raw view) -
    • Quiet weekend, Tickets from last report except GGUS:96725 solved
    • New tickets since all relating to issues at T2 level:
      • Transfer issues -- GGUS:96816, Warsaw, and GGUS:96843, between Wisconsin & Belgian sites
      • GGUS:96826, T2_TR_METU Possible CVMFS related SAM test failure

  • ALICE -
    • CERN: large number of job submission failures from ~08:30 to ~14:00 on Fri, causing the site to get mostly drained of ALICE jobs; caused by overload of the Argus servers; improvements are being looked into (INC:365019, opened by CMS)
    • CNAF: some jobs failed due to absence of a package on the SL5 WN (GGUS:96793); the admins added it (thanks!), while the SL6 (sic) VOBOX was reconfigured to prevent similar issues
    • RAL: on Fri many WN ran out of memory due to one ALICE user's broken jobs that allocated huge amounts of memory in a very short time; to stop the issue, ALICE jobs have been banned since Fri; the guilty tasks have been removed and their owner has been admonished

  • LHCb reports (raw view) -
    • Only MC productions ongoing, including using BOINC prototype
    • T0:
      • switched back to FTS2 transfer system, b/c of issue with accounting of file transfer starting date, and wrongly accounted within DIRAC, needs to be fixed on both sides.
      • CE301, CE302 new CEs show wrong OS information in the BDII
      • removal campaign for large number of histograms on CASTOR Disk was launched
    • T1:
      • NTR

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • KIT: nta
  • PIC: ntr
  • RAL: Bank holiday.
  • OSG: ntr
  • NDGF: Scheduled downtime this Thursday. ALICE and ATLAS will be affected.
  • NL_T1: ntr
  • IN2P3: Scheduled downtime all day on 24/9. Batch and Storage will definitely be down. Details will follow.
  • KISTI: ntr

  • CERN:
    • Dashboards: ntr
    • Storage: ntr

AOB:

Thursday

Attendance:

  • local: MariaD/SCOD
  • remote:

Experiments round table:

  • CMS reports (raw view) -
    • all extremely quiet, not much activity indeed
    • GGUS tickets still open:
      • GGUS:96725 (KISTI APEL status, "waiting for info")
      • GGUS:96816 ("Debug transfer failing from RAL_Disk to Warsaw", waiting for response by RAL)
      • GGUS:96843 ("Failing transfer from Wisconsin to IIHE", assigned to Wisconsin)
      • GGUS:96826 ("SAM CE error in T2_TR_METU, probably caused by CVMFS", waiting for answer)
    • New ones:
      • GGUS:96912 ("CVMFS problems", bridged just today from savannah)
    • We still have a SNOW ticket opened : INC:365019 on CERN Argus servers having overload problems and not properly balancing. Not critical at the moment (not much traffic), but not solved either.
    • HC tests not being set (see here) between Aug 15th and yesterday. Fixed by CERN support.

  • ALICE -

Sites / Services round table:

  • ASGC:
  • BNL:
  • CNAF:
  • FNAL:
  • KIT:
  • PIC:
  • RAL:
  • OSG:
  • NDGF:
  • NL_T1:
  • IN2P3:
  • KISTI:

  • CERN:
  • GGUS: Brainstorming about how to implement the possibility to notify multiple sites via a single GGUS tickets you are invited to contribute comments in Savannah:138299.

AOB:

Edit | Attach | Watch | Print version | History: r10 | r8 < r7 < r6 < r5 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r6 - 2013-08-29 - TommasoBoccali
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback