Week of 130826

Daily WLCG Operations Call details

To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:

  1. Dial +41227676000 (Main) and enter access code 0119168, or
  2. To have the system call you, click here

The scod rota for the next few weeks is at ScodRota

WLCG Availability, Service Incidents, Broadcasts, Operations Web

VO Summaries of Site Usability SIRs Broadcasts Operations Web
ALICE ATLAS CMS LHCb WLCG Service Incident Reports Broadcast archive Operations Web

General Information

General Information GGUS Information LHC Machine Information
CERN IT status board WLCG Baseline Versions WLCG Blogs GgusInformation Sharepoint site - LHC Page 1


Monday

Attendance:

  • local: MariaD/SCOD, Felix/ASGC, Xavi/CERN-DSS, Eddy/CERN-Dashboards, Stefan/LHCb, Maarten/ALICE.
  • remote: Xavier/KIT, Salvatore/CNAF, Elena/ATLAS, Sang-Un/KISTI, David/CMS, Onno/NL_T1, Christian/NDGF, Rob+Kyle/OSG, Lisa/FNAL, Rolf/IN2P3, Michael/BNL, Pepe/PIC.

Experiments round table:

  • ATLAS reports (raw view) -
    • T0
      • ntr
    • T1
      • FTS3 testing at RAL: Huge number of simulation jobs in transferring state on Friday. FZK, LYON and RAL FTS were switched back to FTS2 on Friday afternoon. The problem has gone by Saturday morning for FZK and LYON but there was still a problem for incoming transfers for RAL. Site service box for RAL indicated a problem which was fixed. By Saturday evening all delayed jobs outputs have been transferred.
      • CVMFS has been updated with new CMTUSERCONTEXT on Thursday.
      • FZK (GGUS:96839) :Transfer problems due to one pool blocking regular operations. The process was restarted, but failed. Then the host was rebooted, but failed due to some more errors. Under investigation.
Xavier/KIT explained that a problem with the storage system requires an intervention by the vendor, so it is set at risk till Tue evening. To Stefan's question on the nature of the CVMFS problem, Elena confirmed it is due to ATLAS software.

  • CMS reports (raw view) -
    • Quiet weekend, Tickets from last report except GGUS:96725 solved
    • New tickets since all relating to issues at T2 level:
      • Transfer issues -- GGUS:96816, Warsaw, and GGUS:96843, between Wisconsin & Belgian sites
      • GGUS:96826, T2_TR_METU Possible CVMFS related SAM test failure

  • ALICE -
    • CERN: large number of job submission failures from ~08:30 to ~14:00 on Fri, causing the site to get mostly drained of ALICE jobs; caused by overload of the Argus servers; improvements are being looked into (INC:365019, opened by CMS)
    • CNAF: some jobs failed due to absence of a package on the SL5 WN (GGUS:96793); the admins added it (thanks!), while the SL6 (sic) VOBOX was reconfigured to prevent similar issues
    • RAL: on Fri many WN ran out of memory due to one ALICE user's broken jobs that allocated huge amounts of memory in a very short time; to stop the issue, ALICE jobs have been banned since Fri; the guilty tasks have been removed and their owner has been admonished

  • LHCb reports (raw view) -
    • Only MC productions ongoing, including using BOINC prototype
    • T0:
      • switched back to FTS2 transfer system, b/c of issue with accounting of file transfer starting date, and wrongly accounted within DIRAC, needs to be fixed on both sides.
      • CE301, CE302 new CEs show wrong OS information in the BDII
      • removal campaign for large number of histograms on CASTOR Disk was launched
    • T1:
      • NTR

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • KIT: nta
  • PIC: ntr
  • RAL: Bank holiday.
  • OSG: ntr
  • NDGF: Scheduled downtime this Thursday. ALICE and ATLAS will be affected.
  • NL_T1: ntr
  • IN2P3: Scheduled downtime all day on 24/9. Batch and Storage will definitely be down. Details will follow.
  • KISTI: ntr

  • CERN:
    • Dashboards: ntr
    • Storage: ntr

AOB:

Thursday

Attendance:

  • local: MariaD/SCOD, Eddy/Dashboards, Stefan/LHCb, Felix/ASGC, Belinda/Storage, Kate/Databases, Steve/Grid services, Maarten/ALICE.
  • remote: Tommaso/CMS, Michael/BNL, Rolf/IN2P3, Lisa/FNAL, Pavol/ATLAS, Jeremy/GridPP, Tiju/RAL, Sang-Un/KISTI, Rob/OSG, Dennis/NI_T1, Pepe/PIC, Lucia/CNAF, Christian/NDGF.

Experiments round table:

  • ATLAS reports (raw view) -
    • T0
      • ntr
    • T1
      • FTS3 at RAL was successfully patched on Wednesday, we see that number of active transfers is now up to two times higher then before intervention
      • FZK disk pool problem (GGUS:96839) still worked on

  • CMS reports (raw view) -
    • all extremely quiet, not much activity indeed
    • GGUS tickets still open:
      • GGUS:96725 (submitted by UOS (University of Seoul, Tier2 for CMS in Korea) for Responsible Unit 'APEL', status, "waiting for reply", so action is on the submitter!!!)
      • GGUS:96816 ("Debug transfer failing from RAL_Disk to Warsaw", waiting for response by RAL)
      • GGUS:96843 ("Failing transfer from Wisconsin to IIHE", assigned to Wisconsin)
      • GGUS:96826 ("SAM CE error in T2_TR_METU, probably caused by CVMFS", waiting for answer)
    • New ones:
      • GGUS:96912 ("CVMFS problems", bridged just today from savannah)
    • We still have a SNOW ticket opened : INC:365019 on CERN Argus servers having overload problems and not properly balancing. Not critical at the moment (not much traffic), but not solved either.
    • HC tests not being sent (see here) between Aug 15th and yesterday. Fixed by CERN support.

  • ALICE -
    • RAL: ALICE jobs unbanned since Mon late afternoon, normal share back since yesterday noon, thanks!

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • KIT: none connected
  • PIC: ntr
  • RAL: ntr
  • OSG: ntr
  • NDGF: Scheduled intervention on storage servers next Tuesday. ALICE and ATLAS will be affected.
  • NL_T1: ntr
  • IN2P3: ntr
  • KISTI: A network incident last night caused a DNS problem that blocked EOS. This was solved by midnight. All in operation now.

  • CERN:
    • Dashboards: ntr
    • Databases: A problem in an instance of the CMS online database occured this morning. A fix will be installed transparently next Monday at 10:30am CEST.
    • Grid services: A wrongly configured node of the site BDII wasn't visible outside the CERN firewall. Work is on-going for the connection with IN2P3. Details in http://itssb.web.cern.ch/service-incident/connections-in2p3-lxplus-slow/29-08-2013
    • Storage: Nothing other than the problem reported by KISTI.

AOB:

Edit | Attach | Watch | Print version | History: r10 < r9 < r8 < r7 < r6 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r10 - 2013-08-29 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback