Week of 140922

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Vidyo system. Instructions to join the phone conference can be found here.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Maria Dimou (chair, minutes), Maria Alandes (WLCG Ops Coord), Maarten Litmaath (ALICE), Kate Dziedziniewicz-Woycik (Databases), Alberto Peon (Grid Services), Tsung-Hsun Wu (ASGC), Alessandro di Girolamo (ATLAS), Andrea Sciaba (WLCG Ops Coord).
  • remote: Dea-Han Kim (KISTI), Jose Flix (PIC), Kyle Gross (OSG), Gareth Smith (RAL), Rolf Rumler (IN2P3), Onno Zweers (NL-T1), Pavel Weber (KIT), Ulf Tigerstedt (NDGF), Vladimir Romanovski (LHCb), Sonia (CNAF).
Experiments round table:

  • ATLAS reports (raw view) -
    • T0/T1s
      • NDGF-T1 GGUS:108666 and GGUS:108662 - File transfer issue. As answered from the site, they are in downtime since one centre is without power.
      • RAL-LCG2 GGUS:108668 - File transfer issue, "globus_ftp_client error, no such file or directory"
      • INFN-T1 GGUS:108489 - File transfer issue, files missing. ATLAS should verify if these files have been really deleted from the site or not atlasdatadisk/rucio/mc14_13TeV/
      • BNL-ATLAS GGUS:108669 - File transfer issue "Failed to get source file size", BNL experts are working on it

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • MC and User jobs.
    • T0: NTR
    • T1: NTR
    • Services: NTR
Sites / Services round table:

  • ASGC: NTR
  • BNL: not connected
  • CNAF: The downtime originally scheduled for 24/9 has to be moved. A new candidate date is 14/10 pending confirmation from CMS!.
  • FNAL: not connected
  • GridPP: not connected
  • IN2P3: Reminder of tomorrow's scheduled downtime.
  • JINR: not connected
  • KISTI: NTR
  • KIT: One CMS file got lost (written in the chat prompt, as sound was not working).
  • NDGF: Computer Centre cooling system problems in Copenhagen led to over-heating and obligation to shutdown the systems. Everything is back now, maybe performing a bit more slowly but no data is lost.
  • NL-T1: NTR but a question: where are we with SHA-2? Maarten explained that many CAs issue already SHA-2 compliant certificates for at least half-a-year, the middleware is ready and all WLCG services work. The big campaign was and still is about updating information of the new VOMS service. The rest is transparent.
  • OSG: NTR
  • PIC: A CMS user opened a very large amount of connections to the USA during the weekend. This led to a degradation of the PIC firewall. The matter is being investigated. Not yet understood.
  • RAL:
  • RRC-KI:
  • TRIUMF:

  • CERN batch and grid services: NTR
  • CERN storage services: not present
  • Databases: NTR
  • GGUS: not present
  • Grid Monitoring: not present
  • MW Officer: not present
AOB:

Thursday

Attendance:

  • local: Maria Dimou (chair, minutes), Maarten Litmaath (ALICE), Kate Dziedziniewicz-Woycik (Databases), Alberto Peon (Grid Services), Felix.hung-te Lee (ASGC), Pablo Saiz (Grid Monitoring & GGUS).
  • remote: Dea-Han Kim (KISTI), Jose Flix (PIC), Kyle Gross (OSG), Gareth Smith (RAL), Rolf Rumler (IN2P3), Preslav Borislavov Kostantinov (KIT), Ulf Tigerstedt (NDGF), Vladimir Romanovski (LHCb), Lucia Morandi (CNAF), Andrej Filipcic (ATLAS), Eric Wayne Vaandering (CMS), Guenter Grein (GGUS).
Experiments round table:

  • ATLAS reports (raw view) -
    • CentralService/T0/T1s
      • nothing to report in terms of obvious site specific issues
      • we have started a massive reconstructions campaign (500M events) last week, all in MultiCore. Various issues observed in many sites, often because of jobs being killed due to high VMEM: we would like to remind sites that VMEM in 64bit world is not an optimal metrics to use to kill jobs. We asked ATLAS Site to check:
        1. ) That the resources are not uselessly kept in test by the system, i.e. the resources were set to test sometime ago and never brought back online even if they are passing the HC tests
        2. ) That the resources don't fail due to "lost heartbeat". It is suspected this is due to local batch system memory limits. We would be grateful to know if this is the case and find a solution or to eliminate memory limits as potential cause of the lost heartbeat.
        3. ) Sites that still have a static configuration with dedicated resources can you add more please?
        4. ) If you see all is good and well at your site but you have only few multicore jobs trickling through can you report that too?

  • CMS reports (raw view) -
    • NTR
    • Still analysis running full steam. Production activity level low due to lack of demand.

  • ALICE -
    • NTR

We have problem with opening files with ROOT6 xroot protocol at sites with dCache https://sft.its.cern.ch/jira/browse/ROOT-6639 Bug report to dCache.org was created by GRIDKA

    • MC and User jobs.
    • T0: NTR
    • T1: NTR
    • Services: NTR

Sites / Services round table:

  • ASGC: Next Monday, from noon to midnight (UTC) there will be a scheduled downtime for system maintenance and database optimisation. GOCDB is updated.
  • BNL: ntr
  • CNAF: ntr
  • FNAL: ntr
  • GridPP: not connected
  • IN2P3: This week's planned intervention finished successfully on Wed am.
  • JINR: not connected
  • KISTI: ntr
  • KIT: ntr
  • NDGF: 8M ALICE dark data files found in several T2s, out of a total of 17M files NDGF stores for ALICE. Some are only 2 days old. The file catalogue is being cleaned up. ALICE is kindly ask to check and fix this issue. Maarten confirm work is on-going.
  • NL-T1: not connected (?)
  • OSG: ntr
  • PIC: ntr
  • RAL: The FTS3 upgrade last Tuesday was successful.
  • RRC-KI: not connected
  • TRIUMF: not connected

  • CERN batch and grid services:
    • ce208 will be drained for final retirement as of Monday.
    • sam-bdii, lcg-bdii and site-bdii are running the newer 5.2.23 bdii in QA. lcg & sam bdii also getting the default number of threads for the slapd process increased from the default 16 to 64.
    • WMS decommissioning will happen on the 1st October.
  • CERN storage services: not present
  • Databases: There was a 1-hour outage of the ATLAS offline database on Wednesday at midnight. The service was back up at 1am on Thursday.
  • GGUS: Monday 22nd, 10pm: e-mail problem at KIT. The email notifications for the tickets were delayed for up to 10 hours.The problems have been caused by the VMWare system. The VMWare had problems accessing the disc pools. These problems impacted the email transfer from GGUS to the mail servers.

AOB:
Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2014-09-26 - MariaDimou
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback