Week of 140818

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Tsung-Hsun Wu (ASGC), Alessandro (ATLAS), Jan (Storage), Luca (Databases), Pablo (Monitoring)
  • remote: Stefano (CMS), You-jin (KISTI), David (IN2P3), Dimitri (KIT), Lisa (FNAL), Michael (BNL), Onno (NL-T1), Tiju (RAL), Elizabeth (OSG), Joel (LHCb), Christian (NDGF)

Experiments round table:

  • ATLAS
    • Central Services - Tier0/1 issue
      • Nothing to report today

  • CMS
    • Nothing major report
    • GGUS:107578 odd CERN-EOS failures from Tier0. Why no reply yet ? (opened Aug 8th)
      • Jan will look into GGUS ticket

  • ALICE -
    • NTR

  • LHCb
    • MC and User jobs
    • T0:
    • T1:
      • PIC found some jobs running on their site and accessing file at GRIDKA.. Under investigation.
    • VAC : incident in Manchester
      • Ale mentions that also ATLAS was affected by the incident

Sites / Services round table:

  • ASGC: NTR
  • BNL: NTR
  • CNAF: NR
  • FNAL: NTR
  • GridPP:
  • IN2P3: NTR
  • JINR: NR
  • KISTI: NTR
  • KIT: NTR
  • NDGF: NTR
  • NL-T1: Followup on GGUS:107655 ticket, ie LHCb ticket about brazilian DNs problem. Problem related to double proxy jar files which is fixed now. To activate the change this needs a reboot of dCache (15 mins) which will interrupt all transfers. Questions to experiments when to reboot? Proposed to combine the reboots with another software intervention next week. NL-T1 will post it on GOCDB.
  • OSG: NTR
  • PIC: NTR
  • RAL: NTR
  • RRC-KI: NR
  • TRIUMF: NR

  • CERN batch and grid services:
  • CERN storage services: NTR
  • Databases: NTR
  • GGUS: NR
  • Grid Monitoring: NTR
  • MW Officer: NR

AOB:

Thursday

Attendance:

  • local: Stefan (SCOD), Joel (LHCb), Felix (ASGC), Sebastien (Storage), Xavier (Storage), Luca (Databases), Jan (Storage), Andrea (MW Officer)
  • remote: You-jin (KISTI), Alexei (ATLAS), Lisa (FNAL), Dennis (NL-T1), Rolf (IN2P3), John (RAL), Elizabeth (OSG), Pepe (CMS&PIC), Christian (NDGF),

Experiments round table:

  • ATLAS
    • Central Services
      • Jira is really slow (at least from outside CERN)
    • Tier0/1
      • ATLAS deletion backlog leaded to the blacklisting of scratchdisks of some Tier-1s due to full disks
      • Staging problems at BNL GGUS:107789 , RAL GGUS:107778 and TRIUMF GGUS:107779
      • FZK-LCG2 storage problem GGUS:107672 : network card was replaced yesterday. Now looks OK.

  • CMS reports (raw view) -
    • No major issues, processing and production is continuing
    • CSA14 undergoing. MWGR5 expected at the end of next week.
    • Remaining sites need to upgrade to CVMFS >= 2.1.19 immediately.

  • ALICE -
    • NTR

  • LHCb
    • MC and User jobs
    • T0: Problem with CASTOR lhcbtape. under investigation
    • T1:

Sites / Services round table:

  • ASGC: NTR
  • BNL: NR
  • CNAF: NR
  • FNAL: NTR
  • GridPP: NR
  • IN2P3: NTR
  • JINR: NR
  • KISTI: NTR
  • KIT: NR
  • NDGF: NTR
  • NL-T1: Downtime for storage at SARA is ongoing and shall finish at 4pm.
  • OSG: NTR
  • PIC: NTR
  • RAL: Monday is bank holiday in the UK. Nobody can follow coordination meeting today
  • RRC-KI: NR
  • TRIUMF: NR

  • CERN batch and grid services: NR
  • CERN storage services:
    • EOS machines at Wigner have issues to be accessible from outside.
    • Asked ATLAS if DT is acceptable for compacting the namespace, no answer yet.
    • Found in production problem with gridftp (globus-xio-gsi-driver), version 2.1.2 is not compatible with slc6. Result is that transfers to castor slc6 machines will not succeed. With a retry it may fall back on a slc5 node and succeed. The Castor team is in contact with globus on this. Andrea M will prepare a broadcast to inform sites and experiments about the issue. More details about the issue below.
  • Databases: Interventions next week Tuesday (also on CERN IT status board)
    • Patching with the latest security updates and changing COMPATIBLE for LHCB Offline database services (Tue, Aug 26, 2014 - 8:30 to 10:30)
    • Change of COMPATIBLE parameter on LHCB Online database (Tue, Aug 26, 2014 - 14:30 to 15:30)
    • Rolling Intervention on LCG Database Service (Tue, Aug 26, 2014 16:00 to 17:00)
  • GGUS: NR
  • Grid Monitoring: NR
  • MW Officer: Broadcast done about issue with APEL which is buggy. Fix available with version 1.2.2 which is retrievable from github.

Details of the gridftp issue as reported by Sebastien and Giuseppe:

We have seen a number of failures for gridftp transfers with the following signature on the server:

[24464] Wed Aug 20 13:46:36 2014 :: Transfer failure:
globus_xio: The GSI XIO driver failed to establish a secure connection. The failure occured during a handshake read.
globus_xio: System error in recv: Connection reset by peer
globus_xio: A system call failed: Connection reset by peer

This has been traced down to a combination of a gridftp server running on SLC6 and a globus-url-copy client loading a particular version of the following library:

$ rpm -qf /usr/lib64/libglobus_xio_gsi_driver.so.0
globus-xio-gsi-driver-2.1-2.el6.x86_64

While upgrading this library to:

globus-xio-gsi-driver-2.4-1.el6.x86_64

cures the problem. Note that globus-url-copy does not directly depend on this library, thus a yum install globus-url-copy will not fix the issue: one needs to explicitly upgrade that package.

Moreover, note that SLC5 servers are immune to the issue, while SLC6 servers are affected at the latest available version, that is 6.38-1.el6.

In case of CASTOR diskserver, we actually observe that the forked gridftp server for the affected transfer crashes before the log can be flushed, so the signature is different and can be found in /var/log/castor/stagerjob.log as:

2014-08-20T14:29:37.629208+02:00 lxfsrf11c04 stagerjob[5490]: LVL=Error TID=5490 MSG="Child exited" REQID=e526710c-85c5-4aa0-bbdd-42b60b464dc4 NSHOSTNAME=castorns NSFILEID=1371569368 PID=5491 Status=11 SUBREQID=00e9a0d3-19bc-1980-e053-9308100a86ae

where 'Status=11' is the relevant part as the mover got a segmentation fault.

AOB:

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx 2014-08-19.pptx r1 manage 2863.0 K 2014-08-18 - 11:14 PabloSaiz  
Edit | Attach | Watch | Print version | History: r12 < r11 < r10 < r9 < r8 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r12 - 2014-08-21 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback