Week of 140120

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Simone (SCOD), Maarten (Alice), Ignacio (CERN - Grid Services), Xavi (CERN - Storage), Felix (ASGC),
  • remote: Pavol (ATLAS), Michael (BNL), Renato (LHCb), Tiju (RAL), Pepe (PIC), Ulf (NDGF), Dimitri (KIT), Rolf (IN2P3-CC), Matteo (CNAF)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services/T0
      • Adding the TW cloud to FTS3 this morning
      • ADCR database now looks fine, no problems seen during weekend
    • T1
      • Short interruption of INFN-T1 SE, quickly solved on Saturday morning
      • Still some instabilities on TW T1 SE, renaming campaign is interfering with the regular activity (GGUS:100268)
        • ASGC: trying to sort this out, still no good news

  • CMS reports (raw view) -
    • Not much to report, smooth operations.
    • GGUS:100315: T2_UK_London_IC, problems with Proxy lifetime [SOLVED]
    • GGUS:100373: SAM CE Error in T1_DE_KIT [NEW]
      • The problem looks like a monitoring issue. The ticket has been updated.

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Mainly MC jobs
    • T0: NTR
    • T1: NTR
    • T2:
      • CBPF(T2-D) transfers still slow. Trying new configs for FTS3 (number of channels, network configuration, etc.)

Sites / Services round table:

  • BNL: some issue was observed after the upgrade to dCache 2.6. For rapidly growing SRM tables (like the ones for the space manager) there can be a problem when the postgres vacuum process kicks in: the process locks the table and the SRM server process can not access the DB for writing, producing timeout errors. After some tuning, a database parameter set has been defined to avoid the issue. It will be sent around through the dCache channels. NDGF: in 2.8 the space manager has been rewritten completely, and should avoid this problem.

  • NDGF: a dCache upgrade caused problems to the monitoring system. The storage was healthy and continued working.

  • KIT: last week there was some problem with the CREAM CEs due to the buffer pool size (increased)

  • CERN Grid Services: FTS ITSSB GOCDB The FTS service at fts3.cern.ch will be updated from 3.1.34 to the latest FTS version 3.1.55. Intervention expected to be transparent for 1 hour from 10:00 CET on Wednesday 22nd of January.

  • CERN Grid Services: upgrade of the WMSes scheduled for tomorrow, but only few components (gridsite, globus and LB) needed to generate 1024 bits keys for the openSSL issue. In fact, the full new release of the WMS (3.6.2 a.k.a. EMI 3 update 12) is bugged (2 standing issues) and sites should not deploy it.

  • CERN storage: on friday outage for EOS ATLAS from 8 PM to 10 PM.

AOB:

  • Maarten on openSSL: last week it was discovered that worrying about gridsite only is not sufficient. We have to ensure that the latest globus 5.2.5 is used (on relevant node types). The version is relatively new (2 months old). The WMS is surely affected, but also other node types may be (for example CondorG). The situation will be clarified further and another broadcast will be sent. OSG will also be contacted. Another bug was found in the WMS causing a bad host proxy to be generated and causing an issue in proxy renewal. Java 7 based services are also affected (a simple configuration change is needed, see GDB presentation).

Thursday

Attendance:

  • local: Simone (SCOD), Maarten (Alice), Alberto (CERN Grid Services), Felix (ASGC), Xavier (CERN Storage).
  • remote: Sang-Un (KISTI), Lisa (FNAL), Renato (LHCb), John (RAL), Dennis (NL-T1), Jeremy (GridPP), Kai (ATLAS), Saverio (CNAF), Kyle (OSG)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services/T0
      • RAL was switched to FTS3
        • Many OpenSSL Errors seen as for other sites switched to FTS3, transfers succeeded
        • Maarten clarified that the RAL FTS3 was updated to the newest version and OpenSSL errors came back. The service manager rolled back and since then there are no more OpenSSL errors. A new version is available and will be installed on monday at RAL. The upgrade will cause a few hundreds of transfers failing (this has been discussed with the experiments and agreed). CERN currently is OK (both production and pilot). It is possible that the production version is still at the old release. Will learn from the RAL experience and decide what to do for the CERN upgrade, if still needed.
    • T1
      • Still some instabilities on TW T1 SE, renaming campaign is interfering with the regular activity (GGUS:100268) but decreasing since this night

  • ALICE -
    • NTR

  • LHCb reports (raw view) -
    • Mainly MC, few users jobs.
    • T0: Yesterday, all jobs failed at CERN ONLINE: "Failed to upload output data".
    • T1: NTR
    • T2: NTR

Sites / Services round table:

  • NDGF: We had a short outage to the Danish pools today (2 minutes), otherwise it has been quiet.
  • ASGC: the CASTOR DB went down a few minutes ago. The issue is high temperature in the room. CASTOR is the only affected service so far.
  • KISTI: the site is now connected to the OPN. The first attempt was last friday, but there was some routing problem. The second attempt was successful. The failure in the first attempt affected the T2 co-located but not the T1.
  • OSG: the RSV2SAM validation instance will be stopped on the 29th of january. The production one will be stopped the 30th of April. The BDII problem reported last week was due to a network issue.

AOB:

  • Maarten about the openSSL issue. Beside the new version of gridsite, it was discovered that the latest version of globus is needed in some cases because it increases the default key length to 1024 bits. But for the WMS you need also a new version of the proxy renewal daemon. It will be released probably in late january.

-- SimoneCampana - 16 Dec 2013

Topic attachments
I Attachment History Action Size Date Who Comment
Unknown file formatpptx MB-Jan.pptx r3 r2 r1 manage 2864.7 K 2014-01-21 - 09:10 PabloSaiz  
Edit | Attach | Watch | Print version | History: r17 < r16 < r15 < r14 < r13 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r17 - 2014-01-23 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback