Week of 140203

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: MariaD (SCOD), Maarten (ALICE), Massimo (CERN Data Mgnt), Vitor (CERN Grid Services), Felix (ASGC).
  • remote: Roger (NDGF), Sang-Un (KISTI), Michael (BNL), Matteo (CNAF), Elena (ATLAS), Eric (CMS), Onno (NL_T1), Kyle (OSG), Tiju (RAL), Alexei (LHCb), Lisa (FNAL), Pepe (PIC).

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services/T0
      • CERN_PROD: Transfers were failing with permission denied errors on Monday morning. Noticed and fixed by CERN team. Thanks.
    • T1
      • TAIWAN: heavy SRM load caused transfer failures on Sunday (GGUS:100904). Fixed.
      • FZK: staging errors for DATATAPE on Friday (GGUS:100885). Fixed by issuing a retry for all outstanding stage requests for ATLAS and restarting tape storage software.
      • PIC: problem with one disk pool, which caused transfers to failed on Friday (GGUS:100874), dCache pool restarted.

  • CMS reports (raw view) -
    • T1/T2/Others: Business as usual. Smooth running.
    • Preparing for DBS (data catalog) upgrade on Feb 10. That week will see little to no central processing
    • One problem: ARGUS cluster issue(s) (DNS? and then a new, uninitialized node in the cluster) caused problems with analysis jobs running.
      • Debugged by CMS analysis operations. Better would be to have SLS monitoring of the ARGUS cluster. Ticket is GGUS:100870

  • ALICE -
    • sites please take note of the necessary WLCG VOBOX update announced last Fri
      • see details below
    • KIT
      • the number of corrupted files has shrunk by 45% to 26126
      • 21k files have been salvaged after all, thanks very much!

  • LHCb reports (raw view) -
    • Mostly simulation and user jobs. Smooth running over most of the grid.
      • T0: Pilots aborted at ce202.cern.ch today. Ticket is GGUS:100902
      • T1: NTR
      • T2: NTR

Sites / Services round table:

  • ASGC: ntr
  • BNL: ntr
  • FNAL: ntr
  • OSG: ntr
  • KISTI: ntr
  • NL_T1: ntr
  • CNAF: ntr
  • PIC: ntr
  • NDGF: ntr
  • IN2P3: ntr (sent be email)
  • RAL: Tomorrow, between 8-10hrs am UK time, tape system intervention. Site set at risk in GOCDB.
  • CERN:
    • Grid Services: ntr
    • Data Mgnt:
      • Problem to access EOS from outside CERN. Now solved. Lasted for 1h 15'.
      • ROOT access to CASTOR is now switched off. Hardly 10 users concerned. They have been informed about alternative access methods.

AOB:

  • WLCG VOBOX
    • as announced on the wlcg-operations list last Fri, please ensure your WLCG VOBOX instances generate host proxies with 1024-bit keys!
    • preferably update Globus; correct minimal versions of the affected rpm:
      • globus-proxy-utils-5.0-6 (Globus 5.0)
      • globus-proxy-utils-5.2-1 (Globus 5.2)
        • from EPEL for EMI-3 and EMI-2
    • otherwise one can apply this quick hack:
              perl -pi.bak -e 's/ -q / -bits 1024 $&/' \
                  /etc/vobox/templates/voname-box-proxyrenewal \
                  /etc/init.d/*-box-proxyrenewal
              

Thursday

Attendance:

  • local:
  • remote:

Experiments round table:

  • ALICE -
    • CNAF
      • tape SE updated to xrootd v3.3.4 (on Jan 28) with new checksum plugin successfully validated (Feb 5) with test transfers, thanks!
    • KIT
      • investigating why many jobs read a lot of data remotely from CERN

Sites / Services round table:

AOB:

  • OpenSSL issue
    • EGI broadcast sent Feb 4 describing current state of affairs and recipes for cures
    • Sites using HTCondor as batch system may need to apply one of these configuration changes for now:
      • DELEGATE_JOB_GSI_CREDENTIALS = False
      • GSI_DELEGATION_KEYBITS = 1024
    • HTCondor v8.0.6 will have the default increased to 1024
Edit | Attach | Watch | Print version | History: r15 | r12 < r11 < r10 < r9 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r10 - 2014-02-05 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback