Week of 140203
WLCG Operations Call details
- At CERN the meeting room is 513
R-068.
- For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
- Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
- To have the system call you, click here
- In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here
. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.
General Information
- The SCOD rota for the next few weeks is at ScodRota
- General information about the WLCG Service can be accessed from the Operations Web
Monday
Attendance:
- local: MariaD (SCOD), Maarten (ALICE), Massimo (CERN Data Mgnt), Vitor (CERN Grid Services), Felix (ASGC).
- remote: Roger (NDGF), Sang-Un (KISTI), Michael (BNL), Matteo (CNAF), Elena (ATLAS), Eric (CMS), Onno (NL_T1), Kyle (OSG), Tiju (RAL), Alexei (LHCb), Lisa (FNAL), Pepe (PIC).
Experiments round table:
- ATLAS reports (raw view) -
- Central services/T0
- CERN_PROD: Transfers were failing with permission denied errors on Monday morning. Noticed and fixed by CERN team. Thanks.
- T1
- TAIWAN: heavy SRM load caused transfer failures on Sunday (GGUS:100904
). Fixed.
- FZK: staging errors for DATATAPE on Friday (GGUS:100885
). Fixed by issuing a retry for all outstanding stage requests for ATLAS and restarting tape storage software.
- PIC: problem with one disk pool, which caused transfers to failed on Friday (GGUS:100874
), dCache pool restarted.
- CMS reports (raw view) -
- T1/T2/Others: Business as usual. Smooth running.
- Preparing for DBS (data catalog) upgrade on Feb 10. That week will see little to no central processing
- One problem: ARGUS cluster issue(s) (DNS? and then a new, uninitialized node in the cluster) caused problems with analysis jobs running.
- Debugged by CMS analysis operations. Better would be to have SLS monitoring of the ARGUS cluster. Ticket is GGUS:100870
- ALICE -
- sites please take note of the necessary WLCG VOBOX update announced last Fri
- KIT
- the number of corrupted files has shrunk by 45% to 26126
- 21k files have been salvaged after all, thanks very much!
- LHCb reports (raw view) -
- Mostly simulation and user jobs. Smooth running over most of the grid.
- T0: Pilots aborted at ce202.cern.ch today. Ticket is GGUS:100902
- T1: NTR
- T2: NTR
Sites / Services round table:
- ASGC: ntr
- BNL: ntr
- FNAL: ntr
- OSG: ntr
- KISTI: ntr
- NL_T1: ntr
- CNAF: ntr
- PIC: ntr
- NDGF: ntr
- IN2P3: ntr (sent be email)
- RAL: Tomorrow, between 8-10hrs am UK time, tape system intervention. Site set at risk in GOCDB.
- CERN:
- Grid Services: ntr
- Data Mgnt:
- Problem to access EOS from outside CERN. Now solved. Lasted for 1h 15'.
- ROOT access to CASTOR is now switched off. Hardly 10 users concerned. They have been informed about alternative access methods.
AOB:
Thursday
Attendance:
Experiments round table:
- ATLAS reports (raw view) -
- Central services/T0
- IT and DE clouds moved to FTS3
- T1
- CMS reports (raw view) -
- T1/T2/Others: Bussiness as usual. Smooth running.
- Preparing for DBS (data catalog) upgrade on Feb 10. That week will see little to no central processing. Ramp-down has begun.
- We are encouraging all our sites to switch to FTS3 server at RAL for load testing. Begins in a week or so.
- ALICE -
- CNAF
- tape SE updated to xrootd v3.3.4 (on Jan 28) with new checksum plugin successfully validated (Feb 5) with test transfers, thanks!
- KIT
- investigating why many jobs read a lot of data remotely from CERN
- RRC-KI-T1
- memory tuning for jobs ongoing, thanks!
- LHCb reports (raw view) -
- Mostly simulation and user jobs. Smooth running over most of the grid.
- T0: CVMFS caused 50%+ jobs failures Mo and Tu, back to normal since Wed
- T1: NTR
Sites / Services round table:
- GGUS:
- Suggestion to remove three fields from the 'Ticket Submission Form' (see attachment). Those fields are hardly ever used, and they are anyways concatenated to the body of the issue
AOB:
- OpenSSL issue
- EGI broadcast
sent Feb 4 describing current state of affairs and recipes for cures
- Sites using HTCondor as batch system may need to apply one of these configuration changes for now:
-
DELEGATE_JOB_GSI_CREDENTIALS = False
-
GSI_DELEGATION_KEYBITS = 1024
- HTCondor v8.0.6 will have the default increased to 1024