Week of 140407

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web



  • local: Alberto, Belinda, Eddie, Felix, Maarten
  • remote: Alexander, Christian, Daniele, Dimitri, Eygene, Joel, Lisa, Pepe, Rolf, Sang-Un, Sonia, Tiju
Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
    • T0/1s
      • Quite some jobs failing at CERN-PROD, still the suspicious about CVMFS on some nodes GGUS:102824, but hard to diagnose properly, will follow up
      • SRM went down at SARA on Saturday early morning, GGUS:103024 resolved later evening, on Sunday everything back
      • Many jobs failing at FZK GGUS:103025 today morning admins reported possible problems with networking

  • CMS reports (raw view) -
    • General:
      • HI rereco ongoing. It proceeds at a good pace, but we are struggling to get all the CPU power we would need for this on the HLT+AI resources. Excellent support so far by IT (thanks)
      • April Global Run is April 7-11, i.e. this week: you should expect some CMS T0 activity
    • Central services:
      • CMSONR DB red on the CMS critical services on April 1st for a while, CSP opened Savannah:142866 (and closed it soon afterwards when it returned OK)
      • cmsdoc SLS unavailability (here) for ~1 hr last Friday around noon, OK soon afterwards and did not show up again
    • Tier-1 level:
      • T1_UK_RAL: failing transfers from RAL_Buffer to RAL_Disk node (just 2 files with wrong cksum, it seems), tracked in Savannah:142583
      • T1_RU_JINR: a shifter opened Savannah:142493 while there were failures to access a PU dataset, but the site was in downtime, so just need to re-check now that the downtime is over
      • T1_DE_KIT: low transfer quality to T1_DE_KIT_Buffer from several T1 sources, including CNAF, JINR and KIT_Disk itself (tracked with an ELOG for the shifter so far, and it is being monitored: a ticket may follow if it persists)
        • Dimitri: connection reset errors are being investigated

  • ALICE -
    • KIT: large fluctuations in job numbers and failures since Apr 1 early afternoon
      • no correlated changes were found so far
      • VOBOX reboot seemed to cure the problem for 1 day
      • improved network parameters on the VOBOX did not help a lot
      • VOBOX overload seems side effect of errors experienced by the jobs
      • experts have worked on the matter also during the weekend, thanks!
      • no other site appears to be affected
      • to be continued

  • LHCb reports (raw view) -
    • Stripping, MCsimulation and User jobs.
    • T0: Ticket re-opened with CERN to access CE.
    • T1: One user is not able to create directory on storage at PIC and Gridka. (will open a GGUS ticket)
      • Pepe: e-mail forwarded to our storage experts
Sites / Services round table:

  • ASGC - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3
    • there was a CRL update problem last night, no issues were reported
  • KIT - nta
  • KISTI - ntr
  • NDGF
    • this morning's downtime went fine, a fiber got replaced
    • another fiber maintenance on Thu morning, some ALICE and ATLAS data may be temporarily unavailable
  • NLT1
    • downtime on Thu for HW and SW upgrades
  • PIC
    • Tue-Wed next week downtime for annual electrical maintenance and cooling infrastructure upgrade; services will be stopped for 1.5 days
  • RAL - ntr
  • RRC-KI-T1
    • had network outage at Sunday, 06.04.2014, lasted 01:05; problem understood, emergency fix applied, permanent fix is on the go (should be ready at 08.04.2014).

  • CERN grid services - ntr
  • CERN storage - ntr
  • dashboards - ntr
  • databases
    • ATLAS online database (ATONR) will be migrated and upgraded on Tuesday, 08.04.2014. between 09:30-12:30 (outage notice)



  • local:
  • remote:
Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • OpenSSL vulnerability: almost all the ATLAS services had been fixed. RQF0323273 "Revoke host-certificates cause of recent openssl issue CVE-2014-0160.".
    • T0/1s
      • FZK WN-Storage connection is still an issue for ATLAS

  • ALICE -
    • KIT
      • jobs profile has been stable since the cap was reduced to 1.5k
      • a large fraction of those jobs are doing analysis (in trains)
      • access to the local SE failed from many WN, while it worked from others
      • 1 rack with wrong configuration for port 1094 (xrootd) was fixed
      • almost all WN were checked today and were OK now
      • still a lot of jobs are reading data remotely for some reason
        • being debugged
      • meanwhile the jobs cap will be lowered to 1k to reduce the load on the firewall

Sites / Services round table:
    • Stripping, MCsimulation and User jobs.
    • T0: 22nd april. migrations from CASTOR-USER to EOS-USER
    • T1:We still have no news about the delivery of the LHCb pledge at IN2P3 ???
  • CERN Grid Services: Certificates being renewed today due to Heartbleed bug.
AOB: -- SimoneCampana - 20 Feb 2014

This topic: LCG > WebHome > WLCGCommonComputingReadinessChallenges > WLCGOperationsMeetings > WLCGDailyMeetingsWeek140407
Topic revision: r10 - 2014-04-10 - AlbertoRodriguezPeon
This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback