Week of 140407

WLCG Operations Call details

  • At CERN the meeting room is 513 R-068.

  • For remote participation we use the Alcatel system. At 15.00 CE(S)T on Monday and Thursday (by default) do one of the following:
    1. Dial +41227676000 (just 76000 from a CERN office) and enter access code 0119168, or
    2. To have the system call you, click here

  • In case of problems with Alcatel, we will use Vidyo as backup. Instructions can be found here. The SCOD will email the WLCG operations list in case the Vidyo backup should be used.

General Information

  • The SCOD rota for the next few weeks is at ScodRota
  • General information about the WLCG Service can be accessed from the Operations Web

Monday

Attendance:

  • local: Alberto, Belinda, Eddie, Felix, Maarten
  • remote: Alexander, Christian, Daniele, Dimitri, Eygene, Joel, Lisa, Pepe, Rolf, Sang-Un, Sonia, Tiju
Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
    • T0/1s
      • Quite some jobs failing at CERN-PROD, still the suspicious about CVMFS on some nodes GGUS:102824, but hard to diagnose properly, will follow up
      • SRM went down at SARA on Saturday early morning, GGUS:103024 resolved later evening, on Sunday everything back
      • Many jobs failing at FZK GGUS:103025 today morning admins reported possible problems with networking

  • CMS reports (raw view) -
    • General:
      • HI rereco ongoing. It proceeds at a good pace, but we are struggling to get all the CPU power we would need for this on the HLT+AI resources. Excellent support so far by IT (thanks)
      • April Global Run is April 7-11, i.e. this week: you should expect some CMS T0 activity
    • Central services:
      • CMSONR DB red on the CMS critical services on April 1st for a while, CSP opened Savannah:142866 (and closed it soon afterwards when it returned OK)
      • cmsdoc SLS unavailability (here) for ~1 hr last Friday around noon, OK soon afterwards and did not show up again
    • Tier-1 level:
      • T1_UK_RAL: failing transfers from RAL_Buffer to RAL_Disk node (just 2 files with wrong cksum, it seems), tracked in Savannah:142583
      • T1_RU_JINR: a shifter opened Savannah:142493 while there were failures to access a PU dataset, but the site was in downtime, so just need to re-check now that the downtime is over
      • T1_DE_KIT: low transfer quality to T1_DE_KIT_Buffer from several T1 sources, including CNAF, JINR and KIT_Disk itself (tracked with an ELOG for the shifter so far, and it is being monitored: a ticket may follow if it persists)
        • Dimitri: connection reset errors are being investigated

  • ALICE -
    • KIT: large fluctuations in job numbers and failures since Apr 1 early afternoon
      • no correlated changes were found so far
      • VOBOX reboot seemed to cure the problem for 1 day
      • improved network parameters on the VOBOX did not help a lot
      • VOBOX overload seems side effect of errors experienced by the jobs
      • experts have worked on the matter also during the weekend, thanks!
      • no other site appears to be affected
      • to be continued

  • LHCb reports (raw view) -
    • Stripping, MCsimulation and User jobs.
    • T0: Ticket re-opened with CERN to access CE.
    • T1: One user is not able to create directory on storage at PIC and Gridka. (will open a GGUS ticket)
      • Pepe: e-mail forwarded to our storage experts

Sites / Services round table:

  • ASGC - ntr
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3
    • there was a CRL update problem last night, no issues were reported
  • KIT - nta
  • KISTI - ntr
  • NDGF
    • this morning's downtime went fine, a fiber got replaced
    • another fiber maintenance on Thu morning, some ALICE and ATLAS data may be temporarily unavailable
  • NLT1
    • downtime on Thu for HW and SW upgrades
  • PIC
    • Tue-Wed next week downtime for annual electrical maintenance and cooling infrastructure upgrade; services will be stopped for 1.5 days
  • RAL - ntr
  • RRC-KI-T1
    • had network outage at Sunday, 06.04.2014, lasted 01:05; problem understood, emergency fix applied, permanent fix is on the go (should be ready at 08.04.2014).

  • CERN grid services - ntr
  • CERN storage - ntr
  • dashboards - ntr
  • databases
    • ATLAS online database (ATONR) will be migrated and upgraded on Tuesday, 08.04.2014. between 09:30-12:30 (outage notice)
AOB:

Thursday

Attendance:

  • local: Alberto (grid services), Alessandro (ATLAS), Andrej (ATLAS), Belinda (storage), Felix (ASGC), Joel (LHCb), Maarten (SCOD + ALICE), Maria A (Ops Coord), Pablo (dashboards + tracking tools)
  • remote: Christian (NDGF), Dennis (NLT1), Gareth (RAL), Jeremy (GridPP), Lisa (FNAL), Lucia (CNAF), Pavel (KIT), Rolf (IN2P3), Sang-Un (KISTI), Stefano (CMS)

Experiments round table:

  • ATLAS reports (raw view) -
    • Central services
      • OpenSSL vulnerability: almost all the ATLAS services had been fixed. RQF0323273 "Revoke host-certificates cause of recent openssl issue CVE-2014-0160.".
    • T0/1s
      • FZK WN-Storage connection is still an issue for ATLAS
        • Alessandro: our workload did not change; we see the FZK firewall saturated
        • see ALICE report below

  • CMS reports (raw view) -
    • detector commissioning ongoing (April Global Run). TIer0 is processing data and performing Prompt Calibration. Took about one day of down because some LSF nodes did not get AFS token, solved by IT. TEAM ticket: GGUS:103180
    • on the WLCG grid, emergency reco of HI ongoing. Overall running full throttle both production and analysis. User load has somehow decreased lately and we do not hit global pool limits any more.
    • None of the currently open GGUS tickets deserve being mentioned

  • ALICE -
    • KIT
      • jobs profile has been stable since the cap was reduced to 1.5k
      • a large fraction of those jobs are doing analysis (in trains)
      • access to the local SE failed from many WN, while it worked from others
      • 1 rack with wrong configuration for port 1094 (xrootd) was fixed
      • almost all WN were checked today and were OK now
      • still a lot of jobs are reading data remotely for some reason
        • being debugged
      • meanwhile the jobs cap will be lowered to 1k to reduce the load on the firewall
    • Alessandro:
      • can ALICE produce plots showing what traffic is generated by jobs?
      • WLCG Transfer Dashboard rates do not match the OPN congestion that is observed for CERN to FZK?
        • Pablo: for ALICE traffic there is a known mismatch to be fixed, for other experiments it is OK
      • we need to have an easy way to see what traffic the WN are generating
        • WN to/from local SE
        • WN to/from remote SE
      • to be followed up

  • LHCb reports (raw view) -
    • Stripping, MCsimulation and User jobs.
    • T0: 22nd april. migrations from CASTOR-USER to EOS-USER
    • T1: We still have no news about the delivery of the LHCb pledge at IN2P3 ???
      • Rolf: discussions ongoing, more news expected soon

Sites / Services round table:

  • ASGC
    • the tape library is working OK again
  • CNAF - ntr
  • FNAL - ntr
  • IN2P3
    • today there was an SE outage between 11:50 and 12:00 CEST due to a HW failure; the bad node has been taken out
  • KISTI
    • yesterday the batch system was drained, openssl was updated and services were reconfigured
    • today there is a routing issue between the T1 and CERN, being investigated
  • KIT
    • see the reports from the experiments
    • network experts are involved
    • not clear what changed on Apr 1
    • there may be several conspiring issues
  • NDGF
    • today's downtime has been postponed, no new date yet
  • NLT1
    • SARA: due to the heartbleed vulnerability, they decided to do the planned upgrade of Thursday on Tuesday.
    • SARA: a power outage on both feeds caused some downtime on Wednesday. At ~1200 CEST everything was back on-line.
    • NIKHEF: none of our grid systems was vulnerable to the heartbleed bug, as we are running CentOS 6.4.
  • RAL - ntr

  • CERN Grid Services
    • Certificates being renewed today due to Heartbleed bug. Should all be done now.
  • CERN storage - ntr
  • Dashboards - ntr
  • Tracking tools
    • GGUS development tracker migrated (again) from Savannah to Jira. The issue with the 'Planned release' not being mapped to 'Fix Version' has been solved

AOB: -- SimoneCampana - 20 Feb 2014

Edit | Attach | Watch | Print version | History: r13 < r12 < r11 < r10 < r9 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r13 - 2014-04-10 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback