Week of 130715
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
The scod rota for the next few weeks is at
ScodRota
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Attendance:
- local: Ben, Eddie, Eva, Luc, Maarten, Robert
- remote: Alexander, David, Kyle, Lisa, Marc, Pepe, Ron, Saverio, Sonia, Tiju, Vladimir, Wei-Jen, Xavier
Experiments round table:
- CMS reports (raw view) -
- Continuing 2011 legacy rereco activity and some Upgrade MC generation
- Several issues open at IN2P3:
- GGUS:95654
SAM CE briefly red due to job submission timeout on Jul 11, then green until midnight Jul 15.
- GGUS:95704
Data staging -- leaving ticket open until staged
- GGUS:95720
File read error
- Two tickets against RAL:
- One against CNAF: GGUS:95698
-- CVMFS black hole WN -- "The wn have been closed before this tkt was filed."
- One against CERN: GGUS:95713
-- SAM SRM errors on Jul 13, Green since however.
- LHCb reports (raw view) -
- Incremental stripping campaign in progress and MC productions ongoing
- T0:
- CERN: Wrongly terminated FTS transfers (GGUS:95642
)
- T1:
- GridKa: currently we have no problem in staging, but we have no any information what was changed (GGUS:95135
)
- Xavier: communication with the tape system was broken for all VOs since Fri evening 22:00 CEST, fixed this morning, cause unknown; the ticket will be updated when more is known
- PIC: Pilots aborted (GGUS:95730
) solved by removing all *_sl5 queues from configuration.
Sites / Services round table:
- ASGC - ntr
- CNAF - ntr
- FNAL - ntr
- IN2P3
- ticket GGUS:95726
was wrongly assigned to us, the problem was due to an expired DDM proxy
- KIT
- Wed-Thu July 24-25 site downtime, jobs will be drained on Tue
- Tue July 23 additional downtime for LHCb SE
- Fri July 26 additional downtime for CMS SE
- NLT1 - ntr
- OSG
- the GGUS alarm re-test went OK, the original issue was understood
- PIC
- because of high electricity costs we will stop 25% of our CPU resources until the start of August
- RAL - ntr
- dashboards - ntr
- databases
- tomorrow morning at 10:00 CEST: transparent intervention on integrations DBs to activate encryption and checksumming in the Oracle network layer
- GGUS: Data and instruction for tomorrow's MB attached.
- grid services - ntr
AOB:
Thursday
Attendance:
- local: Eddie, Eva, Jan, Ken, Luc, Maarten
- remote: Jeremy, John, Kyle, Marc, Matteo, Michael, Roger, Ronald, Wei-Jen, Xavier
Experiments round table:
- ATLAS reports (raw view) -
- T0/Central services
- atlascops (voatlas161 node unreachable) INC:342109
rebooted OK
- No such file or directory 550 at CERN-EOS GGUS:95715
& INC:338842
. User contacted to recreate the files
- Jan: to be clear, those files have never been on EOS
- T1
- BNL file recovery progressing
- RAL disk server problem. Successfully completed its rebuild, files can be accessed
- John: the bad disk server has been drained successfully and will be tested
- CMS reports (raw view) -
- Continuing 2011 legacy rereco activity and some Upgrade MC generation, everything pretty quiet
- Several tickets open at RAL
- Currently seem to have some problem with CMSSW installs on various nodes at FNAL, SAV:138771
- Yesterday the Castor T1TRANSFER service was degraded for some hours. But everything seems OK now.
- Jan: activity spikes appear to have led to SRM timeouts and subsequent aborts, though there might (also) have been an issue with the CASTOR DB performance, we will look further into it
- INC:340403
was opened on Tuesday about a machine that had high load. Got fixed yesterday, I'm not sure by whom.
- ALICE -
- CNAF: a plan has been developed for re-staging 2010 data (400k files) to check for corrupted files (GGUS:95073
) and have such cases fixed, while avoiding contention with reprocessing campaigns: thanks!
Sites / Services round table:
- ASGC - ntr
- BNL - ntr
- CNAF - ?
- GridPP - ntr
- IN2P3 - ntr
- KIT
- reminder of various downtimes next week, as announced already and recorded in GOCDB
- NDGF
- network maintenance 18:00-20:00 UTC today, some pools may be affected
- NLT1 - ntr
- OSG - ntr
- PIC - ntr
- RAL
- CVMFS 2.1.12 has been deployed on all WN
- CASTOR upgrades for ALICE and LHCb foreseen for next Tue, but not yet decided
- dashboards - ntr
- databases - ntr
- storage
- srm-lhcb has had various core dumps lately; a patch applied on Tue seems to have cured the problem; next Tue the other instances will also be patched, should be transparent
AOB: