Week of 130325
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
The scod rota for the next few weeks is at
ScodRota
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Attendance:
- local: Ivan, Luca M, Maarten, Stefan, Steve, Ulrich
- remote: Alexander, David, Kyle, Lucia, Michael, Pepe, Rolf, Stephane, Tiju, Ulf, Wei-Jen, Xavier
Experiments round table:
- ATLAS reports (raw view) -
- Central services
- CERN-PROD ALARM : GGUS:92166
: file transfer failures due to 'SECURITY_ERROR' . No reply to ticket and problems occurs each day during few hours.
- Steve: will investigate further as time permits
- T1s
- RAL: Limitations observed in current FTS3 setup at RAL -> No FTS 3 over the week-end until (Savannah:135151
)
- Transfer timeout does not scale with file size
- Bug in the optimizer, resulting in the number of streams being too low
- Maarten: please ensure the FTS-3 developers are informed of these issues
- CMS reports (raw view) -
- LHC / CMS
- Rereconstruction of 2012 data progressing well -- processing the last of 4 datataking eras. Still utilizing all T1 resources for this
- CERN / central services and T0
- Working on reconfiguring for reprocessing (details on last Thursday's report -- coordination meeting between CMS/CERN will occur this week)
- Luca: t0streamer in progress; T0 restructuring details to be agreed during that meeting
- Tier-1:
- GGUS ticket 92754 (RAL) -- xrootd fallback reconfig caused errors in SAM tests briefly -- fixed right away on 3/21, but ticket still open...
- Tier-2:
- Continue to work on moving some T1 workflows to larger T2's with xrootd input, due to high T1 use.
- US T2's working well, expanding to larger European T2's -- requests to enable xrootd fallback, accept t1production role issued end of last week
- ALICE -
- CERN: job submission to CERN CEs OK again since Thu evening, thanks to a big effort of IT-PES! (GGUS:92521
)
- Luca: an EOS update needs to happen soon, will send details via e-mail
- LHCb reports (raw view) -
- Mainly user jobs with some MC ongoing. Preparation for Re-Stripping to be launched next week.
- T0:
- latest SW releases not available in CVMFS Stratum 1 instance for LHCb (GGUS:92815
)
- Steve: for some reason the LHCb repository decided to resync from scratch, being looked into
- T1:
- PIC: Failed pilots, is this already the draining of queues for tomorrows DT? LHCb person at PIC contacted.
Sites / Services round table:
- ASGC - ntr
- BNL - ntr
- CNAF
- kernel upgraded on half of the WN, back in production; the other half will be done gradually
- IN2P3 - ntr
- KIT - ntr
- NDGF - ntr
- NLT1
- tape library maintenance on Wed
- OSG
- during tomorrow's maintenance window all services will be rebooted at some time
- PIC
- reminder: downtime 5-19 UTC tomorrow for electrical maintenance, queues already being drained
- RAL - ntr
- dashboards - ntr
- GGUS
- Reminder! Monthly Release on 2013/03/27 as usual (last Wednesday of the month). There will be NO ALARM tests this time. See in Savannah:136545
why not. Basically, release is minor, dev. items' list is http://bit.ly/12nJ9ev
. (text entered by MariaD now away).
- grid services
- CVMFS Stratum-0 instance offline tomorrow for back-end storage change; Stratum-1 instances should not be affected
- Argus service nodes need improved memory tuning, should be transparent
- storage
- file updates have been disabled on CASTOR-PUBLIC today, the other instances to follow on Apr 8
AOB:
Thursday
Attendance:
- local: Andrea, Ivan, Luca M, Maarten, Manuel, Stefan
- remote: Gareth, Lisa, Lucia, Michael, Oliver, Rob, Rolf, Ronald, Ulf, Wei-Jen, Xavier
Experiments round table:
- CMS reports (raw view) -
- LHC / CMS
- Rereconstruction of 2012 data in the tails, load at the T1 sites small
- CERN / central services and T0
- GGUS:92909
: SAM tests stopped on the 27th, machine needed to be rebooted, even if an alarm was automatically generated, which is supposed to prompt a reboot by the sysadmin, no reboot was done so far even after repeated requests by the SAM team. The impact for CMS is that no SAM tests are run since yesterday and therefore there is no monitoring information.
- A ticket
has been opened for the SAM team to get the SAM production machines' importance increased, so that the sysadmins will look into such issues earlier, as suggested by Manuel
- Tier-1:
- GGUS:92754
-- xrootd fallback reconfig caused errors in SAM tests briefly -- fixed right away on 3/21, closed
- Tier-2:
- Continue to work on moving some T1 workflows to larger T2's with xrootd input, due to high T1 use.
- US T2's working well, expanding to larger European T2's -- requests to enable xrootd fallback, accept t1production role
- ALICE -
- CNAF: VOBOX suffered network issues since Mon afternoon, fixed Tue afternoon (GGUS:92845
)
- IN2P3: investigating very low job rates since switch to new VOBOX on Mon
- KIT: all jobs lost around 09:00 CET today, being investigated
- Xavier: no idea what might have caused that
- the VOBOX was rebooted and ended up in a bizarre state: not understood, but jobs are running again
- LHCb reports (raw view) -
- Mainly user jobs with some MC ongoing.
- T0:
- No SAM tests displayed on the SUM dashboard (GGUS:92924
)
- also see the CMS SAM incident reported above
- T1:
- PIC: Failed pilots after DT. The jobs submitted had requested 700k HepSpec06.seconds, for some reason they ended up in the short queue (17k sec) (GGUS:92914
), no more failures as of yesterday evening, ticket closed
Sites / Services round table:
- ASGC
- this morning one of the DPM daemons crashed; fixed
- yesterday 50 ATLAS files were lost in CASTOR; ATLAS and the CASTOR developers have been informed
- BNL - ntr
- CNAF
- kernel upgrades on UI nodes and second half of the WN will be done next week
- last night there were problems with the Italian top-level BDII, which was recently upgraded to EMI-2; being investigated
- added after the meeting: a few sites reported memory usage increases on the LCG-Rollout list today
- FNAL - ntr
- IN2P3 - ntr
- KIT - nta
- NDGF - ntr
- NLT1 - ntr
- OSG - ntr
- RAL
- on Tue morning there was a network issue caused by the RALPP T2 site overloading the site firewall, due to Xrootd redirection tests by CMS
- dashboards - ntr
- grid services - ntr
- storage - ntr
AOB:
- ATTENTION: start of European Summer Time on Sun March 31 !
- ATTENTION: next meeting on Tuesday April 2 !
- Have a good Easter
break !