Week of 130225
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
- The scod rota for the next few weeks is at ScodRota
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Attendance: local(Simone - SCOD , Alexandre - Dashboards, Jan - CERN Storage, Alessandro - ATLAS); remote(Ulf - NDGF, Michael - BNL, Joel - LHCb,
MariaD - GGUS, Wei-Jen - ASGC, Saverio - CNAF, Tiju -
RAL, Onno - NL-T1, Dimitri - KIT, Rob - OSG, Pepe - PIC)
Experiments round table:
- ATLAS reports -
- Central services
- T0/1s
- PIC_MCTAPE ~6,000 transfer failures: "server err.451.No write pools configured" Friday night. GGUS:91739
verified:an error creating tape file families. Fixed on Saturday at ~10am.
- RAL-LCG2 ~10,000 transfer failures: "SOURCE:SRM_ABORTED". GGUS:91743
filed on Saturday, in progress.
- FZK-LCG2: as from the "ongoing issues" at the top of the ADCOperationsDailyReports2013 - form UK sites file transfer problems: still observing these errors, mostly at FZK-LCG2_PERF-IDTRACKING and FZK-LCG2_SCRATCHDISK tokens at a rate about a thousand errors in 4 hours Friday at ~11pm. GGUS:87958
in progress updated.
- ALICE reports
-
- Central services: this morning the AliEn catalogue DB was moved to a new, more powerful machine to sustain its steady growth.
- LHCb reports -
- Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.
- T0:
- T1: IN2P3 : (GGUS:91760
) : authentication problem with one certificate used for Production. SARA : downtime
Sites / Services round table:
- NDGF: there will be a three days interruption of the NDGF-CERN primary OPN link (starting from today). The backup should be working.
- More infos collected after the meeting concerning the maintenance windows:
- Maint. win. Start : 20130225 15:59 UTC
- Maint. win. End : 20130225 22:59 UTC
- Maint. win. Start : 20130227 15:59 UTC
- Maint. win. End : 20130227 22:59 UTC
- Maint. win. Start : 20130228 15:59 UTC
- Maint. win. End : 20130228 22:59 UTC
- ASGC: the ASGC-CERN network is dow. Under investigation.
- More info collected after the meeting. From ASGC: "both IPLC 10G link and 2.5G link are disconnected since 04:47am GMT/UTC, Feb. 25, 2013. Out telecom carriers, CHT-i and FarEastone, have confirmed that the disconnection is caused by electric power down after a fire disaster event of the IPLC city pop(Chief Data Center located at Neihu, Taipei). The blaze has been stop, but the electric power cannot be rebooted until The Fire Brigade release the building. The IPLC link will remain in disconnection before electric power system restarted."
- RAL: tomorrow morning the site will be AT RISK for maintenance
- SARA: will be in downtime today and tomorrow (the intervention will include the migration of the WNs to SL6+EMI2)
- KIT: VMEM has been set to 10GB per job slot under Alice request.
- OSG: tomorrow there will be the monthly scheduled maintenance of central OSG operational services. Some services might have a very short outage.
- PIC: after the discussion on proof-lite on grid sites last thursday PIC verified that 1% of jobs at PIC exploit multiple cores within the same job slot.
- GGUS: Reminder! As announced last week, there will be a GGUS Release this Wednesday 2013/02/27 with ALARM tests as usual. The interface to Ibergrid changes, PIC is affected!
AOB:
Tuesday
Attendance: local(Simone - SCOD , Alexandre - Dashboards, Jan - CERN Storage, Alessandro - ATLAS, Maarten - Alice, Maria - GGUS); remote(Ulf - NDGF, Matteo - CNAF, Onno - NL-T1, Michael - BNL, Wei-Jen - ASGC, Joel - LHCb, John -
RAL, Lisa - FNAL, Rolf -
IN2P3, Rob - OSG, Jeremy -
GridPP, Pepe - PIC)
Experiments round table:
- ATLAS reports -
- Central services
- T1s and network
- TRIUMF-LCG2 many job failures "no such file or dir" (group gener input for shepra). GGUS:91766
in progress.
- RRC-KI-T1 commissioning is ongoing.
- LHCb reports -
- Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.
- T0:
- T1: IN2P3 : (GGUS:91760
) : authentication problem with one certificate used for Production: Fixed (tomcat restarted) SARA : downtime
- DashBoard : In the "Site Groups" drop down box, RHUL does not appear if you select "All sites". However if you pick "Tier 0/1/2", then you do see UKI-LT2-RHUL.uk.
(
http://dashb-lhcb-sum.cern.ch/dashboard/request.py/historicalsmryview-sum#view=siteavl&time[]=last48&granularity[]=default&profile=LHCb_CRITICAL&group=Tier+0/1/2&site[]=LCG.UKI-LT2-RHUL.uk&type=quality
)
Sites / Services round table:
- NL-T1: maintenance proceeding accoridng to plan.
- RAL:
- At risk this morning for a network intervention. One switch after reboot went down for 30 mins.
- Currently there is a problem with the batch system not being able to dispatch jobs fast enough. Under investigation.
- OSG: maintenance intervention mentioned yesterday has started. No problems are expected WLCG services.
- ASGC: tried the installation of CASTOR 2.9.13 but discovered some bug. Discussing with CASTOR experts.
- MariaD: tomorrow GGUS release (as planned).
- From John Shade: 10G IPLC link of ASGC is recover at 19:45 UTC of Feb. 25, 2013. The communication among ASGC, LHCOPN, and LHCONE has been resumed.
AOB:
Wednesday
Attendance: local(Simone - SCOD, Alexandre - Dashboards, Maarten - Alice, Alessandro - ATLAS, Luca - CERN DB, Jan - CERN Storage, Alex - CERN PES); remote(Michael - BNL, Wei-Jen - ASGC, Joel - LHCb, Lisa - FNAL, Ulf - NDGF, John -
RAL, Pavel - KIT, Ronald - NL-T1, Rolf -
IN2P3, Rob - OGS, Pepe - PIC)
Experiments round table:
- ATLAS reports -
- Central services: many ATLAS Central Services were affected by the HyperVisor issue
- ATLAS-T0-Frontier service was degraded (availability 50%) Wednesday ~4am, GGUS:91793
, fixed by ~8am.
- ATLAS_DDM_VOBOXES ss-bnl,ss-ral, ss-lyon were degraded Wednesday ~5am. After the reboot earlier during the night the dashb services didn't start on atlas-ss-bnl (voatlas316), atlas-ss-ral (voatlas259), atlas-ss-lyon (voatlas314). Those site services were restarted.
- ATLAS_DDM_Deletion was also showing down in SLS, with deletion service stuck. Restarted, back to normal ~8am.
- T1s
- CERN-PROD: GGUS:91755
. It would be useful if a new channel RRC-KI-T1 ->CERN for the FTS T0 export could be created as soon as possible. Thanks in advance
- TRIUMF-LCG2:
- Transfer failures: SOURCE:failed to contact on remote SRM. GGUS:91766
in progress: network outage, should be fixed.
- Frontier 0% efficiency, the panglia plots are frozen, cannot ping ce1.triumf.ca. GGUS:91791
in progress: there was a network outage, the network should be back.
- ALICE reports
-
- KISTI: disk SE unstable, experts looking into it
- LHCb reports -
- Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.
- T0:
- T1:
- SARA : downtime extended for CPU.
- IN2P3 : thanks to IN2P3 for their additionnal 200 TB of tape.
Sites / Services round table:
- ASGC: DPM has been upgraded to 1.6.6-1, while CASTOR has been upgraded to 2.1.13-0. Next step will be the upgrade of CVMFS 2.1 (not scheduled yet)
- NDGF: the network intervention due yesterday and today was shifted to today and tomorrow. The OPN backup link will work normally. No issues have been reported so far.
- OSG: brief outage of the OGS BDII yesterday during the maintenance intervention.
- PIC: downtime the 5th of March to upgrade to Enstore. Reading from TAPE not available, writing will be buffered. Alessandro: is this flagged as a warning or an outage in GOC? ATLAS would prefer warning, because the outage will be interpreted as the all storage being down (while here is only TAPE reading). Pepe will check and come back to that.
- CERN Storage: one ATLAS file has incorrect checksum both in CASTOR and EOS (incorrect wrt LFC). Being checked.
- CERN PES: the hypervisor problem was due to a failure of the Storage Area Network, most likely related to high bursts of I/O . It is not a new issue, many changes have been tried to solve the problem but nothing conclusive yet. Some VM rebalancing is being done to alleviate the issue. The problem is not easily reproducible. Alessandro: ATLAS critical services are being moved to virtual machines. May be we should discuss the very critical services from experiments being deployed in special clusters with reduced I/O (less VMs per hypervisor). To be followed up.
- GGUS: the new release has been put in place this morning. So far so good. Alarm tests going on.
AOB:
- Simone reminded that from next week, the WLCG Daily Ops Meeting will move to the new format: 2 meetings per week, on Monday and Thursday, both at 15:00.
Thursday
Attendance: local(Alessandro, Alex B, Jan, Maarten, Zbyszek);remote(Dennis, Gareth, Jeremy, Joel, Lisa, Matteo, Michael, Rob, Ulf, Wei-Jen,
WooJin).
Experiments round table:
- ATLAS reports -
- Central services
- T1s and network
- TRIUMF-LCG2, many job failures today again with "job killed by signal 15", GGUS:91766
reopened
- ALICE reports
-
- KISTI: suffering network problem due to the maintenance of GLORIAD-CERN backbone network beginning 27/02/2013 02:00 (UTC) scheduled to be finished by 28/02/2013 02:00 (UTC). It was foreseen only one hour link cut during the maintenance but since 09:00 (UTC) yesterday the network link seems down with a reason we do not know. We are waiting for the maintenance to end and will inquire concerning this problem.
Sites / Services round table:
- ASGC
- last night the SRM became unstable after the CASTOR upgrade; after consulting the CASTOR team at CERN it looks OK again
- BNL - ntr
- CNAF
- CREAM ce04-lcg downtime March 1-2 for upgrade to EMI-2 on SL6
- CREAM ce01-lcg is up again
- FNAL - ntr
- GridPP - ntr
- KIT - ntr
- NDGF - ntr
- NLT1
- NIKHEF: 1 failed disk server with 50 TB of ATLAS data; machine down, data not lost; expected back tomorrow
- SARA: 1 failed tape with ~150 files irrecoverable, owners have been informed
- OSG - ntr
- RAL - ntr
- dashboards - ntr
- databases
- next week a security patching campaign will start:
- Mon: ATLAS archive
- Tue: ALICE online, LCGR
- Wed: LHCb offline
- Thu: LHCb online
- GGUS/SNOW
- yesterday's alarm tests appear to have gone OK
- storage - ntr
AOB:
Friday
Attendance: local(Alessandro, Alex B, Jan, Maarten, Manuel, Marcin);remote(Joel, Michael, Onno, Pepe, Salvatore, Thomas, Tiju, Wei-Jen, Xavier).
Experiments round table:
- ATLAS reports -
- Central services
- Schedconfig to AGIS transition: write access to SchedConfig svn repository has been closed, from now on all the modifications of Panda objects will be done through AGIS.
- some problems observed from Pilot Factories (pilotlimit and transferring time set to 0 in AGIS, while Null in Panda) have been fixed this morning.
- T1s and network
- TRIUMF-LCG2, some job failures today again with "No such file or directory" error, GGUS:91766
. The source of the problem and ways to bypass it are discussed in the ticket.
- TAIWAN-LCG2, one disk server currently unavailable due to hardware failure. It's offline and data can not be accessed from it. Vendor was contacted for technical support.
- ALICE reports
-
- KISTI: running jobs OK again, but main disk SE still failing the tests.
- LHCb reports -
- Ongoing activity as before: reprocessing (CERN,IN2P3 + T2s), some prompt-processing (CERN + T2s), MC and user jobs.
- GGUS:91882
open to find out the policy to have the libtool-ltdl rpm installed by default on the grid.
- T0:
- T1:
Sites / Services round table:
- ASGC - nta
- BNL - ntr
- CNAF
- CREAM ce04-lcg down until March 7
- KIT - ntr
- NDGF - ntr
- NLT1 - ntr
- OSG - ntr
- PIC - ntr
- RAL - ntr
- dashboards
- still looking into RHUL issue reported by LHCb
- databases - ntr
- storage
- 2 CASTOR functionalities to be turned off in the near future:
- updating files in place
- "root" protocol (not "xroot")
- Alessandro: looks OK for ATLAS
AOB: