Week of 120716
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
- The scod rota for the next few weeks is at ScodRota
WLCG Service Incidents, Interventions and Availability, Change / Risk Assessments
General Information
Monday
Attendance: local(Massimo, Luc, Maarten, Giuseppe, Ulrich, Edward, Alexandre);remote(Michael, Saerda, Gonzalo,
JhenWei, Lisa, Ronald, Paolo, Tiju, Vladimir, Rolf, Rob).
Experiments round table:
- ATLAS reports -
- CERN CENTRAL SERVICES, T0
- T1
- PIC transfer failures after migration to Chimera. Alarm ticket GGUS:84217
. PIC stable now & back in T0 export.
- CALIB_T2
- CMS reports -
- LHC machine / CMS detector
- Taking data during the week-end
- Van der Meer scan for CMS is foreseen on Tuesday morning
- CERN / central services and T0
- Tier-1/2:
- PIC recovered almost completely, some RUN where not transferred from T0 to PIC due to a problem which was fixed in the morning
- GGUS:83486
(FTS delegation problem): currently no problems but keeping here until sw is fixed
- GGUS:84229
: CMSSW_5_3_2_patch4 missing at PIC. Will be installed by SW deployment team ASAP
- T2_DE_DESY had a power cut today. Site is recovering, GRID services may be affected until tomorrow
- Other:
- LHCb reports -
- Users analysis and Reconstruction at at T1s
- New GGUS (or RT) tickets
- T0:
- CERN: (GGUS:84126
) Pilots aborted; under investigation
- T1:
Sites / Services round table:
- ASGC: ntr
- BNL: ntr
- CNAF: ntr
- FNAL: ntr
- IN2P3: ntr
- NDGF: ntr
- NLT1: In response to to the question GGUS:84223
(ticket solved after the weekend) Roland pointed out this is their service level (weekend on best effort)
- PIC: CMS_SW ticket: due to the "sgm" worker node misconfiguration. CMS should trigger another sw install. The GGUS:84217
was due to the fact that after the upgrade the system was overloading (regitrations + new transfers). The latter had to be cancelled in order to finish registration (situation recovered on Saturday around noon).
- RAL: ntr
- OSG: ntr
- CASTOR/EOS: ntr
- Central Services: One CE got /var full. Now it is back in production but the root cause is under investigation. The LHCb ticket is also under investigation
- Dashboard: ntr
AOB:
Tuesday
Attendance: local(Massimo,
JhenWei, Oliver, Guido, Alexandre, Edward, Ulrich, Eva, Maarten);remote(Michael, Saerda, Paolo, Lisa, Tiju, Gonzalo, Jeremy, Rolf, Vladimir, Rob).
Experiments round table:
- ATLAS reports -
- CERN CENTRAL SERVICES, T0
- CERN-PROD: FTS problem GGUS:84154
still open, no major news, not a showstopper
- atlt3 Castor pool being erased, will be discarded by ATLAS in the next few days
- T1
- PIC downtime finished yesterday
- NDGF-T1 transfer failures to MCTAPE due to staging problem GGUS:84207
solved (files lost)
- CALIB_T2
- INFN-NAPOLI still in downtime after power cut in the week end
- CMS reports -
- LHC machine / CMS detector
- Van der Meer scan for CMS is now foreseen on Tuesday afternoon/night
- Tomorrow, Wednesday, back to physics
- CERN / central services and T0
- Tier-1/2:
- KIT: high load situation on Frontier squids, maybe related to large number of running jobs yesterday? Peaked at close to 5k running jobs in parallel.
- GGUS:83486
(FTS delegation problem): currently no problems but keeping here until sw is fixed
- T2_DE_DESY had a power cut yesterday. Site is recovering, network and basic services are working again, queues have been opened
- Other:
Sites / Services round table:
- ASGC: ntr
- BNL: ntr
- CNAF: ntr
- FNAL: ntr
- IN2P3: ntr
- KIT: ntr
- NDGF: ntr
- NLT1: ntr
- PIC: ntr
- RAL: ntr
- OSG: ntr
- CASTOR/EOS: ntr
- Central Services: ntr
- Data bases: ntr
- Dashboard: ntr
AOB:
Wednesday
Attendance: local(Massimo,Guido, Alexandre, Edward, Ulrich, Luca, Luca, Maarten); remote(Oliver, Michael, Saerda,
JhenWei, Paolo, Burt, Tiju, Gonzalo, Rolf, Vladimir, Kyle, Alexandre).
Experiments round table:
- ATLAS reports -
- CERN CENTRAL SERVICES, T0
- CERN-PROD: some failures in writing to castor (castoratlas/t0atlas), under investigation
- 2 new LFC frontend nodes added yesterday; since then, high number of connections (each node accepts 90 conn)
- T1
- PIC: many transfer failures (GGUS:84311
). All dCache pools assigned to Atlas VO were filled, a new disk space assigned to Atlas, no more errors since then
- TRIUMF: GGUS:84327
, bad ACLs on some directories in LFC, asked the site to kindly change them (certificate used to create them is no more valid)
- CALIB_T2
- INFN-NAPOLI back in production since last night, everything fine
- CMS reports -
- LHC machine / CMS detector
- Machine had RF and cryo problems yesterday
- Van der Meer scans postponed, expected to start this afternoon with LHCb, then Atlas and CMS starting in the evening, takes 12 hours for Atlas and CMS
- CERN / central services and T0
- GGUS:84302
: ce206, ce207, ce208 show issues when jobs from wms316 land there: "Failed to create a delegation id", recovered over night, was most probably effect of reconfiguration yesterday, ticket closed
- Tier-1/2:
- KIT: high load situation on Frontier squids, had again a spike in number of jobs
the two CMS squids were again maxed out, although they max out at 54 MB/s and not at over 100 MB/s which is normal for squids with 1 Gbit connections
currently failover to CERN is able to sustain the load
recommendation is that KIT deploys a 3rd squid (CNAF did already) and investigates why the performance of the two already deployed is limited
- Other:
Sites / Services round table:
- ASGC: ntr
- BNL: ntr
- CNAF: ntr
- FNAL: ntr
- IN2P3: ntr
- NDGF: ntr
- NLT1: ntr
- PIC: Shortage of space (ATLAS) due to some hw problems in the new delivery. Being fix and in the mean time some spare disk are put in production to alleviate the problem
- RAL: ntr
- OSG: ntr
- CASTOR/EOS: investigating the root cause of the ATLAS problem
- Central Services: CE reconfiguration was due to a restart of Tomcat after installing the new BLAH component (normally invisible)
- Data bases: In relation with the LFC problem: whenver possible they would like to be prealerted by an increase of load
- Dashboard: ntr
AOB:
Thursday
Attendance: local();remote().
Experiments round table:
Sites / Services round table:
- ASGC:
- BNL:
- CNAF:
- FNAL:
- IN2P3:
- KIT:
- NDGF:
- NLT1:
- PIC:
- RAL:
- OSG:
- CASTOR/EOS:
- Central Services:
- Data bases:
- Dashboard:
AOB:
Friday
Attendance: local();remote().
Experiments round table:
Sites / Services round table:
- ASGC:
- BNL:
- CNAF:
- FNAL:
- IN2P3:
- KIT:
- NDGF:
- NLT1:
- PIC:
- RAL:
- OSG:
- CASTOR/EOS:
- Central Services:
- Data bases:
- Dashboard:
AOB:
--
JamieShiers - 09-Jul-2012