Week of 130429
Daily WLCG Operations Call details
To join the call, at 15.00 CE(S)T Monday to Friday inclusive (in CERN 513 R-068) do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
The scod rota for the next few weeks is at
ScodRota
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Attendance:
- local: Simone (SCOD), Jarka (CERN - dashboard), Maria (CERN - GGUS), Jerome (CERN - PES), Felix (ASGC), Stefan (CERN - ES), Belinda (CERN - DSS), Victor (LHCb), Marcin (CERN - DB), Pepe (PIC).
- remote: Michael (BNL), Alexander (NL-T1), Xavier (KIT), Kyle (OSG), Christian (NDGF), Gareth (RAL), Stefano (CMS), Alessandro (ATLAS), Salvatore (CNAF)
Experiments round table:
- ATLAS reports (raw view) -
- Central services
- T1s
- ND.ARC: on Saturday thousands of jobs failed with transfer timeout, and problem with big amount of jobs at transferring state, as FTS channel for ND->DE cannot transfer the files fast enough (elog:44012-44015). The issue was fixed by increasing the timeout to 4 days (from 2 days), and doubling the number of parallel transfers.
- From KIT: The FTS configuration at KIT was indeed changed on sunday at 10:00AM. Number of active transfers from NDGF increased from 10 to 20.
- CMS reports (raw view) -
- nothing to report on the distributed system
- Oracle DB problem at CERN last thursday. CMS CRC opened an ALARM ticket GGUS:93653
at 13:38 . SNOW ticket INC:285815
had been issued earlier at 13:12 by CMS operator. According to CERN DB Support problem started at 12:40 and was due to internal reasons not tied to CMS activity. CMS applications have been failing for a couple of hours before service was restored causing alarms (but not panic) among operators. First communication from CERN Oracle support was at 15:00 meeting but CRC phone link dropped near the end and CRC did not hear/understand. We can benefit from written, direct, communication between DB Support ad CMS operators. To be followed up by CMS Computing Operations.
- Maria Dimou: The action of CMS to open an alarm ticket was indeed the correct one. What was the problem with communication mentioned? Stefano: would have been good to have a direct communication from the Oracle team mentioning there was a problem.
- LHCb reports (raw view) -
- Restriping campaing has started on Saturday
- T0:
- T1: RAL (UK sites) Some MC jobs failing to upload from UK sites to different destinations
Sites / Services round table:
- NL-T1: some problems staging files from tape this morning, due to load on tape drives. The tape drives configuration has been then tuned accordingly.
- OSG: still seeing some errors on the CERN BDII (20% drop on the number of entries w.r.t. usual). Looking into it.
- NDGF: short downtime (marked 2h) on friday morning to reboot some disk servers.
- RAL: on wednesday morning warning for oracle patching in DB behind CASTOR. At RISK.
AOB:
Thursday
Attendance:
- local: AndreaV/SCOD, Alessandro/ATLAS, Felix/ASGC, Maarten/ALICE, Jerome/Grid services, Marcin/Databases, Belinda/Storage, Victor/PIC&LHCb
- remote: Jeff/NLT1, Xavier/KIT, Gareth/RAL, Kyle/OSG, Paolo/CNAF, Lisa/FNAL, Rolf/IN2P3, Christian/NDGF, Jeremy/Gridpp; MariaD/GGUS, David/CMS
Experiments round table:
- ATLAS reports (raw view) -
- Central services
- T1s
- today (Thursday) a diskserver problem appeared on RAL, which causes timeout in reading some files, investigation ongoing (GGUS:93795
)
- CMS reports (raw view) -
- In general a slow half-week.
- 2012 Rereco is complete, some MC production on Tier 1's ramping up. In general activity low.
- CMS only Tier 2's are encouraged to make the transition to SL6 -- Sites shared with other VO's need to wait until June 1 to do this.
- ALICE -
- central services: some servers went down due to a long power cut this morning, causing most jobs to fail and the queues to get drained; the job numbers are slowly ramping up again
- LHCb reports (raw view) -
- Incremental stripping campaign in progress: ramping-up between 2k - 3k of stripping and merging jobs with 98% of execution completed
- T0:
- T1:
Sites / Services round table:
- Jeff/NLT1:
- there was a 30 minute unscheduled intervention this morning for tapes, but the disk buffers were working ok
- Nikhef will move to SLC6 during the week of May 21st
- Xavier/KIT: ntr
- Gareth/RAL:
- investigating issues reported by ATLAS
- also working on the firmware upgrade of ALICE disk servers, batch queues have been drained
- next week on Wed or Thu will switch over the DB backend of Castor between primary and standby
- Kyle/OSG: ntr
- Paolo/CNAF: there was an unscheduled intervention for ALICE three days ago, everything was solved by yesterday
- Lisa/FNAL: ntr
- Rolf/IN2P3: ntr
- Christian/NDGF: ntr
- Jeremy/Gridpp: ntr
- Felix/ASGC: there has been an unscheduled intervention on Cream CE
- Victor/PIC: following up issues with Cream CE on SLC6, will wait for the expert next week
- Jerome/Grid services: Cream CE upgrade to EMI2 at CERN is ongoing
- Marcin/DB: ntr
- Belinda/Storage: ntr
- MariaD/GGUS:
- The next GGUS release will take place on June 2nd.
- File ggus-tickets.xls is up-to-date and attached to twiki WLCGOperationsMeetings. There were 2 real ALARMs, so far, since the last WLCG MB that took place on 2013/03/19.
AOB: the meetings next week will take place on Monday and Friday (CERN is closed on Thursday)