Week of 130930
WLCG Operations Call details
To join the call, at 15.00 CE(S)T, by default on Monday and Thursday (at CERN in 513 R-068), do one of the following:
- Dial +41227676000 (Main) and enter access code 0119168, or
- To have the system call you, click here
The scod rota for the next few weeks is at
ScodRota
WLCG Availability, Service Incidents, Broadcasts, Operations Web
General Information
Monday
Attendance:
- local: Simone (SCOD), Alessandro (ATLAS), Ivan (dashboards), Ulrich (CERN/PES), Maarten (Alice), Ken (CMS)
- remote: Michael (BNL), Thomas (NDGF), Vladimir (LHCb), Lisa (FNAL), Rolf (IN2P3), Sang-Un (KISTI), Wei-Jen (ASGC), Gareth (RAL), Rob (OSG), Pepe (PIC), Salvatore (CNAF), Onno (NL-T1)
Experiments round table:
- ATLAS reports (raw view) -
- Central Services
- T0
- T1/T2
- Many NTUP datasets were subscribed recently to T2s and caused DATADISKs to get full. The transfers of these datasets go through T1 SCRATCHDISKs and these got full too: TRIUMF-LCG2,IN2P3-CC,FZK-LCG2,INFN-T1,PIC,RAL-LCG2 https://savannah.cern.ch/bugs/index.php?102694
- CMS reports (raw view) -
- CVMFS Stratum 1 is down, although this doesn't appear to be affecting our operations.
- On Friday, trouble with the BDII for SAM tests, but it got resolved.
- Two unresolved tickets, INC:395272
and INC:396136
. Both came over the weekend, no action over this morning (but I think we need to deal with this on the CMS side).
- OK, I've got the story now on the network link to Russia....
- ALICE -
- NDGF: 35k files (96% production, 4% organized analysis, some >2 years old) no longer found in dCache, probably lost; being investigated
- many sites moving to CVMFS
- LHCb reports (raw view) -
- Main activity are MC productions.
- Fall incremental stripping campaign will be launched this week (see also WLCG Operations Coordination meeting, 19 Sep)
- T0:
- T1:
- RAL: one disk server down, technicians are looking into the issue, at the moment the files are declared unreachable within LHCb
- GRIDKA: problem with staging during stress test
Sites / Services round table:
- RAL:
- 50% of batch resources moved from Torque to Condor (all on SL6). The remaining 50% will be upgraded to SL6 on wednesday but remain under torque for some weeks.
- RAL to UK academic network will be upgraded tomorrow (AT RISK)
- RAL UPS test on wednesday (AT RISK)
- PIC: CVMFS problem with LHCb reported. Now stable.
- NL-T1: some tape problems last week caused by lots of incoming data, all is fine now.
- KIT: next thursday will be national holiday in Germany. Normal response to Alarms, less urgent tickets will have to wait for Friday.
- CERN:
- 2 new CEs are available (301 and 302, both pointing to SLC6 resources) but not used
- more worker nodes are being deployed in Wigner Center. ATLAS would like to have a dedicated discussion on this. Will happen next monday in the ATLAS-IT monthly meeting.
- CERN CVMFS
- During the weekend the CVMFS stratum 1 service at CERN suffered from a full partition.
- cvmfs repositories only mirrored at CERN became totally unavailable , i.e the non-production /cvmfs/lhcb-conddb.cern.ch
- For other repositories e.g /cvmfs/atlas.cern.ch , /cvmfs/cms.cern.ch all clients including those at CERN will have now all switched away from CERN to use alternate other stratum ones.
- Current situation at 13:00 on Monday:
- All cvmfs repositories except for /cvmfs/atlas.cern.ch are again served correctly from cvmfs-stratum-one.cern.ch.
- /cvmfs/atlas.cern.ch is currently not available from CERN.
- With the exception of /cvmfs/lhcb-conddb.cern.ch everything is transparent to all users including both readers and writers of CVMFS.
- Dashboards and Monitoring: MyWLCG and SUM interfaces will be down tomorrow 7:00 to 15:00 UTC during their upgrade so basically no interface to check results of SAM tests. Metric results will be queued and after the release, availability will be calculated. No data will be lost, but there will be a delay.
AOB:
Thursday
Attendance:
- local: Simone (SCOD), Xavi (CERN-DSS), Maarten (Alice), Alessandro (ATLAS), Sang-Un (KISTI), Yeo (KISTI), Park (KISTI), Ulrich (CERN-PES), Maria (GGUS)
- remote: Vladimir (LHCb), Lisa (FNAL), David (CMS), Kyle (OSG), Rolf (IN2P3), Gareth (RAL), Thomas (NDGF), Ronald (NL-T1), Wei-Jen(ASGC), Pepe (PIC)
Experiments round table:
- ATLAS reports (raw view) -
- Central Services
- the BNL VOMS server had to be switched off because of the new CA (/DC=com/DC=DigiCert-Grid/O=Open Science Grid/OU=Services/CN=vo.racf.bnl.gov) which is now providing the host certificate, resulting in a different DN of the VOMS server. This DN is hardcoded in a file for the EGI sites. We are working to make a broadcast to have this file updated.
- Tier0/1s
- IN2P3-CC storage issue GGUS:97685
this was an HW problem now solved.
- RAL-LCG2 all transfers failing for ~1h 3October early morning GGUS:97731
as reported in the ticket, RAL had a network issue 02:45 - 03:30 UTC
- INFN-T1 GGUS:97687
few transfers failing between the site UAM and INFN-T1, problems in contacting the INFN-T1 SRM. Not understood yet.
- TRIUMF GGUS:97698
not sure yet if this is related to the BNL VOMS issue, most probably it is. If site can confirm, then we can close the ticket.
- CMS reports (raw view) -
- Continuing to chew through legacy rereco of 2011 dataset, otherwise quiet on most fronts -- GGUS activity last few days:
- GGUS:97714
WMS Config at DESY causing CERN WMS nodes to overload -- was old configuration, now solved.
- GGUS:97705
User experiencing mapping (?) issue at Bari -- ongoing
- GGUS:97677
CMS production glideins becoming held due to proxy to local user delegation at KIT
- GGUS:97732
Not being able to override default 5.5 hour timeout for CREAMCE sites is causing them to fail CMS SAM tests more often than they probably should.
- Maarten: Alice implemented its own version of the probe to overcome the hardcoded timeout. Will suggest the same to CMS.
- after the meeting: the CMS problem is different, it used to work fine
- ALICE -
- CVMFS
- T0/T1: CNAF and IN2P3 switched (joining KIT and RAL)
- T2/T3: 23 sites
- NDGF: dCache file name mapping turned out to be broken due to a misconfiguration a few weeks ago; some renaming to do, after which the missing files should be available again
- LHCb reports (raw view) -
- Main activity are MC productions.
- Fall incremental stripping campaign will be launched this week (see also WLCG Operations Coordination meeting, 19 Sep)
- T0:
- T1:
- GRIDKA: staging stress test finished successfully
- Discussion: CERN will reply to GGUS:97736
concerning ce202 failing all jobs. If LHCb have the perception that not enough jobs are being executed at CERN, they will open a different ticket.
Sites / Services round table:
- KISTI is currently in a scheduled 2 weeks downtime but there is a problem of delivering of CISCO switches which will extend it by 10 extra days.
- OSG: Need to understand why the MyWLCG downtime of tuesday was not propagated enough to OSG. Also, there will be an upgrade of the RSV -> SAM uploader next Tuesday, October 8
- RAL: a network outage cut the site connectivity to the outside (45 mins this morning). In the morning at 7:00 a router was in trouble (because of the other problem). All problems should be solved now. Concerning the batch farm, it has been upgraded to SL6 as planned.
- PIC: in the last day there have been some internal network issues (packet loss, some services might be affected).
- CERN: problem was discovered in the configuration of newly created VMs (solved). WNs being upgraded to EMI3 for SL6. 100 nodes (800 job slots) in Wigner.
- GGUS: the september GGUS release merged with the october one, on the 23rd of october.
- Storage Services - CASTOR v14 rollout. Transparent Intervention on central services and namespace but downtime required for stagers (~5h). Tentative schedule:
- 7/10/2013 10h to 12h Nameserver and Central Services. Transparent intervention.
- 14/10/2013 (tentative) 9h to 14h CASTORATLAS. Intervention not transparent. Downtime required.
- 21/10/2013 (tentative) 9h to 14h CASTORALICE. Intervention not transparent. Downtime required.
- 22/10/2013 (tentative) 9h to 14h CASTORCMS and CASTORLHCb. Intervention not transparent. Downtime required.
AOB: