Week of 141110
WLCG Operations Call details
- At CERN the meeting room is 513
R-068.
- For remote participation we use the Vidyo system. Instructions can be found here
.
General Information
- The purpose of the meeting is:
- to report significant operational issues (i.e. issues which can or did degrade experiment or site operations) which are ongoing or were resolved after the previous meeting;
- to announce or schedule interventions at Tier-1 sites;
- to inform about recent or upcoming changes in the experiment activities or systems having a visible impact on sites;
- to provide important news about the middleware;
- to communicate any other information considered interesting for WLCG operations.
- The meeting should run from 15:00 until 15:20, exceptionally to 15:30.
- The SCOD rota for the next few weeks is at ScodRota
- General information about the WLCG Service can be accessed from the Operations Web
- Whenever a particular topic needs to be discussed at the daily meeting requiring information from site or experiments, it is highly recommended to announce it by email to wlcg-operations@cernSPAMNOTNOSPAMPLEASE.ch to make sure that the relevant parties have the time to collect the required information or invite the right people at the meeting.
Monday
Attendance:
- local: Belinda (storage), Hervé (storage), Lorena (databases), Maarten (SCOD + ALICE), Stefan (LHCb), Steve (grid services), Tsung-Hsun (ASGC)
- remote: Antonio (CNAF), Christian (NDGF), Dea Han (KISTI), Dimitri (KIT), Lisa (FNAL), Onno (NLT1), Pepe (PIC), Rob (OSG), Rolf (IN2P3), Tiju (RAL), Tommaso (CMS)
Experiments round table:
- ATLAS reports (raw view) -
- Daily Activity overview
- FZK DATADISK is still full (59TB free ), TAIWAN-LCG2 has 76TB.
- some RDO from FZK to other place, but only 50TB.
- DQ2 Central Catalog writer was stuck due to a log file. Fixed. N.B. Now we have 3 nodes behind that alias, will ask CS to put back the spare one with high weight.
- Reprocessing tasks submitted with inputs in Rucio and in DQ2. The DQ2 ones started, not the one in Rucio. Under investigation.
- CentralService/T0/T1s
- Nothing special to report.
- CMS reports (raw view) -
- a few weeks long global run data taking. Involves also CERN/Remote computing
- Testing of new VOMS server infrastructure
- CMS computing operations reminded about the Nov 26th deadline
- Operational items
- Reconfiguration campaign of xrootd for European sites ongoing ( > 1/3 seem compliant now)
- We had a Frontier problem on Thu night, due to the change of Launchpads to Puppetized machines, which did not have correct firewall settings. Reverted to old machines solved the problem.
- GGUS open (T0, T1s only...)
- GGUS:109876
: FS T1, a stuck submission, SOLVED
- GGUS:109855
: a CREAM CE miss-behaving at CERN; STILL WAITING FOR RESPONSE
- GGUS:109919
: glExec problem at RAL_T1, SOLVED
- GGUS:109812
: slow Xrootd access at CERN, turns out to be VM related ; MOVED TO A NEW TICKET
- Twiki is still often RED in SLS ...
- ALICE -
- KIT: due to local staging issues, raw data reprocessing jobs needed to read a lot of data from CERN, thereby putting a high load on the OPN link over the weekend
- the staging should work again now
- LHCb reports (raw view) -
- MC and user jobs. "Legacy Run1 stripping campaign" to be launched today @ T1 sites
- T0: NTR
- T1: problem with SLC6.6 worker nodes and ROOT 6 based applications is fixed by deployment of a new software stack. All new productions (MC) launched as of today will be capable of running also on SLC6.6 nodes (some T1 sites were affected)
Sites / Services round table:
- ASGC: ntr
- BNL:
- CNAF: ntr
- FNAL: ntr
- GridPP:
- IN2P3: ntr
- JINR:
- KISTI: ntr
- KIT:
- the tape issue mentioned by ALICE was solved; it affected all experiments
- NDGF:
- tomorrow IPv6 will be enabled on Norwegian pool nodes, with short interruptions potentially impacting access to some ALICE or ATLAS data
- NL-T1: ntr
- OSG:
- reminder: tomorrow's OSG release will have the new VOMS servers enabled for
voms-proxy-init
- Maarten: ATLAS and CMS did make progress with validating their central services for use with the new servers, but could not declare victory just yet; users may get warnings when trying to obtain a proxy while the new servers are not yet open to the world, but the command should then fall back on the old servers
- Rob: the uptake of this new release probably will be slow; we just need to have it available given the expiration of the old servers on Nov 26
- PIC: ntr
- RAL: ntr
- RRC-KI:
- TRIUMF:
- CERN batch and grid services: ntr
- CERN storage services: ntr
- Databases:
- we are applying security patches to the integration DBs
- GGUS:
- Grid Monitoring:
- MW Officer:
AOB:
Thursday
Attendance:
- local: Alessandro (ATLAS), Andrea M (MW Officer), Andrea S (WLCG), Belinda (storage), Felix (ASGC), Hervé (storage), Jerome (grid services), Kate (databases), Luca M (WLCG), Maarten (SCOD + ALICE), Pablo (GGUS + grid monitoring), Stefan (LHCb)
- remote: Dennis (NLT1), Gareth (RAL), Jeremy (GridPP), Kyle (OSG), Michael (BNL), Rolf (IN2P3), Sang-Un (KISTI), Thomas (KIT), Ulf (NDGF)
Experiments round table:
- ATLAS reports (raw view) -
- Daily Activity overview -- today all ATLAS internal
- CentralService/T0/T1s
- TAIWAN file transfer issue GGUS:110075
-- significant packet drop and strange network activities -- can we check with perfsonar?
- Felix: we will follow up with our perfSonar setup
- Alessandro: this serves as an illustration of the importance of a reliable and usable perfSonar infrastructure throughout WLCG
- Maarten: with the latest perfSonar version we should be getting there soon
- LHCb reports (raw view) -
- MC and user jobs. "Legacy Run1 stripping campaign" first batch of jobs currently validated by physicists, continue to run early next week
- T0: One VOMS server (old) crashed and needed to be restarted (GGUS:110068
)
- T1: glitch with FTS transfers to RRCKI on Tuesday, now fixed
Sites / Services round table:
- ASGC:
- router overload due to high traffic from Asia-Pacific, not yet understood, being investigated
- downtime next Thu for DPM disk server upgrades
- Andrea M: version 1.8.8 is OK, whereas for 1.8.9 the SAM SRM tests currently fail: that should be fixed later this month
- BNL: ntr
- CNAF:
- FNAL:
- GridPP: ntr
- IN2P3: ntr
- JINR:
- KISTI:
- the test alarm reply delay has been investigated: the message did not get filtered as spam, yet did not arrive, so the site's e-mail address appears not to be working; we are following up
- KIT: ntr
- NDGF:
- busy all day upgrading dCache servers to 2.10.9 or 2.10.10, so far all looks OK
- NL-T1: ntr
- OSG:
- the OSG release with the new VOMS servers in the client configuration happened on Nov 11 as planned; minor issues in the configuration files will be fixed in the next release; ATLAS and CMS are not affected
- PIC:
- RAL: ntr
- RRC-KI:
- TRIUMF:
- CERN batch and grid services: ntr
- CERN storage services: ntr
- Databases:
- Golden Gate servers will be migrated to new HW on Mon, with 5 min downtime per instance
- GGUS:
- Grid Monitoring:
- SAM3
in production. This tool will replace SUM and MyWLCG
- MW Officer: UMD 3.9.0 has been released (http://repository.egi.eu/2014/11/10/release-umd-3-9-0/
). Among the packages released it includes the new gfal2/gfal2-utils libraries ( gfal/lcg-utils libraries are unsupported) and the new UI and WN metapackages including them.
AOB: