WLCG Operations Coordination Minutes - February 5th, 2015
Agenda
Attendance
- local: Maria Dimou (secretary), Andrea Sciabà (chairman), Prasanth Kothuri (IT-DB), Andrea Manzi (MW Officer), Maarten Litmaath (ALICE), Christoph Wissing (CMS), Jerôme Belleman (T0), Thomas Hartmann (KIT), Marian Babik, Stefan Roiser (LHCb).
- remote: Antonio Perez Calero Yzquierdo (PIC), Dave Mason (FNAL), Di Qing (Triumf), Catherine Biscarat (IN2P3), Jeremy Coles (GridPP), Yury Lazin (NRC-KI), Brij Kishor Jashal (Indian CMS T2), Hung-Te (Felix) Lee (ASGC), Massimo Sgaravatto (Legnaro/Padova), Rob Quick (OSG), Ulf Bobson Severin Tigerstedt & baby
(NDGF),Dave Dykstra (FNAL).
Operations News
Middleware News
- Baselines:
- frontier/squid 2.7.STABLE9-22 : includes patch from Debian that was backported from a denial of service vulnerability reported for squid3. Suggested upgrade only for those squid exposed to Internet.
- CREAM 1.16.4 : it has been verified by MW readiness, it will soon be included in UMD and set as baseline.
- MW Issues:
- T0 and T1 services
- CERN
- upgraded EOS for CMS to 0.3.94
- RRC-KI-T1
- upgraded dCache for LHCB to 2.6.38 and EOS xrootd for Alice to 0.3.29
- planned to upgrade dCache for ATLAS to 2.10.17
- JINR-T1
- upgraded dCache to 2.10.17
- planned to install FTS 3 later in February and decommission FTS2 at the same time.
- CNAF
- upgraded Storm for ATLAS to 1.11.6
- planned to upgrade also Storm for CMS
- NDGF
- upgraded dCache to 2.11.7/2.11.8
Tier 0 News
- VOMRS decommissioning and replacement by VOMS-admin: the new VOMS-admin release has been deployed for testing. Test feedback from the experiments is gathered here: https://ggus.eu/index.php?mode=ticket_info&ticket_id=110227
. If there are no showstoppers, if will be deployed in production mid next week, and VOMRS will be decommissioned on Feb 16th. To the question what is going to happen with the old VOMRS machines, Alberto Peon reported in email that the plan is to freeze the VOMRS database and keep the VMs around for a while (obviously with VOMRS stopped), just in case we need to revert back to the old service.
- Access to the AFS UI was closed on Monday 2nd Feb, as announced. No tickets received till now.
Tier 1 Feedback
Tier 2 Feedback
Jeremy relayed UK sites' question on Monitoring reports' access to people without a CERN login. Maarten suggested a
lightweight account. Further input from sites to Jeremy shows that these accounts give very insufficient access rights (to some e-groups & twikis at most). To be followed-up offline.
Experiments Reports
ALICE
- continued high activity over the past weeks
- SARA (NLT1) data loss impact
- 108k ALICE files (~8 TB) lost
- catalog cleanup and re-replication have finished
- 11804 files had a single replica there, none of which were very important anymore
- mainly logs and intermediate QA results
- VOMS-Admin testing
- admin feedback provided in the ticket (GGUS:110227
)
- user experience to be tested
Ulf said that ALICE Xrootd transfers to dCache can result in dark data when the destination disk server is very busy and the upload then times out on a checksum calculation.
Apparently it is a matter to be improved on the dCache side, not in the ALICE client.
ATLAS
No report. Software week.
CMS
- Production/Processing overview
- Upgrade Digi-Reco on Tier-1s
- bigger Run2 MC production campaign launched this week
- Tape staging test at Tier-1 sites
- Coordinated with CMS contacts and via GGUS tickets
- Presently PIC and KIT
- Consistency checks (SE inventory vs. CMS catalog) at all Tier-1s
- Coordinated through GGUS tickets
- So far no (big) impact due to decommissioning of the Grid-UI in AFS space at CERN
- 50% of Tier-1 capacity multi-core enabled
- If site has dedicated multi-core resources, it should provide this fraction
- Will be partly used in "partitional slot mode" (Running n single-core jobs in n core multi-core pilot)
- Long lifetime of pilots preferred -- what is still feasible for the sites?
- Migration to a single global Condor pool for Analysis and Production (almost) done
- Tier-2 will stop receiving jobs with VOMS role production
- Will request changes in fairshare configuration in the next few weeks - will be reported also here
- Pushing for some site configurations
LHCb
- Operations
- "Run1 Legacy stripping"
- mostly finished, last files are being merged
- many problems because of SARA disk failure, 40k unmerged files (single replica) were affected and need to be declared lost
- SARA disk failure
- 40k production (mostly stripping) and 60k user files were lost. Many of the user files also only with single replica.
- Waiting for the SIR
- CERN simulation jobs started failing on Tuesday, since then no successful simulation jobs were run and the number of jobs assigned to CERN has been first completely disabled, now running at very low pace to check if the situation is changing.
- Discovered some wrongly reported numbers for KIT and CERN in the APEL portal
- WLCG services
- VOMS testing
- one issue found with a LHCb member not visible via the new interface GGUS:110227#update#70
- Open question about the "users entry page" when joining the VO or resigning the AUP NB! Not in GGUS:110227
!? Should be added or won't be done...
- Other
- FYI as of next meeting the report will be created and presented by the "Grid Expert on Call" of the week.
On the wrong KIT data Christoph said that he reported the lack of KIT accounting data for 2 months. Progress in
GGUS:111117
. Further comment by John Gordon by email:
Today KIT republished their data from the last few months with the correct HEPSPEC. This was after a successful test on December's data. So the issue is close to resolution. CERN are migrating their publishing from SSM1.2 to SSM2. I've told them they have a gap in December publishing around the time they changed. There is a ticket open.
Ongoing Task Forces and Working Groups
gLExec Deployment TF
- gLExec in PanDA:
- testing campaign is covering 54 sites so far (out of 105)
- almost all are usually OK
- some issues with gLExec infrastructure or implicit assumptions for ATLAS workflows
- to be ramped up further in the coming weeks
SHA-2
- retirement plans for the old VOMS servers
- special firewall and router configurations will remain in place until VOMRS is gone
- a VOMS configuration reminder broadcast will be sent shortly before
Machine/Job Features
- Contacted by french and UK sites for test deployments on their batch system. One point to review was the level of documentation provided for the sites to successfully deploy MJF on their batch systems which is currently discussed with the providers of the individual scripts. This TF will meet again to record completed actions and then close.
Middleware Readiness WG
- Full minutes of the last MW Readiness WG meeting are now availabe from MWReadinessMeetingNotes20150121
- Thanks to the MW Officer, ATLAS, CMS and the Volunteer Sites a lot of progress is being made for all Actions, namely:
- Condor-G verification is in the pipeline via the ATLAS pilot factory. JIRA:MWREADY-39
- Brunel and Liverpool Universities offered to participate in the Readiness verification of the ARC CE. JIRA:MWREADY-37
.
- The MW Officer starts investigating with ATLAS & CMS the use of the Prometheus test system, a small dCache instance (currently a single node), offered by DESY, for people who want to help test whether the next major release of dCache (currently 2.12) has any problems. JIRA:MWREADY-36
.
- StoRM 1.11.6 is in the pipeline for testing at QMUL. JIRA:MWREADY-18
.
- Grif starts DPM testing with xrootd 4 for CMS.
- dCache 2.11.8 testing for ATLAS starts at NDGF JIRA:MWREADY-38
. Triumf will probably soon follow.
- EOS testing entered the MW Readiness activity for CMS https://its.cern.ch/jira/browse/MWREADY-40
JIRA:MWREADY-40]].
- Thanks to the Volunteer Sites which installed the new version of the MW Package Reporter. More are very welcome.
Multicore Deployment
IPv6 Validation and Deployment TF
- Thanks to Duncan, there is a dual-stack perfSONAR mesh including for the moment sites participating to the IPv6 working group:
- An IPv6 test will be added to the Nagios server testing the pS instances (as announced by the N&T metrics WG)
Ulf reported dCache 2.12 testing being done with IPv6 in the framework of the relevant HEPiX WG.
Squid Monitoring and HTTP Proxy Discovery TFs
- 154 squid services registered so far in GOCDB or OIM, 178 squids total
- will need a targeted campaign to get the remaining squids registered that we know about
- next need to improve the specification of exceptions and generate separate ATLAS & CMS pages
- Squids that are restricting CVMFS server destinations will need to be updated (if they haven't yet) to allow access to some new Stratum 1s serving egi.eu and opensciencegrid.org repositories
- easy to do with pre-defined acl in recent frontier-squid releases
Stefan will send the CMS and LHCb links to the dashboard that uses a CVMFS tool discovering which sites are already updated. The SAM test checks every configured StratumOne for every configured HTTP proxy. EGI could be interested in running these tests by themselves, but they are based on the WN, for which EGI are going to phase out the tests altogether.
Network and Transfer Metrics WG
- WG still waiting on input from ATLAS on use-cases/requirements for network metrics
- Meeting to discuss the use cases will be held on 18th of February (https://indico.cern.ch/event/372546/
)
- 2nd broadcast was sent to remind sites to update to 3.4.1 - final deadline is 16th of February - sites that won't update by this date will receive tickets
- Production version of perfSONAR infrastructure monitoring available at http://pfomd.grid.iu.edu/
(you need to have your certificate loaded in the browser to access)
- Pilot versions of maddash and datastore (http://pfds.grid.iu.edu
) available
- perfSONAR operations meeting was held last week - minutes available at https://indico.cern.ch/event/369420/
- Agreed to start full mesh latency testing starting with top-k sites and gradually moving to all sites
- Follow up campaign to bring all perfSONARs to the correct configuration
The experiments are encouraged to test the API that gives access to PerfSONAR data. Stefan has students who work on extracting data from the message bus of the PerfSONAR box. This is experiment-agnostic so it can be made available for all.
Action list
- ONGOING on the WLCG monitoring team: evaluate whether SAM works fine with HTCondor CE. Status: HT-Condor CE tests enabled in production on SAM CMS; sites publishing sam_uri in OIM will be tested via HTCondor (all others via GRAM). Number of CMS sites publishing HTCondor-CE is increasing.
- Ongoing discussions on publication in AGIS for ATLAS.
- ONGOING on experiment representatives - report on voms-admin test feedback
- Experiment feedback and feature requests collected in GGUS:110227
AOB
- Next meeting in 2 weeks, Feb 19th. GDB next week, Feb 11th.
- Missing mouse from the meeting room. Opened SNOW:INC0727932
.
--
MariaDimou - 2015-02-05