WLCG Operations Coordination Minutes - September 18th, 2014
Agenda
Attendance
Operations News
- At the last MB held on Tuesday, WLCG Operations presented a set of preliminary ideas on where WLCG Operations could have room for improvement and save some effort to sites. At the MB it was then decided that WLCG Operations will investigate these ideas in more detail and carry out a survey among sites to find out more specifically where operational effort currently goes. This survey will help to identify areas where we could optimise operational costs. A survey will be prepared in the next days and sites will be contacted and asked to fill it in.
- The lack of support for certain ARGUS components was presented at the MB and a series of actions were defined to try to find a solution for this problem.
Middleware News
- Baselines:
- New EMI release last week
- Fix for Apel Parsers 2.3.1 ( compressed accounting log parsing failure) released in EMI-3
- New BDII version (1.6.0) released in EMI-3, containing small fixes for GLUE2 and configuration changes to enhance performances, to set as baseline when released in UMD too
- New version of the EMI-3 UI and WN available. New dependencies as gfal2-util and ginfo included, new bouncycastle-mail version dependency for sha-2 voms. The updated meta packages are not yet in UMD.
- Castor new release 2.1.14-14 , already installed at CERN
- FTS3 new release 3.2.27, installed at CERN already
- Frontier/Squid 2.7.STABLE9-19, enhancement and bug fixes
- MW Issues:
- Storm and new GlueValidator check: There is an issue in the info-provider for Storm which has been discovered when running a check introduced in the new glue-validator version. The glue-validator is integrated as a Nagios Operational probe for EGI and for the moment it has been downgraded waiting for a fix for the Storm info-provider or a workaround inside the glue-validator
- T0 and T1 services
- FTS2 decommissioning
- CERN
- EOSCMS upgraded to 0.3..35
- CASTOR upgraded to 2.1.14-14
- FTS upgraded to 3.2.27 (going to be upgraded on other sites in the coming days)
- NDGF-T1
- dCache upgrades to 2.10.4-SNAPSHOT
Oracle Deployment
Tier 0 News
- AFS UI:
- Ongoing work to understand the use cases. A very common one is to get the CRLs;the normal thing for grid services is to have the CA certs locally and have the standard fetch-crl cronjob updating regularly the CRL lists. We have documented how to configure this both with quattor (https://cern.service-now.com/service-portal/article.do?n=KB0002773
) and Puppet (https://cern.service-now.com/service-portal/article.do?n=KB0002772
)
- Access statistics: In 124 minute sampling period on AFS server hosting both the AFS UI volumes, exclusively. Lots of non-CERN stats and reads from ific.uv.es, unige.ch, and in2p3 and a few others. CERN ones are mostly batch nodes and lxplus. And looking at users: Sorting distinct users by phonebook entry:
- 50 PH-UCM, 14 PH-UAT, 10 Other, 7 PH-ULB, 3 PH-CMG, 1 PH-UAI,1 PH-LBD, 1 PH-LBC, 1 PH-AIP, 1 IT-SDC, 1 IT-PES, 1 IT-ES
- WMS for EGI SAM: With the decommissioning of the WMS instances for SAM scheduled for October 1st, the only remaining use case is availability and reliability monitoring for EGI; for this we will use a catch-all WMS provided by EGI
- plus5/batch5: User input has been analysed; some 40...50 tickets have been opened. A common pattern is the need for a platform to build binaries under SLC5; we have provided a KB article describing how to set up a private VM under Openstack with all the relevant RPMs (see https://cern.service-now.com/service-portal/article.do?n=KB0002752
). This appears to address many users' concerns. Of the remaining tickets, a large fraction comes from CMS and TOTEM users. We are setting up a dedicated meeting with CMS and TOTEM representatives in order to better understand the situation and clarify the next steps to be taken. Other user communities raising concerns are NA48, NA49 and NA61, with whom we will also meet.
Tier 1 Feedback
Tier 2 Feedback
- GridPP: Raises a question about the move from gfal to gfal2. There were indications that gfal (no longer supported) would be removed from EPEL in early November. Yet gfal2 has only just arrived in the WN release (3.1.0) leaving only a short time to check for issues. Responses on the WLCG ops coordination list now suggest gfal will not be removed in November - in that case there is no problem for discussion or need to discuss whether the WN baseline should now be changed from v3.0.0 to v3.1.0.
Experiments Reports
ALICE
- CERN: investigation of job failure rates and inefficiencies
- RSS limit was increased to 2.3 GB - thanks!
- more jobs were still seen getting killed unexpectedly
- further details will be provided to the batch team
ATLAS
- apologies: nobody from ATLAS can be present due to the ongoing ATLAS SW and Computing week.
- no major news to report respect 2 weeks ago: many improvements/steps done on the commissioning and integration of Rucio and ProdSys2, but still long way to go (order several weeks)
- activities as usual.
CMS
- Processing overview:
- Upgrade (Phase2) processing with high priority at Tier-1
- Bad incident with xrootd fallback test on Sep 16
- Unfortunate version of test caused exit state error
- Site readiness will be corrected
- Space monitoring at sites
- Want to close 'unknowns' in the storage accounting
- Present Phedex space accounting is (known to be) incomplete
- Only official datasets are tracked, no user data
- Will be based on storage dumps
- Upload locally pre-processed data to a central DB
- Privacy of space usage of individual users can be kept
- Site controls level of aggregation to be exposed
- Same storage dumps can be used for consistency checks (locally stored files vs catalog)
- Prototype ready
- Request sites to join
- More details: https://twiki.cern.ch/twiki/bin/view/CMSPublic/SpaceMonSiteAdmin
- AFS-UI at CERN
- Services are being migrated to SL6 (with no dependency on AFS-UI)
- Extending the UI availability beyond Oct 31st is still preferred
- Request a supported Grid UI in CVMFS
- CRAB3 clients outside CERN
- Opportunistic resources (via Parrot)
- Closing of lxplus5/lxbatch5
- CMS answer sent by the Computing coordinator
I get back to you about the SLC5 support. CMS would like to keep a number of interactive machines on SLC5. We expect order of 10 virtual machines would
satisfy the CMS use cases on SLC5.
These will be used to compile software and allow legacy analyses from Run1 to be completed, with a timescale of Spring 2015.
- AAA exercise
- Recommended monitoring: http://dashb-wlcg-transfers.cern.ch/ui/#
- Known to incomplete, but should indicate global trends
- Select VO cms and xrootd
- Time line
- Sep 5-20: 'Overflow' running in US
- Job sent to the data -> redirect elsewhere when waiting in the queue
- Done on 1% of analysis jobs level in the US since over a year
- Increased rate to ~15% with peaks to 25% (typical US-T2 analysis job load is 11k)
- Sep 21-28: HammerCloud test for European/Asian region at 10% WAN access
- First half of Oct: Overflow in the US and HC test in EU/Asia at 20% level
- Reminders for sites
- Update xrootd fallback configuration
- Opened tickets to various sites - quite some took action - thanks!
- Add "Phedex Node Name" to site configuration
LHCb
Ongoing Task Forces and Working Groups
gLExec Deployment TF
Machine/Job Features
Middleware Readiness WG
Multicore Deployment
Accounting: Publishing multicore accounting to APEL works. ARC CEs publish correctly. For
CREAM CEs to make it work it has to be an EMI-3 CE and it has to be enabled in the configuration.
Edit
/etc/apel/parser.cfg and set the attribute
parallel=true.
If the site was running multicore already, before upgrading and/or applying this modification, they need to reparse and republish the corrected accounting.
SHA-2 Migration TF
- introduction of the new VOMS servers
- by the Sep 15 deadline only a few T2 were failing the preprod tests
- open tickets
(~8 on Sep 18)
- most of the working T3 sites were also OK on time
- SAM prod machines use the new servers since Tue Sep 16
- now we need to check experiments' job and data systems ASAP
- we can gradually open the node firewalls to selected "customers" per experiment
- e.g.
lxplus-testing.cern.ch
- we could keep the new servers blocked for outside CERN a while longer
- EGI are running a campaign for Ops via the NGIs and ROCs
- first try to ensure sites are configured OK
- then switch SAM-Nagios instances, probably around the end of Sep
- the old VOMS servers will continue running until the end of Nov
- their final SHA-1 host certs expire by then
WMS Decommissioning TF
- Deployment of the Condor-based SAM probes planned on Wed 1st of October 2014
- Discussing with ATLAS to also make ARC-CE tests critical for WLCG monthly reports on the same date (4 sites that would be affected are: ARNES, SE-SNIC-T2, SiGNET, UNIBE-LHEP)
IPv6 Validation and Deployment TF
Squid Monitoring and HTTP Proxy Discovery TFs
- No progress to report today.
- OSG attempted to consolidate the Squid field in the OIM but still needs to adjust it further.
- The automated squid monitor still needs the same small changes reported last meeting.
Network and Transfer Metrics WG
- Kick-off meeting minutes and slides available at https://indico.cern.ch/event/336520/
- The meeting had very good participation including experiments, ESNet Science Engagement Group (perfSONAR development team), Panda, PhEDEx, FTS, FAX as well as majority of the perfSONAR regional contacts. An initial overview of the current status in the network and transfer metrics was presented and a list of topics and tasks to work on in the short-term was proposed. Very good feedback was received and we have agreed on the topics to discuss at the follow up meetings.
- Please check Twiki for updated task table
- 5 sites received tickets on running an outdated version of perfSONAR
- Follow up meetings:
- Metrics area meeting focusing on use cases and review of the transfer systems (T1.1, T1.2)
- Meetings focusing on perfSONAR operations (T2.1):
Action list
- ONGOING on the WLCG middleware officer and the experiment representative: for the experiments to report their usage of the AFS UI and for the middleware officer to take the steps needed to enable the CVMFS UI distribution as a viable replacement for the AFS UI.
- The CVFMS grid.cern.ch contains the emi-ui-3.7.3 SL6 (path /cvmfs/grid.cern.ch/emi-ui-3.7.3-1_sv6v1) and provides as well CA certs, crls and voms lsc files. Given the new UI release we can also plan to upload the UI v3.10.0.
- ONGOING on Andrea S.: to understand with EGI if it is possible to bypass the validation of VO card changes in the case of the LHC VOs. Agreed by Peter Solagna.
- ONGOING on the WLCG monitoring team: status of the CondorG probes for SAM to be able to decommission SAM WMS.
- ONGOING on the WLCG monitoring team: evaluate whether SAM works fine with HTCondor CE. Report about showstoppers. Status: the SAM team made a proposal on the steps to taken to enable SAM. ATLAS is following up to make sure that the new CEs are correctly visible in AGIS, while for the CMS VO feed they will be taken directly from OIM. The plan is at first to test HTCondor-CEs in preproduction and later switch to production. It is not foreseen to monitor at the same time GT5 and HTCondor endpoints on the same host.
- Maarten comments that from private communication with Marian, SAM seems fine after the arrangement with OSG to hack the information needed for discovery into OIM.
- NEW on Alessandro DG: find out from OSG about plans for publication of HTCondor CE in information system, and report findings to WLCG Ops. To be followed up with Michael Ernst and Brian Bockelman.
AOB
- (MariaD) Confirmed "GGUS status and news" talk at the GDB on October 8. Please write to ggus-info@cernSPAMNOTNOSPAMPLEASE.ch any issues you'd wish presented, explained, clarified in this presentation. Issues of interest to the experiments we support and/or WLCG Ops, in general. Also, please, say now if you wish a meeting session with the GGUS dev. team about candidate new features before or after the presentation.
--
MariaALANDESPRADILLO - 20 Jun 2014