WLCG Operations Coordination Minutes, July 6th 2017
Highlights
Agenda
Attendance
- local: Andrea M (MW Officer + data management), Andrea S (IPv6), Gavin (T0), Julia (WLCG), Maarten (WLCG + ALICE)
- remote: Alessandra D (Napoli), Alessandra F (Manchester + ATLAS), Alessandro (CNAF), Brian (RAL), Catherine (LPSC + IN2P3), David B (IN2P3-CC), David C (Glasgow), David M (FNAL), Di (TRIUMF), Eric (IN2P3-CC), Felix (ASGC), Gareth (RAL), Giuseppe (CMS), Javier (IFIC), Jeremy (GridPP), Kyle (OSG), Marcelo (LHCb), Marcin (PSNC), Renaud (IN2P3-CC), Ron (NLT1), Sang-Un (KISTI), Thomas (DESY), Vikas (VECC), Xin (BNL)
- apologies: Marian (networks), ATLAS
Operations News
- WLCG workshop took place from the 19th to the 22nd of June hosted by the University of Manchester. Thank you to our Manchester colleagues, in particular Alessandra for excellent organization. The operations session chaired by Pepe included many important areas like benchmarking, monitoring, IS system evolution and storage space accounting. More details can be found here
- Pre-GDB
on containers will be held Tue July 11 afternoon
- GDB
will be held on Wed July 12
- the next meeting is planned for Sep 14
- please let us know if that date would present a major issue
Middleware News
- Useful Links:
- Baselines/News:
- Globus EOL in 2018 (https://www.globus.org/blog/support-open-source-globus-toolkit-ends-january-2018
).
- So far it looks likely that CERN together with OSG will take over the code maintenance and support in the short term, hopefully with the continued participation of a person from NDGF. In the longer term we will look at how this code should be replaced, in particular gsi and gridftp. Essentially this is a non-issue for now.
- Perfsonar Baseline moved to v4.0.0 ( from last meeting), removed dCache 2.13 from baseline and added dCache 2.16.39
- dCache 2.13.x EOL on June, only KIT and FNAL among T1s are still running this version.
- Some new products are expected to be released in UMD4 within this month:
- As broadcasted by C Aiftimiei, the EMI repos have been shutdown on 15/06.
- Issues:
- T0 and T1 services
- CERN
- Castor upgrade to 2.1.16-18 for all VOs, diskserver migration to C7
- 2 load balanced HAProxy servers deployed in front of Production FTS
- IN2P3
- Major dCache upgrade to v2.16.37
- Upgrade of Xrootd during the next stop in september
- JINR
- Minor dCache upgrade 2.16.31 -> 2.16.39 on both instances;
- minor xrootd upgrade 4.5.0-2.osg33 -> 4.6.1-1.osg33 for CMS
- KISTI
- xrootd upgrade from v3 to v4.4.1 for tape
- NL-T1:
- SURFsara Major dCache upgrade to 2.16.36 on June 6-7
- RAL:
- Castor stagers updated to 2.1.16-13 and SRMs to 2.1.16-0.
- All data now on T10KD drives/media.
- Upgrade of FTS "prod" instance delayed due to non-LHC VOs usage of SOAP API. Hope to be able to upgrade during July
- TRIUMF:
- Major dCache upgrade to v2.16.39
Discussion
Tier 0 News
- Storage: see above
- Batch capacity increases ongoing
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- The activity levels typically have been very high
- The average was 112k running jobs, with a new record of 143k on May 29
- CERN
- Some fallout from the DNS incident on June 18
- No other major problem
ATLAS
- stable production at 300k cores, about 80k used for derivations.
- derivation production is causing too many transfers, need to further optimize the workflow (eg 70 outputs/ mcore job)
- ongoing ATLAS P1 to EOS to CASTOR data throughput test to fully validate (at approx double of the nominal rate) the data workflow from ATLAS experiment to the tape infrastructure.
- ongoing efforts to understand sites not performing very well (high - wrt to the average of the other sites- wallclock time wasted).
CMS
- CMS Detector
- Commissioning progressing
- Most effort goes into the new pixel detector
- Processing activities
- Overall utilization rather moderate
- Finished a RE-RECO of 2016 data
- Main MC production campaign for 2017 still in preparation
- Small (but urgent) RE-RECOs of recent 2017 data for commissioning
- Sites
- Deprecation of stage-out plugins
- In contact with sites to test IPv6 readiness of storages
- EOS
- Suffered from limitations in GSI authentication capacity - fixed
- Identified a source of occasional file corruptions: Improper handling of write recoveries
- Can be circumvented by setting environment variable
- Details: GGUS:127993
- EL7 migration
- Found some issues with Singularity under some configuration circumstances
- Recommendation is to wait with migration, if possible
- Rising interest in CMS to use MPI compute resources for certain generators
- Sites, that want to provide such resources should contact Stephan Lammel and Giuseppe Bagliesi
LHCb
- High activity on the grid, keeping an average of 60K jobs
- CERN
- The proxy expiration problem in HTCondor CEs still being investigated. ( GGUS:129147
)
Ongoing Task Forces and Working Groups
Accounting TF
- Progress on the storage space accounting prototype has been reported
at the WLCG Workshop
- At the latest Accounting TF meeting
in May discussed the plan to add raw wallclock job duration to the accounting portal as a separate metric. Currently wallclock time can contain either raw or scaled wallclock time. APEL colleagues presented EGI work regarding storage space accounting.
Information System Evolution TF
- The IS evolution plans
and progress
in CRIC development have been presented at the WLCG Workshop in Manchester
IPv6 Validation and Deployment TF
- Andrea S:
- we will prepare a campaign for T2 sites to start looking into their IPv6 preparations
- it will be started at a small scale, to gain experience before all sites are contacted
- we probably need a GGUS support unit and a mailing list
- the text sent to the sites needs to be very clear
- we aim to have dual-stack deployment of storage services at the vast majority of sites by the end of Run 2
- Julia:
- there should be a communication channel for sites to share experiences
- a Twiki page would be helpful for recipes etc.
- Julia: did the IPv6 session at the workshop go OK?
- Andrea S:
- there were ~30 people in the hands-on session
- the exercises were easy and went well
Machine/Job Features TF
Current status
MJF hosts (all sites) total: 158
- Hosts OK: 25
- Hosts WARNING: 15
- Hosts CRITICAL: 112
The warnings/errors are of just a few types (configuration mistakes),
and it looks like no much effort is required to correct them.
Namely:
WARNING
- Warning Key hs06 absent (or empty): 11
- Warning Key max_swap_bytes absent (or empty): 4
CRITICAL
- Error Environment variable MACHINEFEATURES not set: 98
- Error Environment variable JOBFEATURES not set: 2
- Error Key total_cpu absent (or empty): 10
- Error Key cpu_limit_secs absent (or empty): 2
Propagation of MJF to other experiments requires some amount of work.
In particular, Antonio (
aperez@picNOSPAMPLEASE.es) wrote about CMS:
- CMS SI have worked with glideinWMS (our pilot system) developers to incorporate the information published as MJF into our pilots (where available). So potentially we could add new features (such as job-masonry, but also signaling job/node shutdown times) when the rest of dependencies are solved. One of those dependencies of course will be the deployment of MJF to the CMS sites not shared with LHCb.
Monitoring
- the status and plans were presented during the WLCG Workshop in Manchester:
MW Readiness WG
This is the status of
jira ticket updates
since the last Ops Coord of 20170518:
- MWREADY-146
- dCache 2.16.34 verification for ATLAS @ TRIUMF with IPV6 as well - completed ( there has been a problem when TRIUMF updated the production unfortunately not spotted in the testing instance)
- MWREADY-145
- The latest version of the WN metapackage for C7 has been released ( v 4.0.5 - renamed to wn), and tested by Liverpool. The metapackage is under inclusion in UMD4 (GGUS:128753
)
- MWREADY-147
- ARC-CE 5.3.1 under testing at Brunel.
- MWREADY-148
- New CREAM-CE for C7: we agreed with M. Sgaravatto to do the testing for CMS at LNL.
Network and Transfer Metrics WG
- Detailed WG update presented as part of the network session at the WLCG workshop in Manchester
- perfSONAR 4.0 was released on 17th of April
- 194 nodes updated so far
- ES/Kibana dashboard showing perfSONAR infrastructure status in testing
- WLCG/OSG network services
- New central mesh configuration interface (MCA) in production (http://meshconfig.grid.iu.edu
)
- New monitoring based on ETF in production (https://psetf.grid.iu.edu/etf/check_mk/
)
- New OSG collector handling multiple backends (Datastore, CERN ActiveMQ and GOC RabbitMQ) in production
- New LHCOPN grafana dashboards done in collaboration with CERN IT/CS and IT/MONIT in testing
- Additional perfSONAR dashboards to be added soon
- Throughput call was held on Wed May 24th at 4pm CEST (https://indico.cern.ch/event/640627/
) mainly focusing on review of new production services
Squid Monitoring and HTTP Proxy Discovery TFs
- CMS frontier at CERN is now using http://grid-wpad/wpad.dat
with IPv6 in production. ATLAS frontier at CERN has been all this time randomly using squids at Geneva and Wigner, regardless of the location of the worker nodes, causing much traffic to go over the long distance links. They are now making plans to start using http://grid-wpad/wpad.dat
to select local squids.
Traceability and Isolation WG
Special topics
MW deployment forums and feedback
presentation
- Gavin: we take HTCondor unchanged
- Maarten: but you enhance it e.g. with the BDII info provider;
furthermore, the matter is not just about patches, but deployment in general
- Julia:
- the
fts3-steering
list is a good example, though only involving VOs and devs
- in general the fora would need to allow VOs, sites and devs to participate
- feedback from sites should be collected and made easily available for others
- deployment documentation, workarounds etc.
- Maarten:
- the MW Readiness WG is the right place to have such things organized
- in the Sep meeting we will have a checkpoint on the progress
Theme: Providing reliable storage - IN2P3
presentation
- Maarten: do you have some services permanently available on a UPS?
- IN2P3-CC:
- the whole building is on a UPS with a minimal lifetime of about 30 minutes
- its main function is to allow switching to the other power line transparently
- if needed, to extend the lifetime we can start switching off all the WN etc.
- Julia: how often do you see file losses from tape?
- IN2P3-CC:
- typically a few files per month
- such incidents tend to get revealed during repack operations
- Xin: couldn't most such files be recovered by the vendor?
- IN2P3-CC:
- we usually try other ways to recover the files first (other tapes or copy from another site)
- even if the vendor manages to recover part of the data, the files typically are corrupted
- Vikas: what are your RAID group disk size and rebuild times?
- IN2P3-CC:
- each disk is 6 to 8 TB, the next ones will be 10 TB
- we have ~145 TB per server
- the rebuild time is ~24h
- we need to rebuild 1 or 2 times per year
- Vikas: 24h is a rather big window for another disk to fail as well...
- Maarten: various parameters need to be taken into account and optimized together;
in the end there will always be a calculated risk...
- IN2P3-CC:
- for the evolution of our tape system we see 2 options:
- move to IBM Jaguar, which would imply replacing the whole library
- move to LTO, which so far we have only used for backups in TSM
- we would like to discuss such matters e.g. in HEPiX
- get an idea on reliability experiences at other sites
- Alessandro:
- we have the same matter to deal with at CNAF
- we have had meetings with several vendors (IBM, Quantum, Spectra Logic)
- we heard some sites are staying with T10KD for the time being
- LTO may not be good enough for heavy stage-in and -out operations
- we support the revival of the tape forum to discuss these things
- Julia:
- we will first follow up with the owner of the existing list
- we will ensure there will be a forum and announce it
Action list
Creation date |
Description |
Responsible |
Status |
Comments |
01 Sep 2016 |
Collect plans from sites to move to EL7 |
WLCG Operations |
Ongoing |
The EL7 WN is ready (see MW report of 29.09.2016). ALICE and LHCb can use it. NDGF plan to use EL7 for new HW as of early 2017. Other ATLAS sites e.g. Triumf are working on a container solution that could mask the EL7 env. for the experiments which can't use it. Maria said that GGUS tickets are a clear way to collect the sites' intentions. Alessandra said we shouldn't ask a vague question. Andrea M. said the UI bundle is also making progress. Jan 26 update: this matter is tied to the EL7 validation statuses for ATLAS and CMS, which were reported in that meeting. March 2 update: the EMI WN and UI meta packages are planned for UMD 4.5 to be released in May May 18 update: UMD 4.5 has been delayed to June July 6 update: UMD 4.5 has been delayed to July |
03 Nov 2016 |
Review VO ID Card documentation and make sure it is suitable for multicore |
WLCG Operations |
Pending |
Jan 26 update: needs to be done in collaboration with EGI |
26 Jan 2017 |
Create long-downtimes proposal v3 and present it to the MB |
WLCG Operations |
Pending |
May 18 update: EGI collected feedback from sites and propose a compromise - 3 days' notice for any scheduled downtime |
18 May 2017 |
Follow up on the ARC forum for WLCG site admins |
WLCG Operations |
In progress |
|
18 May 2017 |
Prepare discussion on the strategy for handling middleware patches |
Andrea Manzi and WLCG operations |
In progress |
|
06 Jul 2017 |
Ensure a forum exists for discussing tape matters |
WLCG Operations |
New |
|
Specific actions for experiments
Specific actions for sites
AOB