WLCG Operations Coordination Minutes, Jan 30, 2020
Highlights
- AVX2 on the WLCG infrastructure. Collecting CPU flags
Agenda
https://indico.cern.ch/event/883512/
Attendance
- local: Robert Currie (LHCb), Concezio Bozzi (LHCb), Jaroslava Schovancova (CERN)
- remote: David Cohen (IL-TAU-HEP), Di Qing (Triumf), David Mason (FNAL), Eric Fede (IN2P3), Giuseppe Bagliesi (CMS), Helge Meinhard (CERN), Johannes Elmsheuser (ATLAS), Matthew Steven Doidge (Lancaster), Panos Paparrigopoulos (CERN), Peter Love (ATLAS), Petr Vokac (ATLAS/Prague), Stephan Lammel (CMS)
- apologies:
Operations News
Special topics
AVX2 on the WLCG infrastructure. Collecting CPU flags
see
presentation
Discussion
- David Mason: what is the purpose of collecting AVX2 information in CRIC or other central repository?
- Johannes: it is not realistic to use this info for brokering
- Julia: not for brokering, but rather to assess the situation on the whole WLCG and at particular sites
- Helge questioned the probability of getting performance gained since it was not yet proven by any experience
- Johannes: ATLAS did perform testing and performance gain was not demonstrated
- Julia : whether LHCb performed any analysis analysis in this respect?
- Concezio: There had been some attempts for software trigger, which showed that code optimization would be required in order to get performance gain
- David Mason: effort to understand the situation with AVX2 and other features is useful. However is not clear how this information can be used to direct jobs to the appropriate resources
- Julia: whether it is possible that the test is not always showing correct results? AVX2 is enabled on the WN, but test is false negative
- Jarka: Could be hidden by virtualization. Will be investigated further
Middleware News
- Useful Links
- Baselines/News
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- Normal to high activity in recent weeks
- Lots of everything: MC, reconstruction, analysis trains, user jobs
- No major issues
ATLAS
- Over the christmas break very smooth and very stable Grid production with ~400k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and user analysis, including ~75k slots from the HLT/Sim@CERN-P1 farm.
- Started RAW/DRAW reprocessing campaign with data18 in data/tape carousel mode last Tuesday (21 Jan) at Tier1s and CERN CTA. Overload of FTS CERN production instance, caused by bulk request injection of tape staging, data rebalancing and data consolidations. Other FTS instances are working fine. This high load caused some Tier1 staging throughput degradation in the past days. FTS expert significantly improved situation in DBonDemand on Tuesday. data17 staging will start in a few weeks.
- No other major other issues apart from the usual storage or transfer related problems at sites.
- Affected by network switch/CEPH incident on Jan 22/23, but speedy recovery by restarting a few systems
- Discussions with the CTA team to put CTA in production
- ATLAS discussions about how to move forward with TPC testing - e.g. switch one site in non-gridftp mode and use it for bulk transfers
- AGIS to CRIC migration in progress
- Grand unification of PanDA queues on-going: unify separate production and analysis queues for more dynamic job scheduling.
CMS
- running at about 250k cores during last month
- usual production/analysis mix (75%/25%)
- ultra-legacy re-reconstruction of 2017 data almost complete
- ultra-legacy re-reconstruction of 2018 data progressing well
- short on disk space at many sites due to changed production pattern, extra cleaning effort underway
- first certificate authority has a root certificate with an expiration date of summer 2038, i.e. after a signed 32-bit integer as UNIX date turns over; older certificate utilities use a signed 32-bit; especially xrootd versions below 4.9.0 fail;
- CERN network outages, manual intervention to recover, at fortunate times thus not too disruptive
- SSB dashboard switch to MonIT scheduled for February 17th
Comment
There are 11 CAs whose expiration dates are beyond 2038, some used in production for WLCG since a few years already
For example, these CMS SEs at GRIF:
grid05.lal.in2p3.fr
node12.datagrid.cea.fr
polgrid4.in2p3.fr
Their host certificates are all signed by this CA:
$ openssl x509 -noout -subject -dates -in /etc/grid-security/certificates/AC-GRID-FR-Services.pem
subject= /C=FR/O=MENESR/OU=GRID-FR/CN=AC GRID-FR Services
notBefore=Sep 30 08:00:00 2016 GMT
notAfter=Sep 30 08:00:00 2040 GMT
LHCb
- Normal activity over past month
- Mostly MC with Stripping campaign starting in the last week
- Had to restart some services after network outage but no major disruptions, highlighted reliance on Gitlab service
- No major issues
Task Forces and Working Groups
Upgrade of the T1 storage instances for TPC
GDPR and WLCG services
Accounting TF
- Sites of 0,1,and 2 tiers took part in validation of the December accounting data. CERN showed good agreement with auto-generated data for CPU, disk and tape storage. Good agreement was also demonstrated with ATLAS and LHCb CPU accounting data.
Archival Storage WG
Containers WG
CREAM migration TF
dCache upgrade TF
- Out of 44 dCache sites used by the LHC VOs, 21 sites still need to migrate. Should accomplish by spring 2020.
- Have an issue with SRR at some sites which publish empty list for data shares. Is being followed up with dCache experts
Discussion
-Peter Love: Would it be possible to accelerate upgrade?
- Julia : We started end of autumn and are progressing pretty well. Sites are well participating. We hope that majority of sites do migrate by the beginning of spring which would be a good result
- Petr Vokac: There are also STORM sites which would need to migrate and enable SRR.
- Julia : Good point. We need to get in touch with Andrea to ask for SRR documentation for STORM and then we can start with STORM as well. There are not too many sites, so most probably, we do not need a task force for it.
DPM upgrade TF
- Out of 55 DPM sites used by LHC VOs 5 left to upgrade and reconfigure, 6 upgraded but have to be reconfigured. Should accomplish by the end of February.
Information System Evolution TF
- REBUS functionality in CRIC is being validated by WLCG Project Office (Cath)
- CRIC team had a meeting with the MONIT team and agreed on the plan for integration MONIT applications with CRIC and REBUS retirement plan
- After validation of REBUS functionality in CRIC , REBUS will be put in read-only mode (spring this year)
- All clients using REBUS info should start migration to CRIC for pledge and federation topology information, please, contact cric-devs@cernNOSPAMPLEASE.ch to coordinate this migration
- There will be a presentation at the next GDB about REBUS functionality in CRIC and REBUS retirement
IPv6 Validation and Deployment TF
Detailed status
here.
Machine/Job Features TF
Monitoring
MW Readiness WG
Network Throughput WG
- CERN Networking Week took place 13-17th January (https://wiki.geant.org/display/SIGNGN/4th+SIG-NGN+Meeting
)*
- Feedback from LHCOPN/LHCONE workshop (https://indico.cern.ch/event/828520/
)
- Importance of network monitoring has been stressed out by most of the experiments (covered many topics including perfSONAR up to requests for detailed packet telemetry)
- Focus on analytics, better insights into existing results would be beneficial for most of the experiments
- DOMA project had a dedicated slide on perfSONAR, highlighted it as a very useful diagnostic tool.
- DUNE is planning to establish perfSONAR mesh
- Several experiments have mentioned lack of available/used capacity monitoring
- Some experiments have mentioned missing API to access network LHCOPN/LHCONE topologies
- Next steps and follow up discussion will take place at the LHCOPN/LHCONE Asia (8-9th March)
- LHCOPN/LHCONE WS had also a dedicated session on the future of LHC networking
- Dedicated TF will be setup to work on packet tagging/pacing and network orchestration in close collaboration with the experiments
- perfSONAR infrastructure status - please ensure you're running the latest version 4.2.2-1.el7
- 100 Gbps perfSONAR testbed mailing list to join: http://cern.ch/simba3/SelfSubscription.aspx?groupName=wlcg-perfsonar-100g
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Traceability WG
Action list
Specific actions for experiments
Specific actions for sites
AOB
- Next meeting will take place 5th of March