WLCG Operations Coordination Minutes, September 13th, 2018
Highlights
- A report on EOS incidents, improvements and plans
- Despite recent issues, the outlook is positive for the rest of 2018 and beyond
- CMS CRIC is deployed in production
- Sites should upgrade perfSONAR to v4.1 on CentOS 7
Agenda
Attendance
- local: Alberto (CERN storage), Borja (monitoring + WLCG), Julia (WLCG), Maarten (ALICE + WLCG), Marian (networks + monitoring), Massimo (CERN storage), Tommaso (CMS), Vincent (security)
- remote: Alessandra D (Napoli), Alessandra F (Manchester + ATLAS + WLCG), Brij (TIFR), Catherine (LPSC + IN2P3), Cristi (CERN storage), Di (TRIUMF), Eric (IN2P3-CC), Felix (ASGC), Giuseppe (CMS), Johannes (ATLAS), Martin (Prague), Matt (EGI), Puneet (TIFR), Stephan (CMS), Vladimir (LHCb)
- apologies: Pepe (PIC)
Operations News
- Experiment contacts are kindly asked to provide info for the list of services dealing with user information and therefore will need to provide privacy note . See more details in the GDPR discussion during GDB
- The next meeting will be on Thu Oct 11
- Please let us know if that date would pose a significant problem.
Special topics
EOS report
See the
presentation
Discussion
- Johannes:
- EOS-ATLAS has been fairly stable over the summer
- 1-2 short outages in the last 3 weeks, not that serious
- The improvements have worked, the situation now looks OK for ATLAS
- 1 rogue user caused an outage during the spring, dealt with on the ATLAS side
- Massimo:
- The presentation refers to a report that compares before and after
- It agrees with the conclusions from ATLAS
- 1 reboot was caused by a power cable incident
- Should be better with the HA setups we will have in the future
- Will also mitigate SW issues
- Reboots lasting 1000+ seconds up to a few hours have been the main issue
- Tommaso:
- CMS has had the opposite experience - major incidents last week
- See the CMS report
- 1-2 h downtime can be handled OK, silent corruptions cannot!
- The source files will typically have been deleted in the meantime
- Fortunately that problem was noticed after just ~20 minutes
- Is it understood?
- Massimo:
- It was caused by a background activity, viz. the MGM memory compaction
- The files disappeared only from the name space and usually can be recovered
- This kind of trouble will go away with the new MGM
- Tommaso:
- We also ran into directories where only root could write
- And there were FUSE mount problems on some of our service machines
- E.g. processing log files
- Do we have to switch to the new plugin?
- Massimo:
- The old plugin is still maintained
- The new version is in QA for LHCb and AMS - maybe CMS wants to join?
- Tommaso:
- Let's try it out on specific machines
- Can the FUSE plugin be used in production?
- Massimo:
- The last 2 years we have invested a huge effort in the FUSE plugin
- We now have v2 which indeed is supported for production
- Alberto: where will it be tested?
- Tommaso: on a few CMS VOBOXes
- Tommaso: w.r.t. the HI data-taking tests this week,
there does not seem to be a lot of tape writing by ALICE?
- Maarten: the 10 PB ALICE disk buffer in EOS + CASTOR should hold all the new data,
which will be copied to tape as fast as possible in the background
- Massimo:
- We have mainly focused on a global test
- We have seen the CMS rates get lower after some time
- Tommaso: we still have ~2 days to debug any remaining issues
- Julia: do LHCb or ALICE have concerns to mention to the EOS team?
- Vladimir: no major issues for LHCb
- Maarten: neither for ALICE
DPM SRR deployment TF
See the
presentations
Discussion
- Julia:
- The timeline has not been decided yet, possibly by the next meeting
- Regarding DPM upgrade to version 1.10.3 and higher we want to be in line with the TPC activity in DOMA
.
We currently aim for the version upgrade to be finished by the end of spring. Re-configuration might come later.
- Start with a small set of pioneer sites to polish the procedures and instructions and then go for wider deployment
- Catherine:
- There are 9 DPM sites in France, 1 using DOME since 1 year
- We have discussed the deployment and the expected timeline is OK
- The current priority is on the dual-stack deployment for IPv6 support
Middleware News
- Useful Links
- Baselines/News
- Issues:
- UMD-4 update on July 11 broke SL6 CREAM CEs
- Reported here for the record - already included in the Service Report for August
- Tomcat could not start with the newer versions of
canl-java
, bouncy-castle
and voms-api-java
- This was not caught in the Staged Rollout, because CREAM itself was not updated
- Will be handled better in the future
- Several high-priority tickets were opened, e.g. GGUS:136074
- CREAM developers quickly provided fixes in their own repository
- UMD-4 was again updated on July 24
- With additional instructions in the release notes
- Still leaving some loose ends to be tied up after the holidays
- Tickets were updated with workaround recipes in the meantime
Discussion
Tier 0 News
- CERN would like to ask the experiments what notice they would need to have the majority of batch resources here changed to CC7, assuming any intervention would take a couple of weeks to roll-out.
An
action
for the experiments has been created
Tier 1 Feedback
- IN2P3-CC: Because RAID problem, a disk server will be lost on XROOTD storage. 110 To of data lost for ALICE.
Tier 2 Feedback
Experiments Reports
ALICE
- Normal activity levels on average over the summer
- IN2P3-CC: 110 TB lost due to RAID problem
- Mostly recovered from replicas
ATLAS
- Smooth Grid production over the last weeks with ~300k concurrently running grid job slots. Additional HPC contributions with peaks of ~100k concurrently running job slots.
- In the last 6 weeks ran a large digitisation and reconstruction campaign of MC16e using about 150k job slots. This will be followed in the next weeks by a larger derivation production campaign.
- Commissioning of the Harvester submission system via PanDA is on-going on the Grid: Iberian cloud mostly done, IT and UK cloud now on-going
- Tape carousel R&D staging campaign at Tier1s on-going: BNL, FZK, PIC, INFN-T1, Triumf, SARA, IN2P3-CC done so far. About 200TBs of AODs are staged from tape and possible improvements of the workflows are evaluated.
- Heavy ion TDAQ to EOS/Castor through-put test started.
Discussion
- Maarten, Julia: it would be good to present your tape carousel results
in the DOMA project and the Archival Storage WG
- Johannes: yes, after they have been presented within ATLAS
CMS
- MD3 in progress
- preparing for HI rate test in the following week
- various EOS issues at CERN during the last month
- local file access issues at RAL and PIC under investigation, GGUS:136028
, GGUS:136677
- compute systems busy at about 220k cores, usual mix of 80% production and 20% analysis
- processing backlog, lower/medium priority Monte Carlos not progressing much
LHCb
- Usual activity with Data reconstruction and stripping, MC simulation and User analysis
Ongoing Task Forces and Working Groups
Accounting TF
- The problem in the CERN accounting has been investigated and hopefully understood.
Archival Storage WG
Update of providing tape info
PLEASE CHECK AND UPDATE THIS TABLE
Site |
Info enabled |
Plans |
Comments |
CERN |
YES |
|
|
BNL |
YES |
|
|
CNAF |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
FNAL |
YES |
|
|
IN2P3 |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
JINR |
YES |
|
|
KISTI |
NO |
|
KISTI has been contacted. Will work on in the second half of September |
KIT |
YES |
|
|
NDGF |
NO |
|
NDGF has a distributed storage which complicates the task. Discuss with NDGF possibility to do aggregation on the storage space accounting server side. No news recently |
NLT1 |
NO |
|
Almost done, waiting for opening of the firewall, order of couple of days |
NRC-KI |
YES |
|
|
PIC |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
RAL |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
TRIUMF |
YES |
|
|
One can see all sites integrated in storage space accounting for tapes
here
Information System Evolution TF
- CMS CRIC is deployed in production. Functionality currently enabled is to replace SiteDB both for topology and user info.
- Working on CRIC for WLCG central operations which will provide topology for all 4 experiments. Progressing well.
- Next IS Evolution Task Force will take place next Thursday. The main topic is to provide computing resources description in the json file similarly to what is proposed in the Storage Resource Reporting document for storage services
Discussion
- Alessandra F: UK sites intend to drop the BDII
- Maarten: have you considered the consequences w.r.t. EGI, e.g. the ARGO tests?
- Alessandra F:
- Other VOs are supported through the GridPP DIRAC service
- We first remove the LHC experiments from the BDII and then we will see
- Matt:
- Will discuss this matter with my colleagues
- Alessandro Paolini would know more details
- Julia: he has participated before; it would be good if he could join next Thu
IPv6 Validation and Deployment TF
Detailed status
here.
See the
status report
presented in the Sep GDB
Machine/Job Features TF
Monitoring
MW Readiness WG
Network Throughput WG
- perfSONAR infrastructure status
- perfSONAR 4.1 was released few weeks ago - main new feature is an improved central/remote configuration
- WLCG broadcast was sent this week to remind sites to upgrade to CC7 and review their configuration (preferably by end of October)
- Around 50% of sonars are on CC7 as of today
- WG update will be presented at the upcoming HEPiX
- WLCG/OSG network services
- Central configuration service (meshconfig/psconfig) was updated to the version released in 4.1 (officially supported by perfSONAR team)
- psconfig.opensciencegrid.org is currently unreachable via IPv6 from non-LHCONE sites due to issue with routing, this is being followed up by the network team at MSU
- NSF funded projects: SAND and IRIS-HEP are starting, both will contribute in different ways to the OSG Network Area - more details will be provided in the HEPiX talk
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Squid Monitoring and HTTP Proxy Discovery TFs
- Thanks to Michal Svatos, there are now CVMFS and ATLAS frontier failover monitors linked from http://wlcg-squid-monitor.cern.ch
- Based on GOCDB/OIM squid registration
- There's also one for CMS but it hasn't yet been cut over into production, waiting on more squid registrations
- Some duplicate MRTG SNMP queries have been removed by sharing data between MRTG plots
- Same technique will be used for generating CMS-only MRTG page based on registrations, which will eliminate more SNMP query duplication and make it easier to implement
Traceability WG
Container WG
See the
status report
presented in the Sep GDB
Action list
Creation date |
Description |
Responsible |
Status |
Comments |
03 Nov 2016 |
Review VO ID Card documentation and make sure it is suitable for multicore |
WLCG Operations |
In progress |
GGUS:133915 |
07 Jun 2018 |
Followup of OSG service URL changes |
WLCG Operations |
DONE |
We suggest that for all middleware using various OSG-related URLs the experiments look at this page and inform operations in case you need more help |
07 Jun 2018 |
GDPR policy implementation across WLCG and experiment services |
WLCG Operations + experiments |
Ongoing |
Details here |
Specific actions for experiments
Specific actions for sites
AOB