WLCG Operations Coordination Minutes, Feb 7, 2019
Highlights
Agenda
https://indico.cern.ch/event/795889/
Attendance
- local: Giuseppe Bagliese, Christoph Wissing, Alessandro Di Girolamo, Julia Andreeva
- remote: Catherine Biscarat, Mikhail Titov, Maria Grigorieva, Jeremy Coles, Di Qing, Vladimir Romanovski, Jean-Roch Vlimant, Alessandra Doria, Daniel Abercrombie, Jose Flix
- apologies:
Operations News
- End of support of CREAM CE. The CREAM working group has announced that official support for the CREAM-CE component will cease at the end of the EOSC-hub project, i.e. in Dec 2020. To prepare for this, EGI Foundation and CERN are actively working to help to minimise disruption. This will include helping users migrate to alternative solutions, i.e. ARC-CE or HTCondor-CE. The CREAM product team will be providing full support until the end of 2019, including one minor release already scheduled. During 2020 only security updates will be released.
- The workshop on migration from CREAM CE to other solutions will be held during the EGI Conference
which will take place 6-8 of May in Amsterdam. More details later. If you would like to participate, to share your experience or concerns please get in touch with wlcg-ops-coord-chairpeople@cernNOSPAMPLEASE.ch
Discussion
- Julia asked Pepe whether PIC can share experience of migration to HTCondor. Pepe confirmed that PIC would contribute to the workshop
Special topics
Preparation of the operational intelligence discussion during HOW 19 workshop
Discussion
- Christoph and Daniel mentioned work in CMS in this area. Necessary machinery to collect data has been set up. Might be interesting to share experience.
- Alessandro told that several projects were started in ATLAS, but not much progress demonstrated
- Alessandro encouraged people to send him ideas for the session
Middleware News
- Useful Links
- Baselines/News
Tier 0 News
- OTG:0046088
: CERN LSF public has now been decommissioned as of Wednesday 30th January 2019. A few dedicated shares are being handled separately.
- OTG:0047300
: ~30% of the CERN batch capacity is now on CC7. The remaining capacity will be migrated with the following schedule:
- end March 2019: ~50% public/grid will be CC7
- 2nd April 2019: lxplus.cern.ch alias change to CC7 (lxplus6 service will remain accessible on lxplus6.cern.ch), Default HTCondor target change to CC7 for local submission.
- early June 2019: remainder of capacity will have been migrated.
- OTG:0048002
ce511, ce512, ce513, ce514 can now be used to target CC7 (by default). Others will follow as we migrate more capacity.
- Unprivileged user namespaces are being enabled on CERN CC7 capacity to support Singularity and other user-space container tools.
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- High activity since the end of Run 2
- Lots of everything: MC, reconstruction, analysis trains, user jobs
- No major issues
ATLAS
- Smooth Grid production over the last weeks with ~300-330k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and analysis and a small fraction of dedicated data reprocessing. Some periods of additional HPC contributions with peaks of ~150k concurrently running job slots last months and ~15-20k jobs from Boinc.
- The HLT farm/Sim@P1 is undergoing system and hardware upgrades and will only be available in April again.
- CentOS7: ATLAS would like to start a more forceful migration to CentOS7 and have the vast majority of resources, if not all, migrated by June 1.
- SCRATCHDISK: ATLAS would like to increase the SCRATCHDISK quota to 100TB per 1000 analysis slots
- IPv6: if sites update to IPv6 dual-stack please let us know in advance. SAM tests are under development (a new SAM IPv6 dev node is sending the "normal" tests to sites, results to be understood)
- DPM DOME upgrade: ATLAS still sees instabilities in the DPM DOME sites used already in production. This first has to stabilize before a larger deployment of DPM DOME can be considered in a few month from now. Discussions with DPM team ongoing to have clear understanding on what would be best to suggest to sites.
- ATLAS sites jamboree and HPC strategy meeting, 5-8 March at CERN, https://indico.cern.ch/event/770307/
Discussion
- IPv6. Alessandro mentioned that CMS is more advanced compared to ATLAS (CMS- 65%, ATLAS 35-40%). Giuseppe mentioned that CMS experiences some eventual problems at the sites which require intervention of site support (firewall problems, wrong configuration). CMS setup SAM tests, and ATLAS intends to do the same, so ATLAS can benefit from the CMS experience with ipv6 SAM tests. Di mentioned the problem of misconfigured BNL server. Alessandro replied that it should have been fixed last week, but had to be confirmed. One of the problems with ipv6 deployment campaign is that ATLAS does not really know which sites do have ipv6 which ones don't.
- DPM DOME migration. Alessandro stressed that massive migration with deadline for migration can be started when there is a prove that migrated DPM sites work well and provide reliable storage. Alessandra Doria told that though migration was not smooth, now site (Naples) is working well. In her opinion it is better to wait for the next release (1.11) which should be out in the coming days. Julia suggested to invite DPM experts and to review this topic at the next meeting.
CMS
- smooth running, compute systems busy at about 220k cores
- usual production/analysis mix (75%/25%)
- 2018 data re-processing being rounded up
- Monte Carlo production ongoing
- staging in B-parked data for reconstruction
- ongoing with good performance
- two incidents the last months where SAM3 got stuck
- thanks to Nikolay and Simone for restoring service on the weekends
- decoupling production services from EOS
Discussion
- EOS instabilities were also experienced by ATLAS (fuse problem). Should be followed up with EOS team and invite them to the next meeting
LHCb
- Data stripping, MC simulation and user analysis with ~100K jobs running
- No major problems
Task Forces and Working Groups
GDPR and WLCG services
Accounting TF
Archival Storage WG
Update of providing tape info
PLEASE CHECK AND UPDATE THIS TABLE
Site |
Info enabled |
Plans |
Comments |
CERN |
YES |
|
|
BNL |
YES |
|
|
CNAF |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
FNAL |
YES |
|
|
IN2P3 |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
JINR |
YES |
|
|
KISTI |
YES |
|
KISTI has been contacted. Will work on in the second half of September |
KIT |
YES |
|
|
NDGF |
NO |
|
NDGF has a distributed storage which complicates the task. Discuss with NDGF possibility to do aggregation on the storage space accounting server side. Should be accomplished by the end of the year |
NLT1 |
YES |
|
Almost done, waiting for opening of the firewall, order of couple of days |
NRC-KI |
YES |
|
|
PIC |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
RAL |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
TRIUMF |
YES |
|
|
One can see all sites integrated in storage space accounting for tapes
here
Information System Evolution TF
- IS Evolution task force meeting
took place on the 24th of January. Main topic of discussion was the json structure for description of the computing resource (CRR). The latest version of the CRR format can be found here
IPv6 Validation and Deployment TF
Detailed status
here.
Machine/Job Features TF
Monitoring
MW Readiness WG
Network Throughput WG
- perfSONAR infrastructure status - CC7/4.1 campaign ongoing
- perfSONAR 4.0 and perfSONARs on SL6 are no longer supported since Q4 2018 - please update ASAP
- We have started ticketing sites, starting with T1s and major T2s
- WG update will be presented at HEPiX in San Diego
- WLCG/OSG network services were updated
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
Squid Monitoring and HTTP Proxy Discovery TFs
- Nothing to report this month
Traceability WG
Container WG
Action list
Specific actions for experiments
Specific actions for sites
AOB
No AOBs next meeting is planned for the 7th of March