WLCG Operations Coordination Minutes, May 16, 2019

Highlights

Agenda

https://indico.cern.ch/event/820489/

Attendance

  • local: Andrea Manzi (CERN), Fabrizio Furano (CERN), Julia Andreeva (CERN), Mayank Sharma (CERN)

  • remote: Adrien, Alessandra Doria (Napoli), Baptiste Grenier (EGI), Christoph Wissing (CMS), Di Qing (TRIUMF), Dmitry Golubkov (LHCb), Eric Fede (IN2P3-CC), Felix Lee (ASGC), Giuseppe Bagliesi (CMS), Guillaume Philippon (GRIF), Jerome Pansanel (Strasbourg), Johannes Elmheuser (ATLAS), Jose Flix Molina (PIC), Laurent Duflot (IN2P3/IRFU), Matthew Viljoen (EGI), Puneet Patel (TIFR), Ron Trompert (NLT1), Stephan Lammel (CMS), Thomas Hartmann (DESY)

  • apologies: Catherine Biscarat

Operations News

  • Site survey combining the storage and compute related questions has been sent around. In case people see any problem accessing the form, please, contact Julia. We plan to close the survey after the 15th of June. We need site answers for planning long term future for the WLCG service and to help sites in migration from CREAM CE which should happen before the end of 2020.

  • The next meeting is planned for June 6
    • Please let us know if that date would be very inconvenient

Special topics

Followup on the CREAM CE migration workshop

https://indico.cern.ch/event/820489/contributions/3429580/attachments/1845746/3028177/CreamCEMigration.pdf

  • Baptiste and Julia: EGI and WLCG will work together for preparing documentation and tracking the progress of various sites.
  • Matthew suggested the creation of a mailing list for sites to get in touch and for sharing outcomes of the task force.
  • Julia: Next week we agree on membership for the task force, start doing real work on the migrations, create a TWiki page to collect all necessary information.

DPM upgrade task force status report

  • Overall, it is going very well. Certain sites which have enabled DOME, had a good experience. Some who use gridftp2, have a good feedback as well. List of items in TWiki lists issues and problems. Open issues will be closed in the next minor release. The next release would be after DPM workshop, likely.
  • Alessandra Doria confirmed that in Napoli things were stable after the upgrade. The issues faced were fixed in the latest release. Napoli recommends that more sites can be looped in as the last release is quite okay.
  • Atlas reported in the last meeting about data deletion issues with DPM. An Apache configuration patch by Petr seems to help mitigate this. A combination of upgrading to the latest DPM DOME flavor and applying Petrís Apache configuration seems to be a likely solution at this stage. This issue does not seem to be related to the version of DPM or its configuration (affects any DPM version). Johannes was not aware of this solution. There are 30-40 sites that suffer this issue in a round robin fashion. Johannes suggests that recommendation from the DPM team to keep them in the loop would be helpful regarding such potential solutions. There will be a follow up with sites at the WLCG level, to inform them of the solution and recommendation from DPM team.

Discussion with EGI on the impact of the no-SRM solutions on the EGI operations

https://indico.cern.ch/event/820489/contributions/3432362/attachments/1845873/3028414/EGI-QA-WLCGCoordMay2019.pdf

  • Julia to EGI: Q6) An important functionality for SRM was space queries. If SRM is not an option for the future, what is foreseen as a replacement for space queries?
  • Baptiste: Need to discuss with colleagues and get back.

  • Julia: SRM would not be a generic solution anymore. We also have storages that do not support SRM (EOS, for instance). We should think about something else for data access/removal and other functionalities that should be inspected seriously. This will impact WLCG and EGI sites alike. We should work together on this. Feel free to send email to WLCG operations to take this discussion further.

  • End of security updates deadline for LCGDM is May 31st 2019 as well. This could be too soon for some EGI sites to upgrade. However, DPM team stress it is important to stick to the upgrade plans. This would be discussed further offline.

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • High activity on average
  • No major issues
  • Prague back in business since 3 weeks
    • Thanks to a big effort by the Prague admins!
  • The yearly ALICE T1-T2 workshop was held May 14-16, this time at the Polytechnical University of Bucharest

ATLAS

  • Smooth Grid production over the last weeks with ~300k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production and analysis and a small fraction of dedicated data reprocessing. Some periods of additional HPC contributions with peaks of ~500k concurrently running job slots and ~15k jobs from Boinc. The HLT farm/Sim@P1 is back into production and adds in its current configuration ~80k job slots in addition.
  • CentOS7 site migration: as it looks unfortunately not all sites will be migrated until the June 1st deadline - see https://twiki.cern.ch/twiki/bin/view/AtlasComputing/CentOS7Deployment - can WLCG help to speed up this process ?
  • Commissioning of a new PanDA worker node pilot version on-going. We are slowly rolling out the new version out to the sites at low scale.
  • Unchanged situation of DPM deletion problems as reported in the last meeting. In contact with DPM developers, but the current support model does not seem adequate at the scale that DPM is used.

Discussion

  • Julia asked other experiments how they follow sites in CentOS7 migration. CMS and LHCb are using Singularity and therefore not so much concerned about this migration, nor is ALICE. Stephan Lammel commented that for CMS enabling Singularity, the key point was good documentation with clear instructions in a single place. This is not really relevant for OS migration. The conclusion of the discussion was that apparently, there is no other option rather than submitting GGUS tickets to the sites which delay migration.
    • Added 2019-06-07: LHCb will be using Singularity, see here.

CMS

  • smooth running, compute systems busy at about 250k cores
    • usual production/analysis mix (80%/20%)
  • first (or several) periods of parked B physics data being processed
  • heavy ion re-reconstruction in progress, about 40% done
  • 2017 and 2018 Monte Carlo production ongoing
  • disk deletion (ahead of tape deletion) campaign complete
    • tape deletion expected to start soon
  • 274 files lost in CERN EOS (due to EOS bug) JIRA
  • memory issues with CMS database service, DBS, since a few weeks, under investigation

LHCb

  • Smooth running at ~100K jobs
    • usual activity: MC simulation, WG productions, and user analysis

Task Forces and Working Groups

GDPR and WLCG services

Accounting TF

  • NTR

Archival Storage WG

Information System Evolution TF

  • CRR format is practically agreed. CRIC developers consumed CRR prototyped by pioneer sites.
  • During the EGI conference we discussed how we make sure that in case some sites would stop using BDII, EGI can get information required for operation from CRIC. The agreed plan is to follow the same data flow as for publishing CLOUD information in EGI via message bus. We will prototype the data flow with our EGI colleagues.
  • CMS CRIC is in production and being used by CMS for several months. Now we agreed with CMS the plan to retire SiteDB. The first step is to put it in read-only mode. Will happen next week. Then after one month if everything goes smoothly, SiteDB can be stopped.
  • On a CMS request, quarterly pledges have been prototyped in the wlcg-cric. We plan to have a demo of the WLCG CRIC at the next GDB in June.

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG


Squid Monitoring and HTTP Proxy Discovery TFs

  • IP ranges from GOCDB are now used by wlcg-wpad to disambiguate sites that share GeoIP organizations and have different squids. It has only affected a few cases so far, but a lot of sites don't have IP ranges registered or have 0.0.0.0. Our plan is to wait until someone has a problem and then ask the site admins to register IP ranges. We should probably add kibana monitoring on wlcg-wpad usage and watch for cases where people attempt to use it and fail.

Traceability WG

Container WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

Edit | Attach | Watch | Print version | History: r15 < r14 < r13 < r12 < r11 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r15 - 2019-06-17 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback