WLCG Operations Coordination Minutes, Oct 3, 2019

Highlights

Agenda

https://indico.cern.ch/event/852337/

Attendance

  • local: Alberto (monitoring), Borja (monitoring), Concezio (LHCb), Eddie (data management + ATLAS), Gavin (T0), Julia (WLCG), Luca (monitoring), Maarten (ALICE + WLCG), Nikolay (monitoring), Pedro (monitoring), Stephan (CMS)
  • remote: Catalin (EGI), Di (TRIUMF), Eric (IN2P3-CC), Felix (ASGC), Mike (ASGC), Panos (WLCG), Pepe (PIC), Raja (LHCb)
  • apologies:

Operations News

  • the next meeting is planned for Nov 14
    • please let us know if that date would pose a major problem

Special topics

WLCG Monitoring Status and Plans

see the presentation

Discussion

  • Julia: can one dig down to the logs?
  • Borja: we will try to keep the functionality as similar as possible,
    that aspect should be fine
  • Pedro: we already dealt with other use cases with the same requirement

  • Julia: we need historical data at least for a few years
  • Borja: it will be imported
  • Pedro: how many years?
  • Julia: 3 years probably would be sufficient
  • Pedro: should be OK; note that high-level reports are kept forever,
    while the raw data is not

  • Stephan: does the data have to be in Elasticsearch?
  • Borja: or in InfluxDB; the logs are also in HDFS

  • Pedro:
    • recomputations will have a new interface which also provides a catalog of them
    • the CI back-end will apply them automatically
  • Maarten: would it be possible to undo a mistaken recomputation request?
  • MONIT team:
    • currently not foreseen and might be expensive to implement
    • instead, one can apply another correction on top, like in SAM-3 today
  • Julia: can the original raw data still be seen then, like in SAM-3 today?
  • Borja: we will look into that

  • Julia: will there be an interface to define new profiles?
  • Pedro:
    • at least in the beginning that will not be available to the user
    • instead, one would have to ask our team through a SNow ticket
    • this workflow helps avoid stale profiles
    • of the existing profiles we will only copy the ones that are used

  • Julia: SSB functionality is also used by WLCG Operations e.g. for validation
    of accounting data against the EGI portal
  • Pedro: we can have a meeting to discuss the relevant data sets etc.
  • Alberto: the migration should also include cleanup

  • Julia: we need an API for DDM (and job) accounting to compare with the new WLCG framework
  • Borja: that is supported
  • Luca: from the GUI one can download as CSV, while JSON can be obtained e.g. via curl
  • Julia:
    • we used to compare the old job accounting numbers with EGI data
    • the old system was stopped in July
    • we now see 20% difference, while it used to be 10%
  • Pedro:
    • even our new dashboard does not match our old dashboard
    • for example due to inclusion of new features like the "closed" state
    • ATLAS defined 4 substates of finished jobs, but only 3 were taken previously
    • now they want all 4 of them
  • Eddie: in the past we were told not to include the 4th state
  • Alberto: the new dashboard was validated against PanDA
  • Julia: we will check if that change may explain what we see

  • Borja: what is the situation with CRIC?
  • Julia:
    • it is awaiting validation by all relevant clients
    • topology information is independent of the pledges
    • the VO feed can already be put into production soon
    • the switch from REBUS is planned for early next year

  • Stephan: does the new system support alarms for site admins?
  • Borja: we can make that work after the validation activities

  • Julia: what are the default retention policies?
  • Borja: indefinite in InfluxDB, 30 days in Elasticsearch
  • Julia: we found that accounting data older than 2014 got removed from InfluxDB
  • Borja, Luca: that limit can be changed

  • Julia: is it possible to make items of the Grafana plots clickable?
  • Luca: that functionality has been enabled

Review of the WLCG Critical Services

Discussion

  • Kubernetes can go together with OpenStack

  • Stephan: what about services outside CERN, e.g. GGUS?
  • Julia: or the FTS, though there is some redundancy there
  • Maarten: GGUS and GOCDB only have local HA, can suffer from site network issues
  • Conclusions regarding services not hosted by T0: for those which do not have multi-site redundancy (e.g. GGUS and GOCDB) we need to create a small section. Experiments will be asked to provide their criticality estimations.

  • Next steps: Julia will merge in the table the changes provided by the experiments (CMS, ATLAS and LHCb already introduced some changes in place) and create a table for non-redundant critical services not hosted by T0. Experiments will be asked to review the modifications and we go through this topic quickly at the next meeting.

  • We should go though this exercise on a regular basis, at least once a year, or on demand when changes are required.

Middleware News

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Sites please plan moving to CentOS / EL 7 as soon as possible
  • Normal to high activity
  • No major issues

ATLAS

  • ATLAS software and computing week 30 Sep-4 Oct - apologies if nobody from ATLAS will be able to connect to the meeting
  • Smooth Grid production over the past weeks with ~320k concurrently running grid job slots with the usual mix of MC generation, simulation, reconstruction, derivation production, user analysis and a dedicated reprocessing campaign (see below). In addition ~90k job slots from the HLT/Sim@CERN-P1 farm when it was not used for TDAQ purposes.
  • Some infrastructure incidents: Rucio seriously degraded for 2-3 hours each by LHCONE (OTG:0052301) and Oracle problems (OTG:0052490)
  • Very last tails of CentOS7 site migration - still some stragglers: twiki
  • Switched off all analysis (Aug 18) and production (Sep 18) queues which are still on SL6 - missing ~6k jobs slots and get them back when sites move to CentOS7
  • Finished PanDA Pilot2 deployment on Sep 18 - all queues using now exclusively Pilot2
  • With Pilot2 comes automatic singularity container usage - still some sites which do not work with singularity
  • Follow-up from previous meeting: In the last 2 months (Aug, Sep): 33 ATLAS GGUS ticket submitted with deletion related problems at a site, out of these 18 had been DPM sites and out of these 9 look like reoccurring problems.

CMS

  • smooth running up to 280k cores during the last month
    • production did not use full share due to campaign scheduling and disk space constraints
    • thus larger analysis share than usual, up to 120k cores
  • 2017 data re-reconstruction, ultra-legacy, in progress (staging more datasets from tape)
  • 2018 heavy-ion data processing close to complete
  • B-parked data reconstruction continuing
  • several sites with small issues
  • down to ten sites on IPv6 storage
  • global pool scaling tests in progress (extra sleep jobs at selected sites)

LHCb

  • smooth usage of grid resources in the last month, with an average of 109k jobs running
  • Small issues
    • Random files on CERN-EOS regularly become inaccessible
    • file access problems (GGUS:142350) and deleting files on ECHO (GGUS:143323) at RAL
  • Main activity: Monte Carlo production (90%), user jobs (5%), Monte Carlo reconstruction (3%), data stripping (1.5%)
  • Legacy re-stripping of Run1 + Run2 data ongoing
    • 2018, 2011, 2012 data done in spring/summer
    • now validating re-stripping of 2016, 2017 data
    • massive tape recall from Tier0+Tier1 sites
  • VO card updated for singularity support
    • requested at Tier1 sites
    • gradually expanding to Tier2 sites

Task Forces and Working Groups

Upgrade of the T1 storage instances for TPC

GDPR and WLCG services

Accounting TF

  • Steady progress in the implementation of the new accounting validation workflow
  • SSB application for comparison of the experiment CPU accounting with the EGI accounting was broken because of the switch of the CMS and ATLAS Dashboards to Monit Dashboards. Work on integration with Monit is ongoing.

Archival Storage WG

Containers WG

Draft baseline doc out for comment regarding Singularity.

CREAM migration TF

See CreamMigrationTaskForce

dCache upgrade TF

  • The task force will follow up the migration of the WLCG dCache sites to version 5.2.0 and higher and enabling of Storage Resource Reporting (SRR). TF twiki available here.

DPM upgrade TF

Information System Evolution TF

  • CRIC presentation at the GDB next week

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG


Squid Monitoring and HTTP Proxy Discovery TFs

  • The latest version of frontier-squid (4.8-2.1) now can auto-register with shoal using option SQUID_AUTO_DISCOVER=true.
    • Intended only for dynamically created squids such as with clouds
    • Any WLCG statically registered squids at a site take precedence in wlcg-wpad results over those registered with shoal

Traceability WG

Action list

Creation date Description Responsible Status Comments

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

-- JuliaAndreeva - 2019-09-30

Edit | Attach | Watch | Print version | History: r20 < r19 < r18 < r17 < r16 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r20 - 2019-11-04 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback