DRAFT

WLCG Operations Coordination Minutes, Sep 14th 2017

Highlights

Agenda

Attendance

  • local: Andrea M (MW Officer + data management), Andrea S (IPv6), Julia (WLCG), Kate (WLCG), Luca (monitoring), Maarten (ALICE + WLCG), Marian (networks), Nurcan (ATLAS), Panos (WLCG), Vladimir (CNAF)

  • remote: Alessandra (Napoli), Catherine (LPSC + IN2P3), Christoph (CMS), Dave D (FNAL), Di (TRIUMF), Eddie (monitoring), Felix (ASGC), Frederique (LAPP), Massimo (LNL), Thomas (DESY), Xin (BNL)

  • apologies: Eric (IN2P3-CC), Jeremy (GridPP) and UK sites

Operations News

  • The next meeting is planned for Oct 5
    • Please let us know if that date would present a major issue

Middleware News

  • Useful Links:
  • Baselines/News:
    • Baselines updates: added HEP_OSlibs 7.1.9 , HEP_OSlibs_SL6 1.0.19 and Singularity 2.3.1. DPM 1.9.0 to be set as baseline once the latest fixes are included in UMD.
    • UMD4.5 released in August, ( http://repository.egi.eu/2017/08/10/release-umd-4-5-0/) among others it includes new packages as the WN for C7 and dynafed
    • dCache 2.13 EOL postponed to end of August. ( 5 out of 12 instances already upgrade to 2.16/3.0). KIT as Tier1 is planning the upgrade date with the experiments.
  • Issues:
    • New mailing-lists for ARC-CE and HT-Condor discussion in WLCG (wlcg-arc-ce-discuss at cern.ch and wlcg-htcondor-discuss at cern.ch). If you have a CERN account you can subscribe by yourself via Egroups, otherwise contact us to be included. Already some discussions on ARC-CE list, ARC-CE devs have prepared some procedures to push community patches to be validated and included upstream (https://source.coderefinery.org/nordugrid/arc-patch-collection/wikis/home). Alessandra Forti reported some issues with the Accounting, to be further discussed. Nothing to report from the HT-Condor mailing list, we asked the devs if they can be included in the mailing list, but we got no answer. In case someone is in touch with them and can help on this communication issue let us know.
    • EOSATLAS and EOSCMS: Bug that could cause file loss under special failure circumstances even though EOS acknowledges the write to the client. Fix deployed.
    • New Jglobus release in EPEL6/7 and OSG3.3 (https://twiki.opensciencegrid.org/bin/view/Documentation/Release3/Release3327) to fix an issue with RFC proxies with BestMan for some CA certificates. Sites should install the latest version.
    • At least one site (IN2P3) started to have problem with Phedex when we upgraded FTS at CERN to FTS 3.7.4. They were running an old version of the fts client (3.4), we have asked CMS to ensure sites are using v3.5.7 at least ( which is the WLCG baseline).
  • T0 and T1 services
    • CERN
      • check T0 report
      • upgrades to xrootd 4.6 on all Castor nodes. EOS 4.1.29 + xrootd 4.7.0 for LHCb , EOS 0.3.268 for ATLAS. EOS 0.3.265 for CMS and ALICE.
      • FTS upgrade to v 3.7.4
      • EOS ALICE upgrade to 4.1.29 Citrine on Monday Sep 18
    • CNAF
      • CMS: after experiencing overload in the past weeks, the 4 gridftp-xRootd servers have been split. 2 gridftp only and 2 xRootd only. The overload persists on xRootd, and the reason has not been understood yet.
    • IN2P3
      • The last media T10KC were removed from the tape library, all data are now on T10KD.
      • Minor upgrade of Xrootd and dCache on September 19.
    • KIT:
      • Getting in contact with VOs in order to find suitable dates for upgrading dCache to 2.16 this year.

Discussion

Tier 0 News

  • Planning to move the bulk of grid resources from LSF (CREAM) to HTCondor in coming weeks.

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • The activity levels typically have been very high
    • New concurrent job records were reached, up to 155k
  • CERN
    • Some fallout from instabilities of the HTCondor CEs
    • Now very stable since many weeks:
      • newer HTCondor versions fixed various issues
      • more CEs were added to spread the load better
      • ALICE jobs by default no longer request any output sandbox
      • job status monitoring frequency has been lowered by a factor 10
  • CERN: Oracle cloud project resources being used since the weekend
    • Up to ~9k cores with good results
    • So far a mix of MC and reco jobs
    • Analysis to be tried later

ATLAS

  • Smooth data processing at Tier0. Smooth production operations with ~400k concurrent running job slots, good contribution from HPCs in the past month with peaks up to 700k.
  • MC16 and MC15 Monte-Carlo production campaigns are running in full speed, derivation production on the data and MC run in parallel. Overlay production stressing the Frontier servers are running in the throttled mode (capped at 1600 jobs).
  • CERN CEPH instance degraded due to network instabilities early last week, affected also EOS, central services monitoring was not available for some time.
  • RAL Frontier server was overloaded last weekend by detector-group jobs requiring large amount of conditions data, production was stopped, two reboots was needed for the service to come back to operation.
  • bigpanda.cern.ch monitor has been migrated to https from http on Sept. 4th (w/o CERN SSO login requirement, the issue is being discussed in INC1446438 "Chrome not supported for accessing BigPanDA using SSO").
  • Preparing for the Software&Computing Technical Interchange Meeting next week; emphasis on Singularity deployment plan, data access over WAN/LAN, compute resource descriptions, HPCs, computing model evolution for Run-3 and beyond.

CMS

  • LHC delivers good luminosity despite 16L2 problem
  • Data parking being setup at P5
  • Compute systems busy after low processing demand earlier this year
    • legacy re-reprocessing of 2016 data in progress
    • Monte Carlo simulations for 2017 analyses
    • for some MCs only the small analysis data tier is stored now
    • Phase 2 upgrade simulations stress storage systems at sites due to high pile-up needed
  • High load also in the transfer system
    • CERN EOS missing/lost/inaccessible file issues caused some operational issues during the summer but are now resolved
    • backlog during the summer worked down
    • system coping well with load
  • overall CPU efficiency of pilots being investigated
    • task force setup early summer
    • first set of improvements implemented
      • depth wise filling instead of cross pilot filling
      • faster pilot retiring
      • queuing timeout to react faster to decreased demand
      • some fixes in the WMAgents that led to too short jobs
  • Reduced number of stage-out plugins in use at sites
    • CMS recommendation for sites is to us gfal2 and xrdcp
  • Interest to run MPI applications for ME generators
    • a few sites volunteered to support a first test
  • Singularity ready and being deployed
    • all issues on CMS site (except a minor one) sorted out/resolved
  • Small progress toward IPv6
    • all CMS Tier-1,2 sites queried, 12 sites ready/checked

LHCb

  • High activity on the grid, keeping an average of 60K jobs
  • No significant problems

Ongoing Task Forces and Working Groups

Accounting TF

  • Issues with CERN accounting for CMS. CPU consumption in the CERN summaries for CMS is very low (not confirmed by Dashboard). This causes very low efficiency for CMS in the accounting reports. Has to be investigated and corrected.
  • Discussion about storage space reporting took place at the September pre-GDB. The final agreement on the proposal is expected at the October GDB. The progress on the storage space accounting prototype strongly depends on the implementation of the storage space reporting proposal.
  • Couple of new requests for EGI portal has been submitted.

Information System Evolution TF

  • No meetings during summer months.
  • CRIC development is progressing well.


IPv6 Validation and Deployment TF


A GGUS support unit for IPv6 in GGUS has been created. Some experts from the HEPiX IPv6 are volunteering to be members of it.

A WLCG broadcast will be sent very soon with this content:

The WLCG management and the LHC experiments approved several months ago (+) a deployment plan for IPv6 (++) which requires that:

  • all Tier-1 sites provide dual-stack access to their storage resources by April 1st 2018
  • all Stratum-1 and FTS instances for WLCG need to be dual-stack by April 1st 2018
  • the vast majority of Tier-2 sites provide dual-stack access to their storage resources by the end of Run2 (end of 2018).

All WLCG sites are therefore invited to plan accordingly in case they have not yet met these requirements. Individual tickets will be sent in the coming weeks to Tier-2 sites (Tier-1 sites are already tracked separately) to track their progress.

Various support channels are available:

Interested sites may also join the HEPiX IPv6 working group (https://hepix-ipv6.web.cern.ch/), which provides some documentation.

(+) https://espace.cern.ch/WLCG-document-repository/Boards/MB/Minutes/2016/MB-Minutes-160920-v1.pdf

(++) https://indico.cern.ch/event/467577/contributions/1976037/attachments/1340008/2017561/Kelsey20sep16.pdf

Discussion

  • Andrea:
    • the readiness of sites and services will be tested through IPv6 ETF instances
    • we will discuss with OSG how they want to handle matters; here we focus on EGI sites

Machine/Job Features TF

Monitoring

MW Readiness WG


Network and Transfer Metrics WG


  • WG update will be presented at HEPiX and LHCOPN/LHCONE workshop (co-located)
  • perfSONAR 4.0.1 was released and was auto-deployed to 187 instances (21 are already on centos7)
  • WLCG/OSG network services
    • New documentation is in preparation and will be hosted at https://opensciencegrid.github.io/networking/
    • New central mesh configuration interface (MCA) and monitoring (ETF) in production (http://meshconfig.grid.iu.edu; https://psetf.grid.iu.edu/etf/check_mk/)
    • OSG collector handling multiple backends (Datastore, CERN ActiveMQ and GOC RabbitMQ) now in production
      • GOC will distribute raw data to 3 different locations, FNAL for tape archive, Nebraska for long-term ES storage, Chicago for short-term ES storage
    • Central dashboard service (psmad.grid.iu.edu) suffers from a bug which prevents showing statuses correctly (as well as retrieve the graphs), ESNet is working on a fix
    • Preparing new LHCOPN and perfSONAR dashboards in collaboration with CERN IT/CS and IT/MONIT
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Squid Monitoring and HTTP Proxy Discovery TFs

  • A new frontier-squid-3.5.27-1.1 was released that fixes a crash when an MRTG query came in over IPv6 to a squid running multiple workers (which is typically only on machines with 10gbit/s bandwidth).
  • ATLAS is working on making an ATLAS-only subset web page of the wlcg-squid-monitor "WLCG" or "all" page, based on squids that are registered in AGIS.

Traceability and Isolation WG

Special topics

CVMFS Stratum-1 configuration changes

  • Plan: change the cern.ch stratum 1 server settings in the EGI cvmfs configuration repository to use port 8000 instead of port 80.
  • Background:
    • All cvmfs stratum 1s support both port 80 and port 8000. This practice began when it was observed that outgoing connections on port 80 at FNAL were being intercepted (for security logging) and a 'Pragma: no-cache' header was inserted, interfering with caching. BNL and RAL have also observed problems with outgoing port 80.
    • egi.eu and opensciencegrid.org repositories have always been configured to use port 8000 everywhere, for years. The cern.ch configuration predated the use of port 8000, and it has not been changed from port 80 in the cvmfs-config-default rpm.
    • OSG sites do not use cvmfs-config-default, they use cvmfs-config-osg which puts all other shared configuration in a cvmfs configuration repository. The cern.ch configuration there was updated January 2017 to use port 8000.
    • EGI sites do mostly use cvmfs-config-default, but beginning this month there will be a cvmfs-config-egi rpm in the EGI UMD repository so sites will begin to migrate to using the EGI configuration repository.
    • The EGI configuration repository was copied from the OSG configuration repository in 2016, and the cern.ch configuration there has not yet been updated.
    • At the end of 2016, the ASGC (Taiwan) stratum 1 started to block port 80 by default because of denial of service attacks, but they did not block port 8000.
      • At that time, cvmfs-config-default was changed to remove the ASGC stratum 1, because of this issue and other instability issues with the ASGC stratum 1. Because the configuration is an rpm, there are probably many sites that have not yet upgraded.
      • OSG instead changed all cern.ch servers to port 8000 and left ASGC in. Because it is in a configuration repository, it took effect immediately.
      • The EGI configuration repository still has ASGC and all other cern.ch servers on port 80.
  • Future: IHEP in Beijing, China, also now has a full stratum 1, but it is not configured for worldwide clients yet
    • There are questions about whether their bandwidth is sufficient
    • Maarten will try to recruit sites in Asia to do performance tests comparing that stratum 1 to others
    • They are working to improve their bandwidth to the LHCONE network

Discussion

  • Maarten:
    • as the change from port 80 to 8000 does not look controversial,
      we go ahead and announce when it has been done
    • we keep ASGC in the configuration (on the new port), as it should be in better shape now
  • Marian: ASGC should have their Stratum-1 on the right routers to ensure good performance
  • Felix: we will look further into that

Theme: Providing reliable storage - CNAF

See the presentation

  • Vladimir: the necessary bandwidth per service on page 17 is obtained through Ethernet aggregation

  • Christoph:
    • most of the CMS recall activity actually was intended and correct
    • unfortunately it got overlaid by the unintended recall of many small files
    • the CMS transfer system did not manage to source those files from disk elsewhere
    • it then fell back onto staging them in from tape
    • the algorithm and policy for such cases need to be reviewed

  • Luca: how do you handle GPFS metadata server failover?
  • Vladimir:
    • both servers are active
    • there are 2 SSD subsystems, all metadata is on both

  • Maarten: do you have any concerns about some StoRM services being a SPOF?
  • Vladimir:
    • since ~1 year we did not need to fail over to standby services
    • in the past it was necessary more often

Data transfer monitoring

See the presentation

  • Luca:
    • the 30-day and 5-year dashboards are different instances
    • different sources can be combined into a single view

  • Maarten: could the alarming functionality be used to populate another view,
    instead of sending an e-mail? to allow having all alarms on one page?
  • Luca: the system allows that to be implemented with some development

  • Vladimir: can the symbols on the axes be changed for plots in presentations?
  • Luca: Grafana is very configurable; let's check offline

  • Julia: do you use Google Analytics to check the uptake of the new dashboards?
  • Luca: we use that for the old ones, we have our own accounting for the new ones

  • Julia: what is the status for the ATLAS DDM dashboard?
  • Luca: it is being validated and looks quite good already;
    it should be the next one to go to production

  • Julia: was the WLCG Transfer dashboard validated?
  • Luca: we did a detailed comparison covering about 1 month
  • Julia: the Xrootd collector was never fully completed in the old dashboard
  • Luca: we will look further into that after the migration

Action list

Creation date Description Responsible Status Comments
01 Sep 2016 Collect plans from sites to move to EL7 WLCG Operations Ongoing The EL7 WN is ready (see MW report of 29.09.2016). ALICE and LHCb can use it. NDGF plan to use EL7 for new HW as of early 2017. Other ATLAS sites e.g. Triumf are working on a container solution that could mask the EL7 env. for the experiments which can't use it. Maria said that GGUS tickets are a clear way to collect the sites' intentions. Alessandra said we shouldn't ask a vague question. Andrea M. said the UI bundle is also making progress.
Jan 26 update: this matter is tied to the EL7 validation statuses for ATLAS and CMS, which were reported in that meeting.
March 2 update: the EMI WN and UI meta packages are planned for UMD 4.5 to be released in May
May 18 update: UMD 4.5 has been delayed to June
July 6 update: UMD 4.5 has been delayed to July
Sep 14 update: UMD 4.5 was released on Aug 10, containing the WN; CREAM and the UI are available in the UMD Preview repo; CREAM client tested OK by ALICE
03 Nov 2016 Review VO ID Card documentation and make sure it is suitable for multicore WLCG Operations Pending Jan 26 update: needs to be done in collaboration with EGI
26 Jan 2017 Create long-downtimes proposal v3 and present it to the MB WLCG Operations Pending May 18 update: EGI collected feedback from sites and propose a compromise - 3 days' notice for any scheduled downtime
18 May 2017 Follow up on the ARC forum for WLCG site admins WLCG Operations DONE  
18 May 2017 Prepare discussion on the strategy for handling middleware patches Andrea Manzi and WLCG operations DONE  
06 Jul 2017 Ensure a forum exists for discussing tape matters WLCG Operations In progress  
14 Sep 2017 Followup of CVMFS configuration changes,
check effects on sites in Asia
WLCG Operations New  

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Comments Deadline Completion

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Comments Deadline Completion

AOB

  • Marian: the Network and Transfer Metrics WG intend to do a WLCG site survey regarding networks
  • Julia: that is fine, thanks for informing this meeting
Edit | Attach | Watch | Print version | History: r23 | r21 < r20 < r19 < r18 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r19 - 2017-09-15 - VladimirRomanovsky
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback