DRAFT

WLCG Operations Coordination Minutes, Oct 5th 2017

Highlights

Agenda

Attendance

TO BE FIXED AFTER THE MEETING

  • local: Andrea M (MW Officer + data management), Andrea S (IPv6), Julia (WLCG), Kate (WLCG), Luca (monitoring), Maarten (ALICE + WLCG), Marian (networks), Nurcan (ATLAS), Panos (WLCG), Vladimir (CNAF)

  • remote: Alessandra (Napoli), Christoph (CMS), Dave D (FNAL), Di (TRIUMF), Eddie (monitoring), Felix (ASGC), Frederique (LAPP), Massimo (LNL), Thomas (DESY), Xin (BNL)

  • apologies: Catherine (LPSC + IN2P3)

Operations News

  • GGUS: some GUI changes will be implemented for the Nov release
    to improve ticket categorization and prioritization
    • The ticket category will no longer have a default value (was: Incident)
    • The ticket priority will no longer have a default value (was: less urgent)
    • The ticket category Change Request will be renamed Service Request (doc)

  • The HEPiX autumn meeting 2017 will be held Oct 16-20 at KEK

Middleware News

  • Useful Links:
  • Baselines/News:
    • dCache 2.13 EOL. 4/12 instances still have to upgrade.
  • Issues:
  • T0 and T1 service
    • BNL
      • FTS upgraded to v3.7.4
      • Xrootd 4.7.x under testing
    • CERN
      • check T0 report
      • EOS ALICE major upgrade to v4.1 (Xrootd4.7 and IPv6 support)
    • IN2P3
      • Upgraded Xrootd to 4.6.1-1, upgraded dCache to 2.16.46
    • KIT:
      • Upgraded dCache for ATLAS to 2.16 on September 20th. Dropped IPv6 support after several issues
      • Downtime for updating dCache for CMS fixed on October 11th. LHCb pending
    • NDGF:
      • Upgraded dCache to 3.0.28 this week with xrootd fixes for the pools
    • RRC-KI-T1:
      • Upgraded dCache to 2.16.30

Discussion

Tier 0 News

  • Will move asap remaining CERN Grid capacity fully to HTCondor. CREAM CEs will be retired.

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • The activity levels typically have been very high
  • CERN: EOS-ALICE was upgraded to Citrine release with XRootD 4.7
  • CERN: Oracle cloud project resources have been used with good results
    • Up to ~9k cores, located in Phoenix AZ
    • Mostly a mix of MC and reco jobs
    • Analysis jobs were briefly enabled: their efficiency was ~3% !
    • Reco was disabled Oct 4 after CERN's commercial 10-Gbit uplink almost got saturated !

ATLAS

  • Smooth data processing at Tier0. Mu between 60 and 40 (start -end), processing time in the ballpark of 30-15secs/event. Compute resources completely full.
  • Smooth grid production operations with ~300k concurrent running job slots, good contribution from HPCs in the past month with peaks up to 600k.
  • Small grid operational hiccups in the authentication service of DDM/Rucio due to service node upgrades to CC7.4 and package dependency problems ( https://cern.service-now.com/service-portal/view-request.do?n=RQF0850843 )
  • In the next week(s) we will run a full Derivations campaign which will reprocess all the data AOD, inputs about 3-4PB.
  • changes in ATLAS Computing : now Torre is CC, Davide Costanzo his deputy. Thanks Simone!. ATLAS Distributed Computing: now ADG is the coord, and Johannes Elmsheuser the deputy ADC coord. Thanks Andrej and Nurcan!

CMS

  • High CPU utilization ~200k cores surpassed in the last week (with also T0 involved)
  • High transfer activity - no major problems
  • CPU efficiency Task Force: work ongoing, some changes already implemented in our production system
  • all CMS sites have updated the WLCG IPv6 site survey
  • PhEDEx: observed tape staging of files that have disk copies (should be strongly disfavoured)
    • Correction to PhEDEx tape stage policy is ongoing to fix the problem
  • T1 usage sharing (presently 95/5 for prod vs analysis)
    • Possible increase of analysis sharing under discussion
  • Posix/cp stage-out commands need to be changed to gfal2/xrdcp/command to handle certificates for Singularity: few sites affected have been contacted
  • Missing files observed recently in the /store/unmerged area at few sites. We remind that our policy is that only files not belonging to active workflows and older than two weeks should be removed
  • Migration to HTCondor for GRID submission (as requested by CERN): no problem expected

LHCb

  • LHCb is running continuously 60k jobs: Data reconstruction, data stripping, Monte Carlo simulation and user analysis
  • pre-staging of 2015 data has been started. This will increase the load on the T0/T1 sites. The reprocessing of these data will start in the beginning of November.
  • LCG.Oracle.cern has been created for using external cloud resource provided by Oracle
  • After the LHC technical stop the data processing applications are depending on the existence of the “git” command in the worker nodes. This requires sites to update or install HEP_OSlibs meta package

Ongoing Task Forces and Working Groups

Accounting TF

Information System Evolution TF

  • CRIC development is progressing well.
  • An update is foreseen to be presented in the GDB next week.


IPv6 Validation and Deployment TF


Discussion

Machine/Job Features TF

Monitoring

MW Readiness WG


Network and Transfer Metrics WG


  • WG update will be presented at HEPiX and LHCOPN/LHCONE workshop (co-located)
  • perfSONAR YouTube channel at https://www.youtube.com/channel/UCjK-P49pAKK9hUrrNbbe0Sg
  • perfSONAR 4.0.1 auto-deployed to 197 instances (21 are already on centos7)
    • Port 443/https is now used as a controller port for pscheduler and needs to be open on central firewalls
    • Some sites suffer from an MA access issue after the upgrade, this is being followed up
  • perfSONAR 4.0.2 is planned to be released in November
    • Brings new SNMP plugin that can be used to retrieve local site router traffic
  • WLCG/OSG network services
    • New documentation is in preparation and will be hosted at https://opensciencegrid.github.io/networking/
    • OSG collector handling multiple backends (Datastore, CERN ActiveMQ and GOC RabbitMQ) now in production
      • GOC will distribute raw data to 3 different locations, FNAL for tape archive, Nebraska for long-term ES storage, Chicago for short-term ES storage
    • Preparing new LHCOPN and perfSONAR dashboards in collaboration with CERN IT/CS and IT/MONIT
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
  • HNSciCloud will create its own perfSONAR mesh to follow up on the network performance btw. providers and sites

Squid Monitoring and HTTP Proxy Discovery TFs

  • Nothing to report

Traceability WG

Container WG

Special topics

WLCG forum for tape matters

Theme: Providing reliable storage - KIT

Action list

Creation dateSorted ascending Description Responsible Status Comments
01 Sep 2016 Collect plans from sites to move to EL7 WLCG Operations Ongoing The EL7 WN is ready (see MW report of 29.09.2016). ALICE and LHCb can use it. NDGF plan to use EL7 for new HW as of early 2017. Other ATLAS sites e.g. Triumf are working on a container solution that could mask the EL7 env. for the experiments which can't use it. Maria said that GGUS tickets are a clear way to collect the sites' intentions. Alessandra said we shouldn't ask a vague question. Andrea M. said the UI bundle is also making progress.
Jan 26 update: this matter is tied to the EL7 validation statuses for ATLAS and CMS, which were reported in that meeting.
March 2 update: the EMI WN and UI meta packages are planned for UMD 4.5 to be released in May
May 18 update: UMD 4.5 has been delayed to June
July 6 update: UMD 4.5 has been delayed to July
Sep 14 update: UMD 4.5 was released on Aug 10, containing the WN; CREAM and the UI are available in the UMD Preview repo; CREAM client tested OK by ALICE
03 Nov 2016 Review VO ID Card documentation and make sure it is suitable for multicore WLCG Operations Pending Jan 26 update: needs to be done in collaboration with EGI
26 Jan 2017 Create long-downtimes proposal v3 and present it to the MB WLCG Operations Pending May 18 update: EGI collected feedback from sites and propose a compromise - 3 days' notice for any scheduled downtime
06 Jul 2017 Ensure a forum exists for discussing tape matters WLCG Operations In progress  
14 Sep 2017 Followup of CVMFS configuration changes,
check effects on sites in Asia
WLCG Operations Pending  

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Comments Deadline Completion

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Comments Deadline Completion

AOB

Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r13 - 2017-10-05 - GavinMcCance
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2021 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback