WLCG Operations Coordination Minutes, June 7th, 2018

Highlights

  • the next meeting will be on July 5

Agenda

Attendance

  • local: Alberto (monitoring), Ivan (ATLAS), Julia (WLCG), Konrad (LHCb), Laurence (T0), Maarten (ALICE + WLCG), Manuel (T0), Marian (networks), plus 1 NN
  • remote: Andrea (WLCG), Christoph (CMS), Di (TRIUMF), Eric (IN2P3-CC), Gareth (RAL), Giuseppe (CMS), Jeff (OSG), Matt (EGI), Mayank (WLCG), Muriel (LHCb), Sang-Un (KISTI), Stephan (CMS), Thomas (DESY)
  • apologies:

Operations News

  • Welcome Matthew Viljoen in the new role of the EGI Operations Manager!
  • Welcome Jeffrey Dost in the new role of the OSG Operations Manager!

Discussion

  • Matt introduced himself as the new EGI Operations Manager:
    • working for EGI since a few years
    • at RAL before that: responsible for CASTOR and DB services

  • Jeff introduced himself as the new OSG Operations Manager:
    • involved in OSG since ~10 years
    • working for UCSD
    • responsible for Glidein factories for CMS and ~10 other VOs

  • Jeff then summarized the recent changes in the OSG operations infrastructure:
    • core services have been migrated from Indiana University to other sites
    • their domains were moved to opensciencegrid.org
    • the old CA has stopped accepting new certificate requests on May 31st
    • the new CAs are InCommon (for WLCG services) and Let's Encrypt
    • WLCG sites use GGUS for support, other sites use an OSG-specific helpdesk

  • Julia: what about OIM?
  • Jeff:
    • OIM info has been moved to GitHub
    • sites can send e-mail or submit pull requests (PR) for changes
    • better tools are in the making
  • Julia: what about downtimes, which are needed by experiment systems?
  • Jeff: sites can submit those through PR
  • Stephan:
    • the experiments are not concerned with that, they still see the same XML
    • only the URLs need to be adjusted
    • there is ticket for monitoring to apply those changes

  • Julia: what about Gratia?
  • Jeff: its successor GRACC is running at Nebraska

  • Julia: could you provide a short summary of what needs to be done?
  • Alberto: that would be helpful also for monitoring
  • Jeff:
    • will provide a list of old URLs
    • many of these matters are detailed here

Special topics

Services for which a GDPR data privacy statement is required

presentation

  • Julia presented the slides attached to the agenda page

  • Maarten:
    • we will create an action on Operations and the experiments
    • it may take quite some time, because non-trivial developments may be needed
    • we need to show we are taking this seriously
    • by autumn the affected services should be in much better shape

  • Matt: is the GEANT Code of Conduct v2 available for reading?
  • Julia: there is a link to its current draft

Important note

After the meeting, David Foster (CERN Data Privacy Protection Officer) pointed out some statements in the presentation which were not correct. Julia discussed these matters with David. The main outcome: "Everything needs to be thought about in context of physical location and legal entity of a particular service".

CERN as international organization cannot sign a code of conduct for a particular jurisdiction since this is a slippery slope and other jurisdictions may similarly overreach in future.

Therefore, there may be a different approach to privacy statements for the WLCG services depending on who is hosting them:

  • Services under the CERN legal entity, for which CERN is controller or processor.
    Privacy statement should be based on CERN regulation (Service Now, Privacy notice form)
  • Non-CERN, CoCo & “binding and enforceable commitments” (Complete template)

Additional guidance will come out in the future about extra-territorial reach of GDPR.

Middleware News

  • Useful Links
  • Baselines/News
  • Issues:
    • These matters were added afterwards for the record;
      already included in the Service Report for May
    • VOMS Java client saga
      • A new voms-clients-java rpm in EPEL caused problems at several sites
        for several experiments
      • It obsoletes the old voms-clients3 rpm
      • That rpm has a non-trivial post-uninstall script
      • The upgrade resulted in the removal of essential symlinks
      • A fixed version was quickly released
      • Affected hosts were typically fixed by a simple command
        • yum reinstall voms-clients-java
      • Further confusion arose from related rpm updates
        • voms-api-java3-3.0.6 (UMD) voms-api-java-3.3.0 (EPEL)
        • canl-java-1.3.3 (UMD) canl-java-2.5.0 (EPEL)
        • bouncycastle1.58-1.58 (EPEL) alongside bouncycastle-1.46 (UMD)
      • These updates happened even where the UMD repositories appeared to be
        protected (not yet understood)
      • To do: proper EPEL-testing coverage and monitoring on QA instances of services
    • Singularity saga
      • Several critical vulnerabilities affecting versions < 2.5.0
      • Advisories released by EGI and OSG
      • Version 2.5.0 OK for current uses by CMS and ATLAS
        • ALICE and LHCb workflows do not depend on it yet

Discussion

  • Christoph: what is the procedure to get the FTS client baseline incremented?
  • Maarten:
    • first get agreement on the fts3-steering list
      • it has the devs, all experiments using the FTS, plus other experts
    • then inform the wlcg-ops-coord list
    • Ops Coordination will then:
      • ask the MW Officer to make the change in the baseline table
      • decide if any announcement is needed, and maybe even a deployment campaign

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Normal to high activity levels on average
  • No major problem

ATLAS

  • Stable grid production over the last weeks with up to ~300-350k concurrently running job slots. Additional HPC contributions with peaks of 900k concurrently running job slots. N.B. the actual corepower of the HPC CPUs is approximately a factor 5-10 weaker in comparison to regular grid site CPUs.
  • Currently there is the usual mix of “CPU heavy” grid workflows of MC generation and simulation on-going with a smaller fraction of more I/O intensive MC digitization+reconstruction and derivation production.
  • Smooth data processing at the Tier0 since the start of data taking with an increased capacity of 23k CPU cores. Final commissioning of the grid spill-over workflow for the B-physics stream.
  • Open ended derivation production of the data18 collision data is on-going.
  • Instability of EOSATLAS over the past weeks: there had been several multiple hours long downtimes of EOSATLAS - they have been addressed with highest priority by the admins and the developers (very good!) but is there a more deeper systematic problem with the system ? If so, how can they be addressed and what would be the alternatives?

Discussion

  • Ivan: we would like to see some followup on the frequent EOS crashes
  • Julia: we will check with the storage team

CMS

  • LHC ahead of expected luminosity profile
  • transfers from Tier-0 to Tier-1s busy with backlog at Fermilab due to slow PhEDEx component (under investigation)
  • accidental data deletion at Tier-0 understood and corrected
    • no data loss
    • lost files are being recreated from Tier-0 input data
    • recovery requires extra Tier-0 processing
  • compute systems busy at about 230k cores, about 20% analysis 80% production
  • processing backlog, lower/medium priority Monte Carlos not progressing much
  • more than half of CMS sites now have storage IPv6 accessible

LHCb

  • Production:
    • Production of Collision18 ongoing
    • Staging of data for reprocessing of Collision16
  • No major problems

Ongoing Task Forces and Working Groups

Accounting TF

  • NTR

Archival Storage WG

Update of providing tape info

Site Info enabled Plans Comments
CERN YES    
BNL YES    
CNAF NO   On the way
FNAL YES    
IN2P3 YES    
JINR NO    
KISTI NO    
KIT YES    
NDGF NO   NDGF has a distributed storage. They need to aggregate data from all instances. Is being discussed whether it can be done on the side of the storage space accounting server
NLT1 NO    
NRC-KI YES    
PIC YES    
RAL YES   Not yet integrated in the portal, some minor changes are required. On the way
TRIUMF YES    

Information System Evolution TF

  • IS Evolution Task Force Meeting in May discussed draft of the format for CE description which the CE services would be encouraged to publish.
  • At this meeting GocDB has presented their development plans which cover required development for GDPR
  • CRIC for CMS is progressing. By mid of June , the prototype should provide basic functionality requested by CMS

IPv6 Validation and Deployment TF

Detailed status here.

  • Andrea started with a high-level summary of current state (see the link)

  • Andrea:
    • the situation in OSG was going to be tracked by OSG Operations
    • given the recent changes, we can submit GGUS tickets also to OSG sites
    • alternatively, the situation at OSG sites could be tracked by the experiments
      • that would be easy for CMS and ALICE
      • would it be easy for ATLAS?
  • Jeff: let's postpone the decision until we have discussed this within OSG

  • Andrea then summarized what can be concluded from the table listing the sites:
    • many sites are waiting for the necessary network infrastructure deployment
    • other sites need to deploy IPv6-ready services
    • most sites do not need help
    • the vast majority of sites expect to be OK around the end of this year
    • very few sites have indicated they cannot make it this year
    • overall the deployment appears to be going OK so far!

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG


  • perfSONAR infrastructure status
    • perfSONAR 4.1 beta will be released in the coming weeks - main new feature is an improved central/remote configuration
    • CC7 campaign had only modest progress recently - 86 instances on CC7 (from 81 in April, out of total 210)
    • WLCG broadcast will be sent to remind sites to plan an upgrade to CC7 and review their configuration
  • WG update was presented at HEPiX and will be presented at CHEP
  • WLCG/OSG network services
    • Following retirement of OSG GOC, all central services were migrated to AGLT2, which took considerable effort in planning and deployment
    • Transition happened without downtime and was transparent to all sites
    • One exception are sites using the old OIM/myOSG central configuration URL, which was deprecated during 3.5 update campaign (meshconfig URLs starting with myosg.grid.iu.edu/pfmesh...)
    • Impacted sites are asked to update their meshconfig-agent.conf following http://opensciencegrid.org/networking/perfsonar/installation/#installation ASAP
  • WLCG Network Throughput Support Unit: see twiki for summary of recent activities.

Squid Monitoring and HTTP Proxy Discovery TFs

  • The file that is generated as input to the wlcg-wpad.cern.ch service now also contains a list of public IPs for all squids at each sites. This is being put to use by an ATLAS developer who is updating the failover monitor to use that file to determine which sites have many worker nodes connecting directly to central servers or backup proxies.

Traceability WG

Container WG

Action list

Creation date Description Responsible Status Comments
03 Nov 2016 Review VO ID Card documentation and make sure it is suitable for multicore WLCG Operations In progress GGUS:133915
07 Jun 2018 Followup of OSG service URL changes WLCG Operations New  
07 Jun 2018 GDPR policy implementation across WLCG and experiment services WLCG Operations + experiments New  

Specific actions for experiments

Creation date Description Affected VO Affected TF/WG Comments Deadline Completion

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Comments Deadline Completion

AOB

  • Di:
    • TRIUMF has started running ATLAS @ HOME jobs through BOINC
    • how may such jobs be accounted as the other jobs that we run?
  • Julia:
    • an accounting prototype has been developed by Andrew McNab
    • we will check with him
Edit | Attach | Watch | Print version | History: r16 < r15 < r14 < r13 < r12 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r16 - 2018-09-15 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback