DRAFT

WLCG Operations Coordination Minutes, October 11th, 2018

Highlights

Agenda

https://indico.cern.ch/event/757611/

Attendance

  • local: Borja (WLCG), Fabrizio (data management), Johannes (ATLAS), Julia (WLCG), Maarten (ALICE + WLCG), Oliver (data management), Stefan (LHCb), Vladimir (CERN tape systems lead)
  • remote: Alessandra D (Napoli), Catherine (LPSC + IN2P3), Christoph (CMS), Dario (ATLAS), Dmitry (FNAL), Gareth (RAL), Giuseppe (CMS), Jeremy (GridPP), Matt (EGI), Natalia (FNAL), Ron (NLT1), Sam (Glasgow), Stephan (CMS), Xavier (KIT)
  • apologies:

Operations News

  • The next meeting will be on Thu Nov 1
    • Please let us know if that date would pose a significant problem.

Special topics

Follow up on August KIT data loss incident

See the presentation

  • Christoph:
    • we do not have a firm value for the number of CMS files lost
    • consistency checks are ongoing
    • many things were simply restarted from scratch
  • Julia: is everything OK now?
  • Christoph:
    • there was an impact, but we managed to deal with it
    • the matter can be considered closed now

  • Maarten: might the disk occupancy on the master have triggered an alarm sooner?
  • Xavier: it went up very quickly and we reacted as fast as we could

WLCG Archival WG update

See the first of the two presentations

  • Stefan: will the SPACM WG give us the relative costs of tape vs. disk?
  • Oliver: as the WG is concerned with a comprehensive analysis, the answer is yes

  • Johannes:
    • w.r.t. the ATLAS tape carousel R&D, we have already followed up with
      the T1 sites regarding parameters, issues, dCache improvements, etc.
    • the performance ought to be similar at all sites, but some sites were
      able to optimize their operations in advance
  • Julia: what about parallel activities from other experiments?
  • Johannes:
    • most of the time there was no problem
    • the most realistic would be to send "random" tasks to the sites,
      so that they cannot prepare in advance; we will try that soon
  • Oliver: isn't it normal to warn them about significant tape usage?
  • Johannes: for reprocessing we do that indeed, but for continuous operations
    it should by design not be needed

See the second of the two presentations

  • Vladimir:
    • there has been good progress with the reporting of pledges and metrics
    • already there were some operational benefits, in particular the ten-fold
      reduction in tape mounts at NRC-KI
  • Julia:
    • the achieved improvements are impressive!
    • do you need help to follow up on open issues?
  • Vladimir:
    • as the open issues are complex, we will follow up within the WG
    • the metrics should be incorporated into the SRR dashboard to make them
      more official and to allow further benefits to be gained from them
  • Julia: their incorporation is advancing according to the planning

DPM Upgrade Task Force update

See the presentation

  • Fabrizio: the new accounting also is different from the StAR records
  • Maarten: will those records still be supported for EGI?
  • Julia:
    • there is an ongoing deployment campaign in EGI to get them into production
    • did some checks and they looked OK compared to values obtained through SRM
  • Oliver: StAR adds up by groups and users, reading directly from the DB
  • Fabrizio: that can be compatible with the use of DOME

  • Johannes: ATLAS have some concerns about the upgrade campaign,
    given that the Prague site was unexpectedly down for a few days
  • Fabrizio, Oliver: as we have improved things thanks to the feedback from Petr,
    the upgrade should go better at other sites, and we still have more test sites

  • Catherine: in France 1 site already switched to DOME successfully, CPPM Marseille

  • Sam:
    • in GridPP Lancaster gave it a try but had to back out because of issues that
      should be solved now; we will see if they can give it another go
    • there also is a test instance at Edinburgh

Middleware News

Important notice concerning the support of TLS v1.2 on WLCG

  • On Sep 21 a Globus update in the EPEL repositories made TLS v1.2
    the only version to be supported for security handshakes in GSI.
    • The concerned package is globus-gssapi-gsi-13.10 .
  • Unfortunately, a significant number of grid services in WLCG
    were not ready for that change and started running into failures.
  • We therefore asked for the minimum supported version to be set
    to TLS v1.0 again and we arranged for services like the FTS either not to
    apply the Globus update yet, or to adjust /etc/grid-security/gsi.conf :
       MIN_TLS_PROTOCOL=TLS1_VERSION_DEPRECATED
       
  • Version globus-gssapi-gsi-14.7-2 has that temporary workaround
    and should soon become available in EPEL.
    • It currently is present in the EPEL-testing repositories.
  • In the meantime we would like all potentially affected services
    to be checked and updated as needed.
  • Such services may directly depend on Globus themselves,
    but could also be based on Java instead.
  • Of particular concern are SRM, GridFTP, CE and Argus services.
    • SRM services listen on port 8443 (dCache), 8444 (StoRM) or 8446 (DPM).
    • The CREAM CE service listens on port 8443.
    • GridFTP services used by CREAM, ARC and SE head nodes listen on port 2811,
      while the port may be unpredictable on SE disk servers.
    • Argus listens on port 8154.
  • To test SRM, CREAM, Argus or any other HTTPS service, please run a command like this:
       openssl s_client -tls1_2 -connect HOST:PORT 2>&1 < /dev/null |
          egrep '^New|Protocol|known|Bad|refused|route'
       
  • The following output is a sign of failure:
       New, (NONE), Cipher is (NONE)
       
  • To test a GridFTP server, one needs a valid VOMS or grid proxy:
       env GLOBUS_GSSAPI_MIN_TLS_PROTOCOL=TLS1_2_VERSION uberftp HOST pwd
       
  • If any of those commands fails due to the TLS v1.2 requirement:
    please update Java/Globus on the affected service to a recent version,
    restart the service and try again.
  • We will need to set a deadline for TLS v1.2 support to early 2019
    and will let you know when the timeline has become clearer.
  • Please report issues you encounter through the usual channels.

Tier 0 News

  • CERN would like to ask the experiments what notice they would need to have the majority of batch resources here changed to CC7, assuming any intervention would take a couple of weeks to roll-out.

An action for the experiments has been created

  • Maarten: the main concern for ALICE is to avoid any significant drop in capacity
    during the heavy-ion run and afterwards
  • Johannes: for ATLAS it would be best to wait until Run 2 has ended
  • Christoph: for CMS the upgrade matters less thanks to Singularity,
    but it may be safer to wait until the run has finished
  • Stefan:
    • LHCb also would like the upgrade to be delayed until after the run
    • some very old MC workflows need to run on SL6, but could run at other sites;
      in the future they can be run under Singularity instead

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • Normal activity levels on average
  • No major issues

ATLAS

  • Smooth Grid production over the last weeks with ~300k concurrently running grid job slots. Additional HPC contributions with peaks of ~50k concurrently running job slots and ~10k jobs from Boinc.
  • Commissioning of the Harvester submission system via PanDA is on-going on the Grid. CERN, the TW, ES, IT, UK cloud have largely been migrated.
  • Heavy Ion throughput tests from CERN point1 to EOS to Tape and 3 Tier1s worked all fine.
  • The first part of the tape carousel R&D campaign at the Tier1s using 200-300 TB of AOD is finished. Stage-in rate from 300 MB/s to 3 GB/s at the different sites have been observed.

CMS

  • LHC running well and CMS is collecting good data, two more weeks of p-p running
  • heavy-ion P5-->EOS rate test successful on day two
  • finalizing software and operation model for heavy-ion run in November
  • stability of EOS fuse mount improved but still encountering read issues (e.g. on 2018-Oct-10) INC:1784940
  • two CMS EOS crashes in the last two weeks, ?both on Thursdays?
  • Fermilab FTS issue traced down to slow CERN-->Fermilab transfers, being investigated, GGUS:137632
  • switched from 2017 Monte Carlo configuration to 2018 MC to be the dominant workflow
  • compute systems busy at above 200k cores, usual mix of about 75% production and 25% analysis

LHCb

  • Operations as usual, nothing specific to report

Task Forces and Working Groups

GDPR and WLCG services

  • Julia:
    • as proposed by Dario, the page now is protected through an e-group
    • it is called gdpr-services and the wlcg-operations e-group is a member
    • if you cannot access the page, please subscribe via the e-groups portal

  • Jeremy: how will the data privacy statements for VOMS etc. be followed up?
  • Maarten:
    • we are already in touch with the VOMS devs, but in general we first need to have
      the generic WLCG data privacy statement being worked on by Dave Kelsey et al
    • that statement would be served along with the AUP the user needs to agree to
    • services with GUIs will need specific data privacy statements that users can
      inspect and possibly need to agree to
    • we will try to have the most obvious cases dealt with first
    • the process may take many months to complete

Accounting TF

Archival Storage WG

Update of providing tape info

PLEASE CHECK AND UPDATE THIS TABLE
Site Info enabled Plans Comments
CERN YES    
BNL YES    
CNAF YES   Space accounting info is integrated in the portal. Other metrics are on the way
FNAL YES    
IN2P3 YES   Space accounting info is integrated in the portal. Other metrics are on the way
JINR YES    
KISTI YES   KISTI has been contacted. Will work on in the second half of September
KIT YES    
NDGF NO   NDGF has a distributed storage which complicates the task. Discuss with NDGF possibility to do aggregation on the storage space accounting server side. Should be accomplished by the end of the year
NLT1 YES   Almost done, waiting for opening of the firewall, order of couple of days
NRC-KI YES    
PIC YES   Space accounting info is integrated in the portal. Other metrics are on the way
RAL YES   Space accounting info is integrated in the portal. Other metrics are on the way
TRIUMF YES    
One can see all sites integrated in storage space accounting for tapes here

Information System Evolution TF

  • Ongoing discussion on the publishing of the CE configuration via JSON file. More details can be found here
  • Storage Resource Reporting implementation by all WLCG storage middleware providers is progressing. More details here
  • Next WLCG IS Evolution Task Force meeting will take place on the 18th of October. Will continue discuss json file structure for CE configuration publishing. UK sites will present their first experience with publishing CE description in json format.

IPv6 Validation and Deployment TF

Detailed status here.

Machine/Job Features TF

Monitoring

MW Readiness WG

Network Throughput WG


Squid Monitoring and HTTP Proxy Discovery TFs

  • LHC@Home is now almost completely switched to using openhtc.io (Cloudflare) cached cvmfs & CMS Frontier services instead of using squids at CERN & Fermilab (except for a small trickle of jobs accessing only /cvmfs/grid.cern.ch). Web Proxy Auto Discovery (WPAD) is used to discover squids when LHC@Home jobs are run at WLCG sites.
  • Plans are being made to integrate a shoal service (for dynamically registering squids) with the WLCG WPAD service. This is intended for squids running in clouds serving WLCG jobs. We will also exclude the dynamically registered squids from being treated as worker nodes in the failover monitor.

Traceability WG

Container WG

Action list

Creation date Description Responsible Status Comments
03 Nov 2016 Review VO ID Card documentation and make sure it is suitable for multicore WLCG Operations In progress GGUS:133915
07 Jun 2018 GDPR policy implementation across WLCG and experiment services WLCG Operations + experiments Ongoing Details here

Specific actions for experiments

Creation date DescriptionSorted ascending Affected VO Affected TF/WG Deadline Completion Comments
13 Sep 2018 moving most of CERN batch to CC7 all - 11 Oct ongoing how much advance warning needed?

Specific actions for sites

Creation date Description Affected VO Affected TF/WG Deadline Completion Comments

AOB

-- JuliaAndreeva - 2018-10-08

Edit | Attach | Watch | Print version | History: r16 | r14 < r13 < r12 < r11 | Backlinks | Raw View | Raw edit | More topic actions...
Topic revision: r12 - 2018-10-12 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback