WLCG Operations Coordination Minutes, October 11th, 2018
Highlights
Agenda
https://indico.cern.ch/event/757611/
Attendance
Operations News
Special topics
Middleware News
- Useful Links
- Baselines/News
Important notice concerning the support of TLS v1.2 on WLCG
- On Sep 21 a Globus update in the EPEL repositories made TLS
v1.2
the only version to be supported for security handshakes in GSI.
- The concerned package is
globus-gssapi-gsi-13.10
.
- Unfortunately, a significant number of grid services in WLCG
were not ready for that change and started running into failures.
- We therefore asked for the minimum supported version to be set
to TLS v1.0
again and we arranged for services like the FTS either not to
apply the Globus update yet, or to adjust /etc/grid-security/gsi.conf
:
MIN_TLS_PROTOCOL=TLS1_VERSION_DEPRECATED
- Version
globus-gssapi-gsi-14.7-2
has that temporary workaround
and should soon become available in EPEL.
- It currently is present in the EPEL-testing repositories.
- In the meantime we would like all potentially affected services
to be checked and updated as needed.
- Such services may directly depend on Globus themselves,
but could also be based on Java instead.
- Of particular concern are SRM, GridFTP, CE and Argus services.
- SRM services listen on port 8443 (dCache), 8444 (StoRM) or 8446 (DPM).
- The CREAM CE service listens on port 8443.
- GridFTP services used by CREAM, ARC and SE head nodes listen on port 2811,
while the port may be unpredictable on SE disk servers.
- Argus listens on port 8154.
- To test SRM, CREAM, Argus or any other HTTPS service, please run a command like this:
openssl s_client -tls1_2 -connect HOST:PORT 2>&1 < /dev/null |
egrep '^New|Protocol|known|Bad|refused|route'
- The following output is a sign of failure:
New, (NONE), Cipher is (NONE)
- To test a GridFTP server, one needs a valid VOMS or grid proxy:
env GLOBUS_GSSAPI_MIN_TLS_PROTOCOL=TLS1_2_VERSION uberftp HOST pwd
- If any of those commands fails due to the TLS
v1.2
requirement:
please update Java/Globus on the affected service to a recent version,
restart the service and try again.
- We will need to set a deadline for TLS
v1.2
support to early 2019
and will let you know when the timeline has become clearer.
- Please report issues you encounter through the usual channels.
Tier 0 News
- CERN would like to ask the experiments what notice they would need to have the majority of batch resources here changed to CC7, assuming any intervention would take a couple of weeks to roll-out.
An
action
for the experiments has been created
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- Normal activity levels on average
- No major issues
ATLAS
- Smooth Grid production over the last weeks with ~300k concurrently running grid job slots. Additional HPC contributions with peaks of ~50k concurrently running job slots and ~10k jobs from Boinc.
- Commissioning of the Harvester submission system via PanDA is on-going on the Grid. CERN, the TW, ES, IT, UK cloud have largely been migrated.
- Heavy Ion throughput tests from CERN point1 to EOS to Tape and 3 Tier1s worked all fine.
- The first part of the tape carousel R&D campaign at the Tier1s using 200-300 TB of AOD is finished. Stage-in rate from 300 MB/s to 3 GB/s at the different sites have been observed.
CMS
- LHC running well and CMS is collecting good data, two more weeks of p-p running
- heavy-ion P5-->EOS rate test successful on day two
- finalizing software and operation model for heavy-ion run in November
- stability of EOS fuse mount improved but still encountering read issues (e.g. on 2018-Oct-10) INC:1784940
- two CMS EOS crashes in the last two weeks, ?both on Thursdays?
- Fermilab FTS issue traced down to slow CERN-->Fermilab transfers, being investigated, GGUS:137632
- switched from 2017 Monte Carlo configuration to 2018 MC to be the dominant workflow
- compute systems busy at above 200k cores, usual mix of about 75% production and 25% analysis
LHCb
- Operations as usual, nothing specific to report
Task Forces and Working Groups
GDPR and WLCG services
Accounting TF
Archival Storage WG
Update of providing tape info
PLEASE CHECK AND UPDATE THIS TABLE
Site |
Info enabled |
Plans |
Comments |
CERN |
YES |
|
|
BNL |
YES |
|
|
CNAF |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
FNAL |
YES |
|
|
IN2P3 |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
JINR |
YES |
|
|
KISTI |
YES |
|
KISTI has been contacted. Will work on in the second half of September |
KIT |
YES |
|
|
NDGF |
NO |
|
NDGF has a distributed storage which complicates the task. Discuss with NDGF possibility to do aggregation on the storage space accounting server side. Should be accomplished by the end of the year |
NLT1 |
YES |
|
Almost done, waiting for opening of the firewall, order of couple of days |
NRC-KI |
YES |
|
|
PIC |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
RAL |
YES |
|
Space accounting info is integrated in the portal. Other metrics are on the way |
TRIUMF |
YES |
|
|
One can see all sites integrated in storage space accounting for tapes
here
Information System Evolution TF
- Ongoing discussion on the publishing of the CE configuration via JSON file. More details can be found here
- Storage Resource Reporting implementation by all WLCG storage middleware providers is progressing. More details here
- Next WLCG IS Evolution Task Force meeting will take place on the 18th of October. Will continue discuss json file structure for CE configuration publishing. UK sites will present their first experience with publishing CE description in json format.
IPv6 Validation and Deployment TF
Detailed status
here.
Machine/Job Features TF
Monitoring
MW Readiness WG
Network Throughput WG
Squid Monitoring and HTTP Proxy Discovery TFs
- LHC@Home is now almost completely switched to using openhtc.io (Cloudflare) cached cvmfs & CMS Frontier services instead of using squids at CERN & Fermilab (except for a small trickle of jobs accessing only /cvmfs/grid.cern.ch). Web Proxy Auto Discovery (WPAD) is used to discover squids when LHC@Home jobs are run at WLCG sites.
- Plans are being made to integrate a shoal service (for dynamically registering squids) with the WLCG WPAD service. This is intended for squids running in clouds serving WLCG jobs. We will also exclude the dynamically registered squids from being treated as worker nodes in the failover monitor.
Traceability WG
Container WG
Action list
Specific actions for experiments
Specific actions for sites
AOB
--
JuliaAndreeva - 2018-10-08