WLCG Operations Coordination Minutes, April 6th 2017
Highlights
Agenda
Attendance
- local: Andrea (MW Officer + data management), Brij (CMS), Christoph (CMS), Giuseppe (CMS), Julia (WLCG), Kate (WLCG + CMS + databases), Maarten (WLCG + ALICE), Marian (networks), Vincent (security)
- remote: Antonio (PIC), Carles (PIC), Di (TRIUMF), Greg (APEL), Javier (IFIC), Jordi (PIC), Kyle (OSG), Lucia (CNAF), Max (KIT), Nurcan (ATLAS), Pepe (PIC), Renaud (IN2P3), Thomas (DESY), Xin (BNL)
- apologies: UK reps are attending the GridPP collaboration meeting
Operations News
- This month's pre-GDB
on Tue April 11 is devoted to collaborating with other communities
- This month's GDB
on Wed April 12 will have a session on containers
- The next Ops Coordination meeting is proposed to be held on May 18
- Please let us know ASAP if that date poses a significant problem (
wlcg-ops-coord-chairpeople
at cern.ch
)
Middleware News
- Useful Links:
- Baselines/News:
- New UMD 3 and 4 releases during this period:
- SL5 end of life 31/03/2017. There shouldn’t be any services running at the moment as monitored by EGI. In case sites are still running SL5 resources please decommission those resource ASAP.
- Issues:
- After deploying a new version of gfal2 ( 2.13.) on FTS host at CERN and RAL, CMS discovered some changes on the way checksum are calculated at some sites,given the impact it has been decided to roll back to the previous version of gfal2 ( the issues gfal2 side have been fixed in the meantime)
- Operational problem when running the upgrade the FTS DB at CERN as part of the update to v 3.6.7 (OTG:0036830
). Only submitted/active transfers imported on the new tables ( plus all LHCb transfers), CMS /ATLAS already completed transfers have been not imported ( not a problem for ATLAS, no feedback from CMS)
- SLAC dCache transfers error after enabling IPV6 at RAL FTS (GGUS:127262
). dCache support is investigating, in the meantime RAL FTS has disabled IPV6 transfers.
- An issue affecting IPV6 transfers to DPM when gridftp redirection is enabled has been discovered ( GGUS:127285
). We suggest sites not to enable gridftp redirection until the issue is fixed.
- GGUS:126874
ALARM ticket for LHCb VOMS users got suspended. Bug in VOMS found and fixed by devs ( already part of the latest UMD releases)
- StoRM SRM space is not updated when deletion goes through webdav (GGUS:126896
)
- High risk CVE-2017-2636 Linux kernel privilege escalation vulnerability (https://wiki.egi.eu/wiki/SVG:Advisory-SVG-CVE-2017-2636
). Similar to the one reported in March, Sites should apply the kernel patches or applied the mitigations as reported in the advisory.
- T0 and T1 services
- ASGC
- DPM is now dual stack ( ipv4/ipv6)
- BNL
- major dCache upgrade to v3.0.11 on CentOS7.3, PostgreSQL was upgraded to 9.6 and Xrootd to v 4.6.0.
- CERN
- check T0 report
- FTS upgrade to v. 3.6.7, gfal2 downgraded to 2.12.3
- IN2P3
- Migration of the core dCache servers to CentOS7 completed ( v 2.13.56)
- Chimera postgres upgraded to 9.6
- SRM postgres upgraded to to 9.5
- JINR
- dCache Major upgrade 2.13.51 -> 2.16.31, Postgres upgrade 9.4.11 -> 9.6.2 on disk instance for CMS done
- dCache Major upgrade 2.13.51 -> 2.16.31, Postgres upgrade 9.4.11 -> 9.6.2 on tape instance for CMS planned for tomorrow
- KIT
- Updated dCache to 2.13.56 for CMS on March 22nd. Resolved issues with certificates issued by certain CAs
- PIC
- dCache Major upgrade to v 2.16.28. Xrootd door upgraded to v.4.5
- FTS upgraded to 3.5.5 (WLCG ‘emergency’ use)
- RAL.
- gfal2 downgraded to v 2.12.3 on FTS nodes
Tier 0 News
Tier 1 Feedback
Tier 2 Feedback
Experiments Reports
ALICE
- The activity levels typically have been very high
- New concurrent job records, up to 136k
- Central service issues had some impact
- A bad RAID controller slowed down I/O
- MySQL DB corruption, cured by a restore
- CERN: CASTOR got overloaded by reco jobs running at CERN plus T1 sites
- Cured by limiting the number of parallel transfers + request queuing
ATLAS
- Reprocessing campaign started on March 26th with bulk production, initial troubles with Frontier servers at LYON and RAL, still under investigation, one optimization is done on the conditions database side (disallowing modifications in the last change time of the DCS folders), another one on fixing the conditions data folders with missing cache tag. Currently running with 130k slots out of ~300k slots, MC16a production campaign is also running in full speed.
- ATLAS Software&Computing week in the week of March 17th, a parallel session on Docker Containers and Singularity with good interest, being followed in the GDB meeting on April 12th.
CMS
- Phase 1 and 2 upgrade Monte Carlo generations ongoing
- Planning for re-RECO of 2016 RAW data starting in April
- cosmic data recording (with magnet still off) started at P5
- staging tests at Tier-1s in progress
- puppet 4 configuration issues 4 weeks ago took out DBfroNtier last week deleted cron entries on non-standard configs
- Oracle database migration uncovered incomplete jumbo frame handling (Many Thanks to Kate for effort/engagement late into the evening. Many Thanks to admins at CNAF and Wisconsin for debugging. Source two days later guessed by CNAF people.)
- EOS issue, GGUS:127297
- moving the central services of the CMS global pool / workflow management system back to CERN still in progress
- old machines are still draining
- preparing for the singularity roll-out/switch. (This will later phase out glexec also for SL 6.)
- some SAM aspects still to be addressed
- Maarten: to be discussed further in the container session of the GDB next week
LHCb
Discussion
Ongoing Task Forces and Working Groups
Accounting TF
- Meeting on the 9th of March was dedicated to integration of usage of non-pledged resources into APEL. Currently main issue is how we benchmark non-pledged resources.
- T0 accounting data generated by the new system has been validated. Compared with data in the CMS and ATLAS dashboards. Showed good agreement.
- Christoph: there still is 1 CMS site showing discrepancies
- Julia: its Dashboard numbers are a lot lower than what the Accounting Portal shows;
let's follow up offline with the monitoring team
- Started to work on the WLCG storage space accounting. Next meeting on the 20th of April will be dedicated to the storage space accounting.
Information System Evolution TF
Julia: CRIC is being implemented. More news next time.
IPv6 Validation and Deployment TF
Machine/Job Features TF
Monitoring
- NTR. Sorry for not being able to participate in the meeting.
MW Readiness WG
This is the status of
jira ticket updates
since the last Ops Coord of 20170302:
- MWREADY-143
FTS 3.6.x for ATLAS, CMS and LHCb at CERN - completed
- MWREADY-140
ARC-CE 5.2.2 on C7 for CMS at Brunel - completed
- MWREADY-141
dCache 3.x at PIC for CMS . Testing a new version of dCache which should fix an issue with IPV6 together with dCache devs
Regarding the UI/WN bundle for C7:
- new wiki pages created: (EL7 WN, EL7 UI)
- A new version of the WN bundle is ready including the LB deps ( needed by Cream CE jobs). We opened a ticket to CREAM in oder to remove this deps on future CREAM releases (GGUS:127020
). This is also needed cause one of those deps conflicts with the latest HT Condor versions.
- We asked EGI to include in the preview repo also cvmfs and yaim-clients rpms so to be added as deps to the bundle
- Liverpool and Manchester has joined the testing activity for ATLAS and LHCb (MWREADY-145
)
- Brunel is also planning to do tests for CMS and ATLAS (MWREADY-144
)
Andrea: the bundle will be released as the WN meta rpm and also in CVMFS
Network and Transfer Metrics WG
- LHCOPN/LHCONE workshop in BNL took place this week (https://indico.cern.ch/event/581520/
)
- perfSONAR 4.0 to be released on 17th of April
- Site on auto-updates will get it automatically - no action needed.
- Sites planning to update perfSONARs to CC7 are encouraged to wait until 4.1 is released.
- Minimal hardware requirements were shifted: Sites running perfSONARs with less than 4GB RAM and 2 core CPU with clock speed less than 2GHz are encouraged to keep running the old version (3.5.1)
- WLCG/OSG network services
- New central mesh configuration interface (MCA) will be deployed to production - transition will be transparent to all sites
- MCA
was developed by OSG, but becomes part of perfSONAR
- Integrates perfSONAR lookup service with OIM/GOCDB services, so we can now easily add NREN perfSONARs into our meshes
- Monitoring was updated to cover new features released in 4.0 and is now based on ETF
- OSG collector was updated to collect additional perfSONAR metrics (such as TCP retransmits, path MTU, etc)
- LHCOPN traffic and LHCONE simulated link utilisation
now available for subscriptions from the netmon brokers
- WLCG Network Throughput Support Unit: see twiki for summary of recent activities.
- BNL/ASGC throughput improved by factor 10 - details reported at the LHCOPN/LHCONE workshop
Marian: NREN perfSONARs had to be associated to "virtual" sites to allow them
to be included, whereas now they can come from a separate information source
Marian: ASGC rewired their services, following advice from the WG and ESNet
Squid Monitoring and HTTP Proxy Discovery TFs
- Most CMS jobs at CERN are now using http://grid-wpad/wpad.dat
, but there are still a small number of jobs using an old IPv4-only service on the same servers so we have not yet enabled frontier-squid-3 and IPv6. (We're also working on setting up an extra puppet branch to be able to have half the servers using squid-2 and half using squid-3 without having to have some on the 'qa' puppet branch, which causes problems when IT updates the qa branch first, so we're waiting on that as well).
Traceability and Isolation WG
Last meeting on 2017/04/04 (
https://indico.cern.ch/event/624751/
):
- Traceability challenges would be desirable before Singularity replaces gLExec
- A first challenge is planned for jobs running at CERN
- Maarten: the information currently available from gLExec logs will need to come from central services instead
- Alternative to VO traceability: pilots exposing information to sites
- A plugin to extend HTCondor for historical information under development
- Not a requirement for sites, only if they want the feature (pushed by Fermilab)
- Singularity:
- Security review ongoing at CERN (no guarantees on time or completeness)
- Better support for Unprivileged user namespaces on RHEL7.4 (-> no suid): http://seclists.org/oss-sec/2017/q2/11
- Deployed in production for 3 CMS sites (not yet at 100%) for two weeks. No major bug/showstopper identified
- Long term support unknown (US DOE project), but large use in US HPC is a good sign
GDB session on containers next Wednesday afternoon:
https://indico.cern.ch/event/578985/
Theme: HTCondor accounting work at PIC
see the
presentation
Discussion
- Carles: the
blah.py
will be made available for testing by other sites;
the modified code is based on packages from the UMD
- Max: GitHub would be a good place to put the code;
it does not have to be ready for production
- Greg: use pull requests to APEL on GitHub to propose code for integration
- Max: use the HTCondor Python bindings to push records directly to APEL?
- Carles: we can investigate that
- Andrea: will check if CERN sends its records like that
- Maarten: be careful with such shortcuts - republishing must also be easy!
- Greg: if an APEL DB is not used, the records can instead be written to
an intermediate set of files that can be reused for republishing
- Thomas: might one make use of the
schedd
history file instead of the query?
- Carles: the
condor_history
command gives robustness and convenient formatting
- Thomas: might the ARC CE accounting format also be used for this case?
- Carles: we can look into that
- Max: we can provide sample records
- Max: have you discussed any of this with the HTCondor devs?
- Carles: no, this does not seem to be so interesting for them
- Pepe: we might present this in the European HTCondor WS in June
- Julia: we need to coordinate a generic solution via the Accounting TF
- Pepe: we will put in production what we have now and report on stability etc.
Action list
Creation date |
Description |
Responsible |
Status |
Comments |
01 Sep 2016 |
Collect plans from sites to move to EL7 |
WLCG Operations |
Ongoing |
The EL7 WN is ready (see MW report of 29.09.2016). ALICE and LHCb can use it. NDGF plan to use EL7 for new HW as of early 2017. Other ATLAS sites e.g. Triumf are working on a container solution that could mask the EL7 env. for the experiments which can't use it. Maria said that GGUS tickets are a clear way to collect the sites' intentions. Alessandra said we shouldn't ask a vague question. Andrea M. said the UI bundle is also making progress. Jan 26 update: this matter is tied to the EL7 validation statuses for ATLAS and CMS, which were reported in that meeting. March 2 update: the EMI WN and UI meta packages are planned for UMD 4.5 to be released in May |
03 Nov 2016 |
Review VO ID Card documentation and make sure it is suitable for multicore |
WLCG Operations |
Pending |
Jan 26 update: needs to be done in collaboration with EGI |
26 Jan 2017 |
Create long-downtimes proposal v3 and present it to the MB |
WLCG Operations |
Pending |
|
Discussion on the long-downtimes proposal
- Maarten:
- OSG have expressed strong concerns about the proposal
- It goes against the spirit of collaboration between sites and experiments
- It would require non-trivial effort to implement
- It appears to create a problem out of a non-issue
- I agree with those sentiments
- EGI have asked the NGIs to provide feedback in the coming weeks
- The proposal was not greeted with much enthusiasm there either
- the easiest would be to drop the proposal
- Julia:
- every experiment agreed that long downtimes should be announced long in advance
- the application to Availability and Reliability calculations seems questionable, though
- Maarten:
- every site has the right to declare downtimes as it sees fit
- we can create an official page with best-practice recommendations instead
- Pepe: how many problematic downtimes did we experience?
- Maarten: very few
- Julia: we will collect further information and discuss the matter at the workshop
Specific actions for experiments
Specific actions for sites
AOB