WLCG Operations Coordination Minutes, September 29th 2016

Highlights

  • Reminder to install the necessary patches as per the critical EGI Advisory-SVG-2016-11476, deadline to update & restart affected services: 2016-09-29 00:00 UTC!
    • Mind that also the Worker Nodes should be patched (while not affected), to avoid false positives in the EGI security monitoring.
  • The Sept 2016 WLCG MB approved the IPv6 deployment plan. Dual stack availability is mandatory for the Tier0 and Tier1s by April 2017, at least on a testbed. Subsequent deployment deadlines in the relevant detailed report.
  • A draft questionnaire on potential Lightweight Site alternatives was presented and discussed. The questionnaire is expected to be sent out still before CHEP, so that trends from early results may be included in the corresponding presentation there.

Agenda

Attendance

  • local: Alessandra Forti (chairperson), Maria Dimou (minutes), Alberto Aimar, Maarten Litmaath, Julia Andreeva, Andrea Sciaba, Andrea Manzi, Raja Nandakumar.
  • remote: Di Qing, Andreas (KIT), Andrew McNab, Shawn McKee, Marian Babik, Javier Sanchez, Dave Mason, Dave Dykstra, Giuseppe Bagliesi, Stephan Lammel, Alessandra Doria, Renaud Vernet, Pepe Flix, Paige Winslowe Lacesso.
  • Apologies: Nurcan Ozturk (ATLAS report), Ulf Tigerstedt (NDGF report), Maria Alandes (Info Sys TF report).

Operations News

Middleware News

  • Baselines/News:
    • a new version of the WN bundle has been published to CVMFS ( /cvmfs/grid.cern.ch/emi-wn-3.17.1-1_sl6v1), it includes gfal2 v 2.11 (GGUS:123994)
    • New version of edg-mkgridmap released (4.0.4) to fix a problem on SL7/C7, available in EPEL-testing, soon in UMD

  • Issues:
    • critical EGI Advisory-SVG-2016-11476, deadline to update & restart affected services: 2016-09-29 00:00 UTC. Sites have to update their services and also their WNs. The rpms concerned with this vulnerability are included also on the WN, which is not affected, but is the target of the standard EGI security monitoring. The advisory didn't mention the WNs. That will be improved next time. We cannot easily find out if a site applied the patch on its services.
    • GGUS:124136: all LHCb transfers failed after FTS upgrade. The issue is described at OTG:0033158 and INC:1143771. A fix has been deployed to allow LHCb transfer to be executed. Had LHCb participated in the MW Readiness effort, this FTS issue would have been discovered during verification, before operation. Raja conveyed the invitation for more active participation to the LHCb computing management.
    • GGUS:120586 concerning an issue with glite-ce-* clients with dual stack CEs. No news yet. How much is affecting for LHCb? Raja said that dual stack no more affects LHCb. Nevertheless, this is a bug that CERN has to fix when the next CREAM client version is released.
    • EOS instabilities ( check T0 report)
  • T0 and T1 services
    • CERN
      • check T0 report
    • IN2P3
      • XRootD proxy servers dedicated to Atlas and CMS upgraded to version 4.4.0-1
    • KIT
      • Update of dCache for ATLAS and CMS to 2.13.44 on 28th and 26th of September respectively.
      • xrootd activity for CMS should now be reported to CMS-AAA-EU-COLLECTOR.cern.ch.
    • NDGF
      • Major dCache upgrade to v 3.0.0 2 weeks go
      • minor upgrade today to fix a bug on file upload ( check T1 feedback)
    • PIC
      • dCache upgrade to v 2.13.42
    • RRC-KI
      • dCache upgrade to v 2.10.61
      • EOS upgrade to 0.3.197-1

Tier 0 News

Highlights

  • The bulk of the computing services is provided as LSF shared instance (70 k cores), HTCondor (26 k cores), a dedicated LSF instance for the ATLAS Tier-0 (12 k cores), and 21.5 k cores as cloud resource for the CMS Tier-0. Limits of LSF become visible (the limit of 5'000 worker nodes has been reached, and there was an occasion where the number of jobs in the system has grown larger than what LSF could reasonably handle). New capacity will be added to HTCondor.
  • On the request of the experiment validation teams, a configuration of worker nodes under CC7 is being worked on.
  • The external cloud resources (4 k cores at T-Systems) are running various job types from the experiments.
  • During September about 5.2 PB have been recorded by CASTOR.
  • Minor upgrades (maintenance) have been performed on EOS, aligning to 0.3.197. In parallel a transparent background upgrade campaign for the FTS disk server daemons is taking place that will finish by mid October.

Issues

  • An LSF failure is being followed up. Work is going on to ensure that in case of failover, no "ghost" jobs are created.
  • A scheduled FTS upgrade (moving the VM to CC7 and upgrading the database) had to be rolled back, since all LHCb transfers were failing (while the other experiments were fine). The issue is being analysed with LHCb, putting the upgrade on hold.
  • Some instabilities affecting EOSATLAS and EOSCMS have been observed and are being studied.
  • The incident affecting several services (ATLAS users disappearing from the ATLAS group) was created by a manual error upstream (wrong update in egroup having the master information) which was propagated across many services.

Plans

  • There is progress on the third link to Wigner; the timescales will hopefully become clear soon.

Tier 1 Feedback

  • NDGF-T1 (Ulf offline during the meeting): The dCache 3.0.0 snapshot running for 1 weeks has had a bug where files uploaded with auto-created directories along a path got buggy permissions on the directories newly created (0664). These then caused the file upload to fail, since the user could not access the directory (--x missing). This was mitigated by mass update of the modes during the week, and fixed by an update of dCache on Thursday. Atlas noticed this and reported it for atlasscratchdisk, presumably alice also got hit by it but didn't notice it. No long-lasting problems come out of it since we can fix the buggy directories with a simple database query.

Tier 2 Feedback

Experiments Reports

ALICE

  • High activity on average
  • Central services were unavailable from Sep 14 evening to Sep 15 afternoon
    • a big network intervention made them unreachable for many hours
    • all grid and user activity for ALICE was stopped for that period
    • in parallel the File Catalog was moved to a more powerful new machine
  • CERN: team ticket GGUS:123929 opened Sep 15 late afternoon
    • CREAM / LSF was not working for ALICE
    • fall-out from OTG:0032902
    • converted to ALARM Sep 16 afternoon
    • LSF info provider issue got fixed
    • job submissions resumed late afternoon

ATLAS

  • Grid has been running in full capacity with the new Sherpa Monte-Carlo production jobs, as well as with the low priority samples.
  • group zp issue last week, e-group deletion attempt done by a power user caused 10646 accounts removed from the zp group, followed up in SNOW (INC1136368) and restored.
  • EOSATLAS service was degraded September 27th night (high latencies between diskservers and the EOSATLAS headnode), stabilized late last night by the EOS team, IT ticket OTG0033142, shortage of disk space at Meyrin, geotagging disabled.
  • Problem with the xrootd 4.2.3 clients accessing input files on EOS in Athena 21.x releases, not fixed by upgrading to the newer clients 4.3.0 and 4.4.0, problem is related to the choice of the compiler version and how the thread code gets optimized by the compiler, a fix was found on Monday, now we are waiting for an official xrootd clients release. Hopefully a release 4.2.4 will happen before the weekend and 4.4.1 next week.
  • ATLAS Software and Computing week this week, discussions for a new production system component for better resource provisioning started.

CMS

  • about four more weeks of proton-proton running then proton-lead
  • re-reconstruction campaign of early 2016 data began this weekend
  • tape-resident data deletion campaign going well, tape repacking at first sites started
  • preparing pileup libraries for next large Monte Carlo campaign

  • CMS transfer team is asking sites with older xrootd version to upgrade to v4.4.0 and sites using DPM to upgrade to 1.8.11

  • the networking issues at CERN and Nebraska together with the downtime at Fermilab and a rogue server in Pakistan caused significant xrootd failures lasting into this week
  • Hammer Cloud outage last week quickly resolved, thanks Andrea!
  • Tier-0 transfer system was down for a weekend due to a hypervisor reboot and service not automatically starting

LHCb

  • Mostly simulation and user jobs now on the grid. Data processing jobs not taking too many slots
  • SARA downtime a little confusing - especially given that the tapes were already moved 2 weeks ago.
  • CERN FTS issue (GGUS:124136) - solved promptly after alarm ticket opened.
  • Request sites installing ARC CEs to ensure that the publishing is correct, and by VO for the numbers that matter (running, waiting jobs). Default ARC installation does not currently publish correct numbers. Alessandra said that sites use puppet and this fix isn't included in puppet. HEP-puppet is a comunity effort and nobody has inserted the patch yet. Those not using puppet have to do it by hand and the instructions are in the GridPP doc.

Ongoing Task Forces and Working Groups

Accounting TF

Andrea S. asked about the Pisa problem. Julia said that there were several independent issues from the Pisa numbers provided and also on the APEL side. Things are getting better now.

Information System Evolution


  • An IS TF meeting took place on 22nd of September. Information sources and main functionality of central CRIC were discussed. Aligment with EGI plans on moving more information to GOCDB was agreed. There is on going progress on the defined actions.
  • Next IS TF meeting will take place on 10th November. VOfeed structure and integration with CRIC will be discussed.

IPv6 Validation and Deployment TF


See slides.

Andrea S. & Alastair Dewhurst gave a short presentation on the TF progress. The Sept 2016 MB approved the IPv6 deployment plan. Dual stack availability is mandatory for the Tier0 and Tier1s by April 2017, at least on a testbed. By April 2018 dual stack should be available in production for the Tier0 and Tier1s. By the end of Run2 a large number of sites should have migrated their storage to IPv6. Alessandra asked about Tier2s' deadlines. Andrea S. said the plan applies to Tier2s as well but deadlines are not as strict. Most Tier1 sites gave positive commitment to the plan. Some haven't replied yet. He emphasised that several Tier2s already have IPv6, especially the GridPP ones. The TF is now becoming more active and participation is welcome. Maria D. suggested the creation of a dedicated GGUS SU to monitor progress with the deployment.

Machine/Job Features TF

  • MJF values used in ongoing LHCb fast benchmarking evaluation (see last GDB)
  • Some local config errors found during this exercise

Monitoring

Status

  • Contains all raw FTS, Xrootd, ETF data. Some Job Monitoring data.
  • Examples of dashboards are available.

  • Had a couple of months of instability and we needed to work on the infrastructure and resources.
  • Worked with the ES (ElasticSearch) service in order have a separate ES MONIT instance. Continued to help in the benchmarking of ES resources.

Next Steps:

  • Will soon add a link from the existing FTS Dashboard to the new MONIT portal with FTS dashboards. The new portal is being tuned and there maybe be glitches (or timeouts if you select longs time ranges...), but all FTS data is available.

  • We are getting to a phase where we need closer and defined contact with WLCG representative (VOs, sites) to show were we are and work together for the WLCG use cases. Details on the organization being discussed with WLCG Operations.

MW Readiness WG


  • The agenda of the 2/11 meeting http://indico.cern.ch/e/MW-Readiness_19 is taking shape and the twiki is reachable from there. Maria will prepare the table of jira tickets' status closer to the date, so please, record all progress in jira or email the e-group wlcg-ops-coord-wg-middleware at cern.
  • WN and UI rpm for EL7 have been prepared ( with the clients/lib available now on EL7) and pushed to UMD preview repo for testing ( MWREADY-135 and MWREADY-128). Looking for sites available for the validation
  • We'd like to debate at this meeting the future of the WG. It completes 3 years of life in December. Some products are now verified for Readiness "be default" see examples here. Other products and 2/4 experiments never embarked this effort. Participation is declining. It is a good moment to review the continuation/transformation/dissolution of the WG.
  • This idea was circulated in email on 22/9. Alessandra's feedback is the WG should remain alive even if meetings are not very frequent. Example reason: CentOS7 will require some coordination and it seems you are the bridge with EGI. The MW Readiness jira tickets are useful, e.g. https://its.cern.ch/jira/browse/MWREADY-128 and https://its.cern.ch/jira/browse/MWREADY-135

Network and Transfer Metrics WG


  • Network session at the WLCG workshop
    • Q&A session planned, questions will be sent in advance, we encourage all to participate
    • Inder Monga (Director of ESNet) will join the session
  • LHCOPN/LHCONE workshop was held in Helsinki, Sept 19-20 (https://indico.cern.ch/event/527372/)
    • GEANT reported peaks over 100GBps and growth of over 65% from Q2 2015 to Q2 2016
    • ESNet reported that LHCONE traffic has increased 118% in the past year
    • Positive feedback received on the LHC Network Evolution talk
  • pre-GDB on networking focusing on the long-term network evolution planned on January 10th - save the date
  • Throughput meetings were held on 15th Sept:
    • Hendrik Borras (Univ. of Heidelberg) presented early results on the network telemetry based on perfSONAR
  • perfSONAR 4.0 RC1 was released, RC2 planned in October with final release sometime in November
  • We are now using a new mailing list wlcg-network-throughput-wg@cernNOSPAMPLEASE.ch - joint mailing list for European and NA throughput meetings
  • WLCG Network Throughput Support Unit: New cases were reported on IPv6 and are being followed up, see twiki for details.

RFC proxies

Some progress on the EGI side. OSG has already switched. Raja asked how easy it is to use them today. Maarten said that one needs to supply an explicit option or environment variable today, but with the UMD update in October they will become the default. For voms-proxy-init the default will simply change, while the UI env. variable today makes myproxy-init upload legacy proxies, but there will be the necessary publicity on what will need to be done to change to the RFC ones. Raja would like to see the switch on the lxplus UI. Maarten confirmed that lxplus would be adjusted as soon as the release is available in the UMD. He reminded that RFC proxies are already in use today (e.g. SAM tests already use them), not enforced but recommended, because legacy proxies have started giving problems in some areas. By the end of the year RFC proxies should have become standard everywhere.

Squid Monitoring and HTTP Proxy Discovery TFs

Traceability and Isolation WG

No report

Theme: Lightweight Site

Presentation by Maarten. Slides on the agenda. Storage not considered for now. Each OSG site (mainly) supports only one LHC experiment. EGI is more complex (e.g. more experiments per site and more supported MW packages). So, we should learn from the OSG sites and concentrate on the EGI ones. A questionnaire is being prepared. Its draft was presented for discussion. During the discussion:

  • Lukasz asked how to 'enforce' APEL Accounting for HTCondor CE. Maarten will include it in the questionnaire to have people aware of the pending work involved. Julia committed to include it in the Accounting TF.
  • Raja asked whether the Network requirements should be included in the questionnaire. Maarten and Alessandra said the Lightweight sites typically will be small, up to a few thousand cores at most. If such a site were used for MC simulation only, little bandwidth would be needed. In general the network requirements are driven by data input and output patterns and hence should be discussed in the Data Management Coordination group, also for lightweight sites. The issues met so far with the Tier0 "cloud" experience (T-Systems) were mainly due to the amounts of data input and output by jobs, which were not always commensurate with the given network capacity. Raja thinks some recommendation of prerequisite network conditions should be included. However, a small site with few human and computing resources will not be taken seriously if it asks for a very fast network capacity. Also, the opposite, a very rich local infrastructure with very low connectivity will be very unbalanced. Julia said that this issue will be raised e.g. in the Data Management session at the San Francisco WLCG Workshop.
  • Maria asked whether everyone knows the DMZ acronym on slide 9 (it stands for DeMilitarised Zone). For the record: https://en.wikipedia.org/wiki/DMZ_%28computing%29
  • She also asked why to include question 11 Allow remote access to a DMZ for the experiment(s)? altogether given that it doesn't scale well, it requires the manual intervention by a remote expert (as today) and it introduces the possibility to remotely login as root outside the site's firewall. It should be checked with the WLCG security experts. Maarten said this is already being practiced by US-CMS at T3 sites in California and, to a lesser extent, without root access, today via the use of VOboxes by ALICE, LHCb and CMS.
  • Andrew will send another version of question 10 (to split it into 2). It now says: Could your site supply WNs dedicated to the experiment(s)?
  • Julia suggested the VOs should also see the questionnaire.
  • After the meeting, Maria suggested to Maarten to check the site survey https://wlcg-survey.web.cern.ch/ we issued in the autumn of 2014 to make sure no question is forgotten before publishing the questionnaire. For example, ask for the contributors' emails so that we can get back to them. Results https://twiki.cern.ch/twiki/bin/view/LCG/WLCGSiteSurvey

Secondary theme: Tier1 downtime announcements

Devise an algorithm for the announcement to be done earlier if the downtime is likely to last longer.

This will be moved to the next meeting because the initiators in ATLAS and LHCb are absent.

Action list

Creation date Description Responsible Status Comments
01.09.2016 Collect plans from sites to move to EL7 WLCG Operations On-going The EL7 WN is ready (see MW report of 29.09.2016). ALICE and LHCb can use it. NDGF will stay on SL6 for now but they plan to go directly to EL7 early 2017. Other ATLAS sites e.g. Triumf are working on a container solution that could mask the EL7 env. for the experiments which can't use it. Maria said that GGUS tickets are a clear way to collect the sites' intentions. Alessandra said we shouldn't ask a vague question. Andrea M. said the UI bundle is also making progress.

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
29.04.2016 Unify HTCondor CE type name in experiments VOfeeds all - Proposal to use HTCONDOR-CE. Still not done for ALICE. Raja will ask the status for LHCb.   Ongoing

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion

AOB

  • Raja would like to know why the tape system at SARA will be down for another 14 days given that it has been moved already. There was no NL_T1 representative today. Maria suggested to bring this up at next Monday's 3pm call.

MariaDimou - 2016-09-21

Edit | Attach | Watch | Print version | History: r62 < r61 < r60 < r59 < r58 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r62 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback