WLCG Operations Coordination Minutes, April 28th 2016

Highlights

  • SLC5 decommissioning (EGI, April 30, 2016). SL5 'ops' service probes will get CRITICAL. The whole retirement process is tracked here: https://wiki.egi.eu/wiki/SL5_retirement
  • New Traceability and isolation WG will have a dedicated session in May pre-GDB.
  • New TF on Accounting review is under preparation to start addressing accounting issues.
  • A detailed review of the Networking and Transfers WG was done. This includes a status report, ongoing actions and future R&D projects. More details in the agenda slides.

Agenda

Attendance

  • local: Maria Alandes (minutes), Maarten Litmaath (ALICE), Julia Andreeva (WLCG), Marian Babik (Networking WG and WLCG Monitoring), Helge Meinhard (T0), Jerome Belleman (T0), Xavier Espinal (IT Storage), Stephan Lammel (CMS), Vincent Brillault (Security), Marcelo Soares (LHCb), Aleksandr Berezhnoi, Nurcan Ozturk (ATLAS)
  • remote: Pepe Flix (chair), Michael Ernst, Alejandro, Alessandra Forti, Antonio Maria Perez Calero Yzquierdo, B. Hoeft, Christoph Wissing, Di Qing, Dave Mason, Frederique Chollet, Henryk Giemza, Jeremy Coles, Rob Quick, Shawn Mc Kee, Thomas Hartmann, Ulf Tigerstedt, Victor Zhiltsov, Vincenzo Spinoso, Xavier Mol, Giuseppe Bagliesi, Felix Lee, Renaud, Shawn Mc Kee, Sang Un Ahn

  • apologies: Maria Dimou, Andrea Manzi, Catherine Biscarat

Operations News

  • SLC5 decommissioning (EGI, April 30, 2016). SL5 'ops' service probes will get CRITICAL. The whole retirement process is tracked here: https://wiki.egi.eu/wiki/SL5_retirement
  • Accounting review:
    • many actions from April pre-GDB. We need to understand which sites provide or are going to provide part of resources through private clouds (pledged or not, and whether to enable the accounting of these resources).
    • During the accounting review it was agreed a) that we drop installed capacity from the accounting reports though continue to publish available capacity to REBUS b) we use wallclock as a metric and HS06-days as a unit.
    • A report with the summary and actions to be taken will soon be provided.

New Traceability and isolation WG

  • Mandate: Explore new traceability and isolation paradigms, propose a new model taking advantage of new technologies and VO frameworks while keeping full trustworthy traceability and isolation of users actions.
  • Kickoff meeting: Next pre-GDB afternoon: May 10, 13:30 to 17:00
  • Still looking for interested people (esp. from sites):

Maria asks whether experiments have been also contacted to participate in the WG and Vincent confirms so.

Pepe asks whether there is already a twiki page and Vincent explains he is working on it, to be finished soon.

Middleware News

  • Useful Links:
  • Baselines/News:
    • SL5 Decommissioning campaign run by EGI. (https://wiki.egi.eu/wiki/SL5_retirement). Deadline for decommissioning is the 30 of April, sites that are not going to update by that date should put their SL5 services in Downtime. Sites risk the suspension in case the services are not upgraded by 31 of May.
    • End of support for dCache 2.11 and 2.12 is end of May. (from BDII only 1 instance affected)
  • Issues:
    • Bug discovered at CERN Castor for LHCb (GGUS:121038) where the srm turl resolution ( for xrootd) was broken after a recent upgrade. A fix has been provided by the devs and deployed in prod.
    • Starting from dCache 2.14 the CA OCSP service is being used to verify the certificate requests by default. During a CERN CA intervention some dCache intances SRM requests got blocked cause the CERN OCSP service was down (GGUS:120784). We have asked dCache do disable by default the check and use CRL instead.
  • T0 and T1 services
    • CERN
      • FTS upgraded to v 3.4.3
    • CNAF
      • Assigned 2016 disk pledges to LHCb (Alice and Atlas already at pledge)
      • To complete assignment of disk pledges 2016 for CMS
    • KIT
      • Migration of the core management service for tape backend onto new hardware. For this had planned a downtime end of Feb, which was postponed because the responsible expert fell ill. The second downtime was planned for 26th and 27th of April and is now canceled due to issues with storage infrastructure. The third attempt for this downtime currently is planned for the 10th and 11th of May.
    • PIC:
      • In March we started enabling dual-stack on all of the disk pools. We can now confirm that ATLAS, CMS and LHCb disk storage is fully dual-stack.

Tier 0 News

  • Final validation of 2016 storage resources on-going, checking peak performance, correcting some data access glitches; tape archive re-plenished with empty tapes for 2016 data taking
  • Cases observed where Adler32 did not spot data integrity issues (see HEPiX Spring 2016); working on improved integrity checks
  • New hardware now rolling in to be used for hardware retirements and re-arrangements (no changes of pledges)
  • VOMS issues being investigated with help by developer; some intermediate steps taken

Helge reminds that support of CERN IT Services outside working hours is provided only on best effort. GGUS tickets should be submitted in case of problems. Operators shouldn't be called on the phone directly. Helge will provide a link to service level documentation.

Regarding Adler 32 data integrity issues, Xavier explains that this is being studied and understood. The possibility of changing to a new checksum algorithm is a major change that hasn't been considered for the time being.

DB News

Tier 1 Feedback

  • NDGF-T1: srm stopped working on 12.4.2016 due to CERN CA going offline due to network changes and forgotten firewall updates. dCache 2.15 and newer use OCSP for online CRL checking for the CAs that advertise it. CERN CA does, and when it went offline production stopped. It's now turned off, and we'll push to change the default in dCache. However, it's a workaround for bad infrastructure at CERN since the CA seemingly cannot guarantee it will always work.
  • NDGF-T1 issue 2: Ubuntu Xenial is out (16.04). fetch-crl in it does not trust eugridpma CA distribution since it's signed by too weak crypto. Note: This is when downloading directly from eugridpma, not from the EGI.eu repo

Tier 2 Feedback

Experiments Reports

ALICE

  • High activity on average
    • New record 102190 running jobs on April 10
  • Successful T1-T2 workshop Apr 18-20 in Bergen!

ATLAS

  • MC15c production campaign is ongoing, 3.4B events out of 4.6B events are done since mid-March.
  • Heavy Ion reprocessing is ongoing, 330M events out of 500M events are done since 11 days after extensive testing on memory usage with different configurations which in the end allowed running with 40k slots on average (~24GB memory with 8-core), with peaks at 55/60k slots (empirical measure on how many high-memory jobs we can run on mcore).
  • Large scale derivation production in the past 2 weeks, it is almost done on the reprocessed 2015 data, running on MC15c samples and 2016 data, using ~20k slots.
  • HLT reprocessing for the first stable beam is successful, 3 rounds of HLT reprocessing were done between Friday and Sunday, all below 15 hours.
  • Networking problems during the weekend between BNL-SARA (https://ggus.eu/index.php?mode=ticket_info&ticket_id=120957) and BNL-CERN, discovered by ATLAS through degraded job throughput. We expect a lower level network monitoring which detects them first. Can this be done?

Nurcan asks whether there is lower level monitoring available in WLCG? Marian answers that this is not available and that more details will be given in the dedicated Networking WG session after the roundtable.

CMS

  • General:
    • detector re-commissioning in progress
      • magnet at 3.8 Tesla
    • first 2016 proton-proton data recorded
  • Production:
    • Monte Carlo Spring 2016 Digi-Reco with CMSSW 8.0
      • over 3 Billion events requested (many more requests expected even for ICHEP timescale)
      • current request half done (investigating visibility of completed samples in DBS)
    • other smaller campaigns ongoing too
  • Tier-0:
    • ready for calibration and prompt reco
    • unscheduled shutdown of transfer system VM caused few hour interrupt
    • production jobs "flocked" successfully to Tier-0 last week
  • PhEDEx:
    • Tier-1 and Tier-2 sites upgraded to current version
    • storage consistency check at Tier-1s in progress
  • Sites:
    • site evaluation was a bit flaky the last weeks with VM I/O overloads, dashboard issues, SAM Nagios to EFT, and HC plugin switches
    • everything resolved and/or complete (switch from SAM Nagios to ETF flawless)

LHCb

  • Validating workflows for 2016 data processing
  • After validation is finished more strain expected on T1 tape systems both for staging in input data for processing campaigns as well as 2016 RAW data replication.
  • CERN
    • SRM returns a wrong URL for xroot protocol (GGUS:121038). Solved
  • T1 sites:
    • SARA SRM transfer problems (GGUS:120954) fixed (this generated a lot of "Completed" state jobs)
    • GRIDKA SRM problem (GGUS:120947) and observed very low staging throughput

Ongoing Task Forces and Working Groups

HTTP Deployment TF

Information System Evolution


  • Working on reducing BDII dependencies:
    • Dedicated LHC VOs Storage: a recipe is being prepared for sites based on PIC experience to be able to stop publishing BDII for dedicated LHC VOs storage services.
    • Computing: work ongoing to define static CE attributes in GOCDB/OIM. ATLAS contacted to test this in a few ATLAS sites and AGIS.
  • WLCG scope tags in GOCDB are being validated by Aleksandr Berezhnoi. More details in VO Tags validation twiki.
  • CRIC prototype for CMS progressing well. More details in CRIC Evaluation
  • Ongoing work will be summarised in the next TF meeting scheduled on 12.05.2016

IPv6 Validation and Deployment TF


Machine/Job Features TF

  • Continued work in validating PBS/Torque scripts
  • Preparing HTCondor scripts for tests at sites

Antonio mentions that this is now being tested at pic for Torque/maui.

Middleware Readiness WG


  • JIRA:MWREADY-122 ATLAS & CMS: CERN FTS 3.4.3 verification completed ( on both SL6 and CentOS7), couple of small issues found to be fixed in 3.4.4
  • JIRA:MWREADY-123 CMS: dCache 2.15.4 verification at PIC ongoing
  • JIRA:MWREADY-30 Argus 1.7.0 on CentOS7 verification at CERN ongoing. Half of the production cluster is running this new version.
  • Input received so far on the pakiti client installation expansion - by CERN, JINR, RAL, GRIF and NL_T1- is now included in a dedicated section for discussion at the next WG meeting of May 18th. Then we'll conclude on the issue.
  • Background: Tier1s were invited at the MWR and the the Ops Coord meetings on 16 & 17/3 to tell the e-group wlcg-ops-coord-wg-middleware at cern.ch whether they agree to install the pakiti client on their production service nodes, so that the versions of MW run at the site be known to authorised DNs site managers taken from GOCDB and expert operations' supporters. The developer Lionel Cons stops further work on the tool. Site replies on their intention to expand the use of pakiti client can be found here. Maria D. will put 2 reminders to sites in the twikis of the April 7th and 28th Ops Coord meetings, as part of the MW Readiness WG report.

Network and Transfer Metrics WG


RFC proxies

  • ALICE, CMS and LHCb have been moved to ETF
  • ATLAS SE test issues in ETF are still being investigated
  • to check with EGI and OSG if RFC proxies can be made the default
    • new version of VOMS clients
    • new UI configuration for MyProxy usage
  • any compatibility concerns for central services? pilot factories?

Maarten adds that now it has to be followed up with experiments if RFC proxies can be the default. Hopefully we could say good bye to legacy proxies by the end of the year.

Squid Monitoring and HTTP Proxy Discovery TFs

  • Web Proxy Auto Discovery is now available for LHC@Home, at http://lhchomeproxy.cern.ch/wpad.dat.
    • Internal CERN only at the moment, we're waiting for it to be approved for opening to the internet
    • Currently lists only one proxy, but when others are added to a configuration file it will automatically geosort the proxies compared to the requester's IP address (exactly like CVMFS stratum 1s are sorted)

Networking and Metrics WG review

  • Marian presents the slides attached in the Indico agenda covering a status review of the main achievements of the TF, open issues and activities that will have a continuation within the scope of the WG and R&D activities outside the scope of the WG. Julia thanks the WG for the good work done so far.
  • Julia asks whether LHCb and ATLAS collectors to process perfSONAR data are general enough to be reused by other experiments. Marian explains that there are some common areas that could be reused but in principle the collectors are quite experiment specific. In particular, for ATLAS work on perfsonar data analysis, the consumer from the message queue with recording data in the elasticsearch/hadoop might be pretty generic. The same for topology resolution using GocDB. The part of prediction/cost matrix generation might be also interesting to evaluate as a generic implementation.
  • Maria asks whether the perfSONAR dashboard could be used by non experts. Marian explains that it is not intended for non experts but for networking teams since it's difficult to interpret and requires prior knowledge.
  • There is a final proposal to have a more in-depth discussion on R&D and open issues in either September pre-GDB, WLCG workshop in October or LHCONE/LHCOPN workshop also in October.
  • There is also a request to follow up Networking for commercial clouds, as this is something started already in CERN procurement activities for commercial clouds.

Action list

Creation date Description Responsible Status Comments
17.03.2016 The Operations' team to check with ALICE, ATLAS and LHCb whether the gLExec monitoring can be stopped. Maarten Ongoing  

Specific actions for experiments

Creation date Description Affected VO Affected TF Comments Deadline Completion
29.04.2016 Unify HTCondor CE type name in experiments VOfeeds all -      

Specific actions for sites

Creation date Description Affected VO Affected TF Comments Deadline Completion
17.03.2016 The Operations' team to follow-up ASGC's progress on the networking issues experienced there and report progress at the Ops Coord meeting ATLAS network metrics This is being followed up by networking experts, ATLAS and ASGC representatives   CLOSED

AOB

-- MariaDimou - 2016-04-25

Edit | Attach | Watch | Print version | History: r51 < r50 < r49 < r48 < r47 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r51 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright & 2008-2019 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
Ideas, requests, problems regarding TWiki? Send feedback