WLCG Operations Coordination Minutes, June 4th 2015

Highlights

  • Smooth start of Run-2 data taking for all LHC experiments
  • LFC decommissioned at CERN on 22nd of June
  • An important issue discovered in the latest globus-gssapi-gsi-11.16-1 breaks myProxy and FTS interactions and may affect other components too. Investigation ongoing. Details in GGUS:114076


Agenda

Attendance

  • local: Alberto Aimar, Maria Alandes (minutes), Marian Babik, Maite Barroso, Alessandro di Girolamo, Maarten Litmaath, Stefan Roiser, Andrea Sciaba, Christoph Wissing
  • remote: Daniele Bonacorsi, Jeremy Coles, Alessandra Doria, Michael Ernst, Pepe Flix, Alessandra Forti (chair), Felix Lee, Di Qing, Gareth Smith, Vincenzo Spinoso, Renaud Vernet

Operations News

  • Next WLCG workshop to be held in January or February 2016 (3 days between 2016/01/24 and 2016/02/06). Call for volunteers to host it. An email will follow later.

Middleware News

  • Baselines:
    • A new version of gfal2 (2.9.2) fixing an issue causing crashes on the bringonline daemon of FTS 3.2.33 is available on the FTS RC repo. It has been already installed at RAL, CERN, BNL. FNAL is also encouraged to update to this version ( they should have noticed the crashes as well)

  • MW Issues:
    • An important issue has been noticed at CERN and NDGF related to a new version of a Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing. (GGUS:114076) Basically this new version changes the default behaviour of the name matching in an HTTP over TLS connection. This breaks MyProxy and FTS interactions as reported in the ticket, but it could affect several components. Discussion is ongoing with the package submitter in order to block the submission to EPEL stable of the package.

Maarten gives a detailed explanation on the issue. There are some questions from ATLAS to understand what would need to be done i.e. in the PANDA server. Maarten explains that specific components would need to be checked and that a workaround is being discussed in the GGUS ticket. An action to follow up this issue will be added in the minutes.

  • T0 and T1 services
    • CERN
      • LHCb and shared LFC instances going to be decommissioned the 22nd June
    • IN2P3
      • planned dCache upgrade to 2.10.31 (core servers) on 16/06
    • PIC
      • dCache upgrade to 2.10.30
    • TRIUMF
      • dCache upgraded to 2.10.28, SRM 2.10.29,FAX xrootd4.2.0.1

Tier 0 News

Tier 1 Feedback

Tier 2 Feedback

Experiments Reports

ALICE

  • mostly normal to high activity
  • smooth start of Run-2 data taking

ATLAS

  • Smooth data taking!
  • Rucio performances very good: got up to 50Hz for few hours, peak for a day at a level of 3M files/day.
  • mc15 25ns reconstruction (mcore) running massively. mc15 50ns almost finished. To provide some numbers: reconstruct >50M events/day evgen: 55M simul:232M digi+reco: 55M
  • analysis as usual
  • HighLevelTrigger reprocessing "of first week" data run successfully yesterday: in 14hours 1Mevents reprocessed (aim was to finish this within 24h), info will be fed back to the HLT community. More similar chains will be run during this first week.
  • Tier-0: observed some WNs particularly slow. It seems that (from WLCG ex-daily meeting) the issue is related to a bug introduced by the new version of OpenStack. For ATLAS would be nice to know how many of these WNs are affected, and while waiting for a proper fix, maybe it's better to remove the "not properly configured" nodes from production (please discuss this with ATLAS, it depends of course on how many they are)
  • FTS issue: cause quite many troubles. Strange other experiments didn't suffer, the bringonline daemons down should have affected also them, maybe they were affected and they just didn't complain.
    • NDGF staging issue is related
    • ALARM ticket on Monday for bringonline daemons to CERN (GGUS:114021)
    • there is a potentially serious globus package update which only affected a few transfers so far (GGUS:114076)
  • RFC proxy task force: Rucio will use RFC proxy on the pre-prod setup. On pilot factory still to be followed up.

CMS

  • General
    • exciting days for the start of the 13 TeV LHC era!

  • Processing and production
    • Digi-Reco campaign for Run-2 is progressing very well, reached >1B evts produced in <1 month
      • a ~dozen of T2 sites also contributing, reaching ~80k slots for production over hot days, working with error rates comparable to T1 sites
      • needed to invalid a consistent number of workflows due to a mistaken update of Global Tag. But the good production rate compensated and allowed to be quickly back on track
    • Reached ~120k parallel running jobs in the Global Pool
      • ~80k from production + ~40k analysis

  • Services
    • Last week the CMSSW Popularity DB was migrated to the CMS Oracle cluster

  • T0
    • the cores quota of the T0 project was increased. Reached a utilization of ~11k.
    • due to intense activity, quota reached for /eos/cms/store/unmerged/ over last weekend. Expert reached, fixed.
    • bad transfer quality from T0, seen at least twice, CERN-IT fixed:
      • "...there were two diskservers in T0CMS with many dead gridftp jobs that were blocking the transfer slots for these boxes, this eventually can act as a black hole and could explain the reason of the many timeouts you are observing”. CASTOR developers are having a closer look.

LHCb

  • GGUS ticket send to all T1 in order to make sure that all the RAW data will be store on the same tape set in each tape system when it is feasible (GGUS:114013 to GGUS:114019).
  • (28th May : Data transfer problem to CERN EOS (GGUS:113954) .

Ongoing Task Forces and Working Groups

gLExec Deployment TF

  • NTR

RFC proxies

  • RFC proxies were tried on sam-alice-preprod
    • the proxy subject DN ended up mixed:
      • /[...]/CN=1042098026/CN=1748237680/CN=2138413231/CN=proxy
    • such proxies are unusable
    • the proxy renewal sensor then needs an (easy) fix after all

Machine/Job Features

  • NTR

Middleware Readiness WG


  • The MW Readiness App moved to production node https://wlcg-mw-readiness.cern.ch/. Remember this presentation' from the last WG meeting on May 6th. This tool will, eventually, replace today's, manually maintained, Baseline table and more.
  • The pakiti client is installed on the dCache CMS instance for MW Readiness at PIC. The nodes are correctly published in the MW Package Collector and viewable from authorised people only.
  • NB!! Next vidyo meeting on June 17th at 4pm CEST. Draft agenda http://indico.cern.ch/e/MW-Readiness_11

Multicore Deployment

  • ATLAS deployment:
    • Reminder that the goal is to have 80% of production resources usable by multicore and sites should modify the configuration to respect that. See Actions for sites.

IPv6 Validation and Deployment TF


Squid Monitoring and HTTP Proxy Discovery TFs

  • No news other than that the developer in critical path is now back from paternity leave, hopefully progress can begin again

Network and Transfer Metrics WG


  • perfSONAR status
    • Detailed report from the WG was presented on Monday at the LHCOPN-LHCONE meeting - LBL Berkeley (US) (https://indico.cern.ch/event/376098/)
    • Both LHCOPN and LHCONE meshes stable now, consistently delivering metrics. RAL shows signs of continuing network problems in both latency and bandwidth.
    • Based on the positive experience in ramping up latency mesh, we plan to establish full WLCG meshes for all types of tests and use it as a baseline for other meshes
    • In collaboration with ESNet, a bug was found in parsing tracepath results, causing significant reduction in efficiency of getting tracepath results. Plan is to revert back to traceroutes and only run low frequency tracepath tests until the issue is fixed.
    • The old mesh configuration interface hosted from grid-deployment.web.cern.ch will be decomissioned on Monday (8th of June). Few sites that still have the old URLs configured have been notified.
  • Network performance incidents process - new GGUS SU (WLCG Network Throughput) already available, more information at https://twiki.cern.ch/twiki/bin/view/LCG/NetworkTransferMetrics#Network_Performance_Incidents
  • Test deployed esmond2mq at CERN (developed in collaboration with LHCb), core functionality works fine, waiting for the OSG datastore to enter production in order to run it continuously
  • Next meeting postponed to 10th of June (https://indico.cern.ch/event/382624/). Plan is to focus it on discussing full WLCG meshes proposal, proximity service and initial report from the FTS performance study.
  • Very special thanks for major contributions to the WG and farewell to Soichi Hayashi (OSG) and Aaron Brown (Internet2).

HTTP Deployment TF

The second meeting of the TF took place on the 3rd Jun, focussed on requirements on storage. A summary will be presented at the next Ops Coordination Meeting.

Action list

Description Responsible Status Comments
Status of fix for Globus library (globus-gssapi-gsi-11.16-1) released in EPEL testing Andrea Manzi Ongoing Details on GGUS:114076

Specific actions for sites

Description Affected VO Affected TF Comments Deadline Completion
ALL ATLAS sites implementing a cap to their multicore resources (whether their configuration is dynamic just for a portion of nodes or it is a static partition) should review the cap to give 80% of the ATLAS production resources to multicore. As a reminder the shares for ATLAS jobs are as follows T1: 5% analysis and 95% production; T2: 50% analysis and 50% production. So multicore should get 80% of 95% at T1s and 80% of 50% at T2. More info here ATLAS Multicore   None  
LHCb T1s requested to make sure that all the RAW data will be stored on the same tape set in each tape system when it is feasible LHCb - More details in GGUS:114013, GGUS:114014, GGUS:114015, GGUS:114016, GGUS:114017, GGUS:114018, GGUS:114019    

AOB

-- MariaALANDESPRADILLO - 2015-06-03

Edit | Attach | Watch | Print version | History: r21 < r20 < r19 < r18 < r17 | Backlinks | Raw View | WYSIWYG | More topic actions
Topic revision: r21 - 2018-02-28 - MaartenLitmaath
 
    • Cern Search Icon Cern Search
    • TWiki Search Icon TWiki Search
    • Google Search Icon Google Search

    LCG All webs login

This site is powered by the TWiki collaboration platform Powered by PerlCopyright &© 2008-2020 by the contributing authors. All material on this collaboration platform is the property of the contributing authors.
or Ideas, requests, problems regarding TWiki? use Discourse or Send feedback